GCC-v7x

hordeprobableΒιοτεχνολογία

4 Οκτ 2013 (πριν από 4 χρόνια και 7 μέρες)

96 εμφανίσεις

GCC


Genomics Core Computing




Current situation GCC 1.0

Roche 454

Current

cluster

UZ network

8C 16Gb 2TB

UZ NAS Storage

8C 16Gb

8C 16Gb

Per run:

~ 1 Mio reads

~ 2Gb raw data

New sequencer: 1000x increase


1.1TB / run (200Gbp)

~1000 Mio reads

8 days run!


Basic analysis of 1 full run

< 1 week on
3 nodes
with 48Gb
RAM and 8 CPU cores each (and
needs 7TB space)


Full capacity sequencing = full
capacity 24
cpu

cores

Meta
-
analyses & post
-
analyses


Several fold higher needs than basic run analyses


Integrate multiple runs (
e.g
,. patient versus controls,
families, etc)


Integrate with previous data


Integrate with publicly available data


RNA
-
Seq

+ gene expression data from GEO


Integrate with other data sources


DNA
-
Seq

+ RNA
-
Seq

+ Methyl
-
Seq


Integrate with genome browsers


Galaxy, UCSC,
Ensembl


Make analysis pipelines available to users as a service


Custom analyses as a service or in collaboration

Ideal computing setup


High Performance
Computing (HPC)

500MB/s

UZ
-
GBIOMED
-
VSC

8C 16Gb 2TB

8C 16Gb

8C 16Gb

UZ NAS Storage

-
Additional RAM (32Gb or 48
Gb

per node)

-

Additional storage?

DAS or NAS?

Dell,
NetApp
?

Open
-
MPI

SGE

Distributed
computing

Torque/PBS

Distributed
computing

Flexible
computing


~ 100
cpu

6Gb
RAM/core

NetApp

+DDN Storage

-

Servers

-

Storage

-

Switches

Software:

-

Academic tools

-

CLCBio
?

Software:

-

CASAVA (
parall
. by user)

-

Academic: bowtie,
bwa
, …

-

CLCBio
?

UZ
-
Patient data

Software:

-

CASAVA

-

CLCBio

-

Roche

-

Computing (0,5 EUR /
cpu
-
hour)

-

Storage (750
-
1500 EUR / TB)


VSC

gbiomed

UZ

To be discussed


How can HiSEQ2000 choose between UZ and KULeuven network to send run data
to storage?


1Gb


350
Gb

/ run compressed


Where to store data after secondary analysis?


Cheap storage


External HDD


tape


Who does what?


Jeroen

/ Jan for UZ?


Stein /
Gert

/
Raf

for Biomed?


Can we already buy additional RAM for UZ cluster?


Can we connect
gbiomed

servers directly to UZ storage?


What are the requirements?


Estimate load over 3 levels


# users


# run


Difficult to estimate now


evaluate after 1yr



What’s next


Decide on
gbiomed

hardware


List of things needed at UZ


Start testing CASAVA on UZ system and on VSC


Test
CLCBio

on UZ system for
Illumina

data

Test with 1000 genomes data



Storage


How much do we need?


1.1 TB per run


7 TB space during analysis


BUT: keep only runs that are being analyzed


~ 3 at a time?



10
-
15 TB


After analysis:


Data delivered to client


Data compressed and moved to offline storage


Cheap HDD array?


Tape?


External
HDDs
?

Proposal for GCC2.0
(ideas under construction)

UZ

Computing nodes
(existing)

8C 16Gb 2TB

UZ
NetApp

Storage

8C 16Gb

8C 16Gb

Patient
-
related data

Non
-
patient
-
related data (e.g.,
model organisms, cell lines, …)

32C 256Gb

8C 48Gb

8C 48Gb

gbiomed

computing

nodes

Fast interconnect;
high I/O bandwidth

Illumina

HiSEQ2000

Roche 454

ICTS/VSC
NetApp

+DDN
Storage

VSC

(existing),
pay per
cpu
-
hour

!

Non
-
patient
-
related data

!

! = to create, to test, or to open 10Gb link

GCC2.0 features


Divide and conquer: solution at 3 levels


UZ: for UZ
-
patient
-
related data (protected)


Gbiomed
:
ad hoc
,
flexible computing
for research (non
-
UZ
-
patient related data)


VSC:
high
-
performance computing

(non UZ
-
patient related data)


Storage (too expensive to duplicate)


VSC storage with
Gbiomed

access (create 10Gb fast interconnect from ICTS to
gbiomed
)


UZ storage with
Gbiomed

access (create ‘open
-
access’ policy for non
-
patient related data)


Gbiomed

ad hoc
storage (
HDDs

in the local servers)


Computing


VSC for HPC


Servers in UZ (patient
-
related data)


Servers in
gbiomed

(for research
-
related
ad hoc
analyses, web services, development,
software testing, …)


Requires fast (10Gb
ethernet
) access to ICTS storage and fast (and open) access to UZ
-
open storage

GCC2.0 Cost, Timing & Effort
estimates


Budget from
Stichting

tegen

Kanker


200
-
250 K left for computing


Solution for the first 3 years should be possible (excluding bioinformatics manpower)


Budget spread between VSC
-
gbiomed
-
UZ: to be decided internally in genomics core


VSC
x
%


Storage (86.400 EUR for 32 TB; ~80 TB is needed for 25 runs per year)


Computing time (29.594 EUR for 55.000
cpu
-
hours)


Gbiomed

local servers and local storage
y
%


UZ additional storage
z
%


Software licenses (
CLCBio
) (price quote requested)


More investments needed over time (e.g., new hardware is only for 3 years)



Timing: 31 August 2010?


Estimated effort (to be discussed)


VSC:


Create 10Gb
ethernet

link to
gbiomed

(cost?)



mandays

for startup and testing (network connections, storage, software)


Maintenance included in price


Genomics Core
bioinformaticians

(VRC, CME)




mandays

for startup and testing


Gbiomed

IT:



mandays

for setting
-
up local servers & integration with ICTS storage


… FTE for maintenance of local servers


UZ: …
mandays

for additional storage and setup
NetApp

share


Hurdles to overcome


1) 10Gb
ethernet

link between VSC and
gbiomed


For non
-
UZ
-
patient related data


To transfer
Illumina

data to VSC


To run
ad hoc
analyses on local
gbiomed

servers, connected to the VSC
storage, without the need to duplicate the storage solution and the data (too
costly)


An absolute requirement


Currently not available


A necessary investment for future VSC
-
BMW interactions


2) UZ
-
Patient
-
related data cannot be transferred to VSC storage, nor
computed at VSC


Can VSC provide a secure transfer, storage and computing environment for
UZ
-
data? If not, data analysis and storage for UZ
-
data remains in UZ.


3) Link between UZ storage and
gbiomed

for non
-
patient related data


Gbiomed
-
UZ


10Gb link is possible in principle. Perhaps during transition period (while waiting
for 10Gb link VSC
-
gbiomed
)?

Alternatives


All
-
in
-
one solution


PSSCLabs


Public tender


Bioinformatics analyses


Estimated effort from Genomics Core
bioinformatician

for
basic
analysis of 1 run: ~2
-
3
mandays


Included in service fee?


This analysis will not be satisfactory for most projects


Fee
-
based bioinformatics and data analysis service for more
advanced analyses?


Many users have a
bioinformatician

in the group or already
collaborate with
bioinformaticians


Contribution in the service fee for GCC hardware & maintenance
cost, and software licenses



Estimated effort:


Either only basic analysis services are offered: ½ FTE bioinformatics
postdoc



Or basic plus advanced bioinformatics services will be offered: 1 FTE
bioinformatics
postdoc
.