Strategy for Physical

cowphysicistInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

121 εμφανίσεις

Strategy for Physical
Infrastructure

Tim Hubbard

ELIXIR Stakeholders Meeting

20th May 2009

Objectives


Understand requirements for physical
infrastructure for ELIXIR for next 5
-
10 years


Core databases (mainly at EMBL
-
EBI)


Specialized databases


Links to core throughout Europe


Explore options for construction and
recommendations on strategy


Initial design and costing of preferred options
for European Date Centre (EDC)

Questions


Size of computing & storage requirement


How will it grow over 5
-
10 years?


Distribution of resources (core, member states)


What criteria is appropriate to identify components
for communal funding?


Physical infrastructure of EDC


Upgrade existing data centre or build extra data
centre?


Challenges


Existing data scale is petabytes


Data size comparable with largest research communities


1000 genomes / ICGC = ~1 petabyte per year


CERN

Large Hadron Collider (LHC)



~10 PB/year at start

~1000 PB in ~10 years


2500 physicists collaborating

http://www.cern.ch

Pan
-
STARRS (Haleakala, Hawaii)

US Air Force

now: 800 TB/year

soon: 4 PB/year

Large Synoptic Survey Telescope (LSST)

NSF, DOE, and private donors


~5
-
10 PB/year at start in 2012

~100 PB by 2025



http://www.lsst.org
; http://pan
-
starrs.ifa.hawaii.edu/public/

Challenges


Existing data scale is petabytes


Data size comparable with largest research communities


Data security is a major issue


Tape backup no longer viable


Secure remote data replication required


Data growth has been exponential for many years


Has exceeded improvements in CPU/disk/network

Trace Archive doubling time

11 months

doubling time

Moore’s law: CPU power doubles in ~18
-
24 mo.

Hard drive capacity doubles in ~12 mo.

Network bandwidth doubles in ~20 mo.

Challenges


Existing data scale is petabytes


Data size comparable with largest research communities


Data security is a major issue


Tape backup no longer viable


Secure remote data replication required


Data growth has been exponential for many years


Has exceeded improvements in CPU/disk/network


Data growth is accelerating


Huge drop in price/jump in rate of growth in sequencing
technology


Sequencing has become biological assay of choice


Imaging likely to follow

Doubling times


Disk




12 months


CPU




18
-
24 months


Network



20 months



Trace Archive



11 months



Illumina run output


3
-
6 months (this
year)

WTSI projection of IT as fraction of
cost of sequencing

Long term value of sequence


WGAS (Whole Genome Association Studies) in
Human Genetics


Success of WTCCC (Wellcome Trust Case Control
Consortium)


Confounded by human population structure


Can detect causal genes, but not casual SNPs


Determining the structure of human variation


1000 genomes project


Used with WGAS, improves ability to detect causal SNPs


Data for pilot has already saturated internet backbone


Only scratching the surface: 3 populations; 400 individuals each; 1%
allele frequency:


Resulting data structure will be large, frequently used,
valuable


Underlying data still required to allow refinement of structure

Consequences


EBI/Genome Campus


Space/Electricity are limiting factors


More about storage than CPU


Tests using Supercomputer centres (WP13)


can be possible with extra work


cost/benefit unclear


for many applications would have to be near the data


Tests using commercial clouds


Recent trends


“Cloud” computing
becomes a reality


Resources for hire


Amazon s3, ec2


Google, Microsoft
both developing
competitors

For example: Microsoft is constructing a new $500M data center in Chicago.

Four new electrical substations totalling 168 MW power.

About 200 40’ truckable containers, each containing ~1000
-
2000 servers.

Estimated 200K
-
400K servers total.



Comparisons to Google, Microsoft, etc. aren’t entirely appropriate;

scale of their budgets vs. ours aren’t comparable.

Google FY2007: 11.5B; ~ $1B to computing hardware



Though they
do

give us early warning of coming trends:

(container data centers; cloud computing)

Private sector datasets and computing capacity
are already huge.

Google, Yahoo!, Microsoft: probably ~100 PB or so

Ebay, Facebook, Walmart: probably ~10 PB or so

Consequences


EBI/Genome Campus


Space/Electricity are limiting factors


More about storage than CPU


Tests using Supercomputer centres (WP13)


can be possible with extra work


cost/benefit unclear


for many applications would have to be near the data


Tests using commercial clouds


Many be useful for individual researcher


Not stable for archives (google abandoned academic data program)


Currently expensive

Physical infrastructure vision

EBI/Elixir

Aggregate

Organise

BioCloud A

BioCloud B















Data

slice

Large scale

Centre A



European

BioCloud Data and

Compute Infrastructure

Bioinformatics

researcher

Can data

Slice if wanted

Can submit

compute if

wanted

Test/small

datasets

Submission

For large scale

3. Distributed, hierarchical, redundant data archives and analysis

(CERN LHC’s four tiers of data centers: 10 Tier 1 sites, 30 Tier 2 sites)


4. Computational infrastructure is integral to experimental design

Physical infrastructure vision

EBI/Elixir

Node 1

Elixir Node 2

Elixir Node 3















Data

slice

Large scale

Centre A



European

BioCloud Data and

Compute Infrastructure

Bioinformatics

researcher

Can data

Slice if wanted

Can submit

compute if

wanted

Test/small

datasets

Submission

For large scale

Conclusions


Need is real, biology just hasn’t been here before



Essential to upgrade data centre capacity at EBI


Implement data replication for Data security


Improve Data transfer for large data sets



Network of small number of nodes for storage and compute
around Europe would address Data security; allow distributed
compute access


Less hierarchical than physics, more data integration required


High reliability of replication needed: single command/control structure?

Recent trends


Very rapid growth in genome sequence data


Challenges of data transfers between centres (1000
genomes)


Tape backup of archives no longer sufficient, data replication
required


Strong trend towards federation of separate databases


APIs, webservices, federation technologies (DAS, BioMart)


Proposed approach for International Cancer Genome
Consortium (ICGC)


WP6 Activities


Round table “future hardware needs of computational
biology” at “Grand Challenges in Computational
Biology” in Barcelona


Initial informal survey of historical and recent growth at
several European Bioinformatics Centres


EBI and WTSI both engaged in projections of future
growth

Informal survey


Split out EBI/Sanger from others (factor ~10
fold higher numbers from EBI+Sanger)


Steady growth, showing sharp recent increases

Installed cores

Estimated CPU growth, given
Moore’s law (doubling every 18
months)

Disk space

Disk vs Compute


Bioinformatics has always been more data orientated
than other IT heavy sciences (arguably only
astronomy is as data orientated)


This trend is accelerating with the presence next
-
gen
sequencing and imaging


Data volumes on a par to LHC distribution volumes

Options for a sustainable physical
infrastructure


Continue to “ride” the decrease in cost per CPU and
disk in each institute


Unlikely to handle data spike increase, especially from next
generation sequencing and imaging


Be smarter


Coordination of archives and centres. Centres mounting
archives “locally”, avoids duplication of data


Better compression
-

SRF/SRA 10/100 fold better
compression than old
-
style “trace” information


More sophisticated pooling


Must be data aware

Compute pooling trends

1990

1995

2000

2005

Unix nice,

VMS queues

One machine’s

resource management

“cheap” linux

Boxes in clusters

Dedicated, poor queuing

LSF + Condor

(becomes SunGrid

Engine) used in life

Science to manage

Institute

“GRID” computing

For LHC developed

Virtualisation

Cloud computing

Virtualisation


Xen and VMware robust and sensible


Amazon offering commercial cloud services
based on virtualisation (ec2)



Removes flexibility barriers for compute style


“linux” is your programming environment


Can move compute to data, not data to compute


Important shift in structure
-

previous GRIDs moved
both data and compute remotely


Also useful for medical ethics issues


EBI & WTSI prototyping virtualisation


Ensembl in Amazon cloud; Xen for EGA access; WTSI
is GRID node

Data shape


Two radically different datasets commonly used
in bioinformatics



“Your” or “local” data, from experiments as part
of this study


The corpus of current information in both
archived and processed formats

Openness




Privacy


Can you have both?



Biology getting closer to medicine

ICGC Database Model

Current approach to providing
access to EGA / DbGaP data


Genotypes / Sequence de
-
identified, but potentially re
-
identifiable


Summary data publicly available


Access to individual data requires registration


Current approach to providing
access to EGA / DbGaP data


Genotypes / Sequence de
-
identified, but potentially re
-
identifiable


Summary data publicly available


Access to individual data requires registration

Current approach to providing
access to EGA / DbGaP data


Genotypes / Sequence de
-
identified, but potentially re
-
identifiable


Summary data publicly available


Access to individual data requires registration


Risk:


registered individuals (2nd party) allowed download access (encrypted)


will 2nd party provide appropriate security to prevent leak to 3rd party?

Future Human Genetics Data


Now: long term case control studies


Genotype all (now: WTCCC)


Sequence all (future)


Future: study design no longer just around individual
diseases


UK biobank (just 500,000 people over 40)


UK NHS patient records (whole population)

Hard to be totally anonymous and
still useful


Patient records anonymised, but associated data makes them
potentially re
-
identifiable


Height, weight, age


Location (county, town, postcode?)


Need access to this data to carry out useful analysis


e.g. need to calculate correlations with postcodes to investigate
environmental effects


Secure analysis of private data


Privacy is an issue


Public happy to contribute to health research


Public not happy to discover personal details have been lost from laptops
/ DVDs etc.


3 potential solutions


“Fuzzify” data accessible for research


Social revolution (personal genetic openness)


Technical solution

Secure analysis of private data


Privacy is an issue


Public happy to contribute to health research


Public not happy to discover personal details have been lost from laptops
/ DVDs etc.


3 potential solutions


“Fuzzify” data accessible for research


Social revolution (personal genetic openness)


Technical solution

Honest Broker


Virtual machines attached to data


Researcher can carry out any data analysis
they wish (with their own code), but is
guaranteed to only be able to see “privacy safe”
summary results

Honest Broker

Data set A

Researcher




Data set B





“Correlate A & B”

Secure Environment

Summary only

(no raw data)

Honest Broker

Data set A

Researcher




Data set B

Data set C





“Correlate A & C”

Secure Environment

Summary only

(no raw data)

Honest Broker

Data set A

Researcher




Data set B

Data set C





“Run X on A, B & C”

Secure Environment

Summary only

(no raw data)

Algorithm X

Virtual

Machine


Virtual machine (VM):



VM has sole access to raw data.



Algorithms implement analysis within VM.



VM guarantees that only summary data can be exported

Existing examples:



cloud computing: Amazon ec2



iphone SDK (all software is developed against SDK, with controlled access)

Algorithm X