An Introduction to CAMERA and Underlying

cowphysicistInternet and Web Development

Dec 4, 2013 (3 years and 7 months ago)

99 views

An Introduction to CAMERA and Underlying
Technologies

Philip Papadopoulos

University of California, San Diego

San Diego Supercomputer Center

California Institute of Telecommunications and
Information Technology (Calit2)

PI Larry Smarr

Announced 17 Jan 2006. Public Release 13 March 2007

$24.5M Over Seven Years

DNA Basics for Non
-
Biologists


Nucleotide bases of DNA


ACTG (Adenine, Cytosine, Guanine, Thymine)


A Sequence of Bases Forms One Side of a DNA
Strand


Complementary Bases form the other side of
DNA


A matches T (pair)


C matches G (pair)


During cell replication, DNA is “unzipped” . The
complementary side can then be replicated
perfectly



Human DNA is about 3 billion base pairs on 26
Chromosomes



Bases


Amino Acids


Triplets of nucleotide bases are called codons and define
amino acids
.


Amino acids are the basic building blocks of proteins


There are 20 amino acids, but 4^3 = 64 nucleotide combinations.


Many amino acids have multiple codons


Special codons (called start and stop codons) assist in DNA translation
during cell replication.



Reading Frames of: GGGAAACCC


This raw sequence could be read as


GGGAAACCC (GGG AAA CCC) (Glycine, Lysine, Proline)


GGAAACCC (GGA AAC) (Glycine, Asparagine)


GAAACCC (GAA ACC) (Glutamic Acid, Threonine)

Sequencing Tidbits


The Institute for Genomic Research

(TIGR) sequenced the genome of the
bacterium
Haemophilus influenzae

in 1995 using shotgun sequencing



1.8 Million Base Pairs (Human: 3 Billion)



Sequencing does NOT tell you what function a particular gene plays



It is believed that only ~1.5% of human chromosome codes for expressed
characteristics


The non
-
coding portions contain our genetic history


Unknown what function the rest our DNA plays

Most of Evolutionary Time Was in the Microbial World

You
Are
Here

Source: Carl Woese, et al

Tree of Life Derived from 16S rRNA Sequences

Marine Genome Sequencing Project



Measuring the Genetic Diversity of Ocean Microbes

Sorcerer II Data Will Double
Number of Proteins in GenBank!

Need

Ocean Data

Some CAMERA Goals


Provide an infrastructure where scientists from around the world can
perform analysis on genetic communities


Global Ocean Sampling (GOS) is the initial large data set


~ 8.5 Billion base pairs of raw Reads


Metadata is available for samples


Saline, Temperature, Geographic Location, Water Depth, Time of Day …


Other metadata will be correlated with samples (e.g. MODIS Satellite)



Allow others to search and compare input sequences against CAMERA data.



Overall provide a resource dedicated to metagenomics


Support new datasets


Support new analysis tools and web services


Global Ocean Survey (GOS) Sequences are Largely
Bacterial

Source: Shibu Yooseph, et al. (PLOS Biology in press 2006)

~3 Million
Previously Known
Sequences

~5.6 Million
GOS
Sequences

Reason for CAMERA


The Global Ocean Survey (GOS) is a huge influx of
sequence data


Factors that interrelate microbes and microbial
communities are not well known


Significant analysis requires large resources


All
-
to
-
all comparisons


Integration of other environmental (meta) data (weather,
temperature, salinity,…) is essential


Raw Sequence Data sets are mid
-
sized


Current set of GOS Raw Reads is about 100GB (FASTA
Files)



Calit2 CAMERA Production

Compute and Storage Complex is On
-
Line

512 Processors

~5 Teraflops

~ 200 Terabytes Storage

User Map


03 May 2007



Site in production on 13 March 2007



More than 500 Registered users from
around the globe (~10 new users/day)

Flat File

Server

Farm

W E B PORTAL


Traditional

User

Response

Request

Dedicated

Compute Farm

(100s of CPUs)

TeraGrid: Cyberinfrastructure Backplane

(scheduled activities, e.g. all by all comparison)

(10000s of CPUs)

Web

(other service)

Local

Cluster

Local

Environment

Direct

Access

Lambda

Cnxns

Data
-

Base

Farm

10 GigE

Fabric

Calit2’s Direct Access Core Architecture

CAMERA’s Metagenomics Server Complex

Source: Phil Papadopoulos, SDSC, Calit2

+ Web Services

Sargasso Sea Data


Sorcerer II Expedition
(GOS)


JGI Community
Sequencing Project


Moore Marine

Microbial Project


NASA and NOAA

Satellite Data


Community Microbial
Metagenomics Data

Calit2 CAMERA Production

Compute and Storage Complex is On
-
Line

Compute Nodes

1 and 10
Gbit
/s

Switching

200 TB File Storage

10
Gbit
/s Network

Web, Application, DB

Servers

Global Elements


Data location


Storage Resource Broker Meta data
catalog


Data
-
type aggregation, cross
-
correlation, integration


BIRN Data Mediator


Identity Management


Use Grid Security Infrastructure (GSI) Public Key
System


Integrated Grid Accounts Management Architecture
(GAMA) from SDSC for ease
-
of
-
use and Single Sign On


Portal Services


Based on GridSphere


Small Dedicated Compute Cluster (32 nodes
)




Cluster Nodes and File Servers

Logical Layout of Servers

Web

Server

Portal

Server

(Tomcat)

Single
Sign
-
on

Server

Postgres
Database

GAMA

Server

Blast
Master
(Jboss)

Cluster
Frontend

Single Sign On Layer

Public Net

Private Net

An Incomplete List of Software Components


Postgres Database


Apache Tomcat


Jboss Servlet Container


Google Web Toolkit


Sun Grid Engine


GAMA (Grid Accounting and Management Architecture)/GSI from Globus


OPAL (Grid/Web Services Wrapper)


GridSphere Portlet Container


CAMERA Registration Portal


Venter Application Portal


NCBI Blast, MPIBlast, ClustalW, MrBayes, CDHit, and host of other Bio
Software


Ergatis Workflow Engine


Jforums


Drupl


All Integrated with Rocks … Single Person Deployment


OptIPortal


Another Rocks Cluster

Termination Device for the OptIPuter Global Backplane


20 Dual CPU Nodes, 20 24” Monitors, ~$50,000


1/4 Teraflop, 5 Terabyte Storage, 45 Mega Pixels
--
Nice PC!


Scalable Adaptive Graphics Environment ( SAGE) Jason Leigh, EVL
-
UIC

Source: Phil Papadopoulos SDSC, Calit2

Use of OptIPortal

to Interactively View Microbial Genome

Source: Raj Singh, UCSD

Acidobacteria bacterium Ellin345 (NCBI)

Soil Bacterium 5.6 Mb

15,000 x 15,000 Pixels

Use of OptIPortal

to Interactively View Microbial Genome

Source: Raj Singh, UCSD

Acidobacteria bacterium Ellin345 (NCBI)

Soil Bacterium 5.6 Mb

15,000 x 15,000 Pixels

A Look at Networking

Introduction to Quartzite

An Experimental Network

Sunlight (10 Gigabit) Campus/WAN

Using a Lambda Network for CAMERA


Many community databases



Protein Databank (PDB)


GenBank


SwissProt


Support only web or web services interfaces


New analysis/programs need access to raw databases/files


Usually, groups make a point
-
in
-
time copy of the database


We call this a data “fork”


Updates are not processed


Papers published with point
-
in
-
time data out of date by months or
years


CAMERA “Direct Connect” will allow us to provide a high
-
speed connection
to the backend servers


Try to eliminate data forking


Copies of CAMERA data is inevitable


Need mechanisms that allow others to keep their copies in synch with
CAMERA


UCSD Quartzite Core at Completion (Year 5 of
OptIPuter)

Quartzite
Core
CalREN
-
HPR
Research
Cloud
Campus Research
Cloud
GigE Switch with
Dual
10
GigE Upliks
.....
To cluster nodes
GigE Switch with
Dual
10
GigE Upliks
.....
To cluster nodes
GigE Switch with
Dual
10
GigE Upliks
.....
To cluster nodes
GigE
10
GigE
...
To
other
nodes
Quartzite Communications
Core Year
3
(
DWDM
)
GlimmerGlass
128
port OOO
Juniper T
320
4
GigE
4
pair fiber
Wavelength
Selective
Switch
(
Lucent
)
To
10
GigE cluster
node interfaces
.....
To
10
GigE cluster
node interfaces and
other switches
Force
10
E
1200
32 10
GigE


Funded 15 Sep 2004



Physical HW to Enable
Optiputer and Other Campus
Networking Research



Hybrid Network Instrument

Reconfigurable
Network and
Enpoints

25

|
AT&T Labs, October 2007

4x4 Wavelength Cross
-
Connect:


All integrated optics (except optical amplifiers)


4 1x4 WSS modules


4 4x1 passive optical combiners


4 x 40
l
x 40Gbps = 6.4Tbps switching capacity


currently using central 8
l

1x4 WSS

1x4 WSS

1x4 WSS

1x4 WSS

4x4 WXC rack

WSSs

combiners

Optical

Amps

26

|
AT&T Labs, October 2007

WXC performance demonstration:

1x4 WSS

1x4 WSS

1x4 WSS

1x4 WSS

ASE source


4x1
swit
ch



OS
A


8 lasers at centre of C
-
Band at 100GHz spacing

use ASE source to illustrate wide bandwidth

1.
use external 4x1 switch to scan WXC ports

2.
alter switch states of WSS1 and WSS3

shown in movie on next page

WSS1

WSS2

WSS3

WSS4

l
1

1

2

3

4

l
2

2

3

4

1

l
3

3
/1

4

1
/3

2

l
4

4

1

2

3

l
5

1

2

3

4

l
6

2

3

4

1

l
1

3
/1

4

1
/3

2

l
8

4

1

2

3

27

|
AT&T Labs, October 2007

WXC performance demonstration:

What Does it Cost to Drive the Network


Dominant cost is DWDM optics



Construction of Multiplexers is Simple, and not expensive ~
$250/Channel/End

Channel 31

Channel 32

Channel 33

Channel 34

10Gbps Switch X 4

Per Side (optional)

XFP Switch Module X 4
Per Side (optional)

XFP DWDM

Optics X 4 Per Side

Used in Host or Switch

SC to LC Fiber 2M X 5
Per Side

DWDM Mux

Transmit X 1 Per
Side

DWDM DeMux

Receive X 1 Per Side

1 Fiber

Pair

Corning 1U Rack

Containing DWDM
Mux / DeMux + SC to
SC couplers, 1 Per side

Layer 1


Four Channel DWDM


1)Optics

SFP/XFP
Optics Costs

DWDM Optics
from
AACTelecom

10Gbps

Luminent

XFP
DWDM per unit
(ZR 80Km) OC
-
192 and 10GE
compatible

3500 US

10Gbps

Luminent

(assembled in
US) XFP
DWDM per Unit
(ER 40Km)

OC
-
192 and
10GE
compatible

2900 US

1

Gbps

SFP
DWDM per
Unit (80KM
model)

OC
-
48
compliant

and 1 GE
compatible

1220 US

10Gbps non
-
DWDM 1310nm
(LR 10Km
model)

700 US

10Gbps
capable
switch

SMC8748L2
(A0707505)+
EXP MOD
-
10G
(A0707506)
from Dell

Switch

2 x 10Gbps
XFP ports, 48
x 1Gbps
Copper


1700 US

10 Gbps
module
(holds XFP)

300 US

2) Optional
-

Layer 2 Switch (10Gbps capable)

DWDM
Mux

DeMux

(SC
connector

type
)

4, 8 , 16
channel =
DWDM
-
100

From
oemarket.c
om

4

Channel
(31,32,33,34
)

560 US

8 Channel

880 US

16 Channel

1600
(approx)

US

3) DWDM Mux DeMux

Corning
Mux

DeMux

container
-
1U

rack mount

Corning PCH
-
01U

from Ed
Carlin Graybar

1 U (sufficient
for 4, 8 or 16
channel)

200 US

2 sets of SC to
SC adaptors

100 US (approx)

Fiber Patch

Cables, Single
Mode

From Ed

Carlin
Graybar

2M, SC to LC
connector type

30 US (approx)
each

4) Corning Rack Mount, Couplers, Fiber

Complete Solution

DWDM to
Copper
Media
Converter

From Carl
Stelling at
Aaxeon.co
m

SFP
pluggable
DWDM to
copper
media
converter

150 US
each, not
including
DWDM
optics (just
converter)

5) Optional
-

DWDM Media Converter

Quartzite State Nov 2007


Core Packet Switch with 68 10 GigE ports (More than ½ Terabit)


Approximately 30 Channels Lit


64
-
port All
-
Optical Glimmerglass Switch
-

All Fiber into Quartzite is
switchable


4 port x 8 Lambda DWDM switch at Lucent (On site at Calit2 in Dec)


4 Channel DWDM Between Calit2 and SDSC


One channel is used for 10Gigabit Production to BIRN Data Racks.



Ordered, but waiting for fulfillment


20 Mux/Demux (8 C
-
band DWDM Channels + 1 1310 (LR) Passband)


32 DWDM XFPS (Channel 40
-
43


will fill out rest of channels in 2008)