pptx - University of Maryland

hordeprobableBiotechnology

Oct 4, 2013 (3 years and 11 months ago)

118 views

Genetic Sequence Analysis in the Clouds:
Applications of MapReduce to the Life Science

Jimmy Lin, Michael Schatz, and Ben
Langmead

University of Maryland


Wednesday
, June
10, 2009

This work is licensed under a Creative Commons Attribution
-
Noncommercial
-
Share Alike 3.0 United
States. See
http://creativecommons.org/licenses/by
-
nc
-
sa/3.0/us/ for details

Cloud Computing @ Maryland


Teaching


Cloud computing course (version 1.0): Spring 2008

Part of the Google/IBM Academic Cloud Computing Initiative


Cloud computing course (version 2.0): Fall 2008

Sponsored by Amazon Web Services through a teaching grant


Research


Web
-
scale text processing


Statistical machine translation


Bioinformatics

Maria

no
dio

una

bofetada

bruja

verde

Mary

did not

a

slap

witch

green

green witch

bruja

verde

Learning Translation Models

Prodi ha erigido hoy un verdadero muro contra esas
acciones, espero que el Sr.
Moscovici

lo haya comprendido
bien, y realmente también espero que esta tendencia se
rompa en los Consejos de Biarritz y de Niza, y se rectifique.

Mr

Prodi

has put an emphatic stop to this kind of action,
which has hopefully resonated with
Mr

Moscovici
, and I truly
hope that this trend can be broken and reversed at the
Councils in Nice and Biarritz.

Esas negociaciones sabemos que son muy difíciles y hacen
temer un fracaso o un acuerdo de mínimos en Niza, lo que
sería aún más grave y usted ya lo ha dicho, señor Ministro.

These are, as we know, very tricky negotiations and raise
fears of a setback or a watered
-
down agreement in Nice
which, as you have already acknowledged,
Mr

Moscovici
,
would be even more serious.

We built systems for “learning” translation models in Hadoop…

… sort of like the word count example, but with more math


Maria

no

dio

una

bofetada

a

la

bruja

verde

Mary

not

did not

no

did not give

give

a

slap

to

the

witch

green

slap

a slap

to the

to

the

green witch

the witch

by

Example from Koehn (2006)

slap

Translation as a “Tiling” Problem

From Text to DNA Sequences


Text processing: [0
-
9A
-
Za
-
z]+


DNA sequence processing: [ATCG]+

(Nope, not really)

Michael Schatz
(Ph.D
. student, Computer
Science; Spring 2008)

Ben
Langmead

(M.S. student, Computer Science; Fall 2008)

Analogy

(And two disclaimers)

Strangely
-
Formatted Manuscript


Dickens:
A Tale of Two Cities


Text written on a long spool









It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

… With Duplicates


Dickens:
A Tale of Two Cities


“Backup” on four more copies










It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

Shredded Book Reconstruction


Dickens accidently shreds the manuscript










How can he reconstruct the text?


5 copies x 138,656 words / 5 words per fragment = 138k fragments


The short fragments from every copy are mixed together


Some fragments are identical



It was the best of

of times, it was the

times, it was the worst

age of wisdom, it was

the age of foolishness, …

It was the best

worst of times, it was

of times, it was the

the age of wisdom, it

was the age of foolishness,

It was the

the worst of times, it

best of times, it was

was the age of wisdom,

it was the age of

foolishness, …

It was

was the worst of times,

the best of times, it

it was the age of

wisdom, it was the age

of foolishness, …

It

it was the worst of

was the best of times,

times, it was the age

of wisdom, it was the

age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of

of times, it was the

times, it was the worst

age of wisdom, it was

the age of foolishness, …

It was the best

worst of times, it was

of times, it was the

the age of wisdom, it

was the age of foolishness,

It was the

the worst of times, it

best of times, it was

was the age of wisdom,

it was the age of

foolishness, …

It was

was the worst of times,

the best of times, it

it was the age of

wisdom, it was the age

of foolishness, …

It

it was the worst of

was the best of times,

times, it was the age

of wisdom, it was the

age of foolishness, …

Overlaps


Generally prefer longer overlaps to shorter overlaps


In the presence of error, we might allow the
overlapping fragments to differ by a small amount

It was the best of

of times, it was the

best of times, it was

times, it was the worst

was the best of times,

the best of times, it

it was the worst of

was the worst of times,

worst of times, it was

of times, it was the

times, it was the age

it was the age of

was the age of wisdom,

the age of wisdom, it

age of wisdom, it was

of wisdom, it was the

wisdom, it was the age

it was the age of

was the age of foolishness,

the worst of times, it

It was the best of

was the best of times,

4 word overlap

It was the best of

of times, it was the

1 word overlap

It was the best of

of wisdom, it was the

1 word overlap

Greedy Assembly


The repeated sequence makes the correct
reconstruction ambiguous

It was the best of

of times, it was the

best of times, it was

times, it was the worst

was the best of times,

the best of times, it

of times, it was the

times, it was the age

It was the best of

of times, it was the

best of times, it was

times, it was the worst

was the best of times,

the best of times, it

it was the worst of

was the worst of times,

worst of times, it was

of times, it was the

times, it was the age

it was the age of

was the age of wisdom,

the age of wisdom, it

age of wisdom, it was

of wisdom, it was the

wisdom, it was the age

it was the age of

was the age of foolishness,

the worst of times, it

The Real Problem

(The easier version)

G
A
T
G
C
TT
A
C
T
A
T
G
C
GGG
CCCC

C
GG
T
C
T
AA
T
G
C
TT
A
C
T
A
T
G
C

G
C
TT
A
C
T
A
T
G
C
GGG
CCCC
TT

AA
T
G
C
TT
A
C
T
A
T
G
C
GGG
CCCC
TT

T
AA
T
G
C
TT
A
C
T
A
T
G
C

AA
T
G
C
TT
A
G
C
T
A
T
G
C
GGG
C

AA
T
G
C
TT
A
C
T
A
T
G
C
GGG
CCCC
TT

AA
T
G
C
TT
A
C
T
A
T
G
C
GGG
CCCC
TT

C
GG
T
C
T
A
G
A
T
G
C
TT
A
C
T
A
T
G
C

AA
T
G
C
TT
A
C
T
A
T
G
C
GGG
CCCC
TT

C
GG
T
C
T
AA
T
G
C
TT
A
G
C
T
A
T
G
C

A
T
G
C
TT
A
C
T
A
T
G
C
GGG
CCCC
TT

?

Subject
genome

Sequencer

Reads

DNA Sequencing

ATCTGATAAGTCCCAGGACTTCAGT

GCAAGGCAAACCCGAGCCCAGTTT

TCCAGTTCTAGAGTTTCACATGATC

GGAGTTAGTAAAAGTCCACATTGAG


Genome of an organism encodes genetic information
in long sequence of 4 DNA nucleotides: ATCG


Bacteria: ~5 million
bp


Humans: ~3 billion
bp


Current DNA sequencing machines can generate 1
-
2
Gbp

of sequence per day, in millions of short reads
(25
-
300bp)


Shorter reads, but much higher throughput


Per
-
base error rate estimated at 1
-
2% (Simpson, et al,
2009)


Recent studies of entire human genomes have used
3.3 (Wang, et al., 2008) & 4.0 (Bentley, et al., 2008)
billion 36bp reads


~144 GB of compressed sequence data

How do we put humpty dumpty back together?

Human Genome

11 years, cost $3 billion… your tax dollars at work!

A complete human DNA sequence was published in
2003, marking the end of the Human Genome Project

C
GG
T
C
T
A
G
A
T
G
C
TT
A
G
C
T
A
T
G
C
GGG
CCCC
TT

Reference sequence

Alignment

G
C
TT
A

T

C
T
A
T

TT
A

T

C
T
A
T
G
C

A

T

C
T
A
T
G
C
GG

A

T

C
T
A
T
G
C
GG

G
C
TT
A

T

C
T
A
T

T
C
T
A
G
A
T
G
C
T

C
T
A
T
G
C
GGG
C

C
T
A
G
A
T
G
C
TT

A

T

C
T
A
T
G
C
GG

C
T
A
T
G
C
GGG
C

A

T

C
T
A
T
G
C
GG

Subject reads

C
GG
T
C
T
A
G
A
T
G
C
TT
A
T
C
T
A
T
G
C
GGG
CCCC
TT

G
C
TT
A
T
C
T
A
T

TT
A
T
C
T
A
T
G
C

A
T
C
T
A
T
G
C
GG

A
T
C
T
A
T
G
C
GG

G
C
TT
A
T
C
T
A
T

GG
CCCC
TT

G
CCCC
TT

CC
TT

C
GG

C
GG
T
C

C
GG
T
C
T

C
GG
T
C
T
A
G

T
C
T
A
G
A
T
G
C
T

C
T
A
T
G
C
GGG
C

C
T
A
G
A
T
G
C
TT

C
TT

A
T
G
C
GGG
CCC

Reference sequence

Subject reads

Reference:

ATGAACCAC
GAACAC
TTTT
TT
GGC
A
ACGATTTAT


Query:

ATGAACAAA
GAACAC
TTTT
TT
GGC
C
ACGATTTAT…


Insertion

Deletion

Mutation

1.
Map: Catalog K
-
mers


Emit every k
-
mer

in the genome and non
-
overlapping k
-
mers

in the reads


Non
-
overlapping k
-
mers

sufficient to guarantee an alignment will be found

CloudBurst

Human chromosome 1

Read 1

Read 2

Map

2.

Shuffle: Coalesce Seeds


Hadoop internal shuffle groups together k
-
mers

shared by the reads and the reference


Conceptually build a hash table of k
-
mers

and their occurrences

shuffle





3.

Reduce: End
-
to
-
end alignment


Locally extend alignment beyond seeds by
computing “match distance”


If read aligns end
-
to
-
end, record the alignment



Reduce

Read 1, Chromosome 1, 12345
-
12365

Read 2, Chromosome 1, 12350
-
12370

0
2000
4000
6000
8000
10000
12000
14000
16000
0
2
4
6
8
Runtime (s)

Millions of Reads

Running Time vs Number of Reads on Chr 1

0
1
2
3
4
0
500
1000
1500
2000
2500
3000
0
2
4
6
8
Runtime (s)

Millions of Reads

Running Time vs Number of Reads on Chr 22

0
1
2
3
4
Results from a small, 24
-
core cluster, with different number of mismatches

Michael Schatz.
CloudBurst
: Highly Sensitive Read Mapping
with MapReduce. Bioinformatics, 2009,
in press
.

0
200
400
600
800
1000
1200
1400
1600
1800
24
48
72
96
Running time (s)

Number of Cores

Running Time on EC2

High
-
CPU Medium Instance Cluster

CloudBurst

running times for mapping 7M reads to human
chromosome 22 with at most 4 mismatches on EC2

Michael Schatz.
CloudBurst
: Highly Sensitive Read Mapping
with MapReduce. Bioinformatics, 2009,
in press
.

What’s Next?

(Michael Schatz’s Ph.D. dissertation)

Wait, no reference?

de Bruijn Graph Construction


D
k

= (V,E)


V = All length
-
k
subfragments

(k > l)


E = Directed edges between consecutive
subfragments

Nodes overlap by k
-
1 words









Locally constructed graph reveals the global sequence
structure


Overlaps implicitly computed


It was the best

was the best of

It was the best of

Original Fragment

Directed Edge

de
Bruijn
, 1946

Idury

and Waterman, 1995

Pevzner
, Tang, Waterman, 2001

de
Bruijn

Graph Assembly

the age of foolishness

It was the best

best of times, it

was the best of

the best of times,

of times, it was

times, it was the

it was the worst

was the worst of

worst of times, it

the worst of times,

it was the age

was the age of

the age of wisdom,

age of wisdom, it

of wisdom, it was

wisdom, it was the

Compressed de Bruijn Graph







Unambiguous non
-
branching paths replaced by single nodes


An
Eulerian

traversal of the graph spells a compatible reconstruction
of the original text


There may be many traversals of the graph


Different sequences can have the same string graph


It was the best of times,
it was the worst of times, it was the worst of times
,
it was the age of wisdom, it was the age of foolishness, …



of times, it was the

It was the best of times, it

it was the age of

the age of wisdom, it was the

it was the worst of times, it

the age of foolishness

Hadoopification


(Stay tuned!)

Cloud worthy?

How much data?

Bottom Line: Bioinformatics


Great use case of Hadoop


Interesting computer science problems


Help unravel life’s mysteries?

Questions?

Comments
?

Thanks to the organizations who support our work: