Genetic Sequence Analysis in the Clouds:
Applications of MapReduce to the Life Science
Jimmy Lin, Michael Schatz, and Ben
Langmead
University of Maryland
Wednesday
, June
10, 2009
This work is licensed under a Creative Commons Attribution

Noncommercial

Share Alike 3.0 United
States. See
http://creativecommons.org/licenses/by

nc

sa/3.0/us/ for details
Cloud Computing @ Maryland
Teaching
Cloud computing course (version 1.0): Spring 2008
Part of the Google/IBM Academic Cloud Computing Initiative
Cloud computing course (version 2.0): Fall 2008
Sponsored by Amazon Web Services through a teaching grant
Research
Web

scale text processing
Statistical machine translation
Bioinformatics
Maria
no
dio
una
bofetada
bruja
verde
Mary
did not
a
slap
witch
green
green witch
bruja
verde
Learning Translation Models
Prodi ha erigido hoy un verdadero muro contra esas
acciones, espero que el Sr.
Moscovici
lo haya comprendido
bien, y realmente también espero que esta tendencia se
rompa en los Consejos de Biarritz y de Niza, y se rectifique.
Mr
Prodi
has put an emphatic stop to this kind of action,
which has hopefully resonated with
Mr
Moscovici
, and I truly
hope that this trend can be broken and reversed at the
Councils in Nice and Biarritz.
Esas negociaciones sabemos que son muy difíciles y hacen
temer un fracaso o un acuerdo de mínimos en Niza, lo que
sería aún más grave y usted ya lo ha dicho, señor Ministro.
These are, as we know, very tricky negotiations and raise
fears of a setback or a watered

down agreement in Nice
which, as you have already acknowledged,
Mr
Moscovici
,
would be even more serious.
We built systems for “learning” translation models in Hadoop…
… sort of like the word count example, but with more math
Maria
no
dio
una
bofetada
a
la
bruja
verde
Mary
not
did not
no
did not give
give
a
slap
to
the
witch
green
slap
a slap
to the
to
the
green witch
the witch
by
Example from Koehn (2006)
slap
Translation as a “Tiling” Problem
From Text to DNA Sequences
Text processing: [0

9A

Za

z]+
DNA sequence processing: [ATCG]+
(Nope, not really)
Michael Schatz
(Ph.D
. student, Computer
Science; Spring 2008)
Ben
Langmead
(M.S. student, Computer Science; Fall 2008)
Analogy
(And two disclaimers)
Strangely

Formatted Manuscript
Dickens:
A Tale of Two Cities
Text written on a long spool
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
… With Duplicates
Dickens:
A Tale of Two Cities
“Backup” on four more copies
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
Shredded Book Reconstruction
Dickens accidently shreds the manuscript
How can he reconstruct the text?
5 copies x 138,656 words / 5 words per fragment = 138k fragments
The short fragments from every copy are mixed together
Some fragments are identical
It was the best of
of times, it was the
times, it was the worst
age of wisdom, it was
the age of foolishness, …
It was the best
worst of times, it was
of times, it was the
the age of wisdom, it
was the age of foolishness,
It was the
the worst of times, it
best of times, it was
was the age of wisdom,
it was the age of
foolishness, …
It was
was the worst of times,
the best of times, it
it was the age of
wisdom, it was the age
of foolishness, …
It
it was the worst of
was the best of times,
times, it was the age
of wisdom, it was the
age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of
of times, it was the
times, it was the worst
age of wisdom, it was
the age of foolishness, …
It was the best
worst of times, it was
of times, it was the
the age of wisdom, it
was the age of foolishness,
It was the
the worst of times, it
best of times, it was
was the age of wisdom,
it was the age of
foolishness, …
It was
was the worst of times,
the best of times, it
it was the age of
wisdom, it was the age
of foolishness, …
It
it was the worst of
was the best of times,
times, it was the age
of wisdom, it was the
age of foolishness, …
Overlaps
Generally prefer longer overlaps to shorter overlaps
In the presence of error, we might allow the
overlapping fragments to differ by a small amount
It was the best of
of times, it was the
best of times, it was
times, it was the worst
was the best of times,
the best of times, it
it was the worst of
was the worst of times,
worst of times, it was
of times, it was the
times, it was the age
it was the age of
was the age of wisdom,
the age of wisdom, it
age of wisdom, it was
of wisdom, it was the
wisdom, it was the age
it was the age of
was the age of foolishness,
the worst of times, it
It was the best of
was the best of times,
4 word overlap
It was the best of
of times, it was the
1 word overlap
It was the best of
of wisdom, it was the
1 word overlap
Greedy Assembly
The repeated sequence makes the correct
reconstruction ambiguous
It was the best of
of times, it was the
best of times, it was
times, it was the worst
was the best of times,
the best of times, it
of times, it was the
times, it was the age
It was the best of
of times, it was the
best of times, it was
times, it was the worst
was the best of times,
the best of times, it
it was the worst of
was the worst of times,
worst of times, it was
of times, it was the
times, it was the age
it was the age of
was the age of wisdom,
the age of wisdom, it
age of wisdom, it was
of wisdom, it was the
wisdom, it was the age
it was the age of
was the age of foolishness,
the worst of times, it
The Real Problem
(The easier version)
G
A
T
G
C
TT
A
C
T
A
T
G
C
GGG
CCCC
C
GG
T
C
T
AA
T
G
C
TT
A
C
T
A
T
G
C
G
C
TT
A
C
T
A
T
G
C
GGG
CCCC
TT
AA
T
G
C
TT
A
C
T
A
T
G
C
GGG
CCCC
TT
T
AA
T
G
C
TT
A
C
T
A
T
G
C
AA
T
G
C
TT
A
G
C
T
A
T
G
C
GGG
C
AA
T
G
C
TT
A
C
T
A
T
G
C
GGG
CCCC
TT
AA
T
G
C
TT
A
C
T
A
T
G
C
GGG
CCCC
TT
C
GG
T
C
T
A
G
A
T
G
C
TT
A
C
T
A
T
G
C
AA
T
G
C
TT
A
C
T
A
T
G
C
GGG
CCCC
TT
C
GG
T
C
T
AA
T
G
C
TT
A
G
C
T
A
T
G
C
A
T
G
C
TT
A
C
T
A
T
G
C
GGG
CCCC
TT
?
Subject
genome
Sequencer
Reads
DNA Sequencing
ATCTGATAAGTCCCAGGACTTCAGT
GCAAGGCAAACCCGAGCCCAGTTT
TCCAGTTCTAGAGTTTCACATGATC
GGAGTTAGTAAAAGTCCACATTGAG
Genome of an organism encodes genetic information
in long sequence of 4 DNA nucleotides: ATCG
Bacteria: ~5 million
bp
Humans: ~3 billion
bp
Current DNA sequencing machines can generate 1

2
Gbp
of sequence per day, in millions of short reads
(25

300bp)
Shorter reads, but much higher throughput
Per

base error rate estimated at 1

2% (Simpson, et al,
2009)
Recent studies of entire human genomes have used
3.3 (Wang, et al., 2008) & 4.0 (Bentley, et al., 2008)
billion 36bp reads
~144 GB of compressed sequence data
How do we put humpty dumpty back together?
Human Genome
11 years, cost $3 billion… your tax dollars at work!
A complete human DNA sequence was published in
2003, marking the end of the Human Genome Project
C
GG
T
C
T
A
G
A
T
G
C
TT
A
G
C
T
A
T
G
C
GGG
CCCC
TT
Reference sequence
Alignment
G
C
TT
A
T
C
T
A
T
TT
A
T
C
T
A
T
G
C
A
T
C
T
A
T
G
C
GG
A
T
C
T
A
T
G
C
GG
G
C
TT
A
T
C
T
A
T
T
C
T
A
G
A
T
G
C
T
C
T
A
T
G
C
GGG
C
C
T
A
G
A
T
G
C
TT
A
T
C
T
A
T
G
C
GG
C
T
A
T
G
C
GGG
C
A
T
C
T
A
T
G
C
GG
Subject reads
C
GG
T
C
T
A
G
A
T
G
C
TT
A
T
C
T
A
T
G
C
GGG
CCCC
TT
G
C
TT
A
T
C
T
A
T
TT
A
T
C
T
A
T
G
C
A
T
C
T
A
T
G
C
GG
A
T
C
T
A
T
G
C
GG
G
C
TT
A
T
C
T
A
T
GG
CCCC
TT
G
CCCC
TT
CC
TT
C
GG
C
GG
T
C
C
GG
T
C
T
C
GG
T
C
T
A
G
T
C
T
A
G
A
T
G
C
T
C
T
A
T
G
C
GGG
C
C
T
A
G
A
T
G
C
TT
C
TT
A
T
G
C
GGG
CCC
Reference sequence
Subject reads
Reference:
ATGAACCAC
GAACAC
TTTT
TT
GGC
A
ACGATTTAT
…
Query:
ATGAACAAA
GAACAC
TTTT
TT
GGC
C
ACGATTTAT…
Insertion
Deletion
Mutation
1.
Map: Catalog K

mers
•
Emit every k

mer
in the genome and non

overlapping k

mers
in the reads
•
Non

overlapping k

mers
sufficient to guarantee an alignment will be found
CloudBurst
Human chromosome 1
Read 1
Read 2
Map
2.
Shuffle: Coalesce Seeds
•
Hadoop internal shuffle groups together k

mers
shared by the reads and the reference
•
Conceptually build a hash table of k

mers
and their occurrences
shuffle
…
…
3.
Reduce: End

to

end alignment
•
Locally extend alignment beyond seeds by
computing “match distance”
•
If read aligns end

to

end, record the alignment
Reduce
Read 1, Chromosome 1, 12345

12365
Read 2, Chromosome 1, 12350

12370
0
2000
4000
6000
8000
10000
12000
14000
16000
0
2
4
6
8
Runtime (s)
Millions of Reads
Running Time vs Number of Reads on Chr 1
0
1
2
3
4
0
500
1000
1500
2000
2500
3000
0
2
4
6
8
Runtime (s)
Millions of Reads
Running Time vs Number of Reads on Chr 22
0
1
2
3
4
Results from a small, 24

core cluster, with different number of mismatches
Michael Schatz.
CloudBurst
: Highly Sensitive Read Mapping
with MapReduce. Bioinformatics, 2009,
in press
.
0
200
400
600
800
1000
1200
1400
1600
1800
24
48
72
96
Running time (s)
Number of Cores
Running Time on EC2
High

CPU Medium Instance Cluster
CloudBurst
running times for mapping 7M reads to human
chromosome 22 with at most 4 mismatches on EC2
Michael Schatz.
CloudBurst
: Highly Sensitive Read Mapping
with MapReduce. Bioinformatics, 2009,
in press
.
What’s Next?
(Michael Schatz’s Ph.D. dissertation)
Wait, no reference?
de Bruijn Graph Construction
D
k
= (V,E)
V = All length

k
subfragments
(k > l)
E = Directed edges between consecutive
subfragments
Nodes overlap by k

1 words
Locally constructed graph reveals the global sequence
structure
Overlaps implicitly computed
It was the best
was the best of
It was the best of
Original Fragment
Directed Edge
de
Bruijn
, 1946
Idury
and Waterman, 1995
Pevzner
, Tang, Waterman, 2001
de
Bruijn
Graph Assembly
the age of foolishness
It was the best
best of times, it
was the best of
the best of times,
of times, it was
times, it was the
it was the worst
was the worst of
worst of times, it
the worst of times,
it was the age
was the age of
the age of wisdom,
age of wisdom, it
of wisdom, it was
wisdom, it was the
Compressed de Bruijn Graph
Unambiguous non

branching paths replaced by single nodes
An
Eulerian
traversal of the graph spells a compatible reconstruction
of the original text
There may be many traversals of the graph
Different sequences can have the same string graph
It was the best of times,
it was the worst of times, it was the worst of times
,
it was the age of wisdom, it was the age of foolishness, …
of times, it was the
It was the best of times, it
it was the age of
the age of wisdom, it was the
it was the worst of times, it
the age of foolishness
Hadoopification
…
(Stay tuned!)
Cloud worthy?
How much data?
Bottom Line: Bioinformatics
Great use case of Hadoop
Interesting computer science problems
Help unravel life’s mysteries?
Questions?
Comments
?
Thanks to the organizations who support our work:
Comments 0
Log in to post a comment