pptx - University of Maryland

educationafflictedBiotechnology

Oct 4, 2013 (3 years and 10 months ago)

112 views

Embracing the Data Deluge:

Data
-
Intensive Computing for the Masses

Jimmy Lin

University of Maryland


Tuesday, July 13, 2010

This work is licensed under a Creative Commons Attribution
-
Noncommercial
-
Share Alike 3.0 United States

See http://creativecommons.org/licenses/by
-
nc
-
sa/3.0/us/ for details

Introduction


We live in a world of large data…


Staying relevant requires embracing it!


In text processing…


Emergence and dominance of empirical, data
-
driven research


Constant danger: uninteresting conclusions on “toy” datasets

(or, experiments taking forever)


In the natural sciences…


Emergence of the 4
th

Paradigm: data
-
intensive
eScience


Difficult computer science problems!


How do we practically scale to large datasets?


Case study in text processing: statistical machine translation


Case study in bioinformatics: DNA sequence alignment

How much data?


Google processes 20 PB a day (2008)


Wayback Machine has 3 PB + 100 TB/month (3/2009)


eBay has 6.5 PB of user data + 50 TB/day (5/2009)


Facebook has 36 PB of user data + 80
-
90 TB/day (6/2010)


CERN’s LHC: 15 PB a year (any day now)


LSST: 6
-
10 PB a year (~2015)





640K

ought to be
enough for anybody.

No data like more data!

(
Banko

and Brill, ACL 2001)

(
Brants

et al., EMNLP 2007)

s/knowledge/data/g;

How do we get here if we’re not Google?

+ simple, distributed programming models

cheap commodity clusters

= data
-
intensive computing for the masses!

(or utility computing)

Source:
flickr

(
turtlemom_nancy
/2046347762)

Why is this different?

Path to data nirvana?

Parallel computing is hard!

Message Passing

P
1

P
2

P
3

P
4

P
5

Shared Memory

P
1

P
2

P
3

P
4

P
5

Memory

Different programming models

Different programming constructs

mutexes, conditional variables, barriers, …

masters/slaves, producers/consumers, work queues, …

Fundamental issues

scheduling, data distribution, synchronization,
inter
-
process communication, robustness, fault
tolerance, …

Common problems

livelock, deadlock, data starvation, priority inversion…

dining philosophers, sleeping barbers, cigarette smokers, …

Architectural issues

Flynn’s taxonomy (SIMD, MIMD, etc.),

network typology, bisection bandwidth

UMA vs. NUMA, cache coherence

The reality: programmer shoulders the burden of managing concurrency…

(I want my students developing new machine learning algorithms, not debugging race conditions)

Source: Ricardo
Guimarães

Herrmann

Source: MIT Open Courseware

Source:
NY Times (6/14/2006)

The datacenter
is

the computer!

MapReduce

MapReduce


Functional programming meets distributed processing


Independent per
-
record processing in parallel


Aggregation of intermediate results to generate final output


Programmers specify two functions:

map

(k, v)
→ <k’, v’>*

reduce

(k’, v’) → <k’, v’>*


All values with the same key are sent to the same reducer


The execution framework handles everything else…


Handles scheduling


Handles data management, transport, etc.


Handles synchronization


Handles errors and faults

map

map

map

map

Shuffle and Sort:

aggregate values by keys

reduce

reduce

reduce

k
1

k
2

k
3

k
4

k
5

k
6

v
1

v
2

v
3

v
4

v
5

v
6

b

a

1

2

c

c

3

6

a

c

5

2

b

c

7

8

a

1

5

b

2

7

c

2

3

6

8

r
1

s
1

r
2

s
2

r
3

s
3

split 0

split 1

split 2

split 3

split 4

worker

worker

worker

worker

worker

User

Program

output

file 0

output

file 1

(1)
submit

(2)
schedule
map

(2)
schedule
reduce

(3) read

(4) local write

(5) remote read

(6) write

Input

files

Map

phase

Intermediate files

(on local disk)

Reduce

phase

Output

files

Adapted
from
(Dean
and
Ghemawat
, OSDI
2004)

Master

(I want my students developing new machine learning algorithms, not debugging race conditions)

MapReduce Implementations


Google has a proprietary implementation in C++


Bindings in Java, Python


Hadoop is an open
-
source implementation in Java


Development led by Yahoo, used in production


Now an Apache project


Rapidly expanding software ecosystem


Lots of custom research implementations


For GPUs, cell processors, etc.

Case Study #1

Statistical Machine Translation

Chris Dyer

(Linguistics Ph.D., 2010)

Translation
Model

Language

Model

Decoder

Foreign Input Sentence

maria no daba una bofetada a la bruja verde

English Output Sentence

mary did not slap the green witch

Word Alignment

Statistical Machine Translation

(vi,
i

saw)

(la mesa
pequeña
, the small table)



Phrase Extraction

i

saw the small table

vi la mesa pequeña

Parallel Sentences

he sat at the table

the service was good

Target
-
Language Text

Training Data

Maria

no

dio

una

bofetada

a

la

bruja

verde

Mary

not

did not

no

did not give

give

a

slap

to

the

witch

green

slap

a slap

to the

to

the

green witch

the witch

by

slap

Translation as a Tiling Problem

Mary

did not

slap

the

green witch

The Data Bottleneck

“Every time I fire a linguist, the performance of our … system goes up.”




-

Fred
Jelinek

Translation
Model

Language

Model

Decoder

Foreign Input Sentence

maria no daba una bofetada a la bruja verde

English Output Sentence

mary did not slap the green witch

Word Alignment

Statistical Machine Translation

(vi,
i

saw)

(la mesa
pequeña
, the small table)



Phrase Extraction

i

saw the small table

vi la mesa pequeña

Parallel Sentences

he sat at the table

the service was good

Target
-
Language Text

Training Data

We’ve built
MapReduce

implementations
of these two components!

HMM Alignment: Giza

Single
-
core commodity server

HMM Alignment: MapReduce

Single
-
core commodity server

38 processor cluster

HMM Alignment: MapReduce

38 processor cluster

1/38 Single
-
core commodity server

What’s the point?


The optimally
-
parallelized version doesn’t exist!


MapReduce occupies a sweet spot in the design space for
a large class of problems:


Fast
… in terms of running time + scaling characteristics


Easy
… in terms of programming effort


Cheap
… in terms of hardware costs



Chris Dyer, Aaron Cordova, Alex Mont, and Jimmy Lin.
Fast, Easy, and Cheap:
Construction of Statistical Machine Translation Models with MapReduce.
Proceedings of the Third Workshop on Statistical Machine Translation at ACL 2008

Case Study #2

DNA Sequence Alignment

Michael Schatz

(Computer Science Ph.D., 2010)

Strangely
-
Formatted Manuscript


Dickens:
A Tale of Two Cities


Text written on a long spool

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

… With Duplicates


Dickens:
A Tale of Two Cities


“Backup” on four more copies

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

Shredded Book Reconstruction


Dickens accidently shreds the manuscript










How can he reconstruct the text?


5 copies x 138,656 words / 5 words per fragment = 138k fragments


The short fragments from every copy are mixed together


Some fragments are identical


It was the best of

of times, it was the

times, it was the worst

age of wisdom, it was

the age of foolishness, …

It was the best

worst of times, it was

of times, it was the

the age of wisdom, it

was the age of foolishness,

It was the

the worst of times, it

best of times, it was

was the age of wisdom,

it was the age of

foolishness, …

It was

was the worst of times,

the best of times, it

it was the age of

wisdom, it was the age

of foolishness, …

It

it was the worst of

was the best of times,

times, it was the age

of wisdom, it was the

age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of

of times, it was the

times, it was the worst

age of wisdom, it was

the age of foolishness, …

It was the best

worst of times, it was

of times, it was the

the age of wisdom, it

was the age of foolishness,

It was the

the worst of times, it

best of times, it was

was the age of wisdom,

it was the age of

foolishness, …

It was

was the worst of times,

the best of times, it

it was the age of

wisdom, it was the age

of foolishness, …

It

it was the worst of

was the best of times,

times, it was the age

of wisdom, it was the

age of foolishness, …

Greedy Assembly

It was the best of

of times, it was the

best of times, it was

times, it was the worst

was the best of times,

the best of times, it

of times, it was the

times, it was the age

It was the best of

of times, it was the

best of times, it was

times, it was the worst

was the best of times,

the best of times, it

it was the worst of

of times, it was the

times, it was the age

it was the age of

was the age of wisdom,

the age of wisdom, it

age of wisdom, it was

of wisdom, it was the

it was the age of

was the age of foolishness,

the worst of times, it

The repeated sequence make the correct
reconstruction ambiguous!


Alternative:
model sequence reconstruction
as a graph problem…


de Bruijn Graph Construction


D
k

= (V,E)


V = All length
-
k

subfragments

(
k

<
l
)


E = Directed edges between consecutive
subfragments

(Nodes overlap by k
-
1 words)






Locally constructed graph reveals the global structure


Overlaps between sequences implicitly computed



It was the best of

Original Fragment

It was the best

was the best of

Directed Edge

de
Bruijn
, 1946

Idury

and Waterman, 1995

Pevzner
, Tang, Waterman, 2001

de
Bruijn

Graph Assembly

the age of foolishness

It was the best

best of times, it

was the best of

the best of times,

of times, it was

times, it was the

it was the worst

was the worst of

worst of times, it

the worst of times,

it was the age

was the age of

the age of wisdom,

age of wisdom, it

of wisdom, it was

wisdom, it was the

A unique Eulerian tour of
the graph reconstructs the
original text


If a unique tour does not
exist, try to simplify the
graph as much as possible

de
Bruijn

Graph Assembly

the age of foolishness

It was the best of times, it


of times, it was the

it was the worst of times, it

it was the age of

the age of wisdom, it was the

A unique Eulerian tour of
the graph reconstructs the
original text


If a unique tour does not
exist, try to simplify the
graph as much as possible

G
A
T
G
C
TT
A
C
T
A
T
G
C
GGG
CCCC

C
GG
T
C
T
AA
T
G
C
TT
A
C
T
A
T
G
C

G
C
TT
A
C
T
A
T
G
C
GGG
CCCC
TT

AA
T
G
C
TT
A
C
T
A
T
G
C
GGG
CCCC
TT

T
AA
T
G
C
TT
A
C
T
A
T
G
C

AA
T
G
C
TT
A
G
C
T
A
T
G
C
GGG
C

AA
T
G
C
TT
A
C
T
A
T
G
C
GGG
CCCC
TT

AA
T
G
C
TT
A
C
T
A
T
G
C
GGG
CCCC
TT

C
GG
T
C
T
A
G
A
T
G
C
TT
A
C
T
A
T
G
C

AA
T
G
C
TT
A
C
T
A
T
G
C
GGG
CCCC
TT

C
GG
T
C
T
AA
T
G
C
TT
A
G
C
T
A
T
G
C

A
T
G
C
TT
A
C
T
A
T
G
C
GGG
CCCC
TT

?

Subject
genome

Sequencer

Reads

Human genome: 3
gbp

A few billion short reads

(~100 GB compressed data)

Present solutions:

large
-
shared memory machines or clusters with high
-
speed interconnects

Can we get by with
MapReduce

on cheap commodity clusters?

Graph Compression


Challenges


Nodes stored on different machines


Nodes can only access direct neighbors




R
andomized Solution


Randomly assign H / T to each
compressible node


Compress H


T links



Fast Graph Compression













































Initial Graph: 42 nodes

Fast Graph Compression

Round 1: 26 nodes (38% savings)



















Fast Graph Compression

Round 2: 15 nodes (64% savings)











Fast Graph Compression

Round 3: 6 nodes (86% savings)

Fast Graph Compression

Round 4: 5 nodes (88% savings)

Contrail


De Novo Assembly of the Human Genome


Genome: African male NA18507 (SRA000271, Bentley et al.,
2008)


Input: 3.5B 36bp reads, 210bp insert (~40x coverage)


Initial

N

Max

>7 B

27
bp

Compressed

>1 B

303
bp

5.0 M

14,007
bp

B

B’

A

Clip Tips

4.2 M

20,594
bp

Pop Bubbles

B

B’

A

C

Assembly of Large Genomes with Cloud Computing.

Schatz MC,
Sommer

D, Kelley D, Pop M,
et al. In Preparation.

Source:

flickr

(
fatboyke
/2918399820)

Source:

flickr

(60in3/2338247189)

Best thing since sliced bread?


Distributed programming models:


MapReduce

is the first


Definitely not the only


And probably not even the best


Alternatives: Pig, Dryad/
DryadLINQ
,
Pregel
, etc.


It’s all about the right level of abstraction


The von Neumann architecture won’t cut it anymore


Separating the
what

from
how


Developer specifies the computation that needs to be performed


Execution framework handles actual execution


Framework hides system
-
level details from the developers

Source:
NY Times (6/14/2006)

The datacenter
is

the computer!

What are the appropriate abstractions for the datacenter computer?

Source:

flickr

(infidelic/3008675635)

Source:
Wikipedia (Tide)

Commoditization of large
-
data processing capabilities
allows us to ride the rising tide!

Questions?