Access Methods

concepcionsockSoftware and s/w Development

Aug 15, 2012 (5 years and 3 days ago)

366 views

Tutorial for


MapReduce (Hadoop) & Large Scale Processing

Le Zhao (LTI, SCS, CMU)

Database Seminar & Large Scale Seminar

2010
-
Feb
-
15

Some slides adapted from IR course lectures by Jamie Callan

© 2010, Le Zhao

1

Outline


Why MapReduce (Hadoop)


MapReduce basics


The MapReduce way of thinking


Manipulating large data

© 2010, Le Zhao

2

Outline


Why MapReduce (Hadoop)


Why go large scale


Compared to other parallel computing models


Hadoop related tools


MapReduce basics


The MapReduce way of thinking


Manipulating large data

© 2010, Le Zhao

3

Why NOT to do parallel computing


Concerns: a parallel system needs to provide:


Data distribution


Computation distribution


Fault tolerance


Job scheduling

© 2010, Le Zhao

4

Why MapReduce (Hadoop)


Previous parallel computation models


1) scp + ssh

»
Manual everything


2) network cross
-
mounted disks + condor/torque

»
No data distr, disk access is bottleneck

»
Can only partition totally distributed computation

»
No fault tolerance

»
Prioritized job scheduling

© 2010, Le Zhao

5

Hadoop


Parallel batch computation


Data distribution

»
Hadoop Distributed File System (HDFS)

»
Like Linux FS, but with automatic data repetition


Computation distribution

»
Automatic, user only need to specify #input_splits

»
Can distribute aggregation computations as well


Fault tolerance

»
Automatic recovery from failure

»
Speculative execution (a backup task)


Job scheduling

»
Ok, but still relies on the politeness of users

© 2010, Le Zhao

6

How you can use Hadoop


Hadoop Streaming


Quick hacking


much like shell scripting

»
Uses STDIN & STDOUT carry data

»
cat file | mapper | sort | reducer > output


Easier to use legacy code, all programming languages


Hadoop Java API


Build large systems

»
More data types

»
More control over Hadoop’s behavior

»
Easier debugging with Java’s error stacktrace display


NetBeans plugin for Hadoop provides easy programming

»
http://hadoopstudio.org/docs.html

© 2010, Le Zhao

7

Outline


Why
MapReduce

(
Hadoop
)


MapReduce

basics


The
MapReduce

way of thinking


Manipulating large data

© 2010, Le Zhao

8

© 2009, Jamie Callan

9

Map and Reduce

MapReduce is a new use of an old idea in Computer Science


Map:
Apply a function to every object in a list


Each object is independent

»
Order is unimportant

»
Maps can be done in parallel


The function produces a result


Reduce:
Combine the results to produce a final result


You may have seen this in a Lisp or functional programming
course

© 2010, Jamie Callan

10

MapReduce


Input reader


Divide input into
splits
, assign each split to a Map processor


Map


Apply the Map function to each record in the split


Each Map function returns a list of (key, value) pairs


Shuffle/Partition and Sort


Shuffle distributes sorting & aggregation to many reducers


All records for key
k

are directed to the same reduce processor


Sort groups the same keys together, and prepares for aggregation


Reduce


Apply the Reduce function to each key


The result of the Reduce function is a list of (key, value) pairs

MapReduce in One Picture


© 2010, Le Zhao

11

Tom White,

Hadoop: The Definitive Guide

Outline


Why
MapReduce

(
Hadoop
)


MapReduce

basics


The
MapReduce

way of thinking


Two simple use cases


Two more advanced & useful
MapReduce

tricks


Two
MapReduce

applications


Manipulating large data

© 2010, Le Zhao

12

MapReduce Use Case (1)


Map Only

Data distributive tasks


Map Only


E.g. classify individual documents


Map does everything


Input: (docno, doc_content), …


Output: (docno, [class, class, …]), …


No reduce

© 2010, Le Zhao

13

MapReduce Use Case (2)


Filtering and
Accumulation

Filtering & Accumulation


Map and Reduce


E.g. Counting total enrollments of two given classes


Map
selects records and outputs initial counts


In: (Jamie, 11741), (Tom, 11493), …


Out: (11741, 1), (11493, 1), …


Shuffle/Partition
by class_id


Sort


In: (11741, 1), (11493, 1), (11741, 1), …


Out: (11493, 1), …, (11741, 1), (11741, 1), …


Reduce accumulates counts


In: (11493, [1, 1, …]), (11741, [1, 1, …])


Sum and Output: (11493, 16), (11741, 35)

© 2010, Le Zhao

14

MapReduce Use Case (3)


Database Join

Problem: Massive lookups


Given two large lists: (URL, ID) and (URL, doc_content) pairs


Produce (ID, doc_content)

Solution: Database join


Input stream
:
both (URL, ID) and (URL, doc_content) lists


(http://del.icio.us/post, 0), (http://digg.com/submit, 1), …


(http://del.icio.us/post, <html0>), (http://digg.com/submit, <html1>), …


Map
simply passes input along,


Shuffle and Sort on URL
(group ID & doc_content for the same URL together)


Out: (http://del.icio.us/post, 0), (http://del.icio.us/post, <html0>),
(http://digg.com/submit, <html1>), (http://digg.com/submit, 1), …


Reduce
outputs result stream of (ID, doc_content) pairs


In: (http://del.icio.us/post, [0, html0]), (http://digg.com/submit, [html1, 1]), …


Out: (0, <html0>), (1, <html1>), …

© 2010, Le Zhao

15

MapReduce Use Case (4)


Secondary Sort

Problem: Sorting on values


E.g. Reverse graph edge directions & output in node order


Input: adjacency list of graph (3 nodes and 4 edges)

(3, [1, 2]) (
1
, [3])

(1, [2, 3])


(
2
, [
1, 3
])


(
3
, [1])


Note, the
node_ids

in the output
values

are also sorted.

But
Hadoop

only sorts on keys!

Solution: Secondary sort


Map


In:
(3, [1, 2])
,
(1, [2, 3])
.


Intermediate:
(1, [3]), (2, [3])
,
(2, [1]), (3, [1])
. (reverse edge direction)


Out:
(<1, 3>, [3]), (<2, 3>, [
3
])
,
(<2, 1>, [
1
]), (<3, 1>, [1])
.


Copy
node_ids

from value to key.

1

2

3

1

2

3



© 2010, Le Zhao

16

MapReduce Use Case (4)


Secondary Sort

Secondary Sort (ctd.)


Shuffle on Key.field1, and Sort on whole Key
(both fields)


In: (<1, 3>, [3]), (<2,
3
>, [3]), (<2,
1
>, [1]),

(<3, 1>, [1])


Out:
(<1, 3>, [3])
,
(<
2
,
1
>, [1]), (<
2
,
3
>, [3])
,
(<3, 1>, [1])


Grouping comparator


Merge according to part of the key


Out:
(<1, 3>, [3])
,
(<
2
,
1
>, [1, 3])
,
(<3, 1>, [1])


this will be the reducer’s input


Reduce


Merge & output: (1, [3]), (2, [1, 3]), (3, [1])

© 2010, Le Zhao

17

Using MapReduce to Construct Indexes:

Preliminaries

Construction of
binary

inverted lists


Input: documents: (docid, [term, term..]), (docid, [term, ..]), ..


Output: (term, [docid, docid, …])


E.g., (apple, [1, 23, 49, 127, …])


Binary inverted lists fit on a slide more easily


Everything also applies to frequency and positional inverted lists

A document id is an
internal document id
, e.g., a unique integer


Not

an external document id such as a url


MapReduce elements


Combiner, Secondary Sort, complex keys, Sorting on keys’ fields

© 2010, Jamie Callan

18

Using MapReduce to Construct Indexes:

A Simple Approach

A simple approach to creating
binary

inverted lists


Each Map task is a document parser


Input: A stream of documents


Output: A stream of (term, docid) tuples

»
(long, 1) (ago, 1) (and, 1) … (once, 2) (upon, 2) …


Shuffle sorts tuples by key and routes tuples to Reducers


Reducers convert streams of keys into streams of inverted lists


Input:

(long, 1) (long, 127) (long, 49) (long, 23) …


The reducer sorts the values for a key and builds an inverted list

»
Longest inverted list must fit in memory


Output: (long, [df:492, docids:1, 23, 49, 127, …])

© 2010, Jamie Callan

19

Using MapReduce to Construct Indexes:

A Simple Approach

A more succinct representation of the previous algorithm


Map:


(docid
1
, content
1
)


(t
1
, docid
1
) (t
2
, docid
1
) …


Shuffle
by t


Sort
by t


(t
5
, docid
1
) (t
4
, docid
3
) …


(t
4
, docid
3
) (t
4
, docid
1
) (t
5
, docid
1
) …


Reduce:
(t
4
, [docid
3

docid
1

…])


(t, ilist)


docid:

a unique integer

t:


a term, e.g., “apple”

ilist:

a complete inverted list

but a) inefficient, b) docids are sorted in reducers, and c) assumes
ilist of a word fits in memory

© 2010, Jamie Callan

20

Using MapReduce to Construct Indexes:

Using Combine


Map:

(docid
1
, content
1
)


(
t
1
, ilist
1,1
) (
t
2
, ilist
2,1
) (
t
3
, ilist
3,1
)



Each output inverted list covers just
one document


Combine

Sort by t

Combine: (t
1

[
ilist
1,2

ilist
1,3

ilist
1,1

…])


(
t
1
, ilist
1,27
)


Each output inverted list covers a
sequence of documents


Shuffle
by t


Sort
by t

(t
4
, ilist
4,1
) (t
5
, ilist
5,3
) …


(t
4
, ilist
4,2
) (t
4
, ilist
4,4
) (t
4
, ilist
4,1
) …


Reduce:
(t
7
, [ilist
7
,2
, ilist
3
,1
, ilist
7
,4
, …])


(t
7
, ilist
final
)


ilist
i,j
:

the j’th inverted list fragment for term i




© 2010, Jamie Callan

21

© 2010, Jamie Callan

22

22

Using MapReduce to Construct Indexes

Parser /

Indexer

Parser /

Indexer

Parser /

Indexer

:

:

:

:

:

:

Merger

Merger

Merger

:

:

A
-
F

Documents

Inverted

Lists

Map/Combine

Processors

Inverted List

Fragments

Processors

Shuffle/Sort

Reduce

G
-
P

Q
-
Z

Using MapReduce to Construct

Partitioned Indexes


Map:
(docid
1
, content
1
)


([p, t
1
], ilist
1,1
)


Combine
to sort and group values



([p, t
1
] [
ilist
1,2

ilist
1,3

ilist
1,1

…])


([p,
t
1
], ilist
1,27
)


Shuffle
by p


Sort

values by [p, t]


Reduce:
([p, t
7
], [ilist
7
,2
, ilist
7
,1
, ilist
7
,4
, …])


([p, t
7
], ilist
final
)


p: partition (shard) id

© 2010, Jamie Callan

23

Using MapReduce to Construct Indexes:

Secondary Sort

So far, we have assumed that Reduce can sort values in memory
…but what if there are too many to fit in memory?


Map:
(docid
1
, content
1
)


([t
1
, fd
1,1
], ilist
1,1
)


Combine
to sort and group values


Shuffle
by t


Sort
by [t, fd], then
Group
by t (
Secondary Sort
)



([t
7
, fd
7,2
], ilist
7
,2
), ([t
7
, fd
7,1
], ilist
7
,1
) …


(t
7
, [ilist
7
,1
, ilist
7
,2
, …])


Reduce:
(t
7
, [ilist
7
,1
, ilist
7
,2
, …])


(t
7
, ilist
final
)


Values arrive in order, so Reduce can stream its output


fd
i,j

is the first docid in ilist
i,j

© 2010, Jamie Callan

24

Using MapReduce to Construct Indexes:

Putting it All Together


Map:
(docid
1
, content
1
)


([p, t
1
, fd
1,1
], ilist
1,1
)


Combine
to sort and group values



([p, t
1
, fd
1,1
] [
ilist
1,2

ilist
1,3

ilist
1,1

…])


([p,
t
1
, fd
1,27
], ilist
1,27
)


Shuffle
by p


Secondary Sort

by [(p, t), fd]



([p, t
7
], [ilist
7
,2
, ilist
7
,1
, ilist
7
,4
, …])


([p, t
7
], [ilist
7
,1
, ilist
7
,2
, ilist
7
,4
, …])


Reduce:
([p, t
7
], [ilist
7
,1
, ilist
7
,2
, ilist
7
,4
, …])


([p, t
7
], ilist
final
)



© 2010, Jamie Callan

25

© 2010, Jamie Callan

26

26

Using MapReduce to Construct Indexes

Parser /

Indexer

Parser /

Indexer

Parser /

Indexer

:

:

:

:

:

:

Merger

Merger

Merger

:

:

Shard

Documents

Inverted

Lists

Map/Combine

Processors

Inverted List

Fragments

Processors

Shuffle/Sort

Reduce

Shard

Shard

PageRank Calculation:

Preliminaries

One PageRank iteration:


Input:


(id
1
, [score
1
(t)
, out
11
, out
12
, ..]), (id
2
, [score
2
(t)
, out
21
, out
22
, ..]) ..


Output:


(id
1
, [score
1
(t+1)
, out
11
, out
12
, ..]), (id
2
, [score
2
(t+1)
, out
21
, out
22
, ..]) ..


MapReduce elements


Score distribution and accumulation


Database join


Side
-
effect files

© 2010, Jamie Callan

27

PageRank:

Score Distribution and Accumulation


Map


In: (id
1
, [score
1
(t)
, out
11
, out
12
, ..]), (id
2
, [score
2
(t)
, out
21
, out
22
, ..]) ..


Out: (out
11
, score
1
(t)
/n
1
), (out
12
, score
1
(t)
/n
1
) .., (out
21
, score
2
(t)
/n
2
), ..


Shuffle & Sort by node_id


In: (id
2
, score
1
), (id
1
, score
2
), (id
1
, score
1
), ..


Out: (id
1
, score
1
), (id
1
, score
2
), .., (id
2
, score
1
), ..


Reduce


In: (id
1
, [score
1
, score
2
, ..]), (id
2
, [score
1
, ..]), ..


Out: (id
1
, score
1
(t+1)
), (id
2
, score
2
(t+1)
), ..

© 2010, Jamie Callan

28

PageRank:

Database Join to associate outlinks with score


Map


In & Out: (id
1
, score
1
(t+1)
), (id
2
, score
2
(t+1)
), .., (id
1
, [out
11
, out
12
, ..]),
(id
2
, [out
21
, out
22
, ..]) ..


Shuffle & Sort by node_id


Out: (id
1
, score
1
(t+1)
), (id
1
, [out
11
, out
12
, ..]), (id
2
, [out
21
, out
22
, ..]), (id
2
,
score
2
(t+1)
), ..


Reduce


In: (id
1
, [score
1
(t+1)
, out
11
, out
12
, ..]), (id
2
, [out
21
, out
22
, .., score
2
(t+1)
]), ..


Out: (id
1
, [score
1
(t+1)
, out
11
, out
12
, ..]), (id
2
, [score
2
(t+1)
, out
21
, out
22
, ..])
..

© 2010, Jamie Callan

29

PageRank:

Side Effect Files for dangling nodes


Dangling Nodes


Nodes with no outlinks (observed but not crawled URLs)


Score has no outlet

»
need to distribute to all graph nodes evenly


Map

for dangling nodes:


In: .., (id
3
, [score
3
]), ..


Out: .., ("*", 0.85
×
score
3
), ..


Reduce


In: .., ("*", [score
1
, score
2
, ..]), ..


Out: .., everything else, ..


Output to side
-
effect: ("*", score), fed to Mapper of next iteration

© 2010, Jamie Callan

30

Outline


Why
MapReduce

(
Hadoop
)


MapReduce

basics


The
MapReduce

way of thinking


Manipulating large data

© 2010, Le Zhao

31

Manipulating Large Data


Do everything in Hadoop (and HDFS)


Make sure every step is parallelized!


Any serial step breaks your design


E.g. storing the URL list for a Web graph


Each node in Web graph has an id


[URL
1
, URL
2
, …], use line number as id


bottle neck


[(id
1
, URL
1
), (id
2
, URL
2
), …], explicit id

© 2010, Le Zhao

32

Hadoop based Tools


For Developing in Java,
NetBeans plugin


http://www.hadoopstudio.org/docs.html


Pig Latin,
a SQL
-
like high level data processing script language


Hive,
Data warehouse, SQL


Cascading,
Data processing


Mahout,
Machine Learning algorithms on Hadoop


HBase,
Distributed data store as a large table


More


http://hadoop.apache.org/


http://en.wikipedia.org/wiki/Hadoop


Many other toolkits, Nutch, Cloud9, Ivory

© 2010, Le Zhao

33

Get Your Hands Dirty


Hadoop Virtual Machine


http://www.cloudera.com/developers/downloads/virtual
-
machine/

»
This runs Hadoop 0.20


An earlier Hadoop 0.18.0 version is here
http://code.google.com/edu/parallel/tools/hadoopvm/index.ht
ml


Amazon EC2


Various other Hadoop clusters around


The NetBeans plugin simulates Hadoop


The workflow view works on Windows


Local running & debugging works on MacOS and Linux


http://www.hadoopstudio.org/docs.html

© 2010, Le Zhao

34

Conclusions


Why large scale


MapReduce advantages


Hadoop uses


Use cases


Map only: for totally distributive computation


Map+Reduce: for filtering & aggregation


Database join: for massive dictionary lookups


Secondary sort: for sorting on values


Inverted indexing: combiner, complex keys


PageRank: side effect files


Large data

© 2010, Jamie Callan

35

© 2010, Jamie Callan

36

For More Information


L. A. Barroso, J. Dean, and U. H
ö
lzle. “Web search for a planet: The Google cluster
architecture.”
IEEE Micro
, 2003.


J. Dean and S. Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters.”
Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI
2004)
, pages 137
-
150. 2004.


S. Ghemawat, H. Gobioff, and S.
-
T. Leung. “The Google File System.”
Proceedings of the
19th ACM Symposium on Operating Systems Principles (SOSP
-
03)
, pages 29
-
43. 2003.


I.H. Witten, A. Moffat, and T.C. Bell.
Managing Gigabytes
. Morgan Kaufmann. 1999.


J. Zobel and A. Moffat. “Inverted files for text search engines.”
ACM Computing Surveys
,
38 (2). 2006.


http://hadoop.apache.org/common/docs/current/mapred_tutorial.html. “Map/Reduce
Tutorial”. Fetched January 21, 2010.


Tom White.
Hadoop: The Definitive Guide
. O'Reilly Media. June 5, 2009


J. Lin and C. Dyer.
Data
-
Intensive Text Processing with MapReduce
, Book Draft. February
7, 2010.