Understanding MapReduce -pseudo code - hadoop-practise

gorgeousvassalΛογισμικό & κατασκευή λογ/κού

7 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

215 εμφανίσεις

-
Ghana


Understanding MapReduce



Map Reduce

-

An Introduction


Word count


default


Word count


custom





Programming model to process large datasets



Supported languages for MR


Java


Ruby


Python


C++



Map Reduce Programs are Inherently parallel.


More data


more machines to analyze.


No need to change anything in the code.


Start with WORDCOUNT example



“Do as I say, not as I do”


Word

Count

As

2

Do

2

I

2

Not

2

Say

1

define
wordCount

as Map<
String,long
>;


for
each document in
documentSet

{



T
= tokenize(document);



for
each token in T {




wordCount
[token]++;



}



}

display(
wordCount
);



This works until the
no.of

documents to process is not
very large


Spam filter


Millions of emails


Word count for analysis


Working from a single computer is time
consuming



Rewrite the program to count form multiple
machines


How do we attain parallel computing ?

1.
All the machines compute fraction of
documents

2.
Combine the results from all the machines



STAGE 1

define
wordCount

as Map<
String,long
>;


for each document in
documentSUBSet

{



T = tokenize(document);



for each token in T {




wordCount
[token]++;



}


}



STAGE 2


define
totalWordCount

as
Multiset
;


for
each
wordCount

received from
firstPhase

{


multisetAdd

(
totalWordCount
,
wordCount
);

}

Display(
totalWordcount
)


Comp
-
1

Comp
-
2

Comp
-
3

Comp
-
4

Documents

Master



Comp
-
1

Comp
-
2

Comp
-
3

Comp
-
4

Documents

Master

Problems

STAGE 1


Documents segregations to be well
defined



Bottle neck in network transfer


Data
-
intensive processing


Not computational intensive


So better store files over
processing machines



BIGGEST FLAW


Storing the words and count in
memory


Disk based hash
-
table
implementation needed

Comp
-
1

Comp
-
2

Comp
-
3

Comp
-
4

Documents

Master

Problems

STAGE
2



Phase 2 has only once machine


Bottle Neck


Phase 1 highly distributed though



Make phase 2 also distributed



Need changes in Phase 1


Partition the phase
-
1 output (say based
on first character of the word)


We have 26 machines in phase 2


Single
Disk based hash
-
table
should be
now 26
Disk based hash
-
table


Word count
-
a ,
worcount
-
b,wordcount
-
c





Comp
-
1

Comp
-
2

Comp
-
3

Comp
-
4

Documents

Master

Comp
-
10

Comp
-
20

Comp
-
30

Comp
-
40

A

B

C

D

E

1

2

4

5

10

A

B

C

D

E

10

20

40

5

9

.

.

.



After phase
-
1


From comp
-
1


WordCount
-
A


comp
-
10


WordCount
-
B


comp
-
20


.


.


.


Each machine in phase 1 will shuffle its output to
different machines in phase 2



This is getting complicated


Store files where are they are being processed


Write disk
-
based hash table obviating RAM
limitations


Partition the phase
-
1 output


Shuffle the phase
-
1 output and send it to
appropriate reducer





This is more than a lot for word count



We haven’t even touched the fault tolerance


What if comp
-
1 or com
-
10 fails



So, A need of frame work to take care of all
these things


We concentrate only on business






Comp
-
1

Comp
-
2

Comp
-
3

Comp
-
4

Documents

Master

Comp
-
10

Comp
-
20

Comp
-
30

Comp
-
40

A

B

C

D

E

1

2

4

5

10

A

B

C

D

E

1

2

4

5

10

.

.

.


Shuffling

HDFS

MAPPER

REDUCER

Interim
output

Partitioning


Mapper


Reducer


Mapper filters and transforms the input

Reducer collects that and aggregate on that.


Extensive research is done two arrive at two
phase strategy



Mapper,Reducer,Partitioner,Shuffling


Work together


common structure for data
processing

Input

Output

Mapper

<K1,V1>

List<K2,V2>

Reducer

<k2,list(v2)>

List<k3,v3>


Mapper


<
key,words_per_line
> : Input


<word,1> : output


Reducer


<
word,list
(1)> : Input


<
word,count
(list(1))> : Output


Input

Output

Mapper

<K1,V1>

List<K2,V2>

Reducer

<k2,list(v2)>

List<k3,v3>


As said, don’t store the data in memory


So keys and values regularly have to be written to
disk.


They must be serialized.


Hadoop provides its way of deserialization


Any class to be key or value have to implement
WRITABLE class.

Java

Type

Hadoop Serialized

Types

String

Text

Integer

IntWritable

Long

LongWritable


Let’s try to execute the following command


hadoop

jar hadoop
-
examples
-
0.20.2
-
cdh3u4.jar
wordcount




hadoop

jar hadoop
-
examples
-
0.20.2
-
cdh3u4.jar
wordcount

<input> <output>



What does this code do ?


Switch to eclipse