ecs251_s2011_L006 - Davis Social Links

lilactruckInternet and Web Development

Dec 4, 2013 (3 years and 11 months ago)

96 views

04/27/2011

DHT

1

ecs251 Spring 2011
:

Operating System

#6: Map and Reduce

Dr. S. Felix Wu

Computer Science Department

University of California, Davis



http://www.facebook.com/group.php?gid=29670204725

http://cyrus.cs.ucdavis.edu/~wu/ecs251

Programming Model


Input
-
key
\
value

pair

Output
-

key
\
value

pair


MapReduce

Library

contains

2

functions
:


Map



Reduce


Input

key
\
value

pair

Intermediate

key
\
value

pair


MapReduce

library

groups

all

intermediate

values

with

the

same

intermediate

key

I



Intermediate

key

I

Smaller

set

of

values


and

values

for

I




MAP

REDUCE

2

MapReduce : Example


Counting

number

of

occurrences

of

each

word

in

a

large

collection

of

documents
.


doc

name

&

doc

contents



word

&

its








occurrences



word

&

list

of

counts

sum

of

all

counts

for

word


Input

and

output

types
:


map(k
1
,v
1
)

list(k
2
,v
2
)


reduce(k
2
,list(v
2
))

list(v
2
)




MAP

REDUCE

3


MapReduce : Execution

4


04/27/2011

DHT

5

04/27/2011

DHT

6

Secondary

NameNode

Client


HDFS Architecture

NameNode

DataNodes

Cluster Membership

Cluster Membership

NameNode : Maps a file to a file
-
id and list of MapNodes

DataNode : Maps a block
-
id to a physical location on disk

SecondaryNameNode: Periodic merge of Transaction log

04/27/2011

DHT

7

Map and Reduce


The idea of Map, and Reduce is 40+ year old


Present in all Functional Programming Languages.


See, e.g., APL, Lisp and ML


Alternate names for Map: Apply
-
All


Higher Order Functions


take function definitions as arguments, or


return a function as output


Map and Reduce are higher
-
order functions.


04/27/2011

DHT

8

04/27/2011

DHT

9

GFS: Google File System


“failures” are norm


Multiple
-
GB files are common


Append rather than overwrite


Random writes are rare


Can we relax the consistency?


04/27/2011

DHT

10

04/27/2011

DHT

11

# an input reader

# a Map function

# a partition function

# a compare function

# a Reduce function

# an output write

04/27/2011

DHT

12

Map: A Higher Order
Function


F(x: int) returns r: int


Let V be an array of integers.


W = map(F, V)


W[i] = F(V[i]) for all I


i.e., apply F to every element of V

04/27/2011

DHT

13

Map Examples in
Haskell


map (+1) [1,2,3,4,5]


== [2, 3, 4, 5, 6]


map (toLower) "abcDEFG12!@#“


== "abcdefg12!@#“


map (`mod` 3) [1..10]


== [1, 2, 0, 1, 2, 0, 1, 2, 0, 1]

04/27/2011

DHT

14

Word Count Example


Read text files and count how often words
occur.


The input is text files


The output is a text file


each line: word, tab, count


Map: Produce pairs of (word, count)


Reduce: For each word, sum up the counts.

04/27/2011

DHT

15

I am a tiger, you are also a tiger


a,2


also,1


am,1


are,1


I,1


tiger,2


you,1


I,1


am,1


a,1


tiger,1


you,1


are,1


also,1


a, 1


tiger,1

a,2

also,1

am,1

are,1


I, 1


tiger,2


you,1

reduce

reduce

map

map

map


a, 1


a,1


also,1


am,1


are,1


I,1


tiger,1


tiger,1


you,1

04/27/2011

DHT

16

Grep Example


Search input files for a given pattern


Map: emits a line if pattern is matched


Reduce: Copies results to output

04/27/2011

DHT

17

Inverted Index Example


Generate
an inverted index

of words from a
given set of files


Map: parses a document and emits <word,
docId> pairs


Reduce: takes all pairs for a given word,
sorts the docId values, and emits a <word,
list(docId)> pair

04/27/2011

DHT

18

Execution on Clusters

1.
Input files split (
M

splits)

2.
Assign Master & Workers

3.
Map tasks

4.
Writing intermediate data to disk (
R
regions)

5.
Intermediate data read & sort

6.
Reduce tasks

7.
Return

04/27/2011

DHT

19

<Key, Value> Pair

19

Reduce

Input

Output

Row Data

key

values

Map

Reduce

key1

val

key2

val

key1

val





Map

Input

Output

Select Key

key1

val

val

….

val

04/27/2011

DHT

20

split 0


split 1


split 2


split 3


split 4

part0

map

map

map

reduce

reduce

part1

input

HDFS

sort/copy

merge

output

HDFS

04/27/2011

DHT

21

04/27/2011

DHT

22

04/27/2011

DHT

23

Class
MR
{


Class Mapper

{


}


Class Reducer

{


}


main(){



JobConf conf = new JobConf(

MR.class

);



conf.setMapperClass(Mapper.class);



conf.setReduceClass(Reducer.class);




FileInputFormat.setInputPaths
(conf, new Path(args[0]));



FileOutputFormat.setOutputPath
(conf, new Path(args[1]));





JobClient.runJob(conf);

}}

Map function

Reduce function

Other parts of program

Map

Reduce

Config

04/27/2011

DHT

24

24

class
MyMap

extends MapReduceBase

implements Mapper <

, , ,
>

{

// global variables

public void
map
(
key, value
,


OutputCollector<

,
>
output
,


Reporter reporter)

throws IOException


{


// local variables and program


output
.collect(
NewKey, NewValue
);


}

}

1


2

3

4



5

6

7

8

9

INPUT

KEY

INPUT
VALUE

OUTPUT
VALUE

OUTPUT
KEY

INPUT

KEY

INPUT
VALUE

OUTPUT
VALUE

OUTPUT
KEY

04/27/2011

DHT

25

class
MyRed

extends MapReduceBase

implements Reducer <

,
,

,
>

{

// global variables

public void
reduce
(

key, Iterator< > values
,


OutputCollector<

,
>
output
,


Reporter reporter)

throws IOException


{


// local variables and program


output
.collect(
NewKey, NewValue
);


}

}

1


2

3

4



5

6

7

8

9

INPUT

KEY

INPUT
VALUE

OUTPUT
VALUE

OUTPUT
KEY

INPUT

KEY

INPUT
VALUE

OUTPUT
VALUE

OUTPUT
KEY

04/27/2011

DHT

26

04/27/2011

DHT

27

04/27/2011

DHT

28

04/27/2011

DHT

29


Complete web search engine


Nutch = Crawler + Indexer/Searcher (Lucene) +
GUI

»
+
Plugins

»
+MapReduce & Distributed FS (Hadoop)


Java based, open source, many customizable
scripts available at
(
http://lucene.apache.org/nutch/
)


Features:


Customizable


Extensible (e.g. extend to Solr for enhanced
portability)

04/27/2011

DHT

30

04/27/2011

DHT

31

Data Structures used by Nutch


Web Database or WebDB


Mirrors the properties/structure of web graph
being crawled


Segment


Intermediate index


Contains pages fetched in a single run


Index


Final inverted index obtained by “merging”
segments (Lucene)


04/27/2011

DHT

32

WebDB


Customized graph database


Used by Crawler only


Persistent storage for “pages” & “links”


Page DB:
Indexed by URL and hash; contains content,
outlinks, fetch information & score


Link DB:
contains “source to target” links, anchor text

04/27/2011

DHT

33

Crawling


Cyclic process


crawler generates a set of fetchlists from the
WebDB


fetchers downloads the content from the Web


the crawler updates the WebDB with new links
that were found


and then the crawler generates a new set of
fetchlists


And Repeat till you reach the “depth”

04/27/2011

DHT

34

Indexing


Iterate through all
k

page sets in parallel,
constructing inverted index


Creates a “searchable document” of:


URL text


Content text


Incoming anchor text


Other content types might have a different
document fields


Eg, email has sender/receiver


Any searchable field end
-
user will want


Uses Lucene text indexer

04/27/2011

DHT

35

Lucene


Open source search project


http://lucene.apache.org


Index & search local files


Download lucene
-
2.2.0.tar.gz from
http://www.apache.org/dyn/closer.cgi/lucene/java/


Extract files


Build an index for a directory


java org.apache.lucene.demo.IndexFiles dir_path


Try search at command line:


java org.apache.lucene.demo.SearchFiles

04/27/2011

DHT

36

Lucene’s Open Architecture

Spring 2008

36

File

System

WWW

IMAP

Server

FS

Crawler

Larm

PDF

HTML

DOC

TXT



TXT

parser

PDF

parser

HTML

parser

Lucene

Docu
-

ments

Stop

Analyzer

CN/DE/

Analyzer

Standard

Analyzer

indexer

indexer

Index

searcher

Crawling

Parsing

Indexing

Searching

Lucene

04/27/2011

DHT

37


Index













Document

Document

Document


Document













Field

Field

Field

Field

Field



Name

Value

04/27/2011

DHT

38


Create an Analyser


WhitespaceAnalyzer


divides text at whitespace


SimpleAnalyzer


divides text at non
-
letters


convert to lower case


StopAnalyzer


SimpleAnalyzer


removes stop words


StandardAnalyzer


good for most European Languages


removes stop words


convert to lower case

04/27/2011

DHT

39

04/27/2011

DHT

40


Inverted Index (Inverted File)


Doc 1:

Penn State
Football …

football

Doc 2:

Football
players …
State

Posting

id

word

doc

offset

1

football

Doc 1

3

Doc 1

67

Doc 2

1

2

penn

Doc 1

1

3

players

Doc 2

2

4

state

Doc 1

2

Doc 2

13

Posting

Table

04/27/2011

DHT

41

Query

Term Dictionary

(Random file access)

Term Info Index

(in Memory)

Frequency File

(Random file access)

Constant time

Position File

(Random file access)

Field info

(in Memory)

04/27/2011

DHT

42

Map/Reduce Cluster
Implementation


split 0

split 1

split 2

split 3

split 4

Output 0

Output 1

Input
files

Output
files

M
map
tasks

R

reduce
tasks

Intermediate
files

Several map or
reduce tasks can
run on a single
computer

Each intermediate
file is divided into
R

partitions, by
partitioning function

Each reduce task
corresponds to one
partition

04/27/2011

DHT

43

Execution

04/27/2011

DHT

44

Hadoop Usage at
Facebook


Data warehouse running Hive


600 machines, 4800 cores, 2.4 PB disk


3200 jobs per day


50+ engineers have used Hadoop

04/27/2011

DHT

45

Facebook Data Pipeline

Web Servers

Scribe Servers

Network
Storage

Hadoop Cluster

Oracle RAC

MySQL

Analysts

04/27/2011

DHT

46

Facebook Job Types


Production jobs:

load data, compute statistics,
detect spam, etc


Long experiments:

machine learning, etc


Small ad
-
hoc queries:

Hive jobs, sampling

GOAL: Provide
fast response times

for small jobs and


guaranteed service levels

for production jobs

04/27/2011

DHT

47

04/27/2011

DHT

48

Cloud Computing
Scheduling


FIFO, Fair
-
Sharing


Job scheduling with “constraints”


Dependency


Priority
-
oriented


Soft Deadline

04/27/2011

DHT

49

Hive


Developed at Facebook


Used for majority of Facebook jobs


“Relational database” built on Hadoop


Maintains list of table schemas


SQL
-
like query language (HQL)


Can call Hadoop Streaming scripts from HQL


Supports table partitioning, clustering, complex

data types, some optimizations

04/27/2011

DHT

50

Creating a Hive Table

CREATE TABLE page_views(viewTime INT, userid BIGINT,


page_url STRING, referrer_url STRING,


ip STRING COMMENT 'User IP address')

COMMENT 'This is the page view table'

PARTITIONED BY(dt STRING, country STRING)

STORED AS SEQUENCEFILE;


Partitioning breaks table into separate files
for each (
dt, country
) pair

Ex:
/hive/page_view/dt=2008
-
06
-
08,country=US



/hive/page_view/dt=2008
-
06
-
08,country=CA



04/27/2011

DHT

51

Simple Query

SELECT page_views.*

FROM page_views

WHERE page_views.date >= '2008
-
03
-
01'

AND page_views.date <= '2008
-
03
-
31'

AND page_views.referrer_url like '%xyz.com';


Hive only reads partition
2008
-
03
-
01,*

instead of scanning entire table


Find all page views coming from xyz.com
on March 31
st
:

04/27/2011

DHT

52

Aggregation and Joins

SELECT pv.page_url, u.gender, COUNT(DISTINCT u.id)

FROM page_views pv JOIN user u ON (pv.userid = u.id)

GROUP BY pv.page_url, u.gender

WHERE pv.date = '2008
-
03
-
03';


Count users who visited each page by gender:






Sample output:

page_url

gender

count(userid
)

home.php

MALE

12,141,412

home.php

FEMALE

15,431,579

photo.php

MALE

23,941,451

photo.php

FEMALE

21,231,314