Lecture 11 notes, Map Reduce and Hadoop - Scott Streit Content

arrogantpreviousInternet και Εφαρμογές Web

2 Φεβ 2013 (πριν από 4 χρόνια και 6 μήνες)

632 εμφανίσεις

MapReduce
From Wikipedia, the free encyclopedia
Jump to:
navigation
,
search

Contents

[
hide
]



1
Logical
view




1
.
1
E
x
a
m
p
l
e




2
Dataflo
w




2
.
1
I
n
p
u
t
r
e
a
d
e
r




2
.
2
M
a
p
f
u
n
c
ti
o
n




2
.
3
P
a
rt
it
i
o
n
f
u
n
c
ti
o
n




2
.
4
C
o
m
p
a
ri
s
o
n
f
u
n
c
ti
o
n




2
.
5
R
e
d
u
c
e
f
u
n
c
ti
o
n




2
.
6
O
u
t
p
u
t
w
ri
t
e
r




3
Distribut
ion and
reliabilit
y




4 Uses




5
Impleme
ntations




6
Referen
ces




7
External
links




7
.
1
P
a
p
e
r
s


MapReduce

is a
software framework

introduced by
Google

to support parallel computations over large
(multiple
petabyte
[1]
) data sets on clusters of comp
uters. This framework is largely taken from
map

and
reduce

functions commonly used in
functional programming
,
[2]

although the actual semantics of
the framework are not the same.
[3]

MapReduce implementations have been written in
C++
,
Java
,
Python

and other languages.

[
edit
] Logical view

The
Map

and
Reduce

functions of
MapReduce

are both defined with respect to data str
uctured in (key,
value) pairs.
Map

takes one pair of data with a type on a data domain, and returns a list of pairs in a
different domain:

Map(k1,v1)
-
> list(k2,v2)

The map function is applied in parallel to every item in the input dataset. This produces a

list of (k2,v2)
pairs for each call. After that, the MapReduce framework collects all pairs with the same key from all
lists and groups them together, thus creating one group for each one of the different generated keys.

The
Reduce

function is then applie
d in parallel to each group, which in turn produces a collection of
values in the same domain:

Reduce(k2, list (v2))
-
> list(v2)

Each
Reduce

call typically produces either one value v2 or an empty return, though one call is allowed
to return more than one
value. The returns of all calls are collected as the desired result list.

Thus the MapReduce framework transforms a list of (key, value) pairs into a list of values. This
behavior is different from the functional programming map and reduce combination, whi
ch accepts a
list of arbitrary values and returns one single value that combines
all

the values returned by map.

[
edit
] Example

The canonical example application of M
apReduce is a process to count the appearances of each
different word in a set of documents:


map(String name, String document):


// key: document name


// value: document contents


for each word w in document:


EmitIntermediate(w, 1);




reduce(Stri
ng word, Iterator partialCounts):


// key: a word


// values: a list of aggregated partial counts


int result = 0;


for each v in partialCounts:


result += ParseInt(v);


Emit(result);

Here, each document is split in words, and each word is counted
initially with a "1" value by the
Map

function, using the word as the result key. The framework puts together all the pairs with the same key
and feeds them to the same call to
Reduce
, thus this function just needs to sum all of its input values to
find th
e total appearances of that word.

[
edit
] Dataflow

The frozen part of the MapReduce framework is a large distributed sort. The hot spots, which the
application defines
, are:



an
input reader




a
Map

function



a
partition

function



a
compare

function



a
Reduce

function



an
output writer


[
edit
] Input reader

The
input reader

divides th
e input into 16MB to 128MB splits and the framework assigns one split to
each
Map

function. The
input reader

reads data from stable storage (typically a distributed file system
like
Google File

System
) and generates key/value pairs.

A common example will read a directory full of text files and return each line as a record.

[
edit
] Map function

Each
Map

func
tion takes a series of key/value pairs, processes each, and generates zero or more output
key/value pairs. The input and output types of the map can be (and often are) different from each other.

If the application is doing a word count, the map function wo
uld break the line into words and output
the word as the key and "1" as the value.

[
edit
] Partition function

The output of all of the maps is allocated to particular
reduces

by the applications's
partition

function.
The
partition

function is given the key and the number of reduces and returns the index of the desired
reduce
.

A typical default is to
hash

the key

and
modulo

the number of
reduces
.

[
edit
] Comparison function

The input for each
reduce

is pulled from t
he machine where the
map

ran and sorted using the
application's
comparison

function.

[
edit
] Reduce function

The framework calls the application's
reduce

function once

for each unique key in the sorted order. The
reduce

can iterate through the values that are associated with that key and output 0 or more key/value
pairs.

In the word count example, the
reduce

function takes the input values, sums them and generates a
sin
gle output of the word and the final sum.

[
edit
] Output writer

The
Output Writer

writes the output of the reduce to stable storage, usually a distributed file system,

such as
Google File System
.

[
edit
] Distribution and reliability

MapReduce achieves reliability by pa
rceling out a number of operations on the set of data to each node
in the network; each node is expected to report back periodically with completed work and status
updates. If a node falls silent for longer than that interval, the master node (similar to t
he master server
in the
Google File System
) records the node as dead, and sends out the node's assigned work to other
nodes. Individual operations use
atomic

operations for naming file outputs as a double check to ensure
that there are not parallel conflicting threads running; when files are renamed, it is possible to also copy
them to another name in addition to the name of the task (allow
ing for
side
-
effects
).

The reduce operations operate much the same way, but because of their inferior properties with regard
to parallel operations, the master node attempts to sch
edule reduce operations on the same node, or in
the same rack as the node holding the data being operated on; this property is desirable as it conserves
bandwidth across the backbone network of the datacenter.

Implementations may not be highly
-
available; i
n
Hadoop
, for example, the
NameNode

is a
Single Point
of Failure

for the distributed filesystem; if the
JobTracker

fails, all outstandin
g work is lost.

[
edit
] Uses

MapReduce is useful in a wide range of applications, including: "distributed
grep
, dist
ributed sort, web
link
-
graph reversal, term
-
vector per host, web access log stats,
inverted index

construction, document
clustering,
ma
chine learning
, statistical
machine translation
..." Most significantly, when MapReduce
was finished, it was used to completely regenerate Google's index of the
World Wide Web
, and
replaced the old
ad hoc

programs that updated the index and ran the various analyses.
[4]

MapReduce's stable inputs and outputs
are usually stored in a
distributed file system
. The transient data
is usually stored on local disk and fetched remotely by the reduces.

David DeWitt

and
Michael Stonebraker
, pioneering experts in
parallel databases

and
shared nothing
architectures
, have made some controversial assertions about the breadth of problems that MapReduce
can be used for. They called its interface too low
-
level, and questioned whether it really represents the

paradigm shift

its proponents have claimed it is.
[5]

They challenge the MapReduce proponents claims
of novelty, ci
ting
Teradata

as an example of
prior art

that has existed for over two decades; they
compared MapReduce programmers to
Codasyl

programmers, noting both are "writing in a
low
-
level
language

performing low
-
level record manipulation".
[5]

MapReduce advocates promote the tool
without seemingly paying attention to years of academic and commercial database research and real
world use[
citation needed
]. MapReduce's use of input files and lack of
schema

support prevents the
performance improvements enabled by common database system features such as
B
-
trees

and
hash
partitioning
, though projects such as
P
igLatin

and
Sawzall

are starting to address these problems.
[6]

[
edit
] Implementations



The
Google

MapReduce framework is implemented in
C++

with interfaces in
Python

and
Java
.



The
Hadoop

project is a free open sour
ce
Java

MapReduce implementation.



Greenplum

is a commercial MapReduce implementation, with s
upport for Python, Perl, SQL
and other languages.



Phoenix
[1]

is a shared
-
memory implementation of MapReduce implemented in
C
.



MapReduce has also been implemented for the
Cell Broadband Engine
, also in
C
.
[2]




MapReduce has been implemented on
NVIDIA

GPUs (Graphics Processors) using
CUDA

[3]
.



Qt Concurrent

is a simplified version of the framework, implemented in
C++
, used for
distributing a task between
multiple processor cores.



CouchDB

uses a MapReduce framework for defining views over distributed documents



Skynet

is an open source
Ruby

implementation of Google’s MapReduce framework



Disco

is an open source MapReduce implementation by
Nokia
. Its

core is written in
Erlang

and
jobs are normally written in
Python
.



Aster Data Systems

n
Cluster In
-
Database MapReduce implements MapReduce inside the
database.

[
edit
] References

Specific references:

1.

^

Google spotlights data center inner workings | Tech news blog
-

CNET News.com


2.

^

"Our abstraction is inspired by the map and reduce primitives present in Lisp and many other
functional languages."
-
"MapReduce: Simplified Data Processing on Lar
ge Clusters"
, by Jeffrey
Dean and Sanjay Ghemawat; from
Google Labs


3.

^

"Google's MapReduce Programming Model
--

Revisited"



paper by Ralf Lammel; from
Microsoft


4.

^

"
How Google Works
". baselinemag.com.

"As of October, Google was running about 3,000
computing jobs per day through MapReduce, representing thousands of machine
-
days,
according to a presentation by Dean. A
mong other things, these batch routines analyze the latest
Web pages and update Google's indexes."

5.

^
a

b

David DeWitt
;
Michael Stonebraker
. "
MapReduce: A major step backwards
".
databasecolumn.com. Retrieved on
2008
-
08
-
27
.

6.

^

David DeWitt
;
Michael Stonebraker
. "
MapReduce II
". databasecolumn.com. Retrieved on
2008
-
08
-
27
.

General references:



Dean, Jeffrey & Ghemawat, Sanjay (2004).
"MapReduce: Simplified Data Processing on Large
Clusters"
. Retrieved Apr. 6, 2005.



MapReduce: A major step backwards

By David DeWitt on January 17, 2008 4:20 PM |
Permalink

|
Comments (42)

|
TrackBacks (1)


[Note: Although the system attributes this post to a single author, it was written by David J. DeWitt and
Michael Stonebraker]


On January 8, a Database Column reader asked for our views on new distributed database research
efforts, a
nd we'll begin here with our views on
MapReduce
. This is a good time to discuss it, since the
recent trade press has been filled with news of the revolution of so
-
called "cloud computing." This
paradig
m entails harnessing large numbers of (low
-
end) processors working in parallel to solve a
computing problem. In effect, this suggests constructing a data center by lining up a large number of
"jelly beans" rather than utilizing a much smaller number of hig
h
-
end servers.


For example, IBM and Google have announced plans to make a 1,000 processor cluster available to a
few select universities to teach students how to program such clusters using a software tool called
MapReduce [1]. Berkeley has gone so far as

to plan on teaching their freshman how to program using
the MapReduce framework.


As both educators and researchers, we are amazed at the hype that the MapReduce proponents have
spread about how it represents a paradigm shift in the development of scalabl
e, data
-
intensive
applications. MapReduce may be a good idea for writing certain types of general
-
purpose
computations, but to the database community, it is:

1.

A giant step backward in the programming paradigm for large
-
scale data intensive applications

2.

A su
b
-
optimal implementation, in that it uses brute force instead of indexing

3.

Not novel at all
--

it represents a specific implementation of well known techniques developed
nearly 25 years ago

4.

Missing most of the features that are routinely included in current

DBMS

5.

Incompatible with all of the tools DBMS users have come to depend on


First, we will briefly discuss what MapReduce is; then we will go into more detail about our five
reactions listed above.



What is MapReduce?


The basic idea of MapReduce is strai
ghtforward. It consists of two programs that the user writes called
map

and
reduce

plus a framework for executing a possibly large number of instances of each program
on a compute cluster.




The map program reads a set of "records" from an input file, do
es any desired filtering and/or
transformations, and then outputs a set of records of the form (key, data). As the map program produces
output records, a "split" function partitions the records into
M

disjoint buckets by applying a function to
the key of e
ach output record.


This split function is typically a hash function, though any deterministic
function will suffice. When a bucket fills, it is written to disk. The map program terminates with
M

output files, one for each bucket.


In general, there are m
ultiple instances of the map program running on different nodes of a compute
cluster. Each map instance is given a distinct portion of the input file by the MapReduce scheduler to
process. If
N

nodes participate in the map phase, then there are
M

files on
disk storage at each of
N

nodes, for a total of
N

*
M

files;
F
i,j
,


1 ≤
i


N
,


1 ≤
j


M
.


The key thing to observe is that all map instances use the same hash function. Hence, all output records
with the same hash value will be in corresponding output fi
les.




The second phase of a MapReduce job executes
M

instances of the reduce program,
R
j
, 1 ≤
j


M
.


The
input for each reduce instance
R
j

consists of the files
F
i,j
,


1 ≤
i


N
.


Again notice that all output
records from the map phase with the same has
h value will be consumed by the same reduce instance
--

no matter which map instance produced them. After being collected by the map
-
reduce framework, the
input records to a reduce instance are grouped on their keys (by sorting or hashing) and feed to the
reduce program. Like the map program, the reduce program is an arbitrary computation in a general
-
purpose language. Hence, it can do anything it wants with its records. For example, it might compute
some additional function over other data fields in the re
cord. Each reduce instance can write records to
an output file, which forms part of the "answer" to a MapReduce computation.


To draw an analogy to SQL, map is like the
group
-
by

clause of an aggregate query. Reduce is
analogous to the
aggregate

function (e
.g., average) that is computed over all the rows with the same
group
-
by attribute.


We now turn to the five concerns we have with this computing paradigm.



1. MapReduce is a step backwards in database access


As a data processing paradigm, MapReduce repre
sents a giant step backwards. The database
community has learned the following three lessons from the 40 years that have unfolded since IBM
first released IMS in 1968.



Schemas are good.



Separation of the schema from the application is good.



High
-
level acce
ss languages are good.


MapReduce has learned none of these lessons and represents a throw back to the 1960s, before modern
DBMSs were invented.


The DBMS community learned the importance of schemas, whereby the fields and their data types are
recorded in
storage. More importantly, the run
-
time system of the DBMS can ensure that input records
obey this schema. This is the best way to keep an application from adding "garbage" to a data set.
MapReduce has no such functionality, and there are no controls to ke
ep garbage out of its data sets. A
corrupted MapReduce dataset can actually silently break all the MapReduce applications that use that
dataset.


It is also crucial to separate the schema from the application program. If a programmer wants to write a
new a
pplication against a data set, he or she must discover the record structure. In modern DBMSs, the
schema is stored in a collection of system catalogs and can be queried (in SQL) by any user to uncover
such structure. In contrast, when the schema does not e
xist or is buried in an application program, the
programmer must discover the structure by an examination of the code. Not only is this a very tedious
exercise, but also the programmer must find the source code for the application. This latter tedium is
fo
rced onto every MapReduce programmer, since there are no system catalogs recording the structure
of records
--

if any such structure exists.


During the 1970s the DBMS community engaged in a "great debate" between the relational advocates
and the Codasyl a
dvocates. One of the key issues was whether a DBMS access program should be
written:



By stating what you want
-

rather than presenting an algorithm for how to get it (relational
view)



By presenting an algorithm for data access (Codasyl view)


The result is

now ancient history, but the entire world saw the value of high
-
level languages and
relational systems prevailed. Programs in high
-
level languages are easier to write, easier to modify, and
easier for a new person to understand. Codasyl was rightly critic
ized for being "the assembly language
of DBMS access." A MapReduce programmer is analogous to a Codasyl programmer
--

he or she is
writing in a low
-
level language performing low
-
level record manipulation. Nobody advocates returning
to assembly language; si
milarly nobody should be forced to program in MapReduce.


MapReduce advocates might counter this argument by claiming that the datasets they are targeting
have no schema. We dismiss this assertion. In extracting a key from the input data set, the map funct
ion
is relying on the existence of at least one data field in each input record. The same holds for a reduce
function that computes some value from the records it receives to process.




Writing MapReduce applications on top of Google's BigTable (or Hadoo
p's HBase) does not really
change the situation significantly. By using a self
-
describing tuple format (row key, column name,
{values}) different tuples within the same table can actually have different schemas. In addition,
BigTable and HBase do not provi
de logical independence, for example with a view mechanism. Views
significantly simplify keeping applications running when the logical schema changes.



2. MapReduce is a poor implementation


All modern DBMSs use hash or B
-
tree indexes to accelerate access

to data. If one is looking for a
subset of the records (e.g., those employees with a salary of 10,000 or those in the shoe department),
then one can often use an index to advantage to cut down the scope of the search by one to two orders
of magnitude. In
addition, there is a query optimizer to decide whether to use an index or perform a
brute
-
force sequential search.


MapReduce has no indexes and therefore has only brute force as a processing option. It will be
creamed whenever an index is the better acces
s mechanism.


One could argue that value of MapReduce is automatically providing parallel execution on a grid of
computers. This feature was explored by the DBMS research community in the 1980s, and multiple
prototypes were built including Gamma [2,3],


Bu
bba [4], and Grace [5]. Commercialization of these
ideas occurred in the late 1980s with systems such as Teradata.




In summary to this first point, there have been high
-
performance, commercial, grid
-
oriented SQL
engines (with schemas and indexing) for th
e past 20 years. MapReduce does not fare well when
compared with such systems.




There are also some lower
-
level implementation issues with MapReduce, specifically skew and data
interchange.


One factor that MapReduce advocates seem to have overlooked is
the issue of skew. As described in
"Parallel Database System: The Future of High Performance Database Systems," [6] skew is a huge
impediment to achieving successful scale
-
up in parallel query systems. The problem occurs in the map
phase when there is wide

variance in the distribution of records with the same key. This variance, in
turn, causes some reduce instances to take much longer to run than others, resulting in the execution
time for the computation being the running time of the slowest reduce instan
ce. The parallel database
community has studied this problem extensively and has developed solutions that the MapReduce
community might want to adopt.


There is a second serious performance problem that gets glossed over by the MapReduce proponents.
Recall

that each of the
N

map instances produces
M

output files
--

each destined for a different reduce
instance. These files are written to a disk local to the computer used to run the map instance. If
N

is
1,000 and
M

is 500, the map phase produces 500,000 loc
al files. When the reduce phase starts, each of
the 500 reduce instances needs to read its 1,000 input files and must use a protocol like FTP to "pull"
each of its input files from the nodes on which the map instances were run. With 100s of reduce
instance
s running simultaneously, it is inevitable that two or more reduce instances will attempt to read
their input files from the same map node simultaneously
--

inducing large numbers of disk seeks and
slowing the effective disk transfer rate by more than a fa
ctor of 20. This is why parallel database
systems do not materialize their split files and use push (to sockets) instead of pull. Since much of the
excellent fault
-
tolerance that MapReduce obtains depends on materializing its split files, it is not clear
w
hether the MapReduce framework could be successfully modified to use the push paradigm instead.


Given the experimental evaluations to date, we have serious doubts about how well MapReduce
applications can scale. Moreover, the MapReduce implementers would
do well to study the last 25
years of parallel DBMS research literature.



3. MapReduce is not novel


The MapReduce community seems to feel that they have discovered an entirely new paradigm for
processing large data sets. In actuality, the techniques empl
oyed by MapReduce are more than 20 years
old. The idea of partitioning a large data set into smaller partitions was first proposed in "Application of
Hash to Data Base Machine and Its Architecture" [11] as the basis for a new type of join algorithm. In
"Mu
ltiprocessor Hash
-
Based Join Algorithms," [7], Gerber demonstrated how Kitsuregawa's
techniques could be extended to execute joins in parallel on a shared
-
nothing [8] cluster using a
combination of partitioned tables, partitioned execution, and hash based
splitting. DeWitt [2] showed
how these techniques could be adopted to execute aggregates with and without group by clauses in
parallel. DeWitt and Gray [6] described parallel database systems and how they process queries.
Shatdal and Naughton [9] explored
alternative strategies for executing aggregates in parallel.




Teradata has been selling a commercial DBMS utilizing all of these techniques for more than 20 years;
exactly the techniques that the MapReduce crowd claims to have invented.




While MapRed
uce advocates will undoubtedly assert that being able to write MapReduce functions is
what differentiates their software from a parallel SQL implementation, we would remind them that
POSTGRES supported user
-
defined functions and user
-
defined aggregates in
the mid 1980s.
Essentially, all modern database systems have provided such functionality for quite a while, starting
with the Illustra engine around 1995.





4.


MapReduce is missing features


All of the following features are routinely provided by modern

DBMSs, and all are missing from
MapReduce:



Bulk loader

--

to transform input data in files into a desired format and load it into a DBMS



Indexing

--

as noted above



Updates

--

to change the data in the data base



Transactions

--

to support parallel update a
nd recovery from failures during update



Integrity constraints

--

to help keep garbage out of the data base



Referential integrity

--

again, to help keep garbage out of the data base



Views

--

so the schema can change without having to rewrite the application

program


In summary, MapReduce provides only a sliver of the functionality found in modern DBMSs.



5.


MapReduce is incompatible with the DBMS tools



A modern SQL DBMS has available all of the following classes of tools:



Report writers

(e.g., Crystal re
ports) to prepare reports for human visualization



Business intelligence tools

(e.g., Business Objects or Cognos) to enable ad
-
hoc querying of
large data warehouses



Data mining tools

(e.g., Oracle Data Mining or IBM DB2 Intelligent Miner) to allow a user to

discover structure in large data sets



Replication tools

(e.g., Golden Gate) to allow a user to replicate data from on DBMS to
another



Database design tools

(e.g., Embarcadero) to assist the user in constructing a data base.


MapReduce cannot use these too
ls and has none of its own. Until it becomes SQL
-
compatible or until
someone writes all of these tools, MapReduce will remain very difficult to use in an end
-
to
-
end task.



In Summary


It is exciting to see a much larger community engaged in the design and

implementation of scalable
query processing techniques. We, however, assert that they should not overlook the lessons of more
than 40 years of database technology
--

in particular the many advantages that a data model, physical
and logical data independen
ce, and a declarative query language, such as SQL, bring to the design,
implementation, and maintenance of application programs. Moreover, computer science communities
tend to be insular and do not read the literature of other communities. We would encoura
ge the wider
community to examine the parallel DBMS literature of the last 25 years. Last, before MapReduce can
measure up to modern DBMSs, there is a large collection of unmet features and required tools that
must be added.


We fully understand that datab
ase systems are not without their problems. The database community
recognizes that database systems are too "hard" to use and is working to solve this problem. The
database community can also learn something valuable from the excellent fault
-
tolerance that

MapReduce provides its applications. Finally we note that some database researchers are beginning to
explore using the MapReduce framework as the basis for building scalable database systems. The
Pig[10] project at Yahoo! Research is one such effort.


Map
Reduce II

By David DeWitt on January 25, 2008 2:56 PM |
Permalink

|
Comments (25)

|
TrackBacks (1)


[Note: Although the system attributes this post to a single author, it was written by David J. DeWitt and
Michael Stonebraker]


Last week's MapReduce post

attracted tens of thousands of readers and generated many comments,
almost all of them attacking our critique. Just to let you know, we don't hold a personal gr
udge against
MapReduce. MapReduce didn't kill our dog, steal our car, or try and date our daughters.


Our motivations for writing about MapReduce stem from MapReduce being increasingly seen as the
most advanced and/or only way to analyze massive datasets.

Advocates promote the tool without
seemingly paying attention to years of academic and commercial database research and real world use.


The point of our initial post was to say that there are striking similarities between MapReduce and a
fairly primitiv
e parallel database system. As such, MapReduce can be significantly improved by
learning from the parallel database community.


So, hold off on your comments for just a few minutes, as we will spend the rest of this post addressing
four specific topics bro
ught up repeatedly by those who commented on our previous blog:

1.

MapReduce is not a database system, so don't judge it as one

2.

MapReduce has excellent scalability; the proof is Google's use

3.

MapReduce is cheap and databases are expensive

4.

We are the old guard

trying to defend our turf/legacy from the young turks



Feedback No. 1: MapReduce is not a database system, so don't judge it as one


It's not that we don't understand this viewpoint. We are not claiming that MapReduce is a database
system. What we are sa
ying is that like a DBMS + SQL + analysis tools, MapReduce can be and is
being used to analyze and perform computations on massive datasets. So we aren't judging apples and
oranges. We are judging two approaches to analyzing massive amounts of information,

even for less
structured information.


To illustrate our point, assume that you have two very large files of facts. The first file contains
structured records of the form:

Rankings (pageURL, pageRank)


Records in the second file have the form:

UserVisi
ts (sourceIPAddr, destinationURL, date, adRevenue)


Someone might ask, "What IP address generated the most ad revenue during the week of January 15th
to the 22nd, and what was the average page rank of the pages visited?"


This question is a little tricky t
o answer in MapReduce because it consumes two data sets rather than
one, and it requires a "join" of the two datasets to find pairs of Ranking and UserVisit records that have
matching values for pageURL and destinationURL. In fact, it appears to require th
ree MapReduce
phases, as noted below.

Phase 1


This phase filters UserVisits records that are outside the desired data range and then "joins"
the qualifying records with records from the Rankings file.



Map program:

The map program scans through UserVisits

and Rankings records.
Each UserVisit record is filtered on the date range specification. Qualifying records
are emitted with composite keys of the form <destinationURL, T1 > where T1
indicates that it is a UserVisits record. Rankings records are emitted w
ith composite
keys of the form <pageURL, T2 >


(T2 is a tag indicating it a Rankings record).
Output records are repartitioned using a user
-
supplied partitioning function that only
hashes on the URL portion of the composite key.



Reduce Program:
The input t
o the reduce program is a single sorted run of records
in URL order. For each unique URL, the program splits the incoming records into
two sets (one for Rankings records and one for UserVisits records) using the tag
component of the composite key. To compl
ete the join, reduce finds all matching
pairs of records of the two sets. Output records are in the form of Temp1
(sourceIPAddr, pageURL, pageRank, adRevenue).




The reduce program must be capable of handling the case in which one or both of these sets
wi
th the same URL are too large to fit into memory and must be materialized on disk. Since
access to these sets is through an iterator, a straightforward implementation will result in
what is termed a nested
-
loops join. This join algorithm is known to have v
ery bad
performance I/O characteristics as "inner" set is scanned once for each record of the "outer"
set.



Phase 2


This phase computes the total ad revenue and average page rank for each Source IP
Address.



Map program:

Scan Temp1 using the identity func
tion on sourceIPAddr.



Reduce program:

The reduce program makes a linear pass over the data. For each
sourceIPAddr, it will sum the ad
-
revenue and compute the average page rank,
retaining the one with the maximum total ad revenue. Each reduce worker then
ou
tputs a single record of the form Temp2 (sourceIPAddr,


total_adRevenue,
average_pageRank).


Phase 3



Map program:

The program uses a single map worker that scans Temp2 and
outputs the record with the maximum value for total_adRevenue.


We realize that por
tions of the processing steps described above are handled automatically by the
MapReduce infrastructure (e.g., sorting and partitioning the records). Although we have not written this
program, we estimate that the custom parts of the code (i.e., the map()
and reduce() functions) would
require substantially more code than the two fairly simple SQL statements to do the same:

Q1


Select as Temp


sourceIPAddr, avg(pageRank) as avgPR, sum(adRevenue) as adTotal

From Rankings, UserVisits

where Rankings.pageURL =
UserVisits.destinationURL and

date > "Jan 14" and date < "Jan 23"

Group by sourceIPAddr



Q2


Select sourceIPAddr, adTotal, avgPR

From Temp

Where adTotal = max (adTotal)


No matter what you think of SQL, eight lines of code is almost certainly easier to w
rite and debug than
the programming required for MapReduce. We believe that MapReduce advocates should consider the
advantages that layering a high
-
level language like SQL could provide to users of MapReduce.
Apparently we're not alone in this assessment,
as efforts such as PigLatin and Sawzall appear to be
promising steps in this direction.


We also firmly believe that augmenting the input files with a schema would provide the basis for
improving the overall performance of MapReduce applications by allowi
ng B
-
trees to be created on the
input data sets and techniques like hash partitioning to be applied. These are technologies in
widespread practice in today's parallel DBMSs, of which there are quite a number on the market,
including ones from IBM, Teradata
, Netezza, Greenplum, Oracle, and Vertica. All of these should be
able to execute this program with the same or better scalability and performance of MapReduce.


Here's how these capabilities could benefit MapReduce:

1.

Indexing.

The filter (date > "Jan 14" a
nd date < "Jan 23") condition can be executed
by using a B
-
tree index on the date attribute of the UserVisits table, avoiding a
sequential scan of the entire table.

2.

Data movement.

When you load files into a distributed file system prior to running
MapReduc
e, data items are typically assigned to blocks/partitions in sequential
order. As records are loaded into a table in a parallel database system, it is standard
practice to apply a hash function to an attribute value to determine which node the
record shoul
d be stored on (the same basic idea as is used to determine which
reduce worker should get an output record from a map instance). For example,
records being loaded into the Rankings and UserVisits tables might be mapped to a
node by hashing on the pageURL
and destinationURL attributes, respectively. If
loaded this way, the join of Rankings and UserVisits in Q1 above would be
performed completely locally
with absolutely no data movement between nodes
.
Furthermore, as result records from the join are material
ized, they will be pipelined
directly into a local aggregate computation without being written first to disk. This
local aggregate operator will partially compute the two aggregates (sum and
average) concurrently (what is called a combiner in MapReduce ter
minology).
These partial aggregates are then repartitioned by hashing on this sourceIPAddr to
produce the final results for Q1.


It is certainly the case that you could do the same thing in MapReduce by using
hashing to map records to chunks of the file an
d then modifying the MapReduce
program to exploit the knowledge of how the data was loaded. But in a database,
physical data independence happens automatically. When Q1 is "compiled," the
query optimizer will extract partitioning information about the two
tables from the
schema.


It will then generate the correct query plan based on this partitioning
information (e.g., maybe Rankings is hash partitioned on pageURL but UserVisits is
hash partitioned on sourceIPAddr). This happens transparently to any user (m
odulo
changes in response time) who submits a query involving a join of the two tables.

3.

Column representation.

Many questions access only a subset of the fields of the
input files. The others do not need to be read by a column store.

4.

Push, not pull.

MapRe
duce relies on the materialization of the output files from the
map phase on disk for fault tolerance. Parallel database systems push the
intermediate files directly to the receiving (i.e., reduce) nodes, avoiding writing the
intermediate results and then
reading them back as they are pulled by the reduce
computation. This provides MapReduce far superior fault tolerance at the expense
of additional I/Os.




In general, we expect these mechanisms to provide about a factor of 10 to 100 performance advantage,
depending on the selectivity of the query, the width of the input records to the map computation, and
the size of the output files from the map phase. As such, we believe that 10 to 100 parallel database
nodes can do the work of 1,000 MapReduce nodes.


To

further illustrate out point, suppose you have a more general filter, F, a more general group_by
function, G, and a more general Reduce function, R. PostgreSQL (an open source, free DBMS) allows
the following SQL query over a table T:

Select R (T)

From T

Group_by G (T)

Where F (T)


F, R, and G can be written in a general
-
purpose language like C or C++. A SQL engine, extended with
user
-
defined functions and aggregates, has nearly
--

if not all
--

of the generality of MapReduce.




As such, we claim that
mo
st things that are possible in MapReduce are also possible in a SQL engine
.
Hence, it is exactly appropriate to compare the two approaches. We are working on a more complete
paper that demonstrates the relative performance and relative programming effort b
etween the two
approaches, so, stay tuned.





Feedback No. 2: MapReduce has excellent scalability; the proof is Google's use


Many readers took offense at our comment about scaling and asserted that since Google runs
MapReduce programs on 1,000s (perhaps

10s of 1,000s) of nodes it must scale well. Having started
benchmarking database systems 25 years ago (yes, in 1983), we believe in a more scientific approach
toward evaluating the scalability of any system for data intensive applications.


Consider the f
ollowing scenario. Assume that you have a 1 TB data set that has been partitioned across
100 nodes of a cluster (each node will have about 10 GB of data). Further assume that some
MapReduce computation runs in 5 minutes if 100 nodes are used for both the m
ap and reduce phases.
Now scale the dataset to 10 TB, partition it over 1,000 nodes, and run the same MapReduce
computation using those 1,000 nodes. If the performance of MapReduce scales linearly, it will execute
the same computation on 10x the amount of
data using 10x more hardware in the same 5 minutes.
Linear scaleup is the gold standard for measuring the scalability of data intensive applications
. As far
as we are aware there are no published papers that study the scalability of MapReduce in a controll
ed
scientific fashion. MapReduce may indeed scale linearly, but we have not seen published evidence of
this.





Feedback No. 3: MapReduce is cheap and databases are expensive


Every organization has a "build" versus "buy" decision, and we don't question

the decision by Google
to roll its own data analysis solution. We also don't intend to defend DBMS pricing by the commercial
vendors. What we wanted to point out is that we believe it is possible to build a version of MapReduce
with more functionality and

better performance. Pig is an excellent step in this direction.


Also, we want to mention that there are several open source (i.e., free) DBMSs, including PostgreSQL,
MySQL, Ingres, and BerkeleyDB. Several of the aforementioned parallel DBMS companies ha
ve
increased the scale of these open source systems by adding parallel computing extensions.


A number of individuals also commented that SQL and the relational data model are too restrictive.
Indeed, the relational data model might very well be the wrong
data model for the types of datasets that
MapReduce applications are targeting. However, there is considerable ground between the relational
data model and no data model at all. The point we were trying to make is that developers writing
business applicati
ons have benefited significantly from the notion of organizing data in the database
according to a data model and accessing that data through a declarative query language. We don't care
what that language or model is. Pig, for example, employs a nested rel
ational model, which gives
developers more flexibility that a traditional 1NF doesn't allow.



Feedback No. 4: We are the old guard trying to defend our turf/legacy from the young turks


Since both of us are among the "gray beards" and have been on this ea
rth about 2 Giga
-
seconds, we
have seen a lot of ideas come and go. We are constantly struck by the following two observations:



How insular computer science is.

The propagation of ideas from sub
-
discipline to sub
-
discipline is very slow and sketchy. Most of

us are content to do our own thing, rather than learn
what other sub
-
disciplines have to offer.



How little knowledge is passed from generation to generation.

In a recent paper entitled
"What goes around comes around," (M. Stonebraker/J. Hellerstein, Readi
ngs in Database
Systems 4th edition, MIT Press, 2004) one of us noted that many current database ideas were
tried a quarter of a century ago and discarded. However, such pragma does not seem to be
passed down from the "gray beards" to the "young turks."


T
he turks and gray beards aren't
usually and shouldn't be adversaries.


Thanks for stopping by the "pasture" and reading this post. We look forward to reading your feedback,
comments and alternative viewpoints.




Categories

Database architecture

,
Database innovation


Tags



DBMS




DeWitt




MapReduce




Stonebraker



Serial vs. Parallel Programming

In the early days of computing, programs were
se
rial
, that is, a program consisted of a sequence of
instructions, where each instruction executed one after the other. It ran from start to finish on a single
processor.

Parallel programming

developed as a means of improving performance and efficiency. In

a parallel
program, the processing is broken up into parts, each of which can be executed concurrently. The
instructions from each part run simultaneously on different CPUs. These CPUs can exist on a single
machine, or they can be CPUs in a set of compute
rs connected via a network.

Not only are parallel programs faster, they can also be used to solve problems on large datasets using
non
-
local resources. When you have a set of computers connected on a network, you have a vast pool
of CPUs, and you often ha
ve the ability to read and write very large files (assuming a distributed file
system is also in place).

The Basics

The first step in building a parallel program is identifying sets of tasks that can run concurrently and/or
paritions of data that can be
processed concurrently. Sometimes it's just not possible. Consider a
Fibonacci function:

F
k+2

= F
k

+ F
k+1

A function to compute this based on the form above, cannot be "parallelized" because each computed
value is dependent on previously computed values.

A common situation is having a large amount of consistent data which must be processed. If the data
can be decomposed into equal
-
size partitions, we can devise a parallel solution. Consider a huge array
which can be broken up into sub
-
arrays.


If the same processing is required for each array element, with no dependencies in the computations,
and no communication required between tasks, we have an ideal parallel computing opport
unity. Here
is a common implementation technique called
master/worker
.

The MASTER:



initializes the array and splits it up according to the number of available WORKERS



sends each WORKER its subarray



receives the results from each WORKER

The WORKER:



re
ceives the subarray from the MASTER



performs processing on the subarray



returns results to MASTER

This model implements
static load balancing

which is commonly used if all tasks are performing the
same amount of work on identical machines. In general,
l
oad balancing

refers to techniques which try
to spread tasks among the processors in a parallel system to avoid some processors being idle while
others have tasks queueing up for execution.

A static load balancer allocates processes to processors at run t
ime while taking no account of current
network load. Dynamic algorithms are more flexible, though more computationally expensive, and give
some consideration to the network load before allocating the new process to a processor.

As an example of the MASTER
/WORKER technique, consider one of the methods for approximating
pi. The first step is to inscribe a circle inside a square:



The area of the square, denoted As = (2r)
2

or 4r
2
.

The area of the circle, denoted Ac, is pi * r
2
. So:

pi = Ac / r
2

As = 4r
2

r
2

= As / 4

pi = 4 * Ac / As

The reason we are doing all these algebraic manipulation is we can parallelize this method in the
following way.

1.

Randomly generate points in the squar
e

2.

Count the number of generated points that are both in the circle and in the square

3.

r = the number of points in the circle divided by the number of points in the square

4.

PI = 4 * r

And here is how we parallelize it:

NUMPOINTS = 100000; // some large n
umber
-

the bigger, the closer the
approximation


p = number of WORKERS;

numPerWorker = NUMPOINTS / p;

countCircle = 0; // one of these for each WORKER


// each WORKER does the following:

for (i = 0; i < numPerWorker; i++) {


generate 2 random numbers t
hat lie inside the square;


xcoord = first random number;


ycoord = second random number;


if (xcoord, ycoord) lies inside the circle


countCircle++;

}


MASTER:


receives from WORKERS their countCircle values


computes PI from these values: PI = 4.0

* countCircle / NUMPOINTS;

What is MapReduce?

Now that we have seen some basic examples of parallel programming, we can look at the MapReduce
programming model. This model derives from the
map

and
reduce

combinators from a functional
language like Lisp.

In Lisp, a
map

takes as input a function and a sequence of values. It then applies the function to each
value in the sequence. A
reduce

combines all the elements of a sequence using a binary operation.
For example, it can use "+" to add up all the element
s in the sequence.

MapReduce is inspired by these concepts. It was developed within Google as a mechanism for
processing large amounts of raw data, for example, crawled documents or web request logs. This data
is so large, it must be distributed across th
ousands of machines in order to be processed in a reasonable
time. This distribution implies parallel computing since the same computations are performed on each
CPU, but with a different dataset. MapReduce is an abstraction that allows Google engineers to

perform simple computations while hiding the details of parallelization, data distribution, load
balancing and fault tolerance.

Map, written by a user of the MapReduce library, takes an input pair and produces a set of intermediate
key/value pairs. The M
apReduce library groups together all intermediate values associated with the
same intermediate key
I

and passes them to the reduce function.

The reduce function, also written by the user, accepts an intermediate key I and a set of values for that
key. It
merges together these values to form a possibly smaller set of values. [1]

Consider the problem of counting the number of occurrences of each word in a large collection of
documents:

map(String key, String value):

// key: document name

// value: docume
nt contents

for each word w in value:


EmitIntermediate(w, "1");


reduce(String key, Iterator values):

// key: a word

// values: a list of counts

int result = 0;

for each v in values:


result += ParseInt(v);

Emit(AsString(result)); [1]

The map fun
ction emits each word plus an associated count of occurrences ("1" in this example). The
reduce function sums together all the counts emitted for a particular word.

MapReduce Execution Overview

The Map invocations are distributed across multiple machines

by automatically partitioning the input
data into a set of M splits or
shards
. The input shards can be processed in parallel on different
machines.

Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a
partit
ioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning
function are specifed by the user.

The illustration below shows the overall fow of a MapReduce operation. When the user program calls
the MapReduce function, the fo
llowing sequence of actions occurs (the numbered labels in the
illustration correspond to the numbers in the list below).



1.

The MapReduce library in the user program first shard
s the input files into M pieces of typically
16 megabytes to 64 megabytes (MB) per piece. It then starts up many copies of the program on
a cluster of machines.

2.

One of the copies of the program is special: the master. The rest are workers that are assigne
d
work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle
workers and assigns each one a map task or a reduce task.

3.

A worker who is assigned a map task reads the contents of the corresponding input shard. It
parses ke
y/value pairs out of the input data and passes each pair to the user
-
defined Map
function. The intermediate key/value pairs produced by the Map function are buffered in
memory.

4.

Periodically, the buffered pairs are written to local disk, partitioned into R

regions by the
partitioning function. The locations of these buffered pairs on the local disk are passed back to
the master, who is responsible for forwarding these locations to the reduce workers.

5.

When a reduce worker is notified by the master about thes
e locations, it uses remote procedure
calls to read the buffered data from the local disks of the map workers. When a reduce worker
has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the
same key are grouped tog
ether. If the amount of intermediate data is too large to fit in memory,
an external sort is used.

6.

The reduce worker iterates over the sorted intermediate data and for each unique intermediate
key encountered, it passes the key and the corresponding set of

intermediate values to the user's
Reduce function. The output of the Reduce function is appended to a final output file for this
reduce partition.

7.

When all map tasks and reduce tasks have been completed, the master wakes up the user
program. At this poin
t, the MapReduce call in the user program returns back to the user code.

After successful completion, the output of the MapReduce execution is available in the R output files.
[1]

To detect failure, the master pings every worker periodically. If no respo
nse is received from a worker
in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the
worker are reset back to their initial idle state, and therefore become eligible for scheduling on other
workers. Similarly, an
y map task or reduce task in progress on a failed worker is also reset to idle and
becomes eligible for rescheduling.

Completed map tasks are re
-
executed when failure occurs because their output is stored on the local
disk(s) of the failed machine and is
therefore inaccessible. Completed reduce tasks do not need to be re
-
executed since their output is stored in a global fille system.

MapReduce Examples

Here are a few simple examples of interesting programs that can be easily expressed as MapReduce
comput
ations.

Distributed Grep:

The map function emits a line if it matches a given pattern. The reduce function is
an identity function that just copies the supplied intermediate data to the output.

Count of URL Access Frequency
: The map function processes lo
gs of web page requests and outputs
<URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total
count> pair.

Reverse Web
-
Link Graph
: The map function outputs <target, source> pairs for each link to a target
URL found in

a page named "source". The reduce function concatenates the list of all source URLs
associated with a given target URL and emits the pair: <target, list(source)>.

Term
-
Vector per Host
: A term vector summarizes the most important words that occur in a doc
ument
or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term
vector> pair for each input document (where the hostname is extracted from the URL of the document).
The reduce function is passed all per
-
document t
erm vectors for a given host. It adds these term vectors
together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair.

Inverted Index
: The map function parses each document, and emits a sequence of <word, document
ID> pairs
. The reduce function accepts all pairs for a given word, sorts the corresponding document
IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted
index. It is easy to augment this computation to keep track of wo
rd positions. [1]


Sponsored by:





This story appeared on JavaWorld at

http://www.javaworld.com/javaworld/jw
-
09
-
2008/jw
-
09
-
hadoop.html

MapReduce programming with Apache Hadoop

Process massive data sets in parallel on large clusters

By

Ravi Shankar and Govindu Narendra,

JavaWorld.com,

09/2
3/08

Google and its MapReduce framework may rule the roost when it comes to massive
-
scale data
processing, but there's still plenty of that goodness to go around. This article gets you started with
Hadoop, the open source MapReduce implementation for
processing large data sets. Authors Ravi
Shankar and Govindu Narendra first demonstrate the powerful combination of
map

and
reduce

in a
simple Java program, then walk you through a more complex data
-
processing application based on
Hadoop. Finally, they sho
w you how to install and deploy your application in both standalone mode
and clustering mode.

Are you amazed by the fast response you get while searching the Web with Google or Yahoo? Have
you ever wondered how these services manage to search millions of
pages and return your results in
milliseconds or less? The algorithms that drive both of these major
-
league search services originated
with Google's MapReduce framework. While MapReduce is proprietary technology, the Apache
Foundation has implemented its o
wn open source map
-
reduce framework, called
Hadoop
. Hadoop is
used by Yahoo and many other services whose success is based on processing massive amounts of data.

In this article we'll help you discover whether it might also be a good solution for your distributed data
processing needs.

We'll start with an overview of MapReduce, followed by a couple of Java programs that demonstrate
the simplicity and power of the

framework. We'll then introduce you to Hadoop's MapReduce
implementation and walk through a complex application that searches a huge log file for a specific
string. Finally, we'll show you how to install Hadoop in a Microsoft Windows environment and deplo
y
the application
--

first as a standalone application and then in clustering mode.

You won't be an expert in all things Hadoop when you're done reading this article, but you will have
enough material to explore and possibly implement Hadoop for your own
large
-
scale data
-
processing
requirements.

About MapReduce

MapReduce is a programming model specifically implemented for processing large data sets. The
model was developed by Jeffrey Dean and Sanjay Ghemawat at Google (see "
MapReduce: Simplified
data processing on large clusters
"). At its core, MapReduce is a combination of two functions
--

map()

and
reduce()
, as its name would suggest.

A quick look at a sample
Java program will help you get your bearings in MapReduce. This application
implements a very simple version of the MapReduce framework, but isn't built on Hadoop. The simple,
abstracted program will illustrate the core parts of the MapReduce framework and

the terminology
associated with it. The application creates some strings, counts the number of characters in each string,
and finally sums them up to show the total number of characters altogether. Listing 1 contains the
program's
Main

class.

Listing 1.
Main class for a simple MapReduce Java app
public class Main

{



public static void main(String[] args)


{



MyMapReduce my = new MyMapReduce();


my.init();



}

}
Listing 1 just instantiates a class called
MyMapReduce
, which is show
n in
Listing 2
.

Listing 2. MyMapReduce.java
import java.util.*;


public class MyMapReduce


...



Download complete Listing 2

As you see, the crux of the class lies in just four functions:



The
init()

method creates some dummy data (just 30 strings). This data serves as the input
data for the program. Note that in the real world, this input could be g
igabytes, terabytes, or
petabytes of data!



The
step1ConvertIntoBuckets()

method segments the input data. In this example, the
data is divided into six smaller chunks and put inside an
ArrayList

named
buckets
. You
can see that the method takes a list, whic
h contains all of the input data, and another
int

value,
numberOfBuckets
. This value has been hardcoded to five; if you divide 30 strings into five
buckets, each bucket will have six strings each. Each bucket in turn is represented as an
ArrayList
. These a
rray lists are put finally into another list and returned. So, at the end of
the function, you have an array list with five buckets (array lists) of six strings each.


These buckets can be put in memory (as in this case), saved to disk, or put onto differe
nt nodes
in a cluster!



step2RunMapFunctionForAllBuckets()

is the next method invoked from
init()
.
This method internally creates five threads (because there are five buckets
--

the idea is to start a
thread for each bucket). The class responsible for thre
ading is
StartThread
, which is
implemented as an inner class. Each thread processes each bucket and puts the individual result
in another array list named
intermediateresults
. All the computation and threading
takes place within the same JVM, and the whole

process runs on a single machine.


If the buckets were on different machines, a master should be monitoring them to know when
the computation is over, if there are any failures in processing in any of the nodes, and so on. It
would be great if the master
could perform the computations on different nodes, rather than
bringing the data from all nodes to the master itself and executing it.



The
step3RunReduceFunctionForAllBuckets()

method collates the results from
intermediateresults
, sums it up, and gives yo
u the final output.


Note that
intermediateresults

needs to combine the results from the parallel processing
explained in the previous bullet point. The exciting part is that this process also can happen
concurrently!

A more complicated scenario

Proces
sing 30 input elements doesn't really make for an interesting scenario. Imagine instead that there
are 100,000 elements of data to be processed. The task at hand is to search for the total number of
occurrences of the word
JavaWorld
. The data may be struct
ured or unstructured. Here's how you'd
approach it:



Assume that, in some way, the data is divided into smaller chunks and is inserted into buckets.
You have a total of 10 buckets now, with 10,000 elements of data within each of them. (Don't
bother worryin
g about who exactly does the dividing at the moment.)



Apply a function named
map()
, which in turn executes your search algorithm on a single
bucket and repeats it concurrently for all the buckets in parallel, storing the result (of processing
of each buck
et) in another set of buckets, called
result buckets
. Note that there may be more
than one result bucket.



Apply a function named
reduce()

on each of these result buckets. This function iterates
through the result buckets, takes in each value, and then per
forms some kind of processing, if
needed. The processing may either aggregate the individual values or apply some kind of
business logic on the aggregated or individual values. This functionality once again takes place
concurrently.



Finally, you will get
the result you expected.

These four steps are very simple but there is so much power in them! Let's look at the details.

Dividing the data

In Step 1, note that the buckets created by someone for you may be on a single machine or on multiple
machines (th
ough they must be on the same cluster in that case). In practice, that means that in large
enterprise projects, multiple terabytes or petabytes of data could be segmented into thousands of
buckets on different machines in the cluster, and processing could
be performed in parallel, giving the
user an extremely fast response. Google uses this concept to index every Web page it crawls. If you
take advantage of the power of the underlying filesystem used for storing the data in individual
machines of the cluste
r, the result could be more fascinating. Google uses the proprietary
Google File
System

(GFS) for this.

The map() function

In Step 2, the
map()

function underst
ands exactly where it should go to process the data. The source
of data may be memory, or disk, or another node in the cluster. Please note that bringing data to the
place where the
map()

function resides is more costly and time
-
consuming than letting the
function
execute at the place where the data resides. If you write a C++ or Java program to process data on
multiple threads, then the program fetches data from a data source (typically a remote database server)
and is usually executed on the machine where

your application is running. In MapReduce
implementations, the computation happens on the distributed nodes.

The reduce() function

In Step 3, the
reduce()

function operates on one or more lists of intermediate results by fetching
each of them from memory
, disk, or a network transfer and performing a function on each element of
each list. The final result of the complete operation is performed by collating and interpreting the
results from all processes running
reduce()

operations.

In Step 4, you get the
final output, which can be either 0 or some data element.

With this simple Java program under your belt, you're ready to understand the more complex
MapReduce implementation in Apache Hadoop.

Apache Hadoop

Hadoop is an open source implementation of the

MapReduce programming model. Hadoop relies not
on
Google File System

(GFS), but on its own
Hadoop Distributed File System

(HDFS). HDFS
replicates data blocks in a reliable manner and places them on different nodes; computation is then
performed by Hadoop on these nodes. HDFS is similar to other filesystems, but is desig
ned to be
highly fault tolerant. This distributed filesystem does not require any high
-
end hardware, but can run on
commodity computers and software; it is also scalable, which is one of the primary design goals for the
implementation. HDFS is independent
of any specific hardware or software platform, and is hence
easily portable across heterogeneous systems.

If you've worked with clustered Java EE applications, you're probably familiar with the concepts of a
master

instance that manages other instances of

the application server (called
slaves
) in a network
deployment architecture. These master instances may be called
deployment managers

(if you're using
WebSphere),
manager servers

(with WebLogic) or
admin servers

(with Tomcat). It is the responsibility
of
the master server instance to delegate various responsibilities to slave application server instances, to
listen for handshaking signals from each instance so as to decide which are alive and which are dead, to
do IP multicasting whenever required for sync
hronization of serializable sessions and data, and other
similar tasks. The master stores the metadata and relevant port information of the slaves and works in a
collaborative manner so that the end user feels as if there is only one instance.

HDFS work
s more or less in a similar way. In the HDFS architecture, the master is called a
NameNode

and the slaves are called
DataNodes.

There is only a single NameNode in HDFS, whereas there are
many DataNodes across the cluster, usually one per node. HDFS allocat
es a
namespace

(similar to a
package in Java, a tablespace in Oracle, or a namespace in C++) for storing user data. A file might be
split into one or more data blocks, and these data blocks are kept in a set of DataNodes. The
NameNode will have the necessa
ry metadata information on how the blocks are mapped to each other
and which blocks are being stored in which of the NameNodes. Note that not all the requests to be
delegated to DataNodes need to pass through the NameNode. All the filesystem's client reque
sts for
reading and writing are processed directly by the DataNodes, whereas namespace operations like the
opening, closing, and renaming of directories are performed by NameNodes. NameNodes are
responsible for issuing instructions to DataNodes for data bl
ock creation, replication, and deletion.

A typical deployment of HDFS has a dedicated machine that runs only the NameNode. Each of the
other machines in the cluster typically runs one instance of the DataNode software, though the
architecture does allow y
ou to run multiple DataNodes on the same machine. The NameNode is
concerned with metadata repository and control, but otherwise never handles user data. The NameNode
uses a special kind of log, named
EditLog,

for the persistence of metadata.

Deploying Had
oop

Though Hadoop is a pure Java implementation, you can use it in two different ways. You can either
take advantage of a streaming API provided with it or use Hadoop pipes. The latter option allows you
to build Hadoop apps with C++; this article will focu
s on the former.

Hadoop's main design goal is to provide storage and communication on lots of homogeneous
commodity machines. The implementers selected Linux as their initial platform for development and
testing; hence, if you're working with Hadoop on Wi
ndows, you will have to install separate software
to mimic the shell environment.

Hadoop can run in three different ways, depending on how the processes are distributed:



Standalone mode
: This is the default mode provided with Hadoop. Everything is run as
a single
Java process.



Pseudo
-
distributed mode
: Here, Hadoop is configured to run on a single machine, with
different Hadoop daemons run as different Java processes.



Fully distributed or cluster mode
: Typically, one machine in the cluster is designated a
s the
NameNode and another machine as the JobTracker. There is exactly one NameNode in each
cluster, which manages the namespace, filesystem metadata, and access control. You can also
set up an optional SecondaryNameNode, used for periodic handshaking with

NameNode for
fault tolerance. The rest of the machines within the cluster act as both DataNodes and
TaskTrackers. The DataNode holds the system data; each data node manages its own locally
scoped storage, or its local hard disk. The TaskTrackers carry out

map and reduce operations.

Writing a Hadoop MapReduce application

The surest way to understand how Hadoop works is to walk through the process of writing a Hadoop
MapReduce application. For the remainder of this article, we'll be working with
EchoOhce
, a simple
MapReduce application that can reverse many strings. The input strings to be reversed represent the
large amount of data that MapReduce applications t
ypically work with. The example divides the data
into different nodes, performs the reversal operations, combines the result strings, and then outputs the
results. This application provides an opportunity to examine all of the main concepts of Hadoop. Afte
r
you understand how it works, you'll see how it can be deployed in different modes.

First, take a look at the package declaration and imports in Listing 3. The
EchoOhce

class is in the
com.javaworld.mapreduce

package.

Listing 3. Package declarations for EchoOhce
package com.javaworld.mapreduce;


import java.io.IOException;

import java.util.ArrayList;

import java.util.Iterator;

import java.util.List;

import j
ava.util.StringTokenizer;

import java.io.*;

import java.net.*;

import java.util.regex.MatchResult;


import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

i
mport org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.mapred.JobClient;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.mapred.MapReduceBase;

import org.apache.hadoop.mapred.Mapper;

import org.apache.hadoop.mapred.OutputCollec
tor;

import org.apache.hadoop.mapred.Reducer;

import org.apache.hadoop.mapred.Reporter;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;



The first set of imports is for the standard Java classes, and the second set is for th
e MapReduce
implementation.

The
EchoOhce

class begins by extending
org.apache.hadoop.conf.Configured

and
implementing the interface
org.apache.hadoop.until.Tool
, as you can see in Listing 4.

Listing 4. Extending Configured, implementing Tool
public class

EchoOhce extends Configured implements Tool {

//..your code goes here

}
The
Configured

class is responsible for delivering the configuration parameters specified in certain
XML files. This is done when the programmer invokes the
getConf()

method of this c
lass. This
method returns an instance of
org.apache.hadoop.conf.Configuration
, which is basically
a holder for the resources specified as name
-
value pairs in XML data. Each resource is named by either
a
String

or by an
org.apache.hadoop.fs.Path

instance.

By default, the two resources loaded in order from the classpath are:



hadoop
-
default.xml
: This file contains read
-
only defaults for Hadoop, like global properties,
logging properties, I/O properties, filesystem properties, and the like. If you want to use
your
own values for any of these properties, you can override them in hadoop
-
site.xml.



hadoop
-
site.xml
: Here you can override the values in hadoop
-
default.xml that do not meet your
specific objectives.

Please note that applications may add additional res
ources
--

as many as you want. Those are loaded in
order from the classpath. You can find out more from the
Hadoop API documentation

for the
addResource()

and
ad
dFinalResource()

methods.
addFinalResource()

allows the
flexibility for declaring a resource to be final so that subsequently loaded resources cannot alter that
value.

You might have noticed that the code implements an interface named
Tool
. This interface

supports a
variety of methods to handle generic command
-
line options. The interface forces the programmer to
write a method,
run()
, that takes in
String

arrays as parameters and returns an
int
. The integer
returned will determine whether the execution has

been successful or not. Once you've implemented
the
run()

method in your class, you can write your
main()

method, as in Listing 5.

Listing 5. main() method
public static void main(String[] args) throws Exception {


int res = ToolRunner.run(new Config
uration(), new EchoOhce(), args);


System.exit(res);


}
The
org.apache.hadoop.util.ToolRunner

class invokes the
run()

method implemented
in the
EchoOhce

class. The
ToolRunner

utility helps to run classes that implement the
Tool

interface. With this fac
ility, developers can avoid writing a custom handler to process various input
options.

Map and reduce

Now you can jump into the actual MapReduce implementation. You're going to write two inner classes
within the
EchoOhce

class. They are:



Map
: Includes fu
nctionality for processing input key
-

value pairs to generate output key
-
value
pairs.



Reduce
: Includes functionality for collecting output from parallel map processing and
outputting that collected data.

Figure 1 illustrates how the sample app will work.


Figure 1. Map and Reduce in action (click to enlarge)

First, take a look at the
Map

class in Listing 6.

Listing 6. Map class
public static class Map ext
ends MapReduceBase


implements Mapper<LongWritable, Text, Text, Text> {



private Text inputText = new Text();


private Text reverseText = new Text();



public void map(LongWritable key, Text inputs,


Ou
tputCollector<Text, Text> output,


Reporter reporter) throws IOException {



String inputString = inputs.toString();


int length = inputString.length();


StringBuffer reverse = new StringB
uffer();


for(int i=length
-
1; i>=0; i
--
)


{


reverse.append(inputString.charAt(i));


}


inputText.set(inputString);


reverseText.set(reverse.toString());



output.collect(inputText,reverseText);


}


}
As mentioned earlier, the EchoOhce application must take an input string, reverse it, and return a key
-
value pair with input and reverse strings together. First, it gets the parameters for the
map()

function
--

namely, the inputs and the output. From the inputs, it gets the input
String
. The application uses the
simple Java API to find the reverse of this
String
, then creates a key
-
value pair by setting the input
String

and the reverse
String
. Y
ou end up with an
OutputCollector

instance, which contains
the result of this processing. Assume that this is one result obtained from one execution of the
map()

function on one of the nodes.

Obviously, you'll need to combine all such outputs. This is exa
ctly what the
reduce()

method of the
Reduce

class, shown in Listing 7, will do.

Listing 7. Reduce.reduce()
public static class Reduce extends MapReduceBase


implements Reducer<Text, Text, Text, Text> {



public void reduce(Text key, Iterato
r<Text> values,


OutputCollector<Text, Text> output,


Reporter reporter) throws IOException {


while (values.hasNext()) {


output.collect(key, values.next());


}


}



}
The MapReduce framework knows how many
OutputCollector
s there are and which are to be
combined for the final result. The
reduce()

method actually does the grunt work.

Finally, to complete EchoOhce's
Main

class, you need to set the values for your con
figuration.
Basically, these values inform the MapReduce framework about the types of the output keys and
values, the names of the
Map

and
Reduce

classes, and so on. The complete
run()

method is shown
in
Listing 8
.

Listing 8. run()
public int run(String[] args) throws Exception {


JobConf conf = new JobConf(getConf(), EchoOhce.class);


conf.setJobName("EchoOhce");


...



Download complete Listing 8
.

As you can see in the listing, you must first create a
Configuration

instance;
org.apache.hadoop.mapred.JobConf

extends from the
Configuration

class.
JobConf

has the primary responsibility of sendin
g your map and reduce implementations to the Hadoop
framework for execution. Once the
JobConf

instance has been given the appropriate values for your
MapReduce implementation, you invoke the most important method, named
runJob()
, on the
org.apache.hadoop.m
apred.JobClient

class, by passing in the
JobConf

instance.
JobClient

internally communicates with the
org.apache.hadoop.mapred.JobTracker

class, and provides facilities for submission of jobs, tracking progress, accessing the progress or logs,
or getting c
luster status.

That should give you a good sense of how EchoOhce, a sample MapReduce application, works. We'll
conclude with instructions for installing the relevant software and running the application.

Installing a MapReduce application in standalon
e mode

Unlike a Java EE application that can easily be deployed onto an app server, a MapReduce application
using Hadoop requires some extra steps for deployment. First, you should understand the default, out
-
of
-
the
-
box way that Hadoop operates: standalone

mode. The following steps describe how to set up the
application on a Windows XP Professional system; the process would be almost identical for any other
environment, with an important exception that you'll learn more about in a moment.

1.

Ensure that versi
on 5.0 or above of Java is installed on your machine.

2.

Download the latest version of Hadoop
. At the time that this article was published, the latest
distribution was version 0.18.0. Save the down
loads into a directory
--

this example will use
D:
\
hadoop.

3.

Make sure that you're logged in with an OS user name that doesn't contain spaces. For example,
a username like "Ravi" should be used rather than "Ravi Shankar". This is to avoid some
problems (whi
ch will be fixed in later versions) while using SSH communication. Please also
make sure that your system uses a username and password to log on at startup. Do not bypass
authentication. SSH will synchronize with Windows login while doing some handshakes.

4.

As mentioned earlier, you will need to have an execution environment for shell scripts. If you're
using a Unix
-
like OS, you will already have a command line available to you; but on a
Windows machines, you will need to install the Cygwin tools.
Download the Cygwin package
,
making sure that you have selected the openSSH package (under the NET category) before you
begin. For the other packages, you can simply use the defaults.

5.

In this example, Java has been installed i
n D:
\
Tiger. You need to make Hadoop aware of this
directory. Go to your Hadoop installation in the D:
\
hadoop directory, then to the conf
subdirectory. Open the file named hadoop
-
env.sh and change the value of JAVA_HOME
(uncommenting, if necessary) to the f
ollowing:
6.

JAVA_HOME = /cygdrive/d/Tiger
(Note that /cygdrive prefix. This is how Cygwin maps your Windows directory to a Unix
-
style
directory format.)

7.

Start Cygwin by choosing
Start > All Programs > Cygwin > Cygwin Bash Shell.


8.

In Hadoop, communication b
etween different processes across different machines is achieved in
through SSH, so the next important step is to get
sshd

running. If you're using SSH for the first
time, please note that
sshd

needs a config file to run, which is generated by the followin
g
command:
9.

ssh
-
host
-
config
When you enter this, you will get a prompt usually asking for the value for CYGWIN. Enter
ntsec tty
. If you are again prompted with a question on the privilege separation that should
be used, your answer should be
no
. If asked f
or your consent for installing SSH as a service,
give
yes

as your response.


Once this has been set up, start the
sshd

service by typing:
10.

/usr/sbin/sshd
To make sure that
sshd

is running, check the process status:
11.

ps | grep sshd
12.

If
sshd

is running, you c
an try to SSH to localhost:
13.

ssh localhost
If you're asked for a passphrase to SSH to the localhost, press Ctrl
-
C and enter:
14.

ssh
-
keygen
-
t dsa
-
P ' '
-
f /.ssh/id_dsa

cat /.ssh/id_dsa.pub >> /.ssh/authorized_keys
15.

Try running the
example programs available at the Hadoop site
. If all of the above steps
have gone as they should, you should be get the expected output.

16.

Now it's time to create the input

data for the EchoOhce application:
17.

echo "Hello" >> word1

echo "World" >> word2

echo "Goodbye" >> word3

echo "JavaWorld" >> word4
18.

Next, you need to put the files you created in Step 10 into HDFS after creating a
directory. Note that you do not need to cre
ate any partitions for HDFS. It comes as part of the
Hadoop installation, and all you need to do is execute the following commands:
19.

bin/hadoop dfs
-
mkdir words

bin/hadoop dfs
-
put word1 words/

bin/hadoop dfs
-
put word2 words/

bin/hadoop dfs
-
put word3 wor
ds/

bin/hadoop dfs
-
put word4 words/
20.

Next, create a JAR file for the sample application. As an easy and extensible approach,
create two environment variables in your machine, HADOOP_HOME and
HADOOP_VERSION. (For the sample under consideration, the values w
ill be D:
\
Hadoop and
0.17.1, respectively.) Now you can create EchoOhce.jar with the following commands:
21.

mkdir EchoOhce_classes

javac
-
classpath ${HADOOP_HOME}/hadoop
-
${HADOOP_VERSION}
-
core.jar
-
d
EchoOhce_classes EchoOhce.java

jar
-
cvf EchoOhce.jar
-
C E
choOhce_classes/
22.

Finally, its time to see the output. Run the application with the following command:
23.

bin/hadoop jar EchoOhce.jar com.javaworld.mapreduce.EchoOhce words
result
You will see an output screen with details like the following:
24.

08/07/18 11:14:
45 INFO streaming.StreamJob: map 0% reduce 0%

08/07/18 11:14:52 INFO streaming.StreamJob: map 40% reduce 0%

08/07/18 11:14:53 INFO streaming.StreamJob: map 80% reduce 0%

08/07/18 11:14:54 INFO streaming.StreamJob: map 100% reduce 0%

08/07/18 11:15:
03 INFO streaming.StreamJob: map 100% reduce 100%

08/07/18 11:15:03 INFO streaming.StreamJob: Job complete:
job_20080718003_0007

08/07/18 11:15:03 INFO streaming.StreamJob: Output: result
Now go to result directory, and look in the file named result. It
should contain the following:
25.

Hello olleH

World dlroW

Goodbye eybdooG

JavaWorld dlroWavaJ
Installing a MapReduce application in real cluster mode

Running the sample application in standalone mode will prove that things are working properly, but it
i
sn't really very exciting. To really demonstrate the power of Hadoop, you'll want to execute it in real
cluster mode.

1.

Pick six open port numbers that you can use; this example will use 8000 through 8005. (If the
details from the netstat command reveal tha
t these are not available, please feel free to use any
six of your choice.) You will need four machines, MACH1 to MACH4, all interconnected either
through a cable or wireless LAN. In the sample scenario described here, they are connected via
a home network
.

2.

MACH1 will be the NameNode, and MACH2 will be the JobTracker. As mentioned earlier, in a
cluster
-
based environment there will be only one of each.

3.

Open the file named hadoop
-
site.xml under the conf directory of your Hadoop installation.
Change the valu
es to match those shown in Listing 9.

Listing 9. hadoop
-
site.xml
4.

<?xml version="1.0"?>

<?xml
-
stylesheet type="text/xsl" href="configuration.xsl"?>


<!
--

Put site
-
specific property overrides in this file.
--
>


<configuration>


<property>


<name>fs.defa
ult.name</name>


<value>MACH1:8000</value>


</property>


<property>


<name>mapred.job.tracker</name>


<value>MACH2:8000</value>


</property>


<property>


<name>dfs.replication</name>


<value>2</value>


</property>


<property>


<name
>dfs.secondary.info.port</name>


<value>8001</value>


</property>


<property>


<name>dfs.info.port</name>


<value>8002</value>


</property>


<property>


<name>mapred.job.tracker.info.port</name>


<value>8003</value>


</property>


<prop
erty>


<name>tasktracker.http.port</name>


<value>8004</value>


</property>

</configuration>
5.

Open the file named masters under the conf directory. Here you need to add the master
NameNode and JobTracker names, as shown in Listing 10. (If there are e
xisting entries, please
replace them with those shown in the listing).

Listing 10. Adding NameNode and JobTracker names
6.

MACH1

MACH2
7.

Open the file named slaves under the conf directory. This is where you put the names of
DataNodes, as shown in Listing 11.
(Again, if there are existing entries in this file, please
replace them.)

Listing 11. Adding DataNode names
8.

MACH3

MACH4
9.

Now you're ready to go, and it's time to start the Hadoop cluster. Log on to each node, accepting
the defaults. Log into your NameNode
as follows:
10.

ssh MACH1
Now go to Hadoop directory:
11.

cd /hadoop0.17.1/
Execute the start script to launch HDFS:
12.

bin/start
-
dfs.sh
(Note that you can stop this later with the
stop
-
dfs.sh

command.)

13.

Start the JobTracker exactly as above, with the following c
ommands:
14.

ssh MACH2

cd /hadoop0.17.1/

bin/start
-
mapred.sh
(Again, this can be stopped later by the corresponding
stop
-
mapred.sh

command.)

You can now execute the EchoOche application as described in the previous section, in the same way.
The difference i
s that now the program will be executed across a cluster of DataNodes. You can
confirm this by going to the Web interface provided with Hadoop. Point your browser to
http://localhost:8002. (The default is actually port 50070; to see why you'd need to use p
ort 8002 here,
take a closer look at Listing 9.) You should see a frame similar to the one in Figure 2, showing the
details of NameNode and all jobs managed by it.


Figure 2. Hadoop Web interface, showing the number of nodes and their status (click to enlarge)

This Web interface will provide many details to browse through, showing you the full statistics of your
application. Hadoop comes with several diffe
rent Web interfaces by default; you can see their default
URLs in Hadoop
-
default.xml. For example, in this sample application, http://localhost:8003 will show
you JobTracker statistics. (The default is port is 50030.)

In conclusion

In this article, we've
presented the fundamentals of MapReduce programming with the open source
Hadoop framework. This excellent framework accelerates the processing of large amounts of data
through distributed processes, delivering very fast responses. It can be adopted and cus
tomized to meet
various development requirements and can be scaled by increasing the number of nodes available for
processing. The extensibility and simplicity of the framework are the key differentiators that make it a
promising tool for data processing.

About the author

Ravi Shankar

is an assistant vice president of technology development, currently working in the
financial industry. He is a Sun Certified Programmer and Sun Certified Enterprise Architect
with 15
years of industry experience. He has been a presenter at international conferences like JavaOne 2004,
IIWAS 2004, and IMECS 2006. Ravi served earlier as a technical member of the OASIS Framework
for Web Services Implementation Committee. He spends
most of his leisure time exploring new
technologies.

Govindu Narendra

is a technical architect pursuing development of parallel processing technologies
and portal development in data warehousing. He is a Sun Certi
fied Programmer.

All contents copyright 1995
-
2008 Java World, Inc.
http://www.javaworld.com