/
*
Joe Stein,
Chief Architect
http://www.medialets.com
Twitter
:
@
allthingshadoop
*
/
1
Tutorial: Streaming Jobs (& Non
-
Java
Hadoop
)
Sample Code
https://github.com/joestein/
amaunet
2
Overview
•
Intro
•
Sample Dataset
•
Options
•
Deep Dive
http://allthingshadoop.com/2010/12/16/si
mple
-
hadoop
-
streaming
-
tutorial
-
using
-
joins
-
and
-
keys
-
with
-
python
/
3
Medialets
4
Medialets
•
Largest deployment of rich media ads for mobile devices
•
Installed on hundreds of millions of devices
•
3
-
4 TB of new data every day
•
Thousands of services in production
•
Hundreds of thousands of events received every second
•
Response times are measured in microseconds
•
Languages
–
35% JVM (20%
Scala
& 10% Java)
–
30% Ruby
–
20% C/C++
–
13% Python
–
2% Bash
6
MapReduce
101
Why and How It Works
Sample Dataset
Data
set 1:
countries.dat
name|key
United
States|US
Canada|CA
United
Kingdom|UK
Italy|IT
7
Sample Dataset
Data set 2:
customers.dat
name|type|country
Alice
Bob|not
bad|US
Sam
Sneed|valued|CA
Jon
Sneed|valued|CA
Arnold
Wesise|not
so
good|UK
Henry
Bob|not
bad|US
Yo
Yo
Ma|not
so
good|CA
Jon
York|valued|CA
Alex
Ball|valued|UK
Jim
Davis|not
so
bad|JA
8
Sample Dataset
The
requirement:
you need to find out grouped by type of
customer how many of each type are in each country
with the name of the country listed in the
countries.dat
in
the final result (and not the 2 digit country name).
To
-
do this you need to:
1) Join the data sets
2) Key on country
3) Count type of customer per country
4) Output the results
9
Sample Dataset
United
States|US
Canada|CA
United
Kingdom|UK
Italy|IT
Alice
Bob|not
bad|US
Sam
Sneed|valued|CA
Jon
Sneed|valued|CA
Arnold
Wesise|not
so
good|UK
Henry
Bob|not
bad|US
Yo
Yo
Ma|not
so
good|CA
Jon
York|valued|CA
Alex
Ball|valued|UK
Jim
Davis|not
so
bad|JA
10
Canada
not so good
1
Canada
valued
3
JA
-
Unkown
Country
not so bad
1
United Kingdom
not so good
1
United Kingdom
valued
1
United States
not bad
2
So many ways to
MapReduce
•
Java
•
Hive
•
Pig
•
Datameer
•
Cascading
–
Cascalog
–
Scalding
•
Streaming with a framework
–
Wukong
–
Dumbo
–
MrJobs
•
Streaming without a framework
–
You can even do it with bash scripts, but don’t
11
Why and When
There are two types of jobs in
Hadoop
1) data transformation 2) queries
•
Java
–
Faster? Maybe not, because you might not know how to
optimize it as well as the Pig and Hive committers do, its
Java … so …
Does not work outside of
Hadoop
without
other Apache projects to let it do so.
•
Hive & Pig
–
Definitely a possibility but maybe better after you have
created your data set. Does not work outside of
Hadoop
.
•
Datameer
–
WICKED cool front end, seriously!!!
•
Streaming
–
With a framework
–
one more thing to learn
–
Without a framework
–
MapReduce
with and without
Hadoop
, huh? really? Yeah!!!
12
How does streaming work
stdin
&
stdout
•
Hadoop
actually opens a process and writes and reads
•
Is this efficient? Yeah it is when you look at it
•
You can read/write to your process without
Hadoop
–
score!!!
•
Why would you do this?
–
You should not put things into
Hadoop
that don’t belong
there. Prototyping and go live without the overhead!
–
You can have your
MapReduce
program run outside of
Hadoop
until it is ready and NEEDS to be running there
–
Really great
dev
lifecycles
–
Did I mention about the great
dev
lifecycles?
–
You can write a script in 5 minutes, seriously and then
interrogate TERABYTES of data without a fuss
13
Blah blah blah
Where's the beef?
#!/
usr
/bin/
env
python
import sys
# input comes from STDIN (standard input)
for line in
sys.stdin
:
try: #sometimes bad data can cause errors use this how you like to deal with lint and bad data
personName
= "
-
1" #default sorted as first
personType
= "
-
1" #default sorted as first
countryName
= "
-
1" #default sorted as first
country2digit = "
-
1" #default sorted as first
# remove leading and trailing whitespace
line =
line.strip
()
splits =
line.split
("|")
if
len
(splits) == 2: #country data
countryName
= splits[0]
country2digit = splits[1]
else: #people data
personName
= splits[0]
personType
= splits[1]
country2digit = splits[2]
print '%s^%s^%s^%s' % (country2digit,personType,personName,countryName)
except: #errors are going to make your job fail which you may or may not want
pass
14
Here is the output of that
CA^
-
1^
-
1^Canada
CA^not
so
good^Yo
Yo
Ma^
-
1
CA^valued^Jon
Sneed^
-
1
CA^valued^Jon
York^
-
1
CA^valued^Sam
Sneed^
-
1
IT^
-
1^
-
1^Italy
JA^not
so
bad^Jim
Davis^
-
1
UK^
-
1^
-
1^United Kingdom
UK^not
so
good^Arnold
Wesise
^
-
1
UK^valued^Alex
Ball^
-
1
US^
-
1^
-
1^United States
US^not
bad^Alice
Bob^
-
1
US^not
bad^Henry
Bob^
-
1
15
Padding is your friend
All sorts are not created equal
Josephs
-
MacBook
-
Pro
:~
josephstein
$ cat
test
1,,2
1,1,2
Josephs
-
MacBook
-
Pro
:~
josephstein
$ cat
test
|
sort
1,,2
1,1,2
[
root@megatron
joestein
]# cat
test
1,,2
1,1,2
[
root@megatron
joestein
]# cat
test|sort
1,1,2
1,,2
16
And the reducer
#!/
usr
/bin/
env
python
import sys
# maps words to their counts
foundKey
= ""
foundValue
= ""
isFirst
= 1
currentCount
= 0
currentCountry2digit = "
-
1"
currentCountryName
= "
-
1"
isCountryMappingLine
= False
# input comes from STDIN
for line in
sys.stdin
:
# remove leading and trailing whitespace
line =
line.strip
()
try:
# parse the input we got from
mapper.py
country2digit,personType,personName,countryName =
line.split
('^')
#the first line should be a mapping line, otherwise we need to set the
currentCountryName
to not known
if
personName
== "
-
1": #this is a new country which may or may not have people in it
currentCountryName
=
countryName
currentCountry2digit = country2digit
isCountryMappingLine
= True
else:
isCountryMappingLine
= False # this is a person we want to count
if not
isCountryMappingLine
: #we only want to count people but use the country line to get the right name
#first check to see if the 2digit country info matches up, might be
unkown
country
if currentCountry2digit != country2digit:
currentCountry2digit = country2digit
currentCountryName
= '%s
-
Unkown
Country' % currentCountry2digit
currentKey
= '%s
\
t%s
' % (
currentCountryName,personType
)
if
foundKey
!=
currentKey
: #new combo of keys to count
if
isFirst
== 0:
print '%s
\
t%s
' % (
foundKey,currentCount
)
currentCount
= 0 #reset the count
else:
isFirst
= 0
foundKey
=
currentKey
#make the found key what we see so when we loop again can see if we increment or print out
currentCount
+= 1 # we increment anything not in the map list
except:
pass
try:
print '%s
\
t%s
' % (
foundKey,currentCount
)
except:
pass
17
How to run it
•
cat
customers.dat
countries.dat
|./
smplMapper.py|sort
|./
smplReducer.py
•
su
hadoop
-
c "
hadoop
jar
/
usr
/
lib
/hadoop
-
0.20/
contrib
/
streaming
/hadoop
-
0.20.1+169.89
-
streaming.jar
-
D
mapred.map.tasks
=75
-
D
mapred.reduce.tasks
=
42
-
file
.
/
smplMapper.py
-
mapper
.
/
smplMapper.py
-
file
.
/
smplReducer.py
-
reducer
.
/
smplReducer.py
-
input $1
–
output
$
2
-
inputformat
SequenceFileAsTextInputFormat
-
partitioner
org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
-
jobconf
stream.map.output.field.separator
=^
-
jobconf
stream.num.map.output.key.fields
=
4
-
jobconf
map.output.key.field.separator
=^
-
jobconf
num.key.fields.for.partition
=1"
18
Breaking down the
Hadoop
job
•
-
partitioner
org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
–
This
is
how
you
handle
keying
on
values
•
-
jobconf
stream.map.output.field.separator
=
^
–
Tell
hadoop
how
it
knows
how
to
parse
your
output
so
it
can
key
on
it
•
-
jobconf
stream.num.map.output.key.fields
=4
–
H
ow
many
fields
total
•
-
jobconf
map.output.key.field.separator
=^
–
Y
ou
can
key
on
your
map
fields
seperatly
•
-
jobconf
num.key.fields.for.partition
=1
–
This
is
how
many
of
those
fiels
are
your
“
key
” the rest are
sort
19
Some tips
•
c
hmod
a+x
your
py
files, they need to execute on the nodes as they are
LITERALLY a process that is run
•
NEVER hold too much in memory, it is better to use the last variable method
than holding say a
hashmap
•
It is ok to have multiple jobs DON
’
T put too much into each of these it is
better to make pass over the data. Transform then query and calculate.
Creating data sets for your data lets others also interrogate the data
•
To join smaller data sets use
–
file and open it in the script
•
http://hadoop.apache.org/common/docs/r0.20.1/
streaming.html
•
For Ruby
streaming
check out the
podcast
http://allthingshadoop.com/2010/05/20/ruby
-
streaming
-
wukong
-
hadoop
-
flip
-
kromer
-
infochimps
/
•
Sample
Code for this talk
https
://github.com/joestein/amaunet
20
connect
@medialets.com
www.medialets.com
/showcas
e
Medialets
The rich media
ad
platform for mobile.
21
We are hiring!
/*
Joe Stein, Chief Architect
http://
www.medialets.com
Twitter
: @
allthingshadoop
*/
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο