Sample Dataset - Meetup

gayheadtibburInternet and Web Development

Feb 5, 2013 (4 years and 8 months ago)

218 views

/
*


Joe Stein,
Chief Architect

http://www.medialets.com

Twitter
:
@
allthingshadoop


*
/

1

Tutorial: Streaming Jobs (& Non
-
Java
Hadoop
)

Sample Code

https://github.com/joestein/
amaunet


2

Overview


Intro


Sample Dataset


Options


Deep Dive


http://allthingshadoop.com/2010/12/16/si
mple
-
hadoop
-
streaming
-
tutorial
-
using
-
joins
-
and
-
keys
-
with
-
python
/


3

Medialets

4

Medialets


Largest deployment of rich media ads for mobile devices


Installed on hundreds of millions of devices


3
-
4 TB of new data every day


Thousands of services in production


Hundreds of thousands of events received every second


Response times are measured in microseconds


Languages


35% JVM (20%
Scala

& 10% Java)


30% Ruby


20% C/C++


13% Python


2% Bash

6

MapReduce

101

Why and How It Works

Sample Dataset

Data
set 1:
countries.dat


name|key


United
States|US

Canada|CA

United
Kingdom|UK

Italy|IT


7

Sample Dataset

Data set 2:
customers.dat


name|type|country

Alice
Bob|not

bad|US

Sam
Sneed|valued|CA

Jon
Sneed|valued|CA

Arnold
Wesise|not

so
good|UK

Henry
Bob|not

bad|US

Yo

Yo

Ma|not

so
good|CA

Jon
York|valued|CA

Alex
Ball|valued|UK

Jim
Davis|not

so
bad|JA

8

Sample Dataset

The
requirement:
you need to find out grouped by type of
customer how many of each type are in each country
with the name of the country listed in the
countries.dat

in
the final result (and not the 2 digit country name).


To
-
do this you need to:


1) Join the data sets

2) Key on country

3) Count type of customer per country

4) Output the results

9

Sample Dataset

United
States|US

Canada|CA

United
Kingdom|UK

Italy|IT


Alice
Bob|not

bad|US

Sam
Sneed|valued|CA

Jon
Sneed|valued|CA

Arnold
Wesise|not

so
good|UK

Henry
Bob|not

bad|US

Yo

Yo

Ma|not

so
good|CA

Jon
York|valued|CA

Alex
Ball|valued|UK

Jim
Davis|not

so
bad|JA


10

Canada
not so good
1

Canada
valued

3

JA
-

Unkown

Country
not so bad
1

United Kingdom
not so good
1

United Kingdom
valued

1

United States
not bad
2


So many ways to
MapReduce


Java


Hive


Pig


Datameer


Cascading


Cascalog


Scalding


Streaming with a framework


Wukong


Dumbo


MrJobs


Streaming without a framework


You can even do it with bash scripts, but don’t

11

Why and When

There are two types of jobs in
Hadoop

1) data transformation 2) queries


Java


Faster? Maybe not, because you might not know how to
optimize it as well as the Pig and Hive committers do, its
Java … so …

Does not work outside of
Hadoop

without
other Apache projects to let it do so.


Hive & Pig


Definitely a possibility but maybe better after you have
created your data set. Does not work outside of
Hadoop
.


Datameer


WICKED cool front end, seriously!!!


Streaming


With a framework


one more thing to learn


Without a framework


MapReduce

with and without
Hadoop
, huh? really? Yeah!!!



12

How does streaming work

stdin

&
stdout


Hadoop

actually opens a process and writes and reads


Is this efficient? Yeah it is when you look at it


You can read/write to your process without
Hadoop



score!!!


Why would you do this?


You should not put things into
Hadoop

that don’t belong
there. Prototyping and go live without the overhead!


You can have your
MapReduce

program run outside of
Hadoop

until it is ready and NEEDS to be running there


Really great
dev

lifecycles


Did I mention about the great
dev

lifecycles?


You can write a script in 5 minutes, seriously and then
interrogate TERABYTES of data without a fuss

13

Blah blah blah

Where's the beef?

#!/
usr
/bin/
env

python



import sys



# input comes from STDIN (standard input)

for line in
sys.stdin
:


try: #sometimes bad data can cause errors use this how you like to deal with lint and bad data




personName

= "
-
1" #default sorted as first


personType

= "
-
1" #default sorted as first


countryName

= "
-
1" #default sorted as first


country2digit = "
-
1" #default sorted as first




# remove leading and trailing whitespace


line =
line.strip
()




splits =
line.split
("|")




if
len
(splits) == 2: #country data


countryName

= splits[0]


country2digit = splits[1]


else: #people data


personName

= splits[0]


personType

= splits[1]


country2digit = splits[2]




print '%s^%s^%s^%s' % (country2digit,personType,personName,countryName)


except: #errors are going to make your job fail which you may or may not want


pass


14

Here is the output of that

CA^
-
1^
-
1^Canada

CA^not

so
good^Yo

Yo

Ma^
-
1

CA^valued^Jon

Sneed^
-
1

CA^valued^Jon

York^
-
1

CA^valued^Sam

Sneed^
-
1

IT^
-
1^
-
1^Italy

JA^not

so
bad^Jim

Davis^
-
1

UK^
-
1^
-
1^United Kingdom

UK^not

so
good^Arnold

Wesise
^
-
1

UK^valued^Alex

Ball^
-
1

US^
-
1^
-
1^United States

US^not

bad^Alice

Bob^
-
1

US^not

bad^Henry

Bob^
-
1


15

Padding is your friend

All sorts are not created equal

Josephs
-
MacBook
-
Pro
:~
josephstein
$ cat
test

1,,2

1,1,2

Josephs
-
MacBook
-
Pro
:~
josephstein
$ cat
test

|
sort

1,,2

1,1,2


[
root@megatron

joestein
]# cat
test

1,,2

1,1,2

[
root@megatron

joestein
]# cat
test|sort

1,1,2

1,,2

16

And the reducer

#!/
usr
/bin/
env

python



import sys



# maps words to their counts

foundKey

= ""

foundValue

= ""

isFirst

= 1

currentCount

= 0

currentCountry2digit = "
-
1"

currentCountryName

= "
-
1"

isCountryMappingLine

= False



# input comes from STDIN

for line in
sys.stdin
:


# remove leading and trailing whitespace


line =
line.strip
()




try:


# parse the input we got from
mapper.py


country2digit,personType,personName,countryName =
line.split
('^')




#the first line should be a mapping line, otherwise we need to set the
currentCountryName

to not known


if
personName

== "
-
1": #this is a new country which may or may not have people in it


currentCountryName

=
countryName


currentCountry2digit = country2digit


isCountryMappingLine

= True


else:


isCountryMappingLine

= False # this is a person we want to count




if not
isCountryMappingLine
: #we only want to count people but use the country line to get the right name




#first check to see if the 2digit country info matches up, might be
unkown

country


if currentCountry2digit != country2digit:


currentCountry2digit = country2digit


currentCountryName

= '%s
-

Unkown

Country' % currentCountry2digit




currentKey

= '%s
\
t%s
' % (
currentCountryName,personType
)




if
foundKey

!=
currentKey
: #new combo of keys to count


if
isFirst

== 0:


print '%s
\
t%s
' % (
foundKey,currentCount
)


currentCount

= 0 #reset the count


else:


isFirst

= 0




foundKey

=
currentKey

#make the found key what we see so when we loop again can see if we increment or print out




currentCount

+= 1 # we increment anything not in the map list


except:


pass



try:


print '%s
\
t%s
' % (
foundKey,currentCount
)

except:


pass


17

How to run it


cat
customers.dat

countries.dat
|./
smplMapper.py|sort
|./
smplReducer.py


su

hadoop

-
c "
hadoop

jar

/
usr
/
lib
/hadoop
-
0.20/
contrib
/
streaming
/hadoop
-
0.20.1+169.89
-
streaming.jar
-
D
mapred.map.tasks
=75
-
D
mapred.reduce.tasks
=
42
-
file
.
/
smplMapper.py

-
mapper

.
/
smplMapper.py

-
file
.
/
smplReducer.py

-
reducer

.
/
smplReducer.py

-
input $1

output
$
2
-
inputformat
SequenceFileAsTextInputFormat

-
partitioner

org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

-
jobconf

stream.map.output.field.separator
=^
-
jobconf

stream.num.map.output.key.fields
=
4

-
jobconf

map.output.key.field.separator
=^
-
jobconf

num.key.fields.for.partition
=1"

18

Breaking down the
Hadoop

job


-
partitioner

org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner


This

is
how

you

handle
keying

on
values



-
jobconf

stream.map.output.field.separator
=
^


Tell
hadoop

how

it

knows

how

to

parse

your

output
so

it

can

key

on
it



-
jobconf

stream.num.map.output.key.fields
=4


H
ow
many

fields

total


-
jobconf

map.output.key.field.separator
=^


Y
ou

can

key

on
your

map
fields

seperatly


-
jobconf

num.key.fields.for.partition
=1


This

is
how

many

of
those

fiels

are
your


key
” the rest are
sort


19

Some tips


c
hmod

a+x

your
py

files, they need to execute on the nodes as they are
LITERALLY a process that is run


NEVER hold too much in memory, it is better to use the last variable method
than holding say a
hashmap


It is ok to have multiple jobs DON

T put too much into each of these it is
better to make pass over the data. Transform then query and calculate.
Creating data sets for your data lets others also interrogate the data


To join smaller data sets use

file and open it in the script


http://hadoop.apache.org/common/docs/r0.20.1/
streaming.html


For Ruby
streaming

check out the
podcast

http://allthingshadoop.com/2010/05/20/ruby
-
streaming
-
wukong
-
hadoop
-
flip
-
kromer
-
infochimps
/



Sample
Code for this talk
https
://github.com/joestein/amaunet



20



connect
@medialets.com


www.medialets.com
/showcas
e




Medialets

The rich media
ad
platform for mobile.




21

We are hiring!



/*


Joe Stein, Chief Architect

http://
www.medialets.com

Twitter
: @
allthingshadoop


*/