1
Machine Learning with
Apache Hama
Tommaso Teofili
tommaso
[at] apache [dot] org
2
About me
ASF member having fun with:
Lucene / Solr
Hama
UIMA
Stanbol
… some others
SW engineer @
Adobe R&D
3
Agenda
Apache Hama and BSP
Why machine learning on BSP
Some examples
Benchmarks
4
Apache Hama
Bulk Synchronous Parallel computing
framework on top of HDFS for massive
scientific computations
TLP since May 2012
0.6.0 release out soon
Growing community
5
BSP
supersteps
A BSP algorithm is composed by a sequence of
“
supersteps
”
6
BSP
supersteps
Each task
Superstep 1
Do some computation
Communicate with other tasks
Synchronize
Superstep 2
Do some computation
Communicate with other tasks
Synchronize
…
…
…
Superstep N
Do some computation
Communicate with other tasks
Synchronize
7
Why BSP
Simple
programming
model
Supersteps
semantic
is
easy
Preserve
data
locality
Improve
performance
Well
suited
for iterative
algorithms
8
Apache Hama architecture
BSP Program execution flow
9
Apache Hama architecture
10
Apache Hama
Features
BSP API
M/R like I/O API
Graph
API
Job management / monitoring
Checkpoint recovery
Local & (Pseudo) Distributed run modes
Pluggable message transfer architecture
YARN supported
Running in Apache Whirr
11
Apache Hama BSP API
public abstract class BSP<K1, V1, K2, V2,
M extends Writable> …
K1, V1 are key, values for inputs
K2, V2 are key, values for outputs
M are they type of messages used for task
communication
12
Apache Hama BSP API
public void
bsp
(
BSPPeer
<K1, V1, K2, V2,
M> peer) throws ..
public void setup(
BSPPeer
<K1, V1, K2, V2,
M> peer) throws ..
public void cleanup(
BSPPeer
<K1, V1, K2,
V2, M> peer) throws ..
13
Machine learning on BSP
Lots (most?) of ML algorithms are
inherently iterative
Hama ML module currently counts
Collaborative
filtering
Clustering
Gradient descent
14
Benchmarking architecture
HDFS
Solr
Lucene
DBMS
Hama
Mahout
Node
Node
Node
Node
15
Collaborative filtering
Given user preferences on movies
We want to find users “near” to some
specific user
So that that user can “follow” them
And/or see what they like (which he/she could
like too)
16
Collaborative filtering BSP
Given a specific user
Iteratively (for each task)
Superstep
1*
i
Read a new user preference row
Find how near is that user from the current user
That is finding how near their preferences are
Since they are given as vectors we may use vector
distance measures like Euclidean, cosine, etc. distance
algorithms
Broadcast the measure output to other peers
Superstep
2*
i
Aggregate measure outputs
Update most relevant users
Still to be committed (HAMA

612)
17
Collaborative filtering BSP
Given user ratings about movies
"john"

> 0, 0, 0, 9.5, 4.5, 9.5, 8
"paula"

> 7, 3, 8, 2, 8.5, 0, 0
"jim”

> 4, 5, 0, 5, 8, 0, 1.5
"tom"

> 9, 4, 9, 1, 5, 0, 8
"timothy"

> 7, 3, 5.5, 0, 9.5, 6.5, 0
We ask for 2 nearest users to “
paula
”
and
we get
“
timothy
”
and
“
tom
”
user recommendation
We can extract highly rated movies
“timothy” and “tom” that “
paula
”
didn
’
t see
Item recommendation
18
Benchmarks
Fairly simple algorithm
Highly iterative
Comparing to Apache Mahout
Behaves better
than ALS

WR
Behaves similarly to
RecommenderJob
and
ItemSimilarityJob
19
K

Means clustering
We have a bunch of data (e.g. documents)
We want to group those docs in k
homogeneous clusters
Iteratively for each cluster
Calculate new cluster center
Add doc nearest to new center to the cluster
20
K

Means clustering
21
K

Means clustering BSP
Iteratively
Superstep
1*
i
Assignment phase
Read vectors splits
Sum up temporary centers with assigned
vectors
Broadcast sum and ingested vectors count
Superstep
2*
i
Update phase
Calculate the total sum over all received
messages and average
Replace old centers with new centers and
check for convergence
22
Benchmarks
One rack (16 nodes 256 cores) cluster
10G network
On average faster than Mahout’s impl
23
Gradient descent
Optimization algorithm
Find a (local) minimum of some function
Used for
solving linear systems
solving non linear systems
in machine learning tasks
linear regression
logistic regression
neural networks backpropagation
…
24
Gradient descent
Minimize a given (cost) function
Give the function a starting point (set of parameters)
Iteratively change parameters in order to minimize the
function
Stop at the (local)
minimum
There’s some math but intuitively:
evaluate derivatives at a given point in order to choose
where to “go” next
25
Gradient descent BSP
Iteratively
Superstep
1*
i
each task calculates and broadcasts portions of the
cost function with the current parameters
Superstep
2*
i
aggregate and update cost function
check the aggregated cost and iterations count
cost should always decrease
Superstep
3*
i
each task calculates and broadcasts portions of
(partial) derivatives
Superstep
4*
i
aggregate and update parameters
26
Gradient descent BSP
Simplistic example
Linear regression
Given real estate market dataset
Estimate new houses prices given known
houses’ size, geographic region and prices
Expected output: actual parameters for the
(linear) prediction function
27
Gradient descent BSP
Generate a different model for each region
House item vectors
price

> size
150k

> 80
2 dimensional space
~1.3M vectors dataset
28
Gradient descent BSP
Dataset and model fit
0
100000
200000
300000
400000
500000
600000
700000
800000
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
29
Gradient descent BSP
Cost checking
30
Gradient descent BSP
Classification
Logistic regression with gradient descent
Real estate market dataset
We want to find which estate listings belong to agencies
To avoid buying from them
Same algorithm
With different cost function and features
Existing items are tagged or not as “belonging to agency”
Create vectors from items’ text
Sample vector
1

> 1 3 0 0 5 3 4 1
31
Gradient descent BSP
Classification
32
Benchmarks
Not directly comparable to Mahout’s
regression algorithms
Both SGD and CGD are inherently better than
plain GD
But Hama GD had on average same
performance of Mahout’s SGD / CGD
Next step is implementing SGD / CGD on top of
Hama
33
Wrap up
Even if
ML module is still “young” / work in progress
and tools like Apache Mahout have better
“coverage”
Apache Hama can be particularly useful in
certain “highly iterative” use cases
Interesting benchmarks
34
Thanks!
