Large Scale Machine Learning

cowphysicistInternet and Web Development

Dec 4, 2013 (3 years and 6 months ago)

56 views

Large Scale Machine Learning

based on
MapReduce

& GPU

Lanbo Zhang

Motivation


Massive data challenges


More and more data to process (Google: 20,000
terabytes per day)


Data arrives faster and faster


Solutions


Invent faster ML algorithms: online algorithms


Stochastic gradient decent
v.s
. batch gradient decent


Parallelize learning processes:
MapReduce
, GPU,
etc.

MapReduce


A programming model invented by Google


Jeffrey Dean , Sanjay
Ghemawat
,
MapReduce
: simplified data
processing on large clusters, Proceedings of the 6th conference on
Symposium on
Opearting

Systems Design &
Implementation (OSDI),
p.10
-
10, December 06
-
08, 2004, San Francisco,
CA


The objective


To support

distributed computing

on large

data sets

on

clusters

of
computers


Features


Automatic parallelization and distribution


Fault
-
tolerance


I/O scheduling


Status and monitoring


User Interface


Users need to implement two functions


map

(
in_key
,
in_value
)
-
> list(
out_key
,
intermediate_value
)


reduce

(
out_key
, list(
intermediate_value
))
-
> list(
out_value
)


Example: Count word occurrences

Map

(
String
input_key
, String
input_value
):

//
input_key
: document
name

//
input_value
: document
contents

for
each word w in
input_value
:


EmitIntermediate
(w
, "1
");


Reduce

(String
output_key
,
Iterator

intermediate_values
):

//
output_key
: a
word

//
output_values
: a list of
counts

int

result = 0
;

for
each v in
intermediate_values
:


result
+=
ParseInt
(v
);

Emit(
AsString
(result
));

MapReduce

Usage Statistics in Google

MapReduce

for Machine Learning


C.

T. Chu, S.

K. Kim, Y.

A. Lin, Y.

Yu, G.

R.
Bradski
, A.

Y. Ng, and
K.

Olukotun
, "Map
-
reduce for machine learning on
multicore
,"
in

NIPS 2006


Algorithms that can be expressed in summation form
could be parallelized in the
MapReduce

framework


L
ocally
weighted linear regression (LWLR
)



Logistic regression(LR): Newton
-
Raphson

method



Naive
Bayes (NB
)


PCA




Linear SVM






EM: mixture of Gaussians (M
-
step)

Time complexity

P: # of cores

P’: speedup of matrix inversion and
eigen
-
decomposition on
multicore

Experimental Results

Speedup from 1 to 16 cores over all datasets

Apache
Hadoop


http://hadoop.apache.org/


An open
-
source implementation of
MapReduce


An excellent tutorial


http://hadoop.apache.org/common/docs/current/mapred_tutorial.html

(with the famous examples of
WordCount
)


Very helpful if you need to quickly develop a simple
Hadoop

program


A comprehensive book


Tom
White.
Hadoop
: The
D
efinitive Guide.
O'Reilly
Media. May 2009

http://oreilly.com/catalog/9780596521981


Topics:
Hadoop

distributed file system,
Hadoop

I/O, How to set up a
Hadoop

cluster, how to develop a
Hadoop

application, Administration, etc.


Helpful if you want to become a
Hadoop

expert

Key User Interfaces of
Hadoop


Class
Mapper
<KEYIN,VALUEIN,KEYOUT,VALUEOUT>


Implement the map function to define your map routines


Class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>


Implement the reduce function to define your reduce
routines


Class
JobConf


the primary interface to configure job parameters, which
include but not limited to:


Input and output path (
Hadoop

Distributed File System)


Number of
Mappers

and Reducers


Job name




Apache Mahout


http://lucene.apache.org/mahout/


A library of parallelized machine learning
algorithms implemented on top of
Hadoop


Applications


Clustering


Classification


Batch based collaborative filtering


Frequent
itemset

mining




Mahout in progress


Algorithms already implemented


K
-
Means, Fuzzy K
-
Means, Naive Bayes, Canopy clustering,
Mean Shift,
Dirichlet

process clustering, Latent
Dirichlet

Allocation, Random Forests, etc.


Algorithms to be implemented


Stochastic gradient decent, SVM, NN, PCA, ICA, GDA, EM,
etc.

GPU for Large
-
Scale ML


Graphics Processing Unit (GPU) is

a
specialized processor that offloads 3D or 2D
graphics rendering from the

CPU


GPUs’
highly parallel structure makes them
more effective than general
-
purpose

CPUs

for
a range of complex

algorithms

NVIDIA
GeForce

8800 GTX
Specification

Number of Streaming Multiprocessors

16

Multiprocessor Width

8

Local Store Size

16 KB

Total number of Stream Processors

128

Peak SP Floating Point Rate

346 Gflops

Clock

1.35 GHz

Device Memory

768 MB

Peak Memory Bandwidth

86.4 GB/s

Connection to Host CPU

PCI Express

CPU
-
> GPU bandwidth

2.2 GB/s*

GPU
-
> CPU bandwidth

1.7 GB/s*

Logical Organization to Programmers


Each block can have
up to 512 threads
that synchronize


Millions of blocks can
be issued

Programming Environment: CUDA


Compute Unified Device
Architecture (CUDA)


A parallel computing
architecture developed by
NVIDIA


The computing engine in
GPUs is accessible to
software developers
through industry standard
programming language

SVM on GPUs


Catanzaro, B.,
Sundaram
, N., and
Keutzer
, K.
2008. Fast support vector machine training
and classification on graphics processors.
In

Proceedings of the 25th international
Conference on Machine Learning

(Helsinki,
Finland, July 05
-

09, 2008). ICML '08, vol. 307.

SVM Training


Quadratic Program




The SMO algorithm


SVM Training on GPU


Each thread computes the following variable
for each point:

Result: SVM training on GPU

(Speedup over
LibSVM
)

0
5
10
15
20
25
30
35
40
Adult
Faces
Forest
Mnist
Usps
Web
SVM Classification


SVM classification task involves finding which side
of the
hyperplane

a point lies on




Each thread evaluates kernel function for a point

Result: SVM classification on GPU

(Speedup over
LibSVM
)

GPUMiner
: Parallel Data Mining on
Graphics Processors


Wenbin

Fang, etc. Parallel Data Mining on Graphics
Processors. Technical Report HKUST
-
CS08
-
07, Oct 2008


Three components


The CPU
-
based storage and buffer manager to handle
I/O and data transfer between CPU and GPU


The GPU
-
CPU co
-
processing parallel mining module


The GPU
-
based mining visualization module


Two mining algorithms implemented


K
-
Means clustering


Apriori

(frequent pattern mining algorithm)


GPUMiner
: System Architecture

The bitmap technique


Use a bitmap to represent the association
between data objects and clusters (for K
-
means),
and the association between items and
transactions (for
Apriori
)


Supports efficient row
-
wise and column
-
wise
operations exploiting the thread parallelism on
the GPU


Use a summary vector to store the number of
ones to accelerate counting on the number of
ones in a row/column

K
-
means





Three functions executed on GPU in parallel


makeBitmap_kernel


computeCentriod_kernel


findCluster_kernel

Apriori


To find those frequent
itemsets

among a large
number of transactions


The
trie
-
based implementation


Uses a
trie

to store candidates and their supports


Uses a bitmap to store the Item
-
Transaction matrix


Obtain the item supports by counting 1s in the bitmap


The 1
-
bit counting and intersection operations are
implemented as GPU programs

Experiments


Settings


GPU: NVIDIA GTX280, 30*8 processors


CPU: Intel Core2 Quad Core

Result: K
-
means


Baseline:
Uvirginia


35x faster than the four
-
threaded CPU
-
based
couterpart

Result:
Apriori


Baselines


CPU
-
based
Apriori


Best implementation of FIMI’03

Conclusion


Both
MapReduce

and GPU are feasible
strategies for large
-
scale parallelized machine
learning


MapReduce

aims at parallelization over
computer clusters


The hardware architecture of GPUs makes
them a natural choice for parallelized ML