Computing for Large-Scale

jabgoldfishAI and Robotics

Oct 19, 2013 (3 years and 5 months ago)

51 views

Map
-
Reduce and Parallel
Computing for Large
-
Scale
Media Processing

Youjie Zhou

Outline


Motivations


Map
-
Reduce Framework


Large
-
scale Multimedia Processing
Parallelization


Machine Learning Algorithm Transformation


Map
-
Reduce Drawbacks and Variants


Conclusions

Motivations


Why we need Parallelization?


“Time is Money”


Simultaneously


Divide
-
and
-
conquer


Data is too huge to handle


1 trillion (10^12) unique URLs in 2008


CPU speed limitation

Motivations


Why we need Parallelization?


Increasing Data


Social Networks


Scalability!


“Brute Force”


No approximations


Cheap clusters v.s. expensive computers

Motivations


Why we choose Map
-
Reduce?


Popular


A parallelization framework Google proposed and
Google uses it everyday


Yahoo and Amazon also involve in


Popular


Good?


“Hides” parallelization details from users


Provides high
-
level operations that suit for majority
algorithms


Good start on deeper parallelization researches

Map
-
Reduce Framework


Simple idea inspired by function language
(like LISP)


map


a type of iteration in which a function is successively
applied to each element of one sequence


reduce


a function combines all the elements of a sequence
using a binary operation

Map
-
Reduce Framework


Data representation


<key,value>


map

generates
<key,value>

pairs


reduce

combines
<key,value>

pairs
according to same
key


“Hello, world!” Example

Map
-
Reduce Framework

data

split0

split1

split2

map

map

map

reduce

reduce

reduce

reduce

output

Map
-
Reduce Framework


Count the appearances of each different
word in a set of documents

void map (Document)

for each word in Document

generate <word,1>

void reduce (word,CountList)

int count = 0;

for each number in CountList

count += number

generate <word,count>

Map
-
Reduce Framework


Different Implementations


Distributed computing


each computer acts as a computing node


focusing on reliability over distributed computer
networks


Google’s clusters


closed source


GFS: distributed file system


Hadoop


open source


HDFS: hadoop distributed file system

Map
-
Reduce Framework


Different Implementations


Multi
-
Core computing


each core acts as a computing node


focusing on high speed computing using large shared memories


P
hoenix++


a two dimensional
<key,value>

table stored in the memory
where
map

and
reduce

read and write pairs


open source created by Stanford


GPU


10x higher memory bandwidth than a CPU


5x to 32x speedups on SVM training

Large
-
scale Multimedia Processing
Parallelization


Clustering


k
-
means


Spectral Clustering


Classifiers training


SVM


Feature extraction and indexing


Bag
-
of
-
Features


Text Inverted Indexing

Clustering


k
-
means


Basic and fundamental


Original Algorithm

1.
Pick k initial center points

2.
Iterate until converge

1.
Assign each point with the nearest center

2.
Calculate new centers


Easy to parallel!

Clustering


k
-
means


a shared file contains center points


map

1.
for each point, find the nearest center

2.
generate
<key,value>

pair


key
: center id


value
: current point’s coordinate


reduce

1.
collect all points belonging to the same cluster (they have the
same
key

value)

2.
calculate the average


new center


iterate

Clustering


Spectral Clustering





S is huge: 10^6 points (double) need 8TB


Sparse It!


Retain only S_ij where j is among the t nearest neighbors of i


Locality Sensitive Hashing?


It’s an approximation


We can calculate directly


Parallel

Clustering


Spectral Clustering


Calculate distance matrix


map


creates <key,value> so that every n/p points have the
same key


p is the number of node in the computer cluster


reduce


collect points with same key so that the data is split into p
parts and each part is stored in each node


for each point in the whole data set, on each node,
find t nearest neighbors

Clustering


Spectral Clustering


Symmetry


x_j in t
-
nearest
-
neighbor set of x_i
≠ x_i in t
-
nearest
-
neighbor set of x_j


map


for each nonzero element, generates two
<key,value>


first:
key

is row ID;
value

is column ID and distance


second:
key

is column ID;
value

is row ID and distance


reduce


uses
key

as row ID and fills columns specified by column
ID in
value

Classification


SVM

Classification


SVM


SMO


instead of solving all alpha together


coordinate ascent


pick one alpha, fix others


optimize alpha_i

Classification


SVM


SMO


But we cannot optimize only one alpha for SVM


We need to optimize two alpha each iteration

Classification


SVM


repeat until converge:


map


given two alpha, updating the optimization information


reduce


find the two maximally violating alpha




Feature Extraction and Indexing


Bag
-
of
-
Features


features


feature clusters


histogram


feature extraction


map


takes images in and outputs features directly


feature clustering


clustering algorithms, like k
-
means

Feature Extraction and Indexing


Bag
-
of
-
Features


feature quantization histogram


map


for each feature on one image, find the nearest feature
cluster


generates
<imageID,clusterID>


reduce


<imageID,cluster0,cluster1…>


for each feature cluster, updating the histogram


generates
<imageID,histogram>

Feature Extraction and Indexing


Text Inverted Indexing


Inverted index of a term


a document list containing the term


each item in the document list stores statistical information


frequency, position, field information


map


for each term in one document, generates
<term,docID>


reduce


<term,doc0,doc1,doc2…>


for each document, update statistical information for that term


generates
<term,list>

Machine Learning Algorithm
Transformation


How can we know whether an algorithm can
be transformed into a Map
-
Reduce fashion?


if so, how to do that?


Statistical Query and Summation Form


All we want is to estimate or inference


cluster id, labels…


From sufficient statistics


distances between points


points positions


statistic computation can be divided

Machine Learning Algorithm
Transformation


Linear Regression

Summation Form

reduce




map

reduce




map

reduce




map

Machine Learning Algorithm
Transformation


Naïve Bayesian

map

reduce

Machine Learning Algorithm
Transformation


Solution


Find statistics calculation part


Distribute calculations on data using
map


Gather and refine all statistics in
reduce

Map
-
Reduce Systems Drawbacks


Batch based system


“pull” model


reduce must wait for un
-
finished map


reduce “pull” data from map


no iteration support directly


Focusing too much on distributed system
and failure tolerance


local computing cluster may not need them

Map
-
Reduce Systems Drawbacks


Focusing too much on distributed system
and failure tolerance

Map
-
Reduce Variants


Map
-
Reduce online


“push” model


map

“pushes” data to
reduce


reduce

can also “push” results to
map

from the
next job


build a pipeline


Iterative Map
-
Reduce


higher level schedulers


schedule the whole iteration process

Map
-
Reduce Variants


Series Map
-
Reduce?

Multi
-
Core

Map
-
Reduce

Multi
-
Core

Map
-
Reduce

Multi
-
Core

Map
-
Reduce

Multi
-
Core

Map
-
Reduce

Map
-
Reduce?
MPI? Condor?

Conclusions


Good parallelization framework


Schedule jobs automatically


Failure tolerance


Distributed computing supported


High level abstraction


easy to port algorithms on it


Too “industry”


why we need a large distributed system?


why we need too much data safety?


References

[1] Map
-
Reduce for Machine Learning on Multicore

[2] A Map Reduce Framework for Programming Graphics Processors

[3] Mapreduce Distributed Computing for Machine Learning

[4] Evaluating mapreduce for multi
-
core and multiprocessor systems

[5] Phoenix Rebirth: Scalable MapReduce on a Large
-
Scale Shared
-
Memory System

[6] Phoenix++: Modular MapReduce for Shared
-
Memory Systems

[7] Web
-
scale computer vision using MapReduce for multimedia data mining

[8] MapReduce indexing strategies: Studying scalability and efficiency

[9] Batch Text Similarity Search with MapReduce

[10] Twister: A Runtime for Iterative MapReduce

[11] MapReduce Online

[12] Fast Training of Support Vector Machines Using Sequential Minimal Optimization

[13] Social Content Matching in MapReduce

[14] Large
-
scale multimedia semantic concept modeling using robust subspace bagging
and MapReduce

[15] Parallel Spectral Clustering in Distributed Systems

Thanks




Q & A