Make Sense of Big Data

voltaireblingΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 10 μήνες)

103 εμφανίσεις

1


Make Sense of Big Data



Researched by JIANG Wen
-
rui



Led by Pro.
ZOU





2

Three levels of Big Data

Data Analysis

Software Infrastructure

Hardware Infrastructure

SaaS

I
aaS

P
aaS

3

Contradiction

First and Second Level



Data Analysis


Meachine

Learning

Data

Warehouse

S
tatistics


SoftWare

Infrastruct

MapReduce


Pregel

GraphLab

GraphBuilder

Spark

Evolution of Big Data Tech

4

Data

Intelligence

Level

HBase

Spark

Hive


Hive


Pig


Pregel


GraphBuild
er


HDFS

MLBase


Mahout


MapReduce


MapReduce


MapReduce


Shark

BDAS

Cloudera

GraphLab

Software

Architecture

Level

BC
-
PDM


Graph app


MapR


5

4
V

in Big Data

V

Why?



V
olume

Big Data is just that


data sets that are so massive that typical software
systems are incapable of economically storing, let alone managing and
computing, the information. A Big Data platform must capture and
readily provide such quantities in a comprehensive and uniform storage
framework to enable straightforward management and development



V
ariety

One of the tenets of Big Data is the exponential growth of unstructured
data. The vast majority of data now originates from sources with either
limited or variable structure, such as social media and telemetry. A Big
Data platform must accommodate the full spectrum of data types and
forms.




V
elocity

As organizations continue to seek new questions, patterns, and metrics
within their data sets, they demand rapid and agile modeling and query
capabilities. A Big Data platform should maintain the original format and
precision of all ingested data to ensure full latitude of future analysis and
processing cycles.



V
alue

Driving relevant value, whether as revenue or cost savings, from data is
the primary motivator for many organizations. The popularity of long tail
business models has forced companies to examine their data in detail to
find the patterns, affiliations, and connections to drive these new
opportunities

6

Model VS

Frame

Performance



Google

MapReduce

Good

at

data
-
independence

tasks
,

not

machine

learning

and

graph

processing(data
-
dependent

and

iterative

tasks)
.

based

on

acyclic

data

flow

Think

like

a

key
.



Google

Pregel

Good at
iterative

and
data
-
dependent computations
, include
graph
processing
.


Using

BSP(Bulk

Synchronous

Parallel)

Model
.

A

Message

Passing

abstraction
.



CMU

GraphLab

Good at
iterative

and
data
-
dependent computations

, especially
nature
graph

problem.

Using

asynchronous

distributed

shared

memory

model
.

A

Shared
-
State

abstraction
.

Think

like

a

vertex
.



UC

Berkeley

BDAS

Spark


Good
at
Iterative algorithms, Interactive data mining, OLAP reports.

Using
RDDs(resilient distributed datasets)
abstraction, which using
In
-
Memory

Cluster Computing and
distributed
-
memory

model.

7

MapReduce

8

Map@MapReuduce

9

Reduce@MapReuduce

10

RPC@MapReuduce

11

RPC@MapReuduce

12

MapReduce+BSP

13

BSP Model

Processors

Local

Computation

Communication

Barrier

Synchronization

14

Mapreduce

+ BSP

15

GraphLab

16

GraphLab
-
Think like a vertex


Graphlab

Working Pattern

17


Pattern


Functions





MR

Map
-
Reduce


Map_reduce_vertices

Map_reduce_edges

Transform_vertices

Transform_edges




GAS

Gather
-
Apply
-
Scatter


Gather_edges

Gather

Apply

Scatter_edges

Scatter

Machine 2

Machine 1

Machine 4

Machine 3

Distributed

Execution
of a
PowerGraph Vertex
-
Program

Σ
1

Σ
2

Σ
3

Σ
4

+ + +

Y

Y

Y

Y

Y’

Σ

Y’

Y’

Y’

G
ather

A
pply

S
catter

18

Master

Mirror

Mirror

Mirror

Graphlab

vs

Pregel
--
Example

What’s the
popularity

of this user?

Popular?

Depends on popularity

of

her

followers

Depends on the

popularity
their

followers

19

Graphlab

vs

Pregel
--

PageRank

u
Update ranks in parallel

u
Iterate until convergence

Rank of
user
i

Weighted sum of
neighbors’ ranks

20

The
Pregel

Abstraction

21

Vertex
-
Programs interact by sending
messages
.

i

Pregel_PageRank
(
i
,
messages
)
:


// Receive all the messages


total
=
0



foreach
(
msg

in
messages
) :



total = total +
msg



// Update the rank of this vertex


R[
i
] =
0.15 + total



// Send
new messages
to neighbors


foreach
(j in
out_neighbors
[
i
]) :


Send
msg
(
R
[
i
] *
w
ij
) to vertex j

Malewicz

et al.
[
PODC’09, SIGMOD’10]

Barrier

The
Pregel

Abstraction

Compute

Communicate

The
GraphLab

Abstraction

Vertex
-
Programs directly
read

the neighbors state

i

GraphLab_PageRank
(
i
)


// Compute sum over neighbors


total =
0


foreach
( j
in

in_neighbors
(
i
)):


total =
total + R[j] *
w
ji




// Update the PageRank


R[
i
] = 0.15 + total




/
/
Trigger neighbors to run again


if R
[
i
]
not converged then



foreach
( j
in

out_neighbors
(
i
)):



signal

vertex
-
program on j

23

Low et al.
[UAI’10, VLDB’12
]

j

GraphLab

Execution

CPU 1

CPU 2

The
scheduler

determines the order that
vertices are executed

e

f

g

k

j

i

h

d

c

b

a

b

i

h

a

i

b

e

f

j

c

Scheduler

The process repeats until the scheduler is empty

GraphLab

vs.
Pregel

(BSP)


Multicore PageRank (25M Vertices, 355M Edges)

1
100
10000
1000000
100000000
0
10
20
30
40
50
60
70
Num
-
Vertices

Number of Updates

51%
u
pdated only

once

Better for ML

Graph
-
parallel
Abstractions

26

Shared State

i

Asynchronous


Messaging

i

Synchronous

Asynchronous Execution

requires heavy locking (GraphLab)

Challenges of
High
-
Degree
Vertices

Touches a large

fraction of graph

(GraphLab)

Sequentially process

edges

Sends many

messages

(
Pregel
)

Edge meta
-
data

too large for single

machine

Synchronous Execution

prone to stragglers (
Pregel
)

27

28

Berkeley Data Analytics Stack

29

Berkeley Data Analytics Stack


HDFS

MapReduce

MPI

GraphLab

etc


Mesos
(Cluster resource manager)


Shared RDDs(distributed memory)

Spark

Shark(
Spark+Hive
)
-
SQL


BlinkDB
(approximate queries)


MLBase


V
alue

V
olume

V
elocity

V
ariety

Spark
-
Motivation

Most current cluster programming
models are
based on
acyclic data flow

from stable storage
to stable
storage

Map

Map

Map

Reduce

Reduce

Input

Output

31

Spark


Iterative algorithms
, including many machine
learning algorithms and graph algorithms like
PageRank
.



Interactive data mining
, where a user would
like to load data into RAM across a cluster and
query it repeatedly.



OLAP reports

that run multiple aggregation
queries on the same data.


32

Spark


Spark allows iterative computation on the same data,
which would form a cycle if jobs were visualized



Spark offers an abstraction called
resilient distributed
datasets (RDDs)

to support these applications efficiently



33

RDDs


Resilient Distributed Dataset (RDD) serves as an
abstraction to raw data, and some data is kept
in memory
and cached for later use.




Spark allows data to be committed in RAM for an
approximate20x speedup over
MapReduce

based on disks.

RDDs allow Spark to outperform existing models by up to
100x

in multi
-
pass analytics




RDDs are immutable and created through parallel
transformations such as map, filter,
groupBy

and reduce


34

Function
-
Mapreduce

VS Spark

Logistic Regression Performance

0
500
1000
1500
2000
2500
3000
3500
4000
4500
1
5
10
20
30
Running Time (s)

Number of Iterations

Hadoop
Spark
127
s

/
iteration

first iteration

174
s

further iterations

6
s

MLBase

Motivation
-
2 Gaps

In spite of the modern primacy of data, the complexity of existing

ML algorithms is often overwhelming
——



many users do not understand the trade
-
offs and challenges of
parameterizing

and choosing between different learning techniques.

They need to tune and compare several suitable algorithms




Further more, existing scalable systems that support machine

learning are typically not accessible to ML researchers without a
strong background in distributed systems and low
-
level primitives

So we design a systems which is extensibility to novel ML algorithms.

37

MLBase

4 pieces

C
apability


MQL



A simple declarative way to specify ML tasks



ML
-
Library

A library of distributed algorithms

Set of high
-
level operators to enable ML researchers to
scalably

implement a wide range of ML methods

without deep systems knowledge



ML
-
Optimizer


A novel optimizer to select and dynamically adapt
the choice of learning

algorithm



ML
-
Runtime


A new run
-
time

optimized for the data
-
access patterns of
these high
-
level

operators

MLBase

Architecture

MLBase


40


Error
guide


Just
Hadoop

Frame ?

In a sense
, the distributed
platforms just a language, we
can not miss
them,
also not
only
depend
on
them. The things
more
important

is as follows:



Machine Learning
!

Reading: Machine
Learning A
Probabilistic Perspective.



Deep Learning
.


41












Parallel
time series
regression



Led by Dr. Yang


Group


LI
Zhong
-
hua


WANG
Yun
-
zhi


JIANG
Wen
-
rui






FUJITSU

42

Parallel
time series regression


Property

Performance




Platform

Hadoop

from

Apache
.

MapReduce

from

Google
(Open

Source)

GraphLab

from

Carnegie

Mellon

University
(Open

Source)

Both

are

Good

at

distributed

parallel

processing

MapReduce



good

at

acyclic

data

flow

GraphLab

-

Good

at

iterative

and

data
-
dependent

computations




Volume


Support for big data. The algorithm has good scalability. When a
large amount of data comes, the algorithm can handle it without
any modification, just by increasing the number of clusters




Velocity



Rapid and agile modeling and handling capabilities for big data.




Interface


Using

XML file

for

input parameters setting, allowing customers
set parameters
intuitively

43

Parallel
time series regression


Decompose

CycLenCalcu

Indicative Frag

TBSCPro

Clustering


MapReduce



MapReduce




MapReduce



MapReduce



GraphLab

Choose Cluster



MapReduce

44

Design for Parallel

Indicative fragment

Indicative fragment
-

identification the best length of indicative fragment.

Assume
-

days:90 Max indicative fragment Length:96

Compare
-

Serial
and parallel time complexity


1


2


3


96


3


3


3


3

96

C
90

2


1


1

C
90

2


2


2

C
90

2


1


1


2


2


96


96

C
90

2


96


96

Generate all the

96* (90*89/2)

o
peration pairs

b
efore the parallel

computation

Serial

Parallel

Time Complexity:
96*


C
90

2

Time Complexity:
1


45

TBSCPro


1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

All

Days

1

2

3

4

5

a1 a2 a3 a4 a5 a6 a7 a8 ………..

e1 e2 e3 e4 e5 e6 e7 e8 ………..

d1 d2 d3 d4 d5 d6 d7 d8
………..

c1 c2 c3 c4 c5 c6 c7 c8 ……..

b1 b2 b3 b4 b5 b6 b7 b8 ……..

Heap with capacity 3

46



Parallel time series regression model

165

192

259

341

464

619

785

988

145

172

209

260

332

419

514

608

164

199

231

283

356

440

520

156

163

191

229

266

320

398

471

0
100
200
300
400
500
600
700
800
900
1000
16
64
144
256
400
600
800
1000
Map=2/Reduce=2
Map=4/Reduce=4
Map=4/Reduce=8
Map=4/Reduce=12
Cluster(4 Nodes)

(Best Time)

47

190

192

201

210

230

272

307

369

430

518

606

795

150

152

156

163

191

229

266

320

398

471

566

765

0
200
400
600
800
1000
1200
1400
1600
1800
90
180
360
720
1080
1440
1800
2160
2520
2880
3240
3600
Data Selection Time(s)-4 Nodes Cluster
Primary Seeds Time(s)-4 Nodes Cluster
DataSelection Time(s)-Single Node
Time: s

Data Selection Run

Time

Num of days

48

Thank you !