Berkley_Data_Analysis_Stack_(BDAS).

gorgeousvassalΛογισμικό & κατασκευή λογ/κού

7 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

100 εμφανίσεις

Berkley Data Analysis Stack
(BDAS)

Mesos
, Spark, Shark, Spark Streaming

Current Data Analysis Open Stack

Application

Storage

Data Processing

Infrastructure

Characteristics:


Batch Processing on on
-
disk data..


Not very efficient with “Interactive” and “Streaming” computations.

Goal

Berkeley Data Analytics Stack (BDAS)

Infrastructure

Storage

Data Processing

Application

Resource Management

Data Management

Share infrastructure across frameworks

(multi
-
programming for datacenters)

Efficient data sharing across frameworks

Data Processing


in
-
memory

processing


trade between
time
,
quality
, and

cost

Application

New apps: AMP
-
Genomics, Carat, …

BDAS Components

Mesos


A
platform for
sharing
commodity clusters

between
diverse
computing frameworks.

B.
Hindman

et. al,
Mesos
: A Platform for Fine
-
Grained
Resource Sharing in the Data Center, tech report,

UCB,
2010

Mesos


“Resource Offers” to publish available resources

B.
Hindman

et. al,
Mesos
: A Platform for Fine
-
Grained
Resource Sharing in the Data Center, tech report,

UCB,
2010


Has to deal with “framework” specific constraints
(without knowing the specific constraints).


e.g. data locality.


Allows the framework scheduler to “reject
offer” if constraints are not met.

Mesos


Other Issues:


Resource Allocation
Strategies: Pluggable


fair
sharing plugin
implemented


Revocation


Isolation:



Existing OS isolation techniques:
Linux Containers
”.


Fault Tolerance:


Master:

stand by master nodes and zoo keeper


Slaves:
Reports task/slave failures to the framework, the latter handles.


Framework scheduler failure:

Replicate

B.
Hindman

et. al,
Mesos
: A Platform for Fine
-
Grained
Resource Sharing in the Data Center, tech report,

UCB,
2010

Spark

Current
popular programming models for clusters transform data flowing from
stable storage to stable
storage







Spark:
In
-
Memory

Cluster Computing for
Iterative and Interactive
Applications.


Assumption


Same working data set is used across iterations or for a number of interactive queries.


Commodity Cluster



Local
Data partitions fit in Memory



Some Slides taken from presentation:

Spark


Acyclic data flows


powerful abstractions


But not efficient for Iterative/interactive applications that
repeatedly use the same “working data set”.

Solution:

augment data flow model with
in
-
memory


resilient distributed datasets


(RDDs)

RDDs


An RDD is an immutable, partitioned, logical collection of
records


Need not be materialized, but rather contains information to rebuild a
dataset from stable
storage (lazy
-
loading and lineage)


can be rebuilt if a partition is lost
(Transform once read many
)


Partitioning
can be based on a key in each record (using hash or
range partitioning
)


Created by transforming data in stable storage using data flow operators (map,
filter, group
-
by, …)


Can
be cached for future reuse

Generality of RDDs


Claim
: Spark

s combination of
data flow

with
RDDs
unifies many proposed cluster programming models


General data flow models:

MapReduce
, Dryad, SQL


Specialized models for
stateful

apps:

Pregel

(BSP),
HaLoop

(iterative MR), Continuous Bulk Processing


Instead of specialized APIs for one type of app, give
user

first
-
class control of distributed datasets

Programming Model

Transformations

(define a new RDD)

map

filter

sample

union

groupByKey

reduceByKey

join

cache



Parallel operations

(return a result to driver)

reduce

collect

count

save

lookupKey






Example: Log Mining


Load error messages from a log into memory,
then interactively search for various patterns

lines = spark.textFile(

hdfs://...

)

errors = lines.
filter
(
_.startsWith(

ERROR

)
)

messages = errors.
map
(
_.split(

\
t

)(2)
)

cachedMsgs = messages.
cache
()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

cachedMsgs.
filter
(
_.contains(

foo

)
).
count

cachedMsgs.
filter
(
_.contains(

bar

)
).
count

. . .

tasks

results

Cache 1

Cache 2

Cache 3

Base RDD

Transformed RDD

Cached RDD

Parallel operation

Result:

full
-
text search of Wikipedia in <1 sec (
vs

20 sec for on
-
disk data)

RDD Fault Tolerance


RDDs maintain
lineage

information that can be
used to reconstruct lost partitions


Ex:



cachedMsgs

=
textFile
(...).
filter
(
_.contains(

error

)
)


.
map
(
_.split(

\
t

)(2)
)


.
cache
()

HdfsRDD

path: hdfs://…

FilteredRDD

func
: contains(...)

MappedRDD

func: split(…)

CachedRDD

Benefits of RDD Model


Consistency is easy due to immutability


Inexpensive fault tolerance (log lineage rather than
replicating/
checkpointing

data)


Locality
-
aware scheduling of tasks on partitions


Despite being restricted, model seems applicable to a broad
variety of applications

Example: Logistic Regression


Goal: find best line separating two sets of
points

target

random initial line

Logistic Regression Code

val

data =
spark.textFile
(...).
map
(
readPoint
).
cache
()


var

w =
Vector.random
(D)


for (
i

<
-

1 to ITERATIONS) {


val

gradient =
data.
map
(
p =>


(1 / (1 +
exp
(
-
p.y
*(w dot
p.x
)))
-

1) *
p.y

*
p.x


).
reduce
(
_ + _
)


w
-
= gradient

}


println
("Final w: " + w)

Logistic Regression Performance

127 s / iteration

first iteration 174 s

further iterations 6 s

Page Rank:
Scala

Implementation

val

links =
// RDD of (
url
, neighbors) pairs

var

ranks =
// RDD of (
url
, rank) pairs


for

(
i

<
-

1 to ITERATIONS) {


val

contribs

=
links.
join
(ranks).
flatMap

{


case (
url
, (links, rank)) =>


links.map
(
dest

=> (
dest
, rank/
links.size
))


}


ranks =
contribs.
reduceByKey
(
_ + _
)


.
mapValues
(
0.15 + 0.85 * _
)

}


ranks.
saveAsTextFile
(...)


Fast, expressive cluster computing system compatible with
Apache Hadoop


Works with any Hadoop
-
supported storage system (HDFS, S3, Avro, …)


Improves
efficiency

through:


In
-
memory computing primitives


General computation graphs


Improves
usability

through:


Rich APIs in Java,
Scala
, Python


Interactive shell

Up to 100
×

faster

Often 2
-
10
×

less code

Spark

Summary

Spark Streaming


Framework for large scale stream processing


Scales to 100s of nodes


Can achieve second scale latencies


Integrates with Spark

s batch and interactive processing


Provides a simple batch
-
like API for implementing complex algorithm


Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.






Requirements


Scalable

to large clusters


Second
-
scale

latencies


Simple

programming
model


Integrated

with batch & interactive processing




Stateful

Stream Processing


Traditional streaming systems have a event
-
driven
record
-
at
-
a
-
time

processing model

-
Each node has mutable state

-
For each record, update state & send new
records



State is lost if node dies!



Making
stateful

stream processing be fault
-
tolerant is challenging

mutable state

node 1

node 3

input

records

node 2

input

records

24

Discretized Stream Processing

Run a streaming computation as a
series of very
small, deterministic batch jobs

25

Spark

Spark

Streaming

batches of X seconds

l
ive

data stream

processed
results


Chop up the live stream into batches of X seconds


Spark treats each batch of data as RDDs and
processes them using RDD operations


Finally, the processed results of the RDD
operations are returned in batches

Example 1


Get hashtags from Twitter

val

tweets

=
ssc.
twitterStream
(<Twitter username>, <Twitter password>)



DStream
: a sequence of RDD representing a stream of data

batch @ t+1

b
atch

@ t

batch @ t+2

tweets DStream

stored in memory as an RDD
(immutable, distributed)

Twitter Streaming API

Example 1


Get hashtags from Twitter

val

tweets =
ssc.twitterStream
(<Twitter username>, <Twitter password>)

val

hashTags

=
tweets
.
flatMap

(status =>
getTags
(status))



flatMap

flatMap

flatMap



transformation
: modify data in one
Dstream

to create another
DStream


new
DStream

new RDDs created for
every batch

batch @ t+1

b
atch

@ t

batch @ t+2

tweets DStream

hashTags Dstream

[#cat, #dog, … ]

Example 1


Get hashtags from Twitter

val

tweets =
ssc.twitterStream
(<Twitter username>, <Twitter password>)

val

hashTags

=
tweets.flatMap

(status =>
getTags
(status))

hashTags
.
saveAsHadoopFiles
("
hdfs
://...")



output operation
: to push data to external storage

flatMa
p

flatMa
p

flatMa
p

save

save

save

batch @ t+1

b
atch

@ t

batch @ t+2

tweets DStream

hashTags DStream

every batch saved
to HDFS

Fault
-
tolerance


RDDs
remember
the sequence of
operations that created it from the
original fault
-
tolerant input data



Batches of input data are replicated in
memory of multiple worker nodes,
therefore fault
-
tolerant



Data lost due to worker failure, can be
recomputed from input data

input data
replicated

in memory

flatMap

lost partitions
recomputed on
other workers

tweets

RDD

hashTags

RDD

Key concepts


DStream



sequence of RDDs representing a stream of data


Twitter, HDFS, Kafka, Flume,
ZeroMQ
,
Akka

Actor, TCP sockets



Transformations



modify data from on
DStream

to another


Standard RDD operations


map,
countByValue
, reduce, join, …


Stateful

operations


window,
countByValueAndWindow
, …



Output Operations


send data to external entity


saveAsHadoopFiles



saves to HDFS


foreach



do anything with each batch of results

Comparison with Storm and S4

Higher throughput than Storm


Spark Streaming:
670k

records/second/node


Storm:
115k

records/second/node


Apache S4: 7.5k records/second/node

0
10
20
30
100
1000
Throughput per node
(MB/s)

Record Size (bytes)

WordCount

Spark
Storm
0
40
80
120
100
1000
Throughput per node
(MB/s)

Record Size (bytes)

Grep

Spark
Storm
31