Real-time PMML Scoring over Spark Streaming and Storm

homelybrrrInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

179 εμφανίσεις

1

Real
-
time PMML Scoring over
Spark Streaming and Storm

Dr. Vijay Srinivas Agneeswaran,

Director and Head, Big
-
data R&D,

Innovation Labs, Impetus

Contents

2

Big Data Computations


BDAS Spark


BDAS Discretized
Streams

Berkeley data
analytics stack

Real
-
time
analytics
with Storm


PMML Primer


Naïve Bayes Primer

PMML
Scoring
for Naïve
Bayes

3

Big Data Computations

Computations/Operations

Giant 1 (simple stats) is perfect
for Hadoop.

Giants 2 (linear algebra), 3 (N
-
body), 4 (optimization) Spark
from UC Berkeley is efficient.

Logistic regression, Kernel SVMs,
Conjugate gradient descent,
collaborative filtering, Gibbs
sampling, Alternating least squares.

Example is social group
-
first
approach for consumer churn
analysis [1]

Interactive/On
-
the
-
fly data
processing


Storm.

OLAP


data cube operations.
Dremel
/Drill

Data sets


not embarrassingly
parallel?

Deep Learning

Artificial Neural Networks

Machine vision from Google

Speech analysis from Microsoft

Giant 5


Graph processing


GraphLab,
Pregel
,
Giraph

[1] National Research Council.

Frontiers in Massive Data Analysis

. Washington, DC: The National Academies Press, 2013
.

[
2
]
R
ICHTER
, Yossi ;

Y
OM
-
T
OV
,
Elad

;

S
LONIM
, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social Groups.
In:

Proceedings of SIAM International Conference on Data Mining
,
2010, S. 732
-
741

4

Berkeley Big
-
data Analytics Stack (BDAS)

BDAS: Spark

[
MZ12]
Matei

Zaharia
,
Mosharaf

Chowdhury
,
Tathagata

Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J.
Franklin, Scott
Shenker
, and Ion
Stoica
. 2012. Resilient distributed datasets: a fault
-
tolerant abstraction for in
-
memory
cluster computing. In

Proceedings of the 9th USENIX conference on Networked Systems Design and
Implementation

(NSDI'12). USENIX Association, Berkeley, CA, USA, 2
-
2.

Transformations
/
Actions

Description

Map(function f1)

Pass each element of the RDD through f1 in parallel and return the resulting RDD.

Filter(function f2)

Select elements of RDD that return true when passed through f2.

flatMap
(function f3)

Similar to Map, but f3 returns a sequence to facilitate mapping single input to multiple outputs.

Union(RDD r1)

Returns result of union of the RDD r1 with the self.

Sample(flag, p, seed)

Returns a randomly sampled (with seed) p percentage of the RDD.

groupByKey
(
noTasks
)

Can only be invoked on key
-
value paired data


returns data grouped by value. No. of parallel
tasks is given as an argument (default is 8).

reduceByKey
(function f4,
noTasks
)

Aggregates result of applying f4 on elements with same key. No. of parallel tasks is the second
argument.

Join(RDD r2,
noTasks
)

Joins RDD r2 with self


computes all possible pairs for given key.

groupWith
(RDD r3,
noTasks
)

Joins RDD r3 with self and groups by key.

sortByKey
(flag)

Sorts the self RDD in ascending or descending based on flag.

Reduce(function f5)

Aggregates result of applying function f5 on all elements of self RDD

Collect()

Return all elements of the RDD as an array.

Count()

Count no. of elements in RDD

take(n)

Get first n elements of RDD.

First()

Equivalent to take(1)

saveAsTextFile
(path)

Persists RDD in a file in HDFS or other Hadoop supported file system at given path.

saveAsSequenceFile
(path)

Persist RDD as a Hadoop sequence file. Can be invoked only on key
-
value paired RDDs that
implement Hadoop writable interface or equivalent.

foreach
(function f6)

Run f6 in parallel on elements of self RDD.

BDAS: Discretized Streams

6

pageViews

=
readStream
("http://...", "1s")

1_s =
pageViews.map
(event => (event.url, 1))

counts =
1_s.runningReduce
((a, b) => a + b)

Treats streams as series of small time interval batch
computations

Event based APIs


stream handling

How to make interval granularity very low (milliseconds)?


Built over Spark RDDs


in
-
memory distributed cache

Fault
-
tolerance is based on RDD lineage (series of
transformations that can be stored and recomputed on failure).


Parallel recovery


re
-
computations happen in parallel across the cluster.

BDAS
: D
-
Streams
Streaming Operators

7

words =
sentences.flatMap
(s =>
s.split
(" "))

pairs =
words.map
(w => (w, 1))

counts =
pairs.reduceByKey
((a, b) => a + b)

Windowing


pairs.window
("5s").
reduceByKey
(_+_)

Incremental aggregation


pairs.reduceByWindow
("5s", (a, b) =>
a + b)

Time skewed joins

BDAS: Use Cases

8

Ooyala

Uses Cassandra for
video data
personalization.

Pre
-
compute
aggregates VS on
-
the
-
fly queries.

Moved to Spark for
ML and computing
views.

Moved to Shark for on
-
the
-
fly queries


C* OLAP aggregate queries on
Cassandra 130
secs
, 60
ms

in Spark

Conviva


Uses Hive for
repeatedly running
ad
-
hoc queries on
video data.

Optimized ad
-
hoc
queries using Spark
RDDs


found Spark
is 30 times faster
than Hive

ML for connection
analysis and video
streaming
optimization.

Yahoo

Advertisement
targeting: 30K
nodes on Hadoop
Yarn

Hadoop


batch processing

Spark


iterative processing

Storm


on
-
the
-
fly processing

Content
recommendation


collaborative filtering

9

Real
-
time Analytics: R over Storm

Real
-
time Analytics UC 1: Internet Traffic Analysis

11

Real
-
time Analysis UC2: Arrhythmia Detection

PMML Primer

12

P
redictive
M
odel
M
arkup
L
anguage

Developed by DMG (Data
Mining Group)

XML representation of a
model.

PMML offers a standard
to define a model, so that
a model generated in
tool
-
A can be directly
used in tool
-
B.

May contain a myriad of
data transformations
(pre
-

and post
-
processing)
as well as one or more
predictive models.

Naïve Bayes Primer

13

Normalization Constant

Likelihood

Prior

A simple probabilistic
classifier based on
Bayes Theorem

Given features
X1,X2,…,
Xn
, predict a
label Y by calculating
the probability for all
possible Y value

PMML Scoring for
Naïve Bayes

14

Wrote a PMML based
scoring engine for
Naïve Bayes
algorithm.

This can theoretically
be used in any
framework for data
processing by
invoking the API

Deployed a Naïve
Bayes PMML
generated from R into
Storm / Spark and
Samza

frameworks

Real time predictions
with the above APIs

<
DataDictionary

numberOfFields
="4">


<
DataField

name="Class"
optype
="categorical"
dataType
="string">


<Value value="democrat"/>


<Value value="republican"/>


</
DataField
>


<
DataField

name="V1"
optype
="categorical"
dataType
="string">


<Value value="n"/>


<Value value="y"/>


</
DataField
>


<
DataField

name="V2"
optype
="categorical"
dataType
="string">


<Value value="n"/>


<Value value="y"/>


</
DataField
>


<
DataField

name="V3"
optype
="categorical"
dataType
="string">


<Value value="n"/>


<Value value="y"/>


</
DataField
>


</
DataDictionary
>



(
ctd

on the next slide)

PMML Scoring for
Naïve Bayes

15


<
NaiveBayesModel

modelName
="
naiveBayes_Model
"
functionName
="classification"
threshold="0.003">


<
MiningSchema
>


<
MiningField


name
="Class"
usageType
="predicted"/>


<
MiningField


name
="V1"
usageType
="active"/>


<
MiningField


name
="V2"
usageType
="active"/>


<
MiningField


name
="V3"
usageType
="active"/>


</
MiningSchema
>


<
Output
>


<
OutputField

name="
Predicted_Class
" feature="
predictedValue
"/>


<
OutputField

name="
Probability_democrat
"
optype
="continuous"
dataType
="double"
feature="probability" value="democrat"/>


<
OutputField

name="
Probability_republican
"
optype
="continuous"
dataType
="double"
feature="probability" value="republican"/>


</Output>


<
BayesInputs
>


(
ctd

on the next page)

PMML Scoring for
Naïve Bayes

16

PMML Scoring for
Naïve Bayes

17


<
BayesInputs
>


<
BayesInput

fieldName
="V1">


<
PairCounts

value="n">


<
TargetValueCounts
>


<
TargetValueCount

value="democrat" count="51"/>


<
TargetValueCount

value="republican" count="85"/>


</
TargetValueCounts
>


</
PairCounts
>


<
PairCounts

value="y">


<
TargetValueCounts
>


<
TargetValueCount

value="democrat" count="73"/>


<
TargetValueCount

value="republican" count="23"/>


</
TargetValueCounts
>


</
PairCounts
>


</
BayesInput
>


<
BayesInput

fieldName
="V2
">



*


<
BayesInput

fieldName
="V3
">



*

</
BayesInputs
>


<
BayesOutput

fieldName
="Class">


<
TargetValueCounts
>


<
TargetValueCount

value="democrat" count="124"/>


<
TargetValueCount

value="republican" count="108"/>


</
TargetValueCounts
>


</
BayesOutput
>


PMML Scoring for
Naïve Bayes

18

Definition Of Elements:
-


DataDictionary

:



D
efinitions
for fields as used in mining
models


( Class, V1, V2, V3 )

NaiveBayesModel

:


Indicates that this is a
NaiveBayes

PMML

MiningSchema

:
lists fields as used in that model
.


Class is “predicted” field,


V1,V2,V3 are “active” predictor fields

Output
:



Describes
a set of result
values
that can be returned




from
a
model


PMML Scoring for
Naïve Bayes

19

Definition Of
Elements (
ctd

.. ) :
-


BayesInputs
:


For each type of inputs, contains the counts of


outputs

BayesOutput
:


Contains
the counts associated with the values of the
target field


Sample Input

Eg1
-

n
y
y

n y
y

n
n

n

n

n

n

y
y

y

y

Eg2
-

n
y n y
y

y

n
n

n

n

n

y
y

y

n y



1
st

, 2
nd

and 3
rd

Columns:


Predictor variables ( Attribute “name” in element
MiningField

)


Using these we predict whether the Output is Democrat or
Republican ( PMML element
BayesOutput
)

PMML Scoring for
Naïve Bayes

20

PMML Scoring for
Naïve Bayes

21


3 Node Xeon Machines Storm cluster ( 8
quad code CPUs, 32 GB RAM, 32 GB Swap
space, 1 Nimbus, 2 Supervisors )



Number of records

( in
millions )


Time Taken (seconds)

0.1

4

0.4

7

1.0

12

2.0

21

10

129

25

310

PMML Scoring for
Naïve Bayes

22


3 Node Xeon Machines Spark cluster( 8
quad code CPUs, 32 GB RAM and 32 GB
Swap space )


Number of records

( in
millions )


Time Taken (

0.1

1

min 47 sec

0.2

3 min 35
src

0.4

6 min 40
secs

1.0

35
mins

17 sec

10

More

than 3
hrs

Thank You!

Mail


vijay.sa@impetus.co.in

LinkedIn


http://
in.linkedin.com/in/vijaysrinivasagneeswaran

Blogs


blogs.impetus.com

Twitter


@
a_vijaysrinivas
.

Back up slides

24

Representation of an RDD

25

Information

HadoopRDD

FilteredRDD

JoinedRDD

Set of partitions

1 per HDFS block

Same as parent

1 per reduce task

Set of dependencies

None

1
-
to
-
1 on parent

Shuffle on each parent

Function to compute data
set based on parents

Read corresponding block

Compute parent and
filter it

Read and join shuffled
data

Meta
-
data on location
(preferredLocaations)

HDFS block location from
namenode

None (parent)

None

Meta
-
data on partitioning
(partitioningScheme)

None

None

HashPartitioner

Logistic Regression: Spark VS Hadoop

26

http://spark
-
project.org


Some Spark(ling) examples

Scala

code (serial)

var

count = 0

for
(
i

<
-

1 to
100000)

{
val

x =
Math.random

* 2
-

1

val

y =
Math.random

* 2
-

1

if
(x*x + y*y < 1) count += 1 }

println
("Pi is roughly " + 4 * count /
100000.0
)

Sample random point on unit circle


count how many are inside them (roughly about PI/4). Hence,
u get approximate value for PI.

Based on the PS/PC = AS/AC=4/PI, so PI = 4 * (PC/PS).

Some Spark(ling) examples

Spark code (parallel)

val

spark = new
SparkContext
(<
Mesos

master>)

var

count =
spark.accumulator
(0)

for
(
i

<
-

spark.parallelize
(1 to
100000, 12))


{
val

x =
Math.random

* 2


1

val

y
=
Math.random

* 2
-

1

if
(x*x + y*y < 1) count += 1 }

println
("Pi is roughly " + 4 * count /
100000.0
)

Notable points:

1.
Spark context created


talks to Mesos
1

master.

2.
Count becomes shared variable


accumulator.

3.
For loop is an RDD


breaks
scala

range object (1 to 100000) into 12 slices.

4.
Parallelize method invokes
foreach

method of RDD.


1

Mesos

is an Apache incubated clustering system


http://mesosproject.org



Logistic Regression in Spark: Serial Code

// Read data file and convert it into Point objects

val

lines =
scala.io.Source.fromFile
("data.txt").
getLines
()

val

points =
lines.map
(x =>
parsePoint
(x))


// Run logistic regression

var

w =
Vector.random
(D)

for (
i

<
-

1 to ITERATIONS) {


val

gradient =
Vector.zeros
(D)


for (p <
-

points) {


val

scale = (1/(1+Math.exp(
-
p.y
*(w dot
p.x
)))
-
1)*
p.y


gradient += scale *
p.x


}


w
-
= gradient

}

println
("Result: " + w)

Logistic Regression in Spark

// Read data file and transform it into Point objects

val

spark = new
SparkContext
(<
Mesos

master>)

val

lines =
spark.hdfsTextFile
("
hdfs
://.../data.txt")

val

points =
lines.map
(x =>
parsePoint
(x)).cache()


// Run logistic regression

var

w =
Vector.random
(D)

for (
i

<
-

1 to ITERATIONS) {


val

gradient =
spark.accumulator
(
Vector.zeros
(D))


for (p <
-

points) {


val

scale = (1/(1+Math.exp(
-
p.y
*(w dot
p.x
)))
-
1)*
p.y


gradient += scale *
p.x


}


w
-
=
gradient.value

}

println
("Result: " + w)