Berkley Data Analysis Stack

hedgebornabaloneSoftware and s/w Development

Dec 2, 2013 (3 years and 11 months ago)

90 views

Berkley Data Analysis
Stack

Shark, Bagel

2

Previous Presentation Summary


Mesos
, Spark
,
Spark Streaming

Infrastructure

Storage

Data Processing

Application

Resource Management

Data Management

Share infrastructure across frameworks

(multi
-
programming for datacenters)

Efficient data sharing across frameworks

Data Processing


in
-
memory

processing


trade between
time
,
quality
, and

cost

Application

New apps: AMP
-
Genomics, Carat, …

3

Previous Presentation Summary


Mesos
, Spark
,
Spark Streaming

Spark Example
: Log Mining


Load error messages from a log into memory,
then interactively search for various patterns

lines = spark.textFile(

hdfs://...

)

errors = lines.
filter
(
_.startsWith(

ERROR

)
)

messages = errors.
map
(
_.split(

\
t

)(2)
)

cachedMsgs = messages.
cache
()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

cachedMsgs.
filter
(
_.contains(

foo

)
).
count

cachedMsgs.
filter
(
_.contains(

bar

)
).
count

. . .

tasks

results

Cache 1

Cache 2

Cache 3

Base RDD

Transformed RDD

Cached RDD

Parallel operation

Result:

full
-
text search of Wikipedia in <1 sec (
vs

20 sec for on
-
disk data)

Logistic Regression Performance

127 s / iteration

first iteration 174 s

further iterations 6 s

val

data =
spark.textFile
(...).
map
(
rea
dPoint
).
cache
()


var

w =
Vector.random
(D)


for (
i

<
-

1 to ITERATIONS)
{




}


println
("Final w: " + w)

HIVE: Components

HDFS

Hive CLI

DDL

Queries

Browsing

Map Reduce

MetaStore

Thrift API

SerDe

Thrift

Jute

JSON..

Execution

Hive QL

Parser

Planner

Mgmt. Web UI

Data Model




Hive Entity

Sample
Metastore Entity

Sample HDFS Location

Table

T

/wh/T

Partition

date=d
1

/wh/T/date=d1

Bucketing
column

userid

/wh/T/date=d1/part
-
0000



/wh/T/date=d1/part
-
1000

(hashed on userid)

External
Table

extT

/wh2/existing/dir

(arbitrary location)

Hive/Shark flowchart (Insert into table)

Two ways to do this.


1.
Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS.

2.
Load “Buckets” directly. The user is responsible for creating buckets.

CREATE TABLE
page_view
(
viewTime

INT,
userid

BIGINT,
page_url

STRING,
referrer_url

STRING,
ip

STRING
COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(
dt

STRING, country
STRING) STORED AS SEQUENCEFILE;

Creates the table directory.

Hive/Shark flowchart (Insert into table)

Two ways to do this.


1.
Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS.


CREATE EXTERNAL TABLE

page_view_stg
(
viewTime

INT,
userid

BIGINT,
page_url

STRING,
referrer_url

STRING,
ip

STRING COMMENT 'IP Address of the User', country STRING COMMENT
'country of origination') COMMENT 'This is the staging page view table'
ROW FORMAT
DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12'

STORED AS TEXTFILE
LOCATION '/user/data/staging/
page_view
';


hadoop

dfs

-
put /
tmp
/pv_2008
-
06
-
08.txt /user/data/staging/
page_view


Step 1

Step 2

Hive/Shark flowchart (Insert into table)

Two ways to do this.


1.
Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS.


Step 3

FROM
page_view_stg

pvs

INSERT OVERWRITE TABLE
page_view

PARTITION(
dt
='2008
-
06
-
08',
country='US') SELECT
pvs.viewTime
,
pvs.userid
,
pvs.page_url
,
pvs.referrer_url
, null, null,
pvs.ip

WHERE
pvs.country

= 'US';

Hive

File on
HDFS

Hierarchical

Object

Writable

Stream

Stream

Hierarchical

Object

Map
Output File

Writable

Writable

Writable

Hierarchical

Object

File on
HDFS

User Script

Hierarchical

Object

Hierarchical

Object

Hive Operator

Hive Operator

SerDe

FileFormat / Hadoop Serialization

Mapper

Reducer

ObjectInspector

1.0
3 54

0.2
1 33

2.2
8 212

0.7
2 22

thrift_record
<…>

thrift_record
<…>

thrift_record
<…>

thrift_record
<…>

BytesWritable
(
\
x3F
\
x64
\
x72
\
x00)

Java Object

Object of a Java Class

Standard Object

Use
ArrayList

for
struct

and array

Use
HashMap

for map

LazyObject

Lazily
-
deserialized

Writable

Writable

Writable

Writable

Text
(‘1.0
3 54’) // UTF8 encoded

User defined
SerDes

per ROW

getType

ObjectInspector1

getFieldOI

getStructField

getType

ObjectInspector2

getMapValueOI

getMapValue

deserialize

SerDe

serialize

getOI

SerDe, ObjectInspector and TypeInfo

Hierarchical

Object

Writable

Writable

Struct

int

string

list

struct

map

string

string

Hierarchical

Object

String Object

getType

ObjectInspector3

TypeInfo

BytesWritable
(
\
x3F
\
x64
\
x72
\
x00)

Text(‘
a=av:b=bv 23 1:2=4:5 abcd
’)

class HO {


HashMap
<String, String> a,


Integer b,


List<
ClassC
> c,


String d;

}

Class
ClassC

{


Integer a,


Integer b;

}


List (


HashMap(“a”


“av”, “b”


“bv”),


23,


List(List(1,null),List(2,4),List(5,null)),


“abcd”

)

int

int

HashMap
(“a”



av
”, “b”



bv
”),

HashMap
<String, String> a,

“av”