Column-Oriented Storage Techniques

hedgebornabaloneΛογισμικό & κατασκευή λογ/κού

2 Δεκ 2013 (πριν από 3 χρόνια και 11 μήνες)

93 εμφανίσεις


Avrilia
Floratou

(University of Wisconsin


Madison)

Jignesh

M. Patel (University of Wisconsin


Madison)

Eugene J.
Shekita

(While at IBM
Almaden

Research Center)

Sandeep

Tata (IBM
Almaden

Research Center)

Presented by:
Luyang

Zhang &&
Yuguan

Li

Column
-
Oriented Storage Techniques
for
MapReduce

1

Motivation

Databases

MapReduce

Column


Oriented

Storage

Performance

Programmability

Fault tolerance

2

Column
-
Oriented Storage

3


Benefits:


Column
-
Oriented organizations are more efficient when an
aggregate needs to be computed over many rows but only for a
notably smaller subset of all columns of data.




Column
-
Oriented organizations are more
efficient when new
values of a column are supplied for all rows at once.



Column data is of uniform type, which provides some
opportunity for storage size optimization. (e.g. Compression)

Questions

4


How to incorporate columnar

storage into an existing
MR system (Hadoop) without changing its core parts?




How can
columnar
-
storage
operate efficiently on top
of
a DFS (HDFS)?




Is it easy to
apply well
-
studied
techniques from the
database field to the Map
-
Reduce framework given
that:


It processes one tuple at a
time.


It does not use a restricted set of operators.


It is used to process complex data types.



Challenges

5


In
H
adoop, it is often convenient to use complex types like
arrays, maps, and nested records to model data.
---

which
leads to a high deserialization cost and lack of effective
column
-
oriented compression techniques.



Serialization: data structure in memory


bytes that can be
transmitted


Deserialization: bytes


data structure in memory



(Since
hadoop

is written in Java, more complex than C++ )

Challenges

6


Compression: Although the column data seems to be more
similar and share a high compression ratio, the complex type
makes some existed technologies cannot be applied to
Hadoop.



Programming API: Some technologies are not feasible for
hand
-
coded mapreduce function.

Outline


Column
-
Oriented Storage


Lazy Tuple Construction


Compression


Experimental Evaluation


Conclusions


7

Column
-
Oriented Storage in Hadoop

8


Main Idea: Store each column of the dataset in a separate file



Problems:


How can we generate roughly equal sized splits so that a job can
be effectively parallelized over the cluster?



How do we make sure that the corresponding values from
different columns in the dataset are co
-
located on the same
node running the map task?

Column
-
Oriented Storage in Hadoop

Name

Age

Info

Joe

23

“hobbies”: {tennis}

“friends”: {Ann, Nick}

Da癩d



“friends”: {George}

J潨n



“hobbies”: {tennis, golf}

卭楴栠



“hobbies”: {swimming}

“friends”: {Helen}

1
st

node

2
nd

node

Horizontally Partitioning

Into split
-
directories

Name

Age

Info

Joe

23

“hobbies”: {tennis}

“friends”: {Ann, Nick}

䑡癩v



“friends”: {George}

Name

Age

Info

John

45

“hobbies”:{tennis, golf}

卭楴栠



“hobbies”: {swimming}

“friends”: {Helen}

Name

Joe

David

Age

23

32

Info

“hobbies”: {tennis}

“friends”:{Ann, Nick}

“friends”: {George}

Name

John

Smith

Age

45

65

Info

“hobbies”: {tennis, golf}

“hobbies”: {swimming}

“friends”: {Helen}


Introduce new
InputFormat
/
OutputFormat

:

ColumnInputFormat (CIF)

ColumnOutputFormat

(
COF
)


9

/data/2013
-
03
-
26/

/data/2013
-
03
-
26/s1

/data/2013
-
03
-
26/s2

ColumnInputFormat V.S RCFile Format

10


RCFile Format:


Avoid Replication and Co
-
location problem


Using
Pax

instead of a true column
-
oriented format, all columns will be
packed in a single row
-
g
roup as a split.


Efficient I/O elimination become difficult.


Metadata need additional space overhead.



CIF:


Need to tackle Replication and Co
-
location


Efficient I/O elimination


Consider adding a column to a dataset.

Replication and Co
-
location

HDFS

Replication

Policy

Node A

Node B

Node C

Node D

Name

Age

Info

Joe

23

“hobbies”: {tennis}

“friends”: {Ann, Nick}

David

32

“friends”: {George}

John

45

“hobbies”: {tennis, golf}

卭楴栠



“hobbies”: {swimming}

“friends”: {Helen}

Name

Joe

David

Age

23

32

Info

“hobbies”: {tennis}

“friends”:{Ann, Nick}

“friends”: {George}

Name

Joe

David

Name

Joe

David

Age

23

32

Age

23

32

Info

“hobbies”: {tennis}

“friends”: {
䅮測乩捫
}

“friends”: {George}

Info

“hobbies”: {tennis}

“friends”:{Ann, Nick}

“friends”: {George}

CPP

Introduce a new column placement policy (CPP)

Can be assigned to “
dfs.block.replicator.classname


11

Example

Age

Name

Record

i
f (age < 35)

r
eturn name

23

32

45

3
0

5
0

Joe

David

John

Mary

Ann

Map
Method

23

Joe

32

David

What if

age
>

35?

Can we avoid reading
and
deserializing

the
name field?

12

ColumnInputForm
at.setColumns
(
job
,”Age”,”Name
”)

Outline


Column
-
Oriented Storage


Lazy Tuple Construction


Compression


Experiments


Conclusions


13

Lazy Tuple Construction






Deserialization of each record field is deferred to the point where it
is actually accessed, i.e. when the get() methods are called.


*:
Deserialize

only those columns that are actually accessed in map
function

Mapper (
NullWritable

key,
Record

value
)

{



String
name
;


int

age =
value.get
(“age”);


if
(age < 35)



name =
value.get
(“name
”);

}

Mapper (
NullWritable

key,
LazyRecord

value)

{



String name;


int

age =
value.
get
(“age”);


if (age < 35)



name =
value.
get
(“name”);

}

14

LazyRecord

implements

Record

15

lastPos

=
curPos

name


Why do we need these: “ Without
lastPos

pointer, each
nextRecord

call would require all the columns to be
deserialized

to extract the length information to update their
respective
curPos

pointer.”

age

lastPos

curPos

skip

Skip List (Logical Behavior)

R1

R2

R10

R20

R99

R100

...

...

R90

...

R1

R1

R20

R90

R100

R100

...

R10

Skip 100 Records

Skip 10

16

R1

R2

R10

R20

R90

R99

R1

R10

R20

R90

R1

R100

Example

Age

Joe

Jane

David

Name

Skip10 = 1002

Skip100 = 9017

Skip 10 = 868





Mary

10 rows

10 rows

100 rows

Skip Bytes

Ann



23

39

45

30

i
f (age < 35)

r
eturn name



17

John

0

1

2

102

Example

Age

“hobbies”: tennis

“friends
” :
Ann,
Nick

Null

“friends
” : George

Info

Skip10 = 2013

Skip100 = 19400

Skip 10 = 1246



“hobbies”: tennis,
golf

10 rows

10 rows

100 rows





23

39

45

30

i
f (age < 35)

r
eturn hobbies





18

Outline


Column
-
Oriented Storage


Lazy Record Construction


Compression


Experiments


Conclusions


19

Compression

# Records in B1

# Records in B2

LZO/ZLIB
compressed block

RID : 0
-

9

LZO/ZLIB
compressed block

RID :
10
-

35

B1

B2

Null

Skip10 = 210

Skip100 = 1709

Skip 10 =
3
04





0: {tennis
,
golf}

10 rows

10 rows

100 rows



Dictionary

“hobbies” : 0

“friends” : 1

Compressed Blocks

Dictionary Compressed
Skip Lists

Skip
Bytes

Decompress

0 : {tennis}

1
:
{Ann
,
Nick}

1: {George}

20

Outline


Column
-
Oriented Storage


Lazy Record Construction


Compression


Experiments


Conclusions


21

RCFile

Metadata

Metadata

Joe, David

John, Smith

23, 32

{“
hobbies”:
{tennis}


“friends”: {Ann, Nick}},
{“friends”:{George}}

{“
hobbies”:
{tennis
,
golf}},

{
“hobbies”:
{swimming}

“friends”:
{Helen}}

Row


Group 1

Row


Group 2

Name

Age

Info

Joe

23

“hobbies”: {tennis}

“friends”: {Ann, Nick}

䑡癩v



“friends”: {George}

J潨o



“hobbies”: {tennis, golf}

卭楴栠



“hobbies”: {swimming}

“friends”: {Helen}

45Ⱐ65

22

Experimental Setup


42 node cluster


Each node:


2

quad
-
core 2.4GHz sockets


32 GB main memory


four 500GB HDD


Network : 1Gbit
ethernet

switch

23

Overhead of Columnar Storage

0
500
1000
1500
2000
2500
3000
3500
4000
Time (sec)

Synthetic Dataset


57GB

13 columns

6 Integers, 6 Strings, 1 Map

Query


Select *

24

Single node experiment

Benefits of Column
-
Oriented Storage

Query


Projection of different
columns

25

0
200
400
600
800
1000
1200
1400
1600
1800
2000
All Columns
1 Integer
1 String
1 Map
1 String / 1 Map
Time (sec)

CIF
CompRCFile
UncompRCFile
Single node experiment

Workload

URLInfo


{


String
url


String
srcUrl


time
fetchTime


String
inlink
[]


Map <
String,String
[]>
metadata


Map <String, String> annotations


byte[] content

}

If(
url

contains “ibm.com/
jp
” )

find all the distinct

encodings
reported by the page

Schema

Query

Dataset : 6.4 TB

Query Selectivity : 6%

26

27


SEQ
:
754 sec

Comparison of
Column
-
Layouts (Map phase)

1

3.7

60.8

81.9

107.8

0
20
40
60
80
100
120
SEQ
Compressed
RCFile
CIF
CIF -SL
CIF-DCSL
Speedup over SEQ

(map phase)

1

3.7

60.8

81.9

107.8

0
20
40
60
80
100
120
SEQ
Compressed
RCFile
CIF
CIF -SL
CIF-DCSL
Speedup over SEQ

(map phase)

28

102

96

75

61

0
20
40
60
80
100
120
SEQ
Compressed
RCFile
CIF
CIF -SL
CIF-DCSL
Data Read (GB)

3040

Comparison of Column
-
Layouts (Map phase)

Comparison of Column


Layouts (Total job)

29

1

2.8

10.3

11.5

12.8

0
2
4
6
8
10
12
14
SEQ
Compressed
RCFile
CIF
CIF -SL
CIF-DCSL
Speedup over SEQ

(Total Job)


SEQ
:
806 sec

Conclusions


Describe a new column
-
oriented binary storage
format in
MapReduce
.


Introduce skip list layout.


Describe the implementation of lazy record
construction.


Show that lightweight dictionary compression for
complex columns can be beneficial.


30