pptx

odecrackAI and Robotics

Oct 29, 2013 (3 years and 7 months ago)

100 views

„Big
data


Benczúr András

MTA SZTAKI



Benczúr


Big Data
-

Szeged
-

2012 március 23

Big Data


the

new

hype


“big data” is when the size of the data

itself becomes part of the problem


“big data” is data that becomes large

enough that it cannot be processed

using

conventional

methods



Google

sorts

1PB
in

33
minutes

(07
-
09
-
2011)


Amazon S3 store contains 499B objects

(19
-
07
-
2011)


New Relic
:

20B+ application

metrics
/
day

(18
-
07
-
2011)


Walmart

monitors 100M entities in real

time

(12
-
09
-
2011)

Source
:
The Emerging Big Data slide from the Intelligent
Information Management DG INFSO/E2 Objective ICT
-
2011.4.4
Info day in Luxembourg on 26 September 2011

Benczúr


Big Data
-

Szeged
-

2012 március 23

Fast

data

Big
analytics

Big Data
Services

Big Data
Planes

Benczúr


Big Data
-

Szeged
-

2012 március 23

Overview


Introduction


Buzzwords


Part I:
Background


Examples


Mobility and navigation traces


Sensors, smart city


IT logs


W
ind

power


Scientific

and Business
Relevance


Part II:
Infrastructures


NoSQL
,
Key
-
value

store
,
Hadoop
, H*,
Pregel
, …


Part III:
Algorithms


Brief

History

of
Algorithms


Web
processing
,
PageRank

with

algorithms


Stream

algoritmusok


Entity

resolution



detailed

comparison

(QDB 2011)

Benczúr


Big Data
-

Szeged
-

2012 március 23

Navigation

and
Mobility

Traces


Streaming

data

at

mobile
base

stations


Privacy

issues


Regulations

to

let

only

anonymized

data

leave

beyond

network

operations

and
billing


Do

regulation

policy
makers

know

about

deanonymization

attacks
?


What

your

wife
/
husband

will

not

know



your

mobile
provider

will

Benczúr


Big Data
-

Szeged
-

2012 március 23

Sensors



smart

home
, city, country, …


Road

and parking
slot

sensors


Mobile parking
traces


Public
transport
,
Oyster

cards


Bike
hire

schemes

Source
: Internet of
Things

Comic

Book
,
http
://www.smartsantander.eu/images/IoT_Comic_Book.pdf

Benczúr


Big Data
-

Szeged
-

2012 március 23


even

agriculture



Benczúr


Big Data
-

Szeged
-

2012 március 23

… and
wind

power

stations

Benczúr


Big Data
-

Szeged
-

2012 március 23

Our experience:

30
-
100+ GB/day

3
-
60 M events

Corporate IT log processing

Aggregation into

Data Warehouse

Identify bottlenecks

Optimize procedures

Detect misuse, fraud, attacks

?

Traditional methods fail

Benczúr


Big Data
-

Szeged
-

2012 március 23

Scientific and business relevance


VLDB 2011 (~100 papers)
:


6 papers on
MapReduce
/
Hadoop
, 10 on big data (+keynote),
11
NoSQL

architectures, 6 GPS/sensory data


tutorials, demos (Microsoft, SAP, IBM
NoSQL

tools)


session: Big Data Analysis,
MapReduce
, Scalable Infrastructures


EWEA 2011
:
28% of papers on wind power raise
data size issues


SIGMOD 2011
: out of 70 papers, 10 on new
architectures and extensions for analytics


Gartner 2011 trend No. 5
:
Next Generation Analytics
-


significant changes to existing operational and business
intelligence infrastructures”


The Economist
2010.02.27:
„Monstrous amounts of

data … Information is transforming traditional businesses”



News
special issue on Big Data

this

April

Benczúr


Big Data
-

Szeged
-

2012 március 23

New
challenges

in

database technologies

Question of research and practice
:
Applicability to a
specific problem
?

Applicability as a general technique?

Benczúr


Big Data
-

Szeged
-

2012 március 23

Overview


Part I:
Background


Examples


Scientific

and Business
Relevance


Part II:
Infrastructures


NoSQL


Key
-
value

stores


Hadoop

és
Hadoopra

épülő eszközök


Bulk

Synchronous

Parallel,
Pregel


Streaming
, S4


Part III:
Algorithms
,
Examples

Benczúr


Big Data
-

Szeged
-

2012 március 23

Most jön sok külső
slide

show …


NoSQL

bevezető


www.intertech.com
/
resource
/
usergroup
/
NoSQL.ppt



Key
-
value

stores


BerkeleyBD



nem osztott


Voldemort



behemoth.strlen.net
/~
alex
/
voldemort
-
nosql
_
live.ppt



Cassandra,
Dynamo
, …


Hadoop

alapon is létezik (lent):
HBase


Hadoop



Erdélyi Miki fóliái


HBase



datasearch.ruc.edu.cn
/
course
/cloudcomputing20102/
slides
/Lec07.ppt



Cascading



nem lesz


Mahout



cwiki.apache.org
/MAHOUT/
faq.data
/Mahout%20Overview.ppt



Miért kell más? Mi más kell?


Bulk

Synchronous

Parallel


Graphlab



Danny

Bickson

slides


MOA


http://www.slideshare.net/abifet/moa
-
5636332/download



Streaming


S4


http://www.slideshare.net/alekbr/s4
-
stream
-
computing
-
platform


Benczúr


Big Data
-

Szeged
-

2012 március 23

Bulk

Synchronous

Parallel
architecture

HAMA:
Pregel

klón

Benczúr


Big Data
-

Szeged
-

2012 március 23

Use of large matrices


Main step in all distributed algorithms


Network based features in classification


Partitioning for efficient algorithms


Exploring the data,


navigation (e.g. ranking


to select a nice


compact
subgraph
)


Hadoop

apps (e.g.
PageRank
)


move the entire data around in


each iteration


Baseline C++ code keeps


data local

Hadoop

Hadoop + KeyValue
store

Best C++ custom code

Benczúr


Big Data
-

Szeged
-

2012 március 23

BSP vs. MapReduce


MapReduce
: Data locality not
preserved
between Map and Reduce invocations or
MapReduce

iterations.


BSP
: Tailored towards processing data with
locality.


Proprietary: Google
Pregel


Open
-
source (will be
??

… several flaws now):
HAMA


Home developed C++ code base


Both
: Easy parallelization and distribution
.


Sidl
o

et

al
.
,

Infrastructures

and

bounds

for

distributed

entity

resolution
.


QDB

2011

Benczúr


Big Data
-

Szeged
-

2012 március 23

Overview


Part I:
Background


Part II: Infrastructures


NoSQL
, Key
-
value store,
Hadoop
, H*,
Pregel
, …


Part III:
Algorithms
,
Examples
,
Comparison


Data and
computation

intense

tasks
,
architectures


History

of
Algorithms


Web
processing
,
PageRank

with

algorithms


Entity

resolution



detailed

with

algorithms


Summary
,
conclusions

Benczúr


Big Data
-

Szeged
-

2012 március 23

Types of Big Data problems


Data intense


Web processing, info retrieval, classification


Log processing (
telco
, IT, supermarket, …)


Compute intense:


Expectation Maximization
,
Gaussian mixture
decomposition, image retrieval, …


Genom

matching,
phylogenetic

trees, …


Data AND compute intense:


Network (Web, friendship, …) partitioning,
finding similarities, centers, hubs, …


Singular value decomposition

Benczúr


Big Data
-

Szeged
-

2012 március 23

Hardware

Data intense:


Map
-
reduce (Hadoop),
cloud, …

Compute intense:


Shared memory


Message passing


Processor arrays, …



became affordable choice
recently, as graphics co
-
procs!

Data AND compute intense??

Benczúr


Big Data
-

Szeged
-

2012 március 23

Big data: Why now?


Hardware is just
getting better,
cheaper?


But data is getting
larger, easier to
access


Bad news for
algorithms slower than
~ linear

Benczúr


Big Data
-

Szeged
-

2012 március 23

Moore’s

Law:
doubling

in

18
months

But

in

a
key

aspect
,
the

trend has
changed
!

From

speed

to

no of
cores

Benczúr


Big Data
-

Szeged
-

2012 március 23

„Numbers Everyone Should Know”

RAM


L1 cache reference 0.5 ns


L2 cache reference 7 ns


Main memory reference 100 ns


Read 1 MB sequentially from memory 250,000 ns

Intra
-
process communication


Mutex lock/unlock 100 ns


Read 1 MB sequentially from network 10,000,000 ns

Disk


Disk seek 10,000,000 ns


Read 1 MB sequentially from disk 30,000,000 ns

Jeff Dean, Google

Disk


10+TB

RAM


100+ GB

CPU


L2 1+ MB


L1 10+ KB

GPU
onboard

memory


Global
4
-
8 GB


Block

shared

10+ KB

Benczúr


Big Data
-

Szeged
-

2012 március 23

Back to Databases, this means …

Sub
-
linear speed
-
up

Linear speed
-
up (ideal)

Number of CPUs

Number of transactions/second

1000/Sec

5 CPUs

2000/Sec

10 CPUs

16 CPUs

1600/Sec


Cost


Security


Integrity control more
difficult


Lack of standards


Lack of experience


Complexity of management
and control


Increased storage
requirements


Increased training cost

Read 1 MB sequentially…


memory 250,000 ns


network 10,000,000 ns


disk 30,000,000 ns

M

CPU

M

CPU

M

CPU

M

CPU

M

CPU


MEMORY


CPU

CPU

CPU

CPU

CPU

CPU

Connolly,
Begg
: Database systems: a practical approach to design, implementation,

and management], International computer science series, Pearson Education, 2005



Benczúr


Big Data
-

Szeged
-

2012 március 23

“The brief history of Algorithms”

P, NP

PRAM

theoretic
models

Thinking

Machines
:
hypercube
,


Cray
:
v
e
c
torprocessor
s

SIMD, MIMD,
message

passing

Map
-
reduce

Google

Multi
-
core

Many
-
core

Cloud

Flash disk

External
memory
algs

CM
-
5:
many

vectorproc
s

Benczúr


Big Data
-

Szeged
-

2012 március 23


P: Graph traversal



Spanning tree






NP: Steiner trees

Earliest history: P, NP

1

25

15

5

1

5

1

1

15

2

1

2

2

2

1

2

1

1

Benczúr


Big Data
-

Szeged
-

2012 március 23

Why do we care about graphs, trees?


Mary

Smith

m.smith
@mail
-
1.com


50071


M.

Doe

mary@mail
-
2.com


79216


Mary

Doe

mary
@mail
-
2.com


50071


M.

Smith

m.smith@mail
-
1.com


34302

name

e
-
mail

ID

1

2

3

Image
segmentation

Entity
Resolution

Benczúr


Big Data
-

Szeged
-

2012 március 23

History of
algs
: spanning trees in parallel


iterative

minimum
spanning

forest


every

node

is a
tree

at

start;
every

iteration

merges

trees


Bentley
:
A parallel algorithm for

constructing minimum spanning trees

1980


Harish et al. Fast Minimum Spanning

Tree for Large Graphs on the

GPU

2009



1

3

8

2

6

7

4

5

Benczúr


Big Data
-

Szeged
-

2012 március 23

Overview


Part I:
Background


Part II: Infrastructures


NoSQL
, Key
-
value store,
Hadoop
, H*,
Pregel
, …


Part III:
Algorithms
,
Examples
,
Comparison


Data and
computation

intense

tasks
,
architectures


History

of
Algorithms


Web
processing
,
PageRank

with

algorithms


Streaming

algoritmusok


Entity

resolution



detailed

with

algorithms


Summary
,
conclusions

Benczúr


Big Data
-

Szeged
-

2012 március 23

T
he Web is
about

data

too

Posted by John
Klossner

on Aug 03, 2009


WEB 1.0 (browsers)


Users find data

WEB 2.0 (social networks)


Users find each other

WEB 3.0 (semantic Web)


Data find each other


WEB 4.0


Data create their own
Facebook

page, restrict friends.


WEB 5.0


Data decide they can work without humans, create their
own language.


WEB 6.0

Human users realize that they no longer can find data unless
invited by data.


WEB 7.0


Data get cheaper cell phone rates.


WEB 8.0


Data horde all the good YouTube videos, leaving human
users with access to bad ’80′s music videos only.


WEB 9.0


Data create and maintain own blogs, are more popular than
human blogs.


WEB 10.0


All episodes of
Battlestar

Gallactica

will now be shown
from the
Cylons
’ point of view.

Big Data
interpetation
:
recommenders
,
personalization
,
info

extraction

Benczúr


Big Data
-

Szeged
-

2012 március 23

Benczúr


Big Data
-

Szeged
-

2012 március 23

Longitudinal Analytics of Web Archive Data

Building a
Virtual

Web
Observatory

on

large

temporal

data

of
Internet
archives

Benczúr


Big Data
-

Szeged
-

2012 március 23

Partner
approaches

to

hardware


Hanzo

Archives

(UK):

Amazon EC2
cloud

+ S3


Internet
Memory

Foundation
:

50
low
-
end

servers


We
: indexing 3TB
compressed
, .5B
pages


Open
source

tools

not

yet

mature


One

week

of
processing

on

50 old
dual

cores


Hardware
worth

approx


10,000; Amazon
price

around


5000

Benczúr


Big Data
-

Szeged
-

2012 március 23

Text
REtrieval

Conference measurement


Documents

stored

in

HBase

tables

over
the

Hadoop

file
system

(HDFS)


Indexelés:


200 példány saját C++ kereső


40
Lucene

példány, utána top 50,000 találat saját kereső


SolR
?
Katta
? Tényleg
real

time

működnek?
Ranking
?


Realistic

-

Even spam is important!

Spam


Obvious parallelization: each node
processes all pages of one host


Link features
(
eg
.
PageRank
)
cannot
be computed in this
way

M.
Erdélyi
, A.
Garzó
, and A. A.
Benczúr
:
Web
spam classification: a few features worth
more
(
WebQuality

2011)

Benczúr


Big Data
-

Szeged
-

2012 március 23

Distributed

storage
:
HBase

vs

WARC
files

o
WARC

o
Many, many medium sized files very inefficient w/
Hadoop

o
Either huge block size wasting space

o
Or data locality
lost

as

blocks

may

continue

at

non
-
local
HDFS
node

o
HBase

o
Data locality preserving ranges


cooperation w/
Hadoop

o
Experiments up to 3TB compressed Web data

o
WARC to H
B
ase

o
One
-
time expensive step, no data locality

o
One
-
by
-
one inserts fail, very low performance

o
MapReduce

jobs to create
HFiles
, the native
HBase

format

o
HFile

transfer

o
HBase

insertion

rate

100,000 per
hour

Benczúr


Big Data
-

Szeged
-

2012 március 23

u

The Random
Surfer

Model

Nodes = Web pages

Edges = hyperlinks

Starts at a random page

arrives at

quality page

Benczúr


Big Data
-

Szeged
-

2012 március 23

u

PageRank
: The Random
Surfer

Model

Chooses random neighbor with probability
1
-



Benczúr


Big Data
-

Szeged
-

2012 március 23

u

The Random
Surfer

Model

Or with probability



teleports

to random
page

gets bored and types a new URL

Benczúr


Big Data
-

Szeged
-

2012 március 23

The Random
Surfer

Model


And continues with the random walk …

Benczúr


Big Data
-

Szeged
-

2012 március 23

The Random
Surfer

Model


And continues with the random walk …

Benczúr


Big Data
-

Szeged
-

2012 március 23

The Random
Surfer

Model

Until convergence … ?

[Brin, Page 98]

Benczúr


Big Data
-

Szeged
-

2012 március 23

PR
(k+1)
= PR
(k)

(

(
1

-


)
M
+



U

)


= PR
(1)

(

(
1
-


)
M

+



U

)
k

PageRank
as

Quality

A quality page is pointed to by several quality pages

Benczúr


Big Data
-

Szeged
-

2012 március 23

u

Or

with

probability




teleports

to

random
page

selected from her bookmarks

Personalized
PageRank

Benczúr


Big Data
-

Szeged
-

2012 március 23

Algorithmics


Estimated 10+ billions of Web pages
worldwide


PageRank (as floats)


fits into 40GB storage


Personalization just to single pages:


10 billions of PageRank scores for each page


Storage exceeds several Exabytes!



NB single
-
page personalization is enough:


Benczúr


Big Data
-

Szeged
-

2012 március 23


For light to reach the other side of the
Galaxy


takes rather longer:

five hundred
thousand years.


The record for hitch hiking this distance is
just under five years, but you don't get to
see much on the

way.

D Adams
,
The Hitchhiker's Guide to the Galaxy. 1979

For

certain

things

are

just

too

big
?

Benczúr


Big Data
-

Szeged
-

2012 március 23


Reformulation by simple tricks of linear
algebra


From
u
simulate

N

independent random walks


Database of
fingerprints
: ending vertices of the
walks from
all

vertices


Query


PPR
(
u,v
) :
=

# (
walks
u

v

) / N


N ≈ 1
000

a
pproximates

top

100

well



Fogaras
-
Racz
:
Towards

Scaling

Fully

Personalized

PageRank
, WAW 2004

Markov

Chain

Monte Carlo

Benczúr


Big Data
-

Szeged
-

2012 március 23

SimRank
:
similarity

in

graphs


Two

pages

are

similar

if

pointed

to

by

similar

pages
” [
Jeh

Widom

KDD 2002]:





Same trick: p
ath

pair

summation

(
can

be
sampled

[Fogaras

Rácz WWW 2005]) over

u = w
0
,w
1
, . . . ,
w
k
−1
,
w
k

= v
2

u = w’
0

,w’
1

, . . . ,
w’
k
−1
,
w’
k

= v
1



DB
application

e.g
:
Yin
,
Han
,
Yu
.
LinkClus
:
efficient

clustering

via

heterogeneous

semantic

links
, VLDB '06

Benczúr


Big Data
-

Szeged
-

2012 március 23

Communication

complexity

bounding


Bit
-
vector
probing

(
BVP
)





Theorem
:
B

m

for any protocol


Reduction
from
BVP

to

Exact
-
PP
R
-
compare




Alice has a bit vector

Input:
x

= (
x
1
, x
2
, …,
x
m

)


Bob has a number

Input:
1 ≤

k ≤ m

X
k

= ?

Communication

B
bits


Alice has
x
= (
x
1
, x
2
, …,
x
m

)

G

graph with
V
vertices,
where
V
2
=
m

Pre
-
compute an Exact
PP
R

data

of
size
D

Communication

Exact
PP
R
,
D
bits


Bob has
1 ≤ k ≤ m

u, v, w
vertices

PP
R
(
u,v
)
?
PP
R(
u,w
)


X
k

= ?

Thus
D = B

m= V
2

Benczúr


Big Data
-

Szeged
-

2012 március 23

Theory

of
Streaming

algorithms


Distinct

values

példa


Motwani

slides


Szekvenciális, RAM algoritmusok


Külső táras algoritmusok


Mintavételezés negatív eredmény



Sketching
” technika

Benczúr


Big Data
-

Szeged
-

2012 március 23

Overview


Part I:
Background


Examples


Scientific

and Business
Relevance


Part II:
Foundations

Illustrated


Data and
computation

intense

tasks
,
architectures


History

of
Algorithms


Web
processing
,
PageRank

with

algorithms


Streaming

algoritmusok


Entity

resolution



detailed

with

algorithms


Summary
,
conclusions

Benczúr


Big Data
-

Szeged
-

2012 március 23

Distributed

Computing

Paradigms

and
Tools


D
istributed

Key
-
Value

Stores
:


distributed

B
-
tree

index
for

all

attributes


Project
Voldemort


MapReduce
:


map


reduce

operations


Apache

Hadoop


Bulk

Synchronous

Parallel:


supersteps
:
computation



communication



barrier

sync


Apache

Hama

Benczúr


Big Data
-

Szeged
-

2012 március 23

e
1

Mary Major

09.12.1979

50071

...

r
1
:

J.
Doe

23.04.1965

79216

...

r
2
:

John
Doe

23.04.1965

79216

...

r
3
:

Richard Miles

31.09.1980

34302

...

r
4
:

Richard G.
Miles

21.09.1980

34302

...

r
5
:

e
2

e
3

A
1

A
2

A
3


Record
s
with

attributes


r
1
, … , r
5


E
ntities

formed

by

set
s

of
records



e
1
, e
2

, e
3


Entity

Resolution

problem

= partition the
records

into

customers
, …

Entity resolution

Benczúr


Big Data
-

Szeged
-

2012 március 23

A communication complexity lower bound


S
et intersection cannot be decided by
communicating less than
Θ
(
n
) bits

[
Kalyanasundaram
,
Schintger

1992]


I
mplication
:
if

data

is over
multiple

servers,
one
needs
Θ
(
n
) communication to decide if it may
have a duplicate with another node


B
est

we can do is communicate
all

data


R
elated area: Locality Sensitive Hashing


no LSH for „minimum”, i.e. to decide if two
attributes agree


similar to negative results on
Donoho’s

Zero „norm”
(number of non
-
zero coordinates)

Benczúr


Big Data
-

Szeged
-

2012 március 23

Wait



how

about

blocking
?


Blocking

speeds

up

shared

memory

parallel
algorithms


[
many

in

literature
,
eg
.
Whang
,
Menestrina
,
Koutrika
,
Theobald
,
Garcia
-
Molina
. ER
with

Iterative

Blocking
, 2009]


Even

if

we

could

partition

the

data

with

no
duplicates

split
,
the

lower

bound

applies

just

to

test
we

are

done


Still
,
blocking

is a
good

idea


we

may

have

to

communicate

much

more
than

Θ
(
n
)
bits

Benczúr


Big Data
-

Szeged
-

2012 március 23

Distributed Key
-
Value Store (KVS)


the record graph can be served from the KVS


KVS mainly provides random access to a huge
graph that would not fit in main memory


computing node
s
, many indexing nodes


implement a graph traversal


Breadth
-
First Search (BFS) with queues


spanning forest with Union
-
Find


basic textbook algorithms
[
Cormen
-
Leiserson
-
Rivest
,

]

Benczúr


Big Data
-

Szeged
-

2012 március 23

MapReduce (MR)


Sorting as the prime MR app


For each feature, sort to form a graph of records


MR has no data locality


All data is moved around the network in each
iteration


Find connected components of the graph


Again the textbook matrix power method can be
implemented in MR


Iterated matrix multiplication is not MR
-
friendly

[Kang, Tsourakakis, Faloutsos. Pegasus framework, 2009]


This step will move huge amounts around: the whole
data even if we have small components only

Benczúr


Big Data
-

Szeged
-

2012 március 23

Bulk Synchronous Parallel (BSP)


O
ne master node + several processing nodes

perform

merge

sort


A
lgorithm
:


resolve
local
data

at

each n
ode

locally


send attribute values in sorted order


central server identifies and sends back
candidates


find connected components


this is the prominent BSP app, very fast

Benczúr


Big Data
-

Szeged
-

2012 március 23

Experiments
:
scalability


15
older

blade

servers, 4GB
memory
, 3GHz CPU
each


insurance

client

dataset

(~
2
records

per
entity
)

Benczúr


Big Data
-

Szeged
-

2012 március 23

Experiments
:
scalability


15
older

blade

servers, 4GB
memory
, 3GHz CPU
each


insurance

client

dataset

(~
2
records

per
entity
)

Benczúr


Big Data
-

Szeged
-

2012 március 23

Experiments
:
scalability


15
older

blade

servers, 4GB
memory
, 3GHz CPU
each


insurance

client

dataset

(~
2
records

per
entity
)

Hadoop

phases

HAMA
phases

Benczúr


Big Data
-

Szeged
-

2012 március 23

Conclusions


Big Data is founded on several subfields


Architectures


processor arrays, many
-
core
affordable


Algorithms


design principles from the ‘90
-
s


Databases


distributed, column oriented,
NoSQL


Data mining, Information retrieval, Machine
learning, Networks


for the top application


Hadoop

and Stream processing are the two main
efficient techniques


Limitations for data AND compute intense problems


Many emerging alternatives (e.g. BSP)


Selecting the right architecture is often the question


Benczúr


Big Data
-

Szeged
-

2012 március 23

Kérdések
?

András

Benczúr

Head, Informatics Laboratory


http://datamining.sztaki.hu/




Institute for Computer Science and Control,
Hungarian Academy of Sciences

Email:
benczur@sztaki.hu