Authors : Robson L.F. Cordeiro et al. 2011 SIGKDD Presented by Shih-Feng Sun

dealerdeputyΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 6 μήνες)

144 εμφανίσεις

Authors : Robson L.F.
Cordeiro

et al.

2011 SIGKDD

Presented by Shih
-
Feng

Sun


How to cluster a very large dataset of
moderate
-
to
-
high dimensional elements?


Currently, no subspace clustering algorithm
is able to handle very large datasets in
feasible time.


For data that do not fit even on a single disk,
parallelism is mandatory.



Major problems for clustering very large
datasets with
MapReduce
.


How to minimize the I/O cost


How to minimize the network cost among
processing nodes



Parallel
Clustering
--

ParC


Sample
-
and
-
Ignore
--

SnI

<key, point>

Plug
-
in clustering
method

<key,
cluster_desc
>


Symbol

Def.

m

Mapper #

r

Reducer #

D
s

Disk

transfer
rate

D
r

Dispersion
ratio (0.5)

N
s

Network
transfer rate

F
s

Database

file
size

S
r

Sampling
ratio

R
r

Reduction
ratio

(0.1)

start_u
p_cost

Tasks

startup time

plug_in
_cost

Clustering
method’s

牵湮n湧

瑩浥


The Best of both Worlds


BoW

method


ParC

SnI


All implements are done using
Hadoop

implementation for the
MapReduce

framework.


Density
-
based
subspace clustering algorithm
,
MrCC
, is applied as plugin method in
reducrers
.


Quality may decrease for small dataset when
using a large number of reducers, because they
receive too little data to represent data patterns.


It takes about 8 minutes to cluster full dataset
containing 1.4TB points and it size is 0.2TB.


The plug
-
in takes 43% of total time, which is
the main bottleneck.



Proposed
BoW

method can balance the disk
delay and network delay when in the
MapReduce

framework.


In the experiment,
YahooEig

is the largest
real dataset ever reported in the database
subspace clustering literature.


Thank you!

Q & A