PARSYMONY: Scalable Parallel Data Mining

mudlickfarctateΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

59 εμφανίσεις

Alok Choudhary

choudhar@ece.nwu.edu
1

Northwestern University

PARSYMONY: Scalable Parallel
Data Mining

Alok N. Choudhary

Northwestern University

(ACK: Harsha Nagesh (Bell Labs) and
Sanjay Goil (Sun)




Alok Choudhary

choudhar@ece.nwu.edu
2

Northwestern University

Outline


Overview of PARSIMONY


MAFIA (Subspace clustering)


pMAFIA (Parallel Subspace Clustering)


Performance Results


PARSIMONY


multidimensional data analysis


parallel classification


Summary

Alok Choudhary

choudhar@ece.nwu.edu
3

Northwestern University

Overview of Knowledge Discovery
Process

Alok Choudhary

choudhar@ece.nwu.edu
4

Northwestern University

PARSIMONY Overview

Alok Choudhary

choudhar@ece.nwu.edu
5

Northwestern University

MAFIA: Subspace Clustering for High
-
Dimensional Data Sets


Clustering and subspace clustering


Base MAFIA Algorithm


pMAFIA (parallelization)


Performance Results


Alok Choudhary

choudhar@ece.nwu.edu
7

Northwestern University

Clustering Multi
-
dimensional Dimensional
Data Sets

Determine range of attributes in each dimension of the cluster(s)

Alok Choudhary

choudhar@ece.nwu.edu
8

Northwestern University

Issues to be addressed


Basic algorithm computational optimizations


Scalability with database size (out of core data
sets)


Scalability with the dimensionality of data


Efficient Parallelization


Recognition of arbitrary shaped clusters



Alok Choudhary

choudhar@ece.nwu.edu
9

Northwestern University

Related Work


Partition based Clustering:


User specified
k

representative points taken as cluster centers
and points assigned to cluster centers


k
-
means,
k
-
mediods, CLARANS (VLDB 94), BIRCH (SIGMOD
96),..


Consider clustering partitioning of points


Hierarchical Clustering:


Each point is a cluster. Merge
similar
points together gradually.


CURE (SIGMOD 98) (use sampling)


Categorical Clustering:


Clustering of categorical data e.g automobile sales data: color,
year, model, price, etc


Best suited for non
-
numerical data


CACTUS (KDD 99), STIRR (VLDB 98)


Alok Choudhary

choudhar@ece.nwu.edu
10

Northwestern University

Related Work


Density and Grid Based Clustering

:


Clusters are high density regions than its surroundings


WaveCluster (VLDB 98), DBSCAN, CLIQUE (SIGMOD 98)


Number of subspaces is exponential in the data dimensionality


Multidimensional space divided into grids. The histogram in each
hyper
-
rectangle is found. Grid regions with a
significant

histogram
value are cluster regions.


Post
-
processing done to grow the connected cluster regions.


Fine
grid size results in explosion in the number hyper
-
rectangles,
coarser

grids fail to detect clusters.


Correct Grid Size is very critical !

Alok Choudhary

choudhar@ece.nwu.edu
12

Northwestern University

Subspace Clustering



Observation: “
If a collection of points
S

is a cluster in a k
-
dimensional space, then S is also a part of a cluster in any
(k
-
1) dimensional projection of the space



Algorithm : Growing clusters…candidate dense units in any k
dimensions are obtained by merging dense units in (k
-
1)
dimensions which share
any
(k
-
2) dimensions.


Ex: ( {a1,b7,d9}, {b7,c8,d9} )
--
> {a1,b7,c8,d9}


Candidate dense units are populated by a pass on the data set
and the dense units are found out in each dimension.


Dense units found are combined to form candidate dense units.


Algorithm terminates when no more candidate dense units
found
.

Alok Choudhary

choudhar@ece.nwu.edu
13

Northwestern University


Adaptive Grids (reducing computation in
practice)


Automatic Grid fitting based on data distribution


MAFIA : Merging of Adaptive Finite Intervals !


Optimal
Bins

in each dimension leads to very few
units in the grid (candidate dense units)


(a) : CLIQUE


(b) : MAFIA

Alok Choudhary

choudhar@ece.nwu.edu
14

Northwestern University

Base MAFIA Algorithm



Divide each dimension into very fine regions.


Compute histogram in these regions along every
dimension.


Set the value of a sliding window to the maximum
histogram value in the window.


Adjacent units which have nearly same histogram
values are merged together to form larger bins.


Threshold of each bin formed is computed
automatically.



A bin having a histogram value much greater (by a factor

2~3
)
than that of equi
-
distribution of data is
DENSE
.


Alok Choudhary

choudhar@ece.nwu.edu
15

Northwestern University

Alok Choudhary

choudhar@ece.nwu.edu
16

Northwestern University

MAFIA Algorithm (merging dense units)




Algorithm
:
Candidate dense units in any k dimensions are
obtained by merging dense units in (k
-
1) dimensions which
share
any

(k
-
2) dimensions.


Ex: ( {a1,c7,b8}, {c7,b8,d9} )
--
> {a1,c7,b8,d9}

Alok Choudhary

choudhar@ece.nwu.edu
17

Northwestern University


CLIQUE

: CDUs in any dimension k is formed by combining dense units
of dimension (k
-
1) which share
first
(k
-
2) dimensions.


MAFIA
: CDUs in any dimension k is formed by combining dense units
of dimension (k
-
1) which share
any

(k
-
2) dimensions.



Data Set

CLIQUE

MAFIA

Fixed Grids

Adaptive
Grids

Fixed Grids

first

(k
-
2) algo

any

(k
-
2) algo

Huge CDU Set

Non Cluster Dims
reported

Huge Search Space


much reduced CDU Set

Correct Cluster
Dims reported

Reduced Search
Space

Dimensions aware of
data distribution

Alok Choudhary

choudhar@ece.nwu.edu
18

Northwestern University

Parallel MAFIA



p
MAFIA
:
Parallel Subspace Clustering


Scalable in data size and number of dimensions
Grid and Density based clustering algorithm


Parallelization provides speedup for the subspace
clustering algorithm.

Alok Choudhary

choudhar@ece.nwu.edu
19

Northwestern University

pMAFIA

Algorithm
:


Each processor reads part of the data in a Data Parallel fashion and
constructs histogram in every dimension.


// Data read in chunks (out of core) of data


Reduce
communication to obtain global histogram. All processors
build Adaptive grids using the histogram.


// Each bin formed is a Candidate Dense Unit.


Current Dimension k = 1


while (no more dense units found)


if( k > 1)


Build Candidate Dense units();


Populate the candidate dense units in a data parallel fashion and in
chunks (out
-
of
-
core) of
B

records.


Reduce

communication to obtain global CDU population.


Identify the dense units ();


Build dense unit data structures();

// for the next higher

dimension

Alok Choudhary

choudhar@ece.nwu.edu
20

Northwestern University

Build
-
Candidate
-
Dense
-
Units



Current Dimension(k) = 3.

Alok Choudhary

choudhar@ece.nwu.edu
22

Northwestern University

Build Candidate Dense Units


Each dense unit is compared with every other dense unit to form
CDUs, resulting in an O(Ndu
2
) algorithm.(Ndu
-
number of dense units)


For large values of Ndu CDUs built in parallel.


Processors 0,..,(p
-
1) work in parallel on parts of total Ndu dense units.
Processor k compares dense units between N
i

and N
i+1

with all the
other dense units; for optimal task partitioning we have












Identical CDUs generated during the process need to be discarded.


Each generated CDU compared with every other CDU to identify
similar ones resulting in O(Ncdu
2
) algorithm.

Alok Choudhary

choudhar@ece.nwu.edu
24

Northwestern University

Parallelization (Building CDUs)

Total work done by Pi

is shown

Alok Choudhary

choudhar@ece.nwu.edu
25

Northwestern University

pMAFIA

:
ANALYSIS


Data parallelism in populating the CDUs in every dimension : effective for
massive data sets.


Gains of task parallelism realized when data contains large number of dense
units (clusters).


A ‘k’ dimensional dense unit allocated just 2k bytes of memory, k bytes for
dimensions and k for bin indices.


Data structures in form of linear arrays of bytes: communicate very small
message buffers, space optimization.


Although bottom
-
up algorithm is exponential in the data dimension, for low
subspace coverage, with use of Adaptive grids and parallel formulation very
promising results.


k dimension of the highest dimension dense unit, we explore all possible
subspaces of these k dimensions ==> O(c
k
)


For k passes over the data set O( (N/pB) * k*Tio),


N
-
total number of records, p
-
processors,


B
-

records per chunk, Tio
-
I/O access time for a block of B records.


Communication overhead results in O(Tcomm * S * p * k),


Tcomm
-

constant for communication, S
-

size of message exchanged,.



O(
c
k
+ (N/pB) * k*Ti + Tcomm * S * p * k )

Alok Choudhary

choudhar@ece.nwu.edu
26

Northwestern University

pMAFIA Outline


Clustering and subspace clustering


Base MAFIA Algorithm


pMAFIA (parallelization)


Performance Results


Adaptivity performance


Scalability with data set size.


Scalability with data dimensionality.


Scalability with cluster dimensionality.


Some Real world data sets.


Alok Choudhary

choudhar@ece.nwu.edu
29

Northwestern University

Advantage of Adaptive Grids


300,000 records in a 15
Dimension space with 1 cluster
of 5 dimensions.


A speedup of 80 obtained
over CLIQUE.


CLIQUE failed to produce
results with our modified
CDU generation algorithm
even in 2 hours on 16
processors.


This relatively
small

data set
mined in 32 seconds on 1
processor

Alok Choudhary

choudhar@ece.nwu.edu
30

Northwestern University

Scalability with Data Set Size


20 Dimension data with 5
clusters in 5 different
subspaces



data sets :
up to 11.8
million records


Clusters detected in just
about 3 minutes on 16
processors !


Almost Linear with the
increase in the data set
size (because most time in
scanning)

Alok Choudhary

choudhar@ece.nwu.edu
31

Northwestern University

Parallelization (on IBM SP2)


30 Dimension data with 8.3 million records, 5
clusters each in a 6 dimension subspace.


Near linear speedups


Negligible Communication overheads (<1%)

Alok Choudhary

choudhar@ece.nwu.edu
32

Northwestern University

Data Dimensionality


250,000 records, 3 clusters
in different 5 dimensional
subspaces.


Near linear behavior with
data dimensionality :
Algorithm depends on the
maximum number of
dimensions in a clusters and
not on the data
dimensionality.

Alok Choudhary

choudhar@ece.nwu.edu
33

Northwestern University

Cluster Dimensionality


50 dimension data with 1
cluster, 650,000 records;
cluster dimension from 3 to
10.


Behavior in line with the
order of the algorithm :
increases with subspace
coverage of the cluster.

Alok Choudhary

choudhar@ece.nwu.edu
34

Northwestern University

Scalability on Movie Data


72,916 users rated 1628 movies in 18 months:
2.8 Million
ratings


4D data: {user
-
id, movie
-
id, weight, score}
















Discovered seven interesting 2d clusters

!





Alok Choudhary

choudhar@ece.nwu.edu
35

Northwestern University

Other Data Sets


One day Ahead Prediction of DAX (German Stock Exchange)


DAX prediction data set was based on a 12 input time series like
stock indices, bond indices, gold prices, etc


22 dimensions with 2757 records: Major gains from task parallelism


Mined clusters in 8.16 seconds on 8 processors.


Unique clusters discovered in 3,4,5 and 6 dimensional subspaces.




Ionosphere data: (UCI repository)


Radar data collected in Goose Bay, Labrador


34 dimension data, 351 records.


Discovered unique clusters only in 3 and 4 dimensional sub spaces.

Alok Choudhary

choudhar@ece.nwu.edu
37

Northwestern University

Summary and Conclusions for pMAFIA


MAFIA : unsupervised subspace clustering algorithm


Introduced the adaptive grids formulation


First parallel subspace clustering algorithm for massive

data sets


Incorporates both task and data parallelism.

.


pMAFIA : Scalable and Parallel Implementation in data
size, no of dimensions

Alok Choudhary

choudhar@ece.nwu.edu
38

Northwestern University

PARSIMONY Overview

Alok Choudhary

choudhar@ece.nwu.edu
39

Northwestern University

Sparse Data Structures and
Representations

Alok Choudhary

choudhar@ece.nwu.edu
40

Northwestern University

Alok Choudhary

choudhar@ece.nwu.edu
41

Northwestern University

Using OLAP Framework for

Alok Choudhary

choudhar@ece.nwu.edu
42

Northwestern University

Alok Choudhary

choudhar@ece.nwu.edu
43

Northwestern University

Alok Choudhary

choudhar@ece.nwu.edu
44

Northwestern University

Alok Choudhary

choudhar@ece.nwu.edu
45

Northwestern University

Alok Choudhary

choudhar@ece.nwu.edu
46

Northwestern University

Alok Choudhary

choudhar@ece.nwu.edu
47

Northwestern University