System Support for High Performance

levelsordΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

84 εμφανίσεις




System Support for High Performance
Scientific Data Mining

Gagan Agrawal

Ruoming Jin

Raghu Machiraju

S. Parthasarathy

Department of Computer and Information Sciences

Ohio State University

Scientific Data Mining Problem


Datasets used for scientific
data mining are large


particularly from simulations


Our understanding of what
algorithms and parameters
will give desired insights is
limited


Time required for
implementing different
algorithms and running them
with different parameters on
large datasets slows down
the scientific data mining
process


Project Overview


FREERIDE (Framework for
Rapid Implementation of
datamining engines) as the
base system


Already demonstrated for a
variety of standard mining
algorithms



Working for feature analysis
and mining of simulation
data currently

FREERIDE offers:


The ability to rapidly prototype a high
-
performance mining implementation


Distributed memory parallelization


Shared memory parallelization


Ability to process large and disk
-
resident
datasets


Only modest modifications to a sequential
implementation for the above three












Key Observation from Mining Algorithms


Popular algorithms have
a common canonical
loop


Can be used as the
basis for supporting a
common middleware


While( ) {


forall( data instances d) {


I = process(d)


R(I) = R(I)
op

d


}


…….

}



Performance of Shared Memory
Parallelization

0
200
400
600
800
1000
1200
1400
1600
1
thread
4
threads
16
threads
full repl
opt full locks
cache sens.
Locks
K
-
means
clustering



Performance on Cluster of SMPs

0
10000
20000
30000
40000
50000
60000
70000
1 node
2
nodes
4
nodes
8
nodes
1 thread
2 threads
3 threads
Apriori
Association
Mining


SPIES On (a) FREERIDE


Developed a new
communication efficient decision
tree construction algorithm


Statistical Pruning of Intervals
for Enhanced Scalability (SPIES)


Combines RainForest with
statistical pruning of intervals of
numerical attributes to reduce
memory requirements and
communication volume


Does not require sorting of
data, or partitioning and
writing
-
back of records

0
1000
2000
3000
4000
5000
6000
7000
1
node
8
nodes
1
thread
2
threads
3
threads
Broader Research Agenda



Applying FREERIDE for Scientific Data
Mining


Focusing on feature
extraction, tracking, and
mining approach developed
by Machiraju et al.


A feature is a region of
interest in a dataset


A suite of algorithms for
extracting and tracking them

Aggregate

Classify Points

Rank

Denoise


Track

Transform

Operator

Tour Grid


A Feature Analysis Algorithm

ROIs

Data

Catalog

Classify
-
Aggregate

Ongoing Work


Parallelization Using
FREERIDE


Most of the steps involve
generalized reductions
-

supported well in FREERIDE


Extensions to FREERIDE
required for aggregation and
tracking steps


Overall, FREERIDE can allow
rapid implementation of
scalable versions of a variety of
steps and algorithms that are
part of the feature mining
paradigm