System Support for High Performance
Scientific Data Mining
Gagan Agrawal
Ruoming Jin
Raghu Machiraju
S. Parthasarathy
Department of Computer and Information Sciences
Ohio State University
Scientific Data Mining Problem
Datasets used for scientific
data mining are large
–
particularly from simulations
Our understanding of what
algorithms and parameters
will give desired insights is
limited
Time required for
implementing different
algorithms and running them
with different parameters on
large datasets slows down
the scientific data mining
process
Project Overview
FREERIDE (Framework for
Rapid Implementation of
datamining engines) as the
base system
Already demonstrated for a
variety of standard mining
algorithms
Working for feature analysis
and mining of simulation
data currently
FREERIDE offers:
The ability to rapidly prototype a high
-
performance mining implementation
Distributed memory parallelization
Shared memory parallelization
Ability to process large and disk
-
resident
datasets
Only modest modifications to a sequential
implementation for the above three
Key Observation from Mining Algorithms
Popular algorithms have
a common canonical
loop
Can be used as the
basis for supporting a
common middleware
While( ) {
forall( data instances d) {
I = process(d)
R(I) = R(I)
op
d
}
…….
}
Performance of Shared Memory
Parallelization
0
200
400
600
800
1000
1200
1400
1600
1
thread
4
threads
16
threads
full repl
opt full locks
cache sens.
Locks
K
-
means
clustering
Performance on Cluster of SMPs
0
10000
20000
30000
40000
50000
60000
70000
1 node
2
nodes
4
nodes
8
nodes
1 thread
2 threads
3 threads
Apriori
Association
Mining
SPIES On (a) FREERIDE
Developed a new
communication efficient decision
tree construction algorithm
–
Statistical Pruning of Intervals
for Enhanced Scalability (SPIES)
Combines RainForest with
statistical pruning of intervals of
numerical attributes to reduce
memory requirements and
communication volume
Does not require sorting of
data, or partitioning and
writing
-
back of records
0
1000
2000
3000
4000
5000
6000
7000
1
node
8
nodes
1
thread
2
threads
3
threads
Broader Research Agenda
Applying FREERIDE for Scientific Data
Mining
Focusing on feature
extraction, tracking, and
mining approach developed
by Machiraju et al.
A feature is a region of
interest in a dataset
A suite of algorithms for
extracting and tracking them
Aggregate
Classify Points
Rank
Denoise
Track
Transform
Operator
Tour Grid
A Feature Analysis Algorithm
ROIs
Data
Catalog
Classify
-
Aggregate
Ongoing Work
–
Parallelization Using
FREERIDE
Most of the steps involve
generalized reductions
-
supported well in FREERIDE
Extensions to FREERIDE
required for aggregation and
tracking steps
Overall, FREERIDE can allow
rapid implementation of
scalable versions of a variety of
steps and algorithms that are
part of the feature mining
paradigm
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Comments 0
Log in to post a comment