Data Mining
Process
1.
The
SEMMA method from SAS institute
S
ample the data by creating one of more tables
E
xplore the data by searching for anticipated relationships, unanticipated trends
and anomalies in order to gain understanding and ideas.
M
odify the
data by creating, selecting, and transforming the variables.
M
odel the data by using the analytical tools to search for a combination of the
data that reliably predicts a desired outcome.
A
ssess the data by evaluating the usefulness and reliability of the
findings from
the data mining process.
Fig. 1

SAS Institute Enterprise Miner Analysis Diagram
(uses SEMMA)
2. Realistic data mining process
Caveat:
t
he process as well as all phases/steps are iterative;
the sequence of
steps
/phases
depends on the
real

life
circumstances
.
Data preparation phase
(i.e. “turn a mess of data into an organized whole”)
:
Data cleaning: consistency, stale information, typos
Missing values: fill them in, ignore them
Data derivation: derive comp
osite features
Input
data
Plot
data
Transform
variables
Select
variables/
features
Partition
data
Regression
Decision tree
Neural net
Assessment
Cost
/score
Merging data from several databases: using a flat file (usually)
Transforming raw data: normalizing, smoothing
, scaling, encoding
Dealing with outliers
Define a study (i.e. selecting data to mine and output, phase I
; i.e. “what are we
doing
here
”
):
Define the goal, for example:
Define the characteristics (i.e. the profile) of patients who have allergies
Profile the patients who recover in 0

2 weeks, 2

3 weeks, 3

6weeks, and 6+
weeks
Profile patients who use mild or high pain relievers in or
der to reduce pain
Identify which features are of interest
, for example:
What features are useful to profiling people with allergies?
How descriptive are the fields in the current features?
What types of features should we include?
Identify input and out
put features
Selecting data to mine and output, phase II
(i.e. “how can I mine only a subset of
data and get good results if I have a large database”):
“Shrink” the table along X axis (i.e. reduce number of columns either by deleting or
merging features
):
Feature selection and reduction (e.g. by comparing mean and variance, by
entropy, by principal component analysis)
Feature composition (e.g. by merging using principal components)
Reducing feature values (e.g. by
discretizing
feature values using binnin
g)
Merging input intervals (e.g. by Chi

square)
“Shrink” the table along Y axis (i.e. reduce number of rows either by deleting or merging
samples/cases):
Cases selection and reduction
Build the model and mine it:
Pick
suitable data mining strategy(ies)
and tool(s)
Validate the model:
Test the model on data which wasn’t used to build the model.
If you built several models (you did, most likely), determine which ones are the best
Calculate the error
Issues: is the model accurate, understandable, lets yo
u know where it’s confident
and where it isn’t and why (i.e. provides quantitative assessment with complex
conclusions), lets you trace which inputs affect the output; is it fast
3. Data Mining Tech
n
iques Anatomy
Each data mining technique can be class
ified into the following categories based on the
functionality provided by the technique
(i.e. the primary data mining task)
:
Classification
Regression
Clustering
Summarization
Dep
endency modeling
Change and Deviation detection
All data mining techniques
can
also
be classified as using
:
supe
rvised or unsupervised learning; and using
inductive or deductive learning.
4
.
Menu
of data mining strategies/tools
Statistical methods:
Bayesian inference
Logistic regression
ANOVA analysis
Log

linear models
C
luster analysis:
Divisible algorithms
Agglomerative hierarchical clustering
Partitional clustering
Incremental clustering
Decision trees and rules:
CLS algorithm
1D3 algorithm
C4.5 algorithm
Prunning algorithm
Association rules:
Market basket a
nalysis
Apriori algorithm
www path traversal patterns
Text mining
Artificial neural nets:
Multilayer perceptrons with backpropagation learning
Kohonen networks
Genetic algorithms
Fuzzy Inference Systems
N

dimensional visualization methods
Comments 0
Log in to post a comment