Data Mining Process

fantasicgilamonsterΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 6 μήνες)

130 εμφανίσεις

Data Mining
Process


1.
The
SEMMA method from SAS institute


S
ample the data by creating one of more tables

E
xplore the data by searching for anticipated relationships, unanticipated trends
and anomalies in order to gain understanding and ideas.

M
odify the

data by creating, selecting, and transforming the variables.

M
odel the data by using the analytical tools to search for a combination of the
data that reliably predicts a desired outcome.

A
ssess the data by evaluating the usefulness and reliability of the

findings from
the data mining process.




Fig. 1
-

SAS Institute Enterprise Miner Analysis Diagram

(uses SEMMA)


2. Realistic data mining process


Caveat:



t
he process as well as all phases/steps are iterative;



the sequence of

steps
/phases

depends on the
real
-
life
circumstances
.


Data preparation phase

(i.e. “turn a mess of data into an organized whole”)
:


Data cleaning: consistency, stale information, typos

Missing values: fill them in, ignore them

Data derivation: derive comp
osite features

Input
data

Plot
data

Transform
variables

Select
variables/
features

Partition
data

Regression

Decision tree

Neural net

Assessment

Cost
/score

Merging data from several databases: using a flat file (usually)

Transforming raw data: normalizing, smoothing
, scaling, encoding

Dealing with outliers


Define a study (i.e. selecting data to mine and output, phase I
; i.e. “what are we
doing

here

):


Define the goal, for example:



Define the characteristics (i.e. the profile) of patients who have allergies



Profile the patients who recover in 0
-
2 weeks, 2
-
3 weeks, 3
-
6weeks, and 6+
weeks



Profile patients who use mild or high pain relievers in or
der to reduce pain


Identify which features are of interest
, for example:



What features are useful to profiling people with allergies?



How descriptive are the fields in the current features?



What types of features should we include?


Identify input and out
put features



Selecting data to mine and output, phase II

(i.e. “how can I mine only a subset of
data and get good results if I have a large database”):


“Shrink” the table along X axis (i.e. reduce number of columns either by deleting or
merging features
):



Feature selection and reduction (e.g. by comparing mean and variance, by
entropy, by principal component analysis)



Feature composition (e.g. by merging using principal components)



Reducing feature values (e.g. by
discretizing

feature values using binnin
g)



Merging input intervals (e.g. by Chi
-
square)


“Shrink” the table along Y axis (i.e. reduce number of rows either by deleting or merging
samples/cases):



Cases selection and reduction


Build the model and mine it:


Pick
suitable data mining strategy(ies)
and tool(s)



Validate the model:


Test the model on data which wasn’t used to build the model.

If you built several models (you did, most likely), determine which ones are the best



Calculate the error



Issues: is the model accurate, understandable, lets yo
u know where it’s confident
and where it isn’t and why (i.e. provides quantitative assessment with complex
conclusions), lets you trace which inputs affect the output; is it fast




3. Data Mining Tech
n
iques Anatomy


Each data mining technique can be class
ified into the following categories based on the
functionality provided by the technique

(i.e. the primary data mining task)
:



Classification



Regression



Clustering



Summarization



Dep
endency modeling



Change and Deviation detection


All data mining techniques
can
also
be classified as using
:



supe
rvised or unsupervised learning; and using



inductive or deductive learning.


4
.
Menu
of data mining strategies/tools


Statistical methods:


Bayesian inference


Logistic regression


ANOVA analysis


Log
-
linear models


C
luster analysis:


Divisible algorithms


Agglomerative hierarchical clustering


Partitional clustering


Incremental clustering



Decision trees and rules:


CLS algorithm


1D3 algorithm


C4.5 algorithm


Prunning algorithm


Association rules:


Market basket a
nalysis


Apriori algorithm


www path traversal patterns



Text mining


Artificial neural nets:


Multilayer perceptrons with backpropagation learning


Kohonen networks


Genetic algorithms


Fuzzy Inference Systems


N
-
dimensional visualization methods