A Short Course in Data Mining

quiltamusedΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 10 μήνες)

86 εμφανίσεις

U.S. Headquarters: StatSoft, Inc.

2300 E. 14th St
. 
Tulsa, OK 74104

USA

(918) 749-1119

Fax: (918) 749-2217

info@statsoft.com

www.statsoft.com
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
Au
s
t
ra
l
i
a
:
Sta
t
Sof
t
Pa
c
i
f
i
c
P
t
y

Ltd
.

F
r
an
ce
:S
tat
S
of
t
F
ran
c
e
I
t
a
l
y
:
Sta
t
Sof
t
Ita
l
i
a
s
r
l
P
o
l
and
: Sta
t
So
ft

Po
l
s
k
a
Sp.
z

o.o.

S
.
Af
r
i
c
a
:
S
t
at
Sof
t

S.
A
f
r
i
c
a

(P
ty
) L
t
d.
B
r
a
z
il:

S
t
atSo
ft
South
A
m
e
r
ica

Ge
r
m
any
: S
t
atSo
ft
G
m
b
H

Japan
:
Sta
t
So
ft
Japan
In
c.
Po
r
t
uga
l:
S
t
at
Sof
t

Ibé
r
ica
Lda
S
w
eden:
Sta
t
Sof
t

Sc
andinavia AB
Bu
lga
r
ia
:
Sta
t
So
ft
Bu
lga
r
ia
Ltd
.

Hunga
ry
:
S
t
at
Sof
t

H
unga
ry
Ltd
.

Ko
r
ea:
S
t
atSo
ft
Ko
rea

Ru
ssia
:
S
t
atSo
ft

Ru
ssia

Ta
iw
an:
S
t
at
Soft

Ta
iw
an
Czech
R
e
p.
:
St
a
t
S
o
f
t
Cz
ech
R
e
p.

s.r.o
.

I
n
dia:
S
t
at
S
o
f
t

I
n
dia
Pvt
.

L
t
d.

N
e
t
h
erla
n
d
s:

St
a
t
S
o
ft
B
e
n
e
lu
x B
V

S
p
ain:
S
t
at
S
o
f
t

I
b
érica
Lda
U
K
:
Sta
t
So
ft
Ltd.
C
h
ina:
S
t
at
Sof
t
C
h
ina
I
s
r
a
e
l
:
S
t
at
Sof
t
I
s
r
a
e
l
L
t
d.

No
r
w
ay
: S
t
at
Sof
t

N
o
rw
ay
A
S
data analysis

data mining

quality control

web-based analytics
A Short Course in Data Mining
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
1
Outline
Overview of Data Mining

What is Data Mining?

Models for Data Mining

Steps in Data Mining

Overview of Data Mining techniques

Points to Remember
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
2
What is
Data Mining?

The need for Data Mining arises
when exp
e
nsive problems in
business (manufacturing, engineer
ing, etc.) have no obvious
solutions

Optimizing a manufacturing proce
ss or a product formulation.

Det
ect
ing fraudulent transactions.

Assessing risk.

Segmenting customers.
A solution must be found.

Pretend problem does not exist. Denial.

Co
nsul
t local
psychi
c.

Use data mining.
Note:
We recommend this approach …
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
3
What is
Data Mining?

Data mining
is an analytic process
designed to expl
ore large amounts of
data in search of
consistent patterns and/or systematic relation
ships between variables,
and then to validate
the
findi
n
gs by
applying t
h
e
detec
t
ed patterns to new subsets of
data.

Data mining
is a business process for maximiz
i
ng
th
e va
lu
e of
da
ta co
lle
cted
b
y
th
e
business.

Data mining
is used to

Detect
patte
rns
in fr
audul
e
nt
t
r
ans
a
ctions, insur
a
nce claims, e
t
c.

Detect patterns in ev
ents and behaviors

Model customer buying patte
rns and behavior for cross-
selling, up selling, and
customer acquisition

Optimize product per
f
ormance
and manufacturing processes

Data mining ca
n be utilized in a
n
y organi
zation that needs to find patterns or
rela
tionships in their data, wherever th
e derived insights will deliver business
val
u
e.
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
4
What is
Data Mining?
The
typical goals of dat
a
mining projects ar
e:

Identification of
groups, clusters, strata, or dimensions
in data that disp
lay no
ob
vious structure,

Identification of factors th
at are related to a particul
ar outcom
e of interest (
root-cause
analysis)

Accurate prediction
of outcome variable(s) of intere
st (in t
h
e
fut
u
r
e
,
or in ne
w
customers, clients, applicants,
etc.; this application is usua
lly referred to as
predictive
dat
a
mining)
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
5
What is
Data Mining?

Dat
a
mining is a t
ool, not a m
a
gic wand.

Data mining will not autom
a
tically di
scover solutions
without guidance.

Data mining will not sit inside of your
database and send yo
u an
email when some
interesting pattern is discovered.

Dat
a
mining may
find int
e
r
e
sting p
a
tte
rns,

but it does not tell you the value of such
pat
t
erns.

Dat
a
mining does not infe
r caus
ality.

For example, it might be determined that
males that have a
ce
r
t
a
i
n in
co
m
e
w
h
o
exercise regularly are likely purchasers of a
cert
ain pr
oduct, however, it does not mean
that s
u
ch factors c
a
use
t
h
e
m
to pur
c
hase
the
product, only that the relationship
exists.
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
6
Models for Data Mining
In the da
ta mining literature,
va
rious “
g
eneral fra
m
ework
s

h
av
e been proposed to serve as
blueprints for how to organize th
e pro
c
ess of ga
thering d
a
ta, analy
z
ing data, dissemin
a
ting results,
implementing r
e
sults, an
d monitorin
g
improve
m
ents.

CRISP
mid-1990s by a European consortium
of companies to serve as a non-
proprietary standard process model for data mining.
Business
Underst
a
nding

Data Un
d
e
r
s
tan
d
in
g

Data Preparation

M
o
delin
g

Ev
aluation

Deployment

DMAIC
Six Sigma
methodo
l
og
y -
d
ata-driv
en method
o
l
o
g
y for elimi
n
ating defects,
waste, or quality control prob
lems of all kinds.
Define

Me
as
ure


Analyze

Improve

Contr
o
l

SEMMA
(SAS I
n
stitute) –
f
ocused m
o
re on
te
chnical aspects of dat
a
mining.
Sample

Explore

Modify

Model


Assess
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
7
Steps in Data Mining
Stage 0: Precise s
t
a
t
ement of the problem.

Before opening a software packag
e and running an analysis, the
analys
t mus
t
be clear as
to
what
que
s
tion he
wants
to answe
r
.
I
f
you
hav
e
not giv
e
n a precise formulation of the
problem you are
t
r
ying to s
o
lve, th
e
n
you ar
e w
a
s
t
ing time and money.
Stage 1: Initial exploration.

This stage usually starts with
data p
r
ep
ara
t
io
n
tha
t
m
a
y in
v
o
lv
e
t
he “cleaning”
of the data
(e.g., ide
n
ti
fication and re
mova
l of incorrectly coded
data, et
c.), data transformations,
selecting subsets of records, an
d, in t
h
e
cas
e
of dat
a
s
e
ts with
large numbers of vari
ables
(“fields”), performing preliminary
feature
se
lect
ion. Dat
a
des
c
ription and visualization are

key com
p
onents of this stage (e
.g. descriptive statistics, correla
tions, scatterplots, box plots,
etc.).
Stage 2: Model building a
n
d validation.

This stage involves considering various m
o
d
e
ls
and choosing the best
one based on their
predictive performance.
Stage 3: Deployment.

When t
h
e goal of t
h
e
dat
a
mining project is to
predict or classify new cases (e.g., to predict
the credit worthiness of individuals applying
f
o
r
loa
n
s)
,
th
e th
ird
a
n
d final stage typically
involves the application of the
best m
o
del or models (determined
i
n the previous stage) to
generate predictions
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
8
Stage 1: Initial exploration

“Cleaning”
of data
,

Identification and removal of incorrect
ly coded dat
a
, Male=Yes, Pregnant=Yes

Data transformations
,
Dat
a
m
a
y be s
k
e
wed (t
hat is, outliers in one direction or another
may be present). Log transforma
tion, Box-Cox transformation
, etc.

Data reduction,
Selecting subsets of records,
and, in
the case of data sets with large
numbers
of variables (“fields”), perfor
m
i
ng preliminary feature selection.

Data description a
n
d visualization
are key compon
ents of this
st
age
(e.g. desc
riptive
statistics, correlations, scatterplots
, box plots, brus
hing tools, et
c.)

Dat
a
des
c
ription allows you to
get a snapshot of the imp
o
rtant charact
e
ristics of t
h
e

dat
a
(e
.g. central t
e
nde
n
cy and dispersion).

Patterns are often easier to perceive visua
lly than wit
h
lists and t
a
bles of numbers
.
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
9
Stage 2:
Model
buil
ding and vali
dation.

Dat
a
Mining involves creati
ng models of re
ality

A model takes one or more inputs
and prod
uces one or more outputs

A model can be
“t
ranspare
nt”, for e
x
ample, a se
ries o
f
if
/th
e
n statem
ents wh
ere stru
cture is
easily discerned,
or a model can be
s
een as
a
black box, for ex
ample, ne
ur
al ne
tw
ork, w
h
e
r
e
th
e stru
cture or th
e ru
les tha
t
g
o
ve
rn th
e pr
edictions are impossib
le
t
o
fully comprehe
nd.
MODEL
Age
Gender
Incom
e
Will customer likely
default on his loan?
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
10
Stage 2:
Model
buil
ding and vali
dation.

A model is typically rate
d according to 2 aspe
cts:

Accuracy

Understandability

These aspects sometimes conflict with one another.

Decision trees and linea
r regression m
o
dels are less comp
licated
a
nd simpler than models
such as neur
al
ne
t
w
orks, b
oosted t
r
ees, e
t
c.
and thus
easier
to unde
r
st
and, however, you
might be
giving up some predictive accur
a
cy.


Remember not to confuse
t
h
e
da
t
a
mining model with r
e
ality
(a road map is not a perfect
representation of the road) but it
c
a
n be
use
d
as a us
eful guide.
Generaliz
a
t
ion
is the abi
lity of a model to make accurate
predicti
ons when faced with data
not drawn from the ori
g
inal
train
i
n
g
set (but d
r
aw
n
from the s
a
me s
o
u
r
c
e
as th
e tra
i
ni
ng s
e
t).
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
11
Stage 2:
Model
buil
ding and vali
dation.

Validation of the m
o
del requires that you train
the m
o
del on one
s
et of
data and evaluate on
another independent set of data.

Th
ere a
r
e two m
a
in
m
e
thods of validation

Split dat
a
into t
r
ain/test
dat
a
sets
(75-2
5
spl
i
t)

I
f
you do not have enough dat
a
to have
a hold
out sample, th
en use
v-fold cross
validation.
In
v
-fold cross-v
a
lidation
, r
epeated
(
v
)
rando
m

sa
mp
le
s

a
r
e
d
r
awn
fro
m

the data

for

the
analys
is, and
the

respecti
ve

mode
l o
r
p
r
ed
ic
tion

me
thod, etc.
is

then app
lied
to
co
mpu
t
e p
r
ed
ic
ted
va
lues
,
c
l
as
s
i
fica
tions
,
e
t
c
.

Typ
i
ca
lly,
summary
i
n
dice
s of t
h
e a
ccura
cy of

t
h
e
p
r
e
d
i
c
ti
o
n
are
com
p
ut
e
d
ov
er
t
h
e

v
r
e
p
lications; thus
, th
is

techn
i
que
a
l
lo
w
s
the
ana
lys
t
to
e
v
a
l
uate
the
o
v
e
r
a
ll
accu
rac
y

o
f
the respec
tive p
r
ed
ic
tion mode
l
o
r
me
thod
in
repe
ated
ly d
r
a
w
n
r
ando
m sa
mple
s.








©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
12
Stage 2:
Model
buil
ding and vali
dation.

If you do not have e
n
ough dat
a
to have
a hold
out sample, th
en use
v-fold cross
validation.
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
13
Stage 2:
Model
buil
ding and vali
dation.

How predictive is the mod
e
l?
Com
p
ute error sums of squares (regression)
or confusion matrix
(classification)

Error on training dat
a

is not a good indicator of performance on
future
dat
a
The new data will probably
not be
exactly the same a
s
the training
dat
a
!

Overfi
tti
n
g:
Fitting t
h
e
t
r
aining dat
a
too precisely
Usually leads
to p
o
or
results on new data
In genera
l
,
the
te
r
m

over
fitting
r
e
fe
r
s

to the condition whe
r
e
a

p
r
ed
ic
tive
mode
l (e
.g., fo
r

p
r
ed
ic
tive

data
min
i
ng
) is
so
"spec
ific" tha
t
it rep
r
od
uces

vari
o
u
s
i
d
io
sy
n
c
ra
si
es (ra
n
d
o
m
"
n
oi
se
" v
a
ri
ati
o
n)
o
f
t
h
e
p
a
rticul
ar
d
a
t
a
from
w
h
ich
the
para
m
e
t
e
r
s

of the
mode
l we
re es
tima
ted;
as

a
resu
lt,
such
mode
ls

o
ften

ma
y not
yie
l
d ac
cu
rate
pred
ictions fo
r
new
observa
tions

(e
.g., du
r
i
ng
dep
lo
yment of a
p
r
ed
ic
tive data

min
i
ng pr
o
j
ec
t)
. O
f
ten,
va
r
i
ous

techn
i
ques
such
as
c
r
os
s-
va
lidation an
d

v-
fo
ld
c
r
os
s-
va
lidation a
r
e
applied

to avo
i
d

overfitting
.
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
14
Stage 2:
Model
buil
ding and vali
dation.
Model Validation Measures

P
ossible validation measures

Classification accuracy

Total cost/benefit –
w
hen different
errors involve different
costs

Lift and Gains curves

Error in Numeric predictions

Error rat
e

Proportion of errors made over t
h
e
whole set of instances

Training set error rate: is way too
optim
i
stic!

You can find p
a
tt
e
r
ns
even in r
a
ndom
dat
a
T
he
lift cha
r
t
provi
d
e
s
a
visu
al summa
ry
of
t
h
e

use
f
u
l
ness
of the
in
fo
r
m
a
t
ion
pro
v
ided b
y

one o
r

mo
re
s
t
a
t
istica
l
mode
ls

fo
r
p
r
ed
ic
ting
a bino
mia
l

(
c
a
t
egor
ica
l
)

outco
m
e va
r
i
ab
le (depend
e
nt va
r
i
ab
le
)
;

fo
r
mu
ltino
m
ia
l (
m
u
l
tip
l
e
-
ca
tegor
y)

outco
me
va
r
i
ab
les
,

lift
c
h
ar
ts
can be
co
mputed fo
r
each ca
te
go
r
y
.
S
pecifically, the chart
sum
m
arizes t
he
utility that
one
ma
y
e
x
pec
t
b
y
us
ing
the respec
tive

p
r
ed
ic
tive
mode
ls

com
p
are
d

t
o
u
s
i
n
g b
a
s
eli
n
e
i
n
f
o
rmati
o
n
o
n
ly.
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
15
Stage 3:
Deployment.

A model is built once, but can
be used over and over again.

Model should be easily deployable.

A linear regression is easily deploy
ed. Si
mply gat
h
er the
regr
ession coefficients…

For example, if a new observed data vector co
mes in {x
1, x2
, x3},
t
h
e
n
simply plug into
linear equation to generate predicted value,
Predicti
o
n
= B
0
+ B
1
*X
1 + B
2
*X
2 + B
3
*X
3

A Classification and Regres
sion Tre
e
model is
easily deployed: A
s
eries of If/Then/Else
statements …
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
16
Steps in Data Mining
Stage 0: Precise statement of the problem.
Stage 1: Initial exploration.
Stage 2: Model building and validation.
Stage 3: Deployme
nt.
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
17
Overvi
ew of
Data Mining techni
ques

Supervised Le
ar
ning

Classification
: response is categorical

Regression
: resp
onse is continuous

Time Series
: dealing with obser
v
ations across time

Optimization
: minimize or maximize some
characteristic

Unsuper
v
ised Le
arning

Principal Com
p
o
nent Analysis
: feature reduction

Clustering
: grouping like ob
jects together

Association and Link Analysis
: descripti
v
e
approac
h
to exploring da
ta tha
t
h
e
lp
s id
en
tif
y

relationships among values
in a database, e.g.
Market basket analysis,
those customers that buy
hammers also buy nails.
Examine conditional
probabilities.
Supervis
e
d
Learning
: A category of d
a
ta
mini
ng methods that use a set of l
a
bel
e
d
training examples (e.g., each example
consists of a set of va
lues
on pr
edic
t
o
r
s
and
outco
mes
)
to fit a mod
e
l that l
a
ter
can b
e

used for deployment.
Unsu
pervised Learnin
g
: A data mini
n
g

method based on trai
ni
ng data w
here the
outcomes are not provided.
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
18
Overvi
ew of
Data Mining Techni
ques

The view from 20,
000 feet above.
The next few slides will cove
r the comm
only used data mi
ning techniques below at a
high level, focusing on t
h
e
big picture so
t
h
at you c
a
n see
how e
a
ch t
e
chnique
fi
ts
into the overall landscape.

Descriptive St
atistics

Linear and Logistic Regress
i
on

Analysis of Variance (ANOVA)

Discriminant
Anal
ysis

Decision Trees

Clustering Techniques (K
Me
ans
&
EM)

Neural Networks

Ass
o
ciation and Link Analysis

MSPC (Multivariat
e St
atistical Process Control)
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
19
Disadvantages of no
nparametric models

Some dat
a
mining algo
rithms need a lot of
dat
a

Curse of dimensionality
is a term coined by
Richard Bellman applied to the problem
caused b
y
t
h
e
r
a
p
i
d increase in volume
associated with adding extra dimensions to a
(mat
hematical) space.
Leo Breiman
g
ives as an e
x
ample t
h
e
fact
t
h
at

100
observations cover t
h
e
one-dimensional
unit
int
e
r
v
al [0,1] on th
e real line quite well.
One could dr
aw
a histog
ram of the results,
and draw inferences. If
one now considers the
corresponding 10-dimensional unit
hypersquare,
1
0
0
observations are
now
isolated points in a va
st empty space. To get
similar c
o
verage to the
one-dimensional
space would now require 10
20
observations,
which is at le
ast
a massive unde
r
t
a
king and
may well be impractical.
The term
cur
s
e of
di
me
ns
iona
lit
y
(Bellman,
19
61, Bish
op, 1
9
9
5
) g
e
n
e
r
a
lly r
e
fers to th
e
difficulti
e
s i
n
volv
e
d
in fitting m
o
d
e
l
s
in ma
ny
dimensions. As the dim
ensionality of the input
data space (i.e., the nu
mber of predictors)
increases, i
t
becomes expon
enti
al
ly mor
e
diffi
cult
to find glo
b
a
l o
p
tim
a
for th
e mo
d
e
ls. He
nce, it is
simply a practical nece
ssity to pr
e-scr
ee
n a
nd
presel
ect from among
a large set of input
(predictor) variables those
that are of likely utility
for predicti
ng the outc
omes of interest.
The curse of dim
e
nsionality is a
significant obstacle
in machine learning
prob
lem
s
that involve learning from few
data samples in a
high-dim
ensional
fe
at
ure
s
p
ace.

See also
:
ht
tp
:
/
/
en
.wik
i
pedia.org/wi
ki/Curse_of
_dimensi
o
n
a
lity
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
20
Descriptive Statistics, etc.

Typically there is so mu
ch
d
a
ta in a d
a
ta
ba
se tha
t
it f
i
rst mus
t
be s
u
mmarized in order to
begin to make
any sense
of it.

T
h
e
first st
ep i
n
Dat
a
Mining
is describing and s
u
mmarizing
the data.

Two main types of statistics com
m
o
nly used to
characterize a distribution of dat
a
are:

Mea
s
u
res of Central Tendency
M
e
an
, Median
, Mode

Mea
sures of Dispersion
St
andard Deviation, Vari
ance

Visualization of the data is cr
ucial. Pat
t
er
ns
that c
a
n be
seen
by the eye leave a much
stronger imprint than a ta
bl
e of numbers or statistics.
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
21
Descriptive Statistics, etc.
Hist
ogr
am (
S
pr
eadsheet
1 1v*10000c)
234
56789
1
0
1
1
1
2
1
3
Var
1
0
1000
2000
3000
4000
5000
6000
7000
N
o
of
obs
A histogram is a simple yet effective
way of summarizin
g information in
a column or variable.
We can quickly determine the range
of the variable (min and max), the
mean, median, mode, and variance.
Descri
pti
v
e Statisti
cs (Spreadsheet1)
Vari
able
Vali
d N
M
ean
M
edi
a
n
M
inimum
Maximum
V
ari
a
nce
S
td.De
v
.
Var1
10000
3.992
616
3.677
125
3.000
038
11.46
572
0.995
422
0.997
708
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
22
Descriptive Statistics, etc.
Positive correlation
Negative correlation
∑∑





=
2
2
)
(
)
(
)
)(
(
y
y
x
x
y
y
x
x
r
xy
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
23
Linear and Logistic Regression

Regression analysis is a statistical method
olog
y that utilizes the relationship between two or
m
o
re quantitative variables, so that one va
riable can b
e
predict
e
d from t
h
e
other
s
.

We shall first consider linear re
gression. This is the case wh
ere the response variable is
continuous. In logistic regression the response is dichotom
ous.

Ex
amples include:

Sales of a product can be predicted utili
zing the relationship
between sales and
advertising expenditures.

The perf
ormance of an employee
on a job can be predicted by
utilizing the relationship
between performance and a ba
ttery of aptitude tests.

The size of vocabulary of a
child can be predicted by utili
zing the relationship between
size of vocabulary and ag
e
of child and amount
of education of t
h
e
p
a
re
nts.
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
24
Linear and Logistic Regression
The simplest form of regression con
t
ai
ns one
predictor an
d one response. Y =
β
0
+
β
1
X
The slop
e and intercept terms ar
e found such that sum of sq
uared
d
eviations from the line
are
minimized. This is t
h
e
principle of
lea
s
t squa
res.
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
25
Linear and Logistic Regression

Personnel professionals customarily use mult
iple regression procedures to determine
equit
a
ble compens
a
tion. The
pers
onnel analyst t
h
e
n
us
ually conducts a s
a
lary survey among
comparable companies in t
h
e
marke
t
, recording
the salaries and respective characteristics
for different positions. Th
is information can be used in a mult
iple regression analysis to build
a regression equation of the form:
Sal
a
ry = .5 *(Amount
of
Responsibility) + .8 * (Num
ber of People
To Supervise)

What i
f
t
h
e
p
a
tt
er
n i
n

the data is NOT linear?

More p
r
e
d
ictors can be us
ed

Transformations

Inter
a
c
t
ions and higher
order polynomial t
e
rms c
a
n be
added (he
n
ce
linear re
gression
does not mean linear in the predictors, but ra
ther linear in the
p
aram
eters), that is, we
can easily bend the line into
curves t
h
at
are
nonlinear.

We can really develop com
p
licated m
o
dels
using this approach,
however, this takes
expertise on the part of the modeler
in both
the domain of application as well as with
the method
ology.
Techniques that we will lo
ok at later can do a
l
ot of the “grunt”
w
ork
for us.…
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
26
Linear and Logistic Regression

What if the response is dichotom
ous?

We might use linear regression to predic
t the 0/1 response.
However, there are
prob
lem
s
such as:

If you use linear regression, the predicted
values will become greater than one and
less than zero if you move far enough on
the X-axis. Such values are theoretically
inadmissible.

One of the assumptions of re
gression is that the variance of Y is constant across
values of X (hom
oscedasticity). This cann
ot b
e
t
h
e
c
a
s
e
wit
h
a binary variable,
beca
u
s
e th
e va
rian
ce is PQ
.
Wh
en 50 pe
rcent of th
e peop
le are 1s, th
en th
e
variance is .25, its
maximum value. As
we move to m
o
re extreme values, the
variance decreases. When P=.
10, the varianc
e
is .1*.9 =
.09, so as P approaches 1
or zero, the variance approaches zero.

The significance
testing of the
b
w
e
ights
res
t
upon t
h
e
assu
mpt
i
on t
h
at
errors of
prediction (Y-Y') are normally distributed.
B
e
cau
s
e Y
on
ly tak
e
s th
e va
lues 0 an
d
1,

this assumption is pretty hard
to justify, even approximat
ely.
Th
eref
ore,
th
e tests of
the regression weights are su
spect if you use linear regr
ession with a binary DV.
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
27
Linear and Logistic Regression

There are a variety of regression
applications in which the re
sponse variables only has two
qualitative outcomes (male/female, de
fault/no default, succes
s/failure).

In a study of labor force partic
ipation of wives, as
a functi
on of age of wife, number of
children, and husband’s inco
me, t
h
e
respons
e
vari
able
w
a
s
de
fined to have two
outcomes: wife in labor forc
e, wife not in labor force.

In a longitudinal s
t
udy
of c
o
ronary
he
ar
t
di
sease as a function of
age, gender, sm
oking
history, cholesterol level, and blood pressure
, the response variable
w
a
s defined t
o
have
to two possib
le outcomes:
person develo
ped he
art
disease, person
did NOT
de
velop
he
ar
t
disease
duri
ng the s
t
udy.
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
28
ANOVA

Analysis of Variance (ANOVA) allows yo
u to test differences among means.

The explanatory variables in an ANOVA are
typically qualitative (gender, geographic
location, plant shift, etc.)

If predictor variables are quantitative, then no
ass
u
mption is made
ab
ou
t th
e na
tu
re of th
e
regression f
u
nction be
twee
n response and predictors.

Some examples include:

An exper
i
ment to study t
h
e
e
f
fects
of five
differe
nt
br
ands of ga
soline on autom
o
bile
operating e
f
ficiency (mpg).

An experiment to assess the effects of di
ffe
r
e
n
t
amounts
of a par
t
icular psyche
delic
drug on manual dexterity.
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
29
Discriminant
Analysis

Discriminant
funct
i
on analy
s
is
is use
d
to dete
rmine w
h
ich vari
ables discriminate
betw
een
two or more naturally occurring groups.

For ex
ample, an e
d
ucational rese
ar
cher
may
w
a
nt
to investigate which variables discrimin
a
te
betwee
n high sc
hool graduates
w
h
o decide
(1) to go to college,
(2) to attend a trade or professional school, or
(3) to seek no fur
t
her

tr
aini
ng or education.
For t
h
at purpose the researcher
could colle
ct data on numerous
variables prior to students'
graduation. After grad
uation, most students will natura
lly fall into on
e of the three
categories. Discriminant
Analysis could the
n
be
use
d
to dete
rmine w
h
ich vari
able(s) ar
e t
h
e

best pr
e
d
ictors of st
ude
n
t
s
'
subseque
nt
educational choice.

A medical researcher may record
differe
nt
variables r
e
lating t
o
patient
s
' backgr
ounds in
order to learn which variables best
predict whether
a patient is
likely
(1.) to recover completely (group 1),
(2.) p
a
rti
a
lly (group 2), or
(3.) not at all (group 3).
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
30
Decision Trees

Decision trees are predictive
m
o
dels that c
a
n be viewed as a
tree.

Models can be m
a
de to pre
d
ict
categorical or continuous
responses.

A decisio
n
tree is
nothing more
than a sequence of if/then
statements.
Some Advantages of Tree Mod
e
ls
Easy
to
I
n
t
e
rpret R
esults.
In most cases, the
interpretati
on of results summarized i
n
a tree is
very simpl
e
.
Tree metho
d
s are N
o
n
p
arametric and

No
nli
n
ear
. The final results of using tree methods
for classificatio
n
or re
g
r
essi
on can
be su
mma
rize
d
in a seri
es of (usual
ly
few) logical if-then
conditi
ons (tree nodes).
Therefore, there is no
impl
icit assumpti
on that the underlying
rel
a
tionships between the
predictor vari
ables and
t
he dependent
v
a
r
i
able
are li
near, follow some
specific non-li
near l
i
nk f
unction, or that they are
ev
en m
onot
onic
in nat
u
r
e
.
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
31
Clustering

Clustering is t
h
e
method
in which like records are
grouped together.

A simple example of clusteri
ng would be the clustering
that one does when doin
g
laundry –
g
rouping the
permanent
press, dry cleani
ng, whites, and brightly
colored clothes.

This is straightforward exce
pt f
o
r th
e wh
ite sh
irt with
red stripes…where does this go?
Cluster An
aly
s
is: Ma
rk
eting Application
A typical example appli
c
ation is a marketing
research study w
h
ere a
number of consumer-
behavior related variab
les are measured for a
large sample of respondents; the purpose of the
study is to detect "m
ar
ket seg
m
e
n
ts," i.e., gr
ou
ps
of respondents that are somehow
more simil
a
r to
each other (to all other
members of the same
cluster) when compared
to r
e
sp
o
n
d
ents th
at
"belong to" other clusters.
In addi
tion to i
d
entifying
such cl
uste
rs, it is usu
a
lly e
q
u
al
ly
of inter
e
st to
determi
ne how those cluste
rs ar
e diffe
re
nt, i.e.,
the sp
ecific varia
b
l
e
s o
r
dimensions on w
h
ich the
mem
b
e
r
s i
n
diffe
re
nt clusters w
ill
vary, an
d how.
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
32
Clustering
Clustering techniques used to…

Identifying characteristics of people belonging to
differe
nt
clusters
(income, age,
m
a
ri
t
a
l st
atus
,
etc.).
Po
ssible application of results…

Develop special marketing campaigns, services, or
recommendations to p
a
rti
c
ular
typ
e
of stores
based on
characteristics.

Arrange stores according to
the taste of diff
erent
shopper groups.

Enhance
the
overall at
trac
tiveness and quality of
the s
h
opping experience….
K Means C
l
ustering
: The classic k-Me
ans
algorithm w
a
s populari
zed and refined by H
a
rtigan
(1975; see also H
a
rti
g
an
and Wong, 1978). The
basic operation of that
algo
rith
m is rel
a
tive
ly

simpl
e
: Given a fixed number of (desi
r
ed or
hypothesized) k cl
usters,
assig
n
o
b
servati
o
ns to
those cl
usters so that the means across cl
usters
(for all vari
abl
es) are as
diffe
re
nt from
each other
as possible.
EM Clustering
:
The EM algorithm for clusteri
ng is
described i
n
detail i
n
Wi
tten and Frank (2001).
The goal of EM clustering is to estimate the
m
e
ans
and s
t
an
da
rd de
v
i
at
ion
s
f
o
r
each c
l
u
s
ter
,

so as to m
a
x
i
mize th
e
likeli
h
o
o
d
of the o
b
se
rved
data (distri
buti
o
n). Put another w
a
y, the EM
alg
o
r
i
thm a
ttempts to
approximate the observed
distri
buti
o
ns of values
based on mixtures of
differ
ent d
i
stributi
o
n
s i
n
diffe
re
nt clusters.
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
33
Neural Networks
Like most statistical models
neural networks are capable of
pe
r
f
orming t
h
r
ee m
a
jor t
a
s
k
s

including
regression
,
classification
. Regression tasks are concerne
d with relating a numbe
r of
input variables
x
with set
of continuous
outcom
es
t
(t
arget
variables). By contrast
,
classification tasks assign cla
ss memberships to a categorical ta
rget
variable given a set
of
input values. In the next
section we will consider re
gression in
m
o
re details.
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
34
Association and Link Analysis

Find items in a database
that occurs together (
Association R
u
les
)

An association is an
expression of form:
Body He
ad (Support
, Confidence)
If buys (x," fla
s
hlight"
)
buys (x," batter
i
es"
)
(250, 89%)
(250, 89%)
©
C
opy
r
i
gh
t
Sta
t
So
ft
,
In
c.,
1984
-2008.
S
t
at
Sof
t
,
Sta
t
So
ft

logo,
and
STA
T
I
S
TICA
a
r
e t
r
ade
m
a
r
ks
o
f
Stat
Sof
t
,

In
c.
35
MSPC

Built upon the capabilitie
s o
f

PCA
and
PLS
techniques,
MSPC
is a selection of methods
particularly design
ed fo
r proc
ess
monitorin
g

an
d qu
ality control in in
dust
ria
l
batch
processing.

Batch pr
o
c
esses ar
e of considerabl
e

importance in making products with the
de
sire
d s
p
ecifications and s
t
an
da
r
d
s in many
sector
s of the i
n
d
u
str
y
, suc
h
as pol
y
mers,
paint, fert
ilizer
s
pha
r
maceuticals/biopharm,
cement, p
e
troleum products, perfumes, and
semiconductors.

The objectiv
es of batch processin
g
are
relat
e
d t
o
profit
ability achiev
e
d
by r
e
ducing
pro
d
uct variability as
well a
s
increas
i
ng
qu
ality.

From a q
u
ality point of v
i
ew, batch processes
can be divided into normal and abnormal
batches. Genera
lly speaking, a normal batch
lea
d
s to
a
pro
d
uct with the desired
specifications and s
t
an
da
r
d
s. This
is in
contrast to abnormal batch runs where the
end pro
d
uct is expected to have a po
or
qu
ality.

Another reasons for batch
monitorin
g
is
r
e
lat
e
d

to regulatory and safety purposes. Often industria
l

pro
d
uctions a
r
e req
u
ired to
keep full track (i.e.,
history) o
f
the batch process for presentation of
ev
idence on go
o
d
qua
lity co
ntrol prac
tice.
MSP
C
helps construct an effectiv
e engineering system
that can be used to moni
tor the pro
g
ression of a

batch and predict the quality of the end pro
d
uct.
©Copyright StatSoft, Inc., 1984-2008. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc.
36
Points to Remember…
Data mining is a tool, not a magic wand.
Data mining will not automatically discover solutions without guidance.
Predictive relationships found via data mining are not necessarily causes of an action or behavior.
To ensure meaningful results, it’s vital that you understand your data.
Data mining’s central quest: Find meaningful, effective patterns and avoid overfitting (finding random
patterns by searching too many possibilities)