Some Basic Concepts in Data Mining

tribecagamosisΤεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

90 εμφανίσεις

CIS600/CSE 690: Analytical Data Mining


Some Basic Concepts in Data Mining

(This set will be updated as need during the semester) September 8, 2009




1.

Supervised and unsupervised learning.

2.

Model selection and assessment

3.

Training, validation and test data

4.

C
ross validation











































(1)

Supervised and unsupervised learning

Supervised Learning

This type of learning is “similar” to human learning from experience. Since computers have no
experience, we provide previous data, called tr
aining data as a substitute. If is analogous to learning
from a teacher and hence the name supervised. Two such tasks in data mining are classification and
prediction. In classification, data attributes are related to a class label while in prediction they

are
related to a numerical value.


Unsupervised Learning

In this type of learning, we discover patterns in data attributes to learn or better understand the data.
Clustering algorithms are used to discover such patterns, i.e
.
, to determine data clusters.
Algorithms
are employed to organize data into groups (clusters) where members in
a

g
r
oup are similar in some
way and different from members in other groups.


(2)

Supervised and unsupervised learning


In data mining applications we seek a model that learns well

from the available data as well as has
good generalization performance. Towards this objective, we need a way to manage model
complexity and a method to measure performance of the chosen model. A common approach is to
divide the data into three sets, trai
ning, validation and test. Training data are used to learn or develop
candidate models, validation set is used to select a model and test set is used for assessing model
performance on future data. However, in many applications only two sets are created, t
raining and
test.





















3.

Training, Validation and Test Data


Example
:


(A)We have data on

16
data items , their attributes and class labels.

RANDOMLY
divide
them
into 8 for training, 4 for validation and 4 for testing.




Training

Item No.


d


䅴瑲楢u瑥猠

䍬慳C




0




0



K乏坎NF佒⁁䱌

1




1



䑁T䄠䥔AMS

1




1




0




0

噡汩V慴楯a




0

㄰1


0

ㄱ1


1

ㄲ1


0

Te獴s

ㄳ1


0

ㄴ1


0

ㄵ1


1

ㄶ1


1
















(B
).
Next, suppose

we develop, three cl
assification
models A
, B, C

from the training
data. Let the training errors on these models be as shown below

(r
ecall that the
models do not necessarily provide perfect results on training data

neither they are
required to).








Classification

res
ults from

Item No.

d
-

Attributes

True Class


Model A


Model B

Model C

1.


0

0

1

1

2.

ALL KNOWN

0

0

0

0

3.


1

0

1

0

4.


1

1

0

1

5.


1

0

0

0

6.


1

1

1

1

7.


0

0

0

0

8.


0

0

0

0


Classification
Error


2/8

3/8

3/8




(
C
). Next
,

use the three model
s A, B, C to classify
each

item in
the
v
alidation set
based on its attribute vales. Recall that we do know their true labels as well.


Suppo
se we get the following results:






Classification results from

Item No.

d
-

Attributes

True Class


Model

A


Model B

Model C

9.


0

1

0

0

10.


0

0

1

0

11.


1

0

1

0

12.


0

0

1

0


Classification
Error


2/4

2/4

1/4

If we use

minimum validation er
ror as model selection criterion
, we would select
model C.









(D
).
Now use
model C to determine class val
ues for each
data point in the test set.
We do so by substituting the (known) attribute value into the classification model C.
Again, recall that we know the true label of each of these data items so that we can
compare the values obtained from the classi
fication model with the true labels to
determine classification error on the test set. Suppose we get the following results.





Classification results
from


Item No.

d
-

Attributes

True Class

Model C

13.


0

0

14.


ALL KNOWN

0

0

15.


1

0

16.


1

1


C
lassification
Error


1/4


(E
).
Based on the above, an

estimate of generalization error is 2
5%.

W
hat this means is

that if we use M
odel C to classify
future items
for which only the
attributes will be known, not the class labels,

we are likely to make i
ncorrect
classification
s

about 25% of the time.


(F
). A summary of the above is as follows:


Model

Training

Validation

Test

A

25

50


----

B

37.5

50

----

C

37.5

25

25















4.

Cross Validation


If available dat
a are limited, we
employ Cross V
alidation (CV). In this approach, data are
randomly
divided into almost

k
equal se
ts. Training is done based on (k
-
1) sets and the
k
-
th

set is used for
test. This process is repeated k

times
(k
-
fold CV).

The average error on
the k

repetition
s is
used as
a measure of the test error.

For the special case when k
=1, the above is called Leave
-

One

Out
-
Cross
-
Validation
(LOO
-
CV).


EXAMPLE:

Consider
the
above data consisting of 16 items.


(A).
Let
k
= 4, i.e.
,

4
-

fold Cross Validation.


Divide

the data into four sets of 4 items each.

Suppose the following set up occurs and the errors obtained are as shown.



Set 1

Set 2

Set 3

Set 4

Training

Item
s

1
-

12

Item
s

1
-

8

13
-
16

Item
s

1
-

4

9
-
16

Item
s

5
-
16

Test

Item
s

13
-
16

Item
s

9
-
12


Item
s

5
-

8


Item
s

1


4



Error on test
set
(assume)

25%

35%

28%

32%



Estimated Classification Error (CE) =
25+35+28+32


=

30%


4

(B). LOO


CV


For this, data are divided into 16 s
ets, each consisting of 15 training data and one test
data.



Set 1

Set 2


Set 15

Set 16

Training

Item
s

1
-

15

Item
s

1


14,16

Item 1,

3
-
8

Item
s

2
-
16

Test

Item

16

Item 15


Item 2


Item

1



Error on test
set
(assume)

0%

100%

100%

100%


Supp
ose Average Classification Error

based on the values in the last row is

CE)
=
32%


Then the estimate of test error is 32% .