Basic Data Mining Techniques

levelsordData Management

Nov 20, 2013 (3 years and 6 months ago)

73 views

Basic Data Mining Techniques

Chapter 3

3.1 Decision Trees

An Algorithm for Building

Decision Trees


1. Let
T

be the set of training instances.

2. Choose an attribute that best differentiates the instances in
T.

3. Create a tree node whose value is the chosen attribute.


-
Create child links from this node where each link represents
a unique value for the chosen attribute.


-
Use the child link values to further subdivide the instances
into subclasses.


4. For each subclass created in step 3:



-
If the instances in the subclass satisfy predefined criteria
or if the set of remaining attribute choices for this path is null,
specify the classification for new instances following this decision
path.


-
If the subclass does not satisfy the criteria and there is at
least one attribute to further subdivide the path of the tree, let
T
be

the current set of subclass instances and return to step 2.

Table 3.1

The Credit Card Promotion Database
Income
Life Insurance
Credit Card
Range
Promotion
Insurance
Sex
Age
40–50K
No
No
Male
45
30–40K
Yes
No
Female
40
40–50K
No
No
Male
42
30–40K
Yes
Yes
Male
43
50–60K
Yes
No
Female
38
20–30K
No
No
Female
55
30–40K
Yes
Yes
Male
35
20–30K
No
No
Male
27
30–40K
No
No
Male
43
30–40K
Yes
No
Female
41
40–50K
Yes
No
Female
43
20–30K
Yes
No
Male
29
50–60K
Yes
No
Female
39
40–50K
No
No
Male
55
20–30K
Yes
Yes
Female
19
Figure 3.1 A partial decision
tree with root node = income
range

Income
Range
30-40K
4 Yes
1 No
2 Yes
2 No
1 Yes
3 No
2 Yes
50-60K
40-50K
20-30K
Figure 3.2 A partial decision
tree with root node = credit
card insurance

Credit
Card
Insurance
No
Yes
3 Yes
0 No
6 Yes
6 No
Figure 3.3 A partial decision
tree with root node = age

Age
<= 43
> 43
0 Yes
3 No
9 Yes
3 No
Decision Trees for the
Credit Card Promotion
Database

Figure 3.4 A three
-
node
decision tree for the credit
card database

Age
Sex
<= 43
Male
Yes (6/0)
Female
> 43
Credit
Card
Insurance
Yes
No
No (4/1)
Yes (2/0)
No (3/0)
Figure 3.5 A two
-
node decision
treee for the credit card
database

Credit
Card
Insurance
Sex
No
Male
Yes (6/1)
Female
Yes
Yes (3/0)
No (6/1)
Table 3.2

Training Data Instances Following the Path in Figure 3.4 to Credit Card
Insurance = No
Income
Life Insurance
Credit Card
Range
Promotion
Insurance
Sex
Age
40–50K
No
No
Male
42
20–30K
No
No
Male
27
30–40K
No
No
Male
43
20–30K
Yes
No
Male
29


Decision Tree Rules




A Rule for the Tree in
Figure 3.4



IF Age <=43 & Sex = Male

& Credit Card Insurance = No

THEN Life Insurance Promotion = No


A Simplified Rule Obtained
by Removing Attribute Age


IF Sex = Male & Credit Card
Insurance = No THEN Life Insurance
Promotion = No


Other Methods for Building
Decision Trees




CART



CHAID


Advantages of Decision
Trees



Easy to understand.



Map nicely to a set of production rules.



Applied to real problems.



Make no prior assumptions about the data.



Able to process both numerical and
categorical data.



Disadvantages of
Decision Trees



Output attribute must be categorical.



Limited to one output attribute.



Decision tree algorithms are unstable.



Trees created from numeric datasets

can be complex.



3.2 Generating Association
Rules

Confidence and Support

Rule Confidence


Given

a

rule

of

the

form


If

A

then

B

,

rule

confidence

is

the

conditional

probability

that

B

is

true

when

A

is

known

to

be

true
.


Rule Support


The

minimum

percentage

of

instances

in

the

database

that

contain

all

items

listed

in

a

given

association

rule
.

Mining Association Rules:
An Example

Table 3.3

A Subset of the Credit Card Promotion Database
Magazine
Watch
Life Insurance
Credit Card
Promotion
Promotion
Promotion
Insurance
Sex
Yes
No
No
No
Male
Yes
Yes
Yes
No
Female
No
No
No
No
Male
Yes
Yes
Yes
Yes
Male
Yes
No
Yes
No
Female
No
No
No
No
Female
Yes
No
Yes
Yes
Male
No
Yes
No
No
Male
Yes
No
No
No
Male
Yes
Yes
Yes
No
Female
Table 3.4

Single
-
Item Sets


Single
-
Item Sets

Number of Items



Magazine Promotion = Yes

7

Watch Promotion = Yes

4

Watch Promotion = No

6

Life Insurance Promotion = Yes

5

Life Insurance Promotion = No

5

Credit
Card
Insurance
= No

8

Sex = Male

6

Sex = Female

4


Table 3.5

Two-Item Sets
Two-Item Sets
Number of Items
Magazine Promotion = Yes & Watch Promotion = No
4
Magazine Promotion = Yes & Life Insurance Promotion = Yes
5
Magazine Promotion = Yes & Credit Card Insurance = No
5
Magazine Promotion = Yes & Sex = Male
4
Watch Promotion = No & Life Insurance Promotion = No
4
Watch Promotion = No & Credit Card Insurance = No
5
Watch Promotion = No & Sex = Male
4
Life Insurance Promotion = No & Credit Card Insurance = No
5
Life Insurance Promotion = No & Sex = Male
4
Credit Card Insurance = No & Sex = Male
4
Credit Card Insurance = No & Sex = Female
4
General Considerations


We are interested in association rules that
show a lift in product sales where the lift
is the result of the product

s association
with one or more other products.


We are also interested in association rules
that show a lower than expected
confidence for a particular association.

3.3 The K
-
Means
Algorithm

1.
Choose a value for
K
, the total number of
clusters.

2.
Randomly choose
K

points as cluster
centers.

3.
Assign the remaining instances to their
closest cluster center.

4.
Calculate a new cluster center for each
cluster.

5.
Repeat steps 3
-
5 until the cluster
centers do not change.

An Example Using K
-
Means

Table 3.6

K-Means Input Values
Instance
X
Y
1
1.0
1.5
2
1.0
4.5
3
2.0
1.5
4
2.0
3.5
5
3.0
2.5
6
5.0
6.0
Figure 3.6 A coordinate mapping
of the data in Table 3.6

0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
f(x)
x
Table 3.7

Several Applications of the K
-
Means Algorithm (
K
= 2)


Outcome

Cluster Centers

Cluster Points

Squared Error

1

(2.67,4.67)

2, 4, 6

14.50

(2.00,1.83)

1, 3, 5


2

(1.5,1.5)

1, 3

15.94

(2.75,4.125)

2, 4, 5, 6


3

(1.8,2.7)

1, 2, 3, 4, 5

9.60

(5,6)

6

Figure 3.7 A K
-
Means clustering
of the data in Table 3.6 (K = 2)

0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
x
f(x)
General Considerations



Requires

real
-
valued

data
.



We must select the number of clusters
present in the

data.



Works best when the clusters in the data
are of approximately equal size.



Attribute significance cannot be determined.



Lacks explanation capabilities.

3.4 Genetic Learning

Genetic Learning Operators


Selection


Crossover


Mutation

Genetic Algorithms and
Supervised Learning

Figure 3.8 Supervised genetic
learning

Fitness
Function
Population
Elements
Candidates
for Crossover
& Mutation
Training
Data
Keep
Throw
Table 3.8

An Initial Population for Supervised Genetic Learning


Population

Income

Life Insurance

Credit Card



Element

Range

Promotion

Insurance

Sex

Age








1


20

30K

No

Yes

Male

30

39

2


30

40K

Yes

No

Female

50

59

3


?

No

No

Male

40

49

4


30

40K

Yes

Yes

Male

40

49


Table 3.9 •
Training Data for Genetic Learning


Training

Income

Life Insurance

Credit Card



Instance

Range

Promotion

Insurance

Sex

Age








1


30

40K

Yes

Yes

Male

30

39

2


30

40K

Yes

No

Female

40

49

3


50

60K

Yes

No

Female

30

39

4


20

30K

No

No

Fe
male

50

59

5


20

30K

No

No

Male

20

29

6


30

40K

No

No

Male

40

49


Figure 3.9 A crossover
operation

Population
Element
Age
Sex
Credit Card
Insurance
Life Insurance
Promotion
Income
Range
#1
30-39
Male
Yes
No
20-30K
Population
Element
Age
Sex
Credit Card
Insurance
Life Insurance
Promotion
Income
Range
#2
50-59
Fem
No
Yes
30-40K
Population
Element
Age
Sex
Credit Card
Insurance
Life Insurance
Promotion
Income
Range
#2
30-39
Male
Yes
Yes
30-40K
Population
Element
Age
Sex
Credit Card
Insurance
Life Insurance
Promotion
Income
Range
#1
50-59
Fem
No
No
20-30K
Table 3.10 •
A Second
-
Generation Population


Population

Income

Life Insurance

Credit Card



Element

Range

Promotion

Insurance

Sex

Age








1


20

30K

No

No

Female

50

59

2


30

40K

Yes

Yes

Male

30

39

3


?

No

No

Male

40

49

4


30

40K

Yes

Yes

Male

40

4
9


Genetic Algorithms and
Unsupervised Clustering

Figure 3.10 Unsupervised
genetic clustering

a
1
a
2
a
3
. . .
a
n
.
.
.
.
I
1
I
p
I
2
.
.
.
.
.
P
instances
S
1
E
k2
E
k1
E
22
E
21
E
12
E
11
S
K
S
2
Solutions
.
.
.
Table 3.11

A First-Generation Population for Unsupervised Clustering
S
1
S
2
S
3
Solution elements
(1.0,1.0)
(3.0,2.0)
(4.0,3.0)
(
initial population)
(5.0,5.0)
(3.0,5.0)
(5.0,1.0)
Fitness score
11.31
9.78
15.55
Solution elements
(5.0,1.0)
(3.0,2.0
)
(4.0,3.0)
(
second generation)
(5.0,5.0)
(3.0,5.0)
(1.0,1.0)
Fitness score
17.96
9.78
11.34
Solution elements
(5.0,5.0)
(3.0,2.0)
(4.0,3.0)
(
third generation)
(1.0,5.0)
(3.0,5.0)
(1.0,1.0)
Fitness score
13.64
9.78
11.34
General Considerations



Global optimization is not a guarantee.



The fitness function determines the

complexity of the algorithm.



Explain their results provided the fitness

function is understandable.



Transforming the data to a form suitable
for genetic learning can be a

challenge.

3.5 Choosing a Data
Mining Technique

Initial Considerations



Is learning supervised or unsupervised?



Is explanation required?



What is the interaction between input and
output attributes?



What are the data types of the input and
output attributes?


Further Considerations



Do We Know the Distribution of the Data?



Do We Know Which Attributes Best
Define the Data?



Does the Data Contain Missing Values?



Is Time an Issue?



Which Technique Is Most Likely to Give a
Best Test Set Accuracy?