EECS 440: Machine Learning
Soumya Ray
http://engr.case.edu/ray_soumya/eecs440_fall13/
sray@case.edu
Office: Olin 516
Office hours: Th, Fri 1:302:30 or by appointment
Text: Machine Learning by Tom Mitchell
9/4/2013 1 Soumya Ray, Case Western Reserve U.
Announcements
• Office hours: Thursday, Friday 1:302:30 (or by
appointment)
• PA1 out (check website later today), due Sep
19
9/4/2013 Soumya Ray, Case Western Reserve U. 2
Today
• Propositional supervised learning
• Decision tree induction
9/4/2013 3 Soumya Ray, Case Western Reserve U.
Supervised Learning
• Examples are annotated with target concept’s
output
• Learning system must find a concept that
matches annotations
• Example: learn to recognize objects
9/4/2013 4 Soumya Ray, Case Western Reserve U.
Supervised Learning
9/4/2013 5 Soumya Ray, Case Western Reserve U.
tiger
cow
elephant
starfish
Feature Vector Representation
• Examples are attributevalue pairs (note “feature”==“attribute”)
• Number of attributes are fixed
• Can be written as an nbym matrix
9/4/2013 6
Attribute
1
Attribute
2
Attribute
3
Example
1
Value
11
Value
12
Value
13
Example
2
Value
21
Value
22
Value
23
Example
3
Value
31
Value
32
Value
33
Soumya Ray, Case Western Reserve U.
Example
9/4/2013 Soumya Ray, Case Western Reserve U. 7
Hasfur? LongTeeth? Scary?
Animal
1
Yes
No
No
Animal
2
No Yes Yes
Animal
3
Yes Yes Yes
Types of Features
• Discrete, Nominal
• Continuous
• Discrete, Ordered
• Hierarchical
• Color ∈ (red, blue,
green)
• Height
• Size ∈ (small, medium,
large)
• Shape ∈
9/4/2013 8
closed
polygon continuous
triangle square circle ellipse
Soumya Ray, Case Western Reserve U.
Feature Space
• We can think of examples embedded in an n
dimensional vector space
9/4/2013 9
Size
Color
Weight
Big
2500
Gray
Soumya Ray, Case Western Reserve U.
The Binary Classification Problem
• Simplest propositional supervised learning
problem
• Target concept assigns one of two labels
(“positive” or “negative”) to all examplesthe
class label
• Can extend to “multiclass” classification,
“regression”
9/4/2013 10 Soumya Ray, Case Western Reserve U.
Example
9/4/2013 Soumya Ray, Case Western Reserve U. 11
Hasfur? LongTeeth? Scary? Lion?
Animal
1
Yes
No
(x
ij
)
No No
Animal
2
No Yes Yes No
Animal
3
Yes Yes Yes Yes
X Y
(x
i
,y
i
)
The Learning Problem
• Given: A binary classification problem
• Do: Produce a “classifier” (concept) that
assigns a label to a new example
9/4/2013 Soumya Ray, Case Western Reserve U. 12
Binary Classifier Concept Geometry
•
Ndimensional volume in feature space
(possibly a disjoint collection)
9/4/2013 13
Size
Color
Weight
Decision Boundary/
Separating Surface
Soumya Ray, Case Western Reserve U.
Decision Tree Induction
9/4/2013 14 Soumya Ray, Case Western Reserve U.
Decision Trees
• A “classical” (1980s) machine learning
algorithm for classification
• Widely used and extremely popular, available
in nearly all ML toolkits
• Not to be confused with decision trees in
decision theory
9/4/2013 Soumya Ray, Case Western Reserve U. 15
What is a Decision Tree?
• Tree: directed acyclic graph, each node has at
most one parent
• Internal nodes: Tests on attributes
• Leaves: Class labels
9/4/2013 16 Soumya Ray, Case Western Reserve U.
Example
9/4/2013 Soumya Ray, Case Western Reserve U. 17
Hasfur? LongTeeth? Scary? Lion?
Animal
1
Yes
No
No No
Animal
2
No Yes Yes No
Animal
3
Yes Yes Yes Yes
Example
9/4/2013 Soumya Ray, Case Western Reserve U. 18
LongTeeth=Yes?
NotLion
False
Scary=Yes?
NotLion
True
True False
Hasfur=Yes?
NotLion
Lion
False True
Classification with a decision tree
• Suppose we are given a tree and a new
example
• Starting at the root, check each attribute test
• This identifies a path through the tree, follow
this until we reach a leaf
• Assign the class label in the leaf
9/4/2013 Soumya Ray, Case Western Reserve U. 19
Example
9/4/2013 Soumya Ray, Case Western Reserve U. 20
LongTeeth=Yes?
NotLion
False
Scary=Yes?
NotLion
True
True False
Hasfur=Yes?
NotLion
Lion
False True
Hasfur?
Long
Teeth?
Scary?
Animal
1
Yes
Yes
No
Decision Tree Induction
• Given a set of examples, produce a decision tree
• Decision tree induction works using the idea of
recursive partitioning
– At each step, the algorithm will choose an attribute
test
– The chosen test will partition the examples into
disjoint partitions
– The algorithm will then recursively call itself on each
partition until
• a partition only has data from one class OR
• it runs out of attributes
9/4/2013 Soumya Ray, Case Western Reserve U. 21
Choosing an Attribute
• Which attribute should we choose to test
first?
– Ideally, the one that is “most predictive” of the
class label
• i.e., the one that gives us the “most information” about
what the label should be
• This idea is captured by the concept of
“(Shannon) entropy” of a random variable
9/4/2013 Soumya Ray, Case Western Reserve U. 22
Entropy of a Random Variable
• Suppose a random variable X has density p(x).
Its (Shannon) “entropy” is defined by:
• Note: 0log(0) = 0 .
9/4/2013 23
2
2
( ) ( log ( ( )))
( ) log ( ( ))
x
H X E p X
p X x p X x
= −
=− = =
∑
Soumya Ray, Case Western Reserve U.
Example
• Suppose X has two values, 0 and 1, and pdf
p(0)=0.5, p(1)=0.5
– Then H(X)=?
• Suppose X has two values, 0 and 1, and pdf
p(0)=0.99, p(1)=0.01
– Then H(X)=?
• Suppose X has two values, 0 and 1, and pdf
p(0)=0.01, p(1)=0.99
– Then H(X)=?
9/4/2013 Soumya Ray, Case Western Reserve U. 24
Entropy of a Bernoulli r.v.
9/4/2013 25
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p
H(p)
Soumya Ray, Case Western Reserve U.
Entropy is
typically
denoted
by H(.)
What is entropy?
• Measure of “information content” in a
distribution
• Suppose we wanted to describe an r.v. X with n
values and distribution p(X=x)
– Shortest lossless description takes −log
2
(p(x)) bits
for each x
– So entropy is the expected length of the shortest
lossless description of the r.v.
9/4/2013 26 Soumya Ray, Case Western Reserve U.
Source Coding Theorem,
Claude Shannon 1948
What is entropy?
• Alternatively, the minimum expected number
of binary questions to pin down the value of X
9/4/2013 27 Soumya Ray, Case Western Reserve U.
What’s the connection?
• Entropy measures the information content of
a random variable
• Suppose we treat the class variable, Y, as a
random variable and measure its entropy
• Then we measure its entropy after
partitioning the examples with an attribute X
9/4/2013 Soumya Ray, Case Western Reserve U. 28
The Entropy Connection
• The difference will be a measure of the
“information gained” about Y by partitioning
the examples with X
• So if we can choose the attribute X that
maximizes this “information gain”, we have
found what we needed
9/4/2013 Soumya Ray, Case Western Reserve U. 29
Information Gain
• IG(X)=expected reduction in entropy of the class
label if the data is partitioned using X
• Suppose at some point we have N training examples,
of which pos are labeled “positive” and neg are
labeled “negative” (pos+neg=N)
• We’ll treat the class label as a Bernoulli r.v. Y that
takes value 1 with prob. p
+
=pos/N and 0 with prob.
p
−
=neg/N
9/4/2013 30 Soumya Ray, Case Western Reserve U.
Information Gain contd.
• Then H(Y)=−p
+
log
2
(p
+
)−p
−
log
2
(p
−
)
• Suppose an attribute X takes two values 1 and
0. After partitioning, we get the quantities
, , and . Then,
9/4/2013 31
1X
p
+
=
1X
p
−
=
1 2 1 1 2 1
0 2 0 0 2 0
(  1) log log
(  0) log log
(  ) ( 1
( ) ( )
) (  1) (
(
)
 )
0 (  0)
X X X X
X X X X
H Y X p p p p
H Y X p p p p
H Y X p X H Y
IG
X p X H Y X
X H Y H Y X
+ + − −
= = ==
+ + − −
= = ==
== − −
== − −
= = =
=
=
−
+ =
0X
p
+
=
0X
p
−
=
Soumya Ray, Case Western Reserve U.
Flowchart
9/4/2013 Soumya Ray, Case Western Reserve U. 32
X=Yes?
p
+
=pos/N, p
−
=neg/N
False True
X Yes
X Yes
X Yes
pos
p
N
+
≠
≠
≠
=
X Yes
X Yes
X Yes
neg
p
N
−
≠
≠
≠
=
X Yes
X Yes
X Yes
pos
p
N
+
=
=
=
=
X Yes
X Yes
X Yes
neg
p
N
−
=
=
=
=
(  )H Y X Yes≠
(  )H Y X yes=
(  )H Y X
( )
X Yes
N
p X Yes
N
≠
≠ =
( )
X Yes
N
p X Yes
N
=
= =
Example
9/4/2013 33 Soumya Ray, Case Western Reserve U.
Hasfur? LongTeeth? Scary? Lion?
Animal
1
Yes
No
No No
Animal
2
No Yes Yes No
Animal
3
Yes No Yes Yes
Animal
4
No Yes Yes Yes
ID3 AlgorithmTraining phase
• Select attribute “most predictive” of class
label (calculating IG(X))
• Partition data based on this attribute
(maximizing IG(X)), and remove attribute
• If all partitions are “pure” OR no more
attributes, stop
– Create a leaf node with the majority class
• Else, recursively process each impure partition
9/4/2013 34 Soumya Ray, Case Western Reserve U.
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment