EECS 440: Machine Learning

bindsodavilleAI and Robotics

Oct 14, 2013 (3 years and 8 months ago)

148 views

EECS 440: Machine Learning
Soumya Ray
http://engr.case.edu/ray_soumya/eecs440_fall13/
sray@case.edu

Office: Olin 516
Office hours: Th, Fri 1:30-2:30 or by appointment
Text: Machine Learning by Tom Mitchell
9/4/2013 1 Soumya Ray, Case Western Reserve U.
Announcements
• Office hours: Thursday, Friday 1:30-2:30 (or by
appointment)
• PA1 out (check website later today), due Sep
19
9/4/2013 Soumya Ray, Case Western Reserve U. 2
Today
• Propositional supervised learning
• Decision tree induction
9/4/2013 3 Soumya Ray, Case Western Reserve U.
Supervised Learning
• Examples are annotated with target concept’s
output

• Learning system must find a concept that
matches annotations

• Example: learn to recognize objects


9/4/2013 4 Soumya Ray, Case Western Reserve U.
Supervised Learning
9/4/2013 5 Soumya Ray, Case Western Reserve U.
tiger
cow
elephant
starfish
Feature Vector Representation
• Examples are attribute-value pairs (note “feature”==“attribute”)
• Number of attributes are fixed
• Can be written as an n-by-m matrix
9/4/2013 6
Attribute
1
Attribute
2
Attribute
3

Example
1
Value
11
Value
12
Value
13

Example
2
Value
21
Value
22
Value
23

Example
3
Value
31
Value
32
Value
33

Soumya Ray, Case Western Reserve U.
Example
9/4/2013 Soumya Ray, Case Western Reserve U. 7
Has-fur? Long-Teeth? Scary?
Animal
1
Yes

No

No
Animal
2
No Yes Yes
Animal
3
Yes Yes Yes
Types of Features
• Discrete, Nominal

• Continuous

• Discrete, Ordered

• Hierarchical
• Color ∈ (red, blue,
green)
• Height

• Size ∈ (small, medium,
large)
• Shape ∈
9/4/2013 8
closed
polygon continuous
triangle square circle ellipse
Soumya Ray, Case Western Reserve U.
Feature Space
• We can think of examples embedded in an n
dimensional vector space
9/4/2013 9
Size
Color
Weight
Big
2500
Gray
Soumya Ray, Case Western Reserve U.
The Binary Classification Problem
• Simplest propositional supervised learning
problem

• Target concept assigns one of two labels
(“positive” or “negative”) to all examples---the
class label

• Can extend to “multiclass” classification,
“regression”
9/4/2013 10 Soumya Ray, Case Western Reserve U.
Example
9/4/2013 Soumya Ray, Case Western Reserve U. 11
Has-fur? Long-Teeth? Scary? Lion?
Animal
1
Yes

No
(x
ij
)
No No
Animal
2
No Yes Yes No
Animal
3
Yes Yes Yes Yes
X Y
(x
i
,y
i
)
The Learning Problem
• Given: A binary classification problem

• Do: Produce a “classifier” (concept) that
assigns a label to a new example
9/4/2013 Soumya Ray, Case Western Reserve U. 12
Binary Classifier Concept Geometry

N-dimensional volume in feature space
(possibly a disjoint collection)








9/4/2013 13
Size
Color
Weight
Decision Boundary/
Separating Surface
Soumya Ray, Case Western Reserve U.
Decision Tree Induction
9/4/2013 14 Soumya Ray, Case Western Reserve U.
Decision Trees
• A “classical” (1980s) machine learning
algorithm for classification

• Widely used and extremely popular, available
in nearly all ML toolkits

• Not to be confused with decision trees in
decision theory
9/4/2013 Soumya Ray, Case Western Reserve U. 15
What is a Decision Tree?
• Tree: directed acyclic graph, each node has at
most one parent

• Internal nodes: Tests on attributes

• Leaves: Class labels
9/4/2013 16 Soumya Ray, Case Western Reserve U.
Example
9/4/2013 Soumya Ray, Case Western Reserve U. 17
Has-fur? Long-Teeth? Scary? Lion?
Animal
1
Yes

No

No No
Animal
2
No Yes Yes No
Animal
3
Yes Yes Yes Yes
Example
9/4/2013 Soumya Ray, Case Western Reserve U. 18
Long-Teeth=Yes?
Not-Lion
False
Scary=Yes?
Not-Lion
True
True False
Has-fur=Yes?
Not-Lion
Lion
False True
Classification with a decision tree
• Suppose we are given a tree and a new
example
• Starting at the root, check each attribute test
• This identifies a path through the tree, follow
this until we reach a leaf
• Assign the class label in the leaf
9/4/2013 Soumya Ray, Case Western Reserve U. 19
Example
9/4/2013 Soumya Ray, Case Western Reserve U. 20
Long-Teeth=Yes?
Not-Lion
False
Scary=Yes?
Not-Lion
True
True False
Has-fur=Yes?
Not-Lion
Lion
False True
Has-fur?
Long-
Teeth?
Scary?
Animal
1
Yes

Yes

No
Decision Tree Induction
• Given a set of examples, produce a decision tree
• Decision tree induction works using the idea of
recursive partitioning
– At each step, the algorithm will choose an attribute
test
– The chosen test will partition the examples into
disjoint partitions
– The algorithm will then recursively call itself on each
partition until
• a partition only has data from one class OR
• it runs out of attributes
9/4/2013 Soumya Ray, Case Western Reserve U. 21
Choosing an Attribute
• Which attribute should we choose to test
first?
– Ideally, the one that is “most predictive” of the
class label
• i.e., the one that gives us the “most information” about
what the label should be

• This idea is captured by the concept of
“(Shannon) entropy” of a random variable
9/4/2013 Soumya Ray, Case Western Reserve U. 22
Entropy of a Random Variable
• Suppose a random variable X has density p(x).
Its (Shannon) “entropy” is defined by:




• Note: 0log(0) = 0 .

9/4/2013 23
2
2
( ) ( log ( ( )))
( ) log ( ( ))
x
H X E p X
p X x p X x
= −
=− = =

Soumya Ray, Case Western Reserve U.
Example
• Suppose X has two values, 0 and 1, and pdf
p(0)=0.5, p(1)=0.5
– Then H(X)=?
• Suppose X has two values, 0 and 1, and pdf
p(0)=0.99, p(1)=0.01
– Then H(X)=?
• Suppose X has two values, 0 and 1, and pdf
p(0)=0.01, p(1)=0.99
– Then H(X)=?

9/4/2013 Soumya Ray, Case Western Reserve U. 24
Entropy of a Bernoulli r.v.
9/4/2013 25
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p
H(p)
Soumya Ray, Case Western Reserve U.
Entropy is
typically
denoted
by H(.)
What is entropy?
• Measure of “information content” in a
distribution
• Suppose we wanted to describe an r.v. X with n
values and distribution p(X=x)
– Shortest lossless description takes −log
2
(p(x)) bits
for each x
– So entropy is the expected length of the shortest
lossless description of the r.v.

9/4/2013 26 Soumya Ray, Case Western Reserve U.
Source Coding Theorem,
Claude Shannon 1948
What is entropy?
• Alternatively, the minimum expected number
of binary questions to pin down the value of X

9/4/2013 27 Soumya Ray, Case Western Reserve U.
What’s the connection?
• Entropy measures the information content of
a random variable

• Suppose we treat the class variable, Y, as a
random variable and measure its entropy

• Then we measure its entropy after
partitioning the examples with an attribute X
9/4/2013 Soumya Ray, Case Western Reserve U. 28
The Entropy Connection
• The difference will be a measure of the
“information gained” about Y by partitioning
the examples with X

• So if we can choose the attribute X that
maximizes this “information gain”, we have
found what we needed

9/4/2013 Soumya Ray, Case Western Reserve U. 29
Information Gain
• IG(X)=expected reduction in entropy of the class
label if the data is partitioned using X

• Suppose at some point we have N training examples,
of which pos are labeled “positive” and neg are
labeled “negative” (pos+neg=N)

• We’ll treat the class label as a Bernoulli r.v. Y that
takes value 1 with prob. p
+
=pos/N and 0 with prob.
p

=neg/N
9/4/2013 30 Soumya Ray, Case Western Reserve U.
Information Gain contd.
• Then H(Y)=−p
+
log
2
(p
+
)−p

log
2
(p

)
• Suppose an attribute X takes two values 1 and
0. After partitioning, we get the quantities
, , and . Then,





9/4/2013 31
1X
p
+
=
1X
p

=
1 2 1 1 2 1
0 2 0 0 2 0
( | 1) log log
( | 0) log log
( | ) ( 1
( ) ( )
) ( | 1) (
(
)
| )
0 ( | 0)
X X X X
X X X X
H Y X p p p p
H Y X p p p p
H Y X p X H Y
IG
X p X H Y X
X H Y H Y X
+ + − −
= = ==
+ + − −
= = ==
== − −
== − −
= = =
=
=

+ =
0X
p
+
=
0X
p

=
Soumya Ray, Case Western Reserve U.
Flowchart
9/4/2013 Soumya Ray, Case Western Reserve U. 32
X=Yes?
p
+
=pos/N, p

=neg/N
False True
X Yes
X Yes
X Yes
pos
p
N
+



=
X Yes
X Yes
X Yes
neg
p
N




=
X Yes
X Yes
X Yes
pos
p
N
+
=
=
=
=
X Yes
X Yes
X Yes
neg
p
N

=
=
=
=
( | )H Y X Yes≠
( | )H Y X yes=
( | )H Y X
( )
X Yes
N
p X Yes
N

≠ =
( )
X Yes
N
p X Yes
N
=
= =
Example
9/4/2013 33 Soumya Ray, Case Western Reserve U.
Has-fur? Long-Teeth? Scary? Lion?
Animal
1
Yes

No

No No
Animal
2
No Yes Yes No
Animal
3
Yes No Yes Yes
Animal
4
No Yes Yes Yes
ID3 Algorithm---Training phase
• Select attribute “most predictive” of class
label (calculating IG(X))
• Partition data based on this attribute
(maximizing IG(X)), and remove attribute
• If all partitions are “pure” OR no more
attributes, stop
– Create a leaf node with the majority class
• Else, recursively process each impure partition
9/4/2013 34 Soumya Ray, Case Western Reserve U.