EECS 440: Machine Learning

Soumya Ray

http://engr.case.edu/ray_soumya/eecs440_fall13/

sray@case.edu

Office: Olin 516

Office hours: Th, Fri 1:30-2:30 or by appointment

Text: Machine Learning by Tom Mitchell

9/4/2013 1 Soumya Ray, Case Western Reserve U.

Announcements

• Office hours: Thursday, Friday 1:30-2:30 (or by

appointment)

• PA1 out (check website later today), due Sep

19

9/4/2013 Soumya Ray, Case Western Reserve U. 2

Today

• Propositional supervised learning

• Decision tree induction

9/4/2013 3 Soumya Ray, Case Western Reserve U.

Supervised Learning

• Examples are annotated with target concept’s

output

• Learning system must find a concept that

matches annotations

• Example: learn to recognize objects

9/4/2013 4 Soumya Ray, Case Western Reserve U.

Supervised Learning

9/4/2013 5 Soumya Ray, Case Western Reserve U.

tiger

cow

elephant

starfish

Feature Vector Representation

• Examples are attribute-value pairs (note “feature”==“attribute”)

• Number of attributes are fixed

• Can be written as an n-by-m matrix

9/4/2013 6

Attribute

1

Attribute

2

Attribute

3

Example

1

Value

11

Value

12

Value

13

Example

2

Value

21

Value

22

Value

23

Example

3

Value

31

Value

32

Value

33

Soumya Ray, Case Western Reserve U.

Example

9/4/2013 Soumya Ray, Case Western Reserve U. 7

Has-fur? Long-Teeth? Scary?

Animal

1

Yes

No

No

Animal

2

No Yes Yes

Animal

3

Yes Yes Yes

Types of Features

• Discrete, Nominal

• Continuous

• Discrete, Ordered

• Hierarchical

• Color ∈ (red, blue,

green)

• Height

• Size ∈ (small, medium,

large)

• Shape ∈

9/4/2013 8

closed

polygon continuous

triangle square circle ellipse

Soumya Ray, Case Western Reserve U.

Feature Space

• We can think of examples embedded in an n

dimensional vector space

9/4/2013 9

Size

Color

Weight

Big

2500

Gray

Soumya Ray, Case Western Reserve U.

The Binary Classification Problem

• Simplest propositional supervised learning

problem

• Target concept assigns one of two labels

(“positive” or “negative”) to all examples---the

class label

• Can extend to “multiclass” classification,

“regression”

9/4/2013 10 Soumya Ray, Case Western Reserve U.

Example

9/4/2013 Soumya Ray, Case Western Reserve U. 11

Has-fur? Long-Teeth? Scary? Lion?

Animal

1

Yes

No

(x

ij

)

No No

Animal

2

No Yes Yes No

Animal

3

Yes Yes Yes Yes

X Y

(x

i

,y

i

)

The Learning Problem

• Given: A binary classification problem

• Do: Produce a “classifier” (concept) that

assigns a label to a new example

9/4/2013 Soumya Ray, Case Western Reserve U. 12

Binary Classifier Concept Geometry

•

N-dimensional volume in feature space

(possibly a disjoint collection)

9/4/2013 13

Size

Color

Weight

Decision Boundary/

Separating Surface

Soumya Ray, Case Western Reserve U.

Decision Tree Induction

9/4/2013 14 Soumya Ray, Case Western Reserve U.

Decision Trees

• A “classical” (1980s) machine learning

algorithm for classification

• Widely used and extremely popular, available

in nearly all ML toolkits

• Not to be confused with decision trees in

decision theory

9/4/2013 Soumya Ray, Case Western Reserve U. 15

What is a Decision Tree?

• Tree: directed acyclic graph, each node has at

most one parent

• Internal nodes: Tests on attributes

• Leaves: Class labels

9/4/2013 16 Soumya Ray, Case Western Reserve U.

Example

9/4/2013 Soumya Ray, Case Western Reserve U. 17

Has-fur? Long-Teeth? Scary? Lion?

Animal

1

Yes

No

No No

Animal

2

No Yes Yes No

Animal

3

Yes Yes Yes Yes

Example

9/4/2013 Soumya Ray, Case Western Reserve U. 18

Long-Teeth=Yes?

Not-Lion

False

Scary=Yes?

Not-Lion

True

True False

Has-fur=Yes?

Not-Lion

Lion

False True

Classification with a decision tree

• Suppose we are given a tree and a new

example

• Starting at the root, check each attribute test

• This identifies a path through the tree, follow

this until we reach a leaf

• Assign the class label in the leaf

9/4/2013 Soumya Ray, Case Western Reserve U. 19

Example

9/4/2013 Soumya Ray, Case Western Reserve U. 20

Long-Teeth=Yes?

Not-Lion

False

Scary=Yes?

Not-Lion

True

True False

Has-fur=Yes?

Not-Lion

Lion

False True

Has-fur?

Long-

Teeth?

Scary?

Animal

1

Yes

Yes

No

Decision Tree Induction

• Given a set of examples, produce a decision tree

• Decision tree induction works using the idea of

recursive partitioning

– At each step, the algorithm will choose an attribute

test

– The chosen test will partition the examples into

disjoint partitions

– The algorithm will then recursively call itself on each

partition until

• a partition only has data from one class OR

• it runs out of attributes

9/4/2013 Soumya Ray, Case Western Reserve U. 21

Choosing an Attribute

• Which attribute should we choose to test

first?

– Ideally, the one that is “most predictive” of the

class label

• i.e., the one that gives us the “most information” about

what the label should be

• This idea is captured by the concept of

“(Shannon) entropy” of a random variable

9/4/2013 Soumya Ray, Case Western Reserve U. 22

Entropy of a Random Variable

• Suppose a random variable X has density p(x).

Its (Shannon) “entropy” is defined by:

• Note: 0log(0) = 0 .

9/4/2013 23

2

2

( ) ( log ( ( )))

( ) log ( ( ))

x

H X E p X

p X x p X x

= −

=− = =

∑

Soumya Ray, Case Western Reserve U.

Example

• Suppose X has two values, 0 and 1, and pdf

p(0)=0.5, p(1)=0.5

– Then H(X)=?

• Suppose X has two values, 0 and 1, and pdf

p(0)=0.99, p(1)=0.01

– Then H(X)=?

• Suppose X has two values, 0 and 1, and pdf

p(0)=0.01, p(1)=0.99

– Then H(X)=?

9/4/2013 Soumya Ray, Case Western Reserve U. 24

Entropy of a Bernoulli r.v.

9/4/2013 25

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

p

H(p)

Soumya Ray, Case Western Reserve U.

Entropy is

typically

denoted

by H(.)

What is entropy?

• Measure of “information content” in a

distribution

• Suppose we wanted to describe an r.v. X with n

values and distribution p(X=x)

– Shortest lossless description takes −log

2

(p(x)) bits

for each x

– So entropy is the expected length of the shortest

lossless description of the r.v.

9/4/2013 26 Soumya Ray, Case Western Reserve U.

Source Coding Theorem,

Claude Shannon 1948

What is entropy?

• Alternatively, the minimum expected number

of binary questions to pin down the value of X

9/4/2013 27 Soumya Ray, Case Western Reserve U.

What’s the connection?

• Entropy measures the information content of

a random variable

• Suppose we treat the class variable, Y, as a

random variable and measure its entropy

• Then we measure its entropy after

partitioning the examples with an attribute X

9/4/2013 Soumya Ray, Case Western Reserve U. 28

The Entropy Connection

• The difference will be a measure of the

“information gained” about Y by partitioning

the examples with X

• So if we can choose the attribute X that

maximizes this “information gain”, we have

found what we needed

9/4/2013 Soumya Ray, Case Western Reserve U. 29

Information Gain

• IG(X)=expected reduction in entropy of the class

label if the data is partitioned using X

• Suppose at some point we have N training examples,

of which pos are labeled “positive” and neg are

labeled “negative” (pos+neg=N)

• We’ll treat the class label as a Bernoulli r.v. Y that

takes value 1 with prob. p

+

=pos/N and 0 with prob.

p

−

=neg/N

9/4/2013 30 Soumya Ray, Case Western Reserve U.

Information Gain contd.

• Then H(Y)=−p

+

log

2

(p

+

)−p

−

log

2

(p

−

)

• Suppose an attribute X takes two values 1 and

0. After partitioning, we get the quantities

, , and . Then,

9/4/2013 31

1X

p

+

=

1X

p

−

=

1 2 1 1 2 1

0 2 0 0 2 0

( | 1) log log

( | 0) log log

( | ) ( 1

( ) ( )

) ( | 1) (

(

)

| )

0 ( | 0)

X X X X

X X X X

H Y X p p p p

H Y X p p p p

H Y X p X H Y

IG

X p X H Y X

X H Y H Y X

+ + − −

= = ==

+ + − −

= = ==

== − −

== − −

= = =

=

=

−

+ =

0X

p

+

=

0X

p

−

=

Soumya Ray, Case Western Reserve U.

Flowchart

9/4/2013 Soumya Ray, Case Western Reserve U. 32

X=Yes?

p

+

=pos/N, p

−

=neg/N

False True

X Yes

X Yes

X Yes

pos

p

N

+

≠

≠

≠

=

X Yes

X Yes

X Yes

neg

p

N

−

≠

≠

≠

=

X Yes

X Yes

X Yes

pos

p

N

+

=

=

=

=

X Yes

X Yes

X Yes

neg

p

N

−

=

=

=

=

( | )H Y X Yes≠

( | )H Y X yes=

( | )H Y X

( )

X Yes

N

p X Yes

N

≠

≠ =

( )

X Yes

N

p X Yes

N

=

= =

Example

9/4/2013 33 Soumya Ray, Case Western Reserve U.

Has-fur? Long-Teeth? Scary? Lion?

Animal

1

Yes

No

No No

Animal

2

No Yes Yes No

Animal

3

Yes No Yes Yes

Animal

4

No Yes Yes Yes

ID3 Algorithm---Training phase

• Select attribute “most predictive” of class

label (calculating IG(X))

• Partition data based on this attribute

(maximizing IG(X)), and remove attribute

• If all partitions are “pure” OR no more

attributes, stop

– Create a leaf node with the majority class

• Else, recursively process each impure partition

9/4/2013 34 Soumya Ray, Case Western Reserve U.

## Comments 0

Log in to post a comment