CS

424 Gregory Dudek
Lecture 14
•
Learning
–
Probably approximately correct learning (cont’d)
–
Version spaces
–
Decision trees
CS

424 Gregory Dudek
PAC: definition
Relax this requirement by not requiring that the learning
program necessarily achieve a small error but only that it to
keep the error small
with high probability
.
Probably approximately correct (PAC) with
probability
and error at most
if, given any set of
training examples drawn according to the fixed
distribution, the program outputs a hypothesis f such
that
Pr(Error(f) >
) <
CS

424 Gregory Dudek
PAC Training examples
Theorem:
If the number of hypotheses H is finite, then a
program that returns an hypothesis that is consistent
with
m = ln(
/H)/ln(1

)
training examples (drawn according to Pr) is
guaranteed to be PAC with probability
and error
bounded by
.
CS

424 Gregory Dudek
We want….
•
PAC (so far) describes accuracy of the hypothesis,
and the chances of finding such a concept.
–
How may examples do we need to rule out the “really
bad” hypotheses.
•
We also want the process to proceed
quickly.
CS

424 Gregory Dudek
PAC learnable spaces
A class of concepts C is said to be
PAC learnable for a hypothesis space H
if there exists an polynomial time algorithm
A
such that:
for any
c
C
, distribution Pr,
, and
,
if
A
is given a
quantity
of training examples
polynomial in 1/
and 1/
,
then with probability 1

the algorithm will return a hypothesis f from H such
that
error(f) <=
.
CS

424 Gregory Dudek
Observations on PAC
•
PAC learnability doesn’t tell us
how
to find the
learning algorithm.
•
The number of examples needed grows slowly as the
concept space increases, and with other key
parameters.
CS

424 Gregory Dudek
Example
•
Target and learned concepts are conjunctions with
up to
n
predicates. (This is our bias.)
–
Each predicate might be appear in either positive or
negated form, or be absent: 3 options.
–
This gives 3
n
possible conjunctions in the hypothesis
space.
CS

424 Gregory Dudek
Result
•
I have such a formula in mind.
•
I’ll give you some examples.
•
You try to guess what the formula is.
A concept that matches all our examples will be PAC
if m is at least
n/
ln ( 3/
)
CS

424 Gregory Dudek
How
•
How can we actually find a suitable concept?
•
One
key
approach: start with the examples
themselves, and try to generalize.
•
E.g. Given f(3,5) and f(5,5).
–
We might try replacing the first argument with a variable
X: f(X,5).
–
We might try replacing
both
arguments with variables:
f(X,Y).
–
We want to get as general as possible, but not too general.
•
The converse of this generalization is specialization.
CS

424 Gregory Dudek
Version Space
[RN 18.5;DAA 5.3]
•
Deals with
conjunctive concepts.
•
Consider a concept
C
as being identified with the
set of positive examples it associated with.
–
C:even numbered hearts = {3
,5
,7
,9
}.
•
A concept
C
1
is a
specialization
of concept C
2
if
the examples associated with
C
1
are a subset of
those associated with
C
2
.
•
3

of

hearts
more specialized than
odd hearts
.
CS

424 Gregory Dudek
Specialization/Generalization
Cards
Black
Red
Odd red
Even
red
Even

(
red
is implied)
3
5
7
9
CS

424 Gregory Dudek
Immediate
•
Immediate Specialization
: no intermediate.
•
Red
is
not
the immediate specialization of 2

of

hearts.
•
Red
is
the immediate specialization of hearts and
diamonds.
–
Note: This observation depends on knowing the
hypothesis space restriction.
CS

424 Gregory Dudek
Algorithm outline
•
Incrementally process training data.
•
Keep list of
most
and
least
specific concepts
consistent with the observed data.
–
For two concepts
A
and
B
that are consistent with the
data,the concept
C=
(
A AND B)
will also be consistent
yet more specific.
•
Tied in a subtle way to
conjunctions.
–
Disjunctive concepts can be obtained trivially by joining
examples, but they’re not interesting.
CS

424 Gregory Dudek
•
4
:no
•
5
:yes
•
5
:no
•
7
:yes
•
9

•
3
yes
VS example
Cards
Black
Red
Even red
Odd red
Odd

(red is implied)
3
5
7
9
CS

424 Gregory Dudek
Algorithm specifics
•
Maintain two bounding concepts:
–
The most specialized (
specific boundary, S

set)
–
The broadest (
general boundary, G

set).
•
Each example we see is either positive (yes) or negative (no).
•
Positive examples (+) tend to make the concept more general
(or inclusive). Negative examples (

) are used to make the
concept more exclusive (to reject them).
+

> move “up” the specific boundary


> move “down” the general boundary.
Detailed algorithm: RN p. 549; DAA p. 191

192.
CS

424 Gregory Dudek
Observations
•
It allows you to
GENERALIZE
from a training set
to examples never

before

seen !!!
–
In contrast, consider table lookup or rote learning.
•
Why is that good?
1 It allows you to infer things about new data (the whole
point of learning)
2 It allows you to (potentially) remember old data much
more efficiently.
•
Version space method is optimal for a conjunction
of positive literals.
–
How does it perform with noisy data?
CS

424 Gregory Dudek
CS

424 Gregory Dudek
Restaurant Selector
Example attributes:
•
1. Alternate
•
2. Bar
•
3. Fri/Sat
•
4. Hungry
•
5. Patrons
•
6. Price
•
etc.
Forall restaurants r : Patrons(r, Full) AND
WaitEstimate(r,under_10_min) AND Hungry(r,N)

> WillWait(r)
CS

424 Gregory Dudek
Example 2
Maybe we should have made a reservation?
(using a decision tree)
•
Restaurant lookup: you’ve heard Joe’s is good.
•
Lookup Joe’s
•
Lookup Chez Joe
•
Lookup Restaurant Joe’s
•
Lookup Bistro Joe’s
•
Lookup Restaurant Chez Joe’s
•
Lookup Le Restaurant Chez Joe
•
Lookup Le Bar Restaurant Joe
•
Lookup Le Restaurant Casa Chez Joe
CS

424 Gregory Dudek
Decision trees: issues
•
Constructing a decision tree is easy… really easy!
–
Just add examples in turn.
•
Difficulty: how can we extract a
simplified
decision
tree?
–
This implies (among other things) establishing a
preference order (bias) among alternative decision trees.
–
Finding the
smallest
one proves to be VERY hard.
Improving over the trivial one is okay.
CS

424 Gregory Dudek
Office size example
Training examples:
1. large ^ cs ^ faculty

> yes
2. large ^ ee ^ faculty

> no
3. large ^ cs ^ student

> yes
4. small ^ cs ^ faculty

> no
5. small ^ cs ^ student

> no
The questions about office size, department and status
tells use something about the mystery attribute.
Let’s encode all this as a decision tree.
CS

424 Gregory Dudek
Decision tree #1
size
/
\
large small
/
\
dept no {4,5}
/
\
cs ee
/
\
yes
{1,3}
no
{2}
CS

424 Gregory Dudek
Decision tree #2
status
/
\
faculty student
/
\
dept dept
/
\
/
\
cs ee ee cs
/
\
/
\
size
no
? size
/
\
/
\
large small large small
/
\
/
\
yes no {4} yes no {5}
CS

424 Gregory Dudek
Making a tree
How can we build a decision tree (that might be good)?
Objective: an algorithm that builds a decision tree from the root
down.
Each node in the decision tree is associated with a set
of training examples that are split among its children.
•
Input: a node in a decision tree with no children
and associated with a set of training examples
•
Output: a decision tree that classifies all of the examples
i.e., all of the training examples stored in each leaf
of the decision tree are in the same class
CS

424 Gregory Dudek
Procedure: Buildtree
If all of the training examples are in the same class,
then quit,
else 1. Choose an attribute to split the examples.
2. Create a new child node for each attribute value.
3. Redistribute the examples among the children
according to the attribute values.
4. Apply buildtree to each child node.
Is this a
good
decision tree? Maybe? How do we decide?
CS

424 Gregory Dudek
A Bad tree
•
To identify an animal (goat,dog,housecat,tiger)
•
Is it a dog?
•
Is it a housecat?
•
Is it a tiger?
•
Is it a goat?
CS

424 Gregory Dudek
•
A good tree?
•
It is a cat (cat family)? (if yes, what kind.)
•
Is it a dog?
•
Max depth 2 questions.
CS

424 Gregory Dudek
Best Property
•
Need to select property / feature / attribute
•
Goal find short tree (Occam's razor)
–
Base this on MAXIMUM depth
•
select
most informative
feature
–
One that best splits (classifies) the examples
•
Use measure from
information theory
–
Claude Shannon (1949)
CS

424 Gregory Dudek
Entropy
•
Entropy is often described as a measure of disorder.
–
Maybe better to think of a measure of
unpredictability
.
•
Low entropy = highly ordered
•
High entropy = unpredictable = surprising

> chaotic
•
As defined, entropy is related to the number of bits
needed. Over some set of states or outcomes with
probability p:

∑
p log p
CS

424 Gregory Dudek
Entropy
•
measures the (im) purity in collection S of examples
•
Entropy(S) =

p_{+} log_2 (p_{+})

p_{

} log_2 (p_{

})
•
p_{+} is proportion of positive examples.
•
p_{

} is proportion of negative examples.
CS

424 Gregory Dudek
Comments 0
Log in to post a comment