Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
Friday, 09 March 2007
William H. Hsu
Department of Computing and Information Sciences, KSU
http://www.kddresearch.org
http://www.cis.ksu.edu/~bhsu
Readings:
Sections 7.4.1

7.4.3, 7.5.1

7.5.3, Mitchell
Chapter 1, Kearns and Vazirani
Bayesian Networks: Learning Distributions
COLT: PAC and VC Dimension
Lecture 24 of 42
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
Lecture Outline
•
Read
7.4.1

7.4.3, 7.5.1

7.5.3
, Mitchell; Chapter 1, Kearns and Vazirani
•
Suggested Exercises: 7.2, Mitchell; 1.1, Kearns and Vazirani
•
PAC Learning (Continued)
–
Examples and results: learning rectangles, normal forms, conjunctions
–
What PAC analysis reveals about problem difficulty
–
Turning PAC results into design choices
•
Occam’s Razor: A Formal Inductive Bias
–
Preference for shorter hypotheses
–
More on Occam’s Razor when we get to decision trees
•
Vapnik

Chervonenkis (VC) Dimension
–
Objective: label any instance of (
shatter
) a set of points with a set of functions
–
VC
(
H
)
: a measure of the expressiveness of hypothesis space
H
•
Mistake Bounds
–
Estimating the number of mistakes made before convergence
–
Optimal error bounds
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
Bayesian Belief Networks (BBNS):
Definition
X
1
X
2
X
3
X
4
Season:
Spring
Summer
Fall
Winter
Sprinkler:
On,
Off
Rain
: None,
Drizzle
, Steady, Downpour
Ground:
Wet
, Dry
X
5
Ground:
Slippery,
Not

Slippery
P
(
Summer
,
Off
,
Drizzle
,
Wet
,
Not

Slippery
) =
P
(
S
) ∙
P
(
O

S
) ∙
P
(
D

S
) ∙
P
(
W

O
,
D
) ∙
P
(
N

W
)
•
Conditional Independence
–
X
is conditionally independent (CI) from
Y
given
Z
(sometimes written
X
Y

Z
) iff
P
(
X

Y
,
Z
) =
P
(
X

Z
) for all values of
X,
Y
, and
Z
–
Example:
P
(
Thunder

Rain
,
Lightning
) =
P
(
Thunder

Lightning
)
T
R

L
•
Bayesian Network
–
Directed graph
model of
conditional dependence assertions
(or
CI assumptions
)
–
Vertices
(nodes): denote events (each a random variable)
–
Edges
(arcs, links): denote conditional dependencies
•
General
Product (Chain) Rule
for BBNs
•
Example (“Sprinkler” BBN)
n
i
i
i
n
2
1
X
parents

X
P
X
,
,
X
,
X
P
1
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
Bayesian Belief Networks:
Properties
•
Conditional Independence
–
Variable (node): conditionally independent of
non

descendants
given
parents
–
Example
–
Result: chain rule for probabilistic inference
•
Bayesian Network: Probabilistic Semantics
–
Node: variable
–
Edge: one
axis
of a
c
onditional
p
robability
t
able (
CPT
)
i
i
n
i
i
i
n
2
1
X
parents
Pa
Pa

X
P
X
,
,
X
,
X
P
1
X
1
X
3
X
4
X
5
Age
Exposure

To

Toxics
Smoking
Cancer
X
6
Serum Calcium
X
2
Gender
X
7
Lung Tumor
s
Descendant
Non
Parents
s
Descendant
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
Topic 0:
A Brief Overview of Machine Learning
•
Overview: Topics, Applications, Motivation
•
Learning = Improving with Experience at Some Task
–
Improve over task
T,
–
with respect to performance measure
P
,
–
based on experience
E
.
•
Brief Tour of Machine Learning
–
A case study
–
A taxonomy of learning
–
Intelligent systems engineering:
specification of learning problems
•
Issues in Machine Learning
–
Design choices
–
The performance element: intelligent systems
•
Some Applications of Learning
–
Database mining, reasoning (inference/decision support), acting
–
Industrial usage of intelligent systems
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
Topic 1:
Concept Learning and Version Spaces
•
Concept Learning as Search through
H
–
Hypothesis space
H
as a state space
–
Learning: finding the correct hypothesis
•
General

to

Specific Ordering over
H
–
Partially

ordered set: Less

Specific

Than (More

General

Than) relation
–
Upper and lower bounds in
H
•
Version Space Candidate Elimination Algorithm
–
S
and
G
boundaries characterize learner’s uncertainty
–
Version space can be used to make predictions over unseen cases
•
Learner Can Generate Useful Queries
•
Next Lecture: When and Why Are Inductive Leaps Possible?
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
Topic 2:
Inductive Bias and PAC Learning
•
Inductive Leaps Possible Only if Learner Is Biased
–
Futility of learning without bias
–
Strength of inductive bias: proportional to restrictions on hypotheses
•
Modeling Inductive Learners with Equivalent Deductive Systems
–
Representing inductive learning as theorem proving
–
Equivalent learning and inference problems
•
Syntactic Restrictions
–
Example:
m

of

n
concept
•
Views of Learning and Strategies
–
Removing uncertainty (“data compression”)
–
Role of knowledge
•
Introduction to Computational Learning Theory (COLT)
–
Things COLT attempts to measure
–
Probably

Approximately

Correct (PAC) learning framework
•
Next: Occam’s Razor, VC Dimension, and Error Bounds
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
Topic 3:
PAC, VC

Dimension, and Mistake Bounds
•
COLT: Framework Analyzing Learning Environments
–
Sample complexity of
C
(what is
m
?)
–
Computational complexity of
L
–
Required expressive power of
H
–
Error and confidence bounds (PAC:
0 <
㰠ㄯ㈬1〠㰠
㰠ㄯ㈩
•
What PAC
Prescribes
–
Whether to try to learn
C
with a known
H
–
Whether to try to
reformulate
H
(apply
change of representation
)
•
Vapnik

Chervonenkis (VC) Dimension
–
A formal measure of the complexity of
H
(besides 
H
)
–
Based on
X
and a worst

case labeling game
•
Mistake Bounds
–
How many could
L
incur?
–
Another way to measure the cost of learning
•
Next: Decision Trees
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
Topic 4:
Decision Trees
•
Decision Trees (DTs)
–
Can be boolean (
c
(
x
)
笫Ⱐ

紩}潲慮来o潶敲e浵汴楰汥l捬慳獥c
–
When to use DT

based models
•
Generic Algorithm
Build

DT
: Top Down Induction
–
Calculating best attribute upon which to split
–
Recursive partitioning
•
Entropy and Information Gain
–
Goal: to measure
uncertainty removed
by splitting on a candidate attribute
A
•
Calculating information gain (change in entropy)
•
Using information gain in construction of tree
–
ID3
Build

DT
using
Gain
(•)
•
ID3 as Hypothesis Space Search (in State Space of Decision Trees)
•
Heuristic Search and Inductive Bias
•
Data Mining using
MLC++
(Machine Learning Library in C++)
•
Next: More Biases (Occam’s Razor); Managing DT Induction
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
Topic 5:
DTs, Occam’s Razor, and Overfitting
•
Occam’s Razor and Decision Trees
–
Preference biases
versus
language biases
–
Two issues regarding Occam algorithms
•
Why prefer smaller trees?
(less chance of “coincidence”)
•
Is Occam’s Razor well defined?
(
yes
, under certain assumptions)
–
MDL principle and Occam’s Razor: more to come
•
Overfitting
–
Problem: fitting training data too closely
•
General definition of overfitting
•
Why it happens
–
Overfitting
prevention
,
avoidance
, and
recovery
techniques
•
Other Ways to Make Decision Tree Induction More Robust
•
Next: Perceptrons, Neural Nets (Multi

Layer Perceptrons), Winnow
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
Topic 6:
Perceptrons and Winnow
•
Neural Networks: Parallel, Distributed Processing Systems
–
Biological and artificial (ANN) types
–
Perceptron (LTU, LTG): model neuron
•
Single

Layer Networks
–
Variety of update rules
•
Multiplicative (Hebbian, Winnow), additive (gradient: Perceptron, Delta Rule)
•
Batch versus incremental mode
–
Various convergence and efficiency conditions
–
Other ways to learn linear functions
•
Linear programming (general

purpose)
•
Probabilistic classifiers (some assumptions)
•
Advantages and Disadvantages
–
“Disadvantage” (tradeoff): simple and restrictive
–
“Advantage”: perform well on many realistic problems (e.g., some text learning)
•
Next: Multi

Layer Perceptrons, Backpropagation, ANN Applications
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
Topic 7:
MLPs and Backpropagation
•
Multi

Layer ANNs
–
Focused on feedforward MLPs
–
Backpropagation of error: distributes penalty (loss) function throughout network
–
Gradient learning: takes derivative of error surface with respect to weights
•
Error is based on difference between desired output (
t
) and actual output (
o
)
•
Actual output (
o
) is based on activation function
•
Must take partial derivative of
捨c潳攠潮攠瑨慴a楳i敡獹瑯摩晥牥湴楡瑥
•
Two
摥楮楴楯i猺s
獩杭潩o
(
aka
logistic
) and
hyperbolic tangent
(
tanh
)
•
Overfitting in ANNs
–
Prevention: attribute subset selection
–
Avoidance: cross

validation, weight decay
•
ANN Applications: Face Recognition, Text

to

Speech
•
Open Problems
•
Recurrent ANNs: Can Express Temporal
Depth
(
Non

Markovity
)
•
Next: Statistical Foundations and Evaluation, Bayesian Learning Intro
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
Topic 8:
Statistical Evaluation of Hypotheses
•
Statistical Evaluation Methods for Learning: Three Questions
–
Generalization quality
•
How well does observed accuracy
estimate
generalization accuracy?
•
Estimation bias and variance
•
Confidence intervals
–
Comparing generalization quality
•
How certain are we that h
1
is better than h
2
?
•
Confidence intervals for paired tests
–
Learning and statistical evaluation
•
What is the best way to make the most of limited data?
•
k

fold CV
•
Tradeoffs: Bias versus Variance
•
Next: Sections 6.1

6.5, Mitchell (Bayes’s Theorem; ML; MAP)
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
Topic 9:
Bayes’s Theorem, MAP, MLE
•
Introduction to Bayesian Learning
–
Framework: using probabilistic criteria to search
H
–
Probability foundations
•
Definitions: subjectivist,
objectivist
; Bayesian, frequentist, logicist
•
Kolmogorov axioms
•
Bayes’s Theorem
–
Definition of conditional (posterior) probability
–
Product rule
•
M
aximum
A
P
osteriori
(
MAP
) and
M
aximum
L
ikelihood (
ML
) Hypotheses
–
Bayes’s Rule and MAP
–
Uniform priors: allow use of MLE to generate MAP hypotheses
–
Relation to version spaces, candidate elimination
•
Next: 6.6

6.10, Mitchell; Chapter 14

15, Russell and Norvig; Roth
–
More Bayesian learning: MDL, BOC, Gibbs, Simple (Naïve) Bayes
–
Learning over text
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
Topic 10:
Bayesian Classfiers: MDL, BOC, and Gibbs
•
M
inimum
D
escription
L
ength (
MDL
) Revisited
–
B
ayesian
I
nformation
C
riterion (
BIC
): justification for Occam’s Razor
•
B
ayes
O
ptimal
C
lassifier (
BOC
)
–
Using BOC as a “gold standard”
•
Gibbs Classifier
–
Ratio bound
•
Simple (Naïve) Bayes
–
Rationale for assumption; pitfalls
•
Practical Inference using MDL, BOC, Gibbs, Naïve Bayes
–
MCMC methods (Gibbs sampling)
–
Glossary:
http://www.media.mit.edu/~tpminka/statlearn/glossary/glossary.html
–
To learn more:
http://bulky.aecom.yu.edu/users/kknuth/bse.html
•
Next: Sections 6.9

6.10, Mitchell
–
More on simple (naïve) Bayes
–
Application to learning over text
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
Meta

Summary
•
Machine Learning Formalisms
–
Theory of computation: PAC, mistake bounds
–
Statistical, probabilistic: PAC, confidence intervals
•
Machine Learning Techniques
–
Models: version space, decision tree, perceptron, winnow, ANN, BBN
–
Algorithms: candidate elimination,
ID3
, backprop, MLE, Naïve Bayes,
K2
, EM
•
Midterm Study Guide
–
Know
•
Definitions (terminology)
•
How to solve problems from Homework 1 (problem set)
•
How algorithms in Homework 2 (machine problem) work
–
Practice
•
Sample exam problems (handout)
•
Example runs of algorithms in Mitchell, lecture notes
–
Don’t panic!
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
PAC Learning:
Definition and Rationale
•
Intuition
–
Can’t expect a learner to learn exactly
•
Multiple consistent concepts
•
Unseen examples:
could
have any label (“OK” to mislabel if “rare”)
–
Can’t always approximate
c
closely (probability of D not being representative)
•
Terms Considered
–
Class
C
of possible concepts, learner
L
, hypothesis space
H
–
Instances
X,
each of length
n
attributes
–
Error parameter
Ⱐ
捯c晩敮捥e灡牡浥瑥p
Ⱐ
瑲略敲e潲
error
D
(
h
)
–
size
(
c
) = the
encoding length
of
c
, assuming some representation
•
Definition
–
C
is
PAC

learnable
by
L
using
H
if for all
c
C
, distributions
D
over
X
,
獵捨
that 0 <
㰠ㄯ㈬1慮搠
獵捨瑨慴〠㰠
㰠ㄯ㈬<汥慲湥爠
L
will, with probability at
least (1

⤬)潵瑰畴愠桹灯瑨敳楳e
h
H
such that
error
D
(
h
)
–
Efficiently PAC

learnable
:
L
runs in time polynomial in 1/
Ⱐㄯ
Ⱐ
n
,
size
(
c
)
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
PAC Learning:
Results for Two Hypothesis Languages
•
Unbiased Learner
–
Recall: sample complexity bound
m
1/
⡬(簠
H
 + ln (1/
⤩
–
Sample complexity not always polynomial
–
Example: for unbiased learner, 
H
 = 2

X

–
Suppose
X
consists of
n
booleans (binary

valued attributes)
•

X
 = 2
n
, 
H
 = 2
2
n
•
m
1/
⠲
n
ln 2 + ln (1/
⤩
•
Sample complexity for this
H
is
exponential in
n
•
Monotone Conjunctions
–
Target function of the form
–
Active learning
protocol (learner gives query instances):
n
examples needed
–
Passive learning with a helpful teacher
:
k
examples (
k
literals in true concept)
–
Passive learning with randomly selected examples
(proof to follow):
m
1/
⡬(簠
H
 + ln (1/
⤩㴠=/
⡬(
n
+ ln (1/
⤩
'
k
'
1
n
1
x
x
x
,
,
x
f
y
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
PAC Learning:
Monotone Conjunctions [1]
•
Monotone Conjunctive Concepts
–
Suppose
c
C
(and
h
H
) is of the form
x
1
x
2
…
x
m
–
n
possible variables: either omitted or included (i.e.,
positive literals only
)
•
Errors of Omission (False Negatives)
–
Claim: the only possible errors are
false negatives
(
h
(
x
) =

,
c
(
x
) = +)
–
Mistake iff (
z
h
)
(
z
c
)
(
x
D
test
.
x
(
z
) =
false
): then
h
(
x
) =

,
c
(
x
) = +
•
Probability of False Negatives
–
Let
z
be a literal; let
Pr
(
Z
) be the probability that
z
is false in a positive
x
D
–
z
in target concept (correct conjunction c = x
1
x
2
…
x
m
)
Pr
(
Z
) = 0
–
Pr
(
Z
) is the probability that a randomly chosen positive example has
z
=
false
(inducing a
potential
mistake, or deleting
z
from
h
if training is still in progress)
–
error(h)
†
z
h
Pr
(
Z
)
c
h
Instance Space
X
+
+




+
+
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
PAC Learning:
Monotone Conjunctions [2]
•
Bad Literals
–
Call a literal
z
bad
if
Pr
(
Z
) >
㴠
’/
n
–
z
does not belong
in
h
, and is likely to be dropped (by appearing with value
true
in a positive
x
D
), but has not yet appeared in such an example
•
Case of No Bad Literals
–
Lemma: if there are no bad literals, then
error(h)
’
–
Proof:
error(h)
†
z
h
Pr
(
Z
)
z
h
’/
n
’
(worst case: all
n
z’s are in
c
~
h
)
•
Case of Some Bad Literals
–
Let
z
be a bad literal
–
Survival probability
(probability that it will
not
be eliminated by a given
example): 1

Pr
(
Z
) < 1

’/
n
–
Survival probability over
m
examples
: (1

Pr
(
Z
))
m
< (1

’/
n
)
m
–
Worst case survival probability over
m
examples
(n bad literals) =
n
(1

’/
n
)
m
–
Intuition: more chance of a mistake = greater chance to learn
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
PAC Learning:
Monotone Conjunctions [3]
•
Goal: Achieve An Upper Bound for Worst

Case Survival Probability
–
Choose
m
large enough so that probability of a bad literal
z
surviving across
m
examples is less than
–
Pr
(
z
survives
m
examples) =
n
(1

’/
n
)
m
<
–
Solve for
m
using inequality 1

x < e

x
•
n
e

m
’/
n
<
•
m
>
n
/
’ (ln (
n
) + ln (1/
⤩數慭灬敳e湥敤敤e瑯杵慲慮瑥攠瑨攠扯畮摳
–
This completes the proof of the PAC result for monotone conjunctions
–
N
ota
B
ene: a specialization of
m
1/
⡬(簠
H
 + ln (1/
⤩㬠
n
/
’ = 1/
•
Practical Ramifications
–
Suppose
㴠〮ㄬ0
’ = 0.1, n = 100: we need 6907 examples
–
Suppose
㴠〮ㄬ0
’ = 0.1, n = 10: we need only 460 examples
–
Suppose
㴠〮〱Ⱐ
’ = 0.1, n = 10: we need only 690 examples
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
PAC Learning:
k

CNF,
k

Clause

CNF,
k

DNF,
k

Term

DNF
•
k

CNF (
C
onjunctive
N
ormal
F
orm) Concepts: Efficiently PAC

Learnable
–
Conjunctions
of any number of
disjunctive clauses
, each with at most
k
literals
–
c
=
C
1
C
2
…
C
m
;
C
i
=
l
1
l
1
…
l
k
; ln (
k

CNF
) = ln (2
(
2
n)
k
) =
(
n
k
)
–
Algorithm:
reduce to learning monotone conjunctions
over
n
k
pseudo

literals
C
i
•
k

Clause

CNF
–
c
=
C
1
C
2
…
C
k
;
C
i
=
l
1
l
1
…
l
m
; ln (
k

Clause

CNF
) = ln (3
kn
) =
(
kn
)
–
Efficiently PAC learnable? See below (
k

Clause

CNF
,
k

Term

DNF
are duals)
•
k

DNF (
D
isjunctive
N
ormal
F
orm)
–
Disjunctions
of any number of
conjunctive terms
, each with at most k literals
–
c
=
T
1
T
2
…
T
m
;
T
i
=
l
1
l
1
…
l
k
•
k

Term

DNF: “Not” Efficiently PAC

Learnable (Kind Of, Sort Of…)
–
c
=
T
1
T
2
…
T
k
;
T
i
=
l
1
l
1
…
l
m
; ln (
k

Term

DNF
) = ln (
k
3
n
) =
(
n +
ln
k
)
–
Polynomial
sample
complexity, not
computational
complexity (unless
RP
=
NP)
–
Solution: Don’t use
H
=
C
!
k

Term

DNF
k

CNF
(so let
H
=
k

CNF
)
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
PAC Learning:
Rectangles
•
Assume Target Concept Is An Axis Parallel (Hyper)rectangle
•
Will We Be Able To Learn The Target Concept?
•
Can We Come Close?
X
Y
+
+
+
+
+
+
+
+
+
+
+





Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
Consistent Learners
•
General Scheme for Learning
–
Follows immediately from definition of
consistent hypothesis
–
Given: a sample
D
of
m
examples
–
Find: some
h
H
that is consistent with all
m
examples
–
PAC: show that
if
m
is large enough
, a consistent hypothesis must be
close
enough
to
c
–
Efficient PAC (and other COLT formalisms): show that you can compute the
consistent hypothesis
efficiently
•
Monotone Conjunctions
–
Used an Elimination algorithm (compare:
Find

S
) to find a hypothesis
h
that is
consistent with the training set (easy to compute)
–
Showed that with sufficiently many examples (polynomial in the parameters),
then
h
is close to
c
–
Sample complexity gives an assurance of “convergence to criterion” for
specified
m
,
and
a necessary condition (polynomial in
n
) for tractability
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
Occam’s Razor and PAC Learning [1]
•
Bad Hypothesis
–
–
Want to bound: probability that there exists a hypothesis
h
H
that
•
is consistent with
m
examples
•
satisfies
error
D
(
h
) >
–
Claim
: the probability is less than 
H
 (1

)
m
•
Proof
–
Let
h
be such a bad hypothesis
–
The probability that
h
is consistent with one example <
x
,
c
(
x
)> of
c
is
–
Because the
m
examples are drawn independently of each other, the
probability that
h
is consistent with
m
examples of
c
is less than (1

)
m
–
The probability that
some
hypothesis in
H
is consistent with
m
examples of
c
is
less than 
H
 (1

)
m
,
Q
uod
E
rat
D
emonstrandum
x
h
x
c
Pr
h
error
D
x
D
ε
1
x
h
x
c
Pr
D
x
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
Occam’s Razor and PAC Learning [2]
•
Goal
–
We want this probability to be smaller than
Ⱐ瑨慴a楳i
•

H
 (1

)
m
<
•
ln (
H
) +
m
ln (1

)
< ln (
)
–
With ln (1

)
㨠
m
1/
⡬(簠
H
 + ln (1/
⤩
–
This is the result from last time [Blumer
et al
, 1987; Haussler, 1988]
•
Occam’s Razor
–
“Entities should not be multiplied without necessity”
–
So called because it indicates a preference towards a small
H
–
Why do we want small
H
?
•
Generalization capability: explicit form of inductive bias
•
Search capability: more efficient, compact
–
To guarantee consistency, need
H
C
–
really want the smallest
H
possible?
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
VC Dimension:
Framework
•
Infinite Hypothesis Space?
–
Preceding analyses were restricted to finite hypothesis spaces
–
Some infinite hypothesis spaces are more expressive than others, e.g.,
•
rectangles vs. 17

sided convex polygons vs. general convex polygons
•
l
inear
t
hreshold (LT) function vs. a conjunction of LT units
–
Need a measure of the expressiveness of an infinite
H
other than its size
•
Vapnik

Chervonenkis Dimension:
VC
(
H
)
–
Provides such a measure
–
Analogous to 
H
: there are bounds for sample complexity using
VC
(
H
)
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
VC Dimension:
Shattering A Set of Instances
•
Dichotomies
–
Recall: a
partition
of a set
S
is a collection of disjoint sets
S
i
whose union is
S
–
Definition: a
dichotomy
of a set
S
is a partition of
S
into two subsets
S
1
and
S
2
•
Shattering
–
A set of instances
S
is
shattered
by hypothesis space
H
if and only if for every
dichotomy of S, there exists a hypothesis in
H
consistent with this dichotomy
–
Intuition: a
rich
set of functions shatters a
larger
instance space
•
The “Shattering Game” (An Adversarial Interpretation)
–
Your client selects an
S
(an instance space
X
)
–
You select an
H
–
Your adversary
labels
S
(i.e., chooses a point
c
from concept space
C =
2
X
)
–
You must find then some
h
H
that “covers” (is consistent with)
c
–
If you can do this for any
c
your adversary comes up with,
H shatters S
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
VC Dimension:
Examples of Shattered Sets
•
Three Instances Shattered
•
Intervals
–
Left

bounded intervals on the real axis: [0,
a
), for
a
R
0
•
Sets of 2 points cannot be shattered
•
Given 2 points, can label so that no hypothesis will be consistent
–
Intervals on the real axis ([
a
,
b
],
b
R
>
a
R
): can shatter 1 or 2 points, not 3
–
Half

spaces in the plane (non

collinear): 1? 2? 3? 4?
Instance Space
X
0
a

+

+
a
b
+
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
VC Dimension:
Definition and Relation to Inductive Bias
•
Vapnik

Chervonenkis Dimension
–
The
VC dimension
VC
(
H
) of hypothesis space
H
(defined over implicit instance
space
X
) is the size of the largest finite subset of
X
shattered by
H
–
If arbitrarily large finite sets of
X
can be shattered by
H
, then
VC
(
H
)
–
Examples
•
VC
(
half intervals in
R
) = 1
no subset of size 2 can be shattered
•
VC
(
intervals in
R
) = 2
no subset of size 3
•
VC
(
half

spaces in
R
2
) = 3
no subset of size 4
•
VC
(
axis

parallel rectangles in
R
2
) = 4
no subset of size 5
•
Relation of
VC
(
H
) to Inductive Bias of
H
–
Unbiased hypothesis space
H
shatters the entire instance space
X
–
i.e.,
H
is able to induce every partition on set
X
of all of all possible instances
–
The
larger
the subset
X
that can be shattered, the
more expressive
a
hypothesis space is, i.e., the
less biased
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
VC Dimension:
Relation to Sample Complexity
•
VC
(
H
) as A Measure of Expressiveness
–
Prescribes an Occam algorithm for infinite hypothesis spaces
–
Given: a sample
D
of
m
examples
•
Find some
h
H
that is consistent with all
m
examples
•
If
m
> 1/
⠸
VC
(
H
) lg 13/
⬠+汧l⠲/
⤩Ⱐ瑨nwi瑨p牯rab楬楴ya琠汥as琠⠱

⤬
h
has
true error less than
•
Significance
•
If
m
is polynomial, we have a
PAC learning algorithm
•
To be
efficient
, we need to produce the hypothesis
h
efficiently
•
Note
–

H
 > 2
m
required to shatter
m
examples
–
Therefore
VC
(
H
)
†
汧(
H
)
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
Mistake Bounds:
Halving Algorithm
•
Scenario for Analyzing Mistake Bounds
–
Halving Algorithm:
learn
concept using version space
•
e.g.,
Candidate

Elimination
algorithm (or
List

Then

Eliminate
)
–
Need to specify
performance element
(how predictions are made)
•
Classify
new instances by
majority vote of version space members
•
How Many Mistakes before Converging to Correct
h
?
–
… in worst case?
•
Can make a mistake when the majority of hypotheses in
VS
H
,
D
are wrong
•
But then we can remove
at least half
of the candidates
•
Worst case number of mistakes:
–
… in best case?
•
Can get away with
no
mistakes!
•
(If we were lucky and majority vote was right,
VS
H
,
D
still shrinks)
H
log
2
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
Optimal Mistake Bounds
•
Upper Mistake Bound for A Particular Learning Algorithm
–
Let
M
A
(
C
) be the max number of mistakes made by algorithm A to learn
concepts in
C
•
Maximum over
c
C
, all possible training sequences
D
•
•
Minimax Definition
–
Let
C
be an arbitrary non

empty concept class
–
The
optimal mistake bound
for
C
, denoted
Opt
(
C
), is the minimum over all
possible learning algorithms
A
of
M
A
(
C
)
–
–
c
M
max
C
M
A
C
c
A
c
M
min
C
Opt
A
algorithms
learning
A
C
lg
C
M
C
Opt
C
VC
Halving
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
COLT Conclusions
•
PAC Framework
–
Provides reasonable model for theoretically analyzing effectiveness of learning
algorithms
–
Prescribes
things to do: enrich the hypothesis space (search for a less
restrictive
H
); make
H
more flexible (e.g., hierarchical); incorporate knowledge
•
Sample Complexity and Computational Complexity
–
Sample complexity for any consistent learner using
H
can be determined from
measures of
H
’s expressiveness (
H
,
VC
(
H
), etc.)
–
If the sample complexity is tractable, then the computational complexity of
finding a consistent
h
governs the complexity of the problem
–
Sample complexity bounds are not tight! (But they separate learnable classes
from non

learnable classes)
–
Computational complexity results exhibit cases where information theoretic
learning is feasible, but finding a good
h
is intractable
•
COLT: Framework For Concrete Analysis of the Complexity of
L
–
Dependent on various assumptions (e.g.,
x
X
contain relevant variables)
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
Terminology
•
PAC Learning: Example Concepts
–
Monotone conjunctions
–
k

CNF
,
k

Clause

CNF
,
k

DNF
,
k

Term

DNF
–
Axis

parallel (hyper)rectangles
–
Intervals
and
semi

intervals
•
Occam’s Razor: A Formal Inductive Bias
–
Occam’s Razor
:
ceteris paribus
(all other things being equal), prefer shorter
hypotheses (in machine learning, prefer shortest
consistent
hypothesis)
–
Occam algorithm
: a learning algorithm that prefers short hypotheses
•
Vapnik

Chervonenkis (VC) Dimension
–
Shattering
–
VC
(
H
)
•
Mistake Bounds
–
M
A
(
C
) for
A
Find

S
,
Halving
–
Optimal mistake bound
Opt
(
H
)
Kansas State University
Department of Computing and Information Sciences
CIS 732: Machine Learning and Pattern Recognition
Summary Points
•
COLT: Framework Analyzing Learning Environments
–
Sample complexity of
C
(what is
m
?)
–
Computational complexity of
L
–
Required expressive power of
H
–
Error and confidence bounds (PAC:
0 <
㰠ㄯ㈬1〠㰠
㰠ㄯ㈩
•
What PAC
Prescribes
–
Whether to try to learn
C
with a known
H
–
Whether to try to
reformulate
H
(apply
change of representation
)
•
Vapnik

Chervonenkis (VC) Dimension
–
A formal measure of the complexity of
H
(besides 
H
)
–
Based on
X
and a worst

case labeling game
•
Mistake Bounds
–
How many could
L
incur?
–
Another way to measure the cost of learning
•
Next Week: Decision Trees
Comments 0
Log in to post a comment