Bayesian Networks: Learning Distributions COLT: PAC and VC Dimension

concretecakeΠολεοδομικά Έργα

29 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

128 εμφανίσεις

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

Friday, 09 March 2007


William H. Hsu

Department of Computing and Information Sciences, KSU

http://www.kddresearch.org

http://www.cis.ksu.edu/~bhsu


Readings:

Sections 7.4.1
-
7.4.3, 7.5.1
-
7.5.3, Mitchell

Chapter 1, Kearns and Vazirani

Bayesian Networks: Learning Distributions

COLT: PAC and VC Dimension

Lecture 24 of 42

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

Lecture Outline


Read
7.4.1
-
7.4.3, 7.5.1
-
7.5.3
, Mitchell; Chapter 1, Kearns and Vazirani


Suggested Exercises: 7.2, Mitchell; 1.1, Kearns and Vazirani


PAC Learning (Continued)


Examples and results: learning rectangles, normal forms, conjunctions


What PAC analysis reveals about problem difficulty


Turning PAC results into design choices


Occam’s Razor: A Formal Inductive Bias


Preference for shorter hypotheses


More on Occam’s Razor when we get to decision trees


Vapnik
-
Chervonenkis (VC) Dimension


Objective: label any instance of (
shatter
) a set of points with a set of functions


VC
(
H
)
: a measure of the expressiveness of hypothesis space
H


Mistake Bounds


Estimating the number of mistakes made before convergence


Optimal error bounds

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

Bayesian Belief Networks (BBNS):

Definition

X
1

X
2

X
3

X
4

Season:

Spring

Summer

Fall

Winter

Sprinkler:

On,
Off

Rain
: None,
Drizzle
, Steady, Downpour

Ground:

Wet
, Dry

X
5

Ground:

Slippery,
Not
-
Slippery

P
(
Summer
,
Off
,
Drizzle
,
Wet
,
Not
-
Slippery
) =
P
(
S
) ∙
P
(
O

|
S
) ∙
P
(
D

|
S
) ∙
P
(
W

|
O
,
D
) ∙
P
(
N

|
W
)


Conditional Independence


X

is conditionally independent (CI) from
Y

given
Z

(sometimes written
X



Y

|
Z
) iff
P
(
X

|
Y
,
Z
) =
P
(
X

|
Z
) for all values of
X,

Y
, and
Z


Example:
P
(
Thunder

|
Rain
,
Lightning
) =
P
(
Thunder

|
Lightning
)


T



R

|
L


Bayesian Network


Directed graph

model of
conditional dependence assertions

(or
CI assumptions
)


Vertices

(nodes): denote events (each a random variable)


Edges

(arcs, links): denote conditional dependencies


General
Product (Chain) Rule

for BBNs


Example (“Sprinkler” BBN)










n
i
i
i
n
2
1
X
parents

|
X
P
X

,

,
X
,
X
P
1

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

Bayesian Belief Networks:

Properties


Conditional Independence


Variable (node): conditionally independent of
non
-
descendants

given
parents


Example







Result: chain rule for probabilistic inference




Bayesian Network: Probabilistic Semantics


Node: variable


Edge: one
axis

of a
c
onditional
p
robability
t
able (
CPT
)







i
i
n
i
i
i
n
2
1
X
parents
Pa
Pa

|
X
P
X

,

,
X
,
X
P




1

X
1

X
3

X
4

X
5

Age

Exposure
-
To
-
Toxics

Smoking

Cancer

X
6

Serum Calcium

X
2

Gender

X
7

Lung Tumor








s
Descendant
Non








Parents







s
Descendant


Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

Topic 0:

A Brief Overview of Machine Learning


Overview: Topics, Applications, Motivation


Learning = Improving with Experience at Some Task


Improve over task
T,


with respect to performance measure
P
,


based on experience
E
.


Brief Tour of Machine Learning


A case study


A taxonomy of learning


Intelligent systems engineering:
specification of learning problems


Issues in Machine Learning


Design choices


The performance element: intelligent systems


Some Applications of Learning


Database mining, reasoning (inference/decision support), acting


Industrial usage of intelligent systems

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

Topic 1:

Concept Learning and Version Spaces


Concept Learning as Search through
H


Hypothesis space
H

as a state space


Learning: finding the correct hypothesis


General
-
to
-
Specific Ordering over
H


Partially
-
ordered set: Less
-
Specific
-
Than (More
-
General
-
Than) relation


Upper and lower bounds in
H


Version Space Candidate Elimination Algorithm


S

and
G

boundaries characterize learner’s uncertainty


Version space can be used to make predictions over unseen cases


Learner Can Generate Useful Queries


Next Lecture: When and Why Are Inductive Leaps Possible?

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

Topic 2:

Inductive Bias and PAC Learning


Inductive Leaps Possible Only if Learner Is Biased


Futility of learning without bias


Strength of inductive bias: proportional to restrictions on hypotheses


Modeling Inductive Learners with Equivalent Deductive Systems


Representing inductive learning as theorem proving


Equivalent learning and inference problems


Syntactic Restrictions


Example:
m
-
of
-
n

concept


Views of Learning and Strategies


Removing uncertainty (“data compression”)


Role of knowledge


Introduction to Computational Learning Theory (COLT)


Things COLT attempts to measure


Probably
-
Approximately
-
Correct (PAC) learning framework


Next: Occam’s Razor, VC Dimension, and Error Bounds

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

Topic 3:

PAC, VC
-
Dimension, and Mistake Bounds


COLT: Framework Analyzing Learning Environments


Sample complexity of
C

(what is
m
?)


Computational complexity of
L


Required expressive power of
H


Error and confidence bounds (PAC:
0 <


㰠ㄯ㈬1〠㰠


㰠ㄯ㈩


What PAC
Prescribes


Whether to try to learn
C

with a known
H


Whether to try to
reformulate

H

(apply
change of representation
)


Vapnik
-
Chervonenkis (VC) Dimension


A formal measure of the complexity of
H

(besides |
H

|)


Based on
X

and a worst
-
case labeling game


Mistake Bounds


How many could
L

incur?


Another way to measure the cost of learning


Next: Decision Trees

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

Topic 4:

Decision Trees


Decision Trees (DTs)


Can be boolean (
c
(
x
)


笫Ⱐ
-
紩}潲⁲慮来o潶敲e浵汴楰汥l捬慳獥c


When to use DT
-
based models


Generic Algorithm
Build
-
DT
: Top Down Induction


Calculating best attribute upon which to split


Recursive partitioning


Entropy and Information Gain


Goal: to measure
uncertainty removed

by splitting on a candidate attribute
A


Calculating information gain (change in entropy)


Using information gain in construction of tree


ID3



Build
-
DT

using
Gain
(•)


ID3 as Hypothesis Space Search (in State Space of Decision Trees)


Heuristic Search and Inductive Bias


Data Mining using
MLC++

(Machine Learning Library in C++)


Next: More Biases (Occam’s Razor); Managing DT Induction

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

Topic 5:

DTs, Occam’s Razor, and Overfitting


Occam’s Razor and Decision Trees


Preference biases

versus
language biases


Two issues regarding Occam algorithms


Why prefer smaller trees?


(less chance of “coincidence”)


Is Occam’s Razor well defined?

(
yes
, under certain assumptions)


MDL principle and Occam’s Razor: more to come


Overfitting


Problem: fitting training data too closely


General definition of overfitting


Why it happens


Overfitting
prevention
,
avoidance
, and
recovery

techniques


Other Ways to Make Decision Tree Induction More Robust


Next: Perceptrons, Neural Nets (Multi
-
Layer Perceptrons), Winnow

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

Topic 6:

Perceptrons and Winnow


Neural Networks: Parallel, Distributed Processing Systems


Biological and artificial (ANN) types


Perceptron (LTU, LTG): model neuron


Single
-
Layer Networks


Variety of update rules


Multiplicative (Hebbian, Winnow), additive (gradient: Perceptron, Delta Rule)


Batch versus incremental mode


Various convergence and efficiency conditions


Other ways to learn linear functions


Linear programming (general
-
purpose)


Probabilistic classifiers (some assumptions)


Advantages and Disadvantages


“Disadvantage” (tradeoff): simple and restrictive


“Advantage”: perform well on many realistic problems (e.g., some text learning)


Next: Multi
-
Layer Perceptrons, Backpropagation, ANN Applications

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

Topic 7:

MLPs and Backpropagation


Multi
-
Layer ANNs


Focused on feedforward MLPs


Backpropagation of error: distributes penalty (loss) function throughout network


Gradient learning: takes derivative of error surface with respect to weights


Error is based on difference between desired output (
t
) and actual output (
o
)


Actual output (
o
) is based on activation function


Must take partial derivative of




捨c潳攠潮攠瑨慴a楳i敡獹瑯摩晥牥湴楡瑥


Two


摥楮楴楯i猺s
獩杭潩o

(
aka

logistic
) and
hyperbolic tangent

(
tanh
)


Overfitting in ANNs


Prevention: attribute subset selection


Avoidance: cross
-
validation, weight decay


ANN Applications: Face Recognition, Text
-
to
-
Speech


Open Problems


Recurrent ANNs: Can Express Temporal
Depth

(
Non
-
Markovity
)


Next: Statistical Foundations and Evaluation, Bayesian Learning Intro

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

Topic 8:

Statistical Evaluation of Hypotheses


Statistical Evaluation Methods for Learning: Three Questions


Generalization quality


How well does observed accuracy
estimate

generalization accuracy?


Estimation bias and variance


Confidence intervals


Comparing generalization quality


How certain are we that h
1

is better than h
2
?


Confidence intervals for paired tests


Learning and statistical evaluation


What is the best way to make the most of limited data?


k
-
fold CV


Tradeoffs: Bias versus Variance


Next: Sections 6.1
-
6.5, Mitchell (Bayes’s Theorem; ML; MAP)

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

Topic 9:

Bayes’s Theorem, MAP, MLE


Introduction to Bayesian Learning


Framework: using probabilistic criteria to search
H


Probability foundations


Definitions: subjectivist,
objectivist
; Bayesian, frequentist, logicist


Kolmogorov axioms


Bayes’s Theorem


Definition of conditional (posterior) probability


Product rule


M
aximum
A

P
osteriori

(
MAP
) and
M
aximum
L
ikelihood (
ML
) Hypotheses


Bayes’s Rule and MAP


Uniform priors: allow use of MLE to generate MAP hypotheses


Relation to version spaces, candidate elimination


Next: 6.6
-
6.10, Mitchell; Chapter 14
-
15, Russell and Norvig; Roth


More Bayesian learning: MDL, BOC, Gibbs, Simple (Naïve) Bayes


Learning over text

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

Topic 10:

Bayesian Classfiers: MDL, BOC, and Gibbs


M
inimum
D
escription
L
ength (
MDL
) Revisited


B
ayesian
I
nformation
C
riterion (
BIC
): justification for Occam’s Razor


B
ayes
O
ptimal
C
lassifier (
BOC
)


Using BOC as a “gold standard”


Gibbs Classifier


Ratio bound


Simple (Naïve) Bayes


Rationale for assumption; pitfalls


Practical Inference using MDL, BOC, Gibbs, Naïve Bayes


MCMC methods (Gibbs sampling)


Glossary:
http://www.media.mit.edu/~tpminka/statlearn/glossary/glossary.html


To learn more:
http://bulky.aecom.yu.edu/users/kknuth/bse.html


Next: Sections 6.9
-
6.10, Mitchell


More on simple (naïve) Bayes


Application to learning over text

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

Meta
-
Summary


Machine Learning Formalisms


Theory of computation: PAC, mistake bounds


Statistical, probabilistic: PAC, confidence intervals


Machine Learning Techniques


Models: version space, decision tree, perceptron, winnow, ANN, BBN


Algorithms: candidate elimination,
ID3
, backprop, MLE, Naïve Bayes,
K2
, EM


Midterm Study Guide


Know


Definitions (terminology)


How to solve problems from Homework 1 (problem set)


How algorithms in Homework 2 (machine problem) work


Practice


Sample exam problems (handout)


Example runs of algorithms in Mitchell, lecture notes


Don’t panic!


Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

PAC Learning:

Definition and Rationale


Intuition


Can’t expect a learner to learn exactly


Multiple consistent concepts


Unseen examples:
could

have any label (“OK” to mislabel if “rare”)


Can’t always approximate
c

closely (probability of D not being representative)


Terms Considered


Class
C

of possible concepts, learner
L
, hypothesis space
H


Instances
X,
each of length
n

attributes


Error parameter



捯c晩敮捥e灡牡浥瑥p



瑲略敲e潲

error
D
(
h
)


size
(
c
) = the
encoding length

of
c
, assuming some representation


Definition


C

is
PAC
-
learnable

by
L

using
H

if for all
c



C
, distributions
D

over
X
,


獵捨
that 0 <


㰠ㄯ㈬1慮搠


獵捨瑨慴〠㰠


㰠ㄯ㈬<汥慲湥爠
L

will, with probability at
least (1
-


⤬)潵瑰畴愠桹灯瑨敳楳e
h



H

such that
error
D
(
h
)





Efficiently PAC
-
learnable
:
L

runs in time polynomial in 1/

Ⱐㄯ


n
,
size
(
c
)

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

PAC Learning:

Results for Two Hypothesis Languages


Unbiased Learner


Recall: sample complexity bound
m



1/


⡬(簠
H

| + ln (1/




Sample complexity not always polynomial


Example: for unbiased learner, |
H

| = 2
|
X

|


Suppose
X

consists of
n

booleans (binary
-
valued attributes)


|
X

| = 2
n
, |
H

| = 2
2
n


m



1/



n
ln 2 + ln (1/




Sample complexity for this
H

is
exponential in
n


Monotone Conjunctions


Target function of the form


Active learning

protocol (learner gives query instances):
n
examples needed


Passive learning with a helpful teacher
:
k

examples (
k

literals in true concept)


Passive learning with randomly selected examples

(proof to follow):




m



1/


⡬(簠
H

| + ln (1/

⤩㴠=/


⡬(
n

+ ln (1/





'
k
'
1
n
1
x
x
x
,

,
x
f
y






Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

PAC Learning:

Monotone Conjunctions [1]


Monotone Conjunctive Concepts


Suppose
c


C

(and
h


H
) is of the form
x
1



x
2





x
m


n

possible variables: either omitted or included (i.e.,
positive literals only
)


Errors of Omission (False Negatives)


Claim: the only possible errors are
false negatives

(
h
(
x
) =
-
,
c
(
x
) = +)


Mistake iff (
z


h
)



(
z


c
)



(


x


D
test
.
x
(
z
) =

false
): then
h
(
x
) =
-
,
c
(
x
) = +


Probability of False Negatives


Let
z

be a literal; let
Pr
(
Z
) be the probability that
z

is false in a positive
x



D


z

in target concept (correct conjunction c = x
1



x
2





x
m
)


Pr
(
Z
) = 0


Pr
(
Z
) is the probability that a randomly chosen positive example has
z

=
false
(inducing a
potential

mistake, or deleting
z

from
h

if training is still in progress)


error(h)




z


h

Pr
(
Z
)

c

h

Instance Space
X

+

+

-

-

-

-

+

+

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

PAC Learning:


Monotone Conjunctions [2]


Bad Literals


Call a literal
z
bad

if
Pr
(
Z
) >




’/
n


z

does not belong

in
h
, and is likely to be dropped (by appearing with value
true

in a positive
x



D
), but has not yet appeared in such an example


Case of No Bad Literals


Lemma: if there are no bad literals, then
error(h)







Proof:
error(h)




z


h

Pr
(
Z
)



z


h


’/
n





(worst case: all
n

z’s are in
c

~
h
)


Case of Some Bad Literals


Let
z

be a bad literal


Survival probability

(probability that it will
not

be eliminated by a given
example): 1
-

Pr
(
Z
) < 1
-


’/
n


Survival probability over
m

examples
: (1
-

Pr
(
Z
))
m

< (1
-


’/
n
)
m


Worst case survival probability over
m

examples

(n bad literals) =
n

(1
-


’/
n
)
m


Intuition: more chance of a mistake = greater chance to learn

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

PAC Learning:


Monotone Conjunctions [3]


Goal: Achieve An Upper Bound for Worst
-
Case Survival Probability


Choose
m

large enough so that probability of a bad literal
z

surviving across
m

examples is less than



Pr
(
z

survives
m

examples) =

n

(1
-


’/
n
)
m

<




Solve for
m

using inequality 1
-

x < e
-
x


n
e
-
m

’/
n

<




m

>
n
/

’ (ln (
n
) + ln (1/

⤩數慭灬敳e湥敤敤e瑯杵慲慮瑥攠瑨攠扯畮摳


This completes the proof of the PAC result for monotone conjunctions


N
ota
B
ene: a specialization of
m



1/


⡬(簠
H

| + ln (1/

⤩㬠
n
/

’ = 1/



Practical Ramifications


Suppose


㴠〮ㄬ0

’ = 0.1, n = 100: we need 6907 examples


Suppose


㴠〮ㄬ0

’ = 0.1, n = 10: we need only 460 examples


Suppose


㴠〮〱Ⱐ

’ = 0.1, n = 10: we need only 690 examples

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

PAC Learning:

k
-
CNF,
k
-
Clause
-
CNF,
k
-
DNF,
k
-
Term
-
DNF


k
-
CNF (
C
onjunctive
N
ormal
F
orm) Concepts: Efficiently PAC
-
Learnable


Conjunctions

of any number of
disjunctive clauses
, each with at most
k

literals


c

=
C
1



C
2





C
m
;
C
i
=
l
1


l
1





l
k
; ln (|
k
-
CNF

|) = ln (2
(
2
n)
k
) =

(
n
k
)


Algorithm:
reduce to learning monotone conjunctions

over
n
k
pseudo
-
literals
C
i


k
-
Clause
-
CNF


c

=
C
1



C
2





C
k
;
C
i
=
l
1


l
1





l
m
; ln (|
k
-
Clause
-
CNF

|) = ln (3
kn
) =

(
kn
)


Efficiently PAC learnable? See below (
k
-
Clause
-
CNF
,
k
-
Term
-
DNF

are duals)


k
-
DNF (
D
isjunctive
N
ormal
F
orm)


Disjunctions

of any number of
conjunctive terms
, each with at most k literals


c

=
T
1



T
2





T
m
;
T
i
=
l
1


l
1





l
k


k
-
Term
-
DNF: “Not” Efficiently PAC
-
Learnable (Kind Of, Sort Of…)


c

=
T
1



T
2





T
k
;
T
i
=
l
1


l
1





l
m
; ln (|
k
-
Term
-
DNF

|) = ln (
k
3
n
) =

(
n +
ln
k
)


Polynomial
sample

complexity, not
computational

complexity (unless
RP
=
NP)


Solution: Don’t use
H

=
C
!
k
-
Term
-
DNF



k
-
CNF

(so let
H

=
k
-
CNF
)

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

PAC Learning:

Rectangles


Assume Target Concept Is An Axis Parallel (Hyper)rectangle











Will We Be Able To Learn The Target Concept?


Can We Come Close?

X

Y

+

+

+

+

+

+

+

+

+

+

+

-

-

-

-

-

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

Consistent Learners


General Scheme for Learning


Follows immediately from definition of
consistent hypothesis


Given: a sample
D

of
m

examples


Find: some
h


H

that is consistent with all
m

examples


PAC: show that
if
m

is large enough
, a consistent hypothesis must be
close
enough

to
c


Efficient PAC (and other COLT formalisms): show that you can compute the
consistent hypothesis
efficiently


Monotone Conjunctions


Used an Elimination algorithm (compare:
Find
-
S
) to find a hypothesis
h

that is
consistent with the training set (easy to compute)


Showed that with sufficiently many examples (polynomial in the parameters),
then
h

is close to
c


Sample complexity gives an assurance of “convergence to criterion” for
specified
m
,
and

a necessary condition (polynomial in
n
) for tractability

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

Occam’s Razor and PAC Learning [1]


Bad Hypothesis





Want to bound: probability that there exists a hypothesis
h


H

that


is consistent with
m

examples


satisfies
error
D
(
h
) >



Claim
: the probability is less than |
H

| (1
-


)
m


Proof


Let
h

be such a bad hypothesis


The probability that
h

is consistent with one example <
x
,
c
(
x
)> of
c

is






Because the
m

examples are drawn independently of each other, the
probability that
h

is consistent with
m

examples of
c

is less than (1
-


)
m


The probability that
some

hypothesis in
H

is consistent with
m

examples of
c

is
less than |
H

| (1
-


)
m
,
Q
uod
E
rat
D
emonstrandum









x
h
x
c
Pr
h
error
D
x
D









ε
1




x
h
x
c
Pr
D
x
Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

Occam’s Razor and PAC Learning [2]


Goal


We want this probability to be smaller than

Ⱐ瑨慴a楳i


|
H

| (1
-


)
m

<



ln (|
H

|) +
m

ln (1
-


)


< ln (

)


With ln (1
-


)





m



1/


⡬(簠
H

| + ln (1/




This is the result from last time [Blumer
et al
, 1987; Haussler, 1988]


Occam’s Razor


“Entities should not be multiplied without necessity”


So called because it indicates a preference towards a small
H


Why do we want small
H
?


Generalization capability: explicit form of inductive bias


Search capability: more efficient, compact


To guarantee consistency, need
H



C



really want the smallest
H

possible?

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

VC Dimension:

Framework


Infinite Hypothesis Space?


Preceding analyses were restricted to finite hypothesis spaces


Some infinite hypothesis spaces are more expressive than others, e.g.,


rectangles vs. 17
-
sided convex polygons vs. general convex polygons


l
inear
t
hreshold (LT) function vs. a conjunction of LT units


Need a measure of the expressiveness of an infinite
H

other than its size


Vapnik
-
Chervonenkis Dimension:
VC
(
H
)


Provides such a measure


Analogous to |
H

|: there are bounds for sample complexity using
VC
(
H
)

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

VC Dimension:

Shattering A Set of Instances


Dichotomies


Recall: a
partition

of a set
S

is a collection of disjoint sets
S
i

whose union is
S


Definition: a
dichotomy

of a set
S

is a partition of
S

into two subsets
S
1
and
S
2


Shattering


A set of instances
S

is
shattered

by hypothesis space
H

if and only if for every
dichotomy of S, there exists a hypothesis in
H

consistent with this dichotomy


Intuition: a
rich

set of functions shatters a
larger

instance space


The “Shattering Game” (An Adversarial Interpretation)


Your client selects an
S

(an instance space
X
)


You select an
H


Your adversary
labels

S

(i.e., chooses a point
c

from concept space
C =
2
X
)


You must find then some
h


H

that “covers” (is consistent with)
c


If you can do this for any
c

your adversary comes up with,
H shatters S

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

VC Dimension:

Examples of Shattered Sets


Three Instances Shattered






Intervals


Left
-
bounded intervals on the real axis: [0,
a
), for
a



R


0


Sets of 2 points cannot be shattered


Given 2 points, can label so that no hypothesis will be consistent


Intervals on the real axis ([
a
,
b
],
b



R

>
a



R
): can shatter 1 or 2 points, not 3


Half
-
spaces in the plane (non
-
collinear): 1? 2? 3? 4?

Instance Space
X

0

a

-

+

-

+

a

b

+

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

VC Dimension:

Definition and Relation to Inductive Bias


Vapnik
-
Chervonenkis Dimension


The
VC dimension

VC
(
H
) of hypothesis space
H
(defined over implicit instance
space
X
) is the size of the largest finite subset of
X

shattered by
H


If arbitrarily large finite sets of
X

can be shattered by
H
, then
VC
(
H
)





Examples


VC
(
half intervals in

R
) = 1


no subset of size 2 can be shattered


VC
(
intervals in

R
) = 2



no subset of size 3


VC
(
half
-
spaces in

R
2
) = 3



no subset of size 4


VC
(
axis
-
parallel rectangles in

R
2
) = 4

no subset of size 5


Relation of
VC
(
H
) to Inductive Bias of
H


Unbiased hypothesis space

H
shatters the entire instance space
X


i.e.,
H

is able to induce every partition on set
X

of all of all possible instances


The
larger

the subset
X

that can be shattered, the
more expressive

a
hypothesis space is, i.e., the
less biased

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

VC Dimension:

Relation to Sample Complexity


VC
(
H
) as A Measure of Expressiveness


Prescribes an Occam algorithm for infinite hypothesis spaces


Given: a sample
D

of
m

examples


Find some
h


H

that is consistent with all
m

examples


If
m

> 1/


⠸
VC
(
H
) lg 13/


⬠+汧l⠲/

⤩Ⱐ瑨nwi瑨p牯rab楬楴ya琠汥as琠⠱
-


⤬
h

has
true error less than



Significance


If
m
is polynomial, we have a
PAC learning algorithm


To be
efficient
, we need to produce the hypothesis
h

efficiently


Note


|
H

| > 2
m

required to shatter
m

examples


Therefore
VC
(
H
)


汧(
H
)

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

Mistake Bounds:

Halving Algorithm


Scenario for Analyzing Mistake Bounds


Halving Algorithm:
learn

concept using version space


e.g.,
Candidate
-
Elimination

algorithm (or
List
-
Then
-
Eliminate
)


Need to specify
performance element

(how predictions are made)


Classify

new instances by
majority vote of version space members


How Many Mistakes before Converging to Correct
h
?


… in worst case?


Can make a mistake when the majority of hypotheses in
VS
H
,
D
are wrong


But then we can remove
at least half

of the candidates


Worst case number of mistakes:


… in best case?


Can get away with
no

mistakes!


(If we were lucky and majority vote was right,
VS
H
,
D
still shrinks)



H
log
2
Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

Optimal Mistake Bounds


Upper Mistake Bound for A Particular Learning Algorithm


Let
M
A
(
C
) be the max number of mistakes made by algorithm A to learn
concepts in
C


Maximum over
c


C
, all possible training sequences
D





Minimax Definition


Let
C
be an arbitrary non
-
empty concept class


The
optimal mistake bound

for
C
, denoted
Opt
(
C
), is the minimum over all
possible learning algorithms
A

of
M
A
(
C
)













c
M
max
C
M
A
C
c
A








c
M
min
C
Opt
A
algorithms
learning
A










C
lg
C
M
C
Opt
C
VC
Halving



Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

COLT Conclusions


PAC Framework


Provides reasonable model for theoretically analyzing effectiveness of learning
algorithms


Prescribes

things to do: enrich the hypothesis space (search for a less
restrictive
H
); make
H
more flexible (e.g., hierarchical); incorporate knowledge


Sample Complexity and Computational Complexity


Sample complexity for any consistent learner using
H

can be determined from
measures of
H
’s expressiveness (|
H

|,
VC
(
H
), etc.)


If the sample complexity is tractable, then the computational complexity of
finding a consistent
h

governs the complexity of the problem


Sample complexity bounds are not tight! (But they separate learnable classes
from non
-
learnable classes)


Computational complexity results exhibit cases where information theoretic
learning is feasible, but finding a good
h

is intractable


COLT: Framework For Concrete Analysis of the Complexity of
L


Dependent on various assumptions (e.g.,
x


X

contain relevant variables)

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

Terminology


PAC Learning: Example Concepts


Monotone conjunctions


k
-
CNF
,
k
-
Clause
-
CNF
,
k
-
DNF
,
k
-
Term
-
DNF


Axis
-
parallel (hyper)rectangles


Intervals

and
semi
-
intervals


Occam’s Razor: A Formal Inductive Bias


Occam’s Razor
:
ceteris paribus

(all other things being equal), prefer shorter
hypotheses (in machine learning, prefer shortest
consistent

hypothesis)


Occam algorithm
: a learning algorithm that prefers short hypotheses


Vapnik
-
Chervonenkis (VC) Dimension


Shattering


VC
(
H
)


Mistake Bounds


M
A
(
C
) for
A



Find
-
S
,
Halving


Optimal mistake bound

Opt
(
H
)

Kansas State University

Department of Computing and Information Sciences

CIS 732: Machine Learning and Pattern Recognition

Summary Points


COLT: Framework Analyzing Learning Environments


Sample complexity of
C

(what is
m
?)


Computational complexity of
L


Required expressive power of
H


Error and confidence bounds (PAC:
0 <


㰠ㄯ㈬1〠㰠


㰠ㄯ㈩


What PAC
Prescribes


Whether to try to learn
C

with a known
H


Whether to try to
reformulate

H

(apply
change of representation
)


Vapnik
-
Chervonenkis (VC) Dimension


A formal measure of the complexity of
H

(besides |
H

|)


Based on
X

and a worst
-
case labeling game


Mistake Bounds


How many could
L

incur?


Another way to measure the cost of learning


Next Week: Decision Trees