Notions of Teaching and
Complexity in Computational
Learning Theory
Dissertation
Zur Erlangung des Grades eines
Doktor der Mathematik
der RuhrUniversität Bochum
amFachbereich Mathematik
vorgelegt von
Thorsten Doliwa
unter der Betreuung von
Hans Ulrich Simon
Bochum
Juli 2013
ii
Eigenständigkeitserklärung
Erklärung
Ich erkläre hiermit,dass ich die vorliegende Arbeit selbstständig verfasst und
keine anderen als die angegebenen Hilfsmittel verwendet habe.
Ich erkläre weiterhin,dass ich alles gedanklich,inhaltlich oder wörtlich von
anderen (z.B.aus Büchern,Zeitschriften,Zeitungen,Lexika,Internet usw.)
Übernommene als solches kenntlich gemacht,d.h.die jeweilige Herkunft im
Text oder in den Anmerkungen belegt habe.Dies gilt gegebenenfalls auch für
Tabellen,Skizzen,Zeichnungen,bildliche Darstellungen usw.
Ort,Datum Unterschrift
iii
iv
Acknowledgments
This research project would not have been possible without the support of
many people.First and foremost,I would like to thank my supervisor Prof.
Dr.Hans Ulrich Simon for giving me the opportunity to work in this research
project,his valuable guidance and his continuous support by keeping the mo
mentumgoing.
I also would like to thank Dr.Sandra Zilles for introducing the problem of
cooperative learning to our research group,the fundamental problem which
this thesis is about,and the warmwelcome and fruitful discussions during my
short visit at the University of Regina.
My sincere thanks go to my colleagues Malte Darnstädt and Michael Kallweit
for scientiﬁc discussion and helpful advice,especially during the last hectic
weeks of our research.
Last but by no means least,I am forever indebted to my ﬁancée Dr.Christine
Kiss for her abundance of patience,understanding and constant support.
v
vi
Contents
List of Symbols ix
Introduction xi
1 A Brief Introduction to Machine Learning 1
1.1 Fundamental Notions in Learning Theory............1
1.2 The Probably Approximately Correct Learning Model......2
1.3 The VapnikChervonenkisDimension...............3
1.4 Special Classes...........................5
1.4.1 Maximumand Maximal Classes..............6
1.4.2 Dudley Classes.......................7
1.4.3 Intersectionclosed Classes.................8
1.4.4 Nested Differences.....................9
1.5 OnlineLearning Model.......................11
1.6 Partial Equivalence Query Model.................12
1.7 Selfdirected Learning.......................13
1.8 Sample Compression and its Relation to PAC Learning.....15
2 The Teaching Dimension 23
2.1 Deﬁnitions..............................24
2.2 Collusions and Coding Tricks...................25
2.3 Relation to the VCdimension...................26
2.4 The Teaching Dimension for Special Classes...........27
2.5 New Results for ShortestPathClosed Classes..........29
vii
3 The Recursive Teaching Dimension 33
3.1 Deﬁnitions and Properties.....................35
3.2 Relation to Other Complexity Notions...............40
3.3 Recursive Teaching Dimension and VCdimension........44
3.3.1 Classes with RTD Exceeding VCD.............45
3.3.2 Intersectionclosed Classes.................50
3.3.3 MaximumClasses.....................52
3.4 Conclusions.............................55
4 Order Compression Schemes 57
4.1 Sample Compression........................59
4.2 Properties of Order Compression Schemes............59
4.3 Order Compression Schemes and Teaching............64
4.4 Order Schemes for Special Classes................68
4.5 Conclusions.............................71
A Python Programs for Complexity Notions 73
A.1 A Small Toolbox..........................73
Index 79
Bibliography 81
viii
List of Symbols
C............concept class
G(C).........oneinclusion graph of C
G
comp
(f;g)...compression graph associated with (f;g)
I(C).........largest minimal spanning set size
Cons(()S;C)..set of concepts in C consistent with S
T S(C;C))....set of all teaching sets of C 2 C
X............instance space
deg
G
(C)......the degree of C
dens(G(C))...density of G(C)
DIFF
d
(C)....nested difference of depth d
hSi
C
.........the smallest concept in C containing S
LCPARTIAL(C) partial equivalence query complexity
N............the integers
OCN(C;H)...order compression scheme
R............the real number
rfRTD(C;C)).repetitionfree recursive teaching dimension of C
RTD(C;C))..recursive teaching dimension of C
SDC(C)......selfdirected learning complexity
TD(C))......teaching dimension of C
TS(C;C))....size of smallest teaching set for C 2 C
TS
avg
(C).....average teaching dimension of C
TS
max
(C))...maximumteaching dimension of C
TS
min
(C))....minimumteaching dimension of C
ix
VCdim(C)....VCdimension of C
A4B........symmetric difference of A and B
D
F;h
.........Dudley class
I(C;G(C))...incident edges of C in G(C)
M
opt
(C)......optimal mistake bound
N............usually the number of concepts
n............usually the number of instances
x
Introduction
In the design and analysis of machine learning algorithms,the amount of
training data that needs to be provided for the learning algorithm to be suc
cessful is an aspect of central importance.In many applications,training data
is expensive or difﬁcult to obtain,and thus inputefﬁcient learning algorithms
are desirable.In computational learning theory therefore,one way of measur
ing the complexity of a concept class is to determine the worstcase number of
input examples required by the best valid learning algorithm.What is a valid
learning algorithm depends on the underlying model of learning.We refer
to this kind of complexity measure as information complexity.For example,
in PAClearning [43],the information complexity of a concept class C is the
worstcase sample complexity a best possible PAC learner for C can achieve
on all concepts in C.In query learning [2],it is the best worstcase number
of queries a learner would have to ask to identify an arbitrary concept in C.
In the classical model of teaching [19;41],the information complexity of C is
given by its teaching dimension,i.e.,the largest number of labeled examples
that would have to be provided for distinguishing any concept in C from all
other concepts in C.
Besides the practical need to limit the required amount of training data,there
are a number of reasons for formally studying information complexity.Firstly,
a theoretical study of information complexity yields formal guarantees con
cerning the amount of data that needs to be processed to solve a learning
problem.Secondly,analyzing information complexity often helps to under
stand the structural properties of concept classes that are particularly hard to
learn or particularly easy to learn.Thirdly,the theoretical study of informa
tion complexity helps to identify connections between various formal models
xi
of learning,for example if it turns out that,for a certain type of concept class,
the information complexity under learning model A is in some relationship
with the information complexity under model B.This third aspect is the main
motivation of our study.
In the past two decades,several learning models were deﬁned with the aimof
understanding in which way a low information complexity can be achieved.
One such model is learning from partial equivalence queries [33],which sub
sume all types of queries for which negative answers are witnessed by coun
terexamples,e.g.,membership,equivalence,subset,superset,and disjointness
queries [2].As lower bounds on the information complexity in this query
model (here called query complexity) hold for numerous other query learning
models,they are particularly interesting objects of study.Even more power
ful are selfdirected learners [22].Each query of a selfdirected learner is a
prediction of a label for an instance of the learner’s choice,and the learner
gets charged only for wrong predictions.The query complexity in this model
lowerbounds the one obtained frompartial equivalence queries [21].
Dual to the models of query learning,in which the learner actively chooses
the instances it wants information on,the literature proposes models of teach
ing [19;41],in which a helpful teacher selects a set of examples and presents
it to the learner,again aiming at a low information complexity.
One of these attempts is the subset teaching dimension introduced by Zilles
et al.[47],denoted with STD.Its key idea is to reduce teaching sets (uniquely
identifying sample sets) to minimal subsets that are not part of any other
teaching set for other concept classes.Although these are probably consistent
to more than one concept,a smart learner could interpret this sophisticated
choice in the desired manner.They present several classes for which the STD
performs substantially better than the teaching dimension.However,besides
lacking some desirable properties,there also exist classes for which the STD
behaves counterintuativ.
Another recent model of teaching with low information complexity is recur
sive teaching,where a teacher chooses a sample based on a sequence of nested
subclasses of the underlying concept class C [47].The nesting is deﬁned as
follows.The outermost “layer” consists of all concepts in C that are easiest to
teach,i.e.,that have the smallest sets of examples distinguishing them from
all other concepts in C.The next layers are formed by recursively repeating
xii
Introduction
this process with the remaining concepts.The largest number of examples
required for teaching at any layer is the recursive teaching dimension (RTD)
of C.The RTD substantially reduces the information complexity bounds ob
tained in previous teaching models.It lowerbounds not only the teaching
dimension—the measure of information complexity in the “classical” teach
ing model [19;41]—but also the information complexity of iterated optimal
teaching [4],which is often substantially smaller than the classical teaching
dimension.The recursive teaching dimension will be the issue that dominates
most of this thesis.
A combinatorial parameter of central importance in learning theory is the VC
dimension [44].Among other relevant properties,it provides bounds on the
sample complexity of PAClearning [9].Since the VCdimension is the best
studied quantity related to information complexity in learning,it is a natural
ﬁrst parameter to compare with it comes to identifying connections between
information complexity notions across various models of learning.For ex
ample,even though the selfdirected learning complexity can exceed the VC
dimension,existing results show some connection between these two com
plexity measures [21].However,the teaching dimension,i.e.,the information
complexity of the classical teaching model,does not exhibit any general rela
tionship to the VCdimension—the two parameters can be arbitrarily far apart
in either direction [19].Similarly,there is no known connection between
teaching dimension and query complexity.
In the context of concept learning,sample compression schemes are schemes
for “encoding” a set of examples in a small subset of examples.For instance,
from the set of examples they process,learning algorithms often extract a
subset of particularly “signiﬁcant” examples in order to represent their hy
potheses.This way sample bounds for PAClearning of a concept class C can
be obtained from the size of a smallest sample compression scheme for C
[32;16].The size of a sample compression scheme is the size of the largest
subset resulting from compressing any sample consistent with some concept
in C.In what follows,we will use the term sample compression number of a
concept class C to refer to the smallest possible size of a sample compression
scheme for C.
Outline of this Thesis In chapter 1 we summarize the results from the lit
erature of the last 25 years which are relevant for us.Chapter 2 is concerned
xiii
with the teaching dimension.Besides subsuming the existing facts we prove
new results for shortestpath closed classes.
In chapter 3 we examine the recursive teaching dimension and show up their
remarkable properties.In particular,we show for many structured concept
classes that the recursive teaching dimension is bounded by the VC dimen
sion.This includes both intersectionclosed and maximum concept classes.
These are two concept classes that are of great interest to the learning the
ory community due to their expressive power despite their simple deﬁnition.
Moreover,we give the ﬁrst link between teaching and sample compression
known to date.
Inﬂuenced by these ideas,we will introduce the notion of order compression
schemes in chapter 4.We show that any total order over a ﬁnite concept
class C induces a special type of sample compression scheme,called order
compression schemes.These schemes can serve as a ﬁrst step in solving or
disproving a long open problem:the sample compression conjecture by Floyd
and Warmuth [16].They claim that any concept class of VC dimension d
exhibits a sample compression scheme of size linear in d.And indeed,we
prove the existence of order compression schemes in size of the VCdimension
for many wellknown concept classes,e.g.,maximum classes,Dudley classes,
intersectionclosed classes and classes of VC dimension 1.Since order com
pression schemes are much simpler than sample compression schemes in gen
eral,their study seems to be a promising step towards resolving the sample
compression conjecture.We reveal a number of fundamental properties of
order compression schemes,which are helpful in such a study.In particular,
order compression schemes exhibit interesting graphtheoretic properties as
well as connections to the theory of learning fromteachers.
xiv
1
A Brief Introduction to Machine
Learning
1.1 Fundamental Notions in Learning Theory
Throughout this work,X denotes a ﬁnite set called instance space and C de
notes a concept class over a domain X.The number of the instances is denoted
with n and jCj = N.A single concept C 2 C can be represented interchange
ably in different ways.It can be used as a subset of X,as a boolean mapping
C:X!f0;1g or as a valuevector in f0;1g
jXj
.Thus,a concept class can be
a subset of 2
X
,a family of boolean functions or vectors,respectively.Depend
ing on the context we will use the deﬁnition which serves our needs best.A
sample set w.r.t.C is a subset of X f0;1g such there exists a concept C 2 C
which fulﬁlls C(x) = b for all (x;b) 2 S,i.e.,C is consistent with S.The set
Cons(S;C) C denotes the set of concepts consistent with S.For a given
sample set S,we deﬁne X(S) to be the set of unlabeled instances included in
S and C
jX
the restriction of C to X.
In many learning theoretic frameworks the learner has to identify a special
concept C
2 C,either with a certain error on X or exactly.This special con
cept is called target concept and denoted with C
.Moreover,often an expert
or oracle will provide training examples to the learner.The learners current
guess for the target is called hypothesis and the related set of all possible hy
pothesis consistent with the examples seen so far is called version space.
1
Chapter 1.A Brief Introduction to Machine Learning
Oracle
Learner
Hypothesis
D
Unkown distribution
Concept
Concept
C
Concept
Concept
Concept class
Knowledge
Sample labeled acc.to C
Output
Knowledge
Access
Figure 1.1:The PACModel.The error of the resulting hypothesis will be
measured according to the distribution D.
1.2 TheProbablyApproximatelyCorrect LearningModel
The PAC model was deﬁned in the groundbreaking paper of Valiant [43] and
is the basis for many of the subsequent studies in machine learning theory.It
reﬂects the general idea of learning.A learner is forced to learn some target
concept and is provided with random examples by an expert or oracle while
the underlying distribution is not known to the learner.According to the pre
sented examples the learner needs to build a hypothesis which reﬂects his
knowledge about the target.But instead of learning the target perfectly he is
allowed to do some mistakes like human beings.Thus,he is allowed to fail on
classiﬁcation on a small proportion weighted according to the unknown distri
bution over the instance space,say".And since the learner has no knowledge
about the distribution and thus the presented examples,he may fail at all with
a relatively small probability,say .This is a necessary concession since the
sample drawn according to the distribution could be highly uninformative.
Based on these simple and natural assumptions we can deﬁne the PAC model.
Deﬁnition 1.1 (Probably approximately correct learning).A concept class C
is efﬁciently PAClearnable if there exists an algorithm A which for concepts
C
2 C,all distributions D over X,any 0 <"<
1
2
and 0 < <
1
2
,pro
duces a hypothesis H 2 C such that with probability at least 1 it holds
2
The VapnikChervonenkisDimension
err
D
(H) "when given access to labeled samples S consistent with C
and
drawn according to D.Additionally
• A runs in polynomial time in n and m
• m2 O(poly(
1
"
;
1
;size(c
)):
Here,err
D
(H) =
P
x2H4C
D(x) with 4 being the symmetric difference is
the true error of the hypothesis weighted according to D.The number of
examples for successfully learning is called sample complexity,which is the
main quantity of interest in most of the cases.
In a slight alteration of the above model,the hypothesis is picked fromanother
class H instead of C itself.This way,there exist classes for which the learning
task gets easier.This type of learning is called improper PAC learning opposed
to proper PAC learning when C = H.The idea of proper and improper learning
will become important in a different setting in chapter 4.
Although the PACmodel is not object of this work it is one of the most impor
tant models,not at last because its strong connection to the VCdimension and
sample compression schemes,which we will both deﬁne shortly.For further
informations about the PACmodel we refer the interested reader to the book
of Kearns and Vazirani [26].
1.3 The VapnikChervonenkisDimension
The VCdimension was discovered independently by several authors [40;39;
44] and has proven itself as a very useful tool in machine learning theory,
combinatorics and probability theory.The main idea behind this complexity
notion is not to measure the complexity of a class C simply by its size but
rather its richness.By richness we mean the ability to ﬁt a given labeling on a
set of instances.
Deﬁnition 1.2.Let C be a concept class over an instance space X.For any set
S = fx
1
;:::;x
m
g X let
C
(S):= fC\S j C 2 Cg
denote the set of all labelings or dichotomies of C on S.Alternatively,one
can think of the set of all vectors produced by C given by the projection onto
3
Chapter 1.A Brief Introduction to Machine Learning
a)
+
+
b)
+
+
c)
+
+
+
+
Figure 1.2:a) and b) show two conﬁgurations that can be realized by axis
aligned rectangles.c) is an example for an unrealizable case with 5 sample
points.
S = fx
1
;:::;x
m
g,i.e.
C
(S) = f(C(x
1
);:::;C(x
m
)) j C 2 Cg:
If j
C
(S)j = 2
jSj
then S is said to be shattered by C.The VapnikChervonenkis
dimension of C,denoted by VCdim(C),is deﬁned as the size of the largest set S
shattered by C.The maximumnumber of realizable patterns with minstances
is denoted by
C
(m) = maxfj
C
(S)j j jSj = m;S Xg:
Example 1.3.Consider the class of axisaligned rectangles in the Euclidean
plane.Any labeling of the 4 points in Figure 1.2 a) and b) can be realized.
While these subﬁgures show two of the 16 different realizations for this speciﬁc
constellation of 4 points,c) shows an unrealizable labeling of 5 points.Indeed,
there always exists such an impossible labeling for any set of 5 points.Pick the set
of the leftmost,rightmost,top and bottompoints and label thempositively.This
will always be a set of 4 points.The remaining last point is labeled negatively.As
such,the VCdimension of axisaligned rectangles is 4.
A ﬁrst fundamental result is the implication of an upper bound on the size of
a concept class of ﬁnite VCdimension.There are existing numerous different
proofs for the following Lemma.One beautiful proof is given by Smolensky
[42].We will pick up and modify his strategy slightly later in chapter 3 for
proving similar results for a different complexity notion.
Lemma 1.4 (Sauer’s Lemma,[39]).Let C 2
X
be an arbitrary concept class
4
Special Classes
with VCdim(C) = d.Then for any mit holds that
j
C
(m)j
d
X
k=0
m
k
:
Throughout this work,we are only concerned with ﬁnite concept classes.In
this special case it follows that jCj
P
d
k=0
n
k
with jXj = n.
A very famous result given by Blumer et al.[9] is the upper sample bound for
successfully learning in the PACmodel,which is independent of the size of the
concept class.
Theorem 1.5 ([9]).Let C be an arbitrary concept class with VCdim(C) = d.
Any consistent learner will output some hypothesis h 2 C with err
D
(H) "with
probability at least 1 ,if he is provided with
m max
4
"
log
2
;
8d
"
log
13
"
many samples.
1.4 Special Classes
Mostly,machine learning theorists don’t deal simply with a subset of f0;1g
n
.
Usually they make more or less strong assumptions on the structure of the
concept class.Widely known,also outside the community of computational
learning theory,are classes of (monotone) monomials over boolean variables,
boolean formulas in disjunctive (resp.conjunctive) normal form,often abbre
viated as DNF (resp.CNF),and its variants like mterm DNF or KDNF.But
there are also existing special types of classes which might not be known to
the reader.Therefore,we want to introduce some of the most important,ﬁrst
and foremost,maximum and maximal classes.On the one hand,researchers
have proven the strongest and most signiﬁcant results for maximum classes.
On the other hand,maximal classes are the evil twin of maximum classes in
some sense.There are only a few vital results for maximal classes known to
date.
5
Chapter 1.A Brief Introduction to Machine Learning
1.4.1 Maximumand Maximal Classes
A ﬁnite class C over X with jXj = n is called maximumclass if jCj =
P
d
k=0
n
k
,
i.e.it meets the Sauer bound from lemma 1.4.A class is called maximal class
if for any superset C
0
C it holds that VCdim(C
0
) > VCdim(C).Note that any
ﬁnite maximum class is also maximal but the converse is not necessarily true
(see [45;17]).
Essentially,maximumclasses are unions of hypercubes of dimension VCdim(C)
(see Floyd and Warmuth [16]) and so they are wellstructured.Many exam
ples for maximumclass have a nice construction or graphical representation.
Example 1.6 (At most two positive examples,[15]).As an example,consider
the maximum class C of VCdimension two on X that consists of all concepts
with at most two positive examples.Then,for fx
1
;x
2
g X,c
x
1
;x
2
denotes the
concept on X fx
1
;x
2
g where every example is a negative example.This is the
only concept on X fx
1
;x
2
g that remains a concept in C if both x
1
and x
2
are
positive examples.
One important property of maximum classes of VCdimension d is that every
set of instances of size at most d is shattered.Note that this is only a necessary
and not sufﬁcient condition.This can be seen by removing the empty concept
fromthe example stated above.All sets of size at most d are still shattered but
the class is obviously not maximumanymore if jXj d +1.
There exists a large variety for examples of maximum class and most of them
can be described or deﬁned in a natural way.
Example 1.7 (Intervals on a line,[15]).Let C
n
be the class containing all
unions of at most n positively labeled intervals on a line.This class is maximum
of VCdimension 2n.This follows because for a ﬁnite set of mpoints on the line,
for m 2n,there are
P
2
i=0
n
m
i
ways to label those mpoints consistent with at
most n positive intervals on a line.
The wellstudied class of halfspaces is not a maximum class itself,but fortu
nately it is a composition of such classes.
Example 1.8 ([15],[16]).For any (h
1
;:::;h
n+1
) 2 R
n+1
and x 2 R
n
let
f
h
(x) =
8
<
:
1,if
P
n
i=1
h
i
x
i
+h
n+1
0
0 else
6
Special Classes
Then HS
n
:= ff
h
jh 2 R
n+1
g is the set of halfspaces in R
n
.This can be parti
tioned into two maximum classes as follows:
let PHS
n
:= ff
h
jh 2 R
n+1
;h
1
> 0g and NHS
n
:= ff
h
jh 2 R
n+1
;h
1
< 0g
be the set of positive and negative halfspaces,respectively.Then any restriction
of one of these to a set of ﬁnitely many points in general position leads to a
maximum class of VCdimension n.
As already indicated above,maximal classes do not have such nice properties
beside their inclusionsmaximality w.r.t.to the VCdimension.To the best of
our knowledge,there are also no natural examples of maximal classes like
in the maximum case.Nevertheless,maximal classes are of special interest
since any class can be embedded into a maximal class of exactly the same
VCdimension.Thus,results concerning maximal classes can be often eas
ily transported to arbitrary classes but theory lacks such signiﬁcant results.
Therefore,embedability of arbitrary classes into maximum classes with at
most linear growth of the VCdimension gains on importance,but this is still
an unsolved problem.Recently,Rubinstein and Rubinstein [37] showed that
a special type of maximum classes is not sufﬁcient to inherit its properties to
arbitrary classes.
1.4.2 Dudley Classes
Deﬁnition 1.9 (Dudley [13]).Let F be a vector space of realvalued functions
over some domain X and h:X!R.For every f 2 F let
C
f
(x):=
8
<
:
1;if f(x) +h(x) 0
0;else
:
Then D
F;h
= fC
f
jf 2 Fg is called a Dudley class.The dimension of D
F;h
is
equal to the dimension of the vector space F.
The deﬁnition of Dudley classes seems to be rather technical,but actually
there are some popular and wellknown Dudley classes,e.g.
• collections of halfspaces over R
n
,which are very common objects of
study in machine learning,such as in artiﬁcial neural networks and sup
port vector machines,see,e.g.,[1],
7
Chapter 1.A Brief Introduction to Machine Learning
• unions of at most k intervals over R,
• ndimensional balls.
Beside this,Dudley classes have a strong connection to maximumclasses.
Lemma 1.10 (BenDavid and Litman [6]).Dudley classes of dimension d are
embeddable in maximum classes of VCdimension d.
Thus,many results for maximum classes can be easily transferred to Dudley
classes and its noticeable members.
1.4.3 Intersectionclosed Classes
A concept class C is called intersectionclosed if for every pair C;C
0
2 C also
C\C
0
2 C.Among the standard examples of intersectionclosed classes are
the ddimensional boxes over [n]
d
:
BOX
d
n
:= f[a
1
;b
1
] [a
d
;b
d
] j 8 i = 1;:::;d:1 a
i
;b
i
ng:
Here,[a;b] is an abbreviation for fa;a + 1;:::;bg,where [a;b] =;if a > b.
As a result of this,monomials are also intersectionclosed when viewed as
orthogonal subrectangles of the boolean hypercube.
For T X,we deﬁne hTi
C
as the smallest concept in C containing T,i.e.,
hTi
C
:=
\
TC2C
C:
A spanning set for T X w.r.t.C is a set S T such that hSi
C
= hTi
C
.
S is called a minimal spanning set w.r.t.C,if,for every proper subset S
0
of
S,hS
0
i
C
6= hSi
C
.I(C) denotes the size of the largest minimal spanning set
w.r.t.C.
Theorem1.11 ([34],[24]).Let C be an intersectionclosed concept class.Then
every minimal spanning set w.r.t.C is shattered and therefore
I(C) = VCdim(C):
Note that,for every C 2 C 2
X
,I(Cj
C
) I(C),because each spanning set
for a set T C w.r.t.C is also a spanning set for T w.r.t.Cj
C
.Spanning sets will
8
Special Classes
be useful in proving upper bounds for the complexity notions we introduce in
later chapters.
Natarajan [34] gives an efﬁcient algorithmwith onesided error for intersection
closed classes,called Closure algorithm.Onesided error means that the hy
pothesis h given by the learner makes only mistakes on M = fx 2 X j C
(x) =
1g to a certain amount.In short that is because the Closure algorithm starts
with the empty concept,ignores negative examples and learns only from pos
itive ones.
Moreover,Natarajan [34] has shown that a class is learnable with onesided
error if and only if it is intersection closed and the VCdimension of the class
grows polynomially (in the relevant parameters),making his Closure algo
rithm an valuable tool in this case.Any class C that is not intersectionclosed
can be embedded into a class C
0
that is intersectionclosed.However,the
VCdimension can grow drastically during this action,making the Closure al
gorithminefﬁcient in the initial parameters.
1.4.4 Nested Diﬀerences
The class of nested differences of depth d (at most d) with concepts from C,
denoted DIFF
d
(C) (DIFF
d
(C),resp.),is deﬁned inductively as follows:
DIFF
1
(C):= C;
DIFF
d
(C):= fC n Dj C 2 C;D 2 DIFF
d1
(C)g;
DIFF
d
(C):=
d
[
i=1
DIFF
i
(C):
Expanding the recursive deﬁnition of DIFF
d
(C) shows that,e.g.,a set in
DIFF
4
(C) has the form C
1
n (C
2
n (C
3
n C
4
)) where C
1
;C
2
;C
3
;C
4
2 C.Fig
ure 1.3 shows some constructions based on axisparallel rectangles.
Concepts in DIFF
2
(C) have been shown to be learnable at an early stage of
research on computational learning theory by Kearns et al.[27].Regard
ing arbitrary depth,Helmbold et al.[24] considered nested differences of
intersectionclosed classes.They developed several algorithms they call them
selves inclusion/exclusion algorithms,each with different properties.The To
tal Recall algorithm,a batch algorithm,remembers all examples seen during
9
Chapter 1.A Brief Introduction to Machine Learning
a)
C
2
C
1
b)
C
3
C
2
C
1
c)
Figure 1.3:a) and b) illustrating the nested difference C
2
n C
1
and C
3
n fC
2
n
C
1
g,respectively;c) shows a more complex scenario.
learning DIFF
d
(C).The Space Efﬁcient algorithm is an online algorithm
and saves only VCdimmany examples during execution.Both use the Clo
sure algorithmiteratively as a subroutine.A central result is the following:
Lemma 1.12 ([24]).If C is intersectionclosed and d 1,then it holds that
VCdim(DIFF
d
(C)) VCdim(DIFF
d
(C)) dVCdim(C):
Further Helmbold et al.[24] observed:if both C
1
and C
2
are intersection
closed then C
1
^ C
2
:= fC
1
\C
2
j C
i
2 C
i
g is intersectionclosed as well.The
same result holds for the intersection C
1
\C
2
:= fC j C 2 C
1
and C 2 C
2
g.
Substituting intersectionclosed by unionclosed leads to dual results for unions
of classes.
For the next result we need the notion of the universal concept,also coined by
Helmbold et al.[24],which is simply the concept equal to the full domain.
Assuming that the universal concept exists in a class is only a minor ﬁx since
adding it to a concept class does not destroy the property of being intersection
closed.
Now,based upon these properties and the use of minimal spanning sets,they
have proved several relationships between unions and intersections of classes
and nested differences of those.
Lemma 1.13 ([24]).Let C
1
;:::;C
r
be arbitrary concept classes,each including
the universal concept.Then
• [
r
j=1
C
j
is a subclass of ^
r
j=1
C
r
• VCdim([
r
j=1
C
j
) VCdim(^
r
j=1
C
j
)
P
r
j=1
VCdim(C
j
)
10
OnlineLearning Model
• DIFF
d
[
r
j=1
C
j
is a subclass of DIFF
d
^
r
j=1
C
j
• VCdim(DIFF
d
([
r
j=1
C
j
)) VCdim(DIFF
d
(^
r
j=1
C
j
))
d
P
r
j=1
VCdim(C
j
)
This way they have been able to extended their efﬁcient algorithms to
DIFF
d
[
r
j=1
C
j
.We will reuse some of their techniques later in chapter 3.
1.5 OnlineLearning Model
The onlinelearning model was introduced by Littlestone [31] and further ex
amined in [22;18;7].Here,instead of learning from random labeled ex
amples,the learner is faced with a sequence of instances presented one by
one.The task is to predict the label of the current instance and after seeing
the correct label the learner can adjust his hypothesis.A natural measure of
complexity in this model is the number of classiﬁcation mistakes a learner or
algorithmmakes during the whole process.
Deﬁnition 1.14.[31] For any onlinelearning algorithm A and any target
concept C let M
A
(C) be the maximum number of mistakes over all possi
ble sequences of instances.For any A and any concept class C we deﬁne
M
A
(C) = max
C2C
M
A
(C) and call any such number a mistake bound.The
optimal mistake bound M
opt
(C) for a concept class C is the minimum over all
possible learning algorithms A of M
A
(C).
The halving algorithm,given by Littlestone [31],is a simple strategy that can
be applied to any ﬁnite concept class C.Given an instance x 2 X the al
gorithms compares the cardinalities of jCons(fx;0g;C)j and jCons(fx;1g;C)j.
Then,the algorithm predicts for x the label of the larger version space and,
after receiving the actual label,proceeds with the remaining concept class C
0
which corresponds to one of the version spaces above.Therefor,in case of a
mistake at least half of the concept class will be discarded,which coined the
name halving algorithm.These observations immediately imply the following
theorem.
Theorem1.15.[31] For any ﬁnite concept class C,M
Halving
(C) log jCj.
But many classes are known to have better polynomial sized mistake bound
algorithms,e.g.disjunctions,conjunctions,kCNF and kterm DNF.A second
11
Chapter 1.A Brief Introduction to Machine Learning
algorithm for ﬁnite classes given by Littlestone [31] is the standard optimal
algorithm.It is not just that it performs better for special classes compared to
the halving algorithm.It’s also proven to be optimal,i.e.it meets the optimal
mistake bound.
Theorem1.16.[31] For any concept class C,VCdim(C) M
opt
(C).
The proof of theorem1.16 is rather obvious since for a sequence starting with
a shattering set of size d at least that many mistakes can be made.Summing
up,the complexity notions introduced so far relate as follows.
Lemma 1.17.[31] For arbitrary concept classes C it holds
VCdim(C) M
opt
(C) M
Halving
(C) log(jCj))
At last,a well known fact relates the PAC model and the onlinelearning
model.
Theorem 1.18 ([31]).If some algorithm A learns a concept class C in the
onlinelearning model,then A can be used to learn in the PAC model.Moreover,
the deduced PAC algorithm learns efﬁciently if it is provided with
M
A
(C)
"
ln
M
A
(C)
many examples.
In contrast of that,Blum [8] has shown that there exist classes that are efﬁ
ciently learnable in the Valiant model but not in the onlinelearning model.
Nonefﬁciency in the latter model means,that the learner does not run in
polynomial time.
1.6 Partial Equivalence Query Model
In their work,Maass and Turán [33] analyzed some variants of the PAC and
online learning model.They studied the relation of previously introduced
types of different like partial equivalence queries,among others.In the partial
equivalence query model the learner can formulate a socalled partial hypothesis
h f0;1;g
jXj
.The""can be interpreted as"don’t care".The oracle will
compare the predicted label with the real labels of the target but neglect the
instances marked with .If the hypothesis is consistent with all other instances
it will return"‘yes"’ (or"‘true"’),otherwise it will provide a counterexample.
12
Selfdirected Learning
The learning complexity LC(A) of a special algorithm A in the partial equiva
lence model is deﬁned by the worstcase number of queries of A before iden
tifying C
exactly.Concordantly,the learning complexity of a concept class,
denoted with LCPARTIAL(C),is the minimal number of LC(A) over all possi
ble algorithms.
There are only a few things we need to know about the LCPARTIAL since
we won’t make much use of it.One remarkable property is its relation to the
optimal mistake bound.
Theorem1.19.[33] Let C be an arbitrary concept class.Then,
LCPARTIAL(C) M
opt
(C):
Unfortunately,the LCPARTIAL is incomparable to the VCdimension which,
was also proven by Maass and Turán [33].Up to this point,we have
LCPARTIAL(C) M
opt
(C) log(jCj))
by utilizing Lemma 1.17.
1.7 Selfdirected Learning
The selfdirected learning model was introduced by Goldman et al.[22] and
later studied in [21;5;28].It is very similar to the onlinelearning model
in which a sequence of instances has to be labeled by the learner and he is
charged for every mistake he makes.But instead of an adversary presenting
the instances in a possible malicious order,in each trial the learner chooses
the instance by himself.This selection can based on all the previous data the
learner has seen so far and has to be done in polynomial time.After passing
the instance and his prediction to the oracle,the latter will return the true
label of C
(x) to the learner.
Deﬁnition 1.20 ([22]).The selfdirected learning complexity of a concept class,
denoted with SDC(C),is deﬁned as the smallest number q such that there is
some selfdirected learning algorithm which can exactly identify any concept
C 2 C without making more than q mistakes.
13
Chapter 1.A Brief Introduction to Machine Learning
Oracle
Learner
Concept
Concept
C
Concept
Concept
Concept class
Knowledge
Provides true
label of C
(x)
Sends x 2 X and prediction b
Knowledge
Figure 1.4:The selfdirected learning model.
The SDC is known for many common classes,e.g.the selfdirected learning
complexity of
• classes of VCdimension 1 is 1
• monotone monomials over n variables is 1
• monomials over n variables is 2
• BOX
d
n
is 2
• mtermmonotone DNF is smaller than m
The ﬁrst result is due to Kuhlmann [28],all others are given by Goldman and
Sloan [21].
It’s easy to see that the VCdimension can be much larger than the SDC.Ben
David and Eiron [5] showed that for any n and d there exists a concept class
C
d
n
with SDC(C
d
n
) arbitrarily larger than VCdim(C
d
n
).Thus,there is no relation
between the VCdimension and the selfdirected learning complexity.
Goldman and Sloan [21] conjectured that the selfdirected learning complex
ity is bounded by the VCdimension at least in the case of intersectionclosed
classes.But this was disproved ﬁrst by BenDavid and Eiron [5] who con
structed a concept class with
3
2
VCdim(C) = SDC(C).Later,Kuhlmann [28]
found a class with an arbitrary large gap between the two notions.
14
Sample Compression and its Relation to PAC Learning
Finally,the selfdirected learning complexity ﬁts nicely in the inequality of the
so far presented other notions.
Theorem1.21 ([21;33;31]).
SDC(C) LCPARTIAL(C) M
opt
(C) log(jCj))
1.8 SampleCompressionanditsRelationtoPACLearn
ing
Another approach to learning concepts from a concept class is the notion of
sample compression schemes.These are often used internally in generic algo
rithms like in learning axisaligned rectangles.Given any sample set with both
positive and negative examples,the straightforward strategy for this class is
picking the leftmost,rightmost,top and bottom positive points and building
the minimal size rectangle including those.This way the whole sample set is
reduced to a minimal subset which allows reconstructing the label of a given
unlabeled point from the original sample set.This is the fundamental idea
behind sample compression.
Deﬁnition 1.22 (sample compression scheme [32]).A sample compression
scheme of size k consists of a compression function f and a reconstruction func
tion g.The compression function f maps every labeled sample S to a subset
of size at most k,called compression set.The reconstruction g maps the com
pression set to a hypothesis h X.Furthermore these must fulﬁll
8 (x;l) 2 S:g(f(S))(x) = l;
i.e.h must be consistent with the sample set.
Example 1.23.Unions of n intervals can be compressed and decompressed as
follows:
Sweep the line fromleft to right.Save the ﬁrst upcoming positive example and its
ﬁrst following negative example.This pair represents the ﬁrst interval.Simply
proceed in the same way for the remaining n 1 intervals.The label of a given
unlabeled point x corresponds to the label of its left neighboring point in the
compression set.Figure 1.5 a) shows a union of two intervals.The compression
set is highlighted by bold letters.
15
Chapter 1.A Brief Introduction to Machine Learning
a)
x
1
0
x
2
0
x
4
1
x
7
1
x
8
1
x
10
0
x
3
1
x
5
0
x
6
1
x
9
0
b)
1
x
2
1
x
3
1
x
4
0
x
5
0
x
6
0
x
7
1
x
1
0
x
8
Figure 1.5:a) A union of two intervals over X = fx
1
;:::;x
10
g;b) A positive
halfspace representing the labeling of X = fx
1
;:::;x
8
g.
Example 1.24.Consider a ﬁnite domain X R
2
whose members are in general
position (i.e.any subset S X contains at most two collinear points) and the
class C of positive halfspaces over X.Then any sample set S labeled according
to some concept from C can be compressed to a pair of points,one labeled pos
itive and one labeled negative.Afterwards any unlabeled point from S can be
labeled correctly by its relative position to the induced hyperplane.Figure 1.5 b)
illustrates this for X = fx
1
;:::;x
8
g.
Littlestone and Warmuth [32] have shown that a sample compression scheme
can be used for learning.Given a sample,an algorithm or a learner can use
a sample compression scheme build by himself for constructing a hypothesis
g(f(S)),as indicated by the examples above.They have also shown that the
existence of such a scheme is sufﬁcient for learnability and proved an analogon
to Theorem1.5.Later Floyd and Warmuth [16] proved a slightly better upper
bound on the sample size,as stated below.
Theorem1.25 ([16]).Let C 2
X
be any concept class with a sample compres
sion scheme of size at most k.Then for 0 <"; < 1,the learning algorithmusing
this scheme learns C with sample size
m
1
1
1
"
ln
1
+k +
k
"
ln
1
"
for any 0 < < 1.
The sample size can further be optimized in the choice of but only with
16
Sample Compression and its Relation to PAC Learning
marginal effects.Obviously,the bound is asymptotically equivalent to the one
given in Theorem1.5.However,using sample compression schemes can result
in better upper bounds for certain concept as stated by Floyd and Warmuth
[16] e.g.halfspaces on the plane with the instances in X in general position.
These have VCdimension equal to 3 but there are existing sample compres
sion schemes of size 2.
For arbitrary concept classes Floyd and Warmuth [16] gave the OnePass al
gorithm,which utilizes a mistakedriven algorithm P to construct a sample
compression scheme which is bounded by M
opt
(C) in its size.The main idea
of the algorithmis to compress to the mistakes the online learner makes.Thus,
together with the default ordering the labels of the original sample can be re
stored.Due to Theorem1.15,this strategy leads to the following upper bound
for arbitrary ﬁnite classes.
Theorem1.26 ([16]).Let C 2
X
be any ﬁnite concept class.Then the OnePass
Compression Scheme using the Halving algorithm gives a sample compression
scheme of size at most log jCj.
For wellstructured classes,this bound can be undercut dramatically,especially
for maximumclasses.
Theorem1.27 ([16]).Let C 2
X
be a maximumconcept class of VCdimension
d on a ﬁnite domain X with jXj = n d.Then for each concept C 2 C,there is
a compression set A X f0;1g of exactly d elements such that C is the only
consistent concept.
More recently,Kuzmin and Warmuth [30] introduced the notion of unlabeled
compression schemes.
Deﬁnition 1.28.An unlabeled compression scheme for a maximum class of VC
dimension d is given by an injective mapping r that assigns to every concept C
a set r(C) X of size at most d such that the following condition is satisﬁed:
8C;C
0
2 C (C 6= C
0
);9x 2 r(C) [r(C
0
):C(x) 6= C
0
(x):(1.1)
(1.1) is referred to as the nonclashing property.In order to ease notation,we
add the following technical deﬁnitions.A representation mapping of order k
for a (not necessarily maximum) class C is any injective mapping r that assigns
to every concept C a set r(C) X of size at most k such that (1.1) holds.
17
Chapter 1.A Brief Introduction to Machine Learning
Recursive Tail Matching Algorithmmatch
Input:a maximumclass C
Output:a representation mapping r for C
if VCdim(C) = 0 (e.g.C = fCg) then
r(C) =;.
else
pick x 2 dom(C) randomly and r:=match(C
x
)
extend that mapping to 0C
x
[1C
x
:
8C 2 C
x
:r(c [ fx = 0g:= r(C) and r(c [ fx = 1g):= r(C) [x
extend r to tail
x
(C)
end if
return r
Figure 1.6:The recursive tail matching algorithm for constructing an unla
beled compression scheme for a maximumclass.
A representationmapping r is said to have the acyclic nonclashing property if
there is an ordering C
1
;:::;C
N
of the concepts in C such that
81 i < j N;9x 2 r(C
i
):C
i
(x) 6= C
j
(x):(1.2)
Considering maximum classes,it was shown [30] that a representation map
ping with the nonclashing property guarantees that,for every sample S la
beled according to a concept fromC,there is exactly one concept C 2 C that is
consistent with S and satisﬁes r(C) X(S).This allows to encode (compress)
a labeled sample S by r(C) and,since r is injective,to decode (decompress)
r(C) by C (so that the labels in S can be reconstructed).This coined the term
“unlabeled compression scheme”.
An inspection of the work by Kuzmin and Warmuth [30] reveals that the un
labeled compression scheme obtained by the tail matching algorithm has the
acyclic nonclashing property.The main ingredient of the algorithm is the
disjoint union of C into three parts:
C = 0C
x
_
[ 1C
x
_
[ tail
x
(C)
The termC
x
is called the reduction of C w.r.t.to x.It consists of all concepts of
Cj
(Xnfxg)
for which both extensions in x exist in C.The tail of a concept class
consists of those concepts of Cj
(Xnfxg)
for which only one of both is included
in C.The notion aC
x
for a 2 f0;1g is a shorthand for the set of concepts of C
x
for which x equals a.
18
Sample Compression and its Relation to PAC Learning
The tail matching algorithm,given in Figure 1.6,will build the unlabeled com
pression sets r(C) for each concept C recursively,depending on what classes
they belong to during the different stages of the recursion.If C 2 0C
x
it will
leave r untouched,whereas for any C 2 1C
x
it will add x to r(C).For the tail
a more complex routine is needed but for sake of simplicity it is omitted here.
At its heart its a mapping onto the forbidden labels of C
x
.This way clashes
between the elements of the reduction and the tail are avoided.
In nearly the same manner one can ﬁnd an acyclic ordering of C
x
which,to
gether with the representation mapping,fulﬁlls the acyclic nonclashing prop
erty.First of all it holds that
8 C 2 1C
x
;C
0
2 0C
x
:C(x) 6= C
0
(x)
since x 2 C for any C 2 1C
x
and x 62 C for C 2 0C
x
.Since the compression
set of elements of the tail correspond to forbidden labels of C
x
it holds that
8 C 2 1C
x
;C
0
2 tail
x
(C) 9 y 2 r(C
0
):C(y) 6= C
0
(y):
Of course this holds for concepts in 0C
x
as well.Overall the acyclic ordering of
the concepts is induced (recursively) by the fact that for C 2 tail
x
(C),C
0
2 1C
x
and C
00
2 0C
x
it holds that C < C
0
< C
00
.
The main result of Kuzmin and Warmuth [30] is the existence of unlabeled
compression schemes for maximum classes in the size of the VCdimension,
condensed in the following theorem.
Theorem1.29 ([30]).Let C be a maximum class of VCdimension d.Then,the
recursive tail matching algorithmwill return an representation mapping of order
at most d.
Another strategy for building a compression scheme is based on the one
inclusion graph.Below,a concept class C over a domain X of size n is identiﬁed
with a subset of f0;1g
n
.
Deﬁnition 1.30.The oneinclusiongraph G(C) associated with C is deﬁned as
follows:
• The nodes are the concepts fromC.
• Two concepts are connected by an edge if and only if they differ in ex
actly one coordinate (when viewed as nodes in the Boolean cube).
19
Chapter 1.A Brief Introduction to Machine Learning
A cube C
0
in C is a subcube of f0;1g
n
such that every node in C
0
represents a
concept fromC.
In the context of the oneinclusion graph,the instances (corresponding to
the dimensions in the Boolean cube) are usually called “colors” (and an edge
along dimension i is viewed as having color i).For a concept C 2 C,I(C;G(C))
denotes the union of the instances associated with the colors of the incident
edges of C in G(C),called incident instances of C.The degree of C in G(C)
is denoted by deg
G(C)
(C).Recall that the density of a graph with m edges
and n nodes is deﬁned as m=n.As shown by Haussler et al.[23,Lemma
2.4],the density of the 1inclusion graph lowerbounds the VCdimension,
i.e.,dens(G(C)) < VCdim(C).
The following deﬁnitions were introduced by Rubinstein and Rubinstein [37]:
Deﬁnition 1.31.A cornerpeeling plan for C is a sequence
P = ((C
1
;C
0
1
);:::;(C
N
;C
0
N
)) (1.3)
with the following properties:
1.N = jCj and C = fC
1
;:::;C
N
g.
2.For all t = 1;:::;N,C
0
t
is a cube in fC
t
;:::;C
N
g which contains C
t
and
all its neighbors in G(fC
t
;:::;C
N
g).(Note that this uniquely speciﬁes
C
0
t
.)
The nodes C
t
are called the corners of the cubes C
0
t
,respectively.The dimension
of the largest cube among C
0
1
;:::;C
0
N
is called the order of the cornerpeeling
plan P.C can be dcornerpeeled if there exists a cornerpeeling plan of order
d.
C is called shortestpath closed if,for every pair of distinct concepts C;C
0
2 C,
G(C) contains a path of length H(C;C
0
) that connects C and C
0
,where H
denotes the Hamming distance.[37] showed the following:
Lemma 1.32 ([37]).If a maximum class C has a cornerpeeling plan (1.3) of
order VCdim(C),then an unlabeled compression scheme for C is obtained by
setting r(C
t
) equal to the set of colors in cube C
0
t
for t = 1;:::;N.
Theorem1.33 ([37]).Every maximumclass C can be VCdim(C)cornerpeeled.
20
Sample Compression and its Relation to PAC Learning
The key element of their proof is that cornerpeeling preserves shortestpath
closedness which is used to showthe uniqueness of the colors of the particular
cubes C
0
t
during peeling.
Although it was known before [30] that any maximumclass has an unlabeled
compression scheme of size VCdim(C),the scheme resulting from corner
peeling has some very special and nice features,e.g.having the acyclic non
clashing property.Thus,the cornerpeeling technique will come in handy in
chapter 3.All the previous results lead to the following fundamental question.
Conjecture 1.34 ([32]).Any concept class C 2
X
possess a sample compression
scheme of size O(VCdim(C)).
The sample compression conjecture is unsolved for almost three decades now,
although it has been answered positively for special classes.Beside the afore
mentioned maximum classes,Helmbold et al.[24] give mistakebound algo
rithms for nesteddifferences of constant depth of intersectionclosed classes
which are bounded by O(VCdim(C)).Together with the above mentioned
OnePass algorithm given by Floyd and Warmuth [16],it follows that there
is also a sample compression scheme linear in the VCdimension.The only
bound involving the VCdimension known to date is a lower bound:
Theorem1.35 ([16]).For an arbitrary concept class C of VCdimension d,there
is no sample compression scheme of size at most d=5 for sample sets of size at least
d.
21
22
2
The Teaching Dimension
The following chapter is concerned with the so called teacherdirected learning
model and its associated complexity parameter,the teaching dimension.Ac
cording to its name,the oracle from the ﬁrst chapter is replaced by a teacher
who will guide the learner through the learning process by presenting helpful
examples.However,there is no exact deﬁnition for the term"helpful"in gen
eral.Although the model and the teaching dimension are quite reasonable,
there are some stringent restrictions on the possible behavior of the learner
which will limit the teacher in his"helpfulness".A goal of the later chapters
will be to relax these rules and improve the sample complexity drastically for
special classes.
Additionally teacher and learner are not allowed to use coding tricks.What
are coding tricks?To explain this,consider the following situation:teacher
and learner agree on some ordering on C = (C
1
;:::;C
m
).If the teacher has
to teach the concept C
i
and X is sufﬁciently large,he can simply encode the
index i in a single sample by sending the ith instance with an arbitrary label
(preferably consistent with C
i
to obscure their fraud).The learner takes this
hint and outputs C
i
as his hypothesis.Thus,learning took only a single exam
ple.Obviously,this is not what learning is meant to be.Coding tricks are often
also called ”collusions” and have been studied by Angluin and Krikis [3],Ott
and Stephan [35] and Goldman and Mathias [20].
The strategy presented in this chapter does not possess coding tricks.Simply
put,the teacher presents the learner a sample S that is only consistent with
the target concept C
.Such a sample is called a teaching set for C.
23
Chapter 2.The Teaching Dimension
Teacher
Learner
Hypothesis
Concept
Concept
C
Concept
Concept
Concept class
Knowledge
Wellchosen sample set
Knowledge
Knowledge
Output
Figure 2.1:The teacherdirected learning model.The resulting hypothesis
needs to be exactly the target concept.
The teaching dimension and its related notions are very well researched [41;
19;20;5;47;48].Although most of the results we present in this chapter
were elaborated more than 25 years ago,they are the fundamental basis for
our work.Besides summarizing results of prior studies,we give new results
for shortestpath closed classes at the end of the chapter that are contained
in Doliwa et al.[12].
2.1 Deﬁnitions
The teaching model was independently introduced by Goldman and Kearns
[19] and Shinohara and Miyano [41].In contrast to the previous models
the learner does not get random or selfchoosen examples.Instead,samples
are chosen in such a way by a helpful teacher that he can ensure that the
learner will identify the correct concept.A teaching set for a concept C 2 C
is an unordered labeled set S such that solely C is consistent with S,i.e.
Cons(S;C) = fCg.The family of sets T S(C;C) is the set of all teaching sets
for C and TS(C;C) denotes the size of a smallest teaching set.The follow
ing complexity notions are well known and have been explored in numerous
24
Collusions and Coding Tricks
works:
TS
min
(C):= min
C2C
TS(C;C)
TS
max
(C):= max
C2C
TS(C;C)
TS
avg
(C):=
1
jCj
P
C2C
TS(C;C)
The termTS
max
(C) =:TD(C) is also named teaching dimension.The quantity
TS
min
(C) is known as the minimum teaching dimension of C and TS
avg
(C) is
called averagecase teaching dimension of C.Obviously,
TS
min
(C) TS
avg
(C) TS
max
(C) = TD(C):
The teaching dimension is also known in several other ﬁelds of research,e.g.
combinatorics and complexity theory.Here,a teaching set is often called wit
ness set (e.g.Jukna [25]).In communication complexity related ﬁelds for
example,this notion is referring to the property of being able to testify the
knowledge of a preshared secret.
2.2 Collusions and Coding Tricks
Of course,there are more strategies violating the common idea of learning
than the one mentioned in the introduction of this chapter.A ﬁrst attempt to
give a formal deﬁnition for this was done by Goldman and Mathias [20].For
a given pair (;) of a teacher and a learner they deﬁned (C) to be the
sample that the teacher will select in pursuit of teaching C.Accordingly,(S)
is the hypothesis of the learner on input S.Due to Goldman and Mathias [20],
a teacherlearner pair is said to be collusionfree (for C) if
8 C 2 C 8 S (C):(S) = C:
More or less,the property demands that enriching the sample with consis
tently labeled points does not make the learner fail on S.Coding tricks like
the one described above will be prevented by this requirement.It is easy to see
that learning with the help of minimumteaching sets provided by a teacher is
indeed collusionfree.
25
Chapter 2.The Teaching Dimension
x
1
x
2
x
3
...
x
n1
x
n
C
0
0
0
0
...
0
0
C
1
1
0
0
...
0
0
C
2
0
1
0
...
0
0
C
3
0
0
1
...
0
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
c
n1
0
0
0
...
1
0
c
n
0
0
0
...
0
1
Figure 2.2:The concept class C
sing
[;for which TD(C) = jCj 1 and
VCdim(C) = 1.
Zilles et al.[48] have given a different interpretation of coding tricks that
is not as restrictive as the one given above.Thus,they were able to prove
the soundness of different protocols based on other complexity notions for
teaching.The interested reader is referred to [48;19;20;35;4] for further
discussions about coding tricks.
2.3 Relation to the VCdimension
Lemma 2.1 ([19]).There is a concept class C for which TD(C) = jCj 1 and
VCdim(C) = 1.
Proof.See ﬁgure 2.2.Obviously the given class has only a VCdimension equal
to 1 since any concept has at most one positive label.Thus no set of size two
can be shattered.
Each of the concepts c
1
;:::;c
n
has a teaching set of size one (the single pos
itive labeled instance x
i
for c
i
).But the concept c
0
requires all labels of all
jCj 1 instances to be revealed for any consistent learner since each instance
distinguishes only one of the other concepts fromc
0
.
We would like to highlight particularly that the concept class above presented
is also maximum,i.e.not even these wellstructured concept class feature a
nice relation between teaching and shattering.
Lemma 2.2 ([33;19]).Consider the concept class C
n
addr
illustrated in ﬁgure 2.3
(addr stands for addressing).It holds that
1 = TD(C
n
addr
) < VCdim(C
n
addr
) = log(n):
26
The Teaching Dimension for Special Classes
x
1
x
2
...
x
log
2
(n)
x
1+log
2
(n)
...
x
n1+log
2
(n)
x
n+log
2
(n)
C
1
0
0
...
0
1
...
0
0
C
2
1
0
...
0
0
...
0
0
C
3
0
1
...
0
0
...
0
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
C
n1
1
1
...
0
0
...
1
0
C
n
1
1
...
1
0
...
0
1
Figure 2.3:The concept class C
n
addr
for which TD(C) < VCdim(C).
Proof.See ﬁgure 2.3 and let n = 2
k
.Note that for the concept c
i
only x
k+i
is
positively labeled among the instances x
k+1
;:::;x
k+n
.Obviously fx
1
;:::;x
k
g
is shattered whereas the positive label of x
k+i
is sufﬁcient for teaching c
i
.
Note that ﬁgure 2.3 also exhibits the maximal factor between these two no
tions since for any ﬁnite C it holds that VCdim(C) log(jCj) and TD(C) 1.
Though there is no relation between the VCdimension and the teaching di
mension,Goldman and Kearns [19] where able to prove the following by
taking the size of the class itself into account.
Theorem2.3 (Goldman and Kearns [19]).For any concept class C it holds that
TD(C) VCdim(C) +jCj 2
VCdim(C)
:
Proof.Let fx
1
;:::;x
d
g be a shattering set of size d = VCdim(C).After taking
x
1
;:::;x
d
and their corresponding labels into a temporary teaching set,there
are at most jCj 2
d
+1 concepts left in the version space.Since each concept
differs in at least one instance from all other concepts,we will need at most
jCj 2
d
more labeled instances for a complete teaching set.
2.4 The Teaching Dimension for Special Classes
Computing the teaching dimension is known to be NPcomplete.This was
proven by [41] by giving a reduction fromthe hitting set problem.Therefore,
the teaching dimension of wellstructured classes is of special interest.Gold
man and Kearns [19] determined the teaching dimension for a large collection
of interesting concept classes.
27
Chapter 2.The Teaching Dimension
Proposition 2.4 (Goldman and Kearns [19]).For the concept class C
• of monotone monomials over n variables and r relevant variables
TD(C) = min(r +1;n)
• of monomials over n variables and r relevant variables
TD(C) = min(r +2;n +1)
• of monotone kterm DNF formulas over n variables
l +1 TD(C) l +k
where l is the number of literals in the target formula;
• of monotone decision lists over n variables
n TD(C) 2n 1
• BOX
d
n
of ddimensional boxes over the domain [n]
d
TD(C) = 2 +2d:
According to Kuhlmann,the minimum teaching set size is a lower bound to
the selfdirected learning complexity.
Lemma 2.5 ([28]).For every concept class C:TS
min
(C) SDC(C).
Thus,the following holds:
Corollary 2.6 ([28]).Let C be an arbitrary concept class.If VCdim(C) = 1 then
TS
min
(C) = 1.
Additionally,Kuhlmann proved the following:
Lemma 2.7 ([28]).Let C be an intersectionclosed concept class.Then,
TS
min
(C) I(C):
28
New Results for ShortestPathClosed Classes
According to results of BenDavid and Eiron [5],the teaching dimension itself
does not ﬁt into our landscape of complexity notions very well.
Lemma 2.8 ([5]).The selfdirected learning complexity of a concept class is not
bounded by any function of its teacherdirected complexity,i.e.for every n and
d 3,there exists a concept class C with TD(C) = d and SDC(C) n.
Summing up,at least the minimum teaching set size can be integrated in our
inequality of complexity notions.
Thus,we can extend Theorem1.21 to the following
Corollary 2.9.For arbitrary concept classes C it holds that
TS
min
(C) SDC(C) M
opt
(C) log jCj
2.5 New Results for ShortestPathClosed Classes
In this section,we study the bestcase teaching dimension,TS
min
(C),and the
averagecase teachingdimension,TS
avg
(C),of a shortestpath closed concept
class C.
It is known that the instances of I(C;G(C)),augmented by their Clabels,
forma unique minimumteaching set for C in C provided that C is a maximum
class [30].Lemma 2.10 slightly generalizes this observation.
Lemma 2.10.Let C be any concept class.Then the following two statements are
equivalent:
1.C is shortestpath closed.
2.Every C 2 C has a unique minimum teaching set S,namely the set S such
that X(S) = I(C;G(C)).
Proof.1 ) 2 is easy to see.Let C be an arbitrary shortestpath closed con
cept class,and let C be any concept in C.Clearly,any teaching set S for C
must satisfy I(C;G(C)) X(S) because C must be distinguished from all its
neighbors in G(C).Let C
0
6= C be any other concept in C.Since C and C
0
are
connected by a path P of length jC 4C
0
j,C and C
0
are distinguished by the
color of the ﬁrst edge in P,say by the color x 2 I(C;G(C)).Thus,no other
29
Chapter 2.The Teaching Dimension
instances (=colors) besides I(C;G(C)) are needed to distinguish C from any
other concept in C.
To show 2 ) 1,we suppose 2 and prove by induction on k that any two
concepts C;C
0
2 C with k = jC 4C
0
j are connected by a path of length k in
G(C).The case k = 1 is trivial.For a ﬁxed k,assume all pairs of concepts of
Hamming distance k are connected by a path of length k in G(C).Let C;C
0
2 C
with jC4C
0
j = k+1 2.Since I(C;G(C)) = X(S),there is an x 2 I(C;G(C))
such that C(x) 6= C
0
(x).Let C
00
be the xneighbor of C in G(C).Note that
C
00
(x) = C
0
(x) so that C
00
and C
0
have Hammingdistance k.According to
the inductive hypothesis,there is a path of length k from C
00
to C
0
in G(C).It
follows that C and C
0
are connected by a path of length k +1.
Theorem2.11.Let C be a shortestpath closed concept class.Then,
TS
avg
(C) < 2VCdim(C):
Proof.According to Lemma 2.10,the averagecase teaching dimension of C
coincides with the average vertexdegree in G(C),which is twice the density
of G(C).The result follows from the fact that Haussler et al.[23] have shown
that the density of G(C) lower bounds VCdimension of C.
Theorem 2.11 generalizes a result by Kuhlmann [28] who showed that the
averagecase teaching dimension of “dballs” (sets of concepts of Hamming
distance at most d from a center concept) is smaller than 2d.It also simpliﬁes
Kuhlmann’s proof substantially.In Theorem 4 of the same paper,Kuhlmann
[28] stated furthermore that TS
avg
(C) < 2 if VCdim(C) = 1,but his proof is
ﬂawed.
1
Despite the ﬂawed proof,the claimitself is correct as we show now:
Theorem2.12.Let C be any concept class.If VCdim(C) = 1 then TS
avg
(C) < 2.
Proof.By Theorem 2.11,the averagecase teaching dimension of a maximum
class of VCdimension 1 is less than 2.It thus sufﬁces to show that any class
C of VCdimension 1 can be transformed into a maximum class C
0
of VC
dimension 1 without decreasing the averagecase teaching dimension.Let
1
His Claim2 states the following.If VCdim(C) = 1,C
1
;C
2
2 C,x 2 X,x =2 C
1
,C
2
= C
1
[
fxg,then,for either (i;j) = (1;2) or (i;j) = (2;1),one obtains TS(C
i
;C) = TS(C
i
x;Cx)+1
and TS(C
j
;C) = 1.This is not correct,as can be shown by the class C = ffx
z
:1 z kg:
0 k 5g over X = fx
k
:1 k 5g,which has VCdimension 1.For C
1
= fx
1
;x
2
g,
C
2
= fx
1
;x
2
;x
3
g,and x = x
3
,we get TS(C
1
;C) = TS(C
2
;C) = TS(C
1
x;C x) = 2.
30
New Results for ShortestPathClosed Classes
X
0
X be any maximal subset of X that does not contain redundant in
stances.(e.g.,X
0
results from X by removing redundant instances as long
as possible).Let C
0
= C
jX
0.Obviously,jC
0
j = jCj and VCdim(C
0
) = 1.Let
m = jX
0
j so that jC
0
j
m
0
+
m
1
= m+ 1.Now we prove that C
0
is maxi
mum.Since no x 2 X
0
is redundant for C
0
,every x 2 X
0
occurs as color in
G(C
0
).As VCdim(C
0
) = 1,no color can occur twice.Thus jE(G(C
0
))j = m.
Moreover,there is no cycle in G(C
0
) since a cycle would require at least one
repeated color.As G(C
0
) is an acyclic graph of m edges,it has at least m+1
vertices,i.e.jC
0
j m+1.Thus,jC
0
j = m+1 and C
0
is maximum.This implies
that TS
avg
(C
0
) < 2VCdim(C
0
).Since X
0
results from X by removing redun
dant instances only,we obtain TS(C;C) TS(C
jX
0;C
0
) for all C 2 C.Thus,
TS
avg
(C) TS
avg
(C
0
) < 2VCdim(C
0
) = 2,which concludes the proof.
We brieﬂy note that TS
avg
(C) cannot in general be bounded by O(VCdim(C)).
Kushilevitz et al.[29] present a family (C
n
) of concept classes such that
TS
avg
(C
n
) =
(
p
jC
n
j) but VCdim(C
n
) log jC
n
j.
Lemma 2.13.If deg
G(C)
(C) jXj 1 for all C 2 C,then C is shortestpath
closed.
Proof.Assume C is not shortestpath closed.Pick two concepts C;C
0
2 C
of minimal Hammingdistance d but not connected by a path of length d in
G(C).It follows that d 2.By minimality of d,any neighbor of C with
Hammingdistance d 1 to C
0
does not belong C.Since there are d such
missing neighbors,the degree of C in G(G) is bounded by jXj d jXj 2.
This yields a contradiction.
In their Theorem 43,Rubinstein et al.[36] provide a concept class V with
TS
min
(V) > VCdim(V).While there are several such classes known,Sec
tion 3.3.3 will show that TS
min
(C) = VCdim(C) for all maximum classes C.
Shortestpathclosed classes generalize maximumclasses,but an inspection of
[36] and Lemma 2.13 show that V is shortestpathclosed,since each concept
in V has degree jXj or jXj 1 in G(V).Thus we can prove the existence
of shortestpathclosed classes with TS
min
(C) > VCdim(C),i.e.the result
does not generalize to shortestpathclosed classes.This knowledge might be
helpful in the study of the open question whether TS
min
(C) 2 O(VCdim(C))
— a question related to the longstanding open sample compression conjec
ture [16],see Section 3.3.
31
32
3
The Recursive Teaching Dimension
Another recent model of teaching with low information complexity is recur
sive teaching,where a teacher chooses a sample based on a sequence of nested
subclasses of the underlying concept class C [47].The nesting is deﬁned as
follows.The outermost “layer” consists of all concepts in C that are easiest to
teach,i.e.,that have the smallest sets of examples distinguishing them from
all other concepts in C.The next layers are formed by recursively repeating
this process with the remaining concepts.The used samples are called recur
sive teaching sets and the largest number of examples required for teaching
at any layer is the recursive teaching dimension (RTD) of C.The RTD sub
stantially reduces the information complexity bounds obtained in traditional
teaching models.It lowerbounds not only the teaching dimension—the mea
sure of information complexity in the “classical” teaching model [19;41]—but
also the information complexity of iterated optimal teaching [4] (which we do
not cover here),which is often substantially smaller than the classical teach
ing dimension.The intuitive idea behind this construction is the following:
a learner provided with a recursive teaching set,realizes that the given sam
ple is not sufﬁcient for exactly identifying one special concept.But he can
discard those concepts fromC that are teached with different samples of min
imal size among all minimal teaching sets for concepts in the class – if the
teacher would try to suggest that any of these is the target,he would have
chosen these uniquely identifying sets.After dropping these concepts fromC,
he can undertake the same investigation over and over again until the initially
received sample coincides with a teaching set for the target concept in some
subset of C.
33
Chapter 3.The Recursive Teaching Dimension
Since the VCdimension is the beststudied quantity related to information
complexity in learning,it is a natural ﬁrst parameter to compare to when
it comes to identifying connections between information complexity notions
across various models of learning.However,the teaching dimension,i.e.,
the information complexity of the classical teaching model,does not exhibit
any general relationship to the VCdimension—the two parameters can be
arbitrarily far apart in either direction as seen in chapter 2.Similarly,there is
no known connection between teaching dimension and query complexity.
In this chapter,we establish the ﬁrst known relationships between the infor
mation complexity of teaching and query complexity,as well as between the
information complexity of teaching and the VCdimension.All these relation
ships are exhibited by the RTD.The main contributions of this chapter are
the following:
• We show that the RTD is never higher (and often considerably lower)
than the complexity of selfdirected learning.Hence all lower bounds
on the RTD hold likewise for selfdirected learning,for learning from
partial equivalence queries,and for a variety of other query learning
models.
• We reveal a strong connection between the RTD and the VCdimension.
Though there are classes for which the RTD exceeds the VCdimension,
we present a number of quite general and natural cases in which the
RTD is upperbounded by the VCdimension.These include classes of
VCdimension 1,intersectionclosed classes,a variety of naturally struc
tured Boolean function classes,and ﬁnite maximum classes in general
(i.e.,classes of maximum possible cardinality for a given VCdimension
and domain size).Many natural concept classes are maximum,e.g.,the
class of unions of up to k intervals,for any k 2 N,or the class of sim
ple arrangements of positive halfspaces.It remains open whether every
class of VCdimension d has an RTD linear in d.
• We reveal a relationship between the RTD and sample compression
schemes.
The relationship between RTD and unlabeled sample compression schemes is
established via cornerpeeling plans and Theorem1.33.Like the RTD,corner
peeling is associated with a nesting of subclasses of the underlying concept
34
Deﬁnitions and Properties
class.A crucial observation we make in this chapter is that every maximum
class of VCdimension d allows cornerpeeling with an additional property,
which ensures that the resulting unlabeled samples contain exactly those in
stances a teacher following the RTDmodel would use.Similarly,we showthat
the unlabeled compression schemes constructed by Kuzmin and Warmuth’s
Tail Matching algorithm (see Figure 1.6) exactly coincide with the teaching
sets used in the RTD model,all of which have size at most d.
This remarkable relationship between the RTD and sample compression sug
gests that the open question of whether or not the RTD is linear in the VC
dimension might be related to the longstanding sample compression Conjec
ture 1.34.To this end,we observe that a negative answer to the former ques
tion would have implications on potential approaches to settling the second.
In particular,if the RTD is not linear in the VCdimension,then there is no
mapping that maps every concept class of VCdimension d to a superclass that
is maximum of VCdimension O(d).Constructing such a mapping would be
one way of proving that the best possible size of sample compression schemes
is linear in the VCdimension.
Note that sample compression schemes are not bound to any constraints as to
how the compression sets have to be formed,other than that they be subsets
of the set to be compressed.In particular,any kind of agreement on,say,
an order over the instance space or an order over the concept class,can be
exploited for creating the smallest possible compression scheme.As opposed
to that,the RTD is deﬁned following a strict “recipe” in which teaching sets
are independent of orderings of the instance space or the concept class.These
differences between the models make the relationship revealed in this chapter
even more remarkable.
Lemma 3.3 and all results of Section 3.2 and 3.3 have been published in the
joint work Doliwa et al.[11] or are contained in Doliwa et al.[12].The only
exceptions are Lemma 3.8 and Lemma 3.9 which are unpublished.
3.1 Deﬁnitions and Properties
Deﬁnition 3.1.Let K be a function that assigns a “complexity” K(C) 2 N
to each concept class C.We say that K is monotonic if C
0
C implies that
K(C
0
) K(C).We say that K is twofold monotonic if K is monotonic and,for
35
Chapter 3.The Recursive Teaching Dimension
every concept class C over X and every X
0
X,it holds that K(C
jX
0 ) K(C).
Deﬁnition 3.2 ([48]).A teaching plan for C is a sequence
P = ((C
1
;S
1
);:::;(C
m
;S
m
)) (3.1)
with the following properties:
1.N = jCj and C = fC
1
;:::;C
m
g
2.For all t = 1;:::;m,S
t
2 TS(C
t
;fC
t
;:::;Cmg).
The quantity ord(P):= max
t=1;:::;N
jS
t
j is called the order of the teaching plan
P.Finally,we deﬁne
RTD(C):= minford(P)jP is a teaching plan for Cg;
RTD
(C):= max
X
0
X
RTD(C
jX
0 ):
The quantity RTD(C) is called the recursive teaching dimension of C.
A teaching plan (3.1) is said to be repetitionfree if the sets X(S
1
);:::;X(S
N
)
are pairwise distinct.(Clearly,the corresponding labeled sets,S
1
;:::;S
N
,
are always pairwise distinct.) Similar to the recursive teaching dimension we
deﬁne
rfRTD(C):= minford(P) j P is a repetitionfree teaching plan for Cg:
One can show that every concept class possesses a repetitionfree teaching
plan.First,by induction on jXj = m,the full cube 2
X
has a repetitionfree
teaching plan of order m:It results froma repetitionfree plan for the (m1)
dimensional subcube of concepts for which a ﬁxed instance x is labeled 1,
where each teaching set is supplemented by the example (x;1),followed by a
repetitionfree teaching plan for the (m1)dimensional subcube of concepts
with x = 0.Second,“projecting” a (repetitionfree) teaching plan for a con
cept class C onto the concepts in a subclass C
0
C yields a (repetitionfree)
teaching plan for C
0
.Putting these two observations together,it follows that
36
Deﬁnitions and Properties
x
1
x
2
x
3
x
4
x
5
TS
min
(C
i
;C)
TS
min
(C
i
;C
1
)
TS
min
(C
i
;C
2
)
TS
min
(C
i
;C
1;2
)
C
1
0
0
0
0
0
2

2

C
2
1
1
0
0
0
2
2


C
3
0
1
0
0
0
4
3
3
2
C
4
1
0
1
0
0
3
3
3
3
C
5
1
0
1
0
1
3
3
3
3
C
6
0
1
1
0
1
3
3
3
3
C
7
0
1
1
1
1
3
3
3
3
C
8
0
1
1
1
0
3
3
3
3
C
9
1
0
1
1
0
3
3
3
3
C
10
1
0
0
1
0
4
3
3
3
C
11
1
0
0
1
1
3
3
3
3
C
12
0
1
0
1
0
4
4
4
4
C
13
0
1
0
1
1
3
3
3
3
Figure 3.1:A class with RTD(C) = 2 but rfRTD(C) = 3.C
i
denotes the class C
without concept C
i
and C
i;j
the class without both C
i
and C
j
.
every class over instance set X has a repetitionfree teaching plan of order
jXj.
It should be noted though that rfRTD(C) may exceed RTD(C).For example,
consider the class in Table 3.1,which is of RTD 2.This can be seen by picking
the trivial order ((C
1
;S
1
);(C
2
;S
2
);:::;(C
13
;S
13
)).Then each S
i
is given by
the corresponding labeled instances highlighted by bold letters in Table 3.1.
In any teaching plan of order 2,both C
1
and C
2
have to be taught ﬁrst with the
same instance set fx
1
;x
2
g augmented by the appropriate labels.Conversely,
the best repetitionfree teaching plan for this class is of order 3.
As observed by Zilles et al.[48],the following holds:
• RTD is monotonic.
• The RTD coincides with the order of any teaching plan that is in canoni
cal form,i.e.,a teaching plan ((C
1
;S
1
);:::;(C
N
;S
N
)) such that
jS
t
j = TS
min
(fC
t
;:::;C
N
g) holds for all t = 1;:::;N.
Intuitively,a canonical teaching plan is a sequence that is recursively built by
always picking an easiesttoteach concept C
t
in the class C n fC
1
;:::;C
t1
g
together with an appropriate teaching set S
t
.
The deﬁnition of teaching plans immediately yields the following result:
37
Chapter 3.The Recursive Teaching Dimension
x
1
x
2
C
1
0
0
C
2
0
1
C
3
1
0
C
4
1
1
(a) The concept
class.
P
1
= ((C
1
;f(x
1
;0);(x
2
;0)g);(C
2
;f(x
1
;0)g);(C
3
;f(x
2
;0)g);(C
4
;;))
P
2
= ((C
2
;f(x
1
;0);(x
2
;1)g);(C
1
;f(x
1
;0)g);(C
3
;f(x
2
;0)g);(C
4
;;))
(b) Two possible teaching plans.
Figure 3.2:An example for the ambiguity of recursive teaching sets and re
lated teaching plans in a naive protocol.
Lemma 3.3.1.If K is monotonic and TS
min
(C) K(C) for every concept
class C,then RTD(C) K(C) for every concept class C.
2.If K is twofold monotonic and TS
min
(C) K(C) for every concept class C,
then RTD
(C) K(C) for every concept class C.
It can be shown that there are existing different orderings of special concept
class C such that the induced teaching sets could lead to a failure when teacher
and learner use some ’naive’ recursive teaching protocol.
Example 3.4.Consider the powerset over 2 instances as a concept class.Fig
ure 3.2 shows this simple scenario along with two teaching plans.Both plans are
not only valid recursive teaching plans but also canonical teaching plans.Thus,
both plans are reasonable for recursive teaching.But in the end,a teacher using
P
1
and a learner using P
2
would lead to a ﬂawed communication which is high
lighted by the bold teaching sets.While the teacher would try to teach C
2
,the
learner would identify C
1
as the target concept.
Does that mean,that teacher and learner have to agree on some ordering of C
in advance?Of course that would lead to inevitable coding tricks.Fortunately,
this apparent problemcan be solved by the recursive teaching hierarchy,a valid
protocol given by Zilles et al.[48] which makes use of recursive teaching sets.
Deﬁnition 3.5 ([48]).Let C be a concept class.The recursive teaching hi
erarchy for C is the sequence H = ((C
1
;d
1
);:::;(C
h
;d
h
)) that fulﬁlls,for all
j 2 f1;:::;hg,
C
j
= fC 2
C
j
jd
j
= TS
min
(
C
j
)g;
where
C
1
= C and
C
i
= C n(C
1
[:::[C
i1
) for all 2 i h.Note that for any
i 2 [h],a sample S 2 T S(C;
C
i
) with jSj = d
i
is called a recursive teaching set
for C.
38
Deﬁnitions and Properties
The teaching hierarchy can be build independently by both teacher and learner
and max
i2[h]
d
i
= RTD(C).Unlike the strategies following the idea of,e.g.,
subset teaching,there exists a teacherlearner pair given by this method that
is collusionfree in the sense of Goldman and Kearns [19],as shown by Zilles
et al.[48].
In a subsequent work,Samei et al.[38] proved an upper bound on the size of
an concept class in dependence of the recursive teaching dimension,similar
to Sauer’s bound (see Lemma1.4).
Lemma 3.6 ([38]).Let C be an arbitrary concept class over X with RTD(C) = r.
Then it holds
jCj
r
X
k=0
n
k
:
Proof.We give an simpliﬁed proof of the one of Samei et al.[38].Let F be
the real vector space of realvalued functions over C.Then dim(F) = jCj = m.
Now consider C as a subset of R
n
.We will show that any realvalued function
on C can be written as a polynomial of degree at most r that is linear in
all variables.As such,F is spanned by these polynomials and dim(F)
P
r
k=0
n
k
.
Let C
1
;:::;C
N
be the concepts in C ordered according to a canonical teaching
plan P of order r.Hence,TS(C
j
;fC
j
;:::;C
N
g r.For the sake of conve
Comments 0
Log in to post a comment