Notions of Teaching and Complexity in Computational Learning Theory

habitualparathyroidsΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

89 εμφανίσεις

Notions of Teaching and
Complexity in Computational
Learning Theory
Dissertation
Zur Erlangung des Grades eines
Doktor der Mathematik
der Ruhr-Universität Bochum
amFachbereich Mathematik
vorgelegt von
Thorsten Doliwa
unter der Betreuung von
Hans Ulrich Simon
Bochum
Juli 2013
ii
Eigenständigkeitserklärung
Erklärung
Ich erkläre hiermit,dass ich die vorliegende Arbeit selbstständig verfasst und
keine anderen als die angegebenen Hilfsmittel verwendet habe.
Ich erkläre weiterhin,dass ich alles gedanklich,inhaltlich oder wörtlich von
anderen (z.B.aus Büchern,Zeitschriften,Zeitungen,Lexika,Internet usw.)
Übernommene als solches kenntlich gemacht,d.h.die jeweilige Herkunft im
Text oder in den Anmerkungen belegt habe.Dies gilt gegebenenfalls auch für
Tabellen,Skizzen,Zeichnungen,bildliche Darstellungen usw.
Ort,Datum Unterschrift
iii
iv
Acknowledgments
This research project would not have been possible without the support of
many people.First and foremost,I would like to thank my supervisor Prof.
Dr.Hans Ulrich Simon for giving me the opportunity to work in this research
project,his valuable guidance and his continuous support by keeping the mo-
mentumgoing.
I also would like to thank Dr.Sandra Zilles for introducing the problem of
cooperative learning to our research group,the fundamental problem which
this thesis is about,and the warmwelcome and fruitful discussions during my
short visit at the University of Regina.
My sincere thanks go to my colleagues Malte Darnstädt and Michael Kallweit
for scientific discussion and helpful advice,especially during the last hectic
weeks of our research.
Last but by no means least,I am forever indebted to my fiancée Dr.Christine
Kiss for her abundance of patience,understanding and constant support.
v
vi
Contents
List of Symbols ix
Introduction xi
1 A Brief Introduction to Machine Learning 1
1.1 Fundamental Notions in Learning Theory............1
1.2 The Probably Approximately Correct Learning Model......2
1.3 The Vapnik-Chervonenkis-Dimension...............3
1.4 Special Classes...........................5
1.4.1 Maximumand Maximal Classes..............6
1.4.2 Dudley Classes.......................7
1.4.3 Intersection-closed Classes.................8
1.4.4 Nested Differences.....................9
1.5 Online-Learning Model.......................11
1.6 Partial Equivalence Query Model.................12
1.7 Self-directed Learning.......................13
1.8 Sample Compression and its Relation to PAC Learning.....15
2 The Teaching Dimension 23
2.1 Definitions..............................24
2.2 Collusions and Coding Tricks...................25
2.3 Relation to the VC-dimension...................26
2.4 The Teaching Dimension for Special Classes...........27
2.5 New Results for Shortest-Path-Closed Classes..........29
vii
3 The Recursive Teaching Dimension 33
3.1 Definitions and Properties.....................35
3.2 Relation to Other Complexity Notions...............40
3.3 Recursive Teaching Dimension and VC-dimension........44
3.3.1 Classes with RTD Exceeding VCD.............45
3.3.2 Intersection-closed Classes.................50
3.3.3 MaximumClasses.....................52
3.4 Conclusions.............................55
4 Order Compression Schemes 57
4.1 Sample Compression........................59
4.2 Properties of Order Compression Schemes............59
4.3 Order Compression Schemes and Teaching............64
4.4 Order Schemes for Special Classes................68
4.5 Conclusions.............................71
A Python Programs for Complexity Notions 73
A.1 A Small Toolbox..........................73
Index 79
Bibliography 81
viii
List of Symbols
C............concept class
G(C).........one-inclusion graph of C
G
comp
(f;g)...compression graph associated with (f;g)
I(C).........largest minimal spanning set size
Cons(()S;C)..set of concepts in C consistent with S
T S(C;C))....set of all teaching sets of C 2 C
X............instance space
deg
G
(C)......the degree of C
dens(G(C))...density of G(C)
DIFF
d
(C)....nested difference of depth d
hSi
C
.........the smallest concept in C containing S
LC-PARTIAL(C) partial equivalence query complexity
N............the integers
OCN(C;H)...order compression scheme
R............the real number
rfRTD(C;C)).repetition-free recursive teaching dimension of C
RTD(C;C))..recursive teaching dimension of C
SDC(C)......self-directed learning complexity
TD(C))......teaching dimension of C
TS(C;C))....size of smallest teaching set for C 2 C
TS
avg
(C).....average teaching dimension of C
TS
max
(C))...maximumteaching dimension of C
TS
min
(C))....minimumteaching dimension of C
ix
VCdim(C)....VC-dimension of C
A4B........symmetric difference of A and B
D
F;h
.........Dudley class
I(C;G(C))...incident edges of C in G(C)
M
opt
(C)......optimal mistake bound
N............usually the number of concepts
n............usually the number of instances
x
Introduction
In the design and analysis of machine learning algorithms,the amount of
training data that needs to be provided for the learning algorithm to be suc-
cessful is an aspect of central importance.In many applications,training data
is expensive or difficult to obtain,and thus input-efficient learning algorithms
are desirable.In computational learning theory therefore,one way of measur-
ing the complexity of a concept class is to determine the worst-case number of
input examples required by the best valid learning algorithm.What is a valid
learning algorithm depends on the underlying model of learning.We refer
to this kind of complexity measure as information complexity.For example,
in PAC-learning [43],the information complexity of a concept class C is the
worst-case sample complexity a best possible PAC learner for C can achieve
on all concepts in C.In query learning [2],it is the best worst-case number
of queries a learner would have to ask to identify an arbitrary concept in C.
In the classical model of teaching [19;41],the information complexity of C is
given by its teaching dimension,i.e.,the largest number of labeled examples
that would have to be provided for distinguishing any concept in C from all
other concepts in C.
Besides the practical need to limit the required amount of training data,there
are a number of reasons for formally studying information complexity.Firstly,
a theoretical study of information complexity yields formal guarantees con-
cerning the amount of data that needs to be processed to solve a learning
problem.Secondly,analyzing information complexity often helps to under-
stand the structural properties of concept classes that are particularly hard to
learn or particularly easy to learn.Thirdly,the theoretical study of informa-
tion complexity helps to identify connections between various formal models
xi
of learning,for example if it turns out that,for a certain type of concept class,
the information complexity under learning model A is in some relationship
with the information complexity under model B.This third aspect is the main
motivation of our study.
In the past two decades,several learning models were defined with the aimof
understanding in which way a low information complexity can be achieved.
One such model is learning from partial equivalence queries [33],which sub-
sume all types of queries for which negative answers are witnessed by coun-
terexamples,e.g.,membership,equivalence,subset,superset,and disjointness
queries [2].As lower bounds on the information complexity in this query
model (here called query complexity) hold for numerous other query learning
models,they are particularly interesting objects of study.Even more power-
ful are self-directed learners [22].Each query of a self-directed learner is a
prediction of a label for an instance of the learner’s choice,and the learner
gets charged only for wrong predictions.The query complexity in this model
lower-bounds the one obtained frompartial equivalence queries [21].
Dual to the models of query learning,in which the learner actively chooses
the instances it wants information on,the literature proposes models of teach-
ing [19;41],in which a helpful teacher selects a set of examples and presents
it to the learner,again aiming at a low information complexity.
One of these attempts is the subset teaching dimension introduced by Zilles
et al.[47],denoted with STD.Its key idea is to reduce teaching sets (uniquely
identifying sample sets) to minimal subsets that are not part of any other
teaching set for other concept classes.Although these are probably consistent
to more than one concept,a smart learner could interpret this sophisticated
choice in the desired manner.They present several classes for which the STD
performs substantially better than the teaching dimension.However,besides
lacking some desirable properties,there also exist classes for which the STD
behaves counterintuativ.
Another recent model of teaching with low information complexity is recur-
sive teaching,where a teacher chooses a sample based on a sequence of nested
subclasses of the underlying concept class C [47].The nesting is defined as
follows.The outermost “layer” consists of all concepts in C that are easiest to
teach,i.e.,that have the smallest sets of examples distinguishing them from
all other concepts in C.The next layers are formed by recursively repeating
xii
Introduction
this process with the remaining concepts.The largest number of examples
required for teaching at any layer is the recursive teaching dimension (RTD)
of C.The RTD substantially reduces the information complexity bounds ob-
tained in previous teaching models.It lower-bounds not only the teaching
dimension—the measure of information complexity in the “classical” teach-
ing model [19;41]—but also the information complexity of iterated optimal
teaching [4],which is often substantially smaller than the classical teaching
dimension.The recursive teaching dimension will be the issue that dominates
most of this thesis.
A combinatorial parameter of central importance in learning theory is the VC-
dimension [44].Among other relevant properties,it provides bounds on the
sample complexity of PAC-learning [9].Since the VC-dimension is the best-
studied quantity related to information complexity in learning,it is a natural
first parameter to compare with it comes to identifying connections between
information complexity notions across various models of learning.For ex-
ample,even though the self-directed learning complexity can exceed the VC-
dimension,existing results show some connection between these two com-
plexity measures [21].However,the teaching dimension,i.e.,the information
complexity of the classical teaching model,does not exhibit any general rela-
tionship to the VC-dimension—the two parameters can be arbitrarily far apart
in either direction [19].Similarly,there is no known connection between
teaching dimension and query complexity.
In the context of concept learning,sample compression schemes are schemes
for “encoding” a set of examples in a small subset of examples.For instance,
from the set of examples they process,learning algorithms often extract a
subset of particularly “significant” examples in order to represent their hy-
potheses.This way sample bounds for PAC-learning of a concept class C can
be obtained from the size of a smallest sample compression scheme for C
[32;16].The size of a sample compression scheme is the size of the largest
subset resulting from compressing any sample consistent with some concept
in C.In what follows,we will use the term sample compression number of a
concept class C to refer to the smallest possible size of a sample compression
scheme for C.
Outline of this Thesis In chapter 1 we summarize the results from the lit-
erature of the last 25 years which are relevant for us.Chapter 2 is concerned
xiii
with the teaching dimension.Besides subsuming the existing facts we prove
new results for shortest-path closed classes.
In chapter 3 we examine the recursive teaching dimension and show up their
remarkable properties.In particular,we show for many structured concept
classes that the recursive teaching dimension is bounded by the VC dimen-
sion.This includes both intersection-closed and maximum concept classes.
These are two concept classes that are of great interest to the learning the-
ory community due to their expressive power despite their simple definition.
Moreover,we give the first link between teaching and sample compression
known to date.
Influenced by these ideas,we will introduce the notion of order compression
schemes in chapter 4.We show that any total order over a finite concept
class C induces a special type of sample compression scheme,called order
compression schemes.These schemes can serve as a first step in solving or
disproving a long open problem:the sample compression conjecture by Floyd
and Warmuth [16].They claim that any concept class of VC dimension d
exhibits a sample compression scheme of size linear in d.And indeed,we
prove the existence of order compression schemes in size of the VC-dimension
for many well-known concept classes,e.g.,maximum classes,Dudley classes,
intersection-closed classes and classes of VC dimension 1.Since order com-
pression schemes are much simpler than sample compression schemes in gen-
eral,their study seems to be a promising step towards resolving the sample
compression conjecture.We reveal a number of fundamental properties of
order compression schemes,which are helpful in such a study.In particular,
order compression schemes exhibit interesting graph-theoretic properties as
well as connections to the theory of learning fromteachers.
xiv
1
A Brief Introduction to Machine
Learning
1.1 Fundamental Notions in Learning Theory
Throughout this work,X denotes a finite set called instance space and C de-
notes a concept class over a domain X.The number of the instances is denoted
with n and jCj = N.A single concept C 2 C can be represented interchange-
ably in different ways.It can be used as a subset of X,as a boolean mapping
C:X!f0;1g or as a value-vector in f0;1g
jXj
.Thus,a concept class can be
a subset of 2
X
,a family of boolean functions or vectors,respectively.Depend-
ing on the context we will use the definition which serves our needs best.A
sample set w.r.t.C is a subset of X f0;1g such there exists a concept C 2 C
which fulfills C(x) = b for all (x;b) 2 S,i.e.,C is consistent with S.The set
Cons(S;C)  C denotes the set of concepts consistent with S.For a given
sample set S,we define X(S) to be the set of unlabeled instances included in
S and C
jX
the restriction of C to X.
In many learning theoretic frameworks the learner has to identify a special
concept C

2 C,either with a certain error on X or exactly.This special con-
cept is called target concept and denoted with C

.Moreover,often an expert
or oracle will provide training examples to the learner.The learners current
guess for the target is called hypothesis and the related set of all possible hy-
pothesis consistent with the examples seen so far is called version space.
1
Chapter 1.A Brief Introduction to Machine Learning
Oracle
Learner
Hypothesis
D
Unkown distribution
Concept
Concept
C

Concept
Concept
Concept class
Knowledge
Sample labeled acc.to C

Output
Knowledge
Access
Figure 1.1:The PAC-Model.The error of the resulting hypothesis will be
measured according to the distribution D.
1.2 TheProbablyApproximatelyCorrect LearningModel
The PAC model was defined in the groundbreaking paper of Valiant [43] and
is the basis for many of the subsequent studies in machine learning theory.It
reflects the general idea of learning.A learner is forced to learn some target
concept and is provided with random examples by an expert or oracle while
the underlying distribution is not known to the learner.According to the pre-
sented examples the learner needs to build a hypothesis which reflects his
knowledge about the target.But instead of learning the target perfectly he is
allowed to do some mistakes like human beings.Thus,he is allowed to fail on
classification on a small proportion weighted according to the unknown distri-
bution over the instance space,say".And since the learner has no knowledge
about the distribution and thus the presented examples,he may fail at all with
a relatively small probability,say .This is a necessary concession since the
sample drawn according to the distribution could be highly uninformative.
Based on these simple and natural assumptions we can define the PAC model.
Definition 1.1 (Probably approximately correct learning).A concept class C
is efficiently PAC-learnable if there exists an algorithm A which for concepts
C

2 C,all distributions D over X,any 0 <"<
1
2
and 0 <  <
1
2
,pro-
duces a hypothesis H 2 C such that with probability at least 1   it holds
2
The Vapnik-Chervonenkis-Dimension
err
D
(H) "when given access to labeled samples S consistent with C

and
drawn according to D.Additionally
• A runs in polynomial time in n and m
• m2 O(poly(
1
"
;
1

;size(c

)):
Here,err
D
(H) =
P
x2H4C

D(x) with 4 being the symmetric difference is
the true error of the hypothesis weighted according to D.The number of
examples for successfully learning is called sample complexity,which is the
main quantity of interest in most of the cases.
In a slight alteration of the above model,the hypothesis is picked fromanother
class H instead of C itself.This way,there exist classes for which the learning
task gets easier.This type of learning is called improper PAC learning opposed
to proper PAC learning when C = H.The idea of proper and improper learning
will become important in a different setting in chapter 4.
Although the PAC-model is not object of this work it is one of the most impor-
tant models,not at last because its strong connection to the VC-dimension and
sample compression schemes,which we will both define shortly.For further
informations about the PAC-model we refer the interested reader to the book
of Kearns and Vazirani [26].
1.3 The Vapnik-Chervonenkis-Dimension
The VC-dimension was discovered independently by several authors [40;39;
44] and has proven itself as a very useful tool in machine learning theory,
combinatorics and probability theory.The main idea behind this complexity
notion is not to measure the complexity of a class C simply by its size but
rather its richness.By richness we mean the ability to fit a given labeling on a
set of instances.
Definition 1.2.Let C be a concept class over an instance space X.For any set
S = fx
1
;:::;x
m
g  X let

C
(S):= fC\S j C 2 Cg
denote the set of all labelings or dichotomies of C on S.Alternatively,one
can think of the set of all vectors produced by C given by the projection onto
3
Chapter 1.A Brief Introduction to Machine Learning
a)

+
+

b)
+

+

c)
+
+
+
+

Figure 1.2:a) and b) show two configurations that can be realized by axis-
aligned rectangles.c) is an example for an unrealizable case with 5 sample
points.
S = fx
1
;:::;x
m
g,i.e.

C
(S) = f(C(x
1
);:::;C(x
m
)) j C 2 Cg:
If j
C
(S)j = 2
jSj
then S is said to be shattered by C.The Vapnik-Chervonenkis-
dimension of C,denoted by VCdim(C),is defined as the size of the largest set S
shattered by C.The maximumnumber of realizable patterns with minstances
is denoted by

C
(m) = maxfj
C
(S)j j jSj = m;S  Xg:
Example 1.3.Consider the class of axis-aligned rectangles in the Euclidean
plane.Any labeling of the 4 points in Figure 1.2 a) and b) can be realized.
While these subfigures show two of the 16 different realizations for this specific
constellation of 4 points,c) shows an unrealizable labeling of 5 points.Indeed,
there always exists such an impossible labeling for any set of 5 points.Pick the set
of the left-most,right-most,top and bottompoints and label thempositively.This
will always be a set of 4 points.The remaining last point is labeled negatively.As
such,the VC-dimension of axis-aligned rectangles is 4.
A first fundamental result is the implication of an upper bound on the size of
a concept class of finite VC-dimension.There are existing numerous different
proofs for the following Lemma.One beautiful proof is given by Smolensky
[42].We will pick up and modify his strategy slightly later in chapter 3 for
proving similar results for a different complexity notion.
Lemma 1.4 (Sauer’s Lemma,[39]).Let C  2
X
be an arbitrary concept class
4
Special Classes
with VCdim(C) = d.Then for any mit holds that
j
C
(m)j 
d
X
k=0

m
k

:
Throughout this work,we are only concerned with finite concept classes.In
this special case it follows that jCj 
P
d
k=0

n
k

with jXj = n.
A very famous result given by Blumer et al.[9] is the upper sample bound for
successfully learning in the PAC-model,which is independent of the size of the
concept class.
Theorem 1.5 ([9]).Let C be an arbitrary concept class with VCdim(C) = d.
Any consistent learner will output some hypothesis h 2 C with err
D
(H) "with
probability at least 1 ,if he is provided with
m max

4
"
log

2


;
8d
"
log

13
"

many samples.
1.4 Special Classes
Mostly,machine learning theorists don’t deal simply with a subset of f0;1g
n
.
Usually they make more or less strong assumptions on the structure of the
concept class.Widely known,also outside the community of computational
learning theory,are classes of (monotone) monomials over boolean variables,
boolean formulas in disjunctive (resp.conjunctive) normal form,often abbre-
viated as DNF (resp.CNF),and its variants like m-term DNF or KDNF.But
there are also existing special types of classes which might not be known to
the reader.Therefore,we want to introduce some of the most important,first
and foremost,maximum and maximal classes.On the one hand,researchers
have proven the strongest and most significant results for maximum classes.
On the other hand,maximal classes are the evil twin of maximum classes in
some sense.There are only a few vital results for maximal classes known to
date.
5
Chapter 1.A Brief Introduction to Machine Learning
1.4.1 Maximumand Maximal Classes
A finite class C over X with jXj = n is called maximumclass if jCj =
P
d
k=0

n
k

,
i.e.it meets the Sauer bound from lemma 1.4.A class is called maximal class
if for any superset C
0
 C it holds that VCdim(C
0
) > VCdim(C).Note that any
finite maximum class is also maximal but the converse is not necessarily true
(see [45;17]).
Essentially,maximumclasses are unions of hypercubes of dimension VCdim(C)
(see Floyd and Warmuth [16]) and so they are well-structured.Many exam-
ples for maximumclass have a nice construction or graphical representation.
Example 1.6 (At most two positive examples,[15]).As an example,consider
the maximum class C of VC-dimension two on X that consists of all concepts
with at most two positive examples.Then,for fx
1
;x
2
g  X,c
x
1
;x
2
denotes the
concept on X fx
1
;x
2
g where every example is a negative example.This is the
only concept on X fx
1
;x
2
g that remains a concept in C if both x
1
and x
2
are
positive examples.
One important property of maximum classes of VC-dimension d is that every
set of instances of size at most d is shattered.Note that this is only a necessary
and not sufficient condition.This can be seen by removing the empty concept
fromthe example stated above.All sets of size at most d are still shattered but
the class is obviously not maximumanymore if jXj  d +1.
There exists a large variety for examples of maximum class and most of them
can be described or defined in a natural way.
Example 1.7 (Intervals on a line,[15]).Let C
n
be the class containing all
unions of at most n positively labeled intervals on a line.This class is maximum
of VC-dimension 2n.This follows because for a finite set of mpoints on the line,
for m 2n,there are
P
2
i=0
n

m
i

ways to label those mpoints consistent with at
most n positive intervals on a line.
The well-studied class of halfspaces is not a maximum class itself,but fortu-
nately it is a composition of such classes.
Example 1.8 ([15],[16]).For any (h
1
;:::;h
n+1
) 2 R
n+1
and x 2 R
n
let
f
h
(x) =
8
<
:
1,if
P
n
i=1
h
i
x
i
+h
n+1
 0
0 else
6
Special Classes
Then HS
n
:= ff
h
jh 2 R
n+1
g is the set of halfspaces in R
n
.This can be parti-
tioned into two maximum classes as follows:
let PHS
n
:= ff
h
jh 2 R
n+1
;h
1
> 0g and NHS
n
:= ff
h
jh 2 R
n+1
;h
1
< 0g
be the set of positive and negative halfspaces,respectively.Then any restriction
of one of these to a set of finitely many points in general position leads to a
maximum class of VC-dimension n.
As already indicated above,maximal classes do not have such nice properties
beside their inclusions-maximality w.r.t.to the VC-dimension.To the best of
our knowledge,there are also no natural examples of maximal classes like
in the maximum case.Nevertheless,maximal classes are of special interest
since any class can be embedded into a maximal class of exactly the same
VC-dimension.Thus,results concerning maximal classes can be often eas-
ily transported to arbitrary classes but theory lacks such significant results.
Therefore,embedability of arbitrary classes into maximum classes with at
most linear growth of the VC-dimension gains on importance,but this is still
an unsolved problem.Recently,Rubinstein and Rubinstein [37] showed that
a special type of maximum classes is not sufficient to inherit its properties to
arbitrary classes.
1.4.2 Dudley Classes
Definition 1.9 (Dudley [13]).Let F be a vector space of real-valued functions
over some domain X and h:X!R.For every f 2 F let
C
f
(x):=
8
<
:
1;if f(x) +h(x)  0
0;else
:
Then D
F;h
= fC
f
jf 2 Fg is called a Dudley class.The dimension of D
F;h
is
equal to the dimension of the vector space F.
The definition of Dudley classes seems to be rather technical,but actually
there are some popular and well-known Dudley classes,e.g.
• collections of halfspaces over R
n
,which are very common objects of
study in machine learning,such as in artificial neural networks and sup-
port vector machines,see,e.g.,[1],
7
Chapter 1.A Brief Introduction to Machine Learning
• unions of at most k intervals over R,
• n-dimensional balls.
Beside this,Dudley classes have a strong connection to maximumclasses.
Lemma 1.10 (Ben-David and Litman [6]).Dudley classes of dimension d are
embeddable in maximum classes of VC-dimension d.
Thus,many results for maximum classes can be easily transferred to Dudley
classes and its noticeable members.
1.4.3 Intersection-closed Classes
A concept class C is called intersection-closed if for every pair C;C
0
2 C also
C\C
0
2 C.Among the standard examples of intersection-closed classes are
the d-dimensional boxes over [n]
d
:
BOX
d
n
:= f[a
1
;b
1
]    [a
d
;b
d
] j 8 i = 1;:::;d:1  a
i
;b
i
 ng:
Here,[a;b] is an abbreviation for fa;a + 1;:::;bg,where [a;b] =;if a > b.
As a result of this,monomials are also intersection-closed when viewed as
orthogonal subrectangles of the boolean hypercube.
For T  X,we define hTi
C
as the smallest concept in C containing T,i.e.,
hTi
C
:=
\
TC2C
C:
A spanning set for T  X w.r.t.C is a set S  T such that hSi
C
= hTi
C
.
S is called a minimal spanning set w.r.t.C,if,for every proper subset S
0
of
S,hS
0
i
C
6= hSi
C
.I(C) denotes the size of the largest minimal spanning set
w.r.t.C.
Theorem1.11 ([34],[24]).Let C be an intersection-closed concept class.Then
every minimal spanning set w.r.t.C is shattered and therefore
I(C) = VCdim(C):
Note that,for every C 2 C  2
X
,I(Cj
C
)  I(C),because each spanning set
for a set T  C w.r.t.C is also a spanning set for T w.r.t.Cj
C
.Spanning sets will
8
Special Classes
be useful in proving upper bounds for the complexity notions we introduce in
later chapters.
Natarajan [34] gives an efficient algorithmwith one-sided error for intersection-
closed classes,called Closure algorithm.One-sided error means that the hy-
pothesis h given by the learner makes only mistakes on M = fx 2 X j C

(x) =
1g to a certain amount.In short that is because the Closure algorithm starts
with the empty concept,ignores negative examples and learns only from pos-
itive ones.
Moreover,Natarajan [34] has shown that a class is learnable with one-sided
error if and only if it is intersection closed and the VC-dimension of the class
grows polynomially (in the relevant parameters),making his Closure algo-
rithm an valuable tool in this case.Any class C that is not intersection-closed
can be embedded into a class C
0
that is intersection-closed.However,the
VC-dimension can grow drastically during this action,making the Closure al-
gorithminefficient in the initial parameters.
1.4.4 Nested Differences
The class of nested differences of depth d (at most d) with concepts from C,
denoted DIFF
d
(C) (DIFF
d
(C),resp.),is defined inductively as follows:
DIFF
1
(C):= C;
DIFF
d
(C):= fC n Dj C 2 C;D 2 DIFF
d1
(C)g;
DIFF
d
(C):=
d
[
i=1
DIFF
i
(C):
Expanding the recursive definition of DIFF
d
(C) shows that,e.g.,a set in
DIFF
4
(C) has the form C
1
n (C
2
n (C
3
n C
4
)) where C
1
;C
2
;C
3
;C
4
2 C.Fig-
ure 1.3 shows some constructions based on axis-parallel rectangles.
Concepts in DIFF
2
(C) have been shown to be learnable at an early stage of
research on computational learning theory by Kearns et al.[27].Regard-
ing arbitrary depth,Helmbold et al.[24] considered nested differences of
intersection-closed classes.They developed several algorithms they call them-
selves inclusion/exclusion algorithms,each with different properties.The To-
tal Recall algorithm,a batch algorithm,remembers all examples seen during
9
Chapter 1.A Brief Introduction to Machine Learning
a)
C
2
C
1
b)
C
3
C
2
C
1
c)
Figure 1.3:a) and b) illustrating the nested difference C
2
n C
1
and C
3
n fC
2
n
C
1
g,respectively;c) shows a more complex scenario.
learning DIFF
d
(C).The Space Efficient algorithm is an on-line algorithm
and saves only VCdim-many examples during execution.Both use the Clo-
sure algorithmiteratively as a subroutine.A central result is the following:
Lemma 1.12 ([24]).If C is intersection-closed and d  1,then it holds that
VCdim(DIFF
d
(C))  VCdim(DIFF
d
(C))  dVCdim(C):
Further Helmbold et al.[24] observed:if both C
1
and C
2
are intersection-
closed then C
1
^ C
2
:= fC
1
\C
2
j C
i
2 C
i
g is intersection-closed as well.The
same result holds for the intersection C
1
\C
2
:= fC j C 2 C
1
and C 2 C
2
g.
Substituting intersection-closed by union-closed leads to dual results for unions
of classes.
For the next result we need the notion of the universal concept,also coined by
Helmbold et al.[24],which is simply the concept equal to the full domain.
Assuming that the universal concept exists in a class is only a minor fix since
adding it to a concept class does not destroy the property of being intersection-
closed.
Now,based upon these properties and the use of minimal spanning sets,they
have proved several relationships between unions and intersections of classes
and nested differences of those.
Lemma 1.13 ([24]).Let C
1
;:::;C
r
be arbitrary concept classes,each including
the universal concept.Then
• [
r
j=1
C
j
is a subclass of ^
r
j=1
C
r
• VCdim([
r
j=1
C
j
)  VCdim(^
r
j=1
C
j
) 
P
r
j=1
VCdim(C
j
)
10
Online-Learning Model
• DIFF
d

[
r
j=1
C
j

is a subclass of DIFF
d

^
r
j=1
C
j

• VCdim(DIFF
d
([
r
j=1
C
j
))  VCdim(DIFF
d
(^
r
j=1
C
j
))
 d 
P
r
j=1
VCdim(C
j
)
This way they have been able to extended their efficient algorithms to
DIFF
d

[
r
j=1
C
j

.We will reuse some of their techniques later in chapter 3.
1.5 Online-Learning Model
The online-learning model was introduced by Littlestone [31] and further ex-
amined in [22;18;7].Here,instead of learning from random labeled ex-
amples,the learner is faced with a sequence of instances presented one by
one.The task is to predict the label of the current instance and after seeing
the correct label the learner can adjust his hypothesis.A natural measure of
complexity in this model is the number of classification mistakes a learner or
algorithmmakes during the whole process.
Definition 1.14.[31] For any online-learning algorithm A and any target
concept C let M
A
(C) be the maximum number of mistakes over all possi-
ble sequences of instances.For any A and any concept class C we define
M
A
(C) = max
C2C
M
A
(C) and call any such number a mistake bound.The
optimal mistake bound M
opt
(C) for a concept class C is the minimum over all
possible learning algorithms A of M
A
(C).
The halving algorithm,given by Littlestone [31],is a simple strategy that can
be applied to any finite concept class C.Given an instance x 2 X the al-
gorithms compares the cardinalities of jCons(fx;0g;C)j and jCons(fx;1g;C)j.
Then,the algorithm predicts for x the label of the larger version space and,
after receiving the actual label,proceeds with the remaining concept class C
0
which corresponds to one of the version spaces above.Therefor,in case of a
mistake at least half of the concept class will be discarded,which coined the
name halving algorithm.These observations immediately imply the following
theorem.
Theorem1.15.[31] For any finite concept class C,M
Halving
(C)  log jCj.
But many classes are known to have better polynomial sized mistake bound
algorithms,e.g.disjunctions,conjunctions,k-CNF and k-term DNF.A second
11
Chapter 1.A Brief Introduction to Machine Learning
algorithm for finite classes given by Littlestone [31] is the standard optimal
algorithm.It is not just that it performs better for special classes compared to
the halving algorithm.It’s also proven to be optimal,i.e.it meets the optimal
mistake bound.
Theorem1.16.[31] For any concept class C,VCdim(C)  M
opt
(C).
The proof of theorem1.16 is rather obvious since for a sequence starting with
a shattering set of size d at least that many mistakes can be made.Summing
up,the complexity notions introduced so far relate as follows.
Lemma 1.17.[31] For arbitrary concept classes C it holds
VCdim(C)  M
opt
(C)  M
Halving
(C)  log(jCj))
At last,a well known fact relates the PAC model and the online-learning
model.
Theorem 1.18 ([31]).If some algorithm A learns a concept class C in the
online-learning model,then A can be used to learn in the PAC model.Moreover,
the deduced PAC algorithm learns efficiently if it is provided with
M
A
(C)
"
ln
M
A
(C)

many examples.
In contrast of that,Blum [8] has shown that there exist classes that are effi-
ciently learnable in the Valiant model but not in the online-learning model.
Non-efficiency in the latter model means,that the learner does not run in
polynomial time.
1.6 Partial Equivalence Query Model
In their work,Maass and Turán [33] analyzed some variants of the PAC and
online learning model.They studied the relation of previously introduced
types of different like partial equivalence queries,among others.In the partial
equivalence query model the learner can formulate a so-called partial hypothesis
h  f0;1;g
jXj
.The""can be interpreted as"don’t care".The oracle will
compare the predicted label with the real labels of the target but neglect the
instances marked with .If the hypothesis is consistent with all other instances
it will return"‘yes"’ (or"‘true"’),otherwise it will provide a counterexample.
12
Self-directed Learning
The learning complexity LC(A) of a special algorithm A in the partial equiva-
lence model is defined by the worst-case number of queries of A before iden-
tifying C

exactly.Concordantly,the learning complexity of a concept class,
denoted with LC-PARTIAL(C),is the minimal number of LC(A) over all possi-
ble algorithms.
There are only a few things we need to know about the LC-PARTIAL since
we won’t make much use of it.One remarkable property is its relation to the
optimal mistake bound.
Theorem1.19.[33] Let C be an arbitrary concept class.Then,
LC-PARTIAL(C)  M
opt
(C):
Unfortunately,the LC-PARTIAL is incomparable to the VC-dimension which,
was also proven by Maass and Turán [33].Up to this point,we have
LC-PARTIAL(C)  M
opt
(C)  log(jCj))
by utilizing Lemma 1.17.
1.7 Self-directed Learning
The self-directed learning model was introduced by Goldman et al.[22] and
later studied in [21;5;28].It is very similar to the online-learning model
in which a sequence of instances has to be labeled by the learner and he is
charged for every mistake he makes.But instead of an adversary presenting
the instances in a possible malicious order,in each trial the learner chooses
the instance by himself.This selection can based on all the previous data the
learner has seen so far and has to be done in polynomial time.After passing
the instance and his prediction to the oracle,the latter will return the true
label of C

(x) to the learner.
Definition 1.20 ([22]).The self-directed learning complexity of a concept class,
denoted with SDC(C),is defined as the smallest number q such that there is
some self-directed learning algorithm which can exactly identify any concept
C 2 C without making more than q mistakes.
13
Chapter 1.A Brief Introduction to Machine Learning
Oracle
Learner
Concept
Concept
C

Concept
Concept
Concept class
Knowledge
Provides true
label of C

(x)
Sends x 2 X and prediction b
Knowledge
Figure 1.4:The self-directed learning model.
The SDC is known for many common classes,e.g.the self-directed learning
complexity of
• classes of VC-dimension 1 is 1
• monotone monomials over n variables is 1
• monomials over n variables is 2
• BOX
d
n
is 2
• m-termmonotone DNF is smaller than m
The first result is due to Kuhlmann [28],all others are given by Goldman and
Sloan [21].
It’s easy to see that the VC-dimension can be much larger than the SDC.Ben-
David and Eiron [5] showed that for any n and d there exists a concept class
C
d
n
with SDC(C
d
n
) arbitrarily larger than VCdim(C
d
n
).Thus,there is no relation
between the VC-dimension and the self-directed learning complexity.
Goldman and Sloan [21] conjectured that the self-directed learning complex-
ity is bounded by the VC-dimension at least in the case of intersection-closed
classes.But this was disproved first by Ben-David and Eiron [5] who con-
structed a concept class with
3
2
VCdim(C) = SDC(C).Later,Kuhlmann [28]
found a class with an arbitrary large gap between the two notions.
14
Sample Compression and its Relation to PAC Learning
Finally,the self-directed learning complexity fits nicely in the inequality of the
so far presented other notions.
Theorem1.21 ([21;33;31]).
SDC(C)  LC-PARTIAL(C)  M
opt
(C)  log(jCj))
1.8 SampleCompressionanditsRelationtoPACLearn-
ing
Another approach to learning concepts from a concept class is the notion of
sample compression schemes.These are often used internally in generic algo-
rithms like in learning axis-aligned rectangles.Given any sample set with both
positive and negative examples,the straightforward strategy for this class is
picking the left-most,right-most,top and bottom positive points and building
the minimal size rectangle including those.This way the whole sample set is
reduced to a minimal subset which allows reconstructing the label of a given
unlabeled point from the original sample set.This is the fundamental idea
behind sample compression.
Definition 1.22 (sample compression scheme [32]).A sample compression
scheme of size k consists of a compression function f and a reconstruction func-
tion g.The compression function f maps every labeled sample S to a subset
of size at most k,called compression set.The reconstruction g maps the com-
pression set to a hypothesis h  X.Furthermore these must fulfill
8 (x;l) 2 S:g(f(S))(x) = l;
i.e.h must be consistent with the sample set.
Example 1.23.Unions of n intervals can be compressed and decompressed as
follows:
Sweep the line fromleft to right.Save the first upcoming positive example and its
first following negative example.This pair represents the first interval.Simply
proceed in the same way for the remaining n 1 intervals.The label of a given
unlabeled point x corresponds to the label of its left neighboring point in the
compression set.Figure 1.5 a) shows a union of two intervals.The compression
set is highlighted by bold letters.
15
Chapter 1.A Brief Introduction to Machine Learning
a)
x
1
0
x
2
0
x
4
1
x
7
1
x
8
1
x
10
0
x
3
1
x
5
0
x
6
1
x
9
0
b)
1
x
2
1
x
3
1
x
4
0
x
5
0
x
6
0
x
7
1
x
1
0
x
8
Figure 1.5:a) A union of two intervals over X = fx
1
;:::;x
10
g;b) A positive
halfspace representing the labeling of X = fx
1
;:::;x
8
g.
Example 1.24.Consider a finite domain X  R
2
whose members are in general
position (i.e.any subset S  X contains at most two collinear points) and the
class C of positive halfspaces over X.Then any sample set S labeled according
to some concept from C can be compressed to a pair of points,one labeled pos-
itive and one labeled negative.Afterwards any unlabeled point from S can be
labeled correctly by its relative position to the induced hyperplane.Figure 1.5 b)
illustrates this for X = fx
1
;:::;x
8
g.
Littlestone and Warmuth [32] have shown that a sample compression scheme
can be used for learning.Given a sample,an algorithm or a learner can use
a sample compression scheme build by himself for constructing a hypothesis
g(f(S)),as indicated by the examples above.They have also shown that the
existence of such a scheme is sufficient for learnability and proved an analogon
to Theorem1.5.Later Floyd and Warmuth [16] proved a slightly better upper
bound on the sample size,as stated below.
Theorem1.25 ([16]).Let C  2
X
be any concept class with a sample compres-
sion scheme of size at most k.Then for 0 <"; < 1,the learning algorithmusing
this scheme learns C with sample size
m
1
1 

1
"
ln
1

+k +
k
"
ln
1
"

for any 0 <  < 1.
The sample size can further be optimized in the choice of  but only with
16
Sample Compression and its Relation to PAC Learning
marginal effects.Obviously,the bound is asymptotically equivalent to the one
given in Theorem1.5.However,using sample compression schemes can result
in better upper bounds for certain concept as stated by Floyd and Warmuth
[16] e.g.halfspaces on the plane with the instances in X in general position.
These have VC-dimension equal to 3 but there are existing sample compres-
sion schemes of size 2.
For arbitrary concept classes Floyd and Warmuth [16] gave the One-Pass al-
gorithm,which utilizes a mistake-driven algorithm P to construct a sample
compression scheme which is bounded by M
opt
(C) in its size.The main idea
of the algorithmis to compress to the mistakes the online learner makes.Thus,
together with the default ordering the labels of the original sample can be re-
stored.Due to Theorem1.15,this strategy leads to the following upper bound
for arbitrary finite classes.
Theorem1.26 ([16]).Let C  2
X
be any finite concept class.Then the One-Pass
Compression Scheme using the Halving algorithm gives a sample compression
scheme of size at most log jCj.
For well-structured classes,this bound can be undercut dramatically,especially
for maximumclasses.
Theorem1.27 ([16]).Let C  2
X
be a maximumconcept class of VC-dimension
d on a finite domain X with jXj = n  d.Then for each concept C 2 C,there is
a compression set A  X f0;1g of exactly d elements such that C is the only
consistent concept.
More recently,Kuzmin and Warmuth [30] introduced the notion of unlabeled
compression schemes.
Definition 1.28.An unlabeled compression scheme for a maximum class of VC-
dimension d is given by an injective mapping r that assigns to every concept C
a set r(C)  X of size at most d such that the following condition is satisfied:
8C;C
0
2 C (C 6= C
0
);9x 2 r(C) [r(C
0
):C(x) 6= C
0
(x):(1.1)
(1.1) is referred to as the non-clashing property.In order to ease notation,we
add the following technical definitions.A representation mapping of order k
for a (not necessarily maximum) class C is any injective mapping r that assigns
to every concept C a set r(C)  X of size at most k such that (1.1) holds.
17
Chapter 1.A Brief Introduction to Machine Learning
Recursive Tail Matching Algorithmmatch
Input:a maximumclass C
Output:a representation mapping r for C
if VCdim(C) = 0 (e.g.C = fCg) then
r(C) =;.
else
pick x 2 dom(C) randomly and r:=match(C
x
)
extend that mapping to 0C
x
[1C
x
:
8C 2 C
x
:r(c [ fx = 0g:= r(C) and r(c [ fx = 1g):= r(C) [x
extend r to tail
x
(C)
end if
return r
Figure 1.6:The recursive tail matching algorithm for constructing an unla-
beled compression scheme for a maximumclass.
A representation-mapping r is said to have the acyclic non-clashing property if
there is an ordering C
1
;:::;C
N
of the concepts in C such that
81  i < j  N;9x 2 r(C
i
):C
i
(x) 6= C
j
(x):(1.2)
Considering maximum classes,it was shown [30] that a representation map-
ping with the non-clashing property guarantees that,for every sample S la-
beled according to a concept fromC,there is exactly one concept C 2 C that is
consistent with S and satisfies r(C)  X(S).This allows to encode (compress)
a labeled sample S by r(C) and,since r is injective,to decode (decompress)
r(C) by C (so that the labels in S can be reconstructed).This coined the term
“unlabeled compression scheme”.
An inspection of the work by Kuzmin and Warmuth [30] reveals that the un-
labeled compression scheme obtained by the tail matching algorithm has the
acyclic non-clashing property.The main ingredient of the algorithm is the
disjoint union of C into three parts:
C = 0C
x
_
[ 1C
x
_
[ tail
x
(C)
The termC
x
is called the reduction of C w.r.t.to x.It consists of all concepts of
Cj
(Xnfxg)
for which both extensions in x exist in C.The tail of a concept class
consists of those concepts of Cj
(Xnfxg)
for which only one of both is included
in C.The notion aC
x
for a 2 f0;1g is a shorthand for the set of concepts of C
x
for which x equals a.
18
Sample Compression and its Relation to PAC Learning
The tail matching algorithm,given in Figure 1.6,will build the unlabeled com-
pression sets r(C) for each concept C recursively,depending on what classes
they belong to during the different stages of the recursion.If C 2 0C
x
it will
leave r untouched,whereas for any C 2 1C
x
it will add x to r(C).For the tail
a more complex routine is needed but for sake of simplicity it is omitted here.
At its heart its a mapping onto the forbidden labels of C
x
.This way clashes
between the elements of the reduction and the tail are avoided.
In nearly the same manner one can find an acyclic ordering of C
x
which,to-
gether with the representation mapping,fulfills the acyclic non-clashing prop-
erty.First of all it holds that
8 C 2 1C
x
;C
0
2 0C
x
:C(x) 6= C
0
(x)
since x 2 C for any C 2 1C
x
and x 62 C for C 2 0C
x
.Since the compression
set of elements of the tail correspond to forbidden labels of C
x
it holds that
8 C 2 1C
x
;C
0
2 tail
x
(C) 9 y 2 r(C
0
):C(y) 6= C
0
(y):
Of course this holds for concepts in 0C
x
as well.Overall the acyclic ordering of
the concepts is induced (recursively) by the fact that for C 2 tail
x
(C),C
0
2 1C
x
and C
00
2 0C
x
it holds that C < C
0
< C
00
.
The main result of Kuzmin and Warmuth [30] is the existence of unlabeled
compression schemes for maximum classes in the size of the VC-dimension,
condensed in the following theorem.
Theorem1.29 ([30]).Let C be a maximum class of VC-dimension d.Then,the
recursive tail matching algorithmwill return an representation mapping of order
at most d.
Another strategy for building a compression scheme is based on the one-
inclusion graph.Below,a concept class C over a domain X of size n is identified
with a subset of f0;1g
n
.
Definition 1.30.The one-inclusion-graph G(C) associated with C is defined as
follows:
• The nodes are the concepts fromC.
• Two concepts are connected by an edge if and only if they differ in ex-
actly one coordinate (when viewed as nodes in the Boolean cube).
19
Chapter 1.A Brief Introduction to Machine Learning
A cube C
0
in C is a subcube of f0;1g
n
such that every node in C
0
represents a
concept fromC.
In the context of the one-inclusion graph,the instances (corresponding to
the dimensions in the Boolean cube) are usually called “colors” (and an edge
along dimension i is viewed as having color i).For a concept C 2 C,I(C;G(C))
denotes the union of the instances associated with the colors of the incident
edges of C in G(C),called incident instances of C.The degree of C in G(C)
is denoted by deg
G(C)
(C).Recall that the density of a graph with m edges
and n nodes is defined as m=n.As shown by Haussler et al.[23,Lemma
2.4],the density of the 1-inclusion graph lower-bounds the VC-dimension,
i.e.,dens(G(C)) < VCdim(C).
The following definitions were introduced by Rubinstein and Rubinstein [37]:
Definition 1.31.A corner-peeling plan for C is a sequence
P = ((C
1
;C
0
1
);:::;(C
N
;C
0
N
)) (1.3)
with the following properties:
1.N = jCj and C = fC
1
;:::;C
N
g.
2.For all t = 1;:::;N,C
0
t
is a cube in fC
t
;:::;C
N
g which contains C
t
and
all its neighbors in G(fC
t
;:::;C
N
g).(Note that this uniquely specifies
C
0
t
.)
The nodes C
t
are called the corners of the cubes C
0
t
,respectively.The dimension
of the largest cube among C
0
1
;:::;C
0
N
is called the order of the corner-peeling
plan P.C can be d-corner-peeled if there exists a corner-peeling plan of order
d.
C is called shortest-path closed if,for every pair of distinct concepts C;C
0
2 C,
G(C) contains a path of length H(C;C
0
) that connects C and C
0
,where H
denotes the Hamming distance.[37] showed the following:
Lemma 1.32 ([37]).If a maximum class C has a corner-peeling plan (1.3) of
order VCdim(C),then an unlabeled compression scheme for C is obtained by
setting r(C
t
) equal to the set of colors in cube C
0
t
for t = 1;:::;N.
Theorem1.33 ([37]).Every maximumclass C can be VCdim(C)-corner-peeled.
20
Sample Compression and its Relation to PAC Learning
The key element of their proof is that corner-peeling preserves shortest-path
closedness which is used to showthe uniqueness of the colors of the particular
cubes C
0
t
during peeling.
Although it was known before [30] that any maximumclass has an unlabeled
compression scheme of size VCdim(C),the scheme resulting from corner-
peeling has some very special and nice features,e.g.having the acyclic non-
clashing property.Thus,the corner-peeling technique will come in handy in
chapter 3.All the previous results lead to the following fundamental question.
Conjecture 1.34 ([32]).Any concept class C  2
X
possess a sample compression
scheme of size O(VCdim(C)).
The sample compression conjecture is unsolved for almost three decades now,
although it has been answered positively for special classes.Beside the afore-
mentioned maximum classes,Helmbold et al.[24] give mistake-bound algo-
rithms for nested-differences of constant depth of intersection-closed classes
which are bounded by O(VCdim(C)).Together with the above mentioned
One-Pass algorithm given by Floyd and Warmuth [16],it follows that there
is also a sample compression scheme linear in the VC-dimension.The only
bound involving the VC-dimension known to date is a lower bound:
Theorem1.35 ([16]).For an arbitrary concept class C of VC-dimension d,there
is no sample compression scheme of size at most d=5 for sample sets of size at least
d.
21
22
2
The Teaching Dimension
The following chapter is concerned with the so called teacher-directed learning
model and its associated complexity parameter,the teaching dimension.Ac-
cording to its name,the oracle from the first chapter is replaced by a teacher
who will guide the learner through the learning process by presenting helpful
examples.However,there is no exact definition for the term"helpful"in gen-
eral.Although the model and the teaching dimension are quite reasonable,
there are some stringent restrictions on the possible behavior of the learner
which will limit the teacher in his"helpfulness".A goal of the later chapters
will be to relax these rules and improve the sample complexity drastically for
special classes.
Additionally teacher and learner are not allowed to use coding tricks.What
are coding tricks?To explain this,consider the following situation:teacher
and learner agree on some ordering on C = (C
1
;:::;C
m
).If the teacher has
to teach the concept C
i
and X is sufficiently large,he can simply encode the
index i in a single sample by sending the ith instance with an arbitrary label
(preferably consistent with C
i
to obscure their fraud).The learner takes this
hint and outputs C
i
as his hypothesis.Thus,learning took only a single exam-
ple.Obviously,this is not what learning is meant to be.Coding tricks are often
also called ”collusions” and have been studied by Angluin and Krikis [3],Ott
and Stephan [35] and Goldman and Mathias [20].
The strategy presented in this chapter does not possess coding tricks.Simply
put,the teacher presents the learner a sample S that is only consistent with
the target concept C

.Such a sample is called a teaching set for C.
23
Chapter 2.The Teaching Dimension
Teacher
Learner
Hypothesis
Concept
Concept
C

Concept
Concept
Concept class
Knowledge
Well-chosen sample set
Knowledge
Knowledge
Output
Figure 2.1:The teacher-directed learning model.The resulting hypothesis
needs to be exactly the target concept.
The teaching dimension and its related notions are very well researched [41;
19;20;5;47;48].Although most of the results we present in this chapter
were elaborated more than 25 years ago,they are the fundamental basis for
our work.Besides summarizing results of prior studies,we give new results
for shortest-path closed classes at the end of the chapter that are contained
in Doliwa et al.[12].
2.1 Definitions
The teaching model was independently introduced by Goldman and Kearns
[19] and Shinohara and Miyano [41].In contrast to the previous models
the learner does not get random or self-choosen examples.Instead,samples
are chosen in such a way by a helpful teacher that he can ensure that the
learner will identify the correct concept.A teaching set for a concept C 2 C
is an unordered labeled set S such that solely C is consistent with S,i.e.
Cons(S;C) = fCg.The family of sets T S(C;C) is the set of all teaching sets
for C and TS(C;C) denotes the size of a smallest teaching set.The follow-
ing complexity notions are well known and have been explored in numerous
24
Collusions and Coding Tricks
works:
TS
min
(C):= min
C2C
TS(C;C)
TS
max
(C):= max
C2C
TS(C;C)
TS
avg
(C):=
1
jCj
P
C2C
TS(C;C)
The termTS
max
(C) =:TD(C) is also named teaching dimension.The quantity
TS
min
(C) is known as the minimum teaching dimension of C and TS
avg
(C) is
called average-case teaching dimension of C.Obviously,
TS
min
(C)  TS
avg
(C)  TS
max
(C) = TD(C):
The teaching dimension is also known in several other fields of research,e.g.
combinatorics and complexity theory.Here,a teaching set is often called wit-
ness set (e.g.Jukna [25]).In communication complexity related fields for
example,this notion is referring to the property of being able to testify the
knowledge of a pre-shared secret.
2.2 Collusions and Coding Tricks
Of course,there are more strategies violating the common idea of learning
than the one mentioned in the introduction of this chapter.A first attempt to
give a formal definition for this was done by Goldman and Mathias [20].For
a given pair (;) of a teacher  and a learner  they defined (C) to be the
sample that the teacher will select in pursuit of teaching C.Accordingly,(S)
is the hypothesis of the learner on input S.Due to Goldman and Mathias [20],
a teacher-learner pair is said to be collusion-free (for C) if
8 C 2 C 8 S  (C):(S) = C:
More or less,the property demands that enriching the sample with consis-
tently labeled points does not make the learner fail on S.Coding tricks like
the one described above will be prevented by this requirement.It is easy to see
that learning with the help of minimumteaching sets provided by a teacher is
indeed collusion-free.
25
Chapter 2.The Teaching Dimension
x
1
x
2
x
3
...
x
n1
x
n
C
0
0
0
0
...
0
0
C
1
1
0
0
...
0
0
C
2
0
1
0
...
0
0
C
3
0
0
1
...
0
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
c
n1
0
0
0
...
1
0
c
n
0
0
0
...
0
1
Figure 2.2:The concept class C
sing
[;for which TD(C) = jCj  1 and
VCdim(C) = 1.
Zilles et al.[48] have given a different interpretation of coding tricks that
is not as restrictive as the one given above.Thus,they were able to prove
the soundness of different protocols based on other complexity notions for
teaching.The interested reader is referred to [48;19;20;35;4] for further
discussions about coding tricks.
2.3 Relation to the VC-dimension
Lemma 2.1 ([19]).There is a concept class C for which TD(C) = jCj 1 and
VCdim(C) = 1.
Proof.See figure 2.2.Obviously the given class has only a VC-dimension equal
to 1 since any concept has at most one positive label.Thus no set of size two
can be shattered.
Each of the concepts c
1
;:::;c
n
has a teaching set of size one (the single pos-
itive labeled instance x
i
for c
i
).But the concept c
0
requires all labels of all
jCj 1 instances to be revealed for any consistent learner since each instance
distinguishes only one of the other concepts fromc
0
.
We would like to highlight particularly that the concept class above presented
is also maximum,i.e.not even these well-structured concept class feature a
nice relation between teaching and shattering.
Lemma 2.2 ([33;19]).Consider the concept class C
n
addr
illustrated in figure 2.3
(addr stands for addressing).It holds that
1 = TD(C
n
addr
) < VCdim(C
n
addr
) = log(n):
26
The Teaching Dimension for Special Classes
x
1
x
2
...
x
log
2
(n)
x
1+log
2
(n)
...
x
n1+log
2
(n)
x
n+log
2
(n)
C
1
0
0
...
0
1
...
0
0
C
2
1
0
...
0
0
...
0
0
C
3
0
1
...
0
0
...
0
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
C
n1
1
1
...
0
0
...
1
0
C
n
1
1
...
1
0
...
0
1
Figure 2.3:The concept class C
n
addr
for which TD(C) < VCdim(C).
Proof.See figure 2.3 and let n = 2
k
.Note that for the concept c
i
only x
k+i
is
positively labeled among the instances x
k+1
;:::;x
k+n
.Obviously fx
1
;:::;x
k
g
is shattered whereas the positive label of x
k+i
is sufficient for teaching c
i
.
Note that figure 2.3 also exhibits the maximal factor between these two no-
tions since for any finite C it holds that VCdim(C)  log(jCj) and TD(C)  1.
Though there is no relation between the VC-dimension and the teaching di-
mension,Goldman and Kearns [19] where able to prove the following by
taking the size of the class itself into account.
Theorem2.3 (Goldman and Kearns [19]).For any concept class C it holds that
TD(C)  VCdim(C) +jCj 2
VCdim(C)
:
Proof.Let fx
1
;:::;x
d
g be a shattering set of size d = VCdim(C).After taking
x
1
;:::;x
d
and their corresponding labels into a temporary teaching set,there
are at most jCj 2
d
+1 concepts left in the version space.Since each concept
differs in at least one instance from all other concepts,we will need at most
jCj 2
d
more labeled instances for a complete teaching set.
2.4 The Teaching Dimension for Special Classes
Computing the teaching dimension is known to be NP-complete.This was
proven by [41] by giving a reduction fromthe hitting set problem.Therefore,
the teaching dimension of well-structured classes is of special interest.Gold-
man and Kearns [19] determined the teaching dimension for a large collection
of interesting concept classes.
27
Chapter 2.The Teaching Dimension
Proposition 2.4 (Goldman and Kearns [19]).For the concept class C
• of monotone monomials over n variables and r relevant variables
TD(C) = min(r +1;n)
• of monomials over n variables and r relevant variables
TD(C) = min(r +2;n +1)
• of monotone k-term DNF formulas over n variables
l +1  TD(C)  l +k
where l is the number of literals in the target formula;
• of monotone decision lists over n variables
n  TD(C)  2n 1
• BOX
d
n
of d-dimensional boxes over the domain [n]
d
TD(C) = 2 +2d:
According to Kuhlmann,the minimum teaching set size is a lower bound to
the self-directed learning complexity.
Lemma 2.5 ([28]).For every concept class C:TS
min
(C)  SDC(C).
Thus,the following holds:
Corollary 2.6 ([28]).Let C be an arbitrary concept class.If VCdim(C) = 1 then
TS
min
(C) = 1.
Additionally,Kuhlmann proved the following:
Lemma 2.7 ([28]).Let C be an intersection-closed concept class.Then,
TS
min
(C)  I(C):
28
New Results for Shortest-Path-Closed Classes
According to results of Ben-David and Eiron [5],the teaching dimension itself
does not fit into our landscape of complexity notions very well.
Lemma 2.8 ([5]).The self-directed learning complexity of a concept class is not
bounded by any function of its teacher-directed complexity,i.e.for every n and
d  3,there exists a concept class C with TD(C) = d and SDC(C)  n.
Summing up,at least the minimum teaching set size can be integrated in our
inequality of complexity notions.
Thus,we can extend Theorem1.21 to the following
Corollary 2.9.For arbitrary concept classes C it holds that
TS
min
(C)  SDC(C)  M
opt
(C)  log jCj
2.5 New Results for Shortest-Path-Closed Classes
In this section,we study the best-case teaching dimension,TS
min
(C),and the
average-case teaching-dimension,TS
avg
(C),of a shortest-path closed concept
class C.
It is known that the instances of I(C;G(C)),augmented by their C-labels,
forma unique minimumteaching set for C in C provided that C is a maximum
class [30].Lemma 2.10 slightly generalizes this observation.
Lemma 2.10.Let C be any concept class.Then the following two statements are
equivalent:
1.C is shortest-path closed.
2.Every C 2 C has a unique minimum teaching set S,namely the set S such
that X(S) = I(C;G(C)).
Proof.1 ) 2 is easy to see.Let C be an arbitrary shortest-path closed con-
cept class,and let C be any concept in C.Clearly,any teaching set S for C
must satisfy I(C;G(C))  X(S) because C must be distinguished from all its
neighbors in G(C).Let C
0
6= C be any other concept in C.Since C and C
0
are
connected by a path P of length jC 4C
0
j,C and C
0
are distinguished by the
color of the first edge in P,say by the color x 2 I(C;G(C)).Thus,no other
29
Chapter 2.The Teaching Dimension
instances (=colors) besides I(C;G(C)) are needed to distinguish C from any
other concept in C.
To show 2 ) 1,we suppose 2 and prove by induction on k that any two
concepts C;C
0
2 C with k = jC 4C
0
j are connected by a path of length k in
G(C).The case k = 1 is trivial.For a fixed k,assume all pairs of concepts of
Hamming distance k are connected by a path of length k in G(C).Let C;C
0
2 C
with jC4C
0
j = k+1  2.Since I(C;G(C)) = X(S),there is an x 2 I(C;G(C))
such that C(x) 6= C
0
(x).Let C
00
be the x-neighbor of C in G(C).Note that
C
00
(x) = C
0
(x) so that C
00
and C
0
have Hamming-distance k.According to
the inductive hypothesis,there is a path of length k from C
00
to C
0
in G(C).It
follows that C and C
0
are connected by a path of length k +1.
Theorem2.11.Let C be a shortest-path closed concept class.Then,
TS
avg
(C) < 2VCdim(C):
Proof.According to Lemma 2.10,the average-case teaching dimension of C
coincides with the average vertex-degree in G(C),which is twice the density
of G(C).The result follows from the fact that Haussler et al.[23] have shown
that the density of G(C) lower bounds VC-dimension of C.
Theorem 2.11 generalizes a result by Kuhlmann [28] who showed that the
average-case teaching dimension of “d-balls” (sets of concepts of Hamming
distance at most d from a center concept) is smaller than 2d.It also simplifies
Kuhlmann’s proof substantially.In Theorem 4 of the same paper,Kuhlmann
[28] stated furthermore that TS
avg
(C) < 2 if VCdim(C) = 1,but his proof is
flawed.
1
Despite the flawed proof,the claimitself is correct as we show now:
Theorem2.12.Let C be any concept class.If VCdim(C) = 1 then TS
avg
(C) < 2.
Proof.By Theorem 2.11,the average-case teaching dimension of a maximum
class of VC-dimension 1 is less than 2.It thus suffices to show that any class
C of VC-dimension 1 can be transformed into a maximum class C
0
of VC-
dimension 1 without decreasing the average-case teaching dimension.Let
1
His Claim2 states the following.If VCdim(C) = 1,C
1
;C
2
2 C,x 2 X,x =2 C
1
,C
2
= C
1
[
fxg,then,for either (i;j) = (1;2) or (i;j) = (2;1),one obtains TS(C
i
;C) = TS(C
i
x;Cx)+1
and TS(C
j
;C) = 1.This is not correct,as can be shown by the class C = ffx
z
:1  z  kg:
0  k  5g over X = fx
k
:1  k  5g,which has VC-dimension 1.For C
1
= fx
1
;x
2
g,
C
2
= fx
1
;x
2
;x
3
g,and x = x
3
,we get TS(C
1
;C) = TS(C
2
;C) = TS(C
1
x;C x) = 2.
30
New Results for Shortest-Path-Closed Classes
X
0
 X be any maximal subset of X that does not contain redundant in-
stances.(e.g.,X
0
results from X by removing redundant instances as long
as possible).Let C
0
= C
jX
0.Obviously,jC
0
j = jCj and VCdim(C
0
) = 1.Let
m = jX
0
j so that jC
0
j 

m
0

+

m
1

= m+ 1.Now we prove that C
0
is maxi-
mum.Since no x 2 X
0
is redundant for C
0
,every x 2 X
0
occurs as color in
G(C
0
).As VCdim(C
0
) = 1,no color can occur twice.Thus jE(G(C
0
))j = m.
Moreover,there is no cycle in G(C
0
) since a cycle would require at least one
repeated color.As G(C
0
) is an acyclic graph of m edges,it has at least m+1
vertices,i.e.jC
0
j  m+1.Thus,jC
0
j = m+1 and C
0
is maximum.This implies
that TS
avg
(C
0
) < 2VCdim(C
0
).Since X
0
results from X by removing redun-
dant instances only,we obtain TS(C;C)  TS(C
jX
0;C
0
) for all C 2 C.Thus,
TS
avg
(C)  TS
avg
(C
0
) < 2VCdim(C
0
) = 2,which concludes the proof.
We briefly note that TS
avg
(C) cannot in general be bounded by O(VCdim(C)).
Kushilevitz et al.[29] present a family (C
n
) of concept classes such that
TS
avg
(C
n
) =
(
p
jC
n
j) but VCdim(C
n
)  log jC
n
j.
Lemma 2.13.If deg
G(C)
(C)  jXj  1 for all C 2 C,then C is shortest-path
closed.
Proof.Assume C is not shortest-path closed.Pick two concepts C;C
0
2 C
of minimal Hamming-distance d but not connected by a path of length d in
G(C).It follows that d  2.By minimality of d,any neighbor of C with
Hamming-distance d  1 to C
0
does not belong C.Since there are d such
missing neighbors,the degree of C in G(G) is bounded by jXj d  jXj 2.
This yields a contradiction.
In their Theorem 43,Rubinstein et al.[36] provide a concept class V with
TS
min
(V) > VCdim(V).While there are several such classes known,Sec-
tion 3.3.3 will show that TS
min
(C) = VCdim(C) for all maximum classes C.
Shortest-path-closed classes generalize maximumclasses,but an inspection of
[36] and Lemma 2.13 show that V is shortest-path-closed,since each concept
in V has degree jXj or jXj  1 in G(V).Thus we can prove the existence
of shortest-path-closed classes with TS
min
(C) > VCdim(C),i.e.the result
does not generalize to shortest-path-closed classes.This knowledge might be
helpful in the study of the open question whether TS
min
(C) 2 O(VCdim(C))
— a question related to the long-standing open sample compression conjec-
ture [16],see Section 3.3.
31
32
3
The Recursive Teaching Dimension
Another recent model of teaching with low information complexity is recur-
sive teaching,where a teacher chooses a sample based on a sequence of nested
subclasses of the underlying concept class C [47].The nesting is defined as
follows.The outermost “layer” consists of all concepts in C that are easiest to
teach,i.e.,that have the smallest sets of examples distinguishing them from
all other concepts in C.The next layers are formed by recursively repeating
this process with the remaining concepts.The used samples are called recur-
sive teaching sets and the largest number of examples required for teaching
at any layer is the recursive teaching dimension (RTD) of C.The RTD sub-
stantially reduces the information complexity bounds obtained in traditional
teaching models.It lower-bounds not only the teaching dimension—the mea-
sure of information complexity in the “classical” teaching model [19;41]—but
also the information complexity of iterated optimal teaching [4] (which we do
not cover here),which is often substantially smaller than the classical teach-
ing dimension.The intuitive idea behind this construction is the following:
a learner provided with a recursive teaching set,realizes that the given sam-
ple is not sufficient for exactly identifying one special concept.But he can
discard those concepts fromC that are teached with different samples of min-
imal size among all minimal teaching sets for concepts in the class – if the
teacher would try to suggest that any of these is the target,he would have
chosen these uniquely identifying sets.After dropping these concepts fromC,
he can undertake the same investigation over and over again until the initially
received sample coincides with a teaching set for the target concept in some
subset of C.
33
Chapter 3.The Recursive Teaching Dimension
Since the VC-dimension is the best-studied quantity related to information
complexity in learning,it is a natural first parameter to compare to when
it comes to identifying connections between information complexity notions
across various models of learning.However,the teaching dimension,i.e.,
the information complexity of the classical teaching model,does not exhibit
any general relationship to the VC-dimension—the two parameters can be
arbitrarily far apart in either direction as seen in chapter 2.Similarly,there is
no known connection between teaching dimension and query complexity.
In this chapter,we establish the first known relationships between the infor-
mation complexity of teaching and query complexity,as well as between the
information complexity of teaching and the VC-dimension.All these relation-
ships are exhibited by the RTD.The main contributions of this chapter are
the following:
• We show that the RTD is never higher (and often considerably lower)
than the complexity of self-directed learning.Hence all lower bounds
on the RTD hold likewise for self-directed learning,for learning from
partial equivalence queries,and for a variety of other query learning
models.
• We reveal a strong connection between the RTD and the VC-dimension.
Though there are classes for which the RTD exceeds the VC-dimension,
we present a number of quite general and natural cases in which the
RTD is upper-bounded by the VC-dimension.These include classes of
VC-dimension 1,intersection-closed classes,a variety of naturally struc-
tured Boolean function classes,and finite maximum classes in general
(i.e.,classes of maximum possible cardinality for a given VC-dimension
and domain size).Many natural concept classes are maximum,e.g.,the
class of unions of up to k intervals,for any k 2 N,or the class of sim-
ple arrangements of positive halfspaces.It remains open whether every
class of VC-dimension d has an RTD linear in d.
• We reveal a relationship between the RTD and sample compression
schemes.
The relationship between RTD and unlabeled sample compression schemes is
established via corner-peeling plans and Theorem1.33.Like the RTD,corner-
peeling is associated with a nesting of subclasses of the underlying concept
34
Definitions and Properties
class.A crucial observation we make in this chapter is that every maximum
class of VC-dimension d allows corner-peeling with an additional property,
which ensures that the resulting unlabeled samples contain exactly those in-
stances a teacher following the RTDmodel would use.Similarly,we showthat
the unlabeled compression schemes constructed by Kuzmin and Warmuth’s
Tail Matching algorithm (see Figure 1.6) exactly coincide with the teaching
sets used in the RTD model,all of which have size at most d.
This remarkable relationship between the RTD and sample compression sug-
gests that the open question of whether or not the RTD is linear in the VC-
dimension might be related to the long-standing sample compression Conjec-
ture 1.34.To this end,we observe that a negative answer to the former ques-
tion would have implications on potential approaches to settling the second.
In particular,if the RTD is not linear in the VC-dimension,then there is no
mapping that maps every concept class of VC-dimension d to a superclass that
is maximum of VC-dimension O(d).Constructing such a mapping would be
one way of proving that the best possible size of sample compression schemes
is linear in the VC-dimension.
Note that sample compression schemes are not bound to any constraints as to
how the compression sets have to be formed,other than that they be subsets
of the set to be compressed.In particular,any kind of agreement on,say,
an order over the instance space or an order over the concept class,can be
exploited for creating the smallest possible compression scheme.As opposed
to that,the RTD is defined following a strict “recipe” in which teaching sets
are independent of orderings of the instance space or the concept class.These
differences between the models make the relationship revealed in this chapter
even more remarkable.
Lemma 3.3 and all results of Section 3.2 and 3.3 have been published in the
joint work Doliwa et al.[11] or are contained in Doliwa et al.[12].The only
exceptions are Lemma 3.8 and Lemma 3.9 which are unpublished.
3.1 Definitions and Properties
Definition 3.1.Let K be a function that assigns a “complexity” K(C) 2 N
to each concept class C.We say that K is monotonic if C
0
 C implies that
K(C
0
)  K(C).We say that K is twofold monotonic if K is monotonic and,for
35
Chapter 3.The Recursive Teaching Dimension
every concept class C over X and every X
0
 X,it holds that K(C
jX
0 )  K(C).
Definition 3.2 ([48]).A teaching plan for C is a sequence
P = ((C
1
;S
1
);:::;(C
m
;S
m
)) (3.1)
with the following properties:
1.N = jCj and C = fC
1
;:::;C
m
g
2.For all t = 1;:::;m,S
t
2 TS(C
t
;fC
t
;:::;Cmg).
The quantity ord(P):= max
t=1;:::;N
jS
t
j is called the order of the teaching plan
P.Finally,we define
RTD(C):= minford(P)jP is a teaching plan for Cg;
RTD

(C):= max
X
0
X
RTD(C
jX
0 ):
The quantity RTD(C) is called the recursive teaching dimension of C.
A teaching plan (3.1) is said to be repetition-free if the sets X(S
1
);:::;X(S
N
)
are pairwise distinct.(Clearly,the corresponding labeled sets,S
1
;:::;S
N
,
are always pairwise distinct.) Similar to the recursive teaching dimension we
define
rfRTD(C):= minford(P) j P is a repetition-free teaching plan for Cg:
One can show that every concept class possesses a repetition-free teaching
plan.First,by induction on jXj = m,the full cube 2
X
has a repetition-free
teaching plan of order m:It results froma repetition-free plan for the (m1)-
dimensional subcube of concepts for which a fixed instance x is labeled 1,
where each teaching set is supplemented by the example (x;1),followed by a
repetition-free teaching plan for the (m1)-dimensional subcube of concepts
with x = 0.Second,“projecting” a (repetition-free) teaching plan for a con-
cept class C onto the concepts in a subclass C
0
 C yields a (repetition-free)
teaching plan for C
0
.Putting these two observations together,it follows that
36
Definitions and Properties
x
1
x
2
x
3
x
4
x
5
TS
min
(C
i
;C)
TS
min
(C
i
;C
1
)
TS
min
(C
i
;C
2
)
TS
min
(C
i
;C
1;2
)
C
1
0
0
0
0
0
2
-
2
-
C
2
1
1
0
0
0
2
2
-
-
C
3
0
1
0
0
0
4
3
3
2
C
4
1
0
1
0
0
3
3
3
3
C
5
1
0
1
0
1
3
3
3
3
C
6
0
1
1
0
1
3
3
3
3
C
7
0
1
1
1
1
3
3
3
3
C
8
0
1
1
1
0
3
3
3
3
C
9
1
0
1
1
0
3
3
3
3
C
10
1
0
0
1
0
4
3
3
3
C
11
1
0
0
1
1
3
3
3
3
C
12
0
1
0
1
0
4
4
4
4
C
13
0
1
0
1
1
3
3
3
3
Figure 3.1:A class with RTD(C) = 2 but rfRTD(C) = 3.C
i
denotes the class C
without concept C
i
and C
i;j
the class without both C
i
and C
j
.
every class over instance set X has a repetition-free teaching plan of order
jXj.
It should be noted though that rfRTD(C) may exceed RTD(C).For example,
consider the class in Table 3.1,which is of RTD 2.This can be seen by picking
the trivial order ((C
1
;S
1
);(C
2
;S
2
);:::;(C
13
;S
13
)).Then each S
i
is given by
the corresponding labeled instances highlighted by bold letters in Table 3.1.
In any teaching plan of order 2,both C
1
and C
2
have to be taught first with the
same instance set fx
1
;x
2
g augmented by the appropriate labels.Conversely,
the best repetition-free teaching plan for this class is of order 3.
As observed by Zilles et al.[48],the following holds:
• RTD is monotonic.
• The RTD coincides with the order of any teaching plan that is in canoni-
cal form,i.e.,a teaching plan ((C
1
;S
1
);:::;(C
N
;S
N
)) such that
jS
t
j = TS
min
(fC
t
;:::;C
N
g) holds for all t = 1;:::;N.
Intuitively,a canonical teaching plan is a sequence that is recursively built by
always picking an easiest-to-teach concept C
t
in the class C n fC
1
;:::;C
t1
g
together with an appropriate teaching set S
t
.
The definition of teaching plans immediately yields the following result:
37
Chapter 3.The Recursive Teaching Dimension
x
1
x
2
C
1
0
0
C
2
0
1
C
3
1
0
C
4
1
1
(a) The concept
class.
P
1
= ((C
1
;f(x
1
;0);(x
2
;0)g);(C
2
;f(x
1
;0)g);(C
3
;f(x
2
;0)g);(C
4
;;))
P
2
= ((C
2
;f(x
1
;0);(x
2
;1)g);(C
1
;f(x
1
;0)g);(C
3
;f(x
2
;0)g);(C
4
;;))
(b) Two possible teaching plans.
Figure 3.2:An example for the ambiguity of recursive teaching sets and re-
lated teaching plans in a naive protocol.
Lemma 3.3.1.If K is monotonic and TS
min
(C)  K(C) for every concept
class C,then RTD(C)  K(C) for every concept class C.
2.If K is twofold monotonic and TS
min
(C)  K(C) for every concept class C,
then RTD

(C)  K(C) for every concept class C.
It can be shown that there are existing different orderings of special concept
class C such that the induced teaching sets could lead to a failure when teacher
and learner use some ’naive’ recursive teaching protocol.
Example 3.4.Consider the powerset over 2 instances as a concept class.Fig-
ure 3.2 shows this simple scenario along with two teaching plans.Both plans are
not only valid recursive teaching plans but also canonical teaching plans.Thus,
both plans are reasonable for recursive teaching.But in the end,a teacher using
P
1
and a learner using P
2
would lead to a flawed communication which is high-
lighted by the bold teaching sets.While the teacher would try to teach C
2
,the
learner would identify C
1
as the target concept.
Does that mean,that teacher and learner have to agree on some ordering of C
in advance?Of course that would lead to inevitable coding tricks.Fortunately,
this apparent problemcan be solved by the recursive teaching hierarchy,a valid
protocol given by Zilles et al.[48] which makes use of recursive teaching sets.
Definition 3.5 ([48]).Let C be a concept class.The recursive teaching hi-
erarchy for C is the sequence H = ((C
1
;d
1
);:::;(C
h
;d
h
)) that fulfills,for all
j 2 f1;:::;hg,
C
j
= fC 2
C
j
jd
j
= TS
min
(
C
j
)g;
where
C
1
= C and
C
i
= C n(C
1
[:::[C
i1
) for all 2  i  h.Note that for any
i 2 [h],a sample S 2 T S(C;
C
i
) with jSj = d
i
is called a recursive teaching set
for C.
38
Definitions and Properties
The teaching hierarchy can be build independently by both teacher and learner
and max
i2[h]
d
i
= RTD(C).Unlike the strategies following the idea of,e.g.,
subset teaching,there exists a teacher-learner pair given by this method that
is collusion-free in the sense of Goldman and Kearns [19],as shown by Zilles
et al.[48].
In a subsequent work,Samei et al.[38] proved an upper bound on the size of
an concept class in dependence of the recursive teaching dimension,similar
to Sauer’s bound (see Lemma1.4).
Lemma 3.6 ([38]).Let C be an arbitrary concept class over X with RTD(C) = r.
Then it holds
jCj 
r
X
k=0

n
k

:
Proof.We give an simplified proof of the one of Samei et al.[38].Let F be
the real vector space of real-valued functions over C.Then dim(F) = jCj = m.
Now consider C as a subset of R
n
.We will show that any real-valued function
on C can be written as a polynomial of degree at most r that is linear in
all variables.As such,F is spanned by these polynomials and dim(F) 
P
r
k=0

n
k

.
Let C
1
;:::;C
N
be the concepts in C ordered according to a canonical teaching
plan P of order r.Hence,TS(C
j
;fC
j
;:::;C
N
g  r.For the sake of conve-