The Fifth
International Conference on Neural Networks and Artificial Intelligence
May 27

30, Minsk, Belarus
The Estimations Based
on the Kolmogorov Complexity
and Machine
Learning
from Examples
Vladimir I. Donskoy
Taurian National University, 4, Vernadsky
Avenue
, Simferopol, 95007, Ukraine, vidonskoy@mail.ru
Abstract

In this paper,
interrelation
between th
e
Kolmogorov complexity and VCD of classes of the
partial recursive functions used in machine learning
from examples is researched. The novel pVCD method
of progra
mming of estimation
s
of VCD and
the
Kolmogorov complexity is proposed. It is shown, how
Kolm
ogorov complexity can be used for the
substantiation of the significance of regularities
discovered in the training samples.
Keywords

Kolmogorov complexity, VCD, Machine
Learning, Samples
.
I.
I
NTRODUCTION
Examining the problems of machine learning, it is
natural to limit the class of used decision functions by
the
partly recursive functions. In that case
,
we need to use
algorithmic approach to machine learning and to examine
algorit
hmic complexity of models. S
tatistical Vapnik

Chervonenkis theory of learni
ng [1], and Kolmogorov
approach [2], and
MDL
[3], and various heuristics used
in Machine Learning

are based on the concepts of
complexity of models which are used to find regularities
or decision making rules. From the different points of
view, when the
learning from examples is used, it is
expedient
to choose as possible simpler deciding rule
(model). The nature of the arising problem can be looked
as decree of Nat
ure: regularity almost always has
to be
very simple or has very simple description, by
othe
r
words,

low complexity.
I
n this paper, Vapnik

Chervonenkis Theory is
extensively used. This theory begins with the concepts of
shatter coefficient
and
Vapnik

Chervonenkis Dimension
(
VCD
) [4]. Let
be a sample
;
,
;
is a set which is
defined in the
applications.
is the number of various
partitions of the sample
on two classes which can be
realized by the rules (functi
ons) of the
family
.
It is evident
that
. The function
is called the
growth function
of the family
or the
l

th
shatter coefficient of
[4]. The
set of
al
l pos
sible
samples
is denoted
. The growth function
either is identically equal to
, or majorized by
function
, where
is the minimum value of the
,
on
which
.
The following definition is based
on the estimation
: i
f
exists such
that
for any
, then it is said that
the family
has
finite
capacity
(or
VCD
(
S
)). If
then it is said that
VCD
is infinite:
.
If
card
then
and
.
The main result of the
statistical Vapnik

Chervonenkis theory is: the finiteness
of the
guarantees the learning ability by the
method of empirical induction when classification rule is
chosen from the family
S
.
The fundamental inequality
is used to estimate the length
of a sample
which is
necessary
for a guaranty that empirical
error
of the
learning (frequency ratio)
of the classification rule
will be

close to the unknown probability of
the
error
of this rule.
The main purpose of this paper is to analyze a process
of
machine learning from e
xamples when the recursive
function families
are used.
To achieve this pu
rpose, we
defined the Kolmogorov
complexity
[5] of the
family of recursive functions
and proved the
inequality
.
The novel method of
estimation both
and
is proposed. Finally, the majorant
is obtained for the probability of the random
choice of the recursive rule
, which is absolutely
correct on all examples of a sample
of
length
,
when this rule
is found by means
machine learning.
The results obtained in this paper are based on the
Kolmogorov approach supposing
consi
deration of
nonrandomness as
regularity
.
II.
K
OLMOGOROV
C
OMPLEXITY OF THE
R
ECURSIVE
C
LASSIFIER
S
Let
be a family of general recursive functions (of
algorithms) in the form of
. A
training sample, which is denoted as
,
contains
arbitrary elements f
rom the
. This sample
presents a sequenced collection which consists of the
limited natural numbers. The bounded set of all
these samples is denoted as
and it
requires
bit to
p
resent
states
. The set
of 0

1

strings
(
words) of arbitrary length as usually presents numbers 0,
1,
2,… . A length of the string
is denoted
.
is the
class of partly recursive functions.
We define more exactly the training sequence or the
sample as the pair
s
, where
,
,
;
is some a
priory unknown, but ex
isting classification function. The
set of all possible training samples is denoted
as
. This set is a general population from
which samples can be extracted. The machine learning
pro
blem consists in finding
the unknown function
by
using given sample
. Practically, the result of
machine learning is the function
, which is not
equal to
, but which is, in a certain sense, as possible
closer to
.
The family
is defined by a choice of the
model of machine learning (and by the corresponding
family of classification algorithms), for example, by
decision trees, neural networks, potential functions, and
another heuristics. The most i
ntricate problem is a
determination of the family
which is relevant, adequate
to the initial information
; therefore empirical learning
problems are so complicated.
Definition 1.
1º The complexity of the algorith
m
relatively to the
sample
by the partly recursive function
is
where
is a binary word of the
length
.
2º The complexity of the algorit
hm
at the set
by
the partly recursive function
is
3º The complexity of the family of algorithms
at the
set
by the partly recursive
function
is
4º
The complexity of the family of algorithms
at the set
is
.
Theorem 1.
Let the family of the partly recursive
functions
has finite
and Kolmogorov
complexity
.
Then
for any
and
.
Proof.
The complexity
of the family
is
defin
ed by the expression
, where a
binary word
fixes the variant of the partition of the
sample
on two subsets. All possible variants of such
partitions are defined by functions of the family
. For
the function
the binary word
is defined by
expression
, and moreover, if the functions
A
and
B
from
are equivalent,
the binary word
and
are
the same. If the partly
recursive function
is fixed, it needs the equality
=
be fulfilled for any
on any
sample
according to defini
tion 1. Therefore, the
argument
must to admit not less than the number
of values, where
is the growth function
of the family
. Remind, that
is the maximum
number of various partitions of the sample
, therefore
defines the maximum possible number of
various binary words
of the length
for
all samples
from
. And
because that
is a function, the
inequality
takes a place.
Furthermore, the equality
=
(1)
is true. Really, it is sufficient to point the fun
ction
such that
=
. This
function
can be defined by the following table 1
consisting of the
cells.
TABLE I
D
ETERMINATION OF
THE FUNCTION
The code
(number)
of the
program
The code (number) of the sample
…
…
0
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
The values
,
,
,
contained in the table, are the binary words of the length
, which are t
he binary natural numbers. We
mean the
natural numbers extended with a zero. Just as values
,
samples
and codes
are interpreted as natural
numbers. So, the function
can be defin
ed on the finite
set of values of arguments presented in the table
1
. On
any other admissible values of arguments which are not
contained in this table, the function
can be
determined as a zero. We remind: the natural functions of
natur
al arguments, which have nonzero values only on a
finite set from their domain of definition, are recursive.
Under the conditions
and
the following
expressions take a place:
,
.
And finally, taking into account the equality (1), we get
.
Corollary 1.
The Kolmogorov complexity of the family of
algorithms
is equal to the least whole number, which is
more or equal to the logarithm of
the
l

th shatter coefficient
of this family
:
=
.
Corollary
2.
III.
T
HE
M
ETHOD OF
P
ROGRAMMING
OF
E
STIMATIONS
OF
VCD
AND
S
HATTER
C
OEFFICIENT
S
The complexity
of the class
of algorithms is
defined above as the minimum length of the binary word
(program)
,
which can be used
to define the word
by means
the corresponding
partly recursive function
(external algorithm)
in the
most unfavorable case of
the set
of samples and
algorithms from
. It is evident, that
for any function
.
Therefore, for the uppe
r estimation of the
, a
ny
Turing machine
can be used alternatively as
algorithm
,
if
this
calculates
.
The appropriate program
in any pro
gram language
such that
for the input
can be
used as well as a Turing machine. So, if the word
and
an appropriate way of calculation of
are defined, then
VCD
can be estimate
d
:
and
.
The novel so

called
method
of programming of the estimation of
VCD
is
based on the
inequality
,
where the
word
is defined by the expression
s
and
.
Taking in account equality
=
(corollary 1), we have
and
for
any
. The shutter coefficient
can be
estimated by inequality
. The following
very important detail must be underlined. As we noted
above, we consider binary strings as natural numbers,
therefore the algorithm
transforms the pair
of
the natural numbers into the natural
number
.
When
is found as a number,
this number must be decoded into the string of the
length
. To present the number
as a binary string we
need to have information about the value of
, so we
need
binary digits added into the word
,
which defines any algorithm
.
We denote
and
. To realize
method
the
following steps must be done:
1º Analysis
of the
family
; definition
as more
restricted set of parameters and/or properties of this
family in order t
o form
the
structure for the
word
,
which
completely defines any
algorithm
.
P
oint
ing
out the algorithm
(the Turing machine, the partly
recursive function, the program for the any computer)
such
t
ha
t
.
2º Definition of the length
of the word
,
, for the upper estimation of the
, or
as t
he upper estimation of the
.
The
method suggests designing of the compressed
description
for any element of the family
and
the
algorithm
which processes th
e input
.
In
particular, it is sufficiently of evidence of existing of
such algorithm, but generally, to use
the
art of
programming and of data organization are needed to
present the structure of the word
and the
algorithm
.
If we use a computer with register capacity
, and the
algorithms from the
family
use
this register capacity
to
present
any parameter of the algorithm, the more
detail
ed estimation
can be obtained.
We illustrate the
method for the family
of Binary Decision Trees with not more than
terminal nodes.
We suppose Boolean samples
and space dimension
. Every internal node of any
tree from
contains the number of Boolean
variable from the set
and two pointers: the left
and the right. Each pointer defines a transition
to the next
node according to the value of this variable. Any terminal
node contains the number of a class (the result of
computation) 0 or 1. The tree with
and
is
shown on the Fig.1.
Any tree
defines the
algorithm
. This algorithm can be
compressed into the word
by the following way. The
word
consists of the concatenation of the fragments
containing the number of Boolean variable
and the
generalized pointer
as it shown on the Fig.2.
Finally,
these fragments as well as the whole word
are
presented as binary numbers.
The meaning of the
generalized pointer
is
explained in the Table 2.
Fig.1. The BDT with
internal nodes
Fig.2
.
The structure of the fragment
TABLE
I I
T
HE MEANING OF THE GE
NERALIZED POINTER
Value
Explanations
0
return_class
(
)
1
return_class
(
)
2
If
then return_class
(
)
else next_fragment
3
If
then return_class
(
)
else
next_fragment
4
If
then return_class
(
)
else next_fragment
5
If
then return_class
(
)
else next_fragment
6
If
then goto_fragment
(
)
else
next_fragment
…
…
…
…
If
then goto_fragment
(
)
else next_fragment
Now we can write the word
which contains all
information which is need to decode the tree given on
f
ig.
1. This word
consists of four concatenated
fragments corresponding to the internal
nodes
. Each fragment consists of two fields
presented in decimal form for easy understanding. But
below we should suppose binary fi
xed fields of all
fragments. According to the Table 2, we have
Note, the fragments with the indexes 0 and 1 in the word
never need be pointed. Therefore the generalized
pointer
always points t
o indexes of fragments
beginning from 2. The algorithm
which decodes the
given word
can be easily understood.
a)
Get a fragment 0.
b)
Decode the number
extracted from the first
field of the fragment, a
nd the value
extracted from the second fragment.
c)
Execute the program code for the value of
according to the Table 2. The result
–
the number of a
class

will be obtained
,
and the algorithm
will
be
stopped
;
or the
transition
operator to the fragment
pointed by value
, or a transition operator to the
next concatenated
fragment will be completed.
We explain the procedures used:
return_class
(
)
–
returns
the
answer 0 if
or the answer 1 if
and
then the algorithm is ended
;
next_fragment
–
transition
to the right
to the
next fragment;
goto_fragment
(
)
–
transition
to the fragment number
.
To encode any tree
, at most
fragments are needed because
is the number of
internal nodes if the number of terminal nodes is
.
Thus, generalized pointer has to possess
special values
and values
to point fragments
indexed as
.
Therefore the number
of values for the generalized
pointer is needed and to encode them, the
number
of binary digits is needed.
Finally,
binary digits are
needed to encode one fragment, and the length of the
binary word
is obtained:
Note,
of binary digits are added into
the word
to define the length
of the binary string which is the
output of the algorithm
. Since a
never
depend of the sample length
, the addition
must be excluded from the
when
VCD
is
estimated by the
method. Taking into account
the inequality
, we get the
following estimation for the family of Binary Decision
Trees with at mo
st
terminal nodes when at most
binary variables are used:
.
For the family
of Binary Decision Trees with at
most
terminal nodes, with a
linear predicates in any
internal node, with at most
variables, and with
coefficients and variable values presented in
digits
per word, we easily get the estimation
.
Note, that the family
is very extensible class of
algorithms, therefore the estimation of
defines large values when all
variables are used to
define a linear separating rule in any internal node.
For
the
neural
network
s
with
nodes in the single hidden
layer the following estimation is obtained:
=
.
IV.
V
ERIFICATION OF
S
IGNIFICANCE
L
EVEL OF
R
EGULARITIES
D
ISCOVERED IN
E
MPIRICAL
D
ATA IN
THE
T
ERMS OF THE
K
OLMOGOROV
A
P
PROACH
Definition 2.
Let
be fixed sample given
from
,

the family of algorithms used for training.
The solution
of the functional system (2), if it
exists, is called
a correct tuning on the sample
.
The solution of the functional system (3), if it exists, is
called
a tuning on the
fixed elements
of the sample
.
(2)
(3)
Evidently
,
a tuning on the
fixed elements
of the sample
is a correct tuning
on some part
of the sample
. In
the
machine
learni
ng problems, as usually, the sample
is random
and independently derived from the general
population
.
Below we use the model with deriving
from the general population
. In the random
derived pair
, the Boolean vector
appear
s
with a certain probability. When the correct tuning
is realiz
ed by some way and there are no errors
on the given sample, the values
on the set
can be arbitrary, and the decision rule
,
which is found, can be
erroneous
generally speaking on
any
. In other words, a
direct solving of the
systems (2
) or (
3
)
is absolutely not equivalent to
learnin
g
from examples! To realize an empirical induction based
on the sample, it is necessary to generalize properties of
this sample to
obtain not only zero empirical error on this
sample, but as possibly less errors on all admissible
objects of the set
. What is happened when we
chose the family
which contains the correct tuning
on the given sample, but not contains the true (or
close to the true) regularity
which generates sample
s
derived from the general population
? We consider
such event as
a random tuning on the sample
.
Theorem 2
.
Let the probability model of derivation
from the general population
is such that an
appearance of any
Boolean vector
in the arbitrary
derived pair
is equally probable. Then the
probability
of a random tuning on some
elements of the sample
satisfies the in
equality
,
where
is the Kolmogorov complexity of the
family
,

the number of errors assumed on the
training sample
by the algorithm realized as
resu
lt of training.
Proof.
The family
unambiguously generates the
finite set
of various ways of a classification
for any given sample
. Cardinality of the set
is at most
. A correct tuning on all
elements of a sample can be realized if and only if the
way
of a classification of the sequence
onto two
classes is contained in the set
(in other words,
when a random extraction of a sample is realized, the
vector
“hits” into the set
). Any possible
which can be presented in a sample is equally
probable according to the conditio
n of the theorem.
Therefore probability of a correct tuning on the fixed part
of the sample of a length of
is at
most
. The
elements from
can
be chosen by
ways. Therefore we have the
estimation
.
According to the corollary 1,
,
.Therefore
.
Corollary 3.
A probability
of a random
correct tuning on the s
ample
satisfy the
inequality
.
Corollary 4.
If the estimation of the Kolmogorov
complexity by the
pVCD
method is obtained so that
then
.
If
then
,
and nonrandomness of the regularity found will be not
less 0,96. This is acceptably on a practice. Thus we have
The
rule
”
of plus five”:
To obtain reliable regularity
when machine learning is used, the length of the training
samp
le must be mo
re
on 5 than Kolmogorov complexity
of the algorithm family used.
C
ONCLUSION
The novel
method presented in this paper
allows to estimate both
and
complexity
of the family
of
learning
algorithms
by using
technique of programming, what defines
advantages
as compared to more complicated
combinatorial approach.
The possible applications of the presented results are
the following
: o
btaining of the novel estim
ations of
the
; r
eliability estimation of the algorithms which
are found as a result of
machine learning; e
stimation of
the required lengths of training samples.
R
EFERENCES
[1] V. N. Vapnik.
Recovery of dependencies by empirical data
.
M
oscow:
Nauka, 1979 (In Russian).
[2] A. N. Kolmogorov.
Information theory and theory of algorithms
.
Moscow: Nauka, 1987 (In Russian).
[3] P. M. B. Vitanyi, M. Li. Minimum Description Length Induction,
Bayesianism, and Kolmogorov complexity,
IEEE Trans. on
Inf. Theory,
46
(2),
pp.
446

464.
[4] L. Devroye, L. Györfi, G. Lugosi.
A Probabilistic Theory of Pattern
Recognition
.
NY:Springer

Verlag, 1997.
[5] V. I. Donskoy. Kolmogorov complexity of the classes of partly
recursive functions with a restricted capacity
,
Tavrian Herald for
Computer Science Theory and Mathematics
, 1, 2005,
pp.
25

34. (In
Russian).
Comments 0
Log in to post a comment