View

randombroadAI and Robotics

Oct 15, 2013 (3 years and 8 months ago)

113 views

The Fifth

International Conference on Neural Networks and Artificial Intelligence

May 27
-
30, Minsk, Belarus


The Estimations Based
on the Kolmogorov Complexity

and Machine
Learning

from Examples

Vladimir I. Donskoy


Taurian National University, 4, Vernadsky
Avenue
, Simferopol, 95007, Ukraine, vidonskoy@mail.ru


Abstract

-

In this paper,
interrelation
between th
e
Kolmogorov complexity and VCD of classes of the
partial recursive functions used in machine learning
from examples is researched. The novel pVCD method
of progra
mming of estimation
s

of VCD and
the
Kolmogorov complexity is proposed. It is shown, how
Kolm
ogorov complexity can be used for the
substantiation of the significance of regularities
discovered in the training samples.


Keywords
-

Kolmogorov complexity, VCD, Machine
Learning, Samples
.

I.

I
NTRODUCTION

Examining the problems of machine learning, it is
natural to limit the class of used decision functions by
the
partly recursive functions. In that case
,

we need to use
algorithmic approach to machine learning and to examine
algorit
hmic complexity of models. S
tatistical Vapnik
-
Chervonenkis theory of learni
ng [1], and Kolmogorov
approach [2], and
MDL

[3], and various heuristics used
in Machine Learning
-

are based on the concepts of
complexity of models which are used to find regularities
or decision making rules. From the different points of
view, when the
learning from examples is used, it is
expedient

to choose as possible simpler deciding rule
(model). The nature of the arising problem can be looked
as decree of Nat
ure: regularity almost always has

to be
very simple or has very simple description, by

othe
r
words,
-

low complexity.

I
n this paper, Vapnik
-
Chervonenkis Theory is
extensively used. This theory begins with the concepts of
shatter coefficient
and
Vapnik
-
Chervonenkis Dimension

(
VCD
) [4]. Let
be a sample
;
,
;

is a set which is

defined in the
applications.

is the number of various
partitions of the sample
on two classes which can be
realized by the rules (functi
ons) of the
family
.
It is evident
that
. The function


is called the
growth function

of the family
or the
l
-
th
shatter coefficient of
[4]. The
set of
al
l pos
sible
samples
is denoted
. The growth function

either is identically equal to
, or majorized by
function
, where

is the minimum value of the
,
on

which
.
The following definition is based
on the estimation
: i
f
exists such
that

for any
, then it is said that
the family

has
finite
capacity

(or
VCD
(
S
)). If

then it is said that
VCD


is infinite:
.
If
card

then

and

.
The main result of the

statistical Vapnik
-
Chervonenkis theory is: the finiteness
of the

guarantees the learning ability by the
method of empirical induction when classification rule is
chosen from the family
S
.
The fundamental inequality


is used to estimate the length
of a sample
which is
necessary
for a guaranty that empirical
error
of the
learning (frequency ratio)
of the classification rule


will be
-
close to the unknown probability of

the

error

of this rule.


The main purpose of this paper is to analyze a process
of

machine learning from e
xamples when the recursive
function families

are used.

To achieve this pu
rpose, we
defined the Kolmogorov

complexity
[5] of the
family of recursive functions

and proved the
inequality

.

The novel method of

estimation both

and

is proposed. Finally, the majorant

is obtained for the probability of the random
choice of the recursive rule
, which is absolutely
correct on all examples of a sample

of
length

,
when this rule
is found by means
machine learning.
The results obtained in this paper are based on the
Kolmogorov approach supposing
consi
deration of
nonrandomness as
regularity
.

II.

K
OLMOGOROV
C
OMPLEXITY OF THE
R
ECURSIVE
C
LASSIFIER
S

Let

be a family of general recursive functions (of
algorithms) in the form of

. A
training sample, which is denoted as
,
contains

arbitrary elements f
rom the
. This sample
presents a sequenced collection which consists of the

limited natural numbers. The bounded set of all
these samples is denoted as
and it

requires

bit to
p
resent


states
. The set
of 0
-
1
-
strings

(
words) of arbitrary length as usually presents numbers 0,

1,

2,… . A length of the string

is denoted
.

is the

class of partly recursive functions.
We define more exactly the training sequence or the
sample as the pair
s
, where
,
,
;

is some a
priory unknown, but ex
isting classification function. The
set of all possible training samples is denoted
as
. This set is a general population from
which samples can be extracted. The machine learning
pro
blem consists in finding
the unknown function
by
using given sample
. Practically, the result of
machine learning is the function
, which is not
equal to
, but which is, in a certain sense, as possible
closer to
.
The family

is defined by a choice of the
model of machine learning (and by the corresponding
family of classification algorithms), for example, by
decision trees, neural networks, potential functions, and
another heuristics. The most i
ntricate problem is a
determination of the family
which is relevant, adequate
to the initial information
; therefore empirical learning
problems are so complicated.


Definition 1.

1º The complexity of the algorith
m
relatively to the
sample

by the partly recursive function
is


where

is a binary word of the
length
.

2º The complexity of the algorit
hm
at the set

by
the partly recursive function
is


3º The complexity of the family of algorithms

at the
set

by the partly recursive
function
is




The complexity of the family of algorithms

at the set

is
.


Theorem 1.

Let the family of the partly recursive
functions

has finite

and Kolmogorov
complexity
.
Then

for any

and
.


Proof.

The complexity
of the family

is
defin
ed by the expression
, where a
binary word

fixes the variant of the partition of the
sample
on two subsets. All possible variants of such
partitions are defined by functions of the family
. For
the function

the binary word

is defined by
expression
, and moreover, if the functions
A

and
B

from

are equivalent,

the binary word
and

are

the same. If the partly
recursive function
is fixed, it needs the equality
=

be fulfilled for any
on any
sample

according to defini
tion 1. Therefore, the
argument
must to admit not less than the number
of values, where

is the growth function
of the family
. Remind, that

is the maximum
number of various partitions of the sample
, therefore

defines the maximum possible number of

various binary words

of the length
for
all samples

from
. And
because that
is a function, the
inequality

takes a place.
Furthermore, the equality


=





(1)

is true. Really, it is sufficient to point the fun
ction

such that

=
. This
function
can be defined by the following table 1
consisting of the

cells.

TABLE I

D
ETERMINATION OF

THE FUNCTION

The code

(number)

of the

program

The code (number) of the sample








0





















































The values
,
,
,
contained in the table, are the binary words of the length
, which are t
he binary natural numbers. We
mean the
natural numbers extended with a zero. Just as values
,

samples
and codes
are interpreted as natural
numbers. So, the function
can be defin
ed on the finite
set of values of arguments presented in the table

1
. On
any other admissible values of arguments which are not
contained in this table, the function
can be
determined as a zero. We remind: the natural functions of
natur
al arguments, which have nonzero values only on a
finite set from their domain of definition, are recursive.
Under the conditions

and
the following
expressions take a place:

,

.

And finally, taking into account the equality (1), we get

.


Corollary 1.


The Kolmogorov complexity of the family of
algorithms


is equal to the least whole number, which is
more or equal to the logarithm of
the
l
-
th shatter coefficient
of this family
:

=
.


Corollary

2.


III.

T
HE
M
ETHOD OF
P
ROGRAMMING


OF

E
STIMATIONS


OF
VCD

AND
S
HATTER
C
OEFFICIENT
S


The complexity

of the class
of algorithms is
defined above as the minimum length of the binary word
(program)
,
which can be used


to define the word
by means
the corresponding
partly recursive function
(external algorithm)
in the
most unfavorable case of

the set
of samples and
algorithms from
. It is evident, that

for any function
.
Therefore, for the uppe
r estimation of the
, a
ny

Turing machine

can be used alternatively as
algorithm
,
if
this
calculates
.
The appropriate program

in any pro
gram language
such that

for the input

can be
used as well as a Turing machine. So, if the word

and
an appropriate way of calculation of
are defined, then
VCD

can be estimate
d
:

and
.

The novel so
-
called

method

of programming of the estimation of
VCD

is
based on the
inequality
,
where the
word

is defined by the expression
s


and
.
Taking in account equality
=

(corollary 1), we have

and


for
any
. The shutter coefficient

can be

estimated by inequality
. The following
very important detail must be underlined. As we noted
above, we consider binary strings as natural numbers,
therefore the algorithm
transforms the pair


of
the natural numbers into the natural
number
.
When

is found as a number,
this number must be decoded into the string of the
length
. To present the number

as a binary string we
need to have information about the value of
, so we
need

binary digits added into the word
,
which defines any algorithm
.
We denote




and
. To realize

method
the
following steps must be done:


1º Analysis

of the
family
; definition
as more
restricted set of parameters and/or properties of this
family in order t
o form
the

structure for the
word
,

which
completely defines any

algorithm
.
P
oint
ing

out the algorithm
(the Turing machine, the partly
recursive function, the program for the any computer)
such
t
ha
t

.


2º Definition of the length

of the word
,
, for the upper estimation of the
, or

as t
he upper estimation of the
.
The

method suggests designing of the compressed
description
for any element of the family

and

the
algorithm
which processes th
e input
.
In

particular, it is sufficiently of evidence of existing of
such algorithm, but generally, to use

the

art of
programming and of data organization are needed to
present the structure of the word

and the
algorithm
.


If we use a computer with register capacity
, and the
algorithms from the

family

use

this register capacity
to

present

any parameter of the algorithm, the more
detail
ed estimation
can be obtained.


We illustrate the

method for the family
of Binary Decision Trees with not more than
terminal nodes.

We suppose Boolean samples
and space dimension
. Every internal node of any
tree from

contains the number of Boolean
variable from the set

and two pointers: the left
and the right. Each pointer defines a transition

to the next
node according to the value of this variable. Any terminal
node contains the number of a class (the result of
computation) 0 or 1. The tree with

and

is
shown on the Fig.1.

Any tree
defines the
algorithm
. This algorithm can be
compressed into the word

by the following way. The
word
consists of the concatenation of the fragments
containing the number of Boolean variable
and the
generalized pointer

as it shown on the Fig.2.

Finally,
these fragments as well as the whole word
are
presented as binary numbers.

The meaning of the
generalized pointer

is
explained in the Table 2.



Fig.1. The BDT with

internal nodes





Fig.2
.
The structure of the fragment


TABLE

I I

T
HE MEANING OF THE GE
NERALIZED POINTER

Value


Explanations

0

return_class
(
)

1

return_class
(
)

2

If
then return_class
(
)

else next_fragment

3

If
then return_class
(
)

else
next_fragment

4

If
then return_class
(
)

else next_fragment

5

If
then return_class
(
)

else next_fragment

6

If

then goto_fragment
(
)

else

next_fragment










If

then goto_fragment
(
)

else next_fragment


Now we can write the word
which contains all
information which is need to decode the tree given on
f
ig.
1. This word

consists of four concatenated
fragments corresponding to the internal
nodes
. Each fragment consists of two fields
presented in decimal form for easy understanding. But
below we should suppose binary fi
xed fields of all
fragments. According to the Table 2, we have



Note, the fragments with the indexes 0 and 1 in the word

never need be pointed. Therefore the generalized
pointer

always points t
o indexes of fragments
beginning from 2. The algorithm
which decodes the
given word
can be easily understood.

a)

Get a fragment 0.

b)

Decode the number
extracted from the first
field of the fragment, a
nd the value

extracted from the second fragment.

c)

Execute the program code for the value of
according to the Table 2. The result


the number of a
class
-

will be obtained
,

and the algorithm
will
be
stopped
;

or the
transition
operator to the fragment
pointed by value
, or a transition operator to the
next concatenated

fragment will be completed.


We explain the procedures used:
return_class
(
)


returns

the
answer 0 if
or the answer 1 if
and
then the algorithm is ended
;
next_fragment



transition
to the right
to the
next fragment;
goto_fragment
(
)


transition

to the fragment number
.


To encode any tree

, at most

fragments are needed because

is the number of
internal nodes if the number of terminal nodes is
.
Thus, generalized pointer has to possess
special values
and values

to point fragments
indexed as
.
Therefore the number

of values for the generalized
pointer is needed and to encode them, the
number

of binary digits is needed.
Finally,

binary digits are
needed to encode one fragment, and the length of the
binary word

is obtained:


Note,

of binary digits are added into

the word

to define the length
of the binary string which is the
output of the algorithm
. Since a

never
depend of the sample length
, the addition

must be excluded from the

when
VCD

is
estimated by the

method. Taking into account
the inequality
, we get the
following estimation for the family of Binary Decision
Trees with at mo
st

terminal nodes when at most

binary variables are used:


.

For the family

of Binary Decision Trees with at
most

terminal nodes, with a
linear predicates in any
internal node, with at most

variables, and with
coefficients and variable values presented in
digits
per word, we easily get the estimation

.

Note, that the family

is very extensible class of
algorithms, therefore the estimation of

defines large values when all
variables are used to
define a linear separating rule in any internal node.


For

the

neural

network
s

with


nodes in the single hidden
layer the following estimation is obtained:

=
.


IV.

V
ERIFICATION OF
S
IGNIFICANCE
L
EVEL OF
R
EGULARITIES
D
ISCOVERED IN
E
MPIRICAL
D
ATA IN
THE
T
ERMS OF THE
K
OLMOGOROV
A
P
PROACH


Definition 2.


Let


be fixed sample given

from

,
-

the family of algorithms used for training.
The solution

of the functional system (2), if it
exists, is called
a correct tuning on the sample
.
The solution of the functional system (3), if it exists, is
called
a tuning on the

fixed elements
of the sample
.


(2)


(3)


Evidently
,
a tuning on the

fixed elements
of the sample

is a correct tuning
on some part
of the sample
. In
the
machine
learni
ng problems, as usually, the sample

is random
and independently derived from the general
population
.

Below we use the model with deriving
from the general population
. In the random
derived pair
, the Boolean vector

appear
s

with a certain probability. When the correct tuning

is realiz
ed by some way and there are no errors
on the given sample, the values
on the set

can be arbitrary, and the decision rule
,
which is found, can be
erroneous
generally speaking on
any
. In other words, a
direct solving of the
systems (2
) or (
3
)

is absolutely not equivalent to

learnin
g
from examples! To realize an empirical induction based
on the sample, it is necessary to generalize properties of
this sample to
obtain not only zero empirical error on this
sample, but as possibly less errors on all admissible
objects of the set
. What is happened when we
chose the family
which contains the correct tuning
on the given sample, but not contains the true (or
close to the true) regularity

which generates sample
s
derived from the general population
? We consider
such event as
a random tuning on the sample
.


Theorem 2
.

Let the probability model of derivation
from the general population


is such that an
appearance of any
Boolean vector
in the arbitrary
derived pair

is equally probable. Then the
probability

of a random tuning on some

elements of the sample

satisfies the in
equality

,

where
is the Kolmogorov complexity of the
family
,
-

the number of errors assumed on the

training sample

by the algorithm realized as
resu
lt of training.


Proof.

The family

unambiguously generates the
finite set

of various ways of a classification
for any given sample
. Cardinality of the set
is at most
. A correct tuning on all
elements of a sample can be realized if and only if the
way
of a classification of the sequence

onto two
classes is contained in the set
(in other words,
when a random extraction of a sample is realized, the
vector

“hits” into the set
). Any possible

which can be presented in a sample is equally
probable according to the conditio
n of the theorem.
Therefore probability of a correct tuning on the fixed part
of the sample of a length of

is at
most
. The

elements from
can
be chosen by
ways. Therefore we have the
estimation

.

According to the corollary 1,
,
.Therefore



.


Corollary 3.

A probability

of a random
correct tuning on the s
ample

satisfy the
inequality
.


Corollary 4.

If the estimation of the Kolmogorov
complexity by the

pVCD

method is obtained so that

then
.


If

then
,
and nonrandomness of the regularity found will be not
less 0,96. This is acceptably on a practice. Thus we have


The
rule


of plus five”:

To obtain reliable regularity
when machine learning is used, the length of the training
samp
le must be mo
re

on 5 than Kolmogorov complexity
of the algorithm family used.


C
ONCLUSION


The novel

method presented in this paper
allows to estimate both
and
complexity
of the family
of
learning
algorithms

by using

technique of programming, what defines
advantages

as compared to more complicated
combinatorial approach.


The possible applications of the presented results are
the following
: o
btaining of the novel estim
ations of
the
; r
eliability estimation of the algorithms which
are found as a result of
machine learning; e
stimation of
the required lengths of training samples.

R
EFERENCES


[1] V. N. Vapnik.
Recovery of dependencies by empirical data
.

M
oscow:

Nauka, 1979 (In Russian).

[2] A. N. Kolmogorov.
Information theory and theory of algorithms
.

Moscow: Nauka, 1987 (In Russian).

[3] P. M. B. Vitanyi, M. Li. Minimum Description Length Induction,
Bayesianism, and Kolmogorov complexity,
IEEE Trans. on
Inf. Theory,
46
(2),
pp.
446
-
464.

[4] L. Devroye, L. Györfi, G. Lugosi.
A Probabilistic Theory of Pattern
Recognition
.
NY:Springer
-
Verlag, 1997.

[5] V. I. Donskoy. Kolmogorov complexity of the classes of partly
recursive functions with a restricted capacity
,
Tavrian Herald for
Computer Science Theory and Mathematics
, 1, 2005,
pp.
25
-
34. (In
Russian).