Twisting Sample Observations with Population Properties to learn
B. APOLLONI, S. BASSIS, S. GAITO and D. MALCHIODI
Dipartimento di Scienze dell’Informazione
Università degli Studi di Milano
Via Comelico 39/41 20135 Milano
ITALY
Abstract: We introduce a theoretic framework for Probably Approximately Correct learning. This enables us
to compute the distribution law of the random variable representing the probability of region where the
hypothesis is incorrect. The distinguishing feature in respect to the inference of an analogous probability from
Bernoulli variable is the dependence of this distribution on a complexity parameter playing a companion role
of the VapnikChervonenkis dimension.
KeyWords: Computational learning, statistical inference, twisting argument.
1 Introduction
A very innovative aspect of PAC learning [7] is to
assume that probabilities are random variables per
se. This is not the point of confidence intervals in
classic statistical theory, where randomness is due to
the extremes of the intervals rather than to the value
of the probabilistic parameter at hand. For instance,
let us consider a Bernoulli variable X and the
inequality
€
min T  P(θ≤T) ≥1−δ
(1)
identifying a 1δ confidence interval for the
parameter θ = P(X=1). Here T is a function f of a
random sample (X
1
, ..., X
m
), the minimum is taken
over some free parameters of f and the randomness
of the event in the brackets only derives from the
randomness of the sample. In this statistical
framework typical PAC learning inequalities [8]
reserve some subtle aspects. In respect to a class C
of Boolean functions {g
α
} and two elements of it
representing the target concept c and its
approximating hypothesis h, consider the event
€
B sup
α∈Δ
 R(α) −ν(α) ε
(2)
for a given ε, where Δ is a set of reals, α indexes
candidate h’s within C, R(α) is a probability
measure of a related symmetric difference ch and
ν(α) the sample frequency of falling in this domain.
If we fix ν(α)=0, the probabilistic features of this
event comes from the randomness of the set Δ
collecting hypotheses whose symmetric difference
with c have no points of a random sample inside.
However, if in order to approximate from the above
the probability of this event we enlarge Δ to the
whole set of indices of C, we refer at an
event with
no sense in the classical statistical theory. On the
contrary it makes sense requiring
€
P(Bν(α) 0) P(
α∈Δ
sup
R(α) ε) P(R(α
*
) ε) 1−δ
(3)
for some α
*
∈Δ if we assume the probability
measure on the error domain related to α
*
to be, in
turn, a random variable.
Right from the start, the object of our inference is
a string of data X (possibly of infinite length) that
we partition into a prefix we assume to be known at
present (and therefore call sample) and a suffix of
unknown future data we call a population (see Fig.
1). All these data share the feature of being
independent observations of the same phenomenon.
Therefore without loss of generality we assume
these data as the output of some function g
θ
having
input from a set of independent random variables U
uniformly distributed in the unit interval. By default,
0
1
p
X
u
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
P
opulation
S
a
mple
Fig. 1. Generating a sample of Bernoulli variables.
Independent variables: xaxis: index of the U realizations;
yaxis: both U (bars) and X (bullets) values. The threshold
line p realizes a mapping from U to X through (4).
capital
letters
(such
as
U
,
X
)
will
denote
random
variables
and
small
letters
(
u
,
x
)
their
corresponding
realizations.
We will refer to
M
=(
U
,
g
θ
) as a
sampling
mechanism
and
to
g
θ
as
an
explaining
function
,
and
this
function
is
precisely
the
object
of
our
inference.
Let
us
consider,
for
instance,
the
sample
mechanism
M
=(
U
,
g
p
), where
€
g
p
(
u
)
1
if
u
≤
p
0
otherwise
(4)
explains
sample
and
population
distributed
according
to
a
Bernoulli
law
of
mean
p
.
As
shown
in
Fig.
1,
for
a
given
sequence
of
U
’s
we
obtain
different
binary
strings
depending
on
the
height
of
the
threshold
line
corresponding
to
p
.
Thus
it
is
easy
to desume the following implication
chain
€
k
˜
p
≥
k
⇐
p
˜
p
⇐
k
˜
p
≥
k
1
(5)
and the consequent bound of the probability
€
Ρ
K
˜
p
≥
k
≥
Ρ
P
˜
p
F
P
˜
p
≥
Ρ
K
˜
p
≥
k
1
(6)
which
characterizes
the
cumulative
distribution
function
(
c.d.f.)
F
P
of
the
parameter
P
,
representing
the
asymptotic
frequency
of
1
in
the
population
compatible
with
the
number
k
of
1
in
the
sample.
Here
k
denotes
the
number
of
1
in
the
sample
and
€
K
˜
p
denotes the random variable counting the
number
of
1
in
the
sample
if
the
threshold
in
the
explaining
function
switches
to
€
˜
p
for the same
realizations
of
U
.
With reference to the probability space (
Ω
,
Σ
, P),
where
Ω
is
the
[0,
1]
interval,
Σ
the
related
sigma
algebra
and
P
is
uniformly
distributed
on
Ω
,
P
in
(6)
refers
to
the
product
space
of
the
sample
and
population
of
U
’s
in
Ω
.
The
sample
and
population
for
the
Bernoulli
variable
are
just
a
function
of
them.
The
probabilities
regarding
the
statistic
come
from
the
marginalization
of
the
joint
distribution
with
respect
to
the
population,
while
the
distribution
of
the
parameter
from
the
marginalization
with
respect
to the sample.
Note the asymmetry in the implications. It
derives
from
the
fact
that
raising
the
threshold
parameter
in
g
p
cannot
decrease
the
number
of
1
in
the
observed
sample,
but
we
can
recognize
that
such
a
raising
occurred
only
if
we
really
see
a
number
of
ones
in
the
sample
greater
than
k
.
We
will
refer
to
every
expression
similar
to
(5)
as
a
twisting
argument
[4],
since
it
allows
us
to
exchange
events
on
parameters
with
events
on
statistics.
Its
peculiarity
lies
in
the
fact
that,
since
the
first
and
last
probability
in
(6)
are
completely
known,
we
are
able
to
identify
the
distribution
law
of
the
unknown
parameter.
A
more
thorough
discussion
of
this
tool
and
of
the
related
algorithmic
inference
framework
can be found in [4].
The
principal
use we will make of relations like
(6)
is
to
compute
confidence
intervals
for
the
unknown
parameter
P
.
Namely,
in
the
above
inference
framework
it
does
not
make
sense
to
assume
a
specific
random
variable
as
given;
rather
we
refer
to
families
of
random
variables
with
some
free
parameters
whose
distribution
law
is
discovered
from
a
sample
through
twisting
arguments.
Nevertheless
for
both
conciseness
sake
and
advisability
of
linking
our
results
with
conventional
ones,
we
will
often
keep
referring
by
abuse
of
notation
to
a
given
random
variable
X
,
yet
with
random
parameters.
Within
this
notation
we
match
our
notion
of
sample,
as
a
prefix
of
a
string
of
data,
with
the
usual
definition
of
it
as
a
specification
of
a
set of identically distributed random variables.
Example
[Inferring
a
probability].
Let
X
denote
a
random
variable
distributed
according
to
a
Bernoulli
law
of
mean
P
,
(
X
1
,…,
X
m
)
a
sample
of
size
m
from
X
and
k=
Σ
x
i
the
sum
of
1
in
a
specification
of
the
sample.
A
symmetric
confidence
interval
of
level
δ
for
P
is
(
l
i
,
l
s
)
where
l
i
is
the
δ
/2
quantile
of
the
Beta
distribution
of
parameters
k
and
m
−
k
+1
[6],
and
l
s
is
the
analogous
1
−
δ
/2
quantile
for
parameters
k
+1
and
m
−
k.
Indeed,
consider
the
explanation
of
X
given
by
(4)
and
the
twisting
argument
(5).
In
this
case
K
p
~
follows a Binomial distribution law of
parameters
m
and
p
~
, so
that
(6) reads
€
m
i
˜
p
i
1
−
˜
p
m
−
i
i
k
m
∑
≥
F
P
˜
p
≥
m
i
˜
p
i
1
−
˜
p
m
−
i
i
k
1
m
∑
(7)
Having
introduced
the
incomplete
Beta
function
I
β
as
the
c.d.f.
of
the
random
variable
Be
(
h
,
r
)
following
a Beta distribution of parameters
h
and
r
, that is
€
I
β
(
h
,
r
)
≡
Ρ
Be
h
,
r
≤
β
1
−
h
r
−
1
i
β
i
1
−
β
h
r
−
1
−
i
i
0
h
−
1
∑
(8)
the above bounds can be written as
€
I
˜
p
k
,
m
−
k
1
≥
F
P
˜
p
≥
I
˜
p
k
1
,
m
−
k
(9)
Therefore, getting
€
δ
I
I
l
s
k
1
,
m
−
k
−
I
l
i
k
,
m
−
k
1
(10)
as
a
lower
bound
to
F
P
(
l
s
)
F
P
(
l
i
)
=P(
l
i
<
P
<
l
s
),
the
desired
confidence
interval
can
be
found
by
dividing
the
probability
measure
outside
(
l
i
,
l
s
)
in
two
equal
parts
in
order
to
obtain
a
twosided
interval
symmetric
in
the
tail
probabilities.
We
thus
obtain
the
extremes
of
the
interval
as
the
solutions
l
i
and
l
s
of the equations system
€
I
l
s
k
1
,
m
−
k
1
−
δ
/
2
(11)
€
I
l
i
k
,
m
−
k
1
δ
/
2
(12)
To check the effectiveness of this computation
we
considered
a
string
of
20+200
unitary
uniform
variables
representing,
respectively,
the
randomness
source
of
a
sample
and
a
population
of
Bernoulli
variables.
Then
according
to
the
explaining
function
(4)
we
computed
a
sequence
of
Bernoullian
220
bits
long
vectors
with
p
rising
from
0
to
1.
The
trajectory
described
by
the
point
of
coordinates
k
/20
and
h
/200,
computing
the
frequency
of
ones
in
the
sample
and
in
the
population
respectively,
is
reported
along
one
fret
line
in
Fig.
2.
We
repeated
this
experiment
20
times
(each
time
using
different
vectors
of
uniform
variables).
Then
we
drew
on
the
same
graph
the
solutions
of
equations
(1112)
with
respect
to
l
i
and
l
s
with
varying
k
for
δ
=0.1.
As
we
can
see,
for
a
given
value
of
k
the
intercepts
of
the
above
curves
with
a
vertical
line
with
abscissa
k
/20
determine
an
interval
containing
almost
all
intercepts of the frets with the same line.
Fig.
2
.
Generating
0.9
confidence
intervals
for
the
mean
P
of
a
Bernoulli
random
variable
with
population
and
sample
of
n
=200
and
m
=20
elements,
respectively.
φ
=
k
/
m
=
frequency
of
ones
in
the
sample;
ψ
=
h
/
n
=
frequency
of
ones
in
the
population.
Fret
lines:
trajectories
described
by
the
number
of
1
in
sample
and
population
when
p
ranges
from
0
to
1,
for
different
sets
of
initial
uniform
random
variables.
Curves:
trajectories
described
by
the
interval
extremes
when
the
observed
number
k
of
1
in
the
sample ranges from 0 to
m
.
A
more
intensive
experiment
would
show
that,
in
the
approximation
of
h
/200
with
the
asymptotic
frequency
of
ones
in
the
suffixes
of
the
first
20
sampled
values,
on
all
samples
and
even
for
each
sample
if
we
draw
many
suffixes
of
the
same
one,
almost
100(1
δ
)
percent
of
the
frets
fall
within
the
analytically computed curves.
Fig.
3.
Functionally
linked
variables:
Circle
h
describing
the
sample
and
possible
circles
describing
the
population.
Small
diamonds
and
circles:
sampled
points;
Line
filled
region
: symmetric difference.
2
Algorithmic
inference
of
Boolean
functions
In
the
PAC
learning
framework
the
parameter
under
investigation
is
the
probability
that
the
inferred
function
will
compute
erroneously
on
next
inputs
(will
not
explain
new
sampled
points).
In
greater
detail, the general form of the sample is
Z
m
={(
X
i
,
b
i
),
i
=1, ...,
m
}
(13)
where
b
i
are
Boolean
variables.
If
we
assume
that
for
every
M
and
every
Z
M
an
f
exists
in
a
Boolean
class
C
,
call
it
c
,
such
that
Z
M
={(
X
i
,
c
(
X
i
)),
i
=1,
...,
M
},
then
we
are
interested
in
the
measure
of
the
symmetric
difference
between
another
function
computed
from
Z
m
,
that
we
denote
as
hypothesis
h
,
and any such
c
(see Fig. 3).
The peculiarity of this inference problem is that
some
degrees
of
freedom
of
our
sample
are
burned
by
the
functional
links
on
the
labels.
Namely,
let
us
denote
by
U
c
h
the
measure
of
the
above
symmetric
difference:
for
a
given
z
m
this
is
the
random
variable
corresponding
to
the
parameter
R
(
α
)
for
a
suitable
mapping
from
{
h
}
to
Δ
.
Then
the
twisting
argument
reads (with some
caveats
):
€
(
T
ε
≥
t
U
c
h
1
)
⇐
(
U
c
h
ε
)
⇐
(
T
ε
≥
t
U
c
h
)
(14)
where
t
Uc
h
is
the
number
of
actual
sample
points
falling
in
c
h
(the
empirical
risk
in
the
Vapnik
notation
[8]),
T
ε
the
analogous
statistic
for
an
enlargement
of
c
h
of
measure
ε
,
and
is
a
new
complexity
measure
directly
referred
to
C
.
The
threshold
in
the
left
inequality
is
due
to
the
fact
that
h
is
a
function
A
of
a
sample
specification
z
m
in
its
own
turn,
so
that
if
A
is
such
that
the
symmetric
difference
grows
with
the
set
of
included
sample
points
and
vice
versa
then
(
U
c
h
<
ε
)
implies
that
any
enlargement
region
containing
c
h
must
violate
the
label
of
at
least
one
more
of
the
sampled
points
at
the
basis
of
h
’s
computation.
Quantity
is
an
upper
bound
to
the
number
of
sample
points
sufficient
to
witness
an
eventual
increase
of
U
c
h
after
a
new
hypothesis
containing
c
h
has
been
generated.
Its
extension
to
the
whole
class
of
concepts
C
and
class
of
hypotheses
H
is
called
detail
D
C,H
.
Although
semantically
different
from
the
VC
dimension
[5],
when
H
coincides
with
C
,
this
complexity
index
is
related to the latter by the following theorem:
Theorem
[1]:
Denoting
by
d
VC
(
C
)
the
VC
dimension of a concept class
C
with detail D
C,C
,
(d
VC
(
C
)1)/176<D
C,C
<(d
VC
(
C
)+1)
(15)
Theorem
[2]:
Assume
we
are
given
a
concept
class
C
on
a
space
9
,
a
sample
z
m
drawn
from
Z
m
as
in
(13),
a
learning
function
1
A
:{
z
m
}
→
C
. Consider the
family
of
sets
{
c
h
}
with
c
∈
C
labeling
z
m
,
h
=
A
(
z
m
)
and
its
detail
D
C,C
=
,
misclassifying
at
least
t
’
and
at
most
t
points
of
probability
π
∈
(0,
1),
and
denote
with
U
c
h
the
random
variable
given
by
the
probability
measure
of
c
h
and
by
F
Uc
h
its
c.d.f.
Then for each
z
m
and
β
∈
(
π
, 1)
€
I
β
1
t
'
,
m
−
t
'
≥
F
U
c
h
β
≥
I
β
t
,
m
−
t
1
(16)
where
€
I
β
(
t
,
m
−
(
t
)
1
)
1
−
m
i
β
i
(
1
−
β
)
m
−
i
i
0
t
−
1
∑
(17)
is the incomplete Beta function.
Corollary
:
Within
the
same
hypotheses
of
the
above
theorem:
1.
the
ratio
between
maximum
and
minimum
numbers
of
examples
needed
to
learn
C
with
1
Satisfying usual regularity conditions which represent the
counterpart
of
a
well
behaved
function
[5]
request.
For
a
formal
definition see [1].
accuracy
parameters
0<
ε
<1/8,
0<
δ
<1/100
is
bounded by a constant [1];
2.
a
pair
of
extremes
(
l
i
,
l
s
)
of
a
confidence
interval
of
level
δ
for
U
c
h
is
constituted
respectively
by
the
δ
/2
quantile
of
the
Beta
distribution
of
parameters
1+
t
’
and
m

t
’,
and
the
analogous
1
δ
/2
quantile for parameters
+
t
and
m
(
+
t
)+1 [4];
3.
class
complexity
and
hypothesis
accuracy
t
linearly add [3].
Example
[Learning
rectangles]
:
Consider
h
belonging
to
the
class
of
rectangles.
We
move
from
the
unidimensional
case
of
Fig.
2
to
the
bidimensional
case
depicted
in
Fig.
4.
Here
again
we
give
label
1
to
the
single
coordinate
u
j
,
j
=
1,
2
(each
ruled
by
a
corresponding
uniform
random
variable
U
j
)
if
it
falls
below
a
given
threshold
p
j
,
label
0
otherwise.
Moreover
we
give
to
the
point
a
i
of
coordinates
(
u
1
,
u
2
)
a
label
equal
to
the
product
of
the
labels
of
the
single
coordinates.
Thus
the
probability
p
c
that
a
point
a
falls
in
the
open
rectangle
c
bounded
by
the
coordinate
axes
and
the
two
mentioned
threshold
lines
(for
short
we
will
henceforth
refer
to
these
rectangles
as
bounds'
rectangles
) is
p
1
p
2
.
Let us complicate the inference scheme in two ways:
1.
We
move
from
U
j
to
the
family
of
uniform
random
variables
Z
j
in
[0,
θ
]
explained
by
the
function
z
=
θ
u
with
θ
∈
(0, +
∞
).
2.
We
maintain
the
same
labeling
rule
but
do
not
know
c
,
i.e.
the
thresholds
p
1
and
p
2
.
Rather,
within
the
class
of
bounds'
rectangles
containing
all
1labeled
sample
points
yet
excluding
all
0
labeled
sample
points
(consistent
bounds'
rectangles
as
statistics),
we
will
identify
it
with
the
maximal
one
h
,
i.e.
the
one
having
the
largest
consistent
edges
(just
before
the
closest
0labeled
points).
Letting
p
'
1
and
p
'
2
be
the
length
of
these
edges,
we
presently
look
for
the
probability
p
h
=
p
'
1
p
'
2
/
θ
2
representing
the
asymptotic
frequency
with
which
future
points
(generated
with
the
above
sampling
mechanism
for
any
θ
)
will
fall
in
h
.
We
may
imagine
a
whole
family
of
sequences
of
domains
B
,
each
sequence
pivoted
on
a
possible
rectangle.
In
some
sequences
the
domain
€
B
˜
p
of
measure
€
˜
p
will include the pivot, in other ones will
be
included
by
it.
Thus
we
need
witnesses
that
our
actual
h
computed
from
the
actual
sample
constitutes
a
pivot
included
in
€
B
˜
p
. But this happens
if
two
special
points
–
exactly
the
negative
one
preventing
the
rectangle
expanding
on
the
left
and
the
negative
one
preventing
the
rectangle
expanding
on the up – are included in
€
B
˜
p
.
Thus let us enrich the family of sequences having
for
each
rectangle
and
each
possible
pair
of
witness
points
the
pivot
constituted
by
the
union
of
the
rectangle
with
these
points.
In
respect
to
the
sequence
pivoted
on
our
actual
rectangle
and
witnesses
we
have
that
if
2
or
more
negative
points
are
included
in
€
B
˜
p
for sure the witness points are
among them. Hence a twisting argument reads:
€
p
h
˜
p
⇐
k
˜
p
≥
k
2
(18)
where
€
k
˜
p
is still a specification of a Binomial
random
variable
of
parameters
m
(sample
size)
and
€
˜
p
, accounting for the sample points contained in
€
B
˜
p
.
From
the
left,
let
us
consider
the
family
of
bounds'
rectangles
As
θ
is
free,
€
p
h
˜
p
requires that there
must
be
an
enlargement
of
h
whose
measure
for
a
proper
θ
exactly
equals
€
˜
p
. But since both edges of
h
are
bounded
by
a
negative
point,
this
enlargement
must
contain
at
least
one
point
more
than
h
itself.
Formally
€
k
˜
p
≥
k
1
⇐
p
h
˜
p
(19)
Putting
together
the
two
pieces
of
twisting
argument,
we
obtain
the
corresponding
bounds
on
probabilities as follows:
€
P
K
˜
p
≥
k
1
≥
P
P
h
˜
p
F
P
h
˜
p
≥
P
K
˜
p
≥
k
2
(20)
Generalizing
our
arguments
to
an
n
dimensional
space,
we
recognize
that
the
number
of
witnessing
points
of
expansions
of
the
maximal
hypotheses
on
bounds'
rectangles
are
at
most
n
.
Hence
n
constitutes
the detail
of this class of hypotheses.
We
extended
the experiment shown in Fig. 2
as
follows:
we
built
a
sample
of
m
=30
elements
drawing
random
Y
i
coordinates.
To
stress
the
spread
of
the
confidence
intervals
we
abandoned
the
uniform
distribution;
namely
we
used
the
sampling
mechanism
(
U
,
g
(
u
))
with
g(
u
)
=
u
j
)
for
computing
the
j
th
coordinate
(namely,
the
first
coordinate
is
u
,
the
second
u
2
,
etc.).
Moreover,
to
mark
the
drift
from
the
Bernoulli
variable,
we
refer
to
a
four
dimensional
rectangle
in
Ψ
=[0,1]
4
,
in
respect
to
which
the
rectangle
in
Fig.
4(a)
could
represent
just
a
projection
.
Then,
to
figure
out
a
wide
set
of
possible
labelings
mechanisms,
we
stored
the
sample
point
labelings
according
to
all
rectangles
with
vertices
in
a
suitable
discretization
(steps
of
1/10
on
each
edge)
of
the
unitary
hypercube.
Then
we
forgot
the
source
figures
and
for
each
labeling
computed
the
maximal
consistent
bounds'
rectangle
h
and
we
drew
the
graph
in
Fig.
4(b).
Namely,
we
drew
10
samples
as
before,
and
for
each
sample
labeling
we
reported
on
the
graph
the
actual
frequency
φ
and
the
probability
p
h
(analytically
computed
on
the
basis
of
the
rectangle
coordinates)
of
drawing
a
point
in
Ψ
belonging
to
the
guessed
maximal
hypothesis.
On
the
same
graph
we
also
reported
the
curves
describing
the
course
of
the
symmetric
0.9
confidence
interval
for
P
h
with
the
observed
frequency
of
falling
inside
the
rectangle
h
according
to
(16).
Finally,
for
comparison's
sake
we
(a)
(b)
Fig.
4.
Generating
0.9
confidence
intervals
for
the
probability
P
h
of
a
bounds'
rectangle
in
Ψ
=[0,1]
4
from
a
sample of 30 elements.
(a)
The
drawn
sample
and
one
of
its
possible
labelings
in
a
twodimensional
projection
.
Bullets:
1labeled
(positive)
sampled
points;
diamonds:
0labeled
(negative)
sampled points.
(b)
Points:
curves
of
the
frequency
φ
and
probability
p
h
of
falling
inside
a
bounds'
rectangle
with
a
close
lattice
of
labeling
functions.
Dashed
curves:
trajectories
described
by
the
confidence
interval
extremes
with
reference
to
=1.
Plain
curves:
trajectories
described
by
the
confidence
interval extremes with reference to
=4.
also
drew
the
curves
obtained
for
a
pure
Bernoulli
variable.
In
spite
of
some
apparent
unbalancing
in
the
figure,
the
percentages
of
points
falling
out
of
upper
and
lower
bound
curves
are
approximately
equal,
3.32
and
3.67%
respectively.
Thus
satisfying
the
5%
upper
bounds
used
for
drawing
these
curves.
The
analogous
percentages
,
16.2
and
0.28%,
denote
the
inadequacy
of
the
curves
drawn
for
the
Bernoulli
distribution.
The
smaller
values
of
actual
versus
allowed
bounds
trespassers
can
be
attributed
to
the
worst
case
duty
of
our
curves:
i.e.
they
must
guarantee
a
given
confidence
whatever
the
underlying distribution law is.
The
above
check
on
domain
measures
is
the
key
action
of
any
PAC
learning
task,
where
confidence
intervals
like
the
ones
in
Fig.
4(b)
are
the
ultimate
probabilistic
learning
target.
Indeed,
the
distinguishing
features
of
the
above
case
study
are
the following:
We
are
building
a
domain
h
on
the
basis
of
the
sampled
coordinates
(besides
the
a
priori
specifications
that
the
rectangle
leftlower
vertex
coincides
with
the
axes
origin
and
edges
orientation are parallel to them).
Though
coming
from
independent
U
i
's,
some
sample
points
,
those
labeled
by
1,
share
the
fact
of
being
all
inside
h
,
and
those
labeled
by
0
vice
versa
.
The
above
twisting
argument
holds
whatever
the
coordinates joint distribution
law
is.
The
experiment
in
the
figure
confirms
that
the
sole
probabilistic
consequence
of
these
additional
features
in
comparison
to
the
original
case
study
of
Fig.
2
is
in
the
bounds
of
the
confidence
region,
now
pushed
up
by
the
fact
that
four
points
in
place
of
one
need
to
witness
the
inclusion
of
h
in
a
proper
domain
of
measure
p
~
and at least one is
additionally
included
in
an
enlargement
of
h
with
this measure.
3 Conclusions
PAC
learning
represents
a
very
innovative
perspective
in
inferential
statistics
which
relates
the
randomness
of
the
samp
le
data
to
the
mutual
structure
deriving
fro
m
their
syntactical
properties.
We
set
up
an
inferential
mechanism
and
a
structural
complexity
index
to
stress
this
idea.
Assuming
a
source
of
uniformly
rando
m
data,
we
want
to
discover
the
function
g
θ
mapping
these
data
i
nto
random
variables
of
a
given
and
possibly
unknown
distribution
law.
In
the
case
of
Bernoulli
variables
whose
values
are
related
to
ancillary
data,
a
prominent
part
of
our
inference
may
lie
in
fixing
this
relation,
which
is
exactly
an
instance
of
the
problem
of learning Boolean functions.
In
the
paper
we
show
the
benefit
of
this
contrivance
in
terms
of
its
capability
of
studying
the
distribution
law
of
the
error
risk,
in
connection
with
specific
features
of
the
computed
hypothesis
such
as
the
detail
of
its
class.
From
an
operational
viewpoint
this
benefit
translates
in
a
more
favourable
relation
between
sample
size
and
accuracy
parameters
ε
and
δ
.
In
particular
we
obtain
a
de
finite
narrowing
of
the
confidence
intervals
of
the
error
risk
in
respect
to
those
usually
computed
in
the Vapnik approach.
In
a
more
philosophical
perspective,
our
approach
provides
a
clear
statistical
rationale
to
the
learning
theory
putting
the
premises
for
new
development of this theory.
References:
[1]
B.
Apolloni
and
S.
Chiaravalli,
PAC
learning
of
concept
classes
through
the
boundaries
of
their
items,
Theoretical
Computer
Science
172,
1997,
91120.
[2]
B.
Apolloni,
E.
Esposito,
D.
Malchiodi,
C.
Orovas,
G.
Palmas
and
J.
G.
Taylor,
A
General
Framework
for
Learning
Rules
from
Data,
IEEE
Transactions
on
Neural
Networks
,
2004,
to
appear.
[3]
B.
Apolloni
and
D.
Malchiodi,
Gaining
degrees
of
freedom
in
subsymbolic
learning,
Theoretical
Computer Science 255, 2001, 295321.
[4]
B.
Apolloni
,
D.
Malchiodi
and
S.
Gaito,
Algorithmic
Inference
in
Machine
Learning,
International
Series
on
Advanced
Intelligence
Vol.
5,
Advanced
Knowledge
International,
Magill, Adelaide, 2003.
[5]
A.
Blumer,
A.
Ehrenfreucht,
D.
Haussler,
M.
Warmuth,
Learnability
and
the
Vapnik
Chervonenkis
Dimension,
Journal
of
the
ACM
36, 1989, 929965.
[6]
J.
W.
Tukey,
Nonparametric
estimation
II.
Statistically
equivalent
blocks
and
tolerance
regions
–
the
continuous
case,
Annals
of
Mathematical Statistics
18, 1947, 529539.
[7]
L.
G.
Valiant,
A
theory
of
the
learnable,
Communications
of
the
ACM
11
(27),
1984,
11341142.
[8]
V.
Vapnik,
Statistical
Learning
Theory
,
John
Wiley, New York, 1998.
Comments 0
Log in to post a comment