Credentials
Our understanding of this topic is based on the
Work of many researchers. In particular:
Rosa Arriaga
Peter Bartlett
Avrim Blum
Bhaskar DasGupta
Nadav Eiron
Barbara Hammer
David Haussler
Klaus Hoffgen
Lee Jones
Michael Kearns
Christian Kuhlman
Phil Long
Ron Rives
Hava Siegelman
Hans Ulrich Simon
Eduardo Sontag
Leslie Valiant
Kevin Van Horn
Santosh Vempala
Van Vu
Introduction
Neural Nets are most
popular
,
effective,
practical
…
learning tools.
Yet,
after almost
40
years of research, there are no
efficient
algorithms for learning with NN’s.
WHY?
Outline of this Talk
1
.
Some background.
2
.
Survey of recent
strong
hardness results.
3
.
New
efficient
learning algorithms for
some
basic NN architectures.
The Label Prediction Problem
Given some domain
set
X
A sample
S
of labeled
members of
X
is
generated by some
(unknown) distribution
For a next point
x
,
Predict its label
Data files of drivers
Will the customer
you interview file a
claim?
Drivers in a sample
are labeled according
to whether they filed
an insurance claim
Formal Definition
Example
The Agnostic Learning Paradigm
Choose a
Hypothesis Class
H
of subsets
of
X
.
For an input sample
S
, find some
h
in
H
that fits
S
well.
For a new point
x
, predict a label
according to its membership in
h
.
The Mathematical Justification
If
H
is not too rich (has small VC

dimension)
then, for every
h
in
H
,
the
agreement ratio
of
h
on the sample
S
is a good estimate of its
probability of success
on a
new
x
.
The Mathematical Justification

Formally
If
S
is sampled i.i.d., by some
D
over
X
{
0
,
1
}
then with probability
>
1

Agreement ratio
Probability of success
A Comparison to ‘Classic’ PAC
Sample labels are
consistent
with some
h
in
H
Learner hypothesis
required to meet
absolute
Upper bound
on its error
No prior restriction on
the sample labels
The required upper
bound on the
hypothesis error is
only relative
(to the
best hypothesis
in the
class)
PAC framework
Agnostic framework
The Model Selection Issue
Output of the the
learning Algorithm
Best regressor for
P
Approximation Error
Estimation Error
Computational Error
The Class H
The Big Question
Are there hypotheses classes that are:
1
.
Expressive
(small approximation error).
2
.
Have
small VC

dim
(small generalization error)
3
.
Have
efficient
good

approximation algorithms
NN’s are quite successful as approximators (property
1
).
If they are small (relative to the data size) then they also
satisfy property
2
.
We investigate property
3
for such NN’s.
The Computational Problem
For some class
H
of domain subsets
Input
:
A finite set of
{
0
,
1
}

labeled
points
S
in
R
n
.
Output
:
Some
h
in
H
that maximizes the
number of correctly classified point of
S
.
“Old” Work
Hardness results:
Blum and Rivest showed that
it is
NP

hard to
optimize
the weights of a
3

nodes NN.
Similar hardness

of

optimization results for
other
classes followed.
But learning can settle for less than optimization.
Efficient algorithms:
known
perceptron
algorithms are efficient for linearly
separable
input data (or the image of such data under
‘tamed’ noise).
But natural data sets are usually not separable
.
The Focus of this Tutorial
The results mentioned above (Blum and
Rivest etc.) show that for many “natural”
NN’s finding such S

optimal
h
in
H
is
NP
hard.
Are there
efficient
algorithms that output
good
approximations
to the S

optimal
hypothesis?
For each of the following classes there exist some
constant
s. t.
approximating the best agreement rate for
h
in
H
(on a given input sample
S
) up to this constant
ratio, is
NP

hard
:
Monomials
Constant width
Monotone Monomials
Half

spaces
Balls
Axis aligned Rectangles
Threshold NN’s with constant
1
st

layer width
BD

Eiron

Long
Bartlett

BD
Hardness

of

Approximation Results
How Significant are Such Hardness Results
All the above results are proved via reductions from some
known

to

be

hard problem.
Relevant Questions
1
.
Samples that are hard for one H are easy for another
(a model selection issue).
2
.
Where do ‘naturally generated’ samples fall?
Data

Dependent Success
Note that the definition of success for agnostic
learning is data

dependent;
The success rate of the learner on
S
is compared
to that of the best
h
in
H
.
We extend this approach to a data

dependent
success definition for approximations;
The required success

rate is a function of the
input
data.
A New Success Criterion
A learning algorithm
A
is
m
margin
successful
if, for every input
S
R
n
{o,
1
}
,

{(x,y)
S
:
A
(s)
(x) = y}

>

{(x,y)
:
h(x)=y
and
d(h, x) >
m}

for
every
h
H
.
Some Intuition
If there exist some optimal
h
which separates
with generous margins, then a
m
margin
algorithm must produce an optimal separator.
On the other hand,
If every good separator can be degraded by
small perturbations, then a
m
margin
algorithm can settle for a
hypothesis that is far
from optimal.
First Virtue of the New Measure
The
m
margin
requirement
is a rigorous
performance guarantee that can be achieved
by
efficient algorithms
(unlike the common approximate optimization).
Another Appealing Feature
of the New Criterion
It turns out that for each of the three classes analysed
so far (
Half

spaces
,
Balls
and
Hyper

Rectangles
),
there exist a
critical value
m
0
so that:
m
margin
learnability is NP

hard for all
m
m
0
while, on the other hand,
For any
m >
m
0
,
there exist a poly

time
m
margin learning algorithm.
A New Positive Result [B

D, Simon]
For every positive
m
,
there is a poly

time
algorithm that classifies correctly as many input
points as any half

space can classify correctly
with margin
>
m
The positive result
For every positive
m
,
there is a poly

time
algorithm that classifies correctly as many input
points as any half

space can classify correctly
with margin >
m
A Complementing Hardness Result
Unless P = NP , no algorithm can do this in time
polynomial in
1
/
m
(
and in S and
n
).
Proof of the Positive Result (Outline)
Best Separating Hyperplane
Best Separating
Homogeneous
Hyperplane
Densest Hemisphere (
un

supervised input)
Densest Open Ball
We apply the following chain of reductions:
Input:
A finite set
P
of points on the unit
sphere
S
n

1
.
Output:
An open Ball
B
of radious
1
so that
B
P
is maximized.
The Denset Open Ball Problem
S
n

1
B
Algorithms for the Densest Open Ball Problem
Alg.
1
.
For every
x
1
, …x
n
P
,
•
find the center of
their minimal
enclosing Ball,
Z(x
1
, …, x
n
)
•
Check
B[
Z(x
1
, …, x
n
),
1
]
P
Output the ball with maximum intersection with P
Running time:
~P
n
exponential
in
n
!
Another Algorithm (for the Densest Open Ball Problem)
Fix a parameter
k << n
,
Alg.
2
. Apply Alg.
1
only for subsets of size
<
k
, i.e.,
For every
x
1
, …x
k
P
,
•
find the center of
their minimal
enclosing Ball,
Z(x
1
, …, x
k
)
•
Check
B[
Z(x
1
, …, x
k
),
1
]
P
Output the ball with maximum intersection with P
Running time:
~P
k
But, does it output a good hypothesis?
Our Core Mathematical Result
The following is a
local approximation
result.
It shows that computations from local data (
k

size subsets)
can approximate global computations,
with precision guarantee depending only on the local
parameter,
k
.
Theorem:
For every
k < n
and
x
1
… x
n
on the unit sphere
S
n

1
,
there exist a subset
So that
The Resulting Perceptron Algorithm
On input
S
consider all
k

size sub

samples.
For each such sub

sample find its largest margin
separating hyperplane.
Among all the
(~S
k
)
resulting hyperplanes.
choose the one with best performance on
S
.
(The choice of
k
is a function of the desired margin
m
,
k ~
m
2
).
A Different, Randomized, Algorithm
Avrim
Blum noticed that the ‘
randomized projection’
algorithm of
Rosa Ariaga and Santosh Vempala ‘
99
achieves, with high probability, a similar
Performance as our algorithm.
Directions for Further Research
Can similar efficient algorithms be derived
for more complex NN architectures?
How well do the new algorithms perform
on real data sets?
Can the ‘local approximation’ results be
extended to more geometric functions?
Comments 0
Log in to post a comment