The Computational Complexity of Learning with Neural Networks

clangedbivalveΤεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

75 εμφανίσεις

Credentials

Our understanding of this topic is based on the

Work of many researchers. In particular:


Rosa Arriaga

Peter Bartlett


Avrim Blum


Bhaskar DasGupta

Nadav Eiron

Barbara Hammer

David Haussler








Klaus Hoffgen

Lee Jones

Michael Kearns


Christian Kuhlman

Phil Long

Ron Rives

Hava Siegelman



Hans Ulrich Simon

Eduardo Sontag

Leslie Valiant

Kevin Van Horn

Santosh Vempala

Van Vu



Introduction




Neural Nets are most
popular
,
effective,



practical

learning tools.


Yet,


after almost
40
years of research, there are no


efficient

algorithms for learning with NN’s.


WHY?

Outline of this Talk



1
.

Some background.





2
.

Survey of recent
strong

hardness results.



3
.

New
efficient

learning algorithms for

some

basic NN architectures.




The Label Prediction Problem

Given some domain

set

X


A sample

S

of labeled

members of
X


is

generated by some

(unknown) distribution

For a next point
x

,

Predict its label

Data files of drivers

Will the customer

you interview file a

claim?


Drivers in a sample

are labeled according

to whether they filed

an insurance claim

Formal Definition

Example

The Agnostic Learning Paradigm





Choose a
Hypothesis Class
H

of subsets

of
X
.





For an input sample
S

, find some
h


in


H

that fits
S

well.





For a new point
x

, predict a label

according to its membership in
h

.


The Mathematical Justification




If

H

is not too rich (has small VC
-
dimension)




then, for every

h

in
H

,



the
agreement ratio

of
h

on the sample
S



is a good estimate of its
probability of success

on a
new

x

.



The Mathematical Justification
-

Formally




If

S

is sampled i.i.d., by some
D

over


X


{
0
,
1
}


then with probability
>
1
-







Agreement ratio

Probability of success

A Comparison to ‘Classic’ PAC

Sample labels are

consistent

with some
h

in
H

Learner hypothesis

required to meet

absolute
Upper bound

on its error

No prior restriction on

the sample labels

The required upper

bound on the

hypothesis error is

only relative

(to the

best hypothesis

in the

class)


PAC framework

Agnostic framework

The Model Selection Issue

Output of the the
learning Algorithm

Best regressor for
P

Approximation Error

Estimation Error

Computational Error

The Class H

The Big Question




Are there hypotheses classes that are:




1
.

Expressive

(small approximation error).


2
.

Have
small VC
-
dim

(small generalization error)


3
.

Have
efficient

good
-
approximation algorithms



NN’s are quite successful as approximators (property
1
).



If they are small (relative to the data size) then they also
satisfy property
2
.



We investigate property
3
for such NN’s.


The Computational Problem





For some class

H

of domain subsets




Input
:

A finite set of
{
0
,
1
}
-
labeled

points
S

in
R
n

.




Output
:

Some
h

in

H

that maximizes the

number of correctly classified point of
S

.



“Old” Work




Hardness results:

Blum and Rivest showed that

it is

NP
-
hard to
optimize
the weights of a


3
-
nodes NN.



Similar hardness
-
of
-
optimization results for

other

classes followed.



But learning can settle for less than optimization.




Efficient algorithms:

known

perceptron

algorithms are efficient for linearly
separable


input data (or the image of such data under

‘tamed’ noise).



But natural data sets are usually not separable
.

The Focus of this Tutorial





The results mentioned above (Blum and

Rivest etc.) show that for many “natural”

NN’s finding such S
-
optimal

h

in

H

is


NP

hard.



Are there
efficient

algorithms that output

good
approximations

to the S
-
optimal

hypothesis?





For each of the following classes there exist some
constant
s. t.

approximating the best agreement rate for

h
in

H

(on a given input sample
S

) up to this constant
ratio, is

NP
-
hard

:




Monomials



Constant width




Monotone Monomials







Half
-
spaces









Balls




Axis aligned Rectangles







Threshold NN’s with constant
1
st
-
layer width






BD
-
Eiron
-
Long

Bartlett
-

BD

Hardness
-
of
-
Approximation Results

How Significant are Such Hardness Results




All the above results are proved via reductions from some
known
-
to
-
be
-
hard problem.





Relevant Questions

















1
.

Samples that are hard for one H are easy for another



(a model selection issue).



2
.

Where do ‘naturally generated’ samples fall?

Data
-
Dependent Success







Note that the definition of success for agnostic

learning is data
-
dependent;


The success rate of the learner on
S

is compared


to that of the best
h

in
H
.






We extend this approach to a data
-
dependent

success definition for approximations;


The required success
-
rate is a function of the

input

data.

A New Success Criterion


A learning algorithm
A

is




m

margin

successful


if, for every input
S


R
n



{o,
1
}

,

|
{(x,y)


S
:

A
(s)
(x) = y}
|

>

|
{(x,y)
:

h(x)=y
and

d(h, x) >
m}
|



for

every

h


H

.

Some Intuition






If there exist some optimal
h
which separates

with generous margins, then a
m

margin


algorithm must produce an optimal separator.





On the other hand,




If every good separator can be degraded by

small perturbations, then a
m


margin

algorithm can settle for a

hypothesis that is far

from optimal.

First Virtue of the New Measure





The
m

margin
requirement
is a rigorous

performance guarantee that can be achieved
by
efficient algorithms


(unlike the common approximate optimization).





Another Appealing Feature

of the New Criterion



It turns out that for each of the three classes analysed
so far (
Half
-
spaces
,

Balls

and

Hyper
-
Rectangles
),
there exist a
critical value

m
0


so that:





m

margin

learnability is NP
-
hard for all

m


m
0



while, on the other hand,




For any
m >

m
0


,

there exist a poly
-
time



m

margin learning algorithm.


A New Positive Result [B
-
D, Simon]








For every positive
m

,

there is a poly
-
time

algorithm that classifies correctly as many input

points as any half
-
space can classify correctly

with margin
>

m






The positive result





For every positive
m
,

there is a poly
-
time

algorithm that classifies correctly as many input

points as any half
-
space can classify correctly

with margin >
m



A Complementing Hardness Result




Unless P = NP , no algorithm can do this in time

polynomial in
1
/
m
(
and in |S| and
n
).


Proof of the Positive Result (Outline)


Best Separating Hyperplane

Best Separating
Homogeneous

Hyperplane

Densest Hemisphere (
un
-
supervised input)

Densest Open Ball

We apply the following chain of reductions:





Input:


A finite set
P

of points on the unit



sphere
S
n
-
1

.


Output:

An open Ball
B

of radious
1
so that



|B


P|

is maximized.







The Denset Open Ball Problem

S
n
-
1

B


Algorithms for the Densest Open Ball Problem





Alg.
1
.


For every
x
1
, …x
n


P

,






find the center of

their minimal






enclosing Ball,
Z(x
1
, …, x
n
)












Check
|B[
Z(x
1
, …, x
n
),
1
]


P|













Output the ball with maximum intersection with P




Running time:
~|P|
n

exponential

in
n
!




Another Algorithm (for the Densest Open Ball Problem)





Fix a parameter
k << n
,



Alg.
2
. Apply Alg.
1
only for subsets of size
<

k
, i.e.,







For every
x
1
, …x
k


P

,






find the center of

their minimal






enclosing Ball,
Z(x
1
, …, x
k
)












Check
|B[
Z(x
1
, …, x
k
),
1
]


P|









Output the ball with maximum intersection with P



Running time:
~|P|
k



But, does it output a good hypothesis?


Our Core Mathematical Result




The following is a
local approximation

result.



It shows that computations from local data (
k
-
size subsets)

can approximate global computations,


with precision guarantee depending only on the local
parameter,
k
.







Theorem:

For every
k < n

and
x
1

… x
n

on the unit sphere


S
n
-
1

,

there exist a subset



So that









The Resulting Perceptron Algorithm






On input
S
consider all
k
-
size sub
-
samples.





For each such sub
-
sample find its largest margin


separating hyperplane.




Among all the
(~|S|
k
)
resulting hyperplanes.


choose the one with best performance on
S

.


(The choice of
k
is a function of the desired margin
m
,




k ~
m

2
).




A Different, Randomized, Algorithm









Avrim

Blum noticed that the ‘
randomized projection’
algorithm of

Rosa Ariaga and Santosh Vempala ‘
99


achieves, with high probability, a similar


Performance as our algorithm.





Directions for Further Research





Can similar efficient algorithms be derived


for more complex NN architectures?





How well do the new algorithms perform


on real data sets?





Can the ‘local approximation’ results be


extended to more geometric functions?