# The Computational Complexity of Learning with Neural Networks

AI and Robotics

Oct 19, 2013 (4 years and 6 months ago)

87 views

Credentials

Our understanding of this topic is based on the

Work of many researchers. In particular:

Rosa Arriaga

Peter Bartlett

Avrim Blum

Bhaskar DasGupta

Nadav Eiron

Barbara Hammer

David Haussler

Klaus Hoffgen

Lee Jones

Michael Kearns

Christian Kuhlman

Phil Long

Ron Rives

Hava Siegelman

Hans Ulrich Simon

Eduardo Sontag

Leslie Valiant

Kevin Van Horn

Santosh Vempala

Van Vu

Introduction

Neural Nets are most
popular
,
effective,

practical

learning tools.

Yet,

after almost
40
years of research, there are no

efficient

algorithms for learning with NN’s.

WHY?

Outline of this Talk

1
.

Some background.

2
.

Survey of recent
strong

hardness results.

3
.

New
efficient

learning algorithms for

some

basic NN architectures.

The Label Prediction Problem

Given some domain

set

X

A sample

S

of labeled

members of
X

is

generated by some

(unknown) distribution

For a next point
x

,

Predict its label

Data files of drivers

Will the customer

you interview file a

claim?

Drivers in a sample

are labeled according

to whether they filed

an insurance claim

Formal Definition

Example

The Agnostic Learning Paradigm

Choose a
Hypothesis Class
H

of subsets

of
X
.

For an input sample
S

, find some
h

in

H

that fits
S

well.

For a new point
x

, predict a label

according to its membership in
h

.

The Mathematical Justification

If

H

is not too rich (has small VC
-
dimension)

then, for every

h

in
H

,

the
agreement ratio

of
h

on the sample
S

is a good estimate of its
probability of success

on a
new

x

.

The Mathematical Justification
-

Formally

If

S

is sampled i.i.d., by some
D

over

X

{
0
,
1
}

then with probability
>
1
-

Agreement ratio

Probability of success

A Comparison to ‘Classic’ PAC

Sample labels are

consistent

with some
h

in
H

Learner hypothesis

required to meet

absolute
Upper bound

on its error

No prior restriction on

the sample labels

The required upper

bound on the

hypothesis error is

only relative

(to the

best hypothesis

in the

class)

PAC framework

Agnostic framework

The Model Selection Issue

Output of the the
learning Algorithm

Best regressor for
P

Approximation Error

Estimation Error

Computational Error

The Class H

The Big Question

Are there hypotheses classes that are:

1
.

Expressive

(small approximation error).

2
.

Have
small VC
-
dim

(small generalization error)

3
.

Have
efficient

good
-
approximation algorithms

NN’s are quite successful as approximators (property
1
).

If they are small (relative to the data size) then they also
satisfy property
2
.

We investigate property
3
for such NN’s.

The Computational Problem

For some class

H

of domain subsets

Input
:

A finite set of
{
0
,
1
}
-
labeled

points
S

in
R
n

.

Output
:

Some
h

in

H

that maximizes the

number of correctly classified point of
S

.

“Old” Work

Hardness results:

Blum and Rivest showed that

it is

NP
-
hard to
optimize
the weights of a

3
-
nodes NN.

Similar hardness
-
of
-
optimization results for

other

classes followed.

But learning can settle for less than optimization.

Efficient algorithms:

known

perceptron

algorithms are efficient for linearly
separable

input data (or the image of such data under

‘tamed’ noise).

But natural data sets are usually not separable
.

The Focus of this Tutorial

The results mentioned above (Blum and

Rivest etc.) show that for many “natural”

NN’s finding such S
-
optimal

h

in

H

is

NP

hard.

Are there
efficient

algorithms that output

good
approximations

to the S
-
optimal

hypothesis?

For each of the following classes there exist some
constant
s. t.

approximating the best agreement rate for

h
in

H

(on a given input sample
S

) up to this constant
ratio, is

NP
-
hard

:

Monomials

Constant width

Monotone Monomials

Half
-
spaces

Balls

Axis aligned Rectangles

Threshold NN’s with constant
1
st
-
layer width

BD
-
Eiron
-
Long

Bartlett
-

BD

Hardness
-
of
-
Approximation Results

How Significant are Such Hardness Results

All the above results are proved via reductions from some
known
-
to
-
be
-
hard problem.

Relevant Questions

1
.

Samples that are hard for one H are easy for another

(a model selection issue).

2
.

Where do ‘naturally generated’ samples fall?

Data
-
Dependent Success

Note that the definition of success for agnostic

learning is data
-
dependent;

The success rate of the learner on
S

is compared

to that of the best
h

in
H
.

We extend this approach to a data
-
dependent

success definition for approximations;

The required success
-
rate is a function of the

input

data.

A New Success Criterion

A learning algorithm
A

is

m

margin

successful

if, for every input
S

R
n

{o,
1
}

,

|
{(x,y)

S
:

A
(s)
(x) = y}
|

>

|
{(x,y)
:

h(x)=y
and

d(h, x) >
m}
|

for

every

h

H

.

Some Intuition

If there exist some optimal
h
which separates

with generous margins, then a
m

margin

algorithm must produce an optimal separator.

On the other hand,

If every good separator can be degraded by

small perturbations, then a
m

margin

algorithm can settle for a

hypothesis that is far

from optimal.

First Virtue of the New Measure

The
m

margin
requirement
is a rigorous

performance guarantee that can be achieved
by
efficient algorithms

(unlike the common approximate optimization).

Another Appealing Feature

of the New Criterion

It turns out that for each of the three classes analysed
so far (
Half
-
spaces
,

Balls

and

Hyper
-
Rectangles
),
there exist a
critical value

m
0

so that:

m

margin

learnability is NP
-
hard for all

m

m
0

while, on the other hand,

For any
m >

m
0

,

there exist a poly
-
time

m

margin learning algorithm.

A New Positive Result [B
-
D, Simon]

For every positive
m

,

there is a poly
-
time

algorithm that classifies correctly as many input

points as any half
-
space can classify correctly

with margin
>

m

The positive result

For every positive
m
,

there is a poly
-
time

algorithm that classifies correctly as many input

points as any half
-
space can classify correctly

with margin >
m

A Complementing Hardness Result

Unless P = NP , no algorithm can do this in time

polynomial in
1
/
m
(
and in |S| and
n
).

Proof of the Positive Result (Outline)

Best Separating Hyperplane

Best Separating
Homogeneous

Hyperplane

Densest Hemisphere (
un
-
supervised input)

Densest Open Ball

We apply the following chain of reductions:

Input:

A finite set
P

of points on the unit

sphere
S
n
-
1

.

Output:

An open Ball
B

of radious
1
so that

|B

P|

is maximized.

The Denset Open Ball Problem

S
n
-
1

B

Algorithms for the Densest Open Ball Problem

Alg.
1
.

For every
x
1
, …x
n

P

,

find the center of

their minimal

enclosing Ball,
Z(x
1
, …, x
n
)

Check
|B[
Z(x
1
, …, x
n
),
1
]

P|

Output the ball with maximum intersection with P

Running time:
~|P|
n

exponential

in
n
!

Another Algorithm (for the Densest Open Ball Problem)

Fix a parameter
k << n
,

Alg.
2
. Apply Alg.
1
only for subsets of size
<

k
, i.e.,

For every
x
1
, …x
k

P

,

find the center of

their minimal

enclosing Ball,
Z(x
1
, …, x
k
)

Check
|B[
Z(x
1
, …, x
k
),
1
]

P|

Output the ball with maximum intersection with P

Running time:
~|P|
k

But, does it output a good hypothesis?

Our Core Mathematical Result

The following is a
local approximation

result.

It shows that computations from local data (
k
-
size subsets)

can approximate global computations,

with precision guarantee depending only on the local
parameter,
k
.

Theorem:

For every
k < n

and
x
1

… x
n

on the unit sphere

S
n
-
1

,

there exist a subset

So that

The Resulting Perceptron Algorithm

On input
S
consider all
k
-
size sub
-
samples.

For each such sub
-
sample find its largest margin

separating hyperplane.

Among all the
(~|S|
k
)
resulting hyperplanes.

choose the one with best performance on
S

.

(The choice of
k
is a function of the desired margin
m
,

k ~
m

2
).

A Different, Randomized, Algorithm

Avrim

Blum noticed that the ‘
randomized projection’
algorithm of

Rosa Ariaga and Santosh Vempala ‘
99

achieves, with high probability, a similar

Performance as our algorithm.

Directions for Further Research

Can similar efficient algorithms be derived

for more complex NN architectures?

How well do the new algorithms perform

on real data sets?

Can the ‘local approximation’ results be

extended to more geometric functions?