Slides

strawberrycokevilleAI and Robotics

Nov 7, 2013 (7 years and 10 months ago)

281 views

A Quarter
-
Century of

Efficient Learnability

Rocco Servedio

Columbia University

Valiant 60
th

Birthday Symposium




Bethesda, Maryland

May 30, 2009

1984


and of course...

Probably Approximately Correct learning

[Valiant84]


Concept class

of Boolean functions over domain X

[Valiant84] presents range of learning
models, oracles

D models (possibly complex) world


Unknown

and
arbitrary

distribution over X


Unknown target concept to be learned from examples

Learner has access to i.i.d. draws from labeled according to











each belongs to X,







i.i.d. drawn from


typically or

PAC learning concept class

Learner’s goal:


come up with hypothesis that will have high accuracy on future examples.


For any target function



for any distribution over X,



with probability learner outputs hypothesis

that is
-
accurate w.r.t.







Efficiently

Algorithm must be

computationally efficient:
should run in time



“The results of
learnability theory

would then indicate the maximum
granularity of the single concepts that can be acquired without
programming.”



“This paper attempts to explore the limits of what is learnable as allowed
by algorithmic complexity….
The identification of these limits is a
major goal of the line of work proposed in this paper.




PAC model, and its variants, provide a clean theoretical framework for
studying the
computational complexity

of learning problems.



From






:

So, what can be learned efficiently?

In the rest of the 1980s,




Valiant & colleagues gave remarkable results on the abilities and
limitations of computationally efficient learning algorithms.


This work introduced research directions and questions that continue

to be intensively studied to this day.

25 years of efficient learnability

Rest of talk: survey some




positive results (algorithms)



negative results (two flavors of hardness results)


(Didn’t just ask the question “what can be learned efficiently”


he did a great deal towards answering it.


(highlight some of these contributions and how the field has evolved since then)




Positive results: learning k
-
DNF

Theorem
[Valiant84]
: k
-
DNF learnable in polynomial time for any k=O(1).



k=2:


View a k
-
DNF as a disjunction over “metavariables”, learn the
disjunction using elimination.


25 years later: improving this to k is still a major open question!



Much has been learned in trying for this improvement…

Poly
-
time PAC learning, general distributions




Decision lists (greedy alg.)

[Rivest87]




Halfspaces (poly
-
time LP)

[Littlestone87, BEHW89, …]



Parities, integer lattices (Gaussian elim.)
[HelmboldSloanWarmuth92, FischerSimon92]




Restricted types of branching programs (DL + parities)
[ErgunKumarRubinfeld95, BshoutyTamonWilson98]



Geometric concept classes (…random projections…)
[BshoutyChenHomer94, BGMST98, Vempala99, …]


and more…

+

+

+

+

-

-

-

-

-

-

-

-

-

+

+

+

+

General
-
distribution PAC learning, cont

Quasi
-
poly / sub
-
exponential
-
time learning:






poly
-
size decision trees

[EhrenfeuchtHaussler89, Blum92]





poly
-
size DNF

[Bshouty96, TaruiTsukiji99,

KlivansS01]





intersections of few

poly(n)
-
weight halfspaces

[KlivansO’DonnellS02]



“PTF method” (halfspaces + metavariables)
-

link with complexity theory


OR

AND

AND

AND

x
2

x
3

x
5

x
6

x
3

x
5

x
1

x
6

x
7

_

_

_

_

+

+

+

+

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

x
3

x
5

x
1

x
1

x
5

x
4

-
1

1

-
1

1

-
1

1

1

Distribution
-
specific learning

Theorem
[KearnsLiValiant87]
: monotone

Boolean functions can be weakly learned

(accuracy ) in poly time

under the uniform distribution on



Ushered in study of algorithms for uniform
-
distribution and distribution
-
specific learning:
halfspaces
[Baum90]
, DNF
[Verbeurgt90, Jackson95]
, decision trees
[KushilevitzMansour93]
, AC
0

[LinialMansourNisan89, FurstJacksonSmith91]
, extended AC
0

[JacksonKlivansS02]
,
juntas
[MosselO’DonnellS03]
, general monotone functions
[BshoutyTamon96, BlumBurchLangford98,
O’DonnellWimmer09],
monotone decision trees
[O’DonnellS06]
, intersections of halfspaces
[BlumKannan94,
Vempala97, KwekPitt98, KlivansO’DonnellS08]
, convex sets, much more…



Key tool:
Fourier analysis

of Boolean functions



Recently come full circle on monotone functions:


[O’DonnellWimmer09]
: poly time, accuracy:
optimal! (by
[BlumBurchLangford98]
)


1

0

1

Other variants

After
[Valiant84],
efficient learning algorithms studied in many settings:



Learning in the presence of noise: malicious
[Valiant85]
, agnostic
[KearnsSchapireSellie93]
, random misclassification
[AngluinLaird87]
,…



Related models: Exact learning from queries and
counterexamples
[Angluin87],
Statistical Query Learning
[Kearns93],
many others…



PAC
-
style analyses of unsupervised learning problems: learning
discrete distributions
[KMRRSS94]
, learning mixture distributions
[Dasgupta99, AroraKannan01, many others…]



Evolvability framework
[Valiant07, Feldman08, …]


Nice algorithmic results in all these settings.


Limits of efficient learnability:

is proper learning feasible?

Proper

learning:



learning algorithm for class must uses hypotheses from



There are efficient proper learning algorithms for conjunctions,

disjunctions, halfspaces, decision lists, parities, k
-
DNF, k
-
CNF.




What about k
-
term DNF


can we learn


using

k
-
term DNF as hypotheses?

Proper learning is computationally hard

Theorem

[PittValiant87]
: If no poly
-
time algorithm can

learn 3
-
term DNF using 3
-
term DNF hypotheses.


Given a graph reduction

produces distribution over

labeled examples such that



high
-
accuracy 3
-
term DNF


iff


is 3
-
colorable.



Note:
can

learn 3
-
term DNF in poly time using 3
-
CNF hypotheses!


“Often a change of representation can make a difficult learning task easy.”

reduction


distribution over

(011111, +) (001111,
-
)

(101111, +) (010111,
-
)

(110111, +) (011101,
-
)


… …

From 1987…

Theorem
[PittValiant87]
: Learning k
-
term DNF using (2k
-
3)
-
term DNF
hypotheses is hard.



Opened door to whole range of hardness results:




is hard to learn using hypotheses from


This work showed computational barriers to learning with restricted
representations in general, not just proper learning:

… to 2009

Great progress in recent years using sophisticated machinery from
hardness of approximation.



[ABFKP04]
: Hard to learn n
-
term DNF using n
100
-
size OR
-
of
-
halfspace
hypotheses.
[Feldman06]
: Holds even if learner can make membership
queries to target function.


[KhotSaket08]
: Hard to (even weakly) learn intersection of 2 halfspaces
using 100 halfspaces as hypothesis


If data is corrupted with 1% noise, then



[FeldmanGopalanKhotPonnuswami08]

: Hard to (even weakly)

learn an AND using an AND as hypothesis. Same for halfspaces.


[GopalanKhotSaket07, Viola08]
: Hard to (even weakly) learn a parity even
using degree
-
100 GF(2) polynomials as hypotheses


Active area with lots of ongoing work.


Representation
-
Independent Hardness

Suppose there are
no hypothesis restrictions
: any poly
-
size circuit OK.


Are there learning problems that are
still

hard for computational reasons?



[Valiant84]:

Existence of
pseudorandom functions

[GoldreichGoldwasserMicali84]

implies that general Boolean circuits
are (representation
-
independently) hard to learn.

Yes:

PKC and hardness of learning

Key insight of
[KearnsValiant89]:
Public
-
key cryptosystems


hard
-
to
-
learn functions.


Adversary can create labeled examples

of by herself…so must not be learnable from labeled

examples, or else cryptosystem would be insecure!



Theorem

[KearnsValiant89]
:
Simple

classes of functions


NC
1
, TC
0
,
poly
-
size DFAs


are inherently hard to learn.




Theorem

[Regev05, KlivansSherstov06]
:
Really simple

functions


poly
-
size OR of halfspaces


are inherently hard to learn.


Closing the gap:
Can these results be extended to show that DNF are
inherently hard to learn? Or are DNF efficiently learnable?



Efficient learnability:

Model and Results

Thank you, Les!

Valiant



provided an elegant
model

for the computational study of learning




followed this up with foundational
results

on what is (and isn’t)
efficiently learnable


These fundamental questions continue to be intensively studied and
cross
-
fertilize other topics in TCS.