Sublinear Optimization for Machine Learning

Kenneth L.Clarkson,IBMAlmaden Research Center

Elad Hazan,Technion - Israel Institute of technology

David P.Woodruff,IBMAlmaden Research Center

In this paper we describe and analyze sublinear-time approximation algorithms for some optimization prob-

lems arising in machine learning,such as training linear classiﬁers and ﬁnding minimum enclosing balls.

Our algorithms can be extended to some kernelized versions of these problems,such as SVDD,hard margin

SVM,and L

2

-SVM,for which sublinear-time algorithms were not known before.These new algorithms use

a combination of a novel sampling techniques and a new multiplicative update algorithm.We give lower

bounds which show the running times of many of our algorithms to be nearly best possible in the unit-cost

RAMmodel.

1.INTRODUCTION

Linear classiﬁcation is a fundamental problem of machine learning,in which positive

and negative examples of a concept are represented in Euclidean space by their feature

vectors,and we seek to ﬁnd a hyperplane separating the two classes of vectors.

The Perceptron Algorithm for linear classiﬁcation is one of the oldest algorithms

studied in machine learning [Novikoff 1963;Minsky and Papert 1988].It can be used

to efﬁciently give a good approximate solution,if one exists,and has nice noise-stability

properties which allowit to be used as a subroutine in many applications such as learn-

ing with noise [Bylander 1994;Blum et al.1998],boosting [Servedio 1999] and more

general optimization [Dunagan and Vempala 2004].In addition,it is extremely simple

to implement:the algorithmstarts with an arbitrary hyperplane,and iteratively ﬁnds

a vector on which it errs,and moves in the direction of this vector by adding a multiple

of it to the normal vector to the current hyperplane.

The standard implementation of the Perceptron Algorithm must iteratively ﬁnd a

“bad vector” which is classiﬁed incorrectly,that is,for which the inner product with

the current normal vector has an incorrect sign.Our new algorithm is similar to the

Perceptron Algorithm,in that it maintains a hyperplane and modiﬁes it iteratively,

according to the examples seen.However,instead of explicitly ﬁnding a bad vector,we

run another dual learning algorithmto learn the “most adversarial” distribution over

the vectors,and use that distribution to generate an “expected bad” vector.Moreover,

we do not compute the inner products with the current normal vector exactly,but

instead estimate themusing a fast sampling-based scheme.

Thus our update to the hyperplane uses a vector whose “badness” is determined

quickly,but very crudely.We show that despite this,an approximate solution is still

obtained in about the same number of iterations as the standard perceptron.So our

algorithm is faster;notably,it can be executed in time sublinear in the size of the

Part of this work was done while E.Hazan was at IBMAlmaden Research Center.He is currently supported

by Israel Science Foundation grant 810/11.

Permission to make digital or hard copies of part or all of this work for personal or classroomuse is granted

without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that

copies show this notice on the ﬁrst page or initial screen of a display along with the full citation.Copyrights

for components of this work owned by others than ACM must be honored.Abstracting with credit is per-

mitted.To copy otherwise,to republish,to post on servers,to redistribute to lists,or to use any component

of this work in other works requires prior speciﬁc permission and/or a fee.Permissions may be requested

fromPublications Dept.,ACM,Inc.,2 Penn Plaza,Suite 701,New York,NY 10121-0701 USA,fax +1 (212)

869-0481,or permissions@acm.org.

c

2012 ACM0004-5411/2012/01-ART0 $10.00

DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

0:2 Clarkson,Hazan and Woodruff

Problem

Previous time

Time Here

Lower Bound

linear

~

O("

2

M)

classiﬁcation

[Novikoff 1963]

~

O("

2

(n +d)) x2

("

2

(n +d)) x6.1

min.enc.

~

O("

1=2

M)

ball (MEB)

[Saha and Vishwanathan 2009]

~

O("

2

n +"

1

d) x3.1

("

2

n +"

1

d) x6.2

QP in the

O("

1

M)

simplex

[Frank and Wolfe 1956]

~

O("

2

n +"

1

d) x3.3

Las Vegas

additive O(M)

versions

Cor 2.11

(M) x6.4

kernelized

factors O(s

4

)

MEB and QP

or O(q) x5

Fig.1.Our results,parameters deﬁned in relevant sections

input data,and still have good output,with high probability.(Here we must make

some reasonable assumptions about the way in which the data is stored,as discussed

below.)

This technique applies more generally than to the perceptron:we also obtain sublin-

ear time approximation algorithms for the related problems of ﬁnding an approximate

Minimum Enclosing Ball (MEB) of a set of points,and training a Support Vector Ma-

chine (SVM),in the hard margin or L

2

-SVMformulations.

We give lower bounds that imply that our algorithms for classiﬁcation are best pos-

sible,up to polylogarithmic factors,in the unit-cost RAMmodel,while our bounds for

MEB are best possible up to an

~

O("

1

) factor.For most of these bounds,we give a fam-

ily of inputs such that a single coordinate,randomly “planted” over a large collection of

input vector coordinates,determines the output to such a degree that all coordinates

in the collection must be examined for even a 2=3 probability of success.

Our approach can be extended to give algorithms for the kernelized versions of these

problems,for some popular kernels including the Gaussian and polynomial,and also

easily gives Las Vegas results,where the output guarantees always hold,and only the

running time is probabilistic.

1

Our main results are given in Figure 1,using the following notation:all the problems

we consider have an n d matrix A as input,with M nonzero entries,and with each

row of A with Euclidean length no more than one.The parameter > 0 is the additive

error;for MEB,this can be a relative error,after a simple O(M) preprocessing step.We

use the asymptotic notation

~

O(f) = O(f polylog

nd

"

).The parameter is the margin of

the probleminstance,explained below.The parameters s and q determine the standard

deviation of a Gaussian kernel,and degree of a polynomial kernel,respectively.

The time bounds given for our algorithms,except the Las Vegas ones,are under

the assumption of constant error probability;for output guarantees that hold with

probability 1 ,our bounds should be multiplied by log(n=).

The time bounds also require the assumption that the input data is stored in such a

way that a given entry A

i;j

can be recovered in constant time.This can be done by,for

example,keeping each row A

i

of A as a hash table.(Simply keeping the entries of the

rowin sorted order by column number is also sufﬁcient,incurring an O(log d) overhead

in running time for binary search.)

1

For MEB and the kernelized versions,we assume that the Euclidean norms of the relevant input vectors

are known.Even with the addition of this linear-time step,all our algorithms improve on prior bounds,with

the exception of MEB when M = o("

3=2

(n +d)).

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

Sublinear Optimization for Machine Learning 0:3

Formal Description:Classiﬁcation.In the linear classiﬁcation problem,the learner

is given a set of n labeled examples in the form of d-dimensional vectors,comprising

the input matrix A.The labels comprise a vector y 2 f+1;1g

n

.

The goal is to ﬁnd a separating hyperplane,that is,a normal vector x in the unit

Euclidean ball B such that for all i,y(i) A

i

x 0;here y(i) denotes the i’th coordinate

of y.As mentioned,we will assume throughout that A

i

2 B for all i 2 [n],where

generally [m] denotes the set of integers f1;2;:::;mg.

As is standard,we may assume that the labels y(i) are all 1,by taking A

i

A

i

for any i with y(i) = 1.The approximation version of linear classiﬁcation is to ﬁnd a

vector x

"

2 B that is an"-approximate solution,that is,

8i

0

A

i

0 x

"

max

x2B

min

i

A

i

x ":(1)

The optimumfor this formulation is obtained when kxk = 1,except when no separating

hyperplane exists,and then the optimumx is the zero vector.

Note that min

i

A

i

x = min

p2

p

>

Ax,where R

n

is the unit simplex fp 2 R

n

j p

i

0;

P

i

p

i

= 1g.Thus we can regard the optimumas the outcome of a game to determine

p

>

Ax,between a minimizer choosing p 2 ,and a maximizer choosing x 2 B,yielding

max

x2B

min

p2

p

>

Ax;

where this optimum is called the margin.From standard duality results, is also

the optimumof the dual problem

min

p2

max

x2B

p

>

Ax;

and the optimumvectors p

and x

are the same for both problems.

The classical Perceptron algorithm returns an"-approximate solution to this prob-

lemin

1

"

2

iterations,and total time O("

2

M)

2

.For given 2 (0;1),our new algorithm

takes O("

2

(n+d)(log n) log(n=)) time to return an"-approximate solution with prob-

ability at least 1 .Further,we show this is optimal in the unit-cost RAMmodel,up

to poly-logarithmic factors.

Formal Description:Minimum Enclosing Ball (MEB).The MEB problem is to ﬁnd

the smallest Euclidean ball in R

d

containing the rows of A.It is a special case of

quadratic programming (QP) in the unit simplex,namely,to ﬁnd min

p2

p

>

b+p

>

AA

>

p,

where b is an n-vector.This relationship,and the generalization of our MEB algorithm

to QP in the simplex,is discussed in x3.3;for more general background on QP in the

simplex,and related problems,see for example [Clarkson 2008].

1.1.Related work

Perhaps the most closely related work is that of [Grigoriadis and Khachiyan 1995],

who showed how to approximately solve a zero-sum game up to additive precision"

in time

~

O("

2

(n + d)),where the game matrix is n d.This problem is analogous to

ours,and our algorithm is similar in structure to theirs,but where we minimize over

p 2 and maximize over x 2 B,their optimization has not only p but also x in a unit

simplex.

Their algorithm (and ours) relies on sampling based on x and p,to estimate inner

products x

>

v or p

>

w for vectors v and w that are rows or columns of A.For a vector

p 2 ,this estimation is easily done by returning w

i

with probability p

i

.

2

strictly speaking,this is true only for"equals the margin,denoted ,and deﬁned below.Yet a slight

modiﬁcation of the perceptron gives this running time for any small enough"> 0

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

0:4 Clarkson,Hazan and Woodruff

For vectors x 2 B,however,the natural estimation technique is to pick i with prob-

ability x

2

i

,and return v

i

=x

i

.The estimator from this`

2

sample is less well-behaved,

since it is unbounded,and can have a high variance.While`

2

sampling has been used

in streaming applications [Monemizadeh and Woodruff 2010],it has not previously

found applications in optimization due to this high variance problem.

Indeed,it might seem surprising that sublinearity is at all possible,given that the

correct classiﬁer might be determined by very few examples,as shown in ﬁgure 2.It

thus seems necessary to go over all examples at least once,instead of looking at noisy

estimates based on sampling.

Fig.2.The optimumx

is determined by the vectors near the horizontal axis.

However,as we show,in our setting there is a version of the fundamental Mul-

tiplicative Weights (MW) technique that can cope with unbounded updates,and for

which the variance of`

2

-sampling is manageable.In our version of MW,the multiplier

associated with a value z is quadratic in z,in contrast to the more standard multiplier

that is exponential in z;while the latter is a fundamental building block in approxi-

mate optimization algorithms,as discussed at [Plotkin et al.1991],in our setting such

exponential updates can lead to a very expensive d

(1)

iterations.

We analyze MW from the perspective of on-line optimization,and show that our

version of MW has low expected expected regret given only that the random updates

have the variance bounds provable for`

2

sampling.We also use another technique from

on-line optimization,a gradient descent variant which is better suited for the ball.

For the special case of zero-sumgames in which the entries are all non-negative (this

is equivalent to packing and covering linear programs),[Koufogiannakis and Young

2007] give a sublinear-time algorithm which returns a relative approximation in time

~

O("

2

(n+d)).Our lower bounds show that a similar relative approximation bound for

sublinear algorithms is impossible for general classiﬁcation,and hence general linear

programming.

2.LINEAR CLASSIFICATION AND THE PERCEPTRON

Before our algorithm,some reminders and further notation: R

n

is the unit simplex

fp 2 R

n

j p

i

0;

P

i

p

i

= 1g,B R

d

is the Euclidean unit ball,and the unsubscripted

kxk denotes the Euclidean norm kxk

2

.The n-vector,all of whose entries are one,is

denoted by 1

n

.

The i’th rowof the input matrix Ais denoted A

i

,although a vector is a column vector

unless otherwise indicated.The i’th coordinate of vector v is denoted v(i).For a vector

v,we let v

2

denote the vector whose coordinates have v

2

(i) v(i)

2

for all i.

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

Sublinear Optimization for Machine Learning 0:5

2.1.The Sublinear Perceptron

Our sublinear perceptron algorithm is given in Figure 1.The algorithm maintains a

vector w

t

2 R

n

,with nonnegative coordinates,and also p

t

2 ,which is w

t

scaled to

have unit`

1

norm.A vector y

t

2 R

d

is maintained also,and x

t

which is y

t

scaled to

have Euclidean normno larger than one.These normalizations are done on line 4.

In lines 5 and 6,the algorithmis updating y

t

by adding a row of A randomly chosen

using p

t

.This is a randomized version of Online Gradient Descent (OGD);due to the

randomchoice of i

t

,A

i

t

is an unbiased estimator of p

>

t

A,which is the gradient of p

>

t

Ay

with respect to y.

In lines 7 through 12,the algorithmis updating w

t

using a column j

t

of A randomly

chosen based on x

t

,and also using the value x

t

(j

t

).This is a version of the Multiplica-

tive Weights (MW) technique for online optimization in the unit simplex,where v

t

is

an unbiased estimator of Ax

t

,the gradient of p

>

Ax

t

with respect to p.

Actually,v

t

is not unbiased,after the clip operation:for z;V 2 R,clip(z;V )

minfV;maxfV;zgg,and our analysis is helped by clipping the entries of v

t

;we show

that the resulting slight bias is not harmful.

As discussed in x1.1,the sampling used to choose j

t

(and update p

t

) is`

2

-sampling,

and that for i

t

,`

1

-sampling.These techniques,which can be regarded as special cases

of an`

p

-sampling technique,for p 2 [1;1),yield unbiased estimators of vector dot

products.It is important for us also that`

2

-sampling has a variance bound here;in

particular,for each relevant i and t,

E[v

t

(i)

2

] kA

i

k

2

kx

t

k

2

1:(2)

Algorithm1 Sublinear Perceptron

1:Input:"> 0,A 2 R

nd

with A

i

2 B for i 2 [n].

2:Let T 200

2

"

2

log n,y

1

0,w

1

1

n

,

q

log n

T

.

3:for t = 1 to T do

4:p

t

w

t

kw

t

k

1

,x

t

y

t

maxf1;ky

t

kg

:

5:Choose i

t

2 [n] by i

t

i with prob.p

t

(i).

6:y

t+1

y

t

+

1

p

2T

A

i

t

7:Choose j

t

2 [d] by

j

t

j with probability x

t

(j)

2

=kx

t

k

2

.

8:for i 2 [n] do

9:~v

t

(i) A

i

(j

t

)kx

t

k

2

=x

t

(j

t

)

10:v

t

(i) clip(~v

t

(i);1=)

11:w

t+1

(i) w

t

(i)(1 v

t

(i) +

2

v

t

(i)

2

)

12:end for

13:end for

14:return x =

1

T

P

t

x

t

First we note the running time.

THEOREM 2.1.The sublinear perceptron takes O("

2

log n) iterations,with a total

running time of O("

2

(n +d) log n).

PROOF.

The algorithmiterates T = O(

log n

"

2

) times.Each iteration requires:

(1) One`

2

sample per iterate,which takes O(d) time using known data structures.

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

0:6 Clarkson,Hazan and Woodruff

(2) Sampling i

t

2

R

p

t

which takes O(n) time.

(3) The update of x

t

and p

t

,which takes O(n +d) time.

The total running time is O("

2

(n +d) log n).

Next we analyze the output quality.The proof uses new tools fromregret minimiza-

tion and sampling that are the building blocks of most of our upper bound results.

Let us ﬁrst state the MWalgorithmused in all our algorithms.

Deﬁnition 2.2 (MWalgorithm).Consider a sequence of vectors q

1

;:::;q

T

2 R

n

.The

Multiplicative Weights (MW) algorithmis as follows.Let w

1

1

n

,and for t 1,

p

t

w

t

=kw

t

k

1

;(3)

and for 0 < 2 R,forall i 2 [n]

w

t+1

(i) w

t

(i)(1 q

t

(i) +

2

q

t

(i)

2

);(4)

The following is a key lemma,which proves a novel bound on the regret of the MW

algorithm above,suitable for the case where the losses are random variables with

bounded variance.As opposed to previous multiplicative-updates algorithms,this is

the only MWalgorithmwe are familiar with that does not require an upper bound on

the losses/payoffs.The proof is differed to after the main theoremand its proof.

LEMMA 2.3 (VARIANCE MW LEMMA).The MW algorithm satisﬁes (recall v

2

de-

notes the vector with v

2

(i) v(i)

2

).

X

t2[T]

p

>

t

q

t

min

i2[n]

P

t2[T]

maxfq

t

(i);

1

g +

log n

+

P

t2[T]

p

>

t

q

2

t

:

The following three lemmas give concentration bounds on our randomvariables from

their expectations.The ﬁrst two are based on standard martingale analysis,and the

last is a simple Markov application.

LEMMA 2.4.For

q

log n

T

,with probability at least 1 O(1=n),

max

i

X

t2[T]

[v

t

(i) A

i

x

t

] 4T:

LEMMA 2.5.For

q

log n

T

,with probability at least 1 O(1=n),it holds that

P

t2[T]

A

i

t

x

t

P

t

p

>

t

v

t

10T:

LEMMA 2.6.With probability at least 1

1

4

,it holds that

P

t

p

>

t

v

2

t

8T:

THEOREM 2.7 (MAIN THEOREM).With probability 1=2,the sublinear perceptron

returns a solution x that is an"-approximation.

PROOF.

First we use the regret bounds for lazy gradient descent to lower bound

P

t2[T]

A

i

t

x

t

,

next we get an upper bound for that quantity using Lemma 2.3,and then we combine

the two.

By deﬁnition,A

i

x

for all i 2 [n],and so,using the bound of Lemma A.2,

T max

x2B

X

t2[T]

A

i

t

x

X

t2[T]

A

i

t

x

t

+2

p

2T;(5)

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

Sublinear Optimization for Machine Learning 0:7

or rearranging,

X

t2[T]

A

i

t

x

t

T 2

p

2T:(6)

Now we turn to the MWpart of our algorithm.By the Weak Regret Lemma 2.3,and

using the clipping of v

t

(i),

X

t2[T]

p

>

t

v

t

min

i2[n]

X

t2[T]

v

t

(i) +(log n)= +

X

t2[T]

p

>

t

v

2

t

:

By Lemma 2.4 above,with high probability,for any i 2 [n],

X

t2[T]

A

i

x

t

X

t2[T]

v

t

(i) 4T;

so that with high probability

X

t2[T]

p

>

t

v

t

min

i2[n]

P

t2[T]

A

i

x

t

+(log n)= +

P

t2[T]

p

>

t

v

2

t

+4T:

Combining (6) and (7) we get

min

i2[n]

X

t2[T]

A

i

x

t

(log n)=

X

t2[T]

p

>

t

v

2

t

4T

+T 2

p

2T j

X

t2[T]

p

>

t

v

t

X

t2[T]

A

i

t

x

t

j

By Lemmas 2.5,2.6 we have w.p at least

3

4

O(

1

n

)

1

2

min

i2[n]

X

t2[T]

A

i

x

t

(log n)= 8T 4T +T 2

p

2T 10T

T

log n

22T:

Dividing through by T,and using our choice of ,we have min

i

A

i

x "=2 w.p.at

least 1=2 as claimed.

PROOF PROOF OF LEMMA 2.3.We ﬁrst show an upper bound on logkw

T+1

k

1

,then

a lower bound,and then relate the two.

From(4) and (3) we have

kw

t+1

k

1

=

X

i2[n]

w

t+1

(i)

=

X

i2[n]

p

t

(i)kw

t

k

1

(1 q

t

(i) +

2

q

t

(i)

2

)

= kw

t

k

1

(1 p

>

t

q

t

+

2

p

>

t

q

2

t

):

This implies by induction on t,and using 1 +z exp(z) for z 2 R,that

logkw

T+1

k

1

= log n +

X

t2[T]

log(1 p

>

t

q

t

+

2

p

>

t

q

2

t

) log n

X

t2[T]

p

>

t

q

t

+

2

p

>

t

q

2

t

:(7)

Now for the lower bound.From(4) we have by induction on t that

w

T+1

(i) =

Y

t2[T]

(1 q

t

(i) +

2

q

t

(i)

2

);

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

0:8 Clarkson,Hazan and Woodruff

and so

logkw

T+1

k

1

= log

2

4

X

i2[n]

Y

t2[T]

(1 q

t

(i) +

2

q

t

(i)

2

)

3

5

log

2

4

max

i2[n]

Y

t2[T]

(1 q

t

(i) +

2

q

t

(i)

2

)

3

5

= max

i2[n]

X

t2[T]

log(1 q

t

(i) +

2

q

t

(i)

2

)

max

i2[n]

X

t2[T]

[minfq

t

(i);1g];

where the last inequality uses the fact that 1 +z +z

2

exp(minfz;1g) for all z 2 R.

Putting this together with the upper bound (7),we have

max

i2[n]

X

t2[T]

[minfq

t

(i);1g] log n

X

t2[T]

p

>

t

q

t

+

2

p

>

t

q

2

t

;

Changing sides

X

t2[T]

p

>

t

q

t

max

i2[n]

X

t2[T]

[minfq

t

(i);1g] +log n +

2

p

>

t

q

2

t

;

= min

i2[n]

X

t2[T]

[maxfq

t

(i);1g] +log n +

2

p

>

t

q

2

t

;

and the lemma follows,dividing through by .

COROLLARY 2.8 (DUAL SOLUTION).The vector p

P

t

e

i

t

=T is,with probability

1=2,an O(")-approximate dual solution.

PROOF.

Observing in (5) that the middle expression max

x2B

P

t2[T]

A

i

t

x is equal to

T max

x2B

p

>

Ax,we have T max

x2B

p

>

Ax

P

t2[T]

A

i

t

x

t

+2

p

2T,or changing sides,

X

t2[T]

A

i

t

x

t

T max

x2B

p

>

Ax 2

p

2T

Recall from(7) that with high probability,

X

t2[T]

p

>

t

v

t

min

i2[n]

X

t2[T]

A

i

x

t

+(log n)= +

X

t2[T]

p

>

t

v

2

t

+4T:(8)

Following the proof of the main Theorem,we combine both inequalities and use Lem-

mas 2.5,2.6,such that with probability at least

1

2

:

T max

x2B

p

>

Ax min

i2[n]

X

t2[T]

A

i

x

t

+(log n)= +

X

t2[T]

p

>

t

v

2

t

+90T +2

p

2T +j

X

t2[T]

p

>

t

v

t

X

t2[T]

A

i

t

x

t

j

T +O(

p

T log n)

Dividing through by T we have with probability at least

1

2

that max

x2B

p

>

Ax +O()

for our choice of T and .

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

Sublinear Optimization for Machine Learning 0:9

2.2.High Success Probability and Las Vegas

Given two vectors u;v 2 B,we have seen that a single`

2

-sample is an unbiased esti-

mator of their inner product with variance at most one.Averaging

1

"

2

such samples re-

duces the variance to"

2

,which reduces the standard deviation to".Repeating O(log

1

)

such estimates,and taking the median,gives an estimator denoted X

";

,which satis-

ﬁes,via a Chernoff bound:

Pr[jX

";

v

>

uj >"]

As an immediate corollary of this fact we obtain:

COROLLARY 2.9.There exists a randomized algorithmthat with probability 1 ,

successfully determines whether a given hyperplane with normal vector x 2 B,together

with an instance of linear classiﬁcation and parameter > 0,is an"-approximate

solution.The algorithmruns in time O(d +

n

"

2

log

n

).

PROOF.

Let

0

= =n.Generate the randomvariable X

";

0

for each inner product pair hx;A

i

i,

and return true if and only if X

";

0

"for each pair.By the observation above and

taking union bound over all n inner products,with probability 1 the estimate X

";

0

was"-accurate for all inner-product pairs,and hence the algorithmreturned a correct

answer.

The running time includes preprocessing of x in O(d) time,and n inner-product esti-

mates,for a total of O(d +

n

"

2

log

n

).

Hence,we can amplify the success probability of Algorithm 1 to 1 for any > 0

albeit incurring additional poly-log factors in running time:

COROLLARY 2.10 (HIGH PROBABILITY).There exists a randomized algorithmthat

with probability 1 returns an"-approximate solution to the linear classiﬁcation

problem,and runs in expected time O(

n+d

"

2

log

n

).

PROOF.

Run Algorithm1 for log

2

1

times to generate that many candidate solutions.By The-

orem2.7,at least one candidate solution is an"-approximate solution with probability

at least 1 2

log

2

1

= 1 .

For each candidate solution apply the veriﬁcation procedure above with success

probability 1

2

1

log

1

,and all veriﬁcations will be correct again with proba-

bility at least 1 .Hence,both events hold with probability at least 1 2.The result

follows after adjusting constants.

The worst-case running time comes to O(

n+d

"

2

log

n

log

1

).However,we can generate

the candidate solutions and verify them one at a time,rather than all at once.The

expected number of candidates we need to generate is constant.

It is also possible to obtain an algorithmthat never errs:

COROLLARY 2.11 (LAS VEGAS VERSION).After O("

2

log n) iterations,the sublin-

ear perceptron returns a solution that with probability 1=2 can be veriﬁed in O(M)

time to be"-approximate.Thus with expected O(1) repetitions,and a total of expected

O(M +"

2

(n +d) log n) work,a veriﬁed"-approximate solution can be found.

PROOF.

We have

min

i

A

i

x kp

>

Ak;

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

0:10 Clarkson,Hazan and Woodruff

and so if

min

i

A

i

x kp

>

Ak ;(9)

then x is an"-approximate solution,and x will pass this test if it and p are ("=2)-

approximate solutions,and the same for p.

Thus,running the algorithmfor a constant factor more iterations,so that with prob-

ability 1=2,x and p are both (=2)-approximate solutions,it can be veriﬁed that both

are"-approximate solutions.

2.3.Further Optimizations

The regret of OGD as given in Lemma A.2 is smaller than the dual strategy of random

MW.We can take advantage of this and improve the running time slightly,by replacing

line [6] of the sublinear algorithmwith the line shown below.

[6’] With probability

1

log T

,let y

t+1

y

t

+

1

2

p

T

A

i

t

(else do nothing).

This has the effect of increasing the regret of the primal online algorithm by a log n

factor,which does not hurt the number of iterations required to converge,since the

overall regret is dominated by that of the MWalgorithm.

Since the primal solution x

t

is not updated in every iteration,we improve the run-

ning time slightly to

O("

2

log n(n +d=(log 1="+log log n))):

We use this technique to greater effect for the MEB problem,where it is discussed in

more detail.

2.4.Implications in the PAC model

Consider the “separable” case of hyperplane learning,in which there exists a hyper-

plane classifying all data points correctly.It is well known that the concept class of hy-

perplanes in d dimensions with margin has effective dimension at most minfd;

1

2

g+1.

Consider the case in which the margin is signiﬁcant,i.e.

1

2

< d.PAC learning the-

ory implies that the number of examples needed to attain generalization error of is

O(

1

2

).

Using the method of online to batch conversion (see [Cesa-Bianchi et al.2004]),and

applying the online gradient descent algorithm,it is possible to obtain generalization

error in time O(

d

2

) time,by going over the data once and performing a gradient step

on each example.

Our algorithm improves upon this running time bound as follows:we use the sub-

linear perceptron to compute a =2-approximation to the best hyperplane over the

test data,where the number of examples is taken to be n = O(

1

2

) (in order to ob-

tain generalization error).As shown previously,the total running time amounts to

~

O(

1

2

+d

2

) = O(

1

4

+

d

2

).

This improves upon standard methods by a factor of

~

O(

2

d),which is always an

improvement by our initial assumption on and d.

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

Sublinear Optimization for Machine Learning 0:11

3.STRONGLY CONVEX PROBLEMS:MEB AND SVM

3.1.MinimumEnclosing Ball

In the MinimumEnclosing Ball problemthe input consists of a matrix A 2 R

nd

.The

rows are interpreted as vectors and the problemis to ﬁnd a vector x 2 R

d

such that

x

argmin

x2R

d max

i2[n]

kx A

i

k

2

We further assume for this problem that all vectors A

i

have Euclidean norm at most

one.Denote by = max

i2[n]

kx

A

i

k

2

the radius of the optimal ball,and we say that

a solution is"-approximate if the ball it generates has radius at most +".

As in the case of linear classiﬁcation,to obtain tight running time bounds we use a

primal-dual approach;the algorithmis given below.

(This is a “conceptual” version of the algorithm:in the analysis of the running time,

we use the fact that we can batch together the updates for w

t

over the iterations for

which x

t

does not change.)

Algorithm2 Sublinear Primal-Dual MEB

1:Input:"> 0,A 2 R

nd

with A

i

2 B for i 2 [n] and kA

i

k known.

2:Let T ("

2

log n),y

1

0,w

1

1,

p

(log n)=T,

log T

p

T log n

.

3:for t = 1 to T do

4:p

t

w

t

kw

t

k

1

5:Choose i

t

2 [n] by i

t

i with probability p

t

(i).

6:With probability ,update y

t+1

y

t

+A

i

t

;x

t+1

y

t+1

t

:(else do nothing)

7:Choose j

t

2 [d] by j

t

j with probability x

t

(j)

2

=kx

t

k

2

.

8:for i 2 [n] do

9:~v

t

(i) 2A

i

(j

t

)kx

t

k

2

=x

t

(j

t

) +kA

i

k

2

+kx

t

k

2

:

10:v

t

(i) clip(~v

t

(i);

1

).

11:w

t+1

(i) w

t

(i)(1 +v

t

(i) +

2

v

t

(i)

2

).

12:end for

13:end for

14:return x =

1

T

P

t

x

t

THEOREM 3.1.Algorithm2 runs in O(

log n

"

2

) iterations,with a total expected running

time of

~

O

n

"

2

+

d

"

;

and with probability 1=2,returns an"-approximate solution.

PROOF.

Except for the running time analysis,the proof of this theorem is very similar to

that of Theorem 2.7,where we take advantage of a tighter regret bound for strictly

convex loss functions in the case of MEB,for which the OGDalgorithmwith a learning

rate of

1

t

is known to obtain a tighter regret bound of O(log T) instead of O(

p

T).For

presentation,we use asymptotic notation rather than computing the exact constants

(as done for the linear classiﬁcation problem).

Let f

t

(x) = kx A

i

t

k

2

.Notice that arg min

x2B

P

t

=1

f

(x) =

P

t

=1

A

i

t

.By Lemma A.5

such that f

t

(x) = kx A

i

t

k

2

,with G 2 and H = 2,and x

being the solution to the

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

0:12 Clarkson,Hazan and Woodruff

instance,we have

E

fc

t

g

[

X

t

kx

t

A

i

t

k

2

] E

fc

t

g

[

X

t

kx

A

i

t

k

2

] +

4

log T T +

4

log T;(10)

where is the squared MEB radius.Here the expectation is taken only over the ran-

domcoin tosses for updating x

t

,denoted c

t

,and holds for any outcome of the indices i

t

sampled fromp

t

and the coordinates j

t

used for the`

2

sampling.

Now we turn to the MW part of our algorithm.By the Weak Regret Lemma 2.3,

using the clipping of v

t

(i),and reversing inequalities to account for the change of sign,

we have

X

t2[T]

p

>

t

v

t

max

i2[n]

X

t2[T]

v

t

(i) O(

log n

+

X

t2[T]

p

>

t

v

2

t

):

Using Lemmas B.4,B.5 with high probability

8i 2 [n]:

X

t2[T]

v

t

(i)

X

t2[T]

kA

i

x

t

k

2

O(T);

X

t2[T]

kx

t

A

i

t

k

2

X

t

p

>

t

v

t

= O(T):

Plugging these two facts in the previous inequality we have w.h.p

X

t2[T]

kx

t

A

i

t

k

2

max

i2[n]

X

t2[T]

kA

i

x

t

k

2

O(

log n

+

X

t2[T]

p

>

t

v

2

t

+T):

This holds w.h.p over the random choices of fi

t

;j

t

g,and irrespective of the coin tosses

fc

t

g.Hence,we can take expectations w.r.t fc

t

g,and obtain

E

fc

t

g

[

X

t2[T]

kx

t

A

i

t

k

2

] E

fc

t

g

[max

i2[n]

X

t2[T]

kA

i

x

t

k

2

] O(

log n

+

X

t2[T]

p

>

t

v

2

t

+T):(11)

Combining with equation (10),we obtain that w.h.p.over the randomvariables fi

t

;j

t

g

T +

4

log T E

fc

t

g

[max

i2[n]

X

t2[T]

kx

t

A

i

k

2

] O(

log n

+

X

t2[T]

p

>

t

v

2

t

+T)

Rearranging and using Lemma B.8,we have w.p.at least

1

2

E

fc

t

g

[max

i2[n]

X

t2[T]

kx

t

A

i

k

2

] O(T +

log T

+

log n

+T)

Dividing through by T and applying Jensen’s inequality,we have

E[max

j

kx A

j

k

2

]

1

T

E[max

i2[n]

X

t2[T]

kx

t

A

i

k

2

] O( +

log T

T

+

log n

T

+):

Optimizing over the values of ,,and T,this implies that the expected error is O("),

and so using Markov’s inequality,x is a O(")-approximate solution with probability at

least 1/2.

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

Sublinear Optimization for Machine Learning 0:13

Running time.The algorithmabove consists of T = O(

log n

"

2

) iterations.Naively,this

would result in the same running time as for linear classiﬁcation.Yet notice that x

t

changes only an expected T times,and only then do we perform an O(d) operation.

The expected number of iterations in which x

t

changes is T 16"

1

log T,and so the

running time is

O("

1

(log T) d +

log n

"

2

n)) =

~

O("

2

n +"

1

d):

The following Corollary is a direct analogue of Corollary 2.8.

COROLLARY 3.2 (DUAL SOLUTION).The vector p

P

t

e

i

t

=T is,with probability

1=2,an O(")-approximate dual solution.

3.2.High Success Probability and Las Vegas

As for linear classiﬁcation,we can amplify the success probability of Algorithm 2 to

1 for any > 0 albeit incurring additional poly-log factors in running time.

COROLLARY 3.3 (MEB HIGH PROBABILITY).There exists a randomized algorithm

that with probability 1 returns an"-approximate solution to the MEB problem,and

runs in expected time

~

O(

n

"

2

log

n

"

+

d

"

log

1

"

).There is also a randomized algorithm that

returns an"-approximate solution in

~

O(M +

n

"

2

+

d

"

) time.

PROOF.

We can estimate the distance between two points in B in O("

2

log(1=)) time,with

error at most"and failure probability at most ,using the dot product estimator de-

scribed in x2.2.Therefore we can estimate the maximum distance of a given point to

every input point in O(n"

2

log(n=)) time,with error at most"and failure probability

at most .This distance is ",where is the optimal radius attainable,w.p.1 .

Because Algorithm 2 yields an"-dual solution with probability 1/2,we can use this

solution to verify that the radius of any possible solution to the farthest point is at

least ".

So,to obtain a solution as described in the lemma statement,run Algorithm 2,and

verify that it yields an"-approximation,using this approximate dual solution;with

probability 1/2,this gives a veriﬁed"-approximation.Keep trying until this succeeds,

in an expected 2 trials.

For a Las Vegas algorithm,we simply apply the same scheme,but verify the dis-

tances exactly.

3.3.Convex Quadratic Programming in the Simplex

We can extend our approach to problems of the form

min

p2

p

>

b +p

>

AA

>

p;(12)

where b 2 R

n

,A 2 R

nd

,and is,as usual,the unit simplex in R

n

.As is well known,

and as we partially review below,this problem includes the MEB problem,margin

estimation as for hard margin support vector machines,the L

2

-SVMvariant of support

vector machines,the problemof ﬁnding the shortest vector in a polytope,and others.

Applying kv xk

2

= v

>

v +x

>

x 2v

>

x 0 with v A

>

p,we have

max

x2R

d

2p

>

Ax kxk

2

= p

>

AA

>

p;(13)

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

0:14 Clarkson,Hazan and Woodruff

with equality at x = A

>

p.Thus (12) can be written as

min

p2

max

x2R

d

p

>

(b +2Ax 1

n

kxk

2

):(14)

The Wolfe dual of this problemexchanges the max and min:

max

x2R

d

min

p2

p

>

(b +2Ax 1

n

kxk

2

):(15)

Since

min

p2

p

>

(b +2Ax 1

n

kxk

2

) = min

i

b(i) +2A

i

x +kxk

2

;(16)

with equality when p

^

i

= 0 if

^

i is not a minimizer,the dual can also be expressed as

max

x2R

d

min

i

b(i) +2A

i

x kxk

2

(17)

By the two relations (13) and (16) used to derive the dual problem from the primal,

we have immediately the weak duality condition that the objective function of the dual

(17) is always no more than the objective function value of the primal (12).The strong

duality condition,that the two problems take the same optimal value,also holds here;

indeed,the optimumx

also solves (13),and the optimal p

also solves (16).

To generalize Algorithm2,we make v

t

an unbiased estimator of b +2Ax

t

1

n

kx

t

k

2

,

and set x

t+1

to be the minimizer of

X

t

0

2[t]

b(i

t

0 ) +2A

i

t

0

x

t

0 kx

t

0 k

2

;

namely,as with MEB,y

t+1

P

t

0

2[t]

A

i

t

0

,and x

t+1

y

t+1

=t.(We also make some sign

changes to account for the max-min formulation here,versus the min-max formulation

used for MEB above.) This allows the use of Lemma A.4 for essentially the same anal-

ysis as for MEB;the gradient bound Gand Hessian bound H are both at most 2,again

assuming that all A

i

2 B.

MEB.When the b(i) kA

i

k

2

,we have

max

x2R

d

min

i

b(i) +2A

i

x kxk

2

= min

x2R

d

max

i

kA

i

k

2

2A

i

x +kxk

2

= min

x2R

d

max

i

kx A

i

k

2

;

the objective function for the MEB problem.

Margin Estimation.When b 0 in the primal problem (12),that problem is one of

ﬁnding the shortest vector in the polytope fA

>

p j p 2 g.Considering this case of the

dual problem (17),for any given x 2 R

d

with min

i

A

i

x 0,the value of 2 R such

that x maximizes min

i

2A

i

x kxk

2

is = 0.On the other hand if x is such that

min

i

A

i

x > 0,the maximizing value is = A

i

x=kxk

2

,so that the solution of (17) also

maximizes min

i

(A

i

x)

2

=kxk

2

.The latter is the square of the margin ,which as before

is the minimum distance of the points A

i

to the hyperplane that is normal to x and

passes through the origin.

Adapting Algorithm2 for margin estimation,and with the slight changes needed for

its analysis,we have that there is an algorithm taking

~

O(n=

2

+ d=) time that ﬁnds

x 2 R

d

such that,for all i 2 [n],

2A

i

x kxk

2

2

:

When

2

,we don’t appear to gain any useful information.However,when

2

> ,

we have min

i2[n]

A

i

x > 0,and so,by appropriate scaling of x,we have ^x such that

^

2

= min

i2[n]

(A

i

^x)

2

=k^xk

2

= min

i2[n]

2A

i

^x k^xk

2

2

;

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

Sublinear Optimization for Machine Learning 0:15

and so ^ =.That is,letting

0

,if

0

,there is an algorithm taking

~

O(n=()

2

+d=

0

) time that ﬁnds a solution ^x with ^

0

.

4.A GENERIC SUBLINEAR PRIMAL-DUAL ALGORITHM

We note that our technique above can be applied more broadly to any constrained

optimization problemfor which low-regret algorithms exist and low-variance sampling

can be applied efﬁciently;that is,consider the general problemwith optimum:

max

x2K

min

i

c

i

(x) = :(18)

Suppose that for the set K and cost functions c

i

(x),there exists an iterative low regret

algorithm,denoted LRA,with regret R(T) = o(T).Let T

"

(LRA) be the smallest T such

that

R(T)

T

".We denote by x

t+1

LRA(x

t

;c) an invocation of this algorithm,when

at state x

t

2 K and the cost function c is observed.

Let Sample(x;c) be a procedure that returns an unbiased estimate of c(x) with

variance at most one,that runs in constant time.Further assume jc

i

(x)j 1 for all

x 2 K;i 2 [n].

Algorithm3 Generic Sublinear Primal-Dual Algorithm

1:Let T maxfT

"

(LRA);

log n

"

2

g,

x

1

LRA(initial),w

1

1

n

,

1

100

q

log n

T

.

2:for t = 1 to T do

3:for i 2 [n] do

4:Let v

t

(i) Sample(x

t

;c

i

)

5:v

t

(i) clip(~v

t

(i);1=)

6:w

t+1

(i) w

t

(i)(1 v

t

(i) +

2

v

t

(i)

2

)

7:end for

8:p

t

w

t

kw

t

k

1

,

9:Choose i

t

2 [n] by i

t

i with probability p

t

(i).

10:x

t

LRA(x

t1

;c

i

t

)

11:end for

12:return x =

1

T

P

t

x

t

Applying the techniques of section 2 we can obtain the following generic lemma.

LEMMA 4.1.The generic sublinear primal-dual algorithmreturns a solution x that

with probability at least

1

2

is an"-approximate solution in maxfT

"

(LRA);

log n

"

2

g itera-

tions.

PROOF.

First we use the regret bounds for LRA to lower bound

P

t2[T]

c

i

t

(x

t

),next we get an

upper bound for that quantity using the Weak Regret Lemma,and then we combine

the two in expectation.

By deﬁnition,c

i

(x

) for all i 2 [n],and so,using the LRA regret guarantee,

T max

x2B

X

t2[T]

c

i

t

(x)

X

t2[T]

c

i

t

(x

t

) +R(T);(19)

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

0:16 Clarkson,Hazan and Woodruff

or rearranging,

X

t2[T]

c

i

t

(x

t

) T R(T):(20)

Now we turn to the MWpart of our algorithm.By the Weak Regret Lemma 2.3,and

using the clipping of v

t

(i),

X

t2[T]

p

>

t

v

t

min

i2[n]

X

t2[T]

v

t

(i) +(log n)= +

X

t2[T]

p

>

t

v

2

t

:

Using Lemma B.4 and Lemma B.5,since the procedure Sample is unbiased and has

variance at most one,with high probability:

8i 2 [n];

X

t2[T]

v

t

(i)

X

t2[T]

c

i

(x

t

) +O(T);

X

t2[T]

c

i

t

(x

t

)

X

t

p

>

t

v

t

= O(T):

Plugging these two facts in the previous inequality we have w.h.p,

X

t2[T]

c

i

t

(x

t

) min

i2[n]

X

t2[T]

c

i

(x

t

) +O(

log n

+

X

t2[T]

p

>

t

v

2

t

+T) (21)

Combining (20) and (21) we get w.h.p

min

i2[n]

X

t2[T]

c

i

(x

t

) O(

log n

+T +

X

t2[T]

p

>

t

v

2

t

) R(T)

And via Lemma B.8 we have w.p.at least

1

2

that

min

i2[n]

X

t2[T]

c

i

(x

t

) O(

log n

+T) R(T)

Dividing through by T,and using our choice of ,we have min

i

c

i

x "=2 w.p.at

least least 1=2 as claimed.

High-probability results can be obtained using the same technique as for linear classi-

ﬁcation.

4.1.More applications

The generic algorithm above can be used to derive the result of [Grigoriadis and

Khachiyan 1995] on sublinear approximation of zero sum games with payoffs/losses

bounded by one (up to poly-logarithmic factors in running time).A zero sumgame can

be cast as the following min-max optimization problem:

min

x2

d

max

i2

n

A

i

x

That is,the constraints are inner products with the rows of the game matrix.This is

exactly the same as the linear classiﬁcation problem,but the vectors x are taken from

the convex set K which is the simplex - or the set of all mixed strategies of the column

player.

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

Sublinear Optimization for Machine Learning 0:17

Alowregret algorithmfor the simplex is the multiplicative weights algorithm,which

attains regret R(T) 2

p

T log n.The procedure Sample(x;A

i

) to estimate the inner

product A

i

x is much simpler than the one used for linear classiﬁcation:we sample from

the distribution x and return A

i

(j) w.p.x(j).This has correct expectation and variance

bounded by one (in fact,the random variable is always bounded by one).Lemma 4.1

then implies:

COROLLARY 4.2.The sublinear primal-dual algorithm applied to zero sum games

returns a solution x that with probability at least

1

2

is an"-approximate solution in

O(

log n

"

2

) iterations and total time

~

O(

n+d

"

2

).

Essentially any constrained optimization problem which has convex or linear con-

straints,and is over a simple convex body such as the ball or simplex,can be approxi-

mated in sublinear time using our method.

5.KERNELIZING THE SUBLINEAR ALGORITHMS

An important generalization of linear classiﬁers is that of kernel-based linear predic-

tors (see e.g.[Sch¨olkopf and Smola 2003]).Let :R

d

7!H be a mapping of feature

vectors into a reproducing kernel Hilbert space.In this setting,we seek a non-linear

classiﬁer given by h 2 H so as to maximize the margin:

max

h2H

min

i2[n]

hh; (A

i

)i:

The kernels of interest are those for which we can compute inner products of the form

k(x;y) = h (x); (y)i efﬁciently.

One popular kernel is the polynomial kernel,for which the corresponding Hilbert

space is the set of polynomials over R

d

of degree q.The mapping for this kernel is

given by

8S [d];jSj q: (x)

S

=

Y

i2S

x

i

:

That is,all monomials of degree at most q.The kernel function in this case is given by

k(x;y) = (x

>

y)

q

.Another useful kernel is the Gaussian kernel k(x;y) = exp(

kxyk

2

2s

2

),

where s is a parameter.The mapping here is deﬁned by the kernel function (see

[Sch¨olkopf and Smola 2003] for more details).

The kernel version of Algorithm1 is shown in Figure 4.Note that x

t

and y

t

are mem-

bers of H,and not maintained explicitly,but rather are implicitly represented by the

values i

t

.(And thus ky

t

k is the norm of H,not R

d

.) Also, (A

i

) is not computed.The

needed kernel product hx

t

; (A

i

)i is estimated by the procedure Kernel-L2-Sampling,

using the implicit representations and speciﬁc properties of the kernel being used.In

the regular sublinear algorithm,this inner product could be sufﬁciently well approx-

imated in O(1) time via`

2

-sampling.As we show below,for many interesting kernels

the time for Kernel-L2-Sampling is not much longer.

For the analog of Theorem2.7 to apply,we need the expectation of the estimates v

t

(i)

to be correct,with variance O(1).By Lemma C.1,it is enough if the estimates v

t

(i)

have an additive bias of O().Hence,we deﬁne the procedure Kernel-L2-Sampling

to obtain such an not-too-biased estimator with variance at most one;ﬁrst we show

how to implement Kernel-L2-Sampling,assuming that there is an estimator

~

k() of

the kernel k() such that E[

~

k(x;y)] = k(x;y) and Var(

~

k(x;y)) 1,and then we show

how to implement such kernel estimators.

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

0:18 Clarkson,Hazan and Woodruff

Algorithm4 Sublinear Kernel Perceptron

1:Input:"> 0,A 2 R

nd

with A

i

2 B for i 2 [n].

2:Let T 200

2

"

2

log n,y

1

0,w

1

~

1

n

,

1

100

q

log n

T

.

3:for t = 1 to T do

4:p

t

w

t

kw

t

k

1

,x

t

y

t

maxf1;ky

t

kg

:

5:Choose i

t

2 [n] by i

t

i with probability p

t

(i).

6:y

t+1

P

2[t]

(A

i

)=

p

2T.

7:for i 2 [n] do

8:~v

t

(i) Kernel-L2-Sampling(x

t

; (A

i

)).(estimating hx

t

; (A

i

)i)

9:v

t

(i) clip(~v

t

(i);1=).

10:w

t+1

(i) w

t

(i)(1 v

t

(i) +

2

v

t

(i)

2

).

11:end for

12:end for

13:return x =

1

T

P

t

x

t

5.1.Implementing Kernel-L2-Sampling

Estimating kyk

t

.A key step in Kernel-L2-Sampling is the estimation of ky

t

k,which

readily reduces to estimating

Y

t

2Tky

t

k

2

=t

2

=

1

t

2

X

;

0

2[t]

k(A

i

;A

i

0

);

that is,the mean of the summands.Since we use maxf1;ky

t

k),we need not be con-

cerned with small ky

t

k,and it is enough that the additive bias in our estimate of Y be

at most =T (2T=t

2

) for t 2 [T],implying a bias for ky

t

k no more than .Since we

need 1=ky

t

k in the algorithm,it is not enough for estimates of Y just to be good in mean

and variance;we will ﬁnd an estimator whose error bounds hold with high probability.

Our estimate

~

Y

t

of Y

t

can ﬁrst be considered assuming we only need to make an

estimate for a single value of t.

Let N

Y

t

2

d(8=3) log(1=)T

2

=

2

t

2

e.To estimate Y

t

,we compute,for each ;

0

2 [t],

n

t

N

Y

=t

2

independent estimates

X

;

0

;m

clip(

~

k(A

i

;A

i

0

);T=);for m2 [n

t

];

and our estimate is

~

Y

t

X

;

0

2[t]

m2[n

t

]

X

;

0

;m

=N

Y

:

LEMMA 5.1.With probability at least 1 ,jY

~

Y

t

j =T.

PROOF.We apply Bernstein’s inequality (as in 31) to the N

Y

random variables

X

;

0

;m

E[X

;

0

;m

].which have mean zero,variance at most one,and are at most

T= in magnitude.Bernstein’s inequality implies,using Var[X

;

0

;m

] 1,

log Probf

X

;

0

2[t]

m2[n

t

]

(X

;

0

;m

E[X

;

0

;m

]) > g

2

=(N

Y

+(T=)=3);

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

Sublinear Optimization for Machine Learning 0:19

and putting N

Y

=T gives

log Probf

~

Y E[

~

Y ] > =Tg N

2

Y

(=T)

2

=(N

Y

+(T=)N

Y

(=T)=3)

(8=3) log(1=)(3=4) 2 log(1=):

Similar reasoning for X

;

0

;m

,and the union bound,implies the lemma.

To compute Y for t = 1:::T,we can save some work by reusing estimates from one

t to the next.Now let N

Y

d(8=3) log(1=)T

2

=

2

e.Compute

~

Y

1

as above for t = 1,and

let

^

Y

1

~

Y

1

.For t > 1,let n

t

dN

Y

=t

2

e,and let

^

Y

t

X

m2[n

t

]

X

t;t;m

=n

t

+

X

2[t]

m2[n

t

]

(X

t;;m

+X

;t;m

)=n

t

;

and return

~

Y

t

P

2[t]

^

Y

=t

2

.

Since for each and

0

,the expected total contribution of all X

;

0

;m

terms to

~

Y

t

is

k(A

i

;A

i

0

),we have E[

~

Y

t

] = Y

t

.Moreover,the number of instances of X

;

0

;m

averaged

to compute

~

Y

t

is always at least as large as the number used for the above “batch”

version;it follows that the total variance of

~

Y

t

is non-increasing in t,and therefore

Lemma 5.1 holds also for the

~

Y

t

computed stepwise.

Since the number of calls to

~

k(;) is

P

t2[T]

(1 +2n

t

) = O(N

Y

),we have the following

lemma.

LEMMA 5.2.The values

~

Y

t

(t

2

=2T) ky

t

k,t 2 [T],can be estimated with

O((log(1=)T

2

=

2

) calls to

~

k(;),so that with probability at least 1,j

~

Y

t

(t

2

=2T)ky

t

kj

.The values ky

t

k,t 2 [T],can be computed exactly with T

2

calls to the exact kernel k(;).

PROOF.This follows from the discussion above,applying the union bound over t 2

[T],and adjusting constants.The claimfor exact computation is straightforward.

Given this procedure for estimating ky

t

k,we can describe Kernel-L2-Sampling.

Since x

t+1

= y

t+1

= maxf1;ky

t+1

kg,we have

hx

t+1

;A

i

i =

1

maxf1;ky

t+1

kg

p

2T

X

2[t]

h (A

i

); (A

i

)i

=

1

maxf1;ky

t+1

kg

p

2T

X

2[t]

k(A

i

;A

i

);(22)

so that the main remaining step is to estimate

P

2[t]

k(A

i

;A

i

),for i 2 [n].Here we

simply call

~

k(A

i

;A

i

) for each .We save time,at the cost of O(n) space,by saving the

value of the sumfor each i 2 [n],and updating it for the next t with n calls

~

k(A

i

t

;A

i

).

LEMMA 5.3.Let L

k

denote the expected time needed for one call to

~

k(;),and

T

k

denote the time needed for one call to k(;).Except for estimating ky

t

k,

Kernel-L2-Sampling can be computed in nL

k

expected time per iteration t.The re-

sulting estimate has expectation within additive of hx

t

;A

i

i,and variance at most one.

Thus Algorithm4 runs in time

~

O(

(L

k

n+d)

"

2

+minf

L

k

"

6

;

T

k

"

4

g),and produces a solution with

properties as in Algorithm1.

PROOF.For Kernel-L2-Sampling it remains only to show that its variance is at

most one,given that each

~

k(;) has variance at most one.We observe from (22 that t

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

0:20 Clarkson,Hazan and Woodruff

independent estimates

~

k(;) are added together,and scaled by a value that is at most

1=

p

2T.Since the variance of the sumis at most t,and the variance is scaled by a value

no more than 1=2T,the variance of Kernel-L2-Sampling is at most one.The only bias

in the estimate is due to estimation of ky

t

k,which gives relative error of .For our

kernels,k (v)k 1 if v 2 B,so the additive error of Kernel-L2-Sampling is O().

The analysis of Algorithm 4 then follows as for the un-kernelized perceptron;we

neglect the time needed for preprocessing for the calls to

~

k(;),as it is dominated by

other terms for the kernels we consider,and this is likely in general.

5.2.Implementing the Kernel Estimators

Using the lemma above we can derive corollaries for the Gaussian and polynomial

kernels.More general kernels can be handled via the technique of [Cesa-Bianchi et al.

2010].

Polynomial kernels.For the polynomial kernel of degree q,estimating a single ker-

nel product,i.e.k(x;y) = k(A

i

;A

j

),where the norm of x;y is at most one,takes

O(q) as follows:Recall that for the polynomial kernel,k(x;y) = (x

>

y)

q

.To estimate

this kernel we take the product of q independent`

2

-samples,yielding

~

k(x;y).Notice

that the expectation of this estimator is exactly equal to the product of expectations,

E[

~

k(x;y)] = (x

>

y)

q

.The variance of this estimator is equal to the product of variances,

which is Var(

~

k(x;y)) (kxkkyk)

q

1.Of course,calculating the inner product exactly

takes O(dlog q) time.We obtain:

COROLLARY 5.4.For the polynomial degree-q kernel,Algorithm4 runs in time

~

O(

q(n +d)

"

2

+minf

dlog q

"

4

;

q

"

6

g):

Gaussian kernels.To estimate the Gaussian kernel function,we assume that kxk

and kyk are known and no more than s=2;thus to estimate

k(x;y) = exp(kx yk

2

) = exp((kxk

2

+kyk

2

)=2s

2

) exp(x

>

y=s

2

);

we need to estimate exp(x

>

y=s

2

).For exp( X) =

P

i0

i

X

i

=i!with random X and

parameter > 0,we pick index i with probability exp( )

i

=i!(that is,i has a Poisson

distribution) and return exp( ) times the product of i independent estimates of X.

In our case we take X to be the average of c`

2

-samples of x

>

y,and hence E[X] =

x

>

y;E[X

2

]

1

c

E[(x

>

y)

2

]

1

c

.The expectation of our kernel estimator is thus:

E[

~

k(x;y)] = E[

X

i0

e

i

i! e

X

i

] =

X

i0

i

i!

i

Y

j=1

E[X] = exp( x

>

y):

The second moment of this estimator is bounded by:

E[

~

k(x;y)

2

] = E[

X

i0

e

i

i! e

2

(X

i

)

2

] = e

X

i0

i

i!

i

Y

j=1

E[X

2

] exp(

2

c

):

Hence,we take = c =

1

s

2

.This gives a correct estimator in terms of expectation

and constant variance.The variance can be further made smaller than one by taking

the average of a constant estimators of the above type.

As for evaluation time,the expected size of the index i is =

1

s

2

.Thus,we require

on the expectation c =

1

s

4

of`

2

-samples.

We obtain:

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

Sublinear Optimization for Machine Learning 0:21

COROLLARY 5.5.For the Gaussian kernel with parameter s,Algorithm 4 runs in

time

~

O(

(n +d)

s

4

"

2

+minf

d

"

4

;

1

s

4

"

6

g):

5.3.Kernelizing the MEB and strictly convex problems

Analogously to Algorithm4,we can deﬁne the kernel version of strongly convex prob-

lems,including MEB.The kernelized version of MEB is particularly efﬁcient,since

as in Algorithm 2,the norm ky

t

k is never required.This means that the procedure

Kernel-L2-Sampling can be computed in time O(nL

k

) per iteration,for a total run-

ning time of O(L

k

("

2

n +"

1

d)).

6.LOWER BOUNDS

All of our lower bounds are information-theoretic,meaning that any successful algo-

rithm must read at least some number of entries of the input matrix A.Clearly this

also lower bounds the time complexity of the algorithmin the unit-cost RAMmodel.

Some of our arguments use the following meta-theorem.Consider a p q matrix

A,where p is an even integer.Consider the following random process.Let W q.Let

a = 11=W,and let e

j

denote the j-th standard q-dimensional unit vector.For each i 2

[p=2],choose a randomj 2 [q] uniformly,and set A

i+p=2

A

i

ae

j

+b(1

q

e

j

),where b

is chosen so that kA

i

k

2

= 1.We say that such an A is a YES instance.With probability

1=2,transform A into a NO instance as follows:choose a random i

2 [p=2] uniformly,

and if A

i

= ae

j

+b(1

q

e

j

) for a particular j

2 [q],set A

i

+p=2

ae

j

+b(1

q

e

j

).

Suppose there is a randomized algorithm reading at most s positions of A which

distinguishes YES and NO instances with probability 2=3,where the probability

is over the algorithm’s coin tosses and this distribution on YES and NO instances.

By averaging this implies a deterministic algorithm Alg reading at most s positions

of A and distinguishing YES and NO instances with probability 2=3,where the

probability is taken only over .We show the following meta-theoremwith a standard

argument.

THEOREM 6.1.(Meta-theorem) For any such algorithmAlg,s =

(pq).

This Meta-Theoremfollows fromthe following folklore fact:

FACT 6.2.Consider the following random process.Initialize a length-r array A to

an array of r zeros.With probability 1=2,choose a random position i 2 [r] and set

A[i] = 1.With the remaining probability 1=2,leave A as the all zero array.Then any

algorithmwhich determines if A is the all zero array with probability 2=3 must read

(r) entries of A.

Let us prove Theorem6.1 using this fact:

PROOF.Consider the matrix B 2 R

(p=2)q

which is deﬁned by subtracting the “bot-

tom” half of the matrix from the top half,that is,B

i;j

= A

i;j

A

i+p=2;j

.Then B is the

all zeros matrix,except that with probability 1/2,there is one entry whose value is

roughly two,and whose location is random and distributed uniformly.An algorithm

distinguishing between YES and NO instances of A in particular distinguishes be-

tween the two cases for B,which cannot be done without reading a linear number of

entries.

In the proofs of Theorem 6.3,Corollary 6.4,and Theorem 6.6,it will be more conve-

nient to use M as an upper bound on the number of non-zero entries of A rather than

the exact number of non-zero entries.However,it should be understood that these the-

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

0:22 Clarkson,Hazan and Woodruff

orems (and corollary) hold even when M is exactly the number of non-zero entries of

A.

To see this,our random matrices A constructed in the proofs have at most M non-

zero entries.If this number M

0

is strictly less than M,we arbitrarily replace M M

0

zero entries with the value (nd)

C

for a large enough constant C > 0.Under our

assumptions on the margin or the minimum enclosing ball radius of the points,the

solution value changes by at most a factor of (1 (nd)

1C

),which does not affect the

proofs.

6.1.Classiﬁcation

Recall that the margin (A) of an nd matrix A is given by max

x2B

min

i

A

i

x.Since we

assume that kA

i

k

2

1 for all i,we have that (A) 1.

6.1.1.Relative Error.We start with a theoremfor relative error algorithms.

THEOREM 6.3.Let > 0 be a sufﬁciently small constant.Let"and (A) have

(A)

2

"

1

min(n;d),(A) 1 ",with"also bounded above by a sufﬁciently

small constant.Also assume that M 2(n +d),that n 2,and that d 3.Then any

randomized algorithm which,with probability at least 2=3,outputs a number in the

interval [(A) "(A);(A)] must read

(min(M;(A)

2

"

1

(n +d)))

entries of A.This holds even if kA

i

k

2

= 1 for all rows A

i

.

Notice that this yields a stronger theorem than assuming that both n and d are sufﬁ-

ciently large,since one of these values may be constant.

PROOF.

We divide the analysis into cases:the case in which d or n is constant,and the case

in which each is sufﬁciently large.Let 2 [0;1 "] be a real number to be determined.

Case:d or n is a constant.By our assumption that (A)

2

"

1

min(n;d),the

values (A) and"are constant,and sufﬁciently large.Therefore we just need to show

an

(min(M;n + d)) bound on the number of entries read.By the premise of the

theorem,M =

(n +d),so we can just show an

(n +d) bound.

An

(d) bound.We give a randomized construction of an n d matrix A.

The ﬁrst rowof Ais built as follows.Let A

1;1

and A

1;2

0.Pick j

2 f3;4;:::;dg

uniformly at random,and let A

1;j

"

1=2

.For all remaining j 2 f3;4;:::;dg,assign

A

1;j

,where 1=d

3

.(The role of is to make an entry slightly non-zero to prevent

an algorithm which has access to exactly the non-zero entries from skipping over it.)

Now using the conditions on ,we have

X kA

1

k

2

=

2

+(d 3)

2

+"

2

(1 ")

2

+d

2

+" 1 "+"

2

+

2

"

2

1;

and so by letting A

1;2

p

1 X,we have kA

1

k = 1.

Now we let A

2

A

1

,with two exceptions:we let A

2;1

A

1;1

= ,and with

probability 1=2,we negate A

2;j

.Thus kA

2

k = 1 also.

For row i with i > 2,put A

i;1

(1 +"),A

i;2

q

1 A

2

i;1

,and all remaining entries

zero.

We have the following picture.

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

Sublinear Optimization for Machine Learning 0:23

0

B

B

B

B

B

B

@

(1

2

(d 3)

2

"

2

)

1=2

"

1=2

(1

2

(d 3)

2

"

2

)

1=2

"

1=2

(1 +") (1 (1 +")

2

2

)

1=2

0 0

(1 +") (1 (1 +")

2

2

)

1=2

0 0

(1 +") (1 (1 +")

2

2

)

1=2

0 0

1

C

C

C

C

C

C

A

Observe that the the number of non-zero entries of the resulting matrix is 2n+2d4,

which satisﬁes the premise of the theorem.Moreover,all rows A

i

satisfy kA

i

k = 1.

Notice that if A

1;j

= A

2;j

,then the margin of A is at most ,which follows by

observing that all but the ﬁrst coordinate of A

1

and A

2

have opposite signs.

On the other hand,if A

1;j

= A

2;j

,consider the vector y with y

1

1,y

j

p

",and

all other entries zero.Then for all i,A

i

y = (1 +"),and so the unit vector x y=kyk

has

A

i

x =

(1 +")

p

1 +"

= (1 +")

1=2

= (1 +

(")):

It follows that in this case the margin of A is at least (1 +

(")).Setting = () and

rescaling"by a constant factor,it follows that these two cases can be distinguished

by an algorithm satisfying the premise of the theorem.By Fact 6.2,any algorithm

distinguishing these two cases with probability 2=3 must read

(d) entries of A.

An

(n) bound.We construct the n d matrix A as follows.All but the ﬁrst two

columns are 0.We set A

i;1

and A

i;2

p

1

2

for all i 2 [n].Next,with probability

1=2,we pick a randomrow i

,and negate A

i

;2

.We have the following picture.

0

B

B

B

B

B

B

B

B

@

p

1

2

0 0

0 0

p

1

2

0 0

p

1

2

0 0

p

1

2

0 0

0 0

p

1

2

0 0

1

C

C

C

C

C

C

C

C

A

The number of non-zeros of the resulting matrix is 2n < M.Depending on the sign

of A

i

;2

,the margin of A is either 1 or .Setting = (),an algorithm satisfying

the premise of the theorem can distinguish the two cases.By Fact 6.2,any algorithm

distinguishing these two cases with probability 2=3 must read

(n) entries of A.

Case:d and n are both sufﬁciently large.Suppose ﬁrst that M =

((A)

2

"

1

(n+d))

for a sufﬁciently large constant in the

().Let s be an even integer in (

2

"

1

) and

with s < min(n;d)1.We will also choose a value in ((A)).We can assume without

loss of generality that n and d are sufﬁciently large,and even.

An

(ns) bound.We set the d-th entry of each row of A to the value .We set all

entries in columns s +1 through d 1 to 0.We then choose the remaining entries of A

as follows.We apply Theorem6.1 with parameters p = n;q = s,and W = d

2

,obtaining

an n s matrix B,where kB

i

k = 1 for all rows B

i

.Put B

0

B

p

1

2

.We then set

A

i;j

B

0

i;j

for all i 2 [n] and j 2 [s].We have the following block structure for A.

B

p

1

2

0

n(ds1)

1

n

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

0:24 Clarkson,Hazan and Woodruff

Here 0

n(ds1)

is a matrix of all 0’s,of the given dimensions.Notice that kA

i

k = 1 for

all rows A

i

,and the number of non-zero entries is at most n(s +1),which is less than

the value M.

We claim that if B is a YES instance,then the margin of A is (1 +

(")).Indeed,

consider the unit vector x for which

x

j

8

>

>

<

>

>

:

"

s

"

2

4s

1=2

j 2 [s]

0 j 2 [s +1;d 1]

1 "=2 j = d

(23)

For any row A

i

,

A

i

x

"

s

"

2

4s

1=2

p

1

2

O

p

1

2

d

2

!!

+

1

"

2

"

s

"

2

4s

1=2

1 O

p

1

2

d

2

!!

+

"

2

since

p

1

2

1

"

s

1=2

(1 ) +

"

2

O("

2

2

) since

r

"

s

1

d

2

= O("

2

2

)

If we set s = c

2

"

1

for c 2 (0;4),then

A

i

x +

"

c

1=2

"

2

+

"

c

1=2

O("

2

2

) = (1 +

(")):(24)

On the other hand,if B is a NO instance,we claim that the margin of A is at most

(1+O("

2

)).By deﬁnition of a NOinstance,there are rows A

i

and A

j

of A which agree

except on a single column k,for which A

i;k

=

p

1

2

O

1

2

d

2

while A

j;k

= A

i;k

.

It follows that the x which maximizes minfA

i

x;A

j

xg has x

k

= 0.But

P

k

0

6=k

A

2

i;k

0

=

1 (1

2

) +O

1

d

2

=

2

+O

1

d

2

.Since kxk 1,by the Cauchy-Schwarz inequality

A

i

x = A

j

x

2

+O

1

d

2

1=2

+O

"

2

= (1 +O("

2

));(25)

where the ﬁrst inequality follows fromour bound

2

"

1

= O(d).

Setting = ((A)) and rescaling"by a constant factor,an algorithm satisfying

the premise of the theorem can distinguish the two cases,and so by Theorem 6.1,it

must read

(ns) =

((A)

2

"

1

n) entries of A.

An

(ds) bound.We ﬁrst deﬁne rows s + 1 through n of our n d input matrix

A.For i > s,put A

i;d

(1 +"),A

i;d1

(1

2

(1 +")

2

)

1=2

,and all remaining entries

zero.

We now deﬁne rows 1 through s.Put A

i;d

for all i 2 [s].Now we apply Theorem

6.1 with p = s,q = d2,and W = d

2

,obtaining an s(d2) matrix B,where kB

i

k = 1

for all rows B

i

.Put B

0

B

p

1

2

,and set A

i;j

B

0

i;j

for all i 2 [s] and j 2 [d 2].We

have the following block structure for A.

B

p

1

2

0

s

1

s

0

(ns)(d2)

1

ns

(1

2

(1 +")

2

)

1=2

1

ns

(1 +")

Notice that kA

i

k = 1 for all rows A

i

,and the number of non-zero entries is at most

2n +sd < M.

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

Sublinear Optimization for Machine Learning 0:25

If B is a YES instance,let x be as in Equation (23).Since the ﬁrst s rows of A

agree with those in our proof of the

(ns) bound,then as shown in Equation (24),

A

i

x = (1 +

(")) for i 2 [s].Moreover,for i > s,since YES instances B are entry-wise

positive,we have

A

i

x >

1

"

2

(1 +") = (1 +

(")):

Hence,if B is a YES instance the margin is (1 +

(")).

Now suppose B is a NO instance.Then,as shown in Equation (25),for any x for

which kxk 1,we have A

i

x (1 +O("

2

)) for i 2 [s].Hence,if B is a NO instance,the

margin is at most (1 +O("

2

)).

Setting = ((A)) and rescaling"by a constant factor,an algorithm satisfying

the premise of the theorem can distinguish the two cases,and so by Theorem 6.1,it

must read

(ds) =

((A)

2

"

1

d) entries of A.

Finally,if M = O((n + d)(A)

2

"

1

),then we must show an

(M) bound.We

will use our previous construction for showing an

(ns) bound,but replace the value

of n there with n

0

,where n

0

is the largest integer for which n

0

s M=2.We claim that

n

0

1.To see this,by the premise of the theoremM 2(n +d).Moreover,s = ("

1

)

and"

1

(n +d).For a small enough constant > 0,s (n +d) M=2,as needed.

As the theorem statement concerns matrices with n rows,each of unit norm,we

must have an input A with n rows.To achieve this,we put A

i;d

= (1 +") and A

i;d1

=

(1

2

(1 +")

2

)

1=2

for all i > n

0

.In all remaining entries in rows A

i

with i > n

0

,we

put the value 0.This ensures that kA

i

k = 1 for all i > n

0

,and it is easy to verify that

this does not change the margin of A.Hence,the lower bound is

(n

0

s) =

(M).Notice

that the number of non-zero entries is at most 2n+n

0

s 2M=3+M=3 = M,as needed.

This completes the proof.

6.1.2.Additive Error.Here we give a lower bound for the additive error case.We give

two different bounds,one when"< ,and one when" .Notice that 0 since we

may take the solution x = 0

d

.The following is a corollary of Theorem6.3.

COROLLARY 6.4.Let > 0 be a sufﬁciently small constant.Let";(A) be such that

(A)

1

"

1

min(n;d) and (A) 1 "=(A),where 0 <"

0

for a sufﬁciently

small constant

0

> 0.Also assume that M 2(n + d),n 2,and d 3.Then any

randomized algorithm which,with probability at least 2=3,outputs a number in the

interval [ ";] must read

(min(M;

1

"

1

(n +d)))

entries of A.This holds even if kA

i

k = 1 for all rows A

i

.

PROOF.We simply set the value of"in Theorem 6.3 to"=.Notice that"is at

most a sufﬁciently small constant and the value

2

"

1

in Theorem6.3 equals

1

"

1

,

which is at most min(n;d) by the premise of the corollary,as needed to apply Theorem

6.3.

The following handles the case when"=

().

COROLLARY 6.5.Let > 0 be a sufﬁciently small constant.Let";(A) be such

that"

2

min(n;d),(A) +"<

1

p

2

,and"=

().Also assume that M 2(n +d),

n 2,and d 3.Then any randomized algorithmwhich,with probability at least 2=3,

outputs a number in the interval [ ";] must read

(min(M;"

2

(n +d)))

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

0:26 Clarkson,Hazan and Woodruff

entries of A.This holds even if kA

i

k = 1 for all rows A

i

.

PROOF.

The proof is very similar to that of Theorem 6.3,so we just outline the differences.

In the case that d or n is constant,we have the following families of hard instances:

An

(n) bound for constant d:

0

B

B

B

B

B

B

B

B

@

1

2

(d 3)

2

2("+)

2

1=2

p

2("+)

1

2

(d 3)

2

2("+)

2

1=2

p

2("+)

p

2("+)

1 2("+)

2

1=2

0 0

p

2("+)

1 2("+)

2

1=2

0 0

p

2("+)

1 2("+)

2

1=2

0 0

1

C

C

C

C

C

C

C

C

A

An

(d) bound for constant n:

0

B

B

B

B

B

B

B

B

@

p

1

2

0 0

0 0

p

1

2

0 0

p

1

2

0 0

p

1

2

0 0

0 0

p

1

2

0 0

1

C

C

C

C

C

C

C

C

A

In these two cases,depending on the sign of the undetermined entry the margin

is either or at least +"(in the

(d) bound,it is or 1,but we assume +"<

1

p

2

).

It follows for = (A),the algorithm of the corollary can distinguish these two cases,

for which the lower bounds follow fromthe proof of Theorem6.3.

For the case of n and d sufﬁciently large,we have the following families of hard

instances.In each case,the matrix B is obtained by invoking Theorem 6.1 with the

value of s = ("

2

).

An

(n"

2

) bound for n;d sufﬁciently large:

B

p

1

2

0

n(ds1)

1

n

An

(d"

2

) bound for n;d sufﬁciently large:

B

p

1

2

0

s

1

s

0

(ns)(d2)

1

ns

(1 ( +")

2

)

1=2

1

ns

( +")

In these two cases,by setting W = poly(nd) to be sufﬁciently large in Theorem 6.1,

depending on whether B is YES or a NO instance the margin is either at most +

1

poly(nd)

or at least +

p

1

2

2"(for an appropriate choice of s).For < 1=

p

2,the

algorithm of the corollary can distinguish these two cases,and therefore needs

(ns)

time in the ﬁrst case,and

(ds) time in the second.

The extension of the proofs to handle the case M = o((n +d)"

2

) is identical to that

given in the proof of Theorem6.3.

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

Sublinear Optimization for Machine Learning 0:27

6.2.MinimumEnclosing Ball

We start by proving the following lower bound for estimating the squared MEB radius

to within an additive".In the next subsection we improve the

("

1

n) term in the

lower bound to

~

("

2

n) for algorithms that either additionally output a coreset,or

output a MEB center that is a convex combination of the input points.As our primal-

dual algorithm actually outputs a coreset,as well as a MEB center that is a convex

combination of the input points,those bounds apply to it.Our algorithm has both

of these properties though satisfying one or the other would be enough to apply the

lower bound.Together with the"

1

d bound given by the next theorem,these bounds

establish its optimality.

THEOREM 6.6.Let > 0 be a sufﬁciently small constant.Assume"

1

min(n;d)

and"is less than a sufﬁciently small constant.Also assume that M 2(n+d) and that

n 2.Then any randomized algorithm which,with probability at least 2=3,outputs a

number in the interval

h

min

x

max

i

kx A

i

k

2

";min

x

max

i

kx A

i

k

2

i

must read

(min(M;"

1

(n +d)))

entries of A.This holds even if kA

i

k = 1 for all rows A

i

.

PROOF.

As with classiﬁcation,we divide the analysis into cases:the case in which d or n is

constant,and the case in which each is sufﬁciently large.

Case d or n is a constant.By our assumption that"

1

min(n;d),"is a constant,

and sufﬁciently large.So we just need to show an

(min(M;n + d)) bound.By the

premise of the theorem,M 2(n +d),so we need only show an

(n +d) bound.

An

(d) bound.We construct an n d matrix A as follows.For i > 2,each row A

i

is

just the vector e

1

= (1;0;0;:::;0).

Let A

1;1

0,and initially assign 1=d to all remaining entries of A

1

.Choose a

randominteger j

2 [2;d],and assign A

1;j

p

1 (d 2)

2

.Note that kA

1

k = 1.

Let A

2

A

1

,and then with probability 1=2,negate A

2;j

.

Our matrix A is as follows.

0

B

B

B

B

B

@

0

p

1 (d 2)

2

0

p

1 (d 2)

2

1 0 0

1 0 0

1

1 0 0

1

C

C

C

C

C

A

Observe that Ahas at most 2n+2d M non-zero entries,and all rows satisfy kA

i

k = 1.

If A

1;j

= A

2;j

,then A

1

and A

2

forma diametral pair,and the MEB radius is 1.

On the other hand,if A

1;j

= A

2;j

,then consider the ball center x with x

1

x

j

1=

p

2,and all other entries zero.Then for all i > 2,kxA

i

k

2

=

1

1

p

2

2

.On the other

hand,for i 2 f1;2g,we have

kx A

i

k

2

1

2

+(d 2)

2

+

1

1

p

2

2

2

p

2 +

1

d

:

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

0:28 Clarkson,Hazan and Woodruff

It follows that for"satisfying the premise of the theorem,an algorithm satisfying

the premise of the theorem can distinguish the two cases.By Fact 6.2,any algorithm

distinguishing these two cases with probability 2=3 must read

(d) entries of A.

An

(n) bound.We construct the n d matrix A as follows.Initially set all

rows A

i

e

1

= (1;0;0;:::;0).Then with probability 1=2 choose a randomi

2 [n],and

negate A

i

;1

.

We have the following picture.

0

B

B

B

B

B

B

@

1 0 0

1 0 0

1 0 0

1 0 0

0 0

1 0 0

1

C

C

C

C

C

C

A

The number of non-zeros of the resulting matrix is n < M.In the case where there is

an entry of 1,the MEB radius of A is 1,but otherwise the MEB radius is 0.Hence,

an algorithm satisfying the premise of the theorem can distinguish the two cases.By

Fact 6.2,any algorithm distinguishing these two cases with probability 2=3 must

read

(n) entries of A.

Case:d and n are sufﬁciently large.Suppose ﬁrst that M =

("

1

(n +d)) for a suf-

ﬁciently large constant in the

().Put s = ("

1

).We can assume without loss of

generality that n,d,and s are sufﬁciently large integers.We need the following simple

claim.

CLAIM 6.7.Given an instance of the minimum enclosing ball problem in T > t

dimensions on a matrix with rows fe

i

+

P

j2[t]nfig

e

j

g

t

i=1

for distinct standard unit

vectors e

i

and 0,the solution x =

P

t

i=1

(+(t 1))e

i

=t of cost ()

2

(11=t)

is optimal.

PROOF.We can subtract the point 1

T

from each of the points,and an optimal

solution y for the translated problemyields an optimal solution y+1

T

for the original

problem with the same cost.We can assume without loss of generality that T = t and

that e

1

;:::;e

t

are the t standard unit vectors in R

t

.Indeed,the value of each of the rows

on each of the remaining coordinates is 0.The cost of the point y

=

P

t

i=1

()e

i

=t in

the translated problemis

( )

2

1

1

t

2

+(t 1) ( )

2

=t

2

= ( )

2

1

1

t

:

On the other hand,for any point y,the cost with respect to row i is ( y

i

)

2

+

P

j6=i

( y

j

)

2

.By averaging and Cauchy-Schwarz,there is a row of cost at least

1

t

"

t

X

i=1

( y

i

)

2

+(t 1)

t

X

i=1

y

2

i

#

= kyk

2

+( )

2

2( )kyk

1

t

kyk

2

+( )

2

2( )kyk

p

t

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

Sublinear Optimization for Machine Learning 0:29

Taking the derivative w.r.t.to kyk,this is minimized when kyk =

p

t

,for which the

cost is at least ( )

2

(1 1=t).

An

(ns) bound.We set the ﬁrst s rows of A to e

1

;:::;e

s

.We set all entries outside

of the ﬁrst s columns of A to 0.We choose the remaining n s =

(n) rows of A by

applying Theorem 6.1 with parameters p = n s;q = s,and W = 1=d.If A is a YES

instance,then by Claim6.7,there is a solution with cost (a b)

2

(1 1=s) = 1 (1=s).

On the other hand,if A is a NO instance,then for a given x,either kA

j

xk

2

or

kA

p=2+j

xk

2

is at least a

2

= 1 O(1=d).By setting s = ("

1

) appropriately,these

two cases differ by an additive",as needed.

An

(ds) bound.We choose A by applying Theorem 6.1 with parameters p = s;q = d,

and W = 1=d.If A is a YES instance,then by Claim 6.7,there is a solution of cost

at most (a b)

2

(1 1=s) = 1 (1=s).On the other hand,if A is a NO instance,

then for a given x,either kA

j

xk

2

or kA

p=2+j

xk

2

is at least a

2

= 1 O(1=d).

As before,setting s = ("

1

) appropriately causes these cases to differ by an additive".

Finally,it remains to show an

(M) bound in case M = O("

1

(n + d)).We will

use our previous construction for showing an

(ns) bound,but replace the value of

n there with n

0

,where n

0

is the largest integer for which n

0

s M=2.We claim that

n

0

1.To see this,by the premise of the theoremM 2(n +d).Moreover,s = ("

1

)

and"

1

(n +d).For a small enough constant > 0,s (n +d) M=2,as needed.

As the theorem statement concerns matrices with n rows,each of unit norm,we

must have an input A with n rows.In this case,since the ﬁrst rowof A is e

1

,which has

sparsity 1,we can simply set all remaining rows to the value of e

1

,without changing

the MEB solution.Hence,the lower bound is

(n

0

s) =

(M).Notice that the number

of non-zero entries is at most n +n

0

s M=2 +M=2 = M,as needed.

This completes the proof.

6.3.An

~

(n"

2

) Bound for MinimumEnclosing Ball

6.3.1.Intuition.Before diving into the intricate lower bound of this section,we describe

a simple construction which lies at its core.Consider two distributions over arrays of

size d:the ﬁrst distribution,,is uniformly distributed over all strings with exactly

3d

4

entries that are 1,and

d

4

entries that are 1.The second distribution ,is uniformly

distributed over all strings with exactly

3d

4

D entries that are 1,and

d

4

+D entries

that are 1,for D =

~

O(

p

d).

Let x with probability

1

2

and x with probability

1

2

.Consider the task of

deciding from which distribution x was sampled.In both cases,the distributions are

over the sphere of radius

p

d,so the normitself cannot be used to distinguish them.At

the heart of our construction lies the following fact:

FACT 6.8.Any algorithm that decides with probability

3

4

the distribution that x

was sampled from,must read at least

~

(d) entries fromx.

We prove a version of this fact in the next sections.But ﬁrst,let us explain the use of

this fact in the lower bound construction:We create an instance of MEBwhich contains

either n vectors similar to the ﬁrst type,or alternatively n 1 vector of the ﬁrst type

and an extra vector of the second type (with a small bias).To distinguish between the

two types of instances,an algorithm has no choice but to check all n vectors,and for

each invest O(d) work as per the above fact.In our parameter setting,we’ll choose

d =

~

O("

2

),attaining the lower bound of

~

O(nd) =

~

O(n"

2

) in terms of time complexity.

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

0:30 Clarkson,Hazan and Woodruff

To compute the difference in MEB center as n 7!1,note that by symmetry in the

ﬁrst case the center will be of the form (a;a;:::;a),where the value a 2 R is chosen to

minimize the maximal distance:

arg min

a

f

3

4

(1 a)

2

+

1

4

(1 a)

2

g = arg min

a

fa

2

a +1g =

1

2

The second MEB center will be

arg min

a

f(

3

4

D

d

)(1 a)

2

+(

1

4

+

D

d

)(1 a)

2

g = arg min

a

fa

2

(1

4D

d

)a +1g =

1

2

2D

d

Hence,the difference in MEB centers is on the order of

q

d (

D

d

)

2

= O(D

2

=d) = O(1).

However,the whole construction is scaled to ﬁt in the unit ball,and hence the differ-

ence in MEB centers becomes

1

p

d

".Hence for an"approximation the algorithm

must distinguish between the two distributions,which in turn requires

("

2

) work.

6.3.2.Probabilistic Lemmas.For a set S of points in R

d

,let MEB(S) denote the smallest

ball that contains S.Let Radius(S) be the radius of MEB(S),and Center(S) the unique

center of MEB(S).

For our next lower bound,our bad instance will come from points on the hypercube

H

d

= f

1

p

d

;

1

p

d

g

d

.

Call a vertex of H

d

regular if it has

3d

4

coordinates equal to

1

p

d

and

d

4

coordinates

equal to

1

p

d

.Call a vertex special if it has

3d

4

12dD coordinates equal to

1

p

d

and

d

4

+12dD coordinates equal to

1

p

d

,where D

lnn

p

d

.

We will consider instances where all but one of the input rows A

i

are randomregular

points,and one row may or may not be a random special point.We will need some

lemmas about these points.

LEMMA 6.9.Let a denote a randomregular point,b a special point,and c denote the

point 1

d

=2

p

d = (

1

2

p

d

;

1

2

p

d

;:::;

1

2

p

d

).Then

kak

2

= kbk

2

= 1 (26)

kck

2

= a

>

c =

1

4

(27)

ka ck

2

=

3

4

(28)

b

>

c = E[a

>

b] =

1

4

12D (29)

PROOF.The normclaims are entirely straightforward,and we have

a

>

c =

1

2d

3d

4

1

2d

d

4

=

1

4

:

Also (28) follows by

ka ck

2

= kak

2

+kck

2

2a

>

c = 1 +

1

4

2

1

4

=

3

4

:

For (29),we have

b

>

c =

1

2d

3d

4

12dD

1

2d

d

4

+12dD

=

3

8

6D

1

8

6D =

1

4

12D;

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

Sublinear Optimization for Machine Learning 0:31

and by linearity of expectation,

E[a

>

b] = d

1

d

3

4

3

4

12D

+

1

4

1

4

+12D

3

4

1

4

+12D

1

4

3

4

12D

=

1

4

12D:

Next,we show that a

>

b is concentrated around its expectation (29).

LEMMA 6.10.Let a be a randomregular point,and b a special point.For d 8 ln

2

n,

Pr[a

>

b >

1

4

6D]

1

n

3

,and Pr[a

>

b <

1

4

18D]

1

n

3

.

PROOF.We will prove the ﬁrst tail estimate,and then discuss the changes needed

to prove the second estimate.

We apply the upper tail of the following enhanced form of Hoeffding’s bound,which

holds for randomvariables with bounded correlation.

FACT 6.11.(Theorem 3.4 of [Panconesi and Srinivasan 1997] with their value of

equal to 1) Let X

1

;:::;X

d

be given random variables with support f0;1g and let X =

P

d

j=1

X

j

.Let > 0 be arbitrary.If there exist independent randomvariables

^

X

1

;:::;

^

X

d

with

^

X =

P

d

j=1

^

X

j

and E[X] E[

^

X] such that for all J [d],

Pr [^

j2J

X

j

= 1]

Y

j2J

Pr

h

^

X

j

= 1

i

;

then

Pr[X > (1 + ) E[

^

X]]

e

(1 + )

1+

E[

^

X]

:

Deﬁne X

j

=

d

2

a

j

b

j

+

1

d

.Since a

j

b

j

2 f

1

p

d

;

1

p

d

g,the X

j

have support f0;1g.Let

^

X

1

;:::;

^

X

d

be i.i.d.variables with support f0;1g with E[

^

X

j

] = E[X

j

] for all j.

We claim that for all J [d],Pr [^

j2J

X

j

= 1]

Q

j2J

Pr

h

^

X

j

= 1

i

:By symmetry,it

sufﬁces to prove it for J 2 f[1];[2];:::;[d]g.We prove it by induction.The base case

J = [1] follows since E[

^

X

j

] = E[X

j

].To prove the inequality for J = [`],` 2,assume

the inequality holds for [`1].Then,

Pr[^

j2[`]

X

j

= 1] = Pr[^

j2[`1]

X

j

= 1] Pr[X

`

= 1 j ^

j2[`1]

X

j

= 1];

and by the inductive hypothesis,

Pr[^

j2[`1]

X

j

= 1]

Y

j2[`1]

Pr

h

^

X

j

= 1

i

;

so to complete the induction it is enough to show

Pr[X

`

= 1 j ^

j2[`1]

X

j

= 1] Pr[X

`

= 1]:(30)

Letting (a;b) be the number of coordinates j for which a

j

6= b

j

,we have

Pr[X

`

= 1] = 1

E[(a;b)]

d

:

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

0:32 Clarkson,Hazan and Woodruff

If ^

j2[`1]

X

j

= 1 occurs,then the ﬁrst`1 coordinates of a

j

and b

j

have the same sign,

and so

Pr[X

`

= 1 j ^

j2[`1]

X

j

= 1] = 1

E[(a;b) j ^

j2[`1]

X

j

= 1]

d `+1

= 1

E[(a;b)]

d `+1

;

which proves (30).

We will apply Fact 6.11 to bound Pr[a

>

b > r] for r =

1

4

6D.Since X =

d

2

a

>

b +

d

2

=

d

2

(1 +a

>

b),we have

X E[X]

E[X]

=

a

>

b E[a

>

b]

1 +E[a

>

b]

;

where we have used that (29) implies E[X] is positive (for large enough d),so we can

performthe division.So

X E[X]

E[X]

r E[a

>

b]

1 +E[a

>

b]

=

a

>

b E[a

>

b]

1 +E[a

>

b]

r E[a

>

b]

1 +E[ a

>

b]

=

a

>

b r

1 +E[a

>

b]

;

and so

Pr[a

>

b > r] = Pr

X E[X]

E[X]

>

r E[a

>

b]

1 +E[a

>

b]

:

By Fact 6.11,for =

rE[a

>

b]

1+E[a

>

b]

,we have for > 0,

Pr[a

>

b > r]

e

(1 + )

1+

d(1+E[a

>

b])=2

:

By (29),r E[a

>

b] = 6D,and 1 1 + E[a

>

b] 2,so 2 [3D;6D].It is well-known

(see Theorem 4.3 of [Motwani and Raghavan 1995]) that for 0 < < 2e 1,e

(1 + )

1+

e

2

=4

,and so

e

(1 + )

1+

d(1+E[a

>

b])=2

exp

2

4

(d(1 +E[a

>

b])=2)

= exp(

2

d(1 +E[a

>

b])=8):

Since 3D and E[a

>

b] > 0,this is at most exp(D

2

d) exp((lnn)

2

) n

3

,for

large enough n,using the deﬁnition of D.

For the second tail estimate,we can apply the same argument to a and b,prov-

ing that Pr[a

>

b > r] 1=n

3

,where r 1=4 + 18D.We let X

j

be the f0;1g vari-

ables

d

2

(a

j

b

j

+

1

d

),with expected sum E[X] = 3d=8 + 6D.As above,Pr[a

>

b > r] =

Pr[

XE[X]

E[X]

> ],where

rE[a

>

b]

1+E[a

>

b]

.Now

p

d is between 6 lnn and 8 lnn,so the same

relations apply as above,and the second tail estimate follows.

Note that since by (28) all regular points are distance

p

3=2 from c,that distance is

an upper bound for the the MEB radius of a collection of regular points.

The next lemmas give more properties of MEBs involving regular and special points,

under the assumption that the above concentration bounds on a

>

b hold for a given

special point b and all a in a collection of regular points.

That is,let S be a collection of randomregular points.Let E be the event that for all

a 2 S,18D a

>

b

1

4

6D.By Lemma 6.10 and a union bound,

Pr[E] 1

2

n

2

;

when S has at most n points.

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

Sublinear Optimization for Machine Learning 0:33

The condition of event E applies not only to every point in S,but to every point in

the convex hull conv S.

LEMMA 6.12.For special point b and collection S of points a,if event E holds,then

for every a

S

2 conv S,18D a

>

S

b

1

4

6D.

PROOF.Since a

S

2 conv S,we have a

S

=

P

a2S

p

a

a for some values p

a

with

P

a2S

p

a

= 1 and p

a

0 for all a 2 S.Therefore,assuming E holds,

a

>

S

b =

"

X

a2S

p

a

a

#

>

b =

X

a2S

p

a

a

>

b

X

a2S

p

a

(1=4 6D) = 1=4 6D;

and similarly a

>

S

b 1=4 18D.

LEMMA 6.13.Suppose b is a special point and S is a collection of regular points such

that event E holds.Then for any a

S

2 conv S,ka

S

bk

p

3

2

+ 6D.Since Center(S) 2

conv S,this bound applies to kCenter(S) bk as well.

PROOF.Let H be the hyperplane normal to c = 1

d

=2

p

d and containing c.Then

S H,and so conv S H,and since the minimum norm point in H is c,all points

a

S

2 conv S have ka

S

k

2

kck

2

= 1=4.By the assumption that event E holds,and the

previous lemma,we have a

>

S

b

1

4

6D.Using this fact,kbk = 1,and ka

S

k

2

1=4,we

have

ka

S

bk

2

= ka

S

k

2

+kbk

2

2a

>

S

b

1

4

+1 2

1

4

6D

=

3

4

+12D;

and so ka

S

bk

p

3

2

+6D provided D is smaller than a small constant.

LEMMA 6.14.Suppose a is a regular point,b is a special point,and a

>

b

1

4

18D.

Then there is a point q 2 R

d

for which kq bk =

p

3

2

and kq ak

p

3

2

+(D

2

),as D!0.

PROOF.As usual let c 1

d

=2

p

d and consider the point q at distance

p

3

2

from b on

the line segment

cb,so

q = c +

b c

kb ck

= c + (b c);

where 1=kb ck and is a value in (D).Fromthe deﬁnition of q,

kq ak

2

= kqk

2

+kak

2

2a

>

q

= kck

2

+2 b

>

c 2 kck

2

+

2

+kak

2

2a

>

c 2 a

>

b +2 a

>

c:

Recall from (26) that kak = 1,from (27) that a

>

c = kck

2

=

1

4

,from (29) that b

>

c =

1=4 12D,and the assumption a

>

b 1=4 18D,we have

kq ak

2

= 1=4 +2 (1=4 12D) 2 (1=4) +

2

+1 2(1=4) 2 (1=4 18D) +2 (1=4)

= 3=4 +12 D+

2

3=4 +(D

2

);

where the last inequality uses = (D) and = (1).

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

0:34 Clarkson,Hazan and Woodruff

6.3.3.Main Theorem.Given an n d matrix A together with the norms kA

i

k for all

rows A

i

,as well as the promise that all kA

i

k = O(1),the"-MEB-Coreset problem is

to output a subset S of

~

O("

1

) rows of A for which A

i

2 (1 +") MEB(S).Our main

theoremin this section is the following.

THEOREM 6.15.If n"

1

d and d =

~

("

2

),then any randomized algorithmwhich

with probability 4=5 solves"-MEB-Coreset must read

~

(n"

2

) entries of A for some

choice of its randomcoins.

We also deﬁne the following problem.Given an nd matrix Atogether with the norms

kA

i

k for all rows A

i

,as well as the promise that all kA

i

k = O(1),the"-MEB-Center

problemis to output a vector x 2 R

d

for which kA

i

xk (1+") min

y2R

d max

i2[n]

kyA

i

k.

We also show the following.

THEOREM 6.16.If n"

2

d and d =

~

("

2

),then any randomized algorithmwhich

with probability 4=5 solves"-MEB-Center by outputting a convex combination of the

rows A

i

must read

~

(n"

2

) entries of A for some choice of its randomcoins.

These theorems will follow from the same hardness construction,which we now de-

scribe.Put d = 8"

2

ln

2

n,which we assume is a sufﬁciently large power of 2.We also

assume n is even.We construct two families F and G of n d matrices A.

The family F consists of all A for which each of the n rows in A is a regular point.

The family G consists of all A for which exactly n 1 rows of A are regular points,

and one row of A is a special point.

(Recall that we say that a vertex of on H

d

is regular if it has exactly

3d

4

coordinates

equal to

1

p

d

.We say a point on H

d

is special if it has exactly d

3

4

12D

coordinates

equal to

1

p

d

,where D is

lnn

p

d

.)

Let be the distribution on n d matrices for which half of its mass is uniformly

distributed on matrices in F,while the remaining half is uniformly distributed on the

matrices in G.Let A .We show that any randomized algorithm Alg which decides

whether A 2 F or A 2 G with probability at least 3=4 must read

~

(nd) entries of A

for some choice of its randomcoins.W.l.o.g.,we may assume that Alg is deterministic,

since we may average out its random coins,as we may ﬁx its coin tosses that lead to

the largest success probability (over the choice of A).By symmetry and independence

of the rows,we can assume that in each row,Alg queries entries in order,that is,if Alg

makes s queries to a row A

i

,we can assume it queries A

i;1

;A

i;2

;:::;A

i;s

,and in that

order.

Let r = d=(C ln

2

n) for a sufﬁciently large constant C > 0.For a vector u 2 R

d

,let

pref(u) denote its ﬁrst r coordinates.Let be the distribution of pref(u) for a random

regular point u.Let

0

be the distribution of pref(u) for a randomspecial point u.

LEMMA 6.17.(Statistical Difference Lemma) For C > 0 a sufﬁciently large constant,

k

0

k

1

1

10

:

PROOF.We will apply the following fact twice,once to and once to

0

.

FACT 6.18.(special case of Theorem 4 of [Diaconis and Freedman 1980]) Suppose

an urn U contains d balls,each marked by one of two colors.Let H

U

r be the distribution

of r draws made at random without replacement from U,and M

U

r be the distribution

of r draws made at randomwith replacement.Then,

kH

U

k M

U

kk

1

4r

d

:

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

Sublinear Optimization for Machine Learning 0:35

Let be the distribution with support f

1

p

d

;

1

p

d

g with (

1

p

d

) =

3

4

and (

1

p

d

) =

1

4

.

Let be the distribution with support f

1

p

d

;

1

p

d

g with (

1

p

d

) =

3

4

12D and (

1

p

d

) =

1

4

+12D.

Let

r

be the joint distribution of r independent samples from,and similarly deﬁne

r

.Applying Fact 6.18 with r = 1=100D

2

,

k

r

k

1

1

25dD

2

;

and

k

0

r

k

1

1

25dD

2

:

By the triangle inequality,

k k

1

k

r

k

1

+k

r

r

k

1

+k

r

0

k

1

k

r

r

k

1

+

2

25dD

2

;

and so it remains to bound k

r

r

k

1

.To do this,we use Stein’s Lemma (see,e.g.,[Cover

and Thomas 1991],Section 12.8),which shows that for two coins with bias in [

(1);1

(1)],one needs (z

2

) independent coins tosses to distinguish the distributions with

constant probability,where z is the difference in their expectations.Here,z = 12D,and

so for constant C > 0 sufﬁciently large,for r = 1=CD

2

,it follows that k

r

r

k

1

1

20

.

We thus have

k

0

k

1

1

20

+

2

25dD

2

1

10

;

where the last inequality uses dD

2

= (lnn)

2

!1.

We use Lemma 6.17 to prove the following.We assume that Alg outputs 1 if it decides

that A2 F,otherwise it outputs 0.

THEOREM 6.19.If Alg queries o(nr) entries of A,it cannot decide if A 2 F with

probability at least 3=4.

PROOF.We can think of A as being generated according to the following random

process.

(1) Choose an index i

2 [n] uniformly at random.

(2) Choose rows A

j

for j 2 [n] to be randomindependent regular points.

(3) With probability 1=2,do nothing.Otherwise,with the remaining probability 1=2,

replace A

i

with a randomspecial point.

(4) Output A.

Deﬁne the advantage adv(Alg) to be:

adv(Alg)

Pr

A2

R

G

[Alg(A) = 1] Pr

A2

R

F

[Alg(A) = 1]

:

To prove the theorem,it sufﬁces to show adv(Alg) < 1=4.Let

A

i

denote the rows of A,

excluding row i

,generated in step 2.By the description of the randomprocess above,

we have

adv(Alg) = E

i

;

A

i

Pr

special A

i

[Alg(A) = 1 j i

;

A

i

] Pr

regular A

i

[Alg(A) = 1 j i

;

A

i

]

:

To analyze this quantity,we ﬁrst condition on a certain event E(i;

A

i

) holding,which

will occur with probability 1 o(1),and allow us to discard the pairs (i;

A

i

) that do

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

0:36 Clarkson,Hazan and Woodruff

not satisfy the condition of the event.Intuitively,the event is just that for most regular

A

i

,algorithmAlg does not read more than r entries in A

i

.This holds with probability

1o(1),over the choice of i

and

A

i

,because all n rows of Aare i.i.d.,and so on average

Alg can only afford to read o(r) entries in each row.

More formally,we say a pair (i;

A

i

) is good if

Pr

regular A

i

[Alg queries at most r queries of A

i

j (i;

A

i

) = (i

;

A

i

)]:

Let E(i

;

A

i

) be the event that (i;

A

i

) is good.Then,Pr

i

;

A

i

[E(i

;

A

i

)] = 1o(1),and

we can upper bound the advantage by

E

i

;

A

i

Pr

special A

i

[Alg(A) = 1 j E(i

;

A

i

);i

;

A

i

] Pr

regular A

i

[Alg(A) = 1 j E(i

;

A

i

);i

;

A

i

]

+o(1):

Consider the algorithm Alg

0

i

,which on input A,makes the same sequence of queries

to A as Alg unless it must query more than r positions of A

i

.In this case,it outputs

an arbitrary value in f0;1g,otherwise it outputs Alg(A).

CLAIM 6.20.

Pr

regular A

i

[Alg(A) = 1 j E(i

;

A

i

);i

;

A

i

] Pr

regular A

i

[Alg

0

i

(A) = 1 j i

;

A

i

]

= o(1);

PROOF.Since E(i

;

A

i

) occurs,

Pr

regular A

i

[Alg makes at most r queries to A

i

j i

;

A

i

] = 1 o(1):

This implies that

Pr

regular A

i

[Alg(A) = 1 j E(i

;

A

i

);i

;

A

i

] Pr

regular A

i

[Alg

0

i

(A) = 1 j i

;

A

i

]

= o(1):

By Lemma 6.17,we have that

Pr

regular A

i

[Alg

0

i

(A) = 1 j i

;

A

i

] Pr

special A

i

[Alg

0

i

(A) = 1 j i

;

A

i

]

1

10

:

Hence,by Claim6.20 and the triangle inequality,we have that

Pr

regular A

i

[Alg(A) = 1 j E(i

;

A

i

);i

;

A

i

] Pr

special A

i

[Alg

0

i

(A) = 1 j i

;

A

i

]

1

10

+o(1):

To ﬁnish the proof,it sufﬁces to show the following claim

CLAIM 6.21.

Pr

special A

i

[Alg(A) = 1 j E(i

;

A

i

);i

;

A

i

] Pr

special A

i

[Alg

0

i

(A) = 1 j i

;

A

i

]

1

10

+o(1):

Indeed,if we show Claim 6.21,then by the triangle inequality we will have that

adv(Alg)

1

5

+o(1) <

1

4

.

Proof of Claim6.21:Since E(i

;

A

i

) occurs,

Pr

regular A

i

[Alg makes at most r queries to A

i

j i

;

A

i

] = 1 o(1):

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

Sublinear Optimization for Machine Learning 0:37

Since is the distribution of preﬁxes of regular points,this condition can be rewritten

as

Pr

u

[Alg makes at most r queries to the i

-th row j i

;

A

i

;pref(A

i

) = u] = 1 o(1):

By Lemma 6.17,we thus have,

Pr

u

0

[Alg makes at most r queries to the i

-th row j i

;

A

i

;pref(A

i

) = u]

9

10

o(1):

Since

0

is the distribution of preﬁxes of special points,this condition can be rewritten

as

Pr

special A

i

[Alg makes at most r queries to A

i

j i

;

A

i

]

9

10

o(1):

This implies that

Pr

special A

i

[Alg(A) = 1 j E(i

;

A

i

);i

;

A

i

] Pr

special A

i

[Alg

0

i

(A) = 1 j i

;

A

i

]

1

10

+o(1):

This completes the proof of the theorem.

6.3.4.Proofs of Theorem 6.15 and 6.16.Next we show how Theorem 6.19 implies Theo-

rem6.15 and Theorem6.16,using the results on MEBs of regular and special points.

Proof of Theorem 6.15:We set the dimension d = 4 36 "

2

ln

2

(n 1).Let A

0

denote the set of regular rows of A.We condition on event E,namely,that every convex

combination p

T

A,where p 2

n1

,satisﬁes p

T

A

0

b

1

4

6D.This event occurs with

probability at least 1 2n

2

.(We may neglect the difference between n and n 1 in

some expressions.)

It follows by Lemma 6.13 that if A2 G,then for every S A

0

,

kCenter(S) bk

p

3

2

+2":

By (28),Radius(A

0

)

p

3

2

.It follows that any algorithm that,with probability at least

4=5,outputs a subset S of

~

O("

1

) rows of Afor which A

i

2 (1+")MEB(S) must include

the point b 2 S.

Given such an algorithm,by reading each of the

~

O("

1

) rows output,we can deter-

mine if A 2 F or A 2 G with an additional

~

O("

1

d) time.By Theorem 6.19,the total

time must be

~

(n"

2

).By assumption,n"

1

d,and so any randomized algorithm

that solves"-MEB-Coreset with probability at least 4=5,can decide if A 2 F with

probability at least 4=5 2n

2

3=4,and so it must read

~

(n"

2

) entries for some

choice of its randomcoins.

Proof of Theorem6.16:We again set the dimension d = 4 36 "

2

ln

2

(n1).Let A

0

denote the set of regular rows of A.We again condition on the event E.

By Lemma 6.13,if A2 G,then for every convex combination p

T

A

0

,

kp

T

A

0

bk

p

3

2

+2";

and so the MEB radius returned by any algorithmthat outputs a convex combination

of rows of A

0

must be at least

p

3

2

+2".

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

0:38 Clarkson,Hazan and Woodruff

However,by (28),if A 2 F,then Radius(A)

p

3

2

.On the other hand,by Lemma

6.14,if A2 G,then MEB-radius(A)

p

3

2

+("

2

).

It follows that if A 2 G,then the convex combination p

T

A output by the algorithm

must have a non-zero coefﬁcient multiplying the special point b.This,in particular,

implies that p

T

A is not on the afﬁne hyperplane H with normal vector 1

d

containing

the point c = 1

d

=2

p

d.However,if A2 F,then any convex combination of the points is

on H.The output p

T

A of the algorithm is on H if and only if p

T

A1

d

=

p

d

2

,which can

be tested in O(d) time.

By Theorem 6.19,the total time must be

~

(n"

2

).By assumption,n"

2

d,and

so any randomized algorithm that solves"-MEB-Center with probability 4=5 by

outputting a convex combination of rows can decide if A 2 F with probability at least

4=5 2n

2

3=4,and so must read

~

(n"

2

) entries for some choice of its random

coins.

6.4.Las Vegas Algorithms

While our algorithms are Monte Carlo,meaning they err with small probability,it may

be desirable to obtain Las Vegas algorithms,i.e.,randomized algorithms that have low

expected time but never err.We show this cannot be done in sublinear time.

THEOREM 6.22.For the classiﬁcation and minimumenclosing ball problems,there

is no Las Vegas algorithm that reads an expected o(M) entries of its input matrix and

solves the problem to within a one-sided additive error of at most 1=2.This holds even

if kA

i

k = 1 for all rows A

i

.

PROOF.

Suppose ﬁrst that n M.Consider n d matrices A;B

1

;:::B

M

,where for each

C 2 fA;B

1

;:::;B

M

g,C

i;j

= 0 if either j > 1 or i > M.Also,A

i;1

= 1 for i 2 [M],while

for each j,B

j

1;i

= 1 if i 2 [M] n fjg,while B

j

1;j

= 1.With probability 1=2 the matrix

A is chosen,otherwise a matrix B

j

is chosen for a random j.Notice that whichever

case we are in,each of the ﬁrst M rows of the input matrix has normequal to 1,while

all remaining rows have norm 0.It is easy to see that distinguishing these two cases

with probability 2=3 requires reading

(M) entries.As

(M) is a lower bound for

Monte Carlo algorithms,it is also a lower bound for Las Vegas algorithms.Moreover,

distinguishing these two cases is necessary,since if the problem is classiﬁcation,if

C = A the margin is 1,otherwise it is 0,while if the problem is minimum enclosing

ball,if C = A the cost is 0,otherwise it is 1.

We now assume M > n.Let d

0

be the largest integer for which nd

0

< M.Here d

0

1.

Let A be the n d

0

matrix,where A

i;j

=

1

p

d

0

for all i and j.The margin of A is 1,and

the minimumenclosing ball has radius 0.

Suppose there were an algorithm Alg on input A for which there is an assignment

to Alg’s randomtape r for which Alg reads at most nd

0

=4 of its entries.If there were no

such r,the expected running time of Alg is already

(nd

0

) =

(M).Let A

`

be a row of

A for which Alg reads at most d

0

=4 entries of A

`

given random tape r,and let S [d

0

]

be the set of indices in A

`

read,where jSj d

0

=4.Consider the n d

0

matrix B for

which B

i;j

= A

i;j

for all i 6=`,while B

`;j

= A

`;j

for all j 2 S,and B

`;j

= A

`;j

for all

j 2 [d

0

] n S.Notice that all rows of A and B have norm1.

To bound the margin of B,consider any vector x of normat most 1.Then

(A

`

+B

`

)x kxk kA

`

+B

`

k kA

`

+B

`

k:

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

Sublinear Optimization for Machine Learning 0:39

A

`

+B

`

has at least 3d

0

=4 entries that are 0,while the non-zero entries all have value

2=

p

d

0

.Hence,kA

`

+B

`

k

2

d

0

4

4

d

0

= 1.It follows that either A

`

x or B

`

x is at most 1=2,

which bounds the margin of B.As Alg cannot distinguish A and B given randomtape

r,it cannot have one-sided additive error at most 1=2.

For minimumenclosing ball,notice that kA

`

B

`

k

2

1

4

3d

0

4

4

d

0

1

4

=

3

4

,which lower

bounds the cost of the minimum enclosing ball of B.As Alg cannot distinguish A and

B given randomtape r,it cannot have one-sided additive error at most 3=4.

7.CONCLUDING REMARKS

We have described a general method for sublinear optimization of constrained convex

programs,and showed applications to classical problems in machine learning such as

linear classiﬁcation and minimum enclosing ball obtaining improvements in leading-

order terms over the state of the art.

In all our running times the dimension d can be replaced by the parameter S,which

is the maximum over the input rows A

i

of the number of nonzero entries in A

i

.Note

that d S M=n.Here we require the assumption that entries of any given row can

be recovered in O(S) time,which is compatible with keeping each row as a hash table

or (up to a logarithmic factor in run-time) in sorted order.

Acknowledgements.We thank Nati Srebro and an anonymous referee for helpful

comments on the relation between this work and PAC learning theory.

REFERENCES

BLUM,A.,FRIEZE,A.M.,KANNAN,R.,AND VEMPALA,S.1998.A polynomial-time algorithmfor learning

noisy linear threshold functions.Algorithmica 22,1/2,35–52.

BYLANDER,T.1994.Learning linear threshold functions in the presence of classiﬁcation noise.In COLT

’94:Proceedings of the Seventh Annual Conference on Computational Learning Theory.ACM,NewYork,

NY,USA,340–347.

CESA-BIANCHI,N.,CONCONI,A.,AND GENTILE,C.2004.On the generalization ability of on-line learning

algorithms.IEEE Transactions on Information Theory 50,9,2050–2057.

CESA-BIANCHI,N.,SHALEV-SHWARTZ,S.,AND SHAMIR,O.2010.Online learning of noisy data with ker-

nels.In COLT,A.T.Kalai and M.Mohri,Eds.Omnipress,218–230.

CLARKSON,K.L.2008.Coresets,sparse greedy approximation,and the Frank-Wolfe algorithm.In SODA

’08:Proc.Nineteenth ACM-SIAMSymposiumon Discrete Algorithms.Society for Industrial and Applied

Mathematics,Philadelphia,PA,USA,922–931.

COVER,T.AND THOMAS,J.A.1991.Elements of Information Theory.Wiley Series in Telecommunications.

DIACONIS,P.AND FREEDMAN,D.1980.Finite exchangeable sequences.The Annals of Probability 8,745–

764.

DUNAGAN,J.AND VEMPALA,S.2004.A simple polynomial-time rescaling algorithm for solving linear

programs.In STOC ’04:Proceedings of the Thirty-Sixth Annual ACMSymposiumon the Theory of Com-

puting.ACM,New York,NY,USA,315–320.

FRANK,M.AND WOLFE,P.1956.An algorithmfor quadratic programming.Naval Research Logistics Quar-

terly 3,95–110.

GRIGORIADIS,M.D.AND KHACHIYAN,L.G.1995.A sublinear-time randomized approximation algorithm

for matrix games.Operations Research Letters 18,53–58.

HAZAN,E.2011.The convex optimization approach to regret minimization.Optimization for machine learn-

ing 1.

HAZAN,E.,AGARWAL,A.,AND KALE,S.2007.Logarithmic regret algorithms for online convex optimiza-

tion.Machine Learning 69,2-3,169–192.

KOUFOGIANNAKIS,C.AND YOUNG,N.E.2007.Beating simplex for fractional packing and covering linear

programs.In FOCS.IEEE Computer Society,494–504.

MINSKY,M.AND PAPERT,S.1988.Perceptrons:An introduction to computational geometry.MIT press

Cambridge,Mass.

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

0:40 Clarkson,Hazan and Woodruff

MONEMIZADEH,M.AND WOODRUFF,D.2010.1-pass relative error l

p

-sampling with applications.In SODA

’10:Proc.Twenty-First ACM-SIAMSymposiumon Discrete Algorithms.

MOTWANI,R.AND RAGHAVAN,P.1995.Randomized Algorithms.Cambridge University Press.

NOVIKOFF,A.B.1963.On convergence proofs for perceptrons.In Proceedings of the Symposium on the

Mathematical Theory of Automata.Vol.12.615–622.

PANCONESI,A.AND SRINIVASAN,A.1997.Randomized distributed edge coloring via an extension of the

chernoff-hoeffding bounds.SIAMJ.Comput.26,2,350–368.

PLOTKIN,S.A.,SHMOYS,D.B.,AND TARDOS,E.1991.Fast approximation algorithms for fractional pack-

ing and covering problems.In SFCS ’91:Proceedings of the 32nd Annual Symposiumon Foundations of

Computer Science.IEEE Computer Society,Washington,DC,USA,495–504.

SAHA,A.AND VISHWANATHAN,S.2009.Efﬁcient approximation algorithms for minimumenclosing convex

shapes.arXiv:0909.1062v2.

SCH

¨

OLKOPF,B.AND SMOLA,A.J.2003.A short introduction to learning with kernels.Springer-Verlag New

York,Inc.,New York,NY,USA.

SERVEDIO,R.A.1999.On PAC learning using winnow,perceptron,and a perceptron-like algorithm.In

COLT ’99:Proceedings of the Twelfth Annual Conference on Computational Learning Theory.ACM,

New York,NY,USA,296–307.

ZINKEVICH,M.2003.Online convex programming and generalized inﬁnitesimal gradient ascent.In Pro-

ceedings of the Twentieth International Conference on Machine Learning(ICML).928–936.

Received January 0000;revised January 0000;accepted January 0000

A.MAIN TOOLS

A.1.Tools fromonline learning

Online linear optimization.The following lemma is essentially due to [Zinkevich

2003]:

LEMMA A.1 (OGD).Consider a set of vectors q

1

;:::;q

T

2 R

d

such that kq

i

k

2

c.

Let x

0

0,and ~x

t+1

x

t

+

1

p

T

q

t

;x

t+1

~x

t+1

maxf1;k~x

t+1

kg

.Then

max

x2B

T

X

t=1

q

>

t

x

T

X

t=1

q

>

t

x

t

2c

p

T:

This is true even if each q

t

is dependent on x

1

;:::;x

t1

.

PROOF.

Assume c = 1,generalization is by straightforward scaling.Let =

1

p

T

.By deﬁnition

and for any kxk 1,

kx x

t+1

k

2

kx ~x

t+1

k

2

= kx x

t

q

t

k

2

= kx x

t

k

2

2q

>

t

(x x

t

) +

2

kq

t

k

2

:

Rearranging we obtain

q

>

t

(x x

t

)

1

2

[kx x

t

k

2

kx x

t+1

k

2

] +=2:

Summing up over t = 1 to T yields

X

t

q

>

t

x

X

t

q

>

t

x

t

1

2

kx x

1

k

2

+T=2

2

+

2

T 2

p

T:

A simpler version of gradient descent,also essentially due to [Zinkevich 2003],is

given by:

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

Sublinear Optimization for Machine Learning 0:41

LEMMA A.2 (LAZY PROJECTION OGD).Consider a set of vectors q

1

;:::;q

T

2 R

d

such that kq

i

k

2

1.Let

x

t+1

arg min

x2B

(

t

X

=1

q

>

x +

p

2Tkxk

2

2

)

Then

max

x2B

T

X

t=1

q

>

t

x

T

X

t=1

q

>

t

x

t

2

p

2T:

This is true even if each q

t

is dependent on x

1

;:::;x

t1

.

For a proof see Theorem 2.1 in [Hazan 2011],where we take R(x) = kxk

2

2

,and the

normof the linear cost functions is bounded by kq

t

k

2

1,as is the diameter of K - the

ball in our case.Notice that the solution of the above optimization problemis simply:

x

t+1

=

y

t+1

maxf1;ky

t+1

kg

;y

t+1

=

P

t

=1

q

p

2T

Strongly convex loss functions.The following Lemma is essentially due to [Hazan

et al.2007].

For H 2 R with H > 0,a function f:R

d

!R is H-strongly convex in B if for all

x 2 B,all the eigenvalues of r

2

f(x) are at least H.

LEMMA A.3 (OGDSTRICTLYCONVEX).Consider a set of H-strongly convex func-

tions f

1

;:::;f

T

such that the normof their gradients is bounded over the unit ball B by

G max

t

max

x2B

krf

t

(x)k.Let x

0

2 B,and ~x

t+1

x

t

1

t

rf

t

(x

t

);x

t+1

~x

t+1

maxf1;k~x

t+1

kg

.

Then

T

X

t=1

f

t

(x

t

) min

kxk

2

1

T

X

t=1

f

t

(x)

G

2

H

log T:

This is true even if each f

t

is dependent on x

1

;:::;x

t1

.

For the MEB application and its relatives it is easier to implement the lazy versions

in the streaming model.The following Lemma is the analogous tool we need:

LEMMA A.4.Consider a set of H-strongly convex functions f

1

;:::;f

T

such that the

norm of their gradients is bounded over the unit ball B by G max

t

max

x2B

krf

t

(x)k.

Let

x

t+1

arg min

x2B

(

t

X

=1

f

(x)

)

Then

T

X

t=1

f

t

(x

t

) min

x2B

T

X

t=1

f

t

(x)

2G

2

H

log T:

This is true even if each f

t

is dependent on x

1

;:::;x

t1

.

PROOF.

By Lemma 2.3 in [Hazan 2011] we have:

T

X

t=1

f

t

(x

t

) min

kxk

2

1

T

X

t=1

f

t

(x)

X

t

[f

t

(x

t

) f

t

(x

t+1

)]

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

0:42 Clarkson,Hazan and Woodruff

Denote by

t

(x) =

P

t

=1

f

.Then by Taylor expansion at x

t+1

,there exists a z

t

2

[x

t+1

;x

t

] for which

t

(x

t

) =

t

(x

t+1

) +(x

t

x

t+1

)

>

r

t

(x

t+1

) +

1

2

kx

t

x

t+1

k

2

z

t

t

(x

t+1

) +

1

2

kx

t

x

t+1

k

2

z

t

;

using the notation kyk

2

z

= y

>

r

2

t

(z)y.The inequality above is true because x

t+1

is a

minimumof

t

over K.Thus,

kx

t

x

t+1

k

2

z

t

2

t

(x

t

) 2

t

(x

t+1

)

= 2 (

t1

(x

t

)

t1

(x

t+1

)) +2[f

t

(x

t

) f

t

(x

t+1

)]

2[f

t

(x

t

) f

t

(x

t+1

)] optimality of x

t

2rf

t

(x

t

)

>

(x

t

x

t+1

) convexity of f

t

:

By convexity and Cauchy-Schwarz:

f

t

(x

t

) f

t

(x

t+1

) rf

t

(x

t

)(x

t

x

t+1

) krf

t

(x

t

)k

z

t

kx

t

x

t+1

k

z

t

krf

t

(x

t

)k

z

t

q

2 rf

t

(x

t

)

>

(x

t

x

t+1

)

Shifting sides and squaring,we get

f

t

(x

t

) f

t

(x

t+1

) rf

t

(x

t

)(x

t

x

t+1

) 2krf

t

(x

t

)k

2

z

t

Since f

t

are assumed to be H-strongly convex,we have k k

z

k k

Ht

,and hence for

the dual norm,

f

t

(x

t

) f

t

(x

t+1

) 2krf

t

(x

t

)k

2

z

t

2

krf

t

(x

t

)k

2

2

Ht

2G

2

Ht

Summing over all iterations we get

T

X

t=1

f

t

(x

t

) min

kxk

2

1

T

X

t=1

f

t

(x)

X

t

[f

t

(x

t

) f

t

(x

t+1

)]

X

t

2G

2

Ht

2G

2

H

log T

Combining sampling and regret minimization

LEMMA A.5.Consider a set of H-strongly convex functions f

1

;:::;f

T

such that the

normof their gradients is bounded over the unit ball by G max

t

max

x2B

krf

t

(x)k.Let

y

t+1

8

>

<

>

:

arg min

x2B

n

P

t

=1

f

(x)

o

w.p.

y

t

o=w

Then for a ﬁxed x

we have

E[

T

X

t=1

f

t

(y

t

)

T

X

t=1

f

t

(x

)]

1

2G

2

H

log T:

This is true even if each f

t

is dependent on y

1

;:::;y

t1

.

PROOF.

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

Sublinear Optimization for Machine Learning 0:43

Consider the sequence of functions

~

f

t

deﬁned as

~

f

t

8

<

:

f

t

w.p.

0 o=w

Where 0 denotes the all-zero function.Then the algorithmfromLemma A.4 applied to

the functions

~

f

t

is exactly the algorithmwe apply above to the functions f

t

.Notice that

the functions

~

f

t

are

H

-strongly convex,and in addition their gradients are bounded by

G

.Hence applying Lemma A.4 we obtain

E[

T

X

t=1

f

t

(y

t

)

T

X

t=1

f

t

(x

)] = E[

T

X

t=1

~

f

t

(x

t

)

T

X

t=1

~

f

t

(x

)]

1

2G

2

H

log T:

B.AUXILIARY LEMMAS

First,some simple lemmas about randomvariables.

LEMMA B.1.Let X be a randomvariable,let

X = clip(X;C) = minfC;maxfC;Xgg

and assume that j E[X]j C=2 for some C > 0.Then

E[

X] E[X]

2Var[X]

C

:

PROOF.As a ﬁrst step,note that for x > C we have x E[X] C=2,so that

C(x C) 2(x E[X])(x C) 2(x E[X])

2

:

Hence,we obtain

E[X] E[

X] =

Z

x<C

(x +C)d

X

+

Z

x>C

(x C)d

X

Z

x>C

(x C)d

X

2

C

Z

x>C

(x E[X])

2

d

X

2

C

Var[X]:

Similarly one can prove that E[X] E[

X] 2Var[X]=C,and the result follows.

LEMMA B.2.For randomvariables X and Y,and 2 [0;1],

E[(X +(1 )Y )

2

] maxfE[X

2

];E[Y

2

]g:

This implies by induction that the second moment of a convex combination of random

variables is no more than the maximumof their second moments.

PROOF.By convexity of the quadratic function (Jensen’s inequality) we have

E[(X +(1 )Y )

2

] E[X

2

+(1 )Y

2

]

= E[X

2

] +(1 ) E[Y

2

]

maxfE[X

2

];E[Y

2

]g:

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

0:44 Clarkson,Hazan and Woodruff

B.1.Martingale and concentration lemmas

The Bernstein inequality,that holds for randomvariables Z

t

;t 2 [T] that are indepen-

dent,and such that for all t,E[Z

t

] = 0,E[Z

2

t

] s,and jZ

t

j V,states

log Probf

X

t2[T]

Z

t

g

2

=2(Ts +V=3) (31)

Here we need a similar bound for random variables which are not independent,but

form a martingale with respect to a certain ﬁltration.Many concentration results

have been proven for Martingales,including somewhere,in all likelihood,the present

lemma.However,for clarity and completeness,we will outline how the proof of the

Bernstein inequality can be adapted to this setting.

LEMMA B.3.Let fZ

t

g be a martingale difference sequence with respect to ﬁltration

fS

t

g,such that E[Z

t

jS

1

;:::;S

t

] = 0.Assume the ﬁltration fS

t

g is such that the values

in S

t

are determined using only those in S

t1

,and not any previous history,and so the

joint probability distribution

ProbfS

1

= s

1

;S

2

= s

2

;:::;S

T

= s

t

g =

Y

t2[T1]

ProbfS

t+1

= s

t+1

j S

t

= s

t

g;

In addition,assume for all t,E[Z

2

t

jS

1

;:::;S

t

] s,and jZ

t

j V.Then

log Probf

X

t2T

Z

t

g

2

=2(Ts +V=3):

PROOF.

A key step in proving the Bernstein inequality is to show an upper bound on the

exponential generating function E[exp(Z)],where Z

P

t

Z

t

,and > 0 is a parame-

ter to be chosen.This step is where the hypothesis of independence is applied.In our

setting,we can show a similar upper bound on this expectation:Let E

t

[] denote expec-

tation with respect to S

t

,and E

[T]

denote expectation with respect to S

t

for t 2 [T].This

expression for the probability distribution implies that for any real-valued function f

of state tuples S

t

,

E

[T]

[

Y

t2[T]

f(S

t

)]

= f(s

1

)

Z

s

2

;:::;s

T

[

Y

t2[T1]

f(s

t+1

)][

Y

t2[T1]

ProbfS

t+1

= s

t+1

j S

t

= s

t

g]

= f(s

1

)

Z

s

2

;:::;s

T1

2

4

[

Y

t2[T2]

f(s

t+1

)][

Y

t2[T2]

ProbfS

t+1

= s

t+1

j S

t

= s

t

g]

Z

s

T

f(s

T

) ProbfS

T

= s

T

j S

T1

= s

T1

g

;

where the inner integral can be denoted as the conditional expectation E

T

[f(S

T

) j

S

T1

].By induction this is

f(s

1

)

Z

s

2

f(s

2

) ProbfS

2

= s

2

j S

1

= s

1

g

Z

s

3

:::

Z

s

T

f(s

T

) ProbfS

T

= s

T

j S

T1

= s

T1

g

:::

;

and by writing the constant f(S

1

) as the expectation with respect to the constant S

0

=

s

0

,and using E

X

[E

X

[Y ]] = E

X

[Y ] for any random variables X and Y,we can write

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

Sublinear Optimization for Machine Learning 0:45

this as

E

[T]

[

Y

t2[T]

f(S

t

)] = E

[T]

[

Y

t2[T]

E

t

[f(S

t

) j S

t1

]]:

For ﬁxed i and a given 2 R,we take f(S

1

) = 1,and f(S

t

) exp(Z

t1

),to obtain

E

[T]

2

4

exp(

X

t2[T]

Z

t

)]

3

5

= E

[T]

2

4

Y

t2[T]

E

t

[exp(Z

t

) j S

t1

]

3

5

:

Now for any randomvariable X with E[X] = 0,E[X

2

] s,and jXj V,

E[exp(X)] exp

s

V

2

(e

V

1 V )

;

(as is shown and used for proving Bernstein’s inequality in the independent case) and

therefore

E

[T]

[exp(Z)] E

[T]

2

4

Y

t2[T]

exp

s

V

2

(e

V

1 V )

3

5

= exp

T

s

V

2

(e

V

1 V )

:

where Z

P

t2[T]

Z

t

.This bound is the same as is obtained for independent Z

t

,and so

the remainder of the proof is exactly as in the proof for the independent case:Markov’s

inequality is applied to the randomvariable exp(Z),obtaining

ProbfZ g exp() E

[T]

[exp(Z)] exp( +T

s

V

2

(e

V

1 V ));

and an appropriate value =

1

V

log(1 + V=Ts) is chosen for minimizing the bound,

yielding

ProbfZ g exp(

Ts

V

2

((1 + ) log(1 + ) ));

where V=Ts,and ﬁnally the inequality for 0 that (1+ ) log(1+ )

2

=2

1+ =3

is applied.

B.2.Proof of lemmas used in main theorem

We restate and prove lemmas 2.4,2.5 and 2.6,in slightly more general form.In the

following we only assume that v

t

(i) = clip(~v

t

(i);

1

) is the clipping of a randomvariable

~v

t

(i) (hence the parameter C for Lemma B.1 is C =

1

>> 1).The variance of ~v

t

(i) is

at most one Var[~v

t

(i)] 1,and we denote by

t

(i) = E[~v

t

(i)].We also assume that the

expectations of ~v

t

(i) are bounded by an absolute constant j

t

(i)j ,and 1

1

.

This constant is one for the perceptron application,but at most two for MEB.Note that

since the variance of ~v

t

(i) is bounded by one,so is the variance of it’s clipping v

t

(i)

3

.

LEMMA B.4.For

q

log n

T

1

,with probability at least 1 O(1=n),

max

i

X

t2[T]

[v

t

(i)

t

(i)] 6T:

3

This follows fromthe fact that the second moment only decreases by the clipping operation,and deﬁnition

of variance as Var(v

t

(i)) = min

z

E[v

t

(i)

2

z

2

].We can use z = E[~v

t

(i)],and hence the decrease in second

moment sufﬁces.

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

0:46 Clarkson,Hazan and Woodruff

PROOF.

Lemma B.1 implies that j E[v

t

(i)]

t

(i)j 2,since Var[~v

t

(i)] 1.

We show that for given i 2 [n],with probability 1 O(1=n

2

),

P

t2[T]

[v

t

(i) E[v

t

(i)]]

4T,and then apply the union bound over all i 2 [n].This together with the above

bound on j E[v

t

(i)]

t

(i)j implies the lemma via the triangle inequality.

Fixing i,let Z

i

t

v

t

(i) E[v

t

(i)],and consider the ﬁltration given by

S

t

(x

t

;p

t

;w

t

;y

t

;v

t1

;i

t1

;j

t1

;v

t1

E[v

t1

]);

Using the notation E

t

[] = E[jS

t

],Observe that

(1) 8t:E

t

[(Z

i

t

)

2

] = E

t

[v

t

(i)

2

] E

t

[v

t

(i)]

2

= Var(v

t

(i)) 1.

(2) jZ

i

t

j 2=.This holds since by construction,jv

t

(i)j 1=,and hence

jZ

i

t

j = jv

t

(i) E[v

t

(i)]j jv

t

(i)j +j E[v

t

(i)]j

2

Using these conditions,despite the fact that the Z

i

t

are not independent,we can use

Lemma B.3,and conclude that Z

P

t2T

Z

i

t

satisﬁes the Bernstein-type inequality

with s = 1 and V = 2=

log ProbfZ g

2

2(Ts +V=3)

2

2(T +2=3)

;

Hence for = 4T,we have

log ProbfZ 4Tg

16

2(1 +8=3)

T 2

2

T

For

q

log n

T

,above probability is at most e

2 log n

1

n

2

.

Lemma 2.5 can be restated in the following more general form:

LEMMA B.5.For

q

log n

T

1

,with probability at least 1 O(1=n),it holds that

P

t2[T]

t

(i

t

)

P

t

p

>

t

v

t

10T:

It is a corollary of the following two lemmas:

LEMMA B.6.For

q

log n

T

,with probability at least 1 O(1=n),

X

t2[T]

p

>

t

v

t

X

t

p

>

t

t

4T:

PROOF.

This Lemma is proven in essentially the same manner as Lemma 2.4,and proven

below for completeness.

Lemma B.1 implies that j E[v

t

(i)]

t

(i)j 2,using Var[~v

t

(i)] 1.Since p

t

is a

distribution,it follows that j E[p

>

t

v

t

] p

>

t

t

j 2

Let Z

t

p

>

t

v

t

E[p

>

t

v

t

] =

P

i

p

t

(i)Z

i

t

,where Z

i

t

= v

t

(i) E[v

t

(i)].Consider the

ﬁltration given by

S

t

(x

t

;p

t

;w

t

;y

t

;v

t1

;i

t1

;j

t1

;v

t1

E[v

t1

]);

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

Sublinear Optimization for Machine Learning 0:47

Using the notation E

t

[] = E[jS

t

],the quantities jZ

t

j and E

t

[Z

2

t

] can be bounded as

follows:

jZ

t

j = j

X

i

p

t

(i)Z

i

t

j

X

i

p

t

(i)jZ

i

t

j 2

1

using jZ

i

t

j 2

1

as in Lemma 2.4:

Also,using properties of variance,we have

E[Z

2

t

] = Var[p

>

t

v

t

] =

X

i

p

t

(i)

2

Var(v

t

(i)) max

i

Var[v

t

(i)] 1:

We can nowapply the Bernstein-type inequality of Lemma B.3,and continue exactly

as in Lemma 2.4.

LEMMA B.7.For

q

log n

T

1

,with probability at least 1 O(1=n),

X

t2[T]

t

(i

t

)

X

t

p

t

t

6T:

PROOF.

Let Z

t

t

(i

t

)p

t

t

,where now

t

is a constant vector and i

t

is the randomvariable,

and consider the ﬁltration given by

S

t

(x

t

;p

t

;w

t

;y

t

;v

t1

;i

t1

;j

t1

;Z

t1

);

The expectation of

t

(i

t

),conditioning on S

t

with respect to the randomchoice r(i

t

),is

p

t

t

.Hence E

t

[Z

t

] = 0,where E

t

[] denotes E[jS

t

].The parameters jZ

t

j and E[Z

2

t

] can

be bounded as follows:

jZ

t

j j

t

(i)j +jp

t

t

j 2

E[Z

2

t

] = E[(

t

(i) p

>

t

t

)

2

] 2 E[

t

(i)

2

] +2(p

>

t

t

)

2

4

2

Applying Lemma B.3 to Z

P

t2T

Z

t

,with parameters s 4

2

;V 2,we obtain

log ProbfZ g

2

4

2

T +2

;

Letting 6T,we obtain

log ProbfZ 6Tg

36

2

2

T

2

4

2

T +12

2

T

2

2

T 2 log n

Where the last inequality holds assuming

q

log n

T

.

Finally,we prove Lemma 2.6 by a simple application of Markov’s inequality:

LEMMA B.8.w.p.at least 1

1

4

it holds that

P

t

p

>

t

v

2

t

8

2

T:

PROOF.By assumption,E[~v

2

t

(i)] = Var(~v

t

(i)) +E[~v

t

(i)]

2

1 +

2

2

2

,and hence

E[v

t

(i)

2

] E[~v

t

(i)

2

] 2

2

.

By linearity of expectation,we have E[

P

t

p

>

t

v

2

t

] 2

2

T,and since the randomvari-

ables v

2

t

are non-negative,applying Markov’s inequality yields the lemma.

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

0:48 Clarkson,Hazan and Woodruff

C.BOUNDED PRECISION

All algorithms in this paper can be implemented with bounded precision.

First we observe that approximation of both the training data and the vectors that

are “played” does not increase the regret too much,for both settings we are working

in.

LEMMA C.1.Given a sequence of functions f

1

;:::;f

T

and another sequence

~

f

1

;:::;

~

f

T

all mapping R

d

to R,such that j

~

f

t

(x) f

t

(x)j

f

for all x 2 B and t 2 [T],

suppose x

1

;:::;x

T

2 B is a sequence of regret R against f

~

f

t

g,that is,

max

x2B

X

t2[T]

~

f

t

(x)

X

t2[T]

~

f

t

(x

t

) R:

Now suppose ~x

1

;:::;~x

T

2 R

d

is a sequence with jf

t

(~x

t

) f

t

(x

t

)j

x

for all t 2 [T].

Then

max

x2B

X

t2[T]

f

t

(x)

X

t2[T]

f

t

(~x

t

) R+T(

x

+2

f

):

PROOF.For x 2 B,we have

P

t2[T]

f

t

(x)

P

t2[T]

~

f

t

(x) +T

f

,and

X

t2[T]

f

t

(~x

t

)

X

t2[T]

f

t

(x

t

) T

x

X

t2[T]

~

f

t

(x

t

) T

x

T

f

;

and the result follows by combining these inequalities.

That is,x

t

is some sequence known to have small regret against the “training func-

tions”

~

f

t

(x),which are approximations to the true functions of interest,and the ~x

t

are

approximations to these x

t

.The lemma says that despite these approximations,the ~x

t

sequence has controllable regret against the true functions.

This lemma is stated in more generality than we need:all functions considered here

have the form f

t

(x) = b

t

+ q

>

t

x + kxk

2

,where jb

t

j 1,q

t

2 B,and j j 1.Thus

if

~

f

t

(x) =

~

b

t

+ ~q

>

t

x + kxk

2

,then the ﬁrst condition j

~

f

t

(x) f

t

(x)j

f

holds when

jb

t

~

b

t

j +kq

t

~q

t

k

f

.Also,the second condition jf

t

(~x

t

) f

t

(x

t

)j

x

holds for such

functions when k~x

t

x

t

k

x

=3.

LEMMA C.2.Given a sequence of vectors q

1

;:::;q

T

2 R

n

,with kq

t

k

1

B for t 2 [T],

and a sequence ~q

1

;:::;~q

T

2 R

n

such that k~q

t

q

t

k

1

q

for all t 2 [T],suppose

p

1

;:::;p

T

2 is a sequence of regret R against f~q

t

g,that is,

X

t2[T]

p

>

t

~q

t

min

p2

X

t2[T]

p

>

~q

t

R:

Now suppose ~p

1

;:::;~p

T

2 R

n

is a sequence with k~p

t

p

t

k

1

p

for all t 2 [T].Then

X

t2[T]

~p

>

t

q

t

min

p2

X

t2[T]

p

>

q

t

R+T(B

p

+2

q

):

PROOF.For p 2 we have

P

t2[T]

p

>

q

t

P

t2[T]

p

>

~q

t

+T

q

,and

X

t2[T]

~p

>

t

q

t

X

t2[T]

p

>

t

q

t

+TB

p

X

t2[T]

p

>

t

~q

t

+TB

p

+T

q

;

The proof follows by combining the inequalities.

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

Sublinear Optimization for Machine Learning 0:49

Note that to have k~p

t

p

t

k

1

p

,it is enough that the relative error of each entry of

~p

t

is

p

.

The use of ~q

t

in place of q

t

(for either of the two lemmas) will be helpful for our ker-

nelized algorithms ( x5),where computation of the norms ky

t

k of the working vectors y

t

is a bottleneck;the above two lemmas imply that it is enough to compute such norms

to within relative or so.

C.1.Bit Precision for Algorithm1

First,the bit precision needed for the OGD part of the algorithm.Let denote a

sufﬁciently small constant fraction of ,where the small constant is absolute.From

Lemma C.1 and following discussion,we need only use the rows A

i

up to a precision

that gives an approximation

~

A

i

that is within Euclidean distance ,and similarly for

an approximation ~x

t

of x

t

.For the latter,in particular,we need only compute ky

t

k to

within relative error .Thus a per-entry precision of =

p

d is sufﬁcient.

We need kx

t

k for`

2

sampling;arithmetic relative error =

p

d in the sampling proce-

dure gives an estimate of ~v

t

(i) for which E[A~v

t

] = A^x

t

,where ^x

t

is a vector within O( )

Euclidean distance of x

t

.We can thus charge this error to the OGD analysis,where ^x

t

is the ~x

t

of Lemma C.1.

For the MW part of the algorithm,we observe that due to the clipping step,if the

initial computation of ~v

t

(i),Line 9,is done with =5 relative error,then the computed

value is within =5 additive error.Similar precision for the clipping implies that the

computed value of v

t

(i),which takes the place of ~q

t

in Lemma C.2,is within =5 of the

exact version,corresponding to q

t

in the lemma.Here B of the lemma,bounding kq

t

k

1

,

is 1=,due to the clipping.

It remains to determine the arithmetic relative error needed in the update step,

Line 11,to keep the relative error of the computed value of p

t

,or ~p

t

of Lemma C.2,

small enough.Indeed,if the relative error is a small enough constant fraction of =T,

then the relative error of all updates together can be =3.Thus

p

=3 and

q

=3

and the added regret due to arithmetic error is at most T.

Summing up:the arithmetic precision needed is at most on the order of

log minf=

p

d;;=Tg = O(log(nd=));

to obtain a solution with additive T=10 regret over the solution computed using exact

computation.This implies an additional error of =10 to the computed solution,and

thus changes only constant factors in the algorithm.

C.2.Bit Precision for Convex Quadratic Programming

From the remarks following Lemma C.1,the conditions of that lemma hold in the

setting of convex quadratic programming in the simplex,assuming that every A

i

2 B.

Thus the discussion of xC.1 carries over,up to constants,with the simpliﬁcation that

computation of ky

t

k is not needed.

Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.

## Comments 0

Log in to post a comment