Sublinear Optimization for Machine Learning

achoohomelessΤεχνίτη Νοημοσύνη και Ρομποτική

14 Οκτ 2013 (πριν από 4 χρόνια και 29 μέρες)

95 εμφανίσεις

Sublinear Optimization for Machine Learning
Kenneth L.Clarkson,IBMAlmaden Research Center
Elad Hazan,Technion - Israel Institute of technology
David P.Woodruff,IBMAlmaden Research Center
In this paper we describe and analyze sublinear-time approximation algorithms for some optimization prob-
lems arising in machine learning,such as training linear classifiers and finding minimum enclosing balls.
Our algorithms can be extended to some kernelized versions of these problems,such as SVDD,hard margin
SVM,and L
2
-SVM,for which sublinear-time algorithms were not known before.These new algorithms use
a combination of a novel sampling techniques and a new multiplicative update algorithm.We give lower
bounds which show the running times of many of our algorithms to be nearly best possible in the unit-cost
RAMmodel.
1.INTRODUCTION
Linear classification is a fundamental problem of machine learning,in which positive
and negative examples of a concept are represented in Euclidean space by their feature
vectors,and we seek to find a hyperplane separating the two classes of vectors.
The Perceptron Algorithm for linear classification is one of the oldest algorithms
studied in machine learning [Novikoff 1963;Minsky and Papert 1988].It can be used
to efficiently give a good approximate solution,if one exists,and has nice noise-stability
properties which allowit to be used as a subroutine in many applications such as learn-
ing with noise [Bylander 1994;Blum et al.1998],boosting [Servedio 1999] and more
general optimization [Dunagan and Vempala 2004].In addition,it is extremely simple
to implement:the algorithmstarts with an arbitrary hyperplane,and iteratively finds
a vector on which it errs,and moves in the direction of this vector by adding a multiple
of it to the normal vector to the current hyperplane.
The standard implementation of the Perceptron Algorithm must iteratively find a
“bad vector” which is classified incorrectly,that is,for which the inner product with
the current normal vector has an incorrect sign.Our new algorithm is similar to the
Perceptron Algorithm,in that it maintains a hyperplane and modifies it iteratively,
according to the examples seen.However,instead of explicitly finding a bad vector,we
run another dual learning algorithmto learn the “most adversarial” distribution over
the vectors,and use that distribution to generate an “expected bad” vector.Moreover,
we do not compute the inner products with the current normal vector exactly,but
instead estimate themusing a fast sampling-based scheme.
Thus our update to the hyperplane uses a vector whose “badness” is determined
quickly,but very crudely.We show that despite this,an approximate solution is still
obtained in about the same number of iterations as the standard perceptron.So our
algorithm is faster;notably,it can be executed in time sublinear in the size of the
Part of this work was done while E.Hazan was at IBMAlmaden Research Center.He is currently supported
by Israel Science Foundation grant 810/11.
Permission to make digital or hard copies of part or all of this work for personal or classroomuse is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation.Copyrights
for components of this work owned by others than ACM must be honored.Abstracting with credit is per-
mitted.To copy otherwise,to republish,to post on servers,to redistribute to lists,or to use any component
of this work in other works requires prior specific permission and/or a fee.Permissions may be requested
fromPublications Dept.,ACM,Inc.,2 Penn Plaza,Suite 701,New York,NY 10121-0701 USA,fax +1 (212)
869-0481,or permissions@acm.org.
c
2012 ACM0004-5411/2012/01-ART0 $10.00
DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
0:2 Clarkson,Hazan and Woodruff
Problem
Previous time
Time Here
Lower Bound
linear
~
O("
2
M)
classification
[Novikoff 1963]
~
O("
2
(n +d)) x2

("
2
(n +d)) x6.1
min.enc.
~
O("
1=2
M)
ball (MEB)
[Saha and Vishwanathan 2009]
~
O("
2
n +"
1
d) x3.1

("
2
n +"
1
d) x6.2
QP in the
O("
1
M)
simplex
[Frank and Wolfe 1956]
~
O("
2
n +"
1
d) x3.3
Las Vegas
additive O(M)
versions
Cor 2.11

(M) x6.4
kernelized
factors O(s
4
)
MEB and QP
or O(q) x5
Fig.1.Our results,parameters defined in relevant sections
input data,and still have good output,with high probability.(Here we must make
some reasonable assumptions about the way in which the data is stored,as discussed
below.)
This technique applies more generally than to the perceptron:we also obtain sublin-
ear time approximation algorithms for the related problems of finding an approximate
Minimum Enclosing Ball (MEB) of a set of points,and training a Support Vector Ma-
chine (SVM),in the hard margin or L
2
-SVMformulations.
We give lower bounds that imply that our algorithms for classification are best pos-
sible,up to polylogarithmic factors,in the unit-cost RAMmodel,while our bounds for
MEB are best possible up to an
~
O("
1
) factor.For most of these bounds,we give a fam-
ily of inputs such that a single coordinate,randomly “planted” over a large collection of
input vector coordinates,determines the output to such a degree that all coordinates
in the collection must be examined for even a 2=3 probability of success.
Our approach can be extended to give algorithms for the kernelized versions of these
problems,for some popular kernels including the Gaussian and polynomial,and also
easily gives Las Vegas results,where the output guarantees always hold,and only the
running time is probabilistic.
1
Our main results are given in Figure 1,using the following notation:all the problems
we consider have an n d matrix A as input,with M nonzero entries,and with each
row of A with Euclidean length no more than one.The parameter  > 0 is the additive
error;for MEB,this can be a relative error,after a simple O(M) preprocessing step.We
use the asymptotic notation
~
O(f) = O(f  polylog
nd
"
).The parameter  is the margin of
the probleminstance,explained below.The parameters s and q determine the standard
deviation of a Gaussian kernel,and degree of a polynomial kernel,respectively.
The time bounds given for our algorithms,except the Las Vegas ones,are under
the assumption of constant error probability;for output guarantees that hold with
probability 1 ,our bounds should be multiplied by log(n=).
The time bounds also require the assumption that the input data is stored in such a
way that a given entry A
i;j
can be recovered in constant time.This can be done by,for
example,keeping each row A
i
of A as a hash table.(Simply keeping the entries of the
rowin sorted order by column number is also sufficient,incurring an O(log d) overhead
in running time for binary search.)
1
For MEB and the kernelized versions,we assume that the Euclidean norms of the relevant input vectors
are known.Even with the addition of this linear-time step,all our algorithms improve on prior bounds,with
the exception of MEB when M = o("
3=2
(n +d)).
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
Sublinear Optimization for Machine Learning 0:3
Formal Description:Classification.In the linear classification problem,the learner
is given a set of n labeled examples in the form of d-dimensional vectors,comprising
the input matrix A.The labels comprise a vector y 2 f+1;1g
n
.
The goal is to find a separating hyperplane,that is,a normal vector x in the unit
Euclidean ball B such that for all i,y(i)  A
i
x  0;here y(i) denotes the i’th coordinate
of y.As mentioned,we will assume throughout that A
i
2 B for all i 2 [n],where
generally [m] denotes the set of integers f1;2;:::;mg.
As is standard,we may assume that the labels y(i) are all 1,by taking A
i
A
i
for any i with y(i) = 1.The approximation version of linear classification is to find a
vector x
"
2 B that is an"-approximate solution,that is,
8i
0
A
i
0 x
"
 max
x2B
min
i
A
i
x ":(1)
The optimumfor this formulation is obtained when kxk = 1,except when no separating
hyperplane exists,and then the optimumx is the zero vector.
Note that min
i
A
i
x = min
p2
p
>
Ax,where   R
n
is the unit simplex fp 2 R
n
j p
i

0;
P
i
p
i
= 1g.Thus we can regard the optimumas the outcome of a game to determine
p
>
Ax,between a minimizer choosing p 2 ,and a maximizer choosing x 2 B,yielding
  max
x2B
min
p2
p
>
Ax;
where this optimum  is called the margin.From standard duality results, is also
the optimumof the dual problem
min
p2
max
x2B
p
>
Ax;
and the optimumvectors p

and x

are the same for both problems.
The classical Perceptron algorithm returns an"-approximate solution to this prob-
lemin
1
"
2
iterations,and total time O("
2
M)
2
.For given  2 (0;1),our new algorithm
takes O("
2
(n+d)(log n) log(n=)) time to return an"-approximate solution with prob-
ability at least 1 .Further,we show this is optimal in the unit-cost RAMmodel,up
to poly-logarithmic factors.
Formal Description:Minimum Enclosing Ball (MEB).The MEB problem is to find
the smallest Euclidean ball in R
d
containing the rows of A.It is a special case of
quadratic programming (QP) in the unit simplex,namely,to find min
p2
p
>
b+p
>
AA
>
p,
where b is an n-vector.This relationship,and the generalization of our MEB algorithm
to QP in the simplex,is discussed in x3.3;for more general background on QP in the
simplex,and related problems,see for example [Clarkson 2008].
1.1.Related work
Perhaps the most closely related work is that of [Grigoriadis and Khachiyan 1995],
who showed how to approximately solve a zero-sum game up to additive precision"
in time
~
O("
2
(n + d)),where the game matrix is n  d.This problem is analogous to
ours,and our algorithm is similar in structure to theirs,but where we minimize over
p 2  and maximize over x 2 B,their optimization has not only p but also x in a unit
simplex.
Their algorithm (and ours) relies on sampling based on x and p,to estimate inner
products x
>
v or p
>
w for vectors v and w that are rows or columns of A.For a vector
p 2 ,this estimation is easily done by returning w
i
with probability p
i
.
2
strictly speaking,this is true only for"equals the margin,denoted ,and defined below.Yet a slight
modification of the perceptron gives this running time for any small enough"> 0
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
0:4 Clarkson,Hazan and Woodruff
For vectors x 2 B,however,the natural estimation technique is to pick i with prob-
ability x
2
i
,and return v
i
=x
i
.The estimator from this`
2
sample is less well-behaved,
since it is unbounded,and can have a high variance.While`
2
sampling has been used
in streaming applications [Monemizadeh and Woodruff 2010],it has not previously
found applications in optimization due to this high variance problem.
Indeed,it might seem surprising that sublinearity is at all possible,given that the
correct classifier might be determined by very few examples,as shown in figure 2.It
thus seems necessary to go over all examples at least once,instead of looking at noisy
estimates based on sampling.
Fig.2.The optimumx

is determined by the vectors near the horizontal axis.
However,as we show,in our setting there is a version of the fundamental Mul-
tiplicative Weights (MW) technique that can cope with unbounded updates,and for
which the variance of`
2
-sampling is manageable.In our version of MW,the multiplier
associated with a value z is quadratic in z,in contrast to the more standard multiplier
that is exponential in z;while the latter is a fundamental building block in approxi-
mate optimization algorithms,as discussed at [Plotkin et al.1991],in our setting such
exponential updates can lead to a very expensive d

(1)
iterations.
We analyze MW from the perspective of on-line optimization,and show that our
version of MW has low expected expected regret given only that the random updates
have the variance bounds provable for`
2
sampling.We also use another technique from
on-line optimization,a gradient descent variant which is better suited for the ball.
For the special case of zero-sumgames in which the entries are all non-negative (this
is equivalent to packing and covering linear programs),[Koufogiannakis and Young
2007] give a sublinear-time algorithm which returns a relative approximation in time
~
O("
2
(n+d)).Our lower bounds show that a similar relative approximation bound for
sublinear algorithms is impossible for general classification,and hence general linear
programming.
2.LINEAR CLASSIFICATION AND THE PERCEPTRON
Before our algorithm,some reminders and further notation: R
n
is the unit simplex
fp 2 R
n
j p
i
 0;
P
i
p
i
= 1g,B  R
d
is the Euclidean unit ball,and the unsubscripted
kxk denotes the Euclidean norm kxk
2
.The n-vector,all of whose entries are one,is
denoted by 1
n
.
The i’th rowof the input matrix Ais denoted A
i
,although a vector is a column vector
unless otherwise indicated.The i’th coordinate of vector v is denoted v(i).For a vector
v,we let v
2
denote the vector whose coordinates have v
2
(i)  v(i)
2
for all i.
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
Sublinear Optimization for Machine Learning 0:5
2.1.The Sublinear Perceptron
Our sublinear perceptron algorithm is given in Figure 1.The algorithm maintains a
vector w
t
2 R
n
,with nonnegative coordinates,and also p
t
2 ,which is w
t
scaled to
have unit`
1
norm.A vector y
t
2 R
d
is maintained also,and x
t
which is y
t
scaled to
have Euclidean normno larger than one.These normalizations are done on line 4.
In lines 5 and 6,the algorithmis updating y
t
by adding a row of A randomly chosen
using p
t
.This is a randomized version of Online Gradient Descent (OGD);due to the
randomchoice of i
t
,A
i
t
is an unbiased estimator of p
>
t
A,which is the gradient of p
>
t
Ay
with respect to y.
In lines 7 through 12,the algorithmis updating w
t
using a column j
t
of A randomly
chosen based on x
t
,and also using the value x
t
(j
t
).This is a version of the Multiplica-
tive Weights (MW) technique for online optimization in the unit simplex,where v
t
is
an unbiased estimator of Ax
t
,the gradient of p
>
Ax
t
with respect to p.
Actually,v
t
is not unbiased,after the clip operation:for z;V 2 R,clip(z;V ) 
minfV;maxfV;zgg,and our analysis is helped by clipping the entries of v
t
;we show
that the resulting slight bias is not harmful.
As discussed in x1.1,the sampling used to choose j
t
(and update p
t
) is`
2
-sampling,
and that for i
t
,`
1
-sampling.These techniques,which can be regarded as special cases
of an`
p
-sampling technique,for p 2 [1;1),yield unbiased estimators of vector dot
products.It is important for us also that`
2
-sampling has a variance bound here;in
particular,for each relevant i and t,
E[v
t
(i)
2
]  kA
i
k
2
kx
t
k
2
 1:(2)
Algorithm1 Sublinear Perceptron
1:Input:"> 0,A 2 R
nd
with A
i
2 B for i 2 [n].
2:Let T 200
2
"
2
log n,y
1
0,w
1
1
n
,

q
log n
T
.
3:for t = 1 to T do
4:p
t

w
t
kw
t
k
1
,x
t

y
t
maxf1;ky
t
kg
:
5:Choose i
t
2 [n] by i
t
i with prob.p
t
(i).
6:y
t+1
y
t
+
1
p
2T
A
i
t
7:Choose j
t
2 [d] by
j
t
j with probability x
t
(j)
2
=kx
t
k
2
.
8:for i 2 [n] do
9:~v
t
(i) A
i
(j
t
)kx
t
k
2
=x
t
(j
t
)
10:v
t
(i) clip(~v
t
(i);1=)
11:w
t+1
(i) w
t
(i)(1 v
t
(i) +
2
v
t
(i)
2
)
12:end for
13:end for
14:return x =
1
T
P
t
x
t
First we note the running time.
THEOREM 2.1.The sublinear perceptron takes O("
2
log n) iterations,with a total
running time of O("
2
(n +d) log n).
PROOF.
The algorithmiterates T = O(
log n
"
2
) times.Each iteration requires:
(1) One`
2
sample per iterate,which takes O(d) time using known data structures.
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
0:6 Clarkson,Hazan and Woodruff
(2) Sampling i
t
2
R
p
t
which takes O(n) time.
(3) The update of x
t
and p
t
,which takes O(n +d) time.
The total running time is O("
2
(n +d) log n).
Next we analyze the output quality.The proof uses new tools fromregret minimiza-
tion and sampling that are the building blocks of most of our upper bound results.
Let us first state the MWalgorithmused in all our algorithms.
Definition 2.2 (MWalgorithm).Consider a sequence of vectors q
1
;:::;q
T
2 R
n
.The
Multiplicative Weights (MW) algorithmis as follows.Let w
1
1
n
,and for t  1,
p
t
w
t
=kw
t
k
1
;(3)
and for 0 <  2 R,forall i 2 [n]
w
t+1
(i) w
t
(i)(1 q
t
(i) +
2
q
t
(i)
2
);(4)
The following is a key lemma,which proves a novel bound on the regret of the MW
algorithm above,suitable for the case where the losses are random variables with
bounded variance.As opposed to previous multiplicative-updates algorithms,this is
the only MWalgorithmwe are familiar with that does not require an upper bound on
the losses/payoffs.The proof is differed to after the main theoremand its proof.
LEMMA 2.3 (VARIANCE MW LEMMA).The MW algorithm satisfies (recall v
2
de-
notes the vector with v
2
(i)  v(i)
2
).
X
t2[T]
p
>
t
q
t
 min
i2[n]
P
t2[T]
maxfq
t
(i);
1

g +
log n

+
P
t2[T]
p
>
t
q
2
t
:
The following three lemmas give concentration bounds on our randomvariables from
their expectations.The first two are based on standard martingale analysis,and the
last is a simple Markov application.
LEMMA 2.4.For  
q
log n
T
,with probability at least 1 O(1=n),
max
i
X
t2[T]
[v
t
(i) A
i
x
t
]  4T:
LEMMA 2.5.For  
q
log n
T
,with probability at least 1  O(1=n),it holds that



P
t2[T]
A
i
t
x
t

P
t
p
>
t
v
t


  10T:
LEMMA 2.6.With probability at least 1 
1
4
,it holds that
P
t
p
>
t
v
2
t
 8T:
THEOREM 2.7 (MAIN THEOREM).With probability 1=2,the sublinear perceptron
returns a solution x that is an"-approximation.
PROOF.
First we use the regret bounds for lazy gradient descent to lower bound
P
t2[T]
A
i
t
x
t
,
next we get an upper bound for that quantity using Lemma 2.3,and then we combine
the two.
By definition,A
i
x

  for all i 2 [n],and so,using the bound of Lemma A.2,
T  max
x2B
X
t2[T]
A
i
t
x 
X
t2[T]
A
i
t
x
t
+2
p
2T;(5)
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
Sublinear Optimization for Machine Learning 0:7
or rearranging,
X
t2[T]
A
i
t
x
t
 T 2
p
2T:(6)
Now we turn to the MWpart of our algorithm.By the Weak Regret Lemma 2.3,and
using the clipping of v
t
(i),
X
t2[T]
p
>
t
v
t
 min
i2[n]
X
t2[T]
v
t
(i) +(log n)= +
X
t2[T]
p
>
t
v
2
t
:
By Lemma 2.4 above,with high probability,for any i 2 [n],
X
t2[T]
A
i
x
t

X
t2[T]
v
t
(i) 4T;
so that with high probability
X
t2[T]
p
>
t
v
t
 min
i2[n]
P
t2[T]
A
i
x
t
+(log n)= +
P
t2[T]
p
>
t
v
2
t
+4T:
Combining (6) and (7) we get
min
i2[n]
X
t2[T]
A
i
x
t
 (log n)= 
X
t2[T]
p
>
t
v
2
t
4T
+T 2
p
2T j
X
t2[T]
p
>
t
v
t

X
t2[T]
A
i
t
x
t
j
By Lemmas 2.5,2.6 we have w.p at least
3
4
O(
1
n
) 
1
2
min
i2[n]
X
t2[T]
A
i
x
t
 (log n)= 8T 4T +T 2
p
2T 10T
 T 
log n

22T:
Dividing through by T,and using our choice of ,we have min
i
A
i
x   "=2 w.p.at
least 1=2 as claimed.
PROOF PROOF OF LEMMA 2.3.We first show an upper bound on logkw
T+1
k
1
,then
a lower bound,and then relate the two.
From(4) and (3) we have
kw
t+1
k
1
=
X
i2[n]
w
t+1
(i)
=
X
i2[n]
p
t
(i)kw
t
k
1
(1 q
t
(i) +
2
q
t
(i)
2
)
= kw
t
k
1
(1 p
>
t
q
t
+
2
p
>
t
q
2
t
):
This implies by induction on t,and using 1 +z  exp(z) for z 2 R,that
logkw
T+1
k
1
= log n +
X
t2[T]
log(1 p
>
t
q
t
+
2
p
>
t
q
2
t
)  log n 
X
t2[T]
p
>
t
q
t
+
2
p
>
t
q
2
t
:(7)
Now for the lower bound.From(4) we have by induction on t that
w
T+1
(i) =
Y
t2[T]
(1 q
t
(i) +
2
q
t
(i)
2
);
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
0:8 Clarkson,Hazan and Woodruff
and so
logkw
T+1
k
1
= log
2
4
X
i2[n]
Y
t2[T]
(1 q
t
(i) +
2
q
t
(i)
2
)
3
5
 log
2
4
max
i2[n]
Y
t2[T]
(1 q
t
(i) +
2
q
t
(i)
2
)
3
5
= max
i2[n]
X
t2[T]
log(1 q
t
(i) +
2
q
t
(i)
2
)
 max
i2[n]
X
t2[T]
[minfq
t
(i);1g];
where the last inequality uses the fact that 1 +z +z
2
 exp(minfz;1g) for all z 2 R.
Putting this together with the upper bound (7),we have
max
i2[n]
X
t2[T]
[minfq
t
(i);1g]  log n 
X
t2[T]
p
>
t
q
t
+
2
p
>
t
q
2
t
;
Changing sides
X
t2[T]
p
>
t
q
t
 max
i2[n]
X
t2[T]
[minfq
t
(i);1g] +log n +
2
p
>
t
q
2
t
;
= min
i2[n]
X
t2[T]
[maxfq
t
(i);1g] +log n +
2
p
>
t
q
2
t
;
and the lemma follows,dividing through by .
COROLLARY 2.8 (DUAL SOLUTION).The vector p 
P
t
e
i
t
=T is,with probability
1=2,an O(")-approximate dual solution.
PROOF.
Observing in (5) that the middle expression max
x2B
P
t2[T]
A
i
t
x is equal to
T max
x2B
p
>
Ax,we have T max
x2B
p
>
Ax 
P
t2[T]
A
i
t
x
t
+2
p
2T,or changing sides,
X
t2[T]
A
i
t
x
t
 T max
x2B
p
>
Ax 2
p
2T
Recall from(7) that with high probability,
X
t2[T]
p
>
t
v
t
 min
i2[n]
X
t2[T]
A
i
x
t
+(log n)= +
X
t2[T]
p
>
t
v
2
t
+4T:(8)
Following the proof of the main Theorem,we combine both inequalities and use Lem-
mas 2.5,2.6,such that with probability at least
1
2
:
T max
x2B
p
>
Ax  min
i2[n]
X
t2[T]
A
i
x
t
+(log n)= +
X
t2[T]
p
>
t
v
2
t
+90T +2
p
2T +j
X
t2[T]
p
>
t
v
t

X
t2[T]
A
i
t
x
t
j
 T +O(
p
T log n)
Dividing through by T we have with probability at least
1
2
that max
x2B
p
>
Ax  +O()
for our choice of T and .
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
Sublinear Optimization for Machine Learning 0:9
2.2.High Success Probability and Las Vegas
Given two vectors u;v 2 B,we have seen that a single`
2
-sample is an unbiased esti-
mator of their inner product with variance at most one.Averaging
1
"
2
such samples re-
duces the variance to"
2
,which reduces the standard deviation to".Repeating O(log
1

)
such estimates,and taking the median,gives an estimator denoted X
";
,which satis-
fies,via a Chernoff bound:
Pr[jX
";
v
>
uj >"]  
As an immediate corollary of this fact we obtain:
COROLLARY 2.9.There exists a randomized algorithmthat with probability 1 ,
successfully determines whether a given hyperplane with normal vector x 2 B,together
with an instance of linear classification and parameter  > 0,is an"-approximate
solution.The algorithmruns in time O(d +
n
"
2
log
n

).
PROOF.
Let 
0
= =n.Generate the randomvariable X
";
0
for each inner product pair hx;A
i
i,
and return true if and only if X
";
0
  "for each pair.By the observation above and
taking union bound over all n inner products,with probability 1  the estimate X
";
0
was"-accurate for all inner-product pairs,and hence the algorithmreturned a correct
answer.
The running time includes preprocessing of x in O(d) time,and n inner-product esti-
mates,for a total of O(d +
n
"
2
log
n

).
Hence,we can amplify the success probability of Algorithm 1 to 1  for any  > 0
albeit incurring additional poly-log factors in running time:
COROLLARY 2.10 (HIGH PROBABILITY).There exists a randomized algorithmthat
with probability 1   returns an"-approximate solution to the linear classification
problem,and runs in expected time O(
n+d
"
2
log
n

).
PROOF.
Run Algorithm1 for log
2
1

times to generate that many candidate solutions.By The-
orem2.7,at least one candidate solution is an"-approximate solution with probability
at least 1 2
log
2
1

= 1 .
For each candidate solution apply the verification procedure above with success
probability 1  
2
 1 

log
1

,and all verifications will be correct again with proba-
bility at least 1 .Hence,both events hold with probability at least 1 2.The result
follows after adjusting constants.
The worst-case running time comes to O(
n+d
"
2
log
n

log
1

).However,we can generate
the candidate solutions and verify them one at a time,rather than all at once.The
expected number of candidates we need to generate is constant.
It is also possible to obtain an algorithmthat never errs:
COROLLARY 2.11 (LAS VEGAS VERSION).After O("
2
log n) iterations,the sublin-
ear perceptron returns a solution that with probability 1=2 can be verified in O(M)
time to be"-approximate.Thus with expected O(1) repetitions,and a total of expected
O(M +"
2
(n +d) log n) work,a verified"-approximate solution can be found.
PROOF.
We have
min
i
A
i
x    kp
>
Ak;
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
0:10 Clarkson,Hazan and Woodruff
and so if
min
i
A
i
x  kp
>
Ak ;(9)
then x is an"-approximate solution,and x will pass this test if it and p are ("=2)-
approximate solutions,and the same for p.
Thus,running the algorithmfor a constant factor more iterations,so that with prob-
ability 1=2,x and p are both (=2)-approximate solutions,it can be verified that both
are"-approximate solutions.
2.3.Further Optimizations
The regret of OGD as given in Lemma A.2 is smaller than the dual strategy of random
MW.We can take advantage of this and improve the running time slightly,by replacing
line [6] of the sublinear algorithmwith the line shown below.
[6’] With probability
1
log T
,let y
t+1
y
t
+
1
2
p
T
A
i
t
(else do nothing).
This has the effect of increasing the regret of the primal online algorithm by a log n
factor,which does not hurt the number of iterations required to converge,since the
overall regret is dominated by that of the MWalgorithm.
Since the primal solution x
t
is not updated in every iteration,we improve the run-
ning time slightly to
O("
2
log n(n +d=(log 1="+log log n))):
We use this technique to greater effect for the MEB problem,where it is discussed in
more detail.
2.4.Implications in the PAC model
Consider the “separable” case of hyperplane learning,in which there exists a hyper-
plane classifying all data points correctly.It is well known that the concept class of hy-
perplanes in d dimensions with margin  has effective dimension at most minfd;
1

2
g+1.
Consider the case in which the margin is significant,i.e.
1

2
< d.PAC learning the-
ory implies that the number of examples needed to attain generalization error of  is
O(
1

2

).
Using the method of online to batch conversion (see [Cesa-Bianchi et al.2004]),and
applying the online gradient descent algorithm,it is possible to obtain  generalization
error in time O(
d

2

) time,by going over the data once and performing a gradient step
on each example.
Our algorithm improves upon this running time bound as follows:we use the sub-
linear perceptron to compute a =2-approximation to the best hyperplane over the
test data,where the number of examples is taken to be n = O(
1

2

) (in order to ob-
tain  generalization error).As shown previously,the total running time amounts to
~
O(
1

2

+d

2
) = O(
1

4

+
d

2
).
This improves upon standard methods by a factor of
~
O(
2
d),which is always an
improvement by our initial assumption on  and d.
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
Sublinear Optimization for Machine Learning 0:11
3.STRONGLY CONVEX PROBLEMS:MEB AND SVM
3.1.MinimumEnclosing Ball
In the MinimumEnclosing Ball problemthe input consists of a matrix A 2 R
nd
.The
rows are interpreted as vectors and the problemis to find a vector x 2 R
d
such that
x

 argmin
x2R
d max
i2[n]
kx A
i
k
2
We further assume for this problem that all vectors A
i
have Euclidean norm at most
one.Denote by  = max
i2[n]
kx

A
i
k
2
the radius of the optimal ball,and we say that
a solution is"-approximate if the ball it generates has radius at most  +".
As in the case of linear classification,to obtain tight running time bounds we use a
primal-dual approach;the algorithmis given below.
(This is a “conceptual” version of the algorithm:in the analysis of the running time,
we use the fact that we can batch together the updates for w
t
over the iterations for
which x
t
does not change.)
Algorithm2 Sublinear Primal-Dual MEB
1:Input:"> 0,A 2 R
nd
with A
i
2 B for i 2 [n] and kA
i
k known.
2:Let T ("
2
log n),y
1
0,w
1
1,
p
(log n)=T,
log T
p
T log n
.
3:for t = 1 to T do
4:p
t

w
t
kw
t
k
1
5:Choose i
t
2 [n] by i
t
i with probability p
t
(i).
6:With probability ,update y
t+1
y
t
+A
i
t
;x
t+1

y
t+1
t
:(else do nothing)
7:Choose j
t
2 [d] by j
t
j with probability x
t
(j)
2
=kx
t
k
2
.
8:for i 2 [n] do
9:~v
t
(i) 2A
i
(j
t
)kx
t
k
2
=x
t
(j
t
) +kA
i
k
2
+kx
t
k
2
:
10:v
t
(i) clip(~v
t
(i);
1

).
11:w
t+1
(i) w
t
(i)(1 +v
t
(i) +
2
v
t
(i)
2
).
12:end for
13:end for
14:return x =
1
T
P
t
x
t
THEOREM 3.1.Algorithm2 runs in O(
log n
"
2
) iterations,with a total expected running
time of
~
O

n
"
2
+
d
"

;
and with probability 1=2,returns an"-approximate solution.
PROOF.
Except for the running time analysis,the proof of this theorem is very similar to
that of Theorem 2.7,where we take advantage of a tighter regret bound for strictly
convex loss functions in the case of MEB,for which the OGDalgorithmwith a learning
rate of
1
t
is known to obtain a tighter regret bound of O(log T) instead of O(
p
T).For
presentation,we use asymptotic notation rather than computing the exact constants
(as done for the linear classification problem).
Let f
t
(x) = kx A
i
t
k
2
.Notice that arg min
x2B
P
t
=1
f

(x) =
P
t
=1
A
i

t
.By Lemma A.5
such that f
t
(x) = kx A
i
t
k
2
,with G  2 and H = 2,and x

being the solution to the
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
0:12 Clarkson,Hazan and Woodruff
instance,we have
E
fc
t
g
[
X
t
kx
t
A
i
t
k
2
]  E
fc
t
g
[
X
t
kx

A
i
t
k
2
] +
4

log T  T +
4

log T;(10)
where  is the squared MEB radius.Here the expectation is taken only over the ran-
domcoin tosses for updating x
t
,denoted c
t
,and holds for any outcome of the indices i
t
sampled fromp
t
and the coordinates j
t
used for the`
2
sampling.
Now we turn to the MW part of our algorithm.By the Weak Regret Lemma 2.3,
using the clipping of v
t
(i),and reversing inequalities to account for the change of sign,
we have
X
t2[T]
p
>
t
v
t
 max
i2[n]
X
t2[T]
v
t
(i) O(
log n

+
X
t2[T]
p
>
t
v
2
t
):
Using Lemmas B.4,B.5 with high probability
8i 2 [n]:
X
t2[T]
v
t
(i) 
X
t2[T]
kA
i
x
t
k
2
O(T);






X
t2[T]
kx
t
A
i
t
k
2

X
t
p
>
t
v
t






= O(T):
Plugging these two facts in the previous inequality we have w.h.p
X
t2[T]
kx
t
A
i
t
k
2
 max
i2[n]
X
t2[T]
kA
i
x
t
k
2
O(
log n

+
X
t2[T]
p
>
t
v
2
t
+T):
This holds w.h.p over the random choices of fi
t
;j
t
g,and irrespective of the coin tosses
fc
t
g.Hence,we can take expectations w.r.t fc
t
g,and obtain
E
fc
t
g
[
X
t2[T]
kx
t
A
i
t
k
2
]  E
fc
t
g
[max
i2[n]
X
t2[T]
kA
i
x
t
k
2
] O(
log n

+
X
t2[T]
p
>
t
v
2
t
+T):(11)
Combining with equation (10),we obtain that w.h.p.over the randomvariables fi
t
;j
t
g
T +
4

log T  E
fc
t
g
[max
i2[n]
X
t2[T]
kx
t
A
i
k
2
] O(
log n

+
X
t2[T]
p
>
t
v
2
t
+T)
Rearranging and using Lemma B.8,we have w.p.at least
1
2
E
fc
t
g
[max
i2[n]
X
t2[T]
kx
t
A
i
k
2
]  O(T +
log T

+
log n

+T)
Dividing through by T and applying Jensen’s inequality,we have
E[max
j
kx A
j
k
2
] 
1
T
E[max
i2[n]
X
t2[T]
kx
t
A
i
k
2
]  O( +
log T
T
+
log n
T
+):
Optimizing over the values of ,,and T,this implies that the expected error is O("),
and so using Markov’s inequality,x is a O(")-approximate solution with probability at
least 1/2.
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
Sublinear Optimization for Machine Learning 0:13
Running time.The algorithmabove consists of T = O(
log n
"
2
) iterations.Naively,this
would result in the same running time as for linear classification.Yet notice that x
t
changes only an expected T times,and only then do we perform an O(d) operation.
The expected number of iterations in which x
t
changes is T  16"
1
log T,and so the
running time is
O("
1
(log T)  d +
log n
"
2
 n)) =
~
O("
2
n +"
1
d):
The following Corollary is a direct analogue of Corollary 2.8.
COROLLARY 3.2 (DUAL SOLUTION).The vector p 
P
t
e
i
t
=T is,with probability
1=2,an O(")-approximate dual solution.
3.2.High Success Probability and Las Vegas
As for linear classification,we can amplify the success probability of Algorithm 2 to
1  for any  > 0 albeit incurring additional poly-log factors in running time.
COROLLARY 3.3 (MEB HIGH PROBABILITY).There exists a randomized algorithm
that with probability 1  returns an"-approximate solution to the MEB problem,and
runs in expected time
~
O(
n
"
2
log
n
"
+
d
"
log
1
"
).There is also a randomized algorithm that
returns an"-approximate solution in
~
O(M +
n
"
2
+
d
"
) time.
PROOF.
We can estimate the distance between two points in B in O("
2
log(1=)) time,with
error at most"and failure probability at most ,using the dot product estimator de-
scribed in x2.2.Therefore we can estimate the maximum distance of a given point to
every input point in O(n"
2
log(n=)) time,with error at most"and failure probability
at most .This distance is  ",where  is the optimal radius attainable,w.p.1 .
Because Algorithm 2 yields an"-dual solution with probability 1/2,we can use this
solution to verify that the radius of any possible solution to the farthest point is at
least  ".
So,to obtain a solution as described in the lemma statement,run Algorithm 2,and
verify that it yields an"-approximation,using this approximate dual solution;with
probability 1/2,this gives a verified"-approximation.Keep trying until this succeeds,
in an expected 2 trials.
For a Las Vegas algorithm,we simply apply the same scheme,but verify the dis-
tances exactly.
3.3.Convex Quadratic Programming in the Simplex
We can extend our approach to problems of the form
min
p2
p
>
b +p
>
AA
>
p;(12)
where b 2 R
n
,A 2 R
nd
,and  is,as usual,the unit simplex in R
n
.As is well known,
and as we partially review below,this problem includes the MEB problem,margin
estimation as for hard margin support vector machines,the L
2
-SVMvariant of support
vector machines,the problemof finding the shortest vector in a polytope,and others.
Applying kv xk
2
= v
>
v +x
>
x 2v
>
x  0 with v A
>
p,we have
max
x2R
d
2p
>
Ax kxk
2
= p
>
AA
>
p;(13)
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
0:14 Clarkson,Hazan and Woodruff
with equality at x = A
>
p.Thus (12) can be written as
min
p2
max
x2R
d
p
>
(b +2Ax 1
n
kxk
2
):(14)
The Wolfe dual of this problemexchanges the max and min:
max
x2R
d
min
p2
p
>
(b +2Ax 1
n
kxk
2
):(15)
Since
min
p2
p
>
(b +2Ax 1
n
kxk
2
) = min
i
b(i) +2A
i
x +kxk
2
;(16)
with equality when p
^
i
= 0 if
^
i is not a minimizer,the dual can also be expressed as
max
x2R
d
min
i
b(i) +2A
i
x kxk
2
(17)
By the two relations (13) and (16) used to derive the dual problem from the primal,
we have immediately the weak duality condition that the objective function of the dual
(17) is always no more than the objective function value of the primal (12).The strong
duality condition,that the two problems take the same optimal value,also holds here;
indeed,the optimumx

also solves (13),and the optimal p

also solves (16).
To generalize Algorithm2,we make v
t
an unbiased estimator of b +2Ax
t
1
n
kx
t
k
2
,
and set x
t+1
to be the minimizer of
X
t
0
2[t]
b(i
t
0 ) +2A
i
t
0
x
t
0 kx
t
0 k
2
;
namely,as with MEB,y
t+1

P
t
0
2[t]
A
i
t
0
,and x
t+1
y
t+1
=t.(We also make some sign
changes to account for the max-min formulation here,versus the min-max formulation
used for MEB above.) This allows the use of Lemma A.4 for essentially the same anal-
ysis as for MEB;the gradient bound Gand Hessian bound H are both at most 2,again
assuming that all A
i
2 B.
MEB.When the b(i) kA
i
k
2
,we have
max
x2R
d
min
i
b(i) +2A
i
x kxk
2
= min
x2R
d
max
i
kA
i
k
2
2A
i
x +kxk
2
= min
x2R
d
max
i
kx A
i
k
2
;
the objective function for the MEB problem.
Margin Estimation.When b 0 in the primal problem (12),that problem is one of
finding the shortest vector in the polytope fA
>
p j p 2 g.Considering this case of the
dual problem (17),for any given x 2 R
d
with min
i
A
i
x  0,the value of  2 R such
that x maximizes min
i
2A
i
x  kxk
2
is  = 0.On the other hand if x is such that
min
i
A
i
x > 0,the maximizing value  is  = A
i
x=kxk
2
,so that the solution of (17) also
maximizes min
i
(A
i
x)
2
=kxk
2
.The latter is the square of the margin ,which as before
is the minimum distance of the points A
i
to the hyperplane that is normal to x and
passes through the origin.
Adapting Algorithm2 for margin estimation,and with the slight changes needed for
its analysis,we have that there is an algorithm taking
~
O(n=
2
+ d=) time that finds
x 2 R
d
such that,for all i 2 [n],
2A
i
x kxk
2
 
2
:
When 
2
 ,we don’t appear to gain any useful information.However,when 
2
> ,
we have min
i2[n]
A
i
x > 0,and so,by appropriate scaling of x,we have ^x such that
^
2
= min
i2[n]
(A
i
^x)
2
=k^xk
2
= min
i2[n]
2A
i
^x k^xk
2
 
2
;
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
Sublinear Optimization for Machine Learning 0:15
and so ^    =.That is,letting   
0
,if 
0
 ,there is an algorithm taking
~
O(n=()
2
+d=
0
) time that finds a solution ^x with ^   
0
.
4.A GENERIC SUBLINEAR PRIMAL-DUAL ALGORITHM
We note that our technique above can be applied more broadly to any constrained
optimization problemfor which low-regret algorithms exist and low-variance sampling
can be applied efficiently;that is,consider the general problemwith optimum:
max
x2K
min
i
c
i
(x) = :(18)
Suppose that for the set K and cost functions c
i
(x),there exists an iterative low regret
algorithm,denoted LRA,with regret R(T) = o(T).Let T
"
(LRA) be the smallest T such
that
R(T)
T
".We denote by x
t+1
LRA(x
t
;c) an invocation of this algorithm,when
at state x
t
2 K and the cost function c is observed.
Let Sample(x;c) be a procedure that returns an unbiased estimate of c(x) with
variance at most one,that runs in constant time.Further assume jc
i
(x)j  1 for all
x 2 K;i 2 [n].
Algorithm3 Generic Sublinear Primal-Dual Algorithm
1:Let T maxfT
"
(LRA);
log n
"
2
g,
x
1
LRA(initial),w
1
1
n
,
1
100
q
log n
T
.
2:for t = 1 to T do
3:for i 2 [n] do
4:Let v
t
(i) Sample(x
t
;c
i
)
5:v
t
(i) clip(~v
t
(i);1=)
6:w
t+1
(i) w
t
(i)(1 v
t
(i) +
2
v
t
(i)
2
)
7:end for
8:p
t

w
t
kw
t
k
1
,
9:Choose i
t
2 [n] by i
t
i with probability p
t
(i).
10:x
t
LRA(x
t1
;c
i
t
)
11:end for
12:return x =
1
T
P
t
x
t
Applying the techniques of section 2 we can obtain the following generic lemma.
LEMMA 4.1.The generic sublinear primal-dual algorithmreturns a solution x that
with probability at least
1
2
is an"-approximate solution in maxfT
"
(LRA);
log n
"
2
g itera-
tions.
PROOF.
First we use the regret bounds for LRA to lower bound
P
t2[T]
c
i
t
(x
t
),next we get an
upper bound for that quantity using the Weak Regret Lemma,and then we combine
the two in expectation.
By definition,c
i
(x

)   for all i 2 [n],and so,using the LRA regret guarantee,
T  max
x2B
X
t2[T]
c
i
t
(x) 
X
t2[T]
c
i
t
(x
t
) +R(T);(19)
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
0:16 Clarkson,Hazan and Woodruff
or rearranging,
X
t2[T]
c
i
t
(x
t
)  T R(T):(20)
Now we turn to the MWpart of our algorithm.By the Weak Regret Lemma 2.3,and
using the clipping of v
t
(i),
X
t2[T]
p
>
t
v
t
 min
i2[n]
X
t2[T]
v
t
(i) +(log n)= +
X
t2[T]
p
>
t
v
2
t
:
Using Lemma B.4 and Lemma B.5,since the procedure Sample is unbiased and has
variance at most one,with high probability:
8i 2 [n];
X
t2[T]
v
t
(i) 
X
t2[T]
c
i
(x
t
) +O(T);






X
t2[T]
c
i
t
(x
t
) 
X
t
p
>
t
v
t






= O(T):
Plugging these two facts in the previous inequality we have w.h.p,
X
t2[T]
c
i
t
(x
t
)  min
i2[n]
X
t2[T]
c
i
(x
t
) +O(
log n

+
X
t2[T]
p
>
t
v
2
t
+T) (21)
Combining (20) and (21) we get w.h.p
min
i2[n]
X
t2[T]
c
i
(x
t
)  O(
log n

+T +
X
t2[T]
p
>
t
v
2
t
) R(T)
And via Lemma B.8 we have w.p.at least
1
2
that
min
i2[n]
X
t2[T]
c
i
(x
t
)  O(
log n

+T) R(T)
Dividing through by T,and using our choice of ,we have min
i
c
i
x   "=2 w.p.at
least least 1=2 as claimed.
High-probability results can be obtained using the same technique as for linear classi-
fication.
4.1.More applications
The generic algorithm above can be used to derive the result of [Grigoriadis and
Khachiyan 1995] on sublinear approximation of zero sum games with payoffs/losses
bounded by one (up to poly-logarithmic factors in running time).A zero sumgame can
be cast as the following min-max optimization problem:
min
x2
d
max
i2
n
A
i
x
That is,the constraints are inner products with the rows of the game matrix.This is
exactly the same as the linear classification problem,but the vectors x are taken from
the convex set K which is the simplex - or the set of all mixed strategies of the column
player.
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
Sublinear Optimization for Machine Learning 0:17
Alowregret algorithmfor the simplex is the multiplicative weights algorithm,which
attains regret R(T)  2
p
T log n.The procedure Sample(x;A
i
) to estimate the inner
product A
i
x is much simpler than the one used for linear classification:we sample from
the distribution x and return A
i
(j) w.p.x(j).This has correct expectation and variance
bounded by one (in fact,the random variable is always bounded by one).Lemma 4.1
then implies:
COROLLARY 4.2.The sublinear primal-dual algorithm applied to zero sum games
returns a solution x that with probability at least
1
2
is an"-approximate solution in
O(
log n
"
2
) iterations and total time
~
O(
n+d
"
2
).
Essentially any constrained optimization problem which has convex or linear con-
straints,and is over a simple convex body such as the ball or simplex,can be approxi-
mated in sublinear time using our method.
5.KERNELIZING THE SUBLINEAR ALGORITHMS
An important generalization of linear classifiers is that of kernel-based linear predic-
tors (see e.g.[Sch¨olkopf and Smola 2003]).Let :R
d
7!H be a mapping of feature
vectors into a reproducing kernel Hilbert space.In this setting,we seek a non-linear
classifier given by h 2 H so as to maximize the margin:
  max
h2H
min
i2[n]
hh; (A
i
)i:
The kernels of interest are those for which we can compute inner products of the form
k(x;y) = h (x); (y)i efficiently.
One popular kernel is the polynomial kernel,for which the corresponding Hilbert
space is the set of polynomials over R
d
of degree q.The mapping for this kernel is
given by
8S  [d];jSj  q: (x)
S
=
Y
i2S
x
i
:
That is,all monomials of degree at most q.The kernel function in this case is given by
k(x;y) = (x
>
y)
q
.Another useful kernel is the Gaussian kernel k(x;y) = exp(
kxyk
2
2s
2
),
where s is a parameter.The mapping here is defined by the kernel function (see
[Sch¨olkopf and Smola 2003] for more details).
The kernel version of Algorithm1 is shown in Figure 4.Note that x
t
and y
t
are mem-
bers of H,and not maintained explicitly,but rather are implicitly represented by the
values i
t
.(And thus ky
t
k is the norm of H,not R
d
.) Also, (A
i
) is not computed.The
needed kernel product hx
t
; (A
i
)i is estimated by the procedure Kernel-L2-Sampling,
using the implicit representations and specific properties of the kernel being used.In
the regular sublinear algorithm,this inner product could be sufficiently well approx-
imated in O(1) time via`
2
-sampling.As we show below,for many interesting kernels
the time for Kernel-L2-Sampling is not much longer.
For the analog of Theorem2.7 to apply,we need the expectation of the estimates v
t
(i)
to be correct,with variance O(1).By Lemma C.1,it is enough if the estimates v
t
(i)
have an additive bias of O().Hence,we define the procedure Kernel-L2-Sampling
to obtain such an not-too-biased estimator with variance at most one;first we show
how to implement Kernel-L2-Sampling,assuming that there is an estimator
~
k() of
the kernel k() such that E[
~
k(x;y)] = k(x;y) and Var(
~
k(x;y))  1,and then we show
how to implement such kernel estimators.
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
0:18 Clarkson,Hazan and Woodruff
Algorithm4 Sublinear Kernel Perceptron
1:Input:"> 0,A 2 R
nd
with A
i
2 B for i 2 [n].
2:Let T 200
2
"
2
log n,y
1
0,w
1

~
1
n
,
1
100
q
log n
T
.
3:for t = 1 to T do
4:p
t

w
t
kw
t
k
1
,x
t

y
t
maxf1;ky
t
kg
:
5:Choose i
t
2 [n] by i
t
i with probability p
t
(i).
6:y
t+1

P
2[t]
(A
i

)=
p
2T.
7:for i 2 [n] do
8:~v
t
(i) Kernel-L2-Sampling(x
t
; (A
i
)).(estimating hx
t
; (A
i
)i)
9:v
t
(i) clip(~v
t
(i);1=).
10:w
t+1
(i) w
t
(i)(1 v
t
(i) +
2
v
t
(i)
2
).
11:end for
12:end for
13:return x =
1
T
P
t
x
t
5.1.Implementing Kernel-L2-Sampling
Estimating kyk
t
.A key step in Kernel-L2-Sampling is the estimation of ky
t
k,which
readily reduces to estimating
Y
t
 2Tky
t
k
2
=t
2
=
1
t
2
X
;
0
2[t]
k(A
i

;A
i

0
);
that is,the mean of the summands.Since we use maxf1;ky
t
k),we need not be con-
cerned with small ky
t
k,and it is enough that the additive bias in our estimate of Y be
at most =T  (2T=t
2
) for t 2 [T],implying a bias for ky
t
k no more than .Since we
need 1=ky
t
k in the algorithm,it is not enough for estimates of Y just to be good in mean
and variance;we will find an estimator whose error bounds hold with high probability.
Our estimate
~
Y
t
of Y
t
can first be considered assuming we only need to make an
estimate for a single value of t.
Let N
Y
t
2
d(8=3) log(1=)T
2
=
2
t
2
e.To estimate Y
t
,we compute,for each ;
0
2 [t],
n
t
N
Y
=t
2
independent estimates
X
;
0
;m
clip(
~
k(A
i

;A
i

0
);T=);for m2 [n
t
];
and our estimate is
~
Y
t

X
;
0
2[t]
m2[n
t
]
X
;
0
;m
=N
Y
:
LEMMA 5.1.With probability at least 1 ,jY 
~
Y
t
j  =T.
PROOF.We apply Bernstein’s inequality (as in 31) to the N
Y
random variables
X
;
0
;m
 E[X
;
0
;m
].which have mean zero,variance at most one,and are at most
T= in magnitude.Bernstein’s inequality implies,using Var[X
;
0
;m
]  1,
log Probf
X
;
0
2[t]
m2[n
t
]
(X
;
0
;m
E[X
;
0
;m
]) > g  
2
=(N
Y
+(T=)=3);
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
Sublinear Optimization for Machine Learning 0:19
and putting  N
Y
=T gives
log Probf
~
Y E[
~
Y ] > =Tg  N
2
Y
(=T)
2
=(N
Y
+(T=)N
Y
(=T)=3)
 (8=3) log(1=)(3=4)  2 log(1=):
Similar reasoning for X
;
0
;m
,and the union bound,implies the lemma.
To compute Y for t = 1:::T,we can save some work by reusing estimates from one
t to the next.Now let N
Y
d(8=3) log(1=)T
2
=
2
e.Compute
~
Y
1
as above for t = 1,and
let
^
Y
1

~
Y
1
.For t > 1,let n
t
dN
Y
=t
2
e,and let
^
Y
t

X
m2[n
t
]
X
t;t;m
=n
t
+
X
2[t]
m2[n
t
]
(X
t;;m
+X
;t;m
)=n
t
;
and return
~
Y
t

P
2[t]
^
Y

=t
2
.
Since for each  and 
0
,the expected total contribution of all X
;
0
;m
terms to
~
Y
t
is
k(A
i

;A
i

0
),we have E[
~
Y
t
] = Y
t
.Moreover,the number of instances of X
;
0
;m
averaged
to compute
~
Y
t
is always at least as large as the number used for the above “batch”
version;it follows that the total variance of
~
Y
t
is non-increasing in t,and therefore
Lemma 5.1 holds also for the
~
Y
t
computed stepwise.
Since the number of calls to
~
k(;) is
P
t2[T]
(1 +2n
t
) = O(N
Y
),we have the following
lemma.
LEMMA 5.2.The values
~
Y
t
(t
2
=2T)  ky
t
k,t 2 [T],can be estimated with
O((log(1=)T
2
=
2
) calls to
~
k(;),so that with probability at least 1,j
~
Y
t
(t
2
=2T)ky
t
kj 
.The values ky
t
k,t 2 [T],can be computed exactly with T
2
calls to the exact kernel k(;).
PROOF.This follows from the discussion above,applying the union bound over t 2
[T],and adjusting constants.The claimfor exact computation is straightforward.
Given this procedure for estimating ky
t
k,we can describe Kernel-L2-Sampling.
Since x
t+1
= y
t+1
= maxf1;ky
t+1
kg,we have
hx
t+1
;A
i
i =
1
maxf1;ky
t+1
kg
p
2T
X
2[t]
h (A
i

); (A
i
)i
=
1
maxf1;ky
t+1
kg
p
2T
X
2[t]
k(A
i

;A
i
);(22)
so that the main remaining step is to estimate
P
2[t]
k(A
i

;A
i
),for i 2 [n].Here we
simply call
~
k(A
i

;A
i
) for each .We save time,at the cost of O(n) space,by saving the
value of the sumfor each i 2 [n],and updating it for the next t with n calls
~
k(A
i
t
;A
i
).
LEMMA 5.3.Let L
k
denote the expected time needed for one call to
~
k(;),and
T
k
denote the time needed for one call to k(;).Except for estimating ky
t
k,
Kernel-L2-Sampling can be computed in nL
k
expected time per iteration t.The re-
sulting estimate has expectation within additive  of hx
t
;A
i
i,and variance at most one.
Thus Algorithm4 runs in time
~
O(
(L
k
n+d)
"
2
+minf
L
k
"
6
;
T
k
"
4
g),and produces a solution with
properties as in Algorithm1.
PROOF.For Kernel-L2-Sampling it remains only to show that its variance is at
most one,given that each
~
k(;) has variance at most one.We observe from (22 that t
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
0:20 Clarkson,Hazan and Woodruff
independent estimates
~
k(;) are added together,and scaled by a value that is at most
1=
p
2T.Since the variance of the sumis at most t,and the variance is scaled by a value
no more than 1=2T,the variance of Kernel-L2-Sampling is at most one.The only bias
in the estimate is due to estimation of ky
t
k,which gives relative error of .For our
kernels,k (v)k  1 if v 2 B,so the additive error of Kernel-L2-Sampling is O().
The analysis of Algorithm 4 then follows as for the un-kernelized perceptron;we
neglect the time needed for preprocessing for the calls to
~
k(;),as it is dominated by
other terms for the kernels we consider,and this is likely in general.
5.2.Implementing the Kernel Estimators
Using the lemma above we can derive corollaries for the Gaussian and polynomial
kernels.More general kernels can be handled via the technique of [Cesa-Bianchi et al.
2010].
Polynomial kernels.For the polynomial kernel of degree q,estimating a single ker-
nel product,i.e.k(x;y) = k(A
i
;A
j
),where the norm of x;y is at most one,takes
O(q) as follows:Recall that for the polynomial kernel,k(x;y) = (x
>
y)
q
.To estimate
this kernel we take the product of q independent`
2
-samples,yielding
~
k(x;y).Notice
that the expectation of this estimator is exactly equal to the product of expectations,
E[
~
k(x;y)] = (x
>
y)
q
.The variance of this estimator is equal to the product of variances,
which is Var(
~
k(x;y))  (kxkkyk)
q
 1.Of course,calculating the inner product exactly
takes O(dlog q) time.We obtain:
COROLLARY 5.4.For the polynomial degree-q kernel,Algorithm4 runs in time
~
O(
q(n +d)
"
2
+minf
dlog q
"
4
;
q
"
6
g):
Gaussian kernels.To estimate the Gaussian kernel function,we assume that kxk
and kyk are known and no more than s=2;thus to estimate
k(x;y) = exp(kx yk
2
) = exp((kxk
2
+kyk
2
)=2s
2
) exp(x
>
y=s
2
);
we need to estimate exp(x
>
y=s
2
).For exp( X) =
P
i0

i
X
i
=i!with random X and
parameter > 0,we pick index i with probability exp( )
i
=i!(that is,i has a Poisson
distribution) and return exp( ) times the product of i independent estimates of X.
In our case we take X to be the average of c`
2
-samples of x
>
y,and hence E[X] =
x
>
y;E[X
2
] 
1
c
E[(x
>
y)
2
] 
1
c
.The expectation of our kernel estimator is thus:
E[
~
k(x;y)] = E[
X
i0
e


i
i! e

 X
i
] =
X
i0

i
i!
i
Y
j=1
E[X] = exp( x
>
y):
The second moment of this estimator is bounded by:
E[
~
k(x;y)
2
] = E[
X
i0
e


i
i! e
2
 (X
i
)
2
] = e

X
i0

i
i!
i
Y
j=1
E[X
2
]  exp(
2
c
):
Hence,we take = c =
1
s
2
.This gives a correct estimator in terms of expectation
and constant variance.The variance can be further made smaller than one by taking
the average of a constant estimators of the above type.
As for evaluation time,the expected size of the index i is =
1
s
2
.Thus,we require
on the expectation c =
1
s
4
of`
2
-samples.
We obtain:
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
Sublinear Optimization for Machine Learning 0:21
COROLLARY 5.5.For the Gaussian kernel with parameter s,Algorithm 4 runs in
time
~
O(
(n +d)
s
4
"
2
+minf
d
"
4
;
1
s
4
"
6
g):
5.3.Kernelizing the MEB and strictly convex problems
Analogously to Algorithm4,we can define the kernel version of strongly convex prob-
lems,including MEB.The kernelized version of MEB is particularly efficient,since
as in Algorithm 2,the norm ky
t
k is never required.This means that the procedure
Kernel-L2-Sampling can be computed in time O(nL
k
) per iteration,for a total run-
ning time of O(L
k
("
2
n +"
1
d)).
6.LOWER BOUNDS
All of our lower bounds are information-theoretic,meaning that any successful algo-
rithm must read at least some number of entries of the input matrix A.Clearly this
also lower bounds the time complexity of the algorithmin the unit-cost RAMmodel.
Some of our arguments use the following meta-theorem.Consider a p  q matrix
A,where p is an even integer.Consider the following random process.Let W  q.Let
a = 11=W,and let e
j
denote the j-th standard q-dimensional unit vector.For each i 2
[p=2],choose a randomj 2 [q] uniformly,and set A
i+p=2
A
i
ae
j
+b(1
q
e
j
),where b
is chosen so that kA
i
k
2
= 1.We say that such an A is a YES instance.With probability
1=2,transform A into a NO instance as follows:choose a random i

2 [p=2] uniformly,
and if A
i

= ae
j

+b(1
q
e
j

) for a particular j

2 [q],set A
i

+p=2
ae
j

+b(1
q
e
j

).
Suppose there is a randomized algorithm reading at most s positions of A which
distinguishes YES and NO instances with probability  2=3,where the probability
is over the algorithm’s coin tosses and this distribution  on YES and NO instances.
By averaging this implies a deterministic algorithm Alg reading at most s positions
of A and distinguishing YES and NO instances with probability  2=3,where the
probability is taken only over .We show the following meta-theoremwith a standard
argument.
THEOREM 6.1.(Meta-theorem) For any such algorithmAlg,s =
(pq).
This Meta-Theoremfollows fromthe following folklore fact:
FACT 6.2.Consider the following random process.Initialize a length-r array A to
an array of r zeros.With probability 1=2,choose a random position i 2 [r] and set
A[i] = 1.With the remaining probability 1=2,leave A as the all zero array.Then any
algorithmwhich determines if A is the all zero array with probability  2=3 must read

(r) entries of A.
Let us prove Theorem6.1 using this fact:
PROOF.Consider the matrix B 2 R
(p=2)q
which is defined by subtracting the “bot-
tom” half of the matrix from the top half,that is,B
i;j
= A
i;j
A
i+p=2;j
.Then B is the
all zeros matrix,except that with probability 1/2,there is one entry whose value is
roughly two,and whose location is random and distributed uniformly.An algorithm
distinguishing between YES and NO instances of A in particular distinguishes be-
tween the two cases for B,which cannot be done without reading a linear number of
entries.
In the proofs of Theorem 6.3,Corollary 6.4,and Theorem 6.6,it will be more conve-
nient to use M as an upper bound on the number of non-zero entries of A rather than
the exact number of non-zero entries.However,it should be understood that these the-
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
0:22 Clarkson,Hazan and Woodruff
orems (and corollary) hold even when M is exactly the number of non-zero entries of
A.
To see this,our random matrices A constructed in the proofs have at most M non-
zero entries.If this number M
0
is strictly less than M,we arbitrarily replace M M
0
zero entries with the value (nd)
C
for a large enough constant C > 0.Under our
assumptions on the margin or the minimum enclosing ball radius of the points,the
solution value changes by at most a factor of (1 (nd)
1C
),which does not affect the
proofs.
6.1.Classification
Recall that the margin (A) of an nd matrix A is given by max
x2B
min
i
A
i
x.Since we
assume that kA
i
k
2
 1 for all i,we have that (A)  1.
6.1.1.Relative Error.We start with a theoremfor relative error algorithms.
THEOREM 6.3.Let  > 0 be a sufficiently small constant.Let"and (A) have
(A)
2
"
1
 min(n;d),(A)  1 ",with"also bounded above by a sufficiently
small constant.Also assume that M  2(n +d),that n  2,and that d  3.Then any
randomized algorithm which,with probability at least 2=3,outputs a number in the
interval [(A) "(A);(A)] must read

(min(M;(A)
2
"
1
(n +d)))
entries of A.This holds even if kA
i
k
2
= 1 for all rows A
i
.
Notice that this yields a stronger theorem than assuming that both n and d are suffi-
ciently large,since one of these values may be constant.
PROOF.
We divide the analysis into cases:the case in which d or n is constant,and the case
in which each is sufficiently large.Let  2 [0;1 "] be a real number to be determined.
Case:d or n is a constant.By our assumption that (A)
2
"
1
 min(n;d),the
values (A) and"are constant,and sufficiently large.Therefore we just need to show
an
(min(M;n + d)) bound on the number of entries read.By the premise of the
theorem,M =
(n +d),so we can just show an
(n +d) bound.
An
(d) bound.We give a randomized construction of an n d matrix A.
The first rowof Ais built as follows.Let A
1;1
 and A
1;2
0.Pick j

2 f3;4;:::;dg
uniformly at random,and let A
1;j

"
1=2
.For all remaining j 2 f3;4;:::;dg,assign
A
1;j
,where  1=d
3
.(The role of  is to make an entry slightly non-zero to prevent
an algorithm which has access to exactly the non-zero entries from skipping over it.)
Now using the conditions on ,we have
X kA
1
k
2
= 
2
+(d 3)
2
+"
2
 (1 ")
2
+d
2
+" 1 "+"
2
+
2
"
2
 1;
and so by letting A
1;2

p
1 X,we have kA
1
k = 1.
Now we let A
2
A
1
,with two exceptions:we let A
2;1
A
1;1
= ,and with
probability 1=2,we negate A
2;j
.Thus kA
2
k = 1 also.
For row i with i > 2,put A
i;1
(1 +"),A
i;2

q
1 A
2
i;1
,and all remaining entries
zero.
We have the following picture.
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
Sublinear Optimization for Machine Learning 0:23
0
B
B
B
B
B
B
@
 (1 
2
(d 3)
2
"
2
)
1=2
    "
1=2
     
 (1 
2
(d 3)
2
"
2
)
1=2
     "
1=2
     
(1 +") (1 (1 +")
2

2
)
1=2
0    0
(1 +") (1 (1 +")
2

2
)
1=2
0    0
              
(1 +") (1 (1 +")
2

2
)
1=2
0    0
1
C
C
C
C
C
C
A
Observe that the the number of non-zero entries of the resulting matrix is 2n+2d4,
which satisfies the premise of the theorem.Moreover,all rows A
i
satisfy kA
i
k = 1.
Notice that if A
1;j
 = A
2;j
,then the margin of A is at most ,which follows by
observing that all but the first coordinate of A
1
and A
2
have opposite signs.
On the other hand,if A
1;j
 = A
2;j
,consider the vector y with y
1
1,y
j

p
",and
all other entries zero.Then for all i,A
i
y = (1 +"),and so the unit vector x y=kyk
has
A
i
x =
(1 +")
p
1 +"
= (1 +")
1=2
= (1 +
(")):
It follows that in this case the margin of A is at least (1 +
(")).Setting  = () and
rescaling"by a constant factor,it follows that these two cases can be distinguished
by an algorithm satisfying the premise of the theorem.By Fact 6.2,any algorithm
distinguishing these two cases with probability  2=3 must read
(d) entries of A.
An
(n) bound.We construct the n  d matrix A as follows.All but the first two
columns are 0.We set A
i;1
 and A
i;2

p
1 
2
for all i 2 [n].Next,with probability
1=2,we pick a randomrow i

,and negate A
i

;2
.We have the following picture.
0
B
B
B
B
B
B
B
B
@

p
1 
2
0    0
      0    0

p
1 
2
0    0
 
p
1 
2
0    0

p
1 
2
0    0
      0    0

p
1 
2
0    0
1
C
C
C
C
C
C
C
C
A
The number of non-zeros of the resulting matrix is 2n < M.Depending on the sign
of A
i

;2
,the margin of A is either 1 or .Setting  = (),an algorithm satisfying
the premise of the theorem can distinguish the two cases.By Fact 6.2,any algorithm
distinguishing these two cases with probability  2=3 must read
(n) entries of A.
Case:d and n are both sufficiently large.Suppose first that M =
((A)
2
"
1
(n+d))
for a sufficiently large constant in the
().Let s be an even integer in (
2
"
1
) and
with s < min(n;d)1.We will also choose a value  in ((A)).We can assume without
loss of generality that n and d are sufficiently large,and even.
An
(ns) bound.We set the d-th entry of each row of A to the value .We set all
entries in columns s +1 through d 1 to 0.We then choose the remaining entries of A
as follows.We apply Theorem6.1 with parameters p = n;q = s,and W = d
2
,obtaining
an n  s matrix B,where kB
i
k = 1 for all rows B
i
.Put B
0
B
p
1 
2
.We then set
A
i;j
B
0
i;j
for all i 2 [n] and j 2 [s].We have the following block structure for A.

B
p
1 
2
0
n(ds1)
1
n


Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
0:24 Clarkson,Hazan and Woodruff
Here 0
n(ds1)
is a matrix of all 0’s,of the given dimensions.Notice that kA
i
k = 1 for
all rows A
i
,and the number of non-zero entries is at most n(s +1),which is less than
the value M.
We claim that if B is a YES instance,then the margin of A is (1 +
(")).Indeed,
consider the unit vector x for which
x
j

8
>
>
<
>
>
:

"
s

"
2
4s

1=2
j 2 [s]
0 j 2 [s +1;d 1]
1 "=2 j = d
(23)
For any row A
i
,
A
i
x 

"
s

"
2
4s

1=2

p
1 
2
O

p
1 
2
d
2
!!
+

1 
"
2




"
s

"
2
4s

1=2

1  O

p
1 
2
d
2
!!
+ 
"
2
since
p
1 
2
 1 


"
s

1=2
(1 ) + 
"
2
O("
2

2
) since
r
"
s

1
d
2
= O("
2

2
)
If we set s = c
2
"
1
for c 2 (0;4),then
A
i
x   +
"
c
1=2


"
2
+
"
c
1=2

O("
2

2
) = (1 +
(")):(24)
On the other hand,if B is a NO instance,we claim that the margin of A is at most
(1+O("
2
)).By definition of a NOinstance,there are rows A
i
and A
j
of A which agree
except on a single column k,for which A
i;k
=
p
1 
2
O

1
2
d
2

while A
j;k
= A
i;k
.
It follows that the x which maximizes minfA
i
x;A
j
xg has x
k
= 0.But
P
k
0
6=k
A
2
i;k
0
=
1 (1 
2
) +O

1
d
2

= 
2
+O

1
d
2

.Since kxk  1,by the Cauchy-Schwarz inequality
A
i
x = A
j
x 


2
+O

1
d
2

1=2
  +O

"
2

= (1 +O("
2
));(25)
where the first inequality follows fromour bound 
2
"
1
= O(d).
Setting  = ((A)) and rescaling"by a constant factor,an algorithm satisfying
the premise of the theorem can distinguish the two cases,and so by Theorem 6.1,it
must read
(ns) =
((A)
2
"
1
n) entries of A.
An
(ds) bound.We first define rows s + 1 through n of our n  d input matrix
A.For i > s,put A
i;d
(1 +"),A
i;d1
(1 
2
(1 +")
2
)
1=2
,and all remaining entries
zero.
We now define rows 1 through s.Put A
i;d
 for all i 2 [s].Now we apply Theorem
6.1 with p = s,q = d2,and W = d
2
,obtaining an s(d2) matrix B,where kB
i
k = 1
for all rows B
i
.Put B
0
B
p
1 
2
,and set A
i;j
B
0
i;j
for all i 2 [s] and j 2 [d 2].We
have the following block structure for A.

B
p
1 
2
0
s
1
s

0
(ns)(d2)
1
ns
(1 
2
(1 +")
2
)
1=2
1
ns
(1 +")

Notice that kA
i
k = 1 for all rows A
i
,and the number of non-zero entries is at most
2n +sd < M.
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
Sublinear Optimization for Machine Learning 0:25
If B is a YES instance,let x be as in Equation (23).Since the first s rows of A
agree with those in our proof of the
(ns) bound,then as shown in Equation (24),
A
i
x = (1 +
(")) for i 2 [s].Moreover,for i > s,since YES instances B are entry-wise
positive,we have
A
i
x >

1 
"
2

 (1 +") = (1 +
(")):
Hence,if B is a YES instance the margin is (1 +
(")).
Now suppose B is a NO instance.Then,as shown in Equation (25),for any x for
which kxk  1,we have A
i
x  (1 +O("
2
)) for i 2 [s].Hence,if B is a NO instance,the
margin is at most (1 +O("
2
)).
Setting  = ((A)) and rescaling"by a constant factor,an algorithm satisfying
the premise of the theorem can distinguish the two cases,and so by Theorem 6.1,it
must read
(ds) =
((A)
2
"
1
d) entries of A.
Finally,if M = O((n + d)(A)
2
"
1
),then we must show an
(M) bound.We
will use our previous construction for showing an
(ns) bound,but replace the value
of n there with n
0
,where n
0
is the largest integer for which n
0
s  M=2.We claim that
n
0
 1.To see this,by the premise of the theoremM  2(n +d).Moreover,s = ("
1
)
and"
1
 (n +d).For a small enough constant  > 0,s  (n +d)  M=2,as needed.
As the theorem statement concerns matrices with n rows,each of unit norm,we
must have an input A with n rows.To achieve this,we put A
i;d
= (1 +") and A
i;d1
=
(1  
2
(1 +")
2
)
1=2
for all i > n
0
.In all remaining entries in rows A
i
with i > n
0
,we
put the value 0.This ensures that kA
i
k = 1 for all i > n
0
,and it is easy to verify that
this does not change the margin of A.Hence,the lower bound is
(n
0
s) =
(M).Notice
that the number of non-zero entries is at most 2n+n
0
s  2M=3+M=3 = M,as needed.
This completes the proof.
6.1.2.Additive Error.Here we give a lower bound for the additive error case.We give
two different bounds,one when"< ,and one when" .Notice that   0 since we
may take the solution x = 0
d
.The following is a corollary of Theorem6.3.
COROLLARY 6.4.Let  > 0 be a sufficiently small constant.Let";(A) be such that
(A)
1
"
1
 min(n;d) and (A)  1 "=(A),where 0 <" 
0
 for a sufficiently
small constant 
0
> 0.Also assume that M  2(n + d),n  2,and d  3.Then any
randomized algorithm which,with probability at least 2=3,outputs a number in the
interval [ ";] must read

(min(M;
1
"
1
(n +d)))
entries of A.This holds even if kA
i
k = 1 for all rows A
i
.
PROOF.We simply set the value of"in Theorem 6.3 to"=.Notice that"is at
most a sufficiently small constant and the value 
2
"
1
in Theorem6.3 equals 
1
"
1
,
which is at most min(n;d) by the premise of the corollary,as needed to apply Theorem
6.3.
The following handles the case when"=
().
COROLLARY 6.5.Let  > 0 be a sufficiently small constant.Let";(A) be such
that"
2
 min(n;d),(A) +"<
1
p
2
,and"=
().Also assume that M  2(n +d),
n  2,and d  3.Then any randomized algorithmwhich,with probability at least 2=3,
outputs a number in the interval [ ";] must read

(min(M;"
2
(n +d)))
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
0:26 Clarkson,Hazan and Woodruff
entries of A.This holds even if kA
i
k = 1 for all rows A
i
.
PROOF.
The proof is very similar to that of Theorem 6.3,so we just outline the differences.
In the case that d or n is constant,we have the following families of hard instances:
An
(n) bound for constant d:
0
B
B
B
B
B
B
B
B
@


1 
2
(d 3)
2
2("+)
2

1=2
    
p
2("+)     
 

1 
2
(d 3)
2
2("+)
2

1=2
     
p
2("+)     
p
2("+)

1 2("+)
2

1=2
0    0
p
2("+)

1 2("+)
2

1=2
0    0
              
p
2("+)

1 2("+)
2

1=2
0    0
1
C
C
C
C
C
C
C
C
A
An
(d) bound for constant n:
0
B
B
B
B
B
B
B
B
@

p
1 
2
0    0
      0    0

p
1 
2
0    0
 
p
1 
2
0    0

p
1 
2
0    0
      0    0

p
1 
2
0    0
1
C
C
C
C
C
C
C
C
A
In these two cases,depending on the sign of the undetermined entry the margin
is either  or at least  +"(in the
(d) bound,it is  or 1,but we assume  +"<
1
p
2
).
It follows for  = (A),the algorithm of the corollary can distinguish these two cases,
for which the lower bounds follow fromthe proof of Theorem6.3.
For the case of n and d sufficiently large,we have the following families of hard
instances.In each case,the matrix B is obtained by invoking Theorem 6.1 with the
value of s = ("
2
).
An
(n"
2
) bound for n;d sufficiently large:

B
p
1 
2
0
n(ds1)
1
n


An
(d"
2
) bound for n;d sufficiently large:

B
p
1 
2
0
s
1
s

0
(ns)(d2)
1
ns
(1 ( +")
2
)
1=2
1
ns
( +")

In these two cases,by setting W = poly(nd) to be sufficiently large in Theorem 6.1,
depending on whether B is YES or a NO instance the margin is either at most  +
1
poly(nd)
or at least  +
p
1 
2
 2"(for an appropriate choice of s).For  < 1=
p
2,the
algorithm of the corollary can distinguish these two cases,and therefore needs
(ns)
time in the first case,and
(ds) time in the second.
The extension of the proofs to handle the case M = o((n +d)"
2
) is identical to that
given in the proof of Theorem6.3.
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
Sublinear Optimization for Machine Learning 0:27
6.2.MinimumEnclosing Ball
We start by proving the following lower bound for estimating the squared MEB radius
to within an additive".In the next subsection we improve the
("
1
n) term in the
lower bound to
~

("
2
n) for algorithms that either additionally output a coreset,or
output a MEB center that is a convex combination of the input points.As our primal-
dual algorithm actually outputs a coreset,as well as a MEB center that is a convex
combination of the input points,those bounds apply to it.Our algorithm has both
of these properties though satisfying one or the other would be enough to apply the
lower bound.Together with the"
1
d bound given by the next theorem,these bounds
establish its optimality.
THEOREM 6.6.Let  > 0 be a sufficiently small constant.Assume"
1
 min(n;d)
and"is less than a sufficiently small constant.Also assume that M  2(n+d) and that
n  2.Then any randomized algorithm which,with probability at least 2=3,outputs a
number in the interval
h
min
x
max
i
kx A
i
k
2
";min
x
max
i
kx A
i
k
2
i
must read

(min(M;"
1
(n +d)))
entries of A.This holds even if kA
i
k = 1 for all rows A
i
.
PROOF.
As with classification,we divide the analysis into cases:the case in which d or n is
constant,and the case in which each is sufficiently large.
Case d or n is a constant.By our assumption that"
1
 min(n;d),"is a constant,
and sufficiently large.So we just need to show an
(min(M;n + d)) bound.By the
premise of the theorem,M  2(n +d),so we need only show an
(n +d) bound.
An
(d) bound.We construct an n d matrix A as follows.For i > 2,each row A
i
is
just the vector e
1
= (1;0;0;:::;0).
Let A
1;1
0,and initially assign  1=d to all remaining entries of A
1
.Choose a
randominteger j

2 [2;d],and assign A
1;j

p
1 (d 2)
2
.Note that kA
1
k = 1.
Let A
2
A
1
,and then with probability 1=2,negate A
2;j
.
Our matrix A is as follows.
0
B
B
B
B
B
@
0     
p
1 (d 2)
2
    
0      
p
1 (d 2)
2
    
1 0    0
1 0    0
1         
1 0    0
1
C
C
C
C
C
A
Observe that Ahas at most 2n+2d  M non-zero entries,and all rows satisfy kA
i
k = 1.
If A
1;j

= A
2;j

,then A
1
and A
2
forma diametral pair,and the MEB radius is 1.
On the other hand,if A
1;j

= A
2;j

,then consider the ball center x with x
1
x
j


1=
p
2,and all other entries zero.Then for all i > 2,kxA
i
k
2
=

1 
1
p
2

2
.On the other
hand,for i 2 f1;2g,we have
kx A
i
k
2

1
2
+(d 2)
2
+

1 
1
p
2

2
 2 
p
2 +
1
d
:
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
0:28 Clarkson,Hazan and Woodruff
It follows that for"satisfying the premise of the theorem,an algorithm satisfying
the premise of the theorem can distinguish the two cases.By Fact 6.2,any algorithm
distinguishing these two cases with probability  2=3 must read
(d) entries of A.
An
(n) bound.We construct the n  d matrix A as follows.Initially set all
rows A
i
e
1
= (1;0;0;:::;0).Then with probability 1=2 choose a randomi

2 [n],and
negate A
i

;1
.
We have the following picture.
0
B
B
B
B
B
B
@
1 0    0
           
1 0    0
1 0    0
1 0    0
   0    0
1 0    0
1
C
C
C
C
C
C
A
The number of non-zeros of the resulting matrix is n < M.In the case where there is
an entry of 1,the MEB radius of A is 1,but otherwise the MEB radius is 0.Hence,
an algorithm satisfying the premise of the theorem can distinguish the two cases.By
Fact 6.2,any algorithm distinguishing these two cases with probability  2=3 must
read
(n) entries of A.
Case:d and n are sufficiently large.Suppose first that M =
("
1
(n +d)) for a suf-
ficiently large constant in the
().Put s = ("
1
).We can assume without loss of
generality that n,d,and s are sufficiently large integers.We need the following simple
claim.
CLAIM 6.7.Given an instance of the minimum enclosing ball problem in T > t
dimensions on a matrix with rows fe
i
+
P
j2[t]nfig
e
j
g
t
i=1
for distinct standard unit
vectors e
i
and     0,the solution x =
P
t
i=1
(+(t 1))e
i
=t of cost ()
2
(11=t)
is optimal.
PROOF.We can subtract the point 1
T
from each of the points,and an optimal
solution y for the translated problemyields an optimal solution y+1
T
for the original
problem with the same cost.We can assume without loss of generality that T = t and
that e
1
;:::;e
t
are the t standard unit vectors in R
t
.Indeed,the value of each of the rows
on each of the remaining coordinates is 0.The cost of the point y

=
P
t
i=1
()e
i
=t in
the translated problemis
( )
2

1 
1
t

2
+(t 1) ( )
2
=t
2
= ( )
2

1 
1
t

:
On the other hand,for any point y,the cost with respect to row i is (    y
i
)
2
+
P
j6=i
( y
j
)
2
.By averaging and Cauchy-Schwarz,there is a row of cost at least
1
t

"
t
X
i=1
(  y
i
)
2
+(t 1)
t
X
i=1
y
2
i
#
= kyk
2
+( )
2

2( )kyk
1
t
 kyk
2
+( )
2

2( )kyk
p
t
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
Sublinear Optimization for Machine Learning 0:29
Taking the derivative w.r.t.to kyk,this is minimized when kyk =

p
t
,for which the
cost is at least ( )
2
(1 1=t).
An
(ns) bound.We set the first s rows of A to e
1
;:::;e
s
.We set all entries outside
of the first s columns of A to 0.We choose the remaining n  s =
(n) rows of A by
applying Theorem 6.1 with parameters p = n  s;q = s,and W = 1=d.If A is a YES
instance,then by Claim6.7,there is a solution with cost (a b)
2
(1 1=s) = 1 (1=s).
On the other hand,if A is a NO instance,then for a given x,either kA
j
  xk
2
or
kA
p=2+j
 xk
2
is at least a
2
= 1 O(1=d).By setting s = ("
1
) appropriately,these
two cases differ by an additive",as needed.
An
(ds) bound.We choose A by applying Theorem 6.1 with parameters p = s;q = d,
and W = 1=d.If A is a YES instance,then by Claim 6.7,there is a solution of cost
at most (a  b)
2
(1  1=s) = 1  (1=s).On the other hand,if A is a NO instance,
then for a given x,either kA
j
  xk
2
or kA
p=2+j
  xk
2
is at least a
2
= 1  O(1=d).
As before,setting s = ("
1
) appropriately causes these cases to differ by an additive".
Finally,it remains to show an
(M) bound in case M = O("
1
(n + d)).We will
use our previous construction for showing an
(ns) bound,but replace the value of
n there with n
0
,where n
0
is the largest integer for which n
0
s  M=2.We claim that
n
0
 1.To see this,by the premise of the theoremM  2(n +d).Moreover,s = ("
1
)
and"
1
 (n +d).For a small enough constant  > 0,s  (n +d)  M=2,as needed.
As the theorem statement concerns matrices with n rows,each of unit norm,we
must have an input A with n rows.In this case,since the first rowof A is e
1
,which has
sparsity 1,we can simply set all remaining rows to the value of e
1
,without changing
the MEB solution.Hence,the lower bound is
(n
0
s) =
(M).Notice that the number
of non-zero entries is at most n +n
0
s  M=2 +M=2 = M,as needed.
This completes the proof.
6.3.An
~

(n"
2
) Bound for MinimumEnclosing Ball
6.3.1.Intuition.Before diving into the intricate lower bound of this section,we describe
a simple construction which lies at its core.Consider two distributions over arrays of
size d:the first distribution,,is uniformly distributed over all strings with exactly
3d
4
entries that are 1,and
d
4
entries that are 1.The second distribution ,is uniformly
distributed over all strings with exactly
3d
4
D entries that are 1,and
d
4
+D entries
that are 1,for D =
~
O(
p
d).
Let x   with probability
1
2
and x   with probability
1
2
.Consider the task of
deciding from which distribution x was sampled.In both cases,the distributions are
over the sphere of radius
p
d,so the normitself cannot be used to distinguish them.At
the heart of our construction lies the following fact:
FACT 6.8.Any algorithm that decides with probability 
3
4
the distribution that x
was sampled from,must read at least
~
(d) entries fromx.
We prove a version of this fact in the next sections.But first,let us explain the use of
this fact in the lower bound construction:We create an instance of MEBwhich contains
either n vectors similar to the first type,or alternatively n 1 vector of the first type
and an extra vector of the second type (with a small bias).To distinguish between the
two types of instances,an algorithm has no choice but to check all n vectors,and for
each invest O(d) work as per the above fact.In our parameter setting,we’ll choose
d =
~
O("
2
),attaining the lower bound of
~
O(nd) =
~
O(n"
2
) in terms of time complexity.
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
0:30 Clarkson,Hazan and Woodruff
To compute the difference in MEB center as n 7!1,note that by symmetry in the
first case the center will be of the form (a;a;:::;a),where the value a 2 R is chosen to
minimize the maximal distance:
arg min
a
f
3
4
(1 a)
2
+
1
4
(1 a)
2
g = arg min
a
fa
2
a +1g =
1
2
The second MEB center will be
arg min
a
f(
3
4

D
d
)(1 a)
2
+(
1
4
+
D
d
)(1 a)
2
g = arg min
a
fa
2
(1 
4D
d
)a +1g =
1
2

2D
d
Hence,the difference in MEB centers is on the order of
q
d (
D
d
)
2
= O(D
2
=d) = O(1).
However,the whole construction is scaled to fit in the unit ball,and hence the differ-
ence in MEB centers becomes
1
p
d
".Hence for an"approximation the algorithm
must distinguish between the two distributions,which in turn requires
("
2
) work.
6.3.2.Probabilistic Lemmas.For a set S of points in R
d
,let MEB(S) denote the smallest
ball that contains S.Let Radius(S) be the radius of MEB(S),and Center(S) the unique
center of MEB(S).
For our next lower bound,our bad instance will come from points on the hypercube
H
d
= f
1
p
d
;
1
p
d
g
d
.
Call a vertex of H
d
regular if it has
3d
4
coordinates equal to
1
p
d
and
d
4
coordinates
equal to 
1
p
d
.Call a vertex special if it has
3d
4
 12dD coordinates equal to
1
p
d
and
d
4
+12dD coordinates equal to 
1
p
d
,where D 
lnn
p
d
.
We will consider instances where all but one of the input rows A
i
are randomregular
points,and one row may or may not be a random special point.We will need some
lemmas about these points.
LEMMA 6.9.Let a denote a randomregular point,b a special point,and c denote the
point 1
d
=2
p
d = (
1
2
p
d
;
1
2
p
d
;:::;
1
2
p
d
).Then
kak
2
= kbk
2
= 1 (26)
kck
2
= a
>
c =
1
4
(27)
ka ck
2
=
3
4
(28)
b
>
c = E[a
>
b] =
1
4
12D (29)
PROOF.The normclaims are entirely straightforward,and we have
a
>
c =
1
2d

3d
4

1
2d

d
4
=
1
4
:
Also (28) follows by
ka ck
2
= kak
2
+kck
2
2a
>
c = 1 +
1
4
2
1
4
=
3
4
:
For (29),we have
b
>
c =
1
2d

3d
4
12dD


1
2d

d
4
+12dD

=
3
8
6D
1
8
6D =
1
4
12D;
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
Sublinear Optimization for Machine Learning 0:31
and by linearity of expectation,
E[a
>
b] = d 
1
d


3
4


3
4
12D

+
1
4


1
4
+12D


3
4


1
4
+12D


1
4


3
4
12D

=
1
4
12D:
Next,we show that a
>
b is concentrated around its expectation (29).
LEMMA 6.10.Let a be a randomregular point,and b a special point.For d  8 ln
2
n,
Pr[a
>
b >
1
4
6D] 
1
n
3
,and Pr[a
>
b <
1
4
18D] 
1
n
3
.
PROOF.We will prove the first tail estimate,and then discuss the changes needed
to prove the second estimate.
We apply the upper tail of the following enhanced form of Hoeffding’s bound,which
holds for randomvariables with bounded correlation.
FACT 6.11.(Theorem 3.4 of [Panconesi and Srinivasan 1997] with their value of 
equal to 1) Let X
1
;:::;X
d
be given random variables with support f0;1g and let X =
P
d
j=1
X
j
.Let > 0 be arbitrary.If there exist independent randomvariables
^
X
1
;:::;
^
X
d
with
^
X =
P
d
j=1
^
X
j
and E[X]  E[
^
X] such that for all J  [d],
Pr [^
j2J
X
j
= 1] 
Y
j2J
Pr
h
^
X
j
= 1
i
;
then
Pr[X > (1 + ) E[
^
X]] 

e

(1 + )
1+

E[
^
X]
:
Define X
j
=
d
2


a
j
b
j
+
1
d

.Since a
j
b
j
2 f
1
p
d
;
1
p
d
g,the X
j
have support f0;1g.Let
^
X
1
;:::;
^
X
d
be i.i.d.variables with support f0;1g with E[
^
X
j
] = E[X
j
] for all j.
We claim that for all J  [d],Pr [^
j2J
X
j
= 1] 
Q
j2J
Pr
h
^
X
j
= 1
i
:By symmetry,it
suffices to prove it for J 2 f[1];[2];:::;[d]g.We prove it by induction.The base case
J = [1] follows since E[
^
X
j
] = E[X
j
].To prove the inequality for J = [`],` 2,assume
the inequality holds for [`1].Then,
Pr[^
j2[`]
X
j
= 1] = Pr[^
j2[`1]
X
j
= 1]  Pr[X
`
= 1 j ^
j2[`1]
X
j
= 1];
and by the inductive hypothesis,
Pr[^
j2[`1]
X
j
= 1] 
Y
j2[`1]
Pr
h
^
X
j
= 1
i
;
so to complete the induction it is enough to show
Pr[X
`
= 1 j ^
j2[`1]
X
j
= 1]  Pr[X
`
= 1]:(30)
Letting (a;b) be the number of coordinates j for which a
j
6= b
j
,we have
Pr[X
`
= 1] = 1 
E[(a;b)]
d
:
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
0:32 Clarkson,Hazan and Woodruff
If ^
j2[`1]
X
j
= 1 occurs,then the first`1 coordinates of a
j
and b
j
have the same sign,
and so
Pr[X
`
= 1 j ^
j2[`1]
X
j
= 1] = 1 
E[(a;b) j ^
j2[`1]
X
j
= 1]
d `+1
= 1 
E[(a;b)]
d `+1
;
which proves (30).
We will apply Fact 6.11 to bound Pr[a
>
b > r] for r =
1
4
6D.Since X =
d
2
a
>
b +
d
2
=
d
2
(1 +a
>
b),we have
X E[X]
E[X]
=
a
>
b E[a
>
b]
1 +E[a
>
b]
;
where we have used that (29) implies E[X] is positive (for large enough d),so we can
performthe division.So
X E[X]
E[X]

r E[a
>
b]
1 +E[a
>
b]
=
a
>
b E[a
>
b]
1 +E[a
>
b]

r E[a
>
b]
1 +E[ a
>
b]
=
a
>
b r
1 +E[a
>
b]
;
and so
Pr[a
>
b > r] = Pr

X E[X]
E[X]
>
r E[a
>
b]
1 +E[a
>
b]

:
By Fact 6.11,for =
rE[a
>
b]
1+E[a
>
b]
,we have for > 0,
Pr[a
>
b > r] 

e

(1 + )
1+

d(1+E[a
>
b])=2
:
By (29),r  E[a
>
b] = 6D,and 1  1 + E[a
>
b]  2,so 2 [3D;6D].It is well-known
(see Theorem 4.3 of [Motwani and Raghavan 1995]) that for 0 < < 2e  1,e


(1 + )
1+
e

2
=4
,and so

e

(1 + )
1+

d(1+E[a
>
b])=2
 exp



2
4
(d(1 +E[a
>
b])=2)

= exp(
2
d(1 +E[a
>
b])=8):
Since  3D and E[a
>
b] > 0,this is at most exp(D
2
d)  exp((lnn)
2
)  n
3
,for
large enough n,using the definition of D.
For the second tail estimate,we can apply the same argument to a and b,prov-
ing that Pr[a
>
b > r]  1=n
3
,where r  1=4 + 18D.We let X
j
be the f0;1g vari-
ables
d
2
(a
j
b
j
+
1
d
),with expected sum E[X] = 3d=8 + 6D.As above,Pr[a
>
b > r] =
Pr[
XE[X]
E[X]
> ],where 
rE[a
>
b]
1+E[a
>
b]
.Now
p
d is between 6 lnn and 8 lnn,so the same
relations apply as above,and the second tail estimate follows.
Note that since by (28) all regular points are distance
p
3=2 from c,that distance is
an upper bound for the the MEB radius of a collection of regular points.
The next lemmas give more properties of MEBs involving regular and special points,
under the assumption that the above concentration bounds on a
>
b hold for a given
special point b and all a in a collection of regular points.
That is,let S be a collection of randomregular points.Let E be the event that for all
a 2 S,18D  a
>
b 
1
4
 6D.By Lemma 6.10 and a union bound,
Pr[E]  1 
2
n
2
;
when S has at most n points.
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
Sublinear Optimization for Machine Learning 0:33
The condition of event E applies not only to every point in S,but to every point in
the convex hull conv S.
LEMMA 6.12.For special point b and collection S of points a,if event E holds,then
for every a
S
2 conv S,18D  a
>
S
b 
1
4
 6D.
PROOF.Since a
S
2 conv S,we have a
S
=
P
a2S
p
a
a for some values p
a
with
P
a2S
p
a
= 1 and p
a
 0 for all a 2 S.Therefore,assuming E holds,
a
>
S
b =
"
X
a2S
p
a
a
#
>
b =
X
a2S
p
a
a
>
b 
X
a2S
p
a
(1=4 6D) = 1=4 6D;
and similarly a
>
S
b  1=4 18D.
LEMMA 6.13.Suppose b is a special point and S is a collection of regular points such
that event E holds.Then for any a
S
2 conv S,ka
S
 bk 
p
3
2
+ 6D.Since Center(S) 2
conv S,this bound applies to kCenter(S) bk as well.
PROOF.Let H be the hyperplane normal to c = 1
d
=2
p
d and containing c.Then
S  H,and so conv S  H,and since the minimum norm point in H is c,all points
a
S
2 conv S have ka
S
k
2
 kck
2
= 1=4.By the assumption that event E holds,and the
previous lemma,we have a
>
S
b 
1
4
6D.Using this fact,kbk = 1,and ka
S
k
2
 1=4,we
have
ka
S
bk
2
= ka
S
k
2
+kbk
2
2a
>
S
b

1
4
+1 2

1
4
6D

=
3
4
+12D;
and so ka
S
bk 
p
3
2
+6D provided D is smaller than a small constant.
LEMMA 6.14.Suppose a is a regular point,b is a special point,and a
>
b 
1
4
18D.
Then there is a point q 2 R
d
for which kq bk =
p
3
2
and kq ak 
p
3
2
+(D
2
),as D!0.
PROOF.As usual let c  1
d
=2
p
d and consider the point q at distance
p
3
2
from b on
the line segment
cb,so
q = c + 
b c
kb ck
= c + (b c);
where   1=kb ck and is a value in (D).Fromthe definition of q,
kq ak
2
= kqk
2
+kak
2
2a
>
q
= kck
2
+2 b
>
c 2 kck
2
+
2
+kak
2
2a
>
c 2 a
>
b +2 a
>
c:
Recall from (26) that kak = 1,from (27) that a
>
c = kck
2
=
1
4
,from (29) that b
>
c =
1=4 12D,and the assumption a
>
b  1=4 18D,we have
kq ak
2
= 1=4 +2 (1=4 12D) 2 (1=4) +
2
+1 2(1=4) 2 (1=4 18D) +2 (1=4)
= 3=4 +12 D+
2
 3=4 +(D
2
);
where the last inequality uses = (D) and  = (1).
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
0:34 Clarkson,Hazan and Woodruff
6.3.3.Main Theorem.Given an n  d matrix A together with the norms kA
i
k for all
rows A
i
,as well as the promise that all kA
i
k = O(1),the"-MEB-Coreset problem is
to output a subset S of
~
O("
1
) rows of A for which A
i
2 (1 +")  MEB(S).Our main
theoremin this section is the following.
THEOREM 6.15.If n"
1
 d and d =
~

("
2
),then any randomized algorithmwhich
with probability  4=5 solves"-MEB-Coreset must read
~

(n"
2
) entries of A for some
choice of its randomcoins.
We also define the following problem.Given an nd matrix Atogether with the norms
kA
i
k for all rows A
i
,as well as the promise that all kA
i
k = O(1),the"-MEB-Center
problemis to output a vector x 2 R
d
for which kA
i
xk  (1+") min
y2R
d max
i2[n]
kyA
i
k.
We also show the following.
THEOREM 6.16.If n"
2
 d and d =
~

("
2
),then any randomized algorithmwhich
with probability  4=5 solves"-MEB-Center by outputting a convex combination of the
rows A
i
must read
~

(n"
2
) entries of A for some choice of its randomcoins.
These theorems will follow from the same hardness construction,which we now de-
scribe.Put d = 8"
2
ln
2
n,which we assume is a sufficiently large power of 2.We also
assume n is even.We construct two families F and G of n d matrices A.
The family F consists of all A for which each of the n rows in A is a regular point.
The family G consists of all A for which exactly n 1 rows of A are regular points,
and one row of A is a special point.
(Recall that we say that a vertex of on H
d
is regular if it has exactly
3d
4
coordinates
equal to
1
p
d
.We say a point on H
d
is special if it has exactly d

3
4
12D

coordinates
equal to
1
p
d
,where D is
lnn
p
d
.)
Let  be the distribution on n  d matrices for which half of its mass is uniformly
distributed on matrices in F,while the remaining half is uniformly distributed on the
matrices in G.Let A  .We show that any randomized algorithm Alg which decides
whether A 2 F or A 2 G with probability at least 3=4 must read
~

(nd) entries of A
for some choice of its randomcoins.W.l.o.g.,we may assume that Alg is deterministic,
since we may average out its random coins,as we may fix its coin tosses that lead to
the largest success probability (over the choice of A).By symmetry and independence
of the rows,we can assume that in each row,Alg queries entries in order,that is,if Alg
makes s queries to a row A
i
,we can assume it queries A
i;1
;A
i;2
;:::;A
i;s
,and in that
order.
Let r = d=(C ln
2
n) for a sufficiently large constant C > 0.For a vector u 2 R
d
,let
pref(u) denote its first r coordinates.Let  be the distribution of pref(u) for a random
regular point u.Let 
0
be the distribution of pref(u) for a randomspecial point u.
LEMMA 6.17.(Statistical Difference Lemma) For C > 0 a sufficiently large constant,
k 
0
k
1

1
10
:
PROOF.We will apply the following fact twice,once to  and once to 
0
.
FACT 6.18.(special case of Theorem 4 of [Diaconis and Freedman 1980]) Suppose
an urn U contains d balls,each marked by one of two colors.Let H
U
r be the distribution
of r draws made at random without replacement from U,and M
U
r be the distribution
of r draws made at randomwith replacement.Then,
kH
U
k M
U
kk
1

4r
d
:
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
Sublinear Optimization for Machine Learning 0:35
Let  be the distribution with support f
1
p
d
;
1
p
d
g with (
1
p
d
) =
3
4
and (
1
p
d
) =
1
4
.
Let  be the distribution with support f
1
p
d
;
1
p
d
g with (
1
p
d
) =
3
4
12D and (
1
p
d
) =
1
4
+12D.
Let 
r
be the joint distribution of r independent samples from,and similarly define

r
.Applying Fact 6.18 with r = 1=100D
2
,
k 
r
k
1

1
25dD
2
;
and
k
0

r
k
1

1
25dD
2
:
By the triangle inequality,
k k
1
 k 
r
k
1
+k
r

r
k
1
+k
r

0
k
1
 k
r

r
k
1
+
2
25dD
2
;
and so it remains to bound k
r

r
k
1
.To do this,we use Stein’s Lemma (see,e.g.,[Cover
and Thomas 1991],Section 12.8),which shows that for two coins with bias in [
(1);1

(1)],one needs (z
2
) independent coins tosses to distinguish the distributions with
constant probability,where z is the difference in their expectations.Here,z = 12D,and
so for constant C > 0 sufficiently large,for r = 1=CD
2
,it follows that k
r

r
k
1

1
20
.
We thus have
k 
0
k
1

1
20
+
2
25dD
2

1
10
;
where the last inequality uses dD
2
= (lnn)
2
!1.
We use Lemma 6.17 to prove the following.We assume that Alg outputs 1 if it decides
that A2 F,otherwise it outputs 0.
THEOREM 6.19.If Alg queries o(nr) entries of A,it cannot decide if A 2 F with
probability at least 3=4.
PROOF.We can think of A as being generated according to the following random
process.
(1) Choose an index i

2 [n] uniformly at random.
(2) Choose rows A
j
for j 2 [n] to be randomindependent regular points.
(3) With probability 1=2,do nothing.Otherwise,with the remaining probability 1=2,
replace A
i

with a randomspecial point.
(4) Output A.
Define the advantage adv(Alg) to be:
adv(Alg) 




Pr
A2
R
G
[Alg(A) = 1]  Pr
A2
R
F
[Alg(A) = 1]




:
To prove the theorem,it suffices to show adv(Alg) < 1=4.Let

A
i
 denote the rows of A,
excluding row i

,generated in step 2.By the description of the randomprocess above,
we have
adv(Alg) = E
i

;

A
i





Pr
special A
i

[Alg(A) = 1 j i

;

A
i
]  Pr
regular A
i

[Alg(A) = 1 j i

;

A
i
]




:
To analyze this quantity,we first condition on a certain event E(i;

A
i
) holding,which
will occur with probability 1  o(1),and allow us to discard the pairs (i;

A
i
) that do
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
0:36 Clarkson,Hazan and Woodruff
not satisfy the condition of the event.Intuitively,the event is just that for most regular
A
i
,algorithmAlg does not read more than r entries in A
i
.This holds with probability
1o(1),over the choice of i

and

A
i
,because all n rows of Aare i.i.d.,and so on average
Alg can only afford to read o(r) entries in each row.
More formally,we say a pair (i;

A
i

) is good if
Pr
regular A
i

[Alg queries at most r queries of A
i

j (i;

A
i

) = (i

;

A
i

)]:
Let E(i

;

A
i

) be the event that (i;

A
i

) is good.Then,Pr
i

;

A
i

[E(i

;

A
i

)] = 1o(1),and
we can upper bound the advantage by
E
i

;

A
i





Pr
special A
i

[Alg(A) = 1 j E(i

;

A
i
);i

;

A
i
]  Pr
regular A
i

[Alg(A) = 1 j E(i

;

A
i
);i

;

A
i
]




+o(1):
Consider the algorithm Alg
0
i
,which on input A,makes the same sequence of queries
to A as Alg unless it must query more than r positions of A
i

.In this case,it outputs
an arbitrary value in f0;1g,otherwise it outputs Alg(A).
CLAIM 6.20.




Pr
regular A
i

[Alg(A) = 1 j E(i

;

A
i
);i

;

A
i
]  Pr
regular A
i

[Alg
0
i

(A) = 1 j i

;

A
i
]




= o(1);
PROOF.Since E(i

;

A
i

) occurs,
Pr
regular A
i

[Alg makes at most r queries to A
i
 j i

;

A
i
] = 1 o(1):
This implies that




Pr
regular A
i

[Alg(A) = 1 j E(i

;

A
i
);i

;

A
i
]  Pr
regular A
i

[Alg
0
i

(A) = 1 j i

;

A
i
]




= o(1):
By Lemma 6.17,we have that




Pr
regular A
i

[Alg
0
i

(A) = 1 j i

;

A
i
]  Pr
special A
i

[Alg
0
i

(A) = 1 j i

;

A
i
]





1
10
:
Hence,by Claim6.20 and the triangle inequality,we have that




Pr
regular A
i

[Alg(A) = 1 j E(i

;

A
i
);i

;

A
i
]  Pr
special A
i

[Alg
0
i

(A) = 1 j i

;

A
i
]





1
10
+o(1):
To finish the proof,it suffices to show the following claim
CLAIM 6.21.




Pr
special A
i

[Alg(A) = 1 j E(i

;

A
i
);i

;

A
i
]  Pr
special A
i

[Alg
0
i

(A) = 1 j i

;

A
i
]





1
10
+o(1):
Indeed,if we show Claim 6.21,then by the triangle inequality we will have that
adv(Alg) 
1
5
+o(1) <
1
4
.
Proof of Claim6.21:Since E(i

;

A
i
) occurs,
Pr
regular A
i

[Alg makes at most r queries to A
i
 j i

;

A
i
] = 1 o(1):
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
Sublinear Optimization for Machine Learning 0:37
Since  is the distribution of prefixes of regular points,this condition can be rewritten
as
Pr
u
[Alg makes at most r queries to the i

-th row j i

;

A
i
;pref(A
i
) = u] = 1 o(1):
By Lemma 6.17,we thus have,
Pr
u
0
[Alg makes at most r queries to the i

-th row j i

;

A
i

;pref(A
i

) = u] 
9
10
o(1):
Since 
0
is the distribution of prefixes of special points,this condition can be rewritten
as
Pr
special A
i

[Alg makes at most r queries to A
i
 j i

;

A
i
] 
9
10
o(1):
This implies that




Pr
special A
i

[Alg(A) = 1 j E(i

;

A
i

);i

;

A
i

]  Pr
special A
i

[Alg
0
i
(A) = 1 j i

;

A
i

]





1
10
+o(1):
This completes the proof of the theorem.
6.3.4.Proofs of Theorem 6.15 and 6.16.Next we show how Theorem 6.19 implies Theo-
rem6.15 and Theorem6.16,using the results on MEBs of regular and special points.
Proof of Theorem 6.15:We set the dimension d = 4  36 "
2
ln
2
(n  1).Let A
0
denote the set of regular rows of A.We condition on event E,namely,that every convex
combination p
T
A,where p 2 
n1
,satisfies p
T
A
0
b 
1
4
 6D.This event occurs with
probability at least 1  2n
2
.(We may neglect the difference between n and n  1 in
some expressions.)
It follows by Lemma 6.13 that if A2 G,then for every S  A
0
,
kCenter(S) bk 
p
3
2
+2":
By (28),Radius(A
0
) 
p
3
2
.It follows that any algorithm that,with probability at least
4=5,outputs a subset S of
~
O("
1
) rows of Afor which A
i
2 (1+")MEB(S) must include
the point b 2 S.
Given such an algorithm,by reading each of the
~
O("
1
) rows output,we can deter-
mine if A 2 F or A 2 G with an additional
~
O("
1
d) time.By Theorem 6.19,the total
time must be
~

(n"
2
).By assumption,n"
1
 d,and so any randomized algorithm
that solves"-MEB-Coreset with probability at least 4=5,can decide if A 2 F with
probability at least 4=5  2n
2
 3=4,and so it must read
~

(n"
2
) entries for some
choice of its randomcoins.
Proof of Theorem6.16:We again set the dimension d = 4  36 "
2
ln
2
(n1).Let A
0
denote the set of regular rows of A.We again condition on the event E.
By Lemma 6.13,if A2 G,then for every convex combination p
T
A
0
,
kp
T
A
0
bk 
p
3
2
+2";
and so the MEB radius returned by any algorithmthat outputs a convex combination
of rows of A
0
must be at least
p
3
2
+2".
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
0:38 Clarkson,Hazan and Woodruff
However,by (28),if A 2 F,then Radius(A) 
p
3
2
.On the other hand,by Lemma
6.14,if A2 G,then MEB-radius(A) 
p
3
2
+("
2
).
It follows that if A 2 G,then the convex combination p
T
A output by the algorithm
must have a non-zero coefficient multiplying the special point b.This,in particular,
implies that p
T
A is not on the affine hyperplane H with normal vector 1
d
containing
the point c = 1
d
=2
p
d.However,if A2 F,then any convex combination of the points is
on H.The output p
T
A of the algorithm is on H if and only if p
T
A1
d
=
p
d
2
,which can
be tested in O(d) time.
By Theorem 6.19,the total time must be
~

(n"
2
).By assumption,n"
2
 d,and
so any randomized algorithm that solves"-MEB-Center with probability  4=5 by
outputting a convex combination of rows can decide if A 2 F with probability at least
4=5  2n
2
 3=4,and so must read
~

(n"
2
) entries for some choice of its random
coins.
6.4.Las Vegas Algorithms
While our algorithms are Monte Carlo,meaning they err with small probability,it may
be desirable to obtain Las Vegas algorithms,i.e.,randomized algorithms that have low
expected time but never err.We show this cannot be done in sublinear time.
THEOREM 6.22.For the classification and minimumenclosing ball problems,there
is no Las Vegas algorithm that reads an expected o(M) entries of its input matrix and
solves the problem to within a one-sided additive error of at most 1=2.This holds even
if kA
i
k = 1 for all rows A
i
.
PROOF.
Suppose first that n  M.Consider n  d matrices A;B
1
;:::B
M
,where for each
C 2 fA;B
1
;:::;B
M
g,C
i;j
= 0 if either j > 1 or i > M.Also,A
i;1
= 1 for i 2 [M],while
for each j,B
j
1;i
= 1 if i 2 [M] n fjg,while B
j
1;j
= 1.With probability 1=2 the matrix
A is chosen,otherwise a matrix B
j
is chosen for a random j.Notice that whichever
case we are in,each of the first M rows of the input matrix has normequal to 1,while
all remaining rows have norm 0.It is easy to see that distinguishing these two cases
with probability  2=3 requires reading
(M) entries.As
(M) is a lower bound for
Monte Carlo algorithms,it is also a lower bound for Las Vegas algorithms.Moreover,
distinguishing these two cases is necessary,since if the problem is classification,if
C = A the margin is 1,otherwise it is 0,while if the problem is minimum enclosing
ball,if C = A the cost is 0,otherwise it is 1.
We now assume M > n.Let d
0
be the largest integer for which nd
0
< M.Here d
0
 1.
Let A be the n d
0
matrix,where A
i;j
=
1
p
d
0
for all i and j.The margin of A is 1,and
the minimumenclosing ball has radius 0.
Suppose there were an algorithm Alg on input A for which there is an assignment
to Alg’s randomtape r for which Alg reads at most nd
0
=4 of its entries.If there were no
such r,the expected running time of Alg is already
(nd
0
) =
(M).Let A
`
be a row of
A for which Alg reads at most d
0
=4 entries of A
`
given random tape r,and let S  [d
0
]
be the set of indices in A
`
read,where jSj  d
0
=4.Consider the n  d
0
matrix B for
which B
i;j
= A
i;j
for all i 6=`,while B
`;j
= A
`;j
for all j 2 S,and B
`;j
= A
`;j
for all
j 2 [d
0
] n S.Notice that all rows of A and B have norm1.
To bound the margin of B,consider any vector x of normat most 1.Then
(A
`
+B
`
)x  kxk  kA
`
+B
`
k  kA
`
+B
`
k:
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
Sublinear Optimization for Machine Learning 0:39
A
`
+B
`
has at least 3d
0
=4 entries that are 0,while the non-zero entries all have value
2=
p
d
0
.Hence,kA
`
+B
`
k
2

d
0
4

4
d
0
= 1.It follows that either A
`
x or B
`
x is at most 1=2,
which bounds the margin of B.As Alg cannot distinguish A and B given randomtape
r,it cannot have one-sided additive error at most 1=2.
For minimumenclosing ball,notice that kA
`
B
`
k
2

1
4

3d
0
4

4
d
0

1
4
=
3
4
,which lower
bounds the cost of the minimum enclosing ball of B.As Alg cannot distinguish A and
B given randomtape r,it cannot have one-sided additive error at most 3=4.
7.CONCLUDING REMARKS
We have described a general method for sublinear optimization of constrained convex
programs,and showed applications to classical problems in machine learning such as
linear classification and minimum enclosing ball obtaining improvements in leading-
order terms over the state of the art.
In all our running times the dimension d can be replaced by the parameter S,which
is the maximum over the input rows A
i
of the number of nonzero entries in A
i
.Note
that d  S  M=n.Here we require the assumption that entries of any given row can
be recovered in O(S) time,which is compatible with keeping each row as a hash table
or (up to a logarithmic factor in run-time) in sorted order.
Acknowledgements.We thank Nati Srebro and an anonymous referee for helpful
comments on the relation between this work and PAC learning theory.
REFERENCES
BLUM,A.,FRIEZE,A.M.,KANNAN,R.,AND VEMPALA,S.1998.A polynomial-time algorithmfor learning
noisy linear threshold functions.Algorithmica 22,1/2,35–52.
BYLANDER,T.1994.Learning linear threshold functions in the presence of classification noise.In COLT
’94:Proceedings of the Seventh Annual Conference on Computational Learning Theory.ACM,NewYork,
NY,USA,340–347.
CESA-BIANCHI,N.,CONCONI,A.,AND GENTILE,C.2004.On the generalization ability of on-line learning
algorithms.IEEE Transactions on Information Theory 50,9,2050–2057.
CESA-BIANCHI,N.,SHALEV-SHWARTZ,S.,AND SHAMIR,O.2010.Online learning of noisy data with ker-
nels.In COLT,A.T.Kalai and M.Mohri,Eds.Omnipress,218–230.
CLARKSON,K.L.2008.Coresets,sparse greedy approximation,and the Frank-Wolfe algorithm.In SODA
’08:Proc.Nineteenth ACM-SIAMSymposiumon Discrete Algorithms.Society for Industrial and Applied
Mathematics,Philadelphia,PA,USA,922–931.
COVER,T.AND THOMAS,J.A.1991.Elements of Information Theory.Wiley Series in Telecommunications.
DIACONIS,P.AND FREEDMAN,D.1980.Finite exchangeable sequences.The Annals of Probability 8,745–
764.
DUNAGAN,J.AND VEMPALA,S.2004.A simple polynomial-time rescaling algorithm for solving linear
programs.In STOC ’04:Proceedings of the Thirty-Sixth Annual ACMSymposiumon the Theory of Com-
puting.ACM,New York,NY,USA,315–320.
FRANK,M.AND WOLFE,P.1956.An algorithmfor quadratic programming.Naval Research Logistics Quar-
terly 3,95–110.
GRIGORIADIS,M.D.AND KHACHIYAN,L.G.1995.A sublinear-time randomized approximation algorithm
for matrix games.Operations Research Letters 18,53–58.
HAZAN,E.2011.The convex optimization approach to regret minimization.Optimization for machine learn-
ing 1.
HAZAN,E.,AGARWAL,A.,AND KALE,S.2007.Logarithmic regret algorithms for online convex optimiza-
tion.Machine Learning 69,2-3,169–192.
KOUFOGIANNAKIS,C.AND YOUNG,N.E.2007.Beating simplex for fractional packing and covering linear
programs.In FOCS.IEEE Computer Society,494–504.
MINSKY,M.AND PAPERT,S.1988.Perceptrons:An introduction to computational geometry.MIT press
Cambridge,Mass.
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
0:40 Clarkson,Hazan and Woodruff
MONEMIZADEH,M.AND WOODRUFF,D.2010.1-pass relative error l
p
-sampling with applications.In SODA
’10:Proc.Twenty-First ACM-SIAMSymposiumon Discrete Algorithms.
MOTWANI,R.AND RAGHAVAN,P.1995.Randomized Algorithms.Cambridge University Press.
NOVIKOFF,A.B.1963.On convergence proofs for perceptrons.In Proceedings of the Symposium on the
Mathematical Theory of Automata.Vol.12.615–622.
PANCONESI,A.AND SRINIVASAN,A.1997.Randomized distributed edge coloring via an extension of the
chernoff-hoeffding bounds.SIAMJ.Comput.26,2,350–368.
PLOTKIN,S.A.,SHMOYS,D.B.,AND TARDOS,E.1991.Fast approximation algorithms for fractional pack-
ing and covering problems.In SFCS ’91:Proceedings of the 32nd Annual Symposiumon Foundations of
Computer Science.IEEE Computer Society,Washington,DC,USA,495–504.
SAHA,A.AND VISHWANATHAN,S.2009.Efficient approximation algorithms for minimumenclosing convex
shapes.arXiv:0909.1062v2.
SCH
¨
OLKOPF,B.AND SMOLA,A.J.2003.A short introduction to learning with kernels.Springer-Verlag New
York,Inc.,New York,NY,USA.
SERVEDIO,R.A.1999.On PAC learning using winnow,perceptron,and a perceptron-like algorithm.In
COLT ’99:Proceedings of the Twelfth Annual Conference on Computational Learning Theory.ACM,
New York,NY,USA,296–307.
ZINKEVICH,M.2003.Online convex programming and generalized infinitesimal gradient ascent.In Pro-
ceedings of the Twentieth International Conference on Machine Learning(ICML).928–936.
Received January 0000;revised January 0000;accepted January 0000
A.MAIN TOOLS
A.1.Tools fromonline learning
Online linear optimization.The following lemma is essentially due to [Zinkevich
2003]:
LEMMA A.1 (OGD).Consider a set of vectors q
1
;:::;q
T
2 R
d
such that kq
i
k
2
 c.
Let x
0
0,and ~x
t+1
x
t
+
1
p
T
q
t
;x
t+1

~x
t+1
maxf1;k~x
t+1
kg
.Then
max
x2B
T
X
t=1
q
>
t
x 
T
X
t=1
q
>
t
x
t
 2c
p
T:
This is true even if each q
t
is dependent on x
1
;:::;x
t1
.
PROOF.
Assume c = 1,generalization is by straightforward scaling.Let  =
1
p
T
.By definition
and for any kxk  1,
kx x
t+1
k
2
 kx  ~x
t+1
k
2
= kx x
t
q
t
k
2
= kx x
t
k
2
2q
>
t
(x x
t
) +
2
kq
t
k
2
:
Rearranging we obtain
q
>
t
(x x
t
) 
1
2
[kx x
t
k
2
kx x
t+1
k
2
] +=2:
Summing up over t = 1 to T yields
X
t
q
>
t
x 
X
t
q
>
t
x
t

1
2
kx x
1
k
2
+T=2 
2

+

2
T  2
p
T:
A simpler version of gradient descent,also essentially due to [Zinkevich 2003],is
given by:
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
Sublinear Optimization for Machine Learning 0:41
LEMMA A.2 (LAZY PROJECTION OGD).Consider a set of vectors q
1
;:::;q
T
2 R
d
such that kq
i
k
2
 1.Let
x
t+1
arg min
x2B
(
t
X
=1
q
>

 x +
p
2Tkxk
2
2
)
Then
max
x2B
T
X
t=1
q
>
t
x 
T
X
t=1
q
>
t
x
t
 2
p
2T:
This is true even if each q
t
is dependent on x
1
;:::;x
t1
.
For a proof see Theorem 2.1 in [Hazan 2011],where we take R(x) = kxk
2
2
,and the
normof the linear cost functions is bounded by kq
t
k
2
 1,as is the diameter of K - the
ball in our case.Notice that the solution of the above optimization problemis simply:
x
t+1
=
y
t+1
maxf1;ky
t+1
kg
;y
t+1
=

P
t
=1
q

p
2T
Strongly convex loss functions.The following Lemma is essentially due to [Hazan
et al.2007].
For H 2 R with H > 0,a function f:R
d
!R is H-strongly convex in B if for all
x 2 B,all the eigenvalues of r
2
f(x) are at least H.
LEMMA A.3 (OGDSTRICTLYCONVEX).Consider a set of H-strongly convex func-
tions f
1
;:::;f
T
such that the normof their gradients is bounded over the unit ball B by
G  max
t
max
x2B
krf
t
(x)k.Let x
0
2 B,and ~x
t+1
x
t

1
t
rf
t
(x
t
);x
t+1

~x
t+1
maxf1;k~x
t+1
kg
.
Then
T
X
t=1
f
t
(x
t
)  min
kxk
2
1
T
X
t=1
f
t
(x) 
G
2
H
log T:
This is true even if each f
t
is dependent on x
1
;:::;x
t1
.
For the MEB application and its relatives it is easier to implement the lazy versions
in the streaming model.The following Lemma is the analogous tool we need:
LEMMA A.4.Consider a set of H-strongly convex functions f
1
;:::;f
T
such that the
norm of their gradients is bounded over the unit ball B by G  max
t
max
x2B
krf
t
(x)k.
Let
x
t+1
arg min
x2B
(
t
X
=1
f

(x)
)
Then
T
X
t=1
f
t
(x
t
) min
x2B
T
X
t=1
f
t
(x) 
2G
2
H
log T:
This is true even if each f
t
is dependent on x
1
;:::;x
t1
.
PROOF.
By Lemma 2.3 in [Hazan 2011] we have:
T
X
t=1
f
t
(x
t
)  min
kxk
2
1
T
X
t=1
f
t
(x) 
X
t
[f
t
(x
t
) f
t
(x
t+1
)]
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
0:42 Clarkson,Hazan and Woodruff
Denote by 
t
(x) =
P
t
=1
f

.Then by Taylor expansion at x
t+1
,there exists a z
t
2
[x
t+1
;x
t
] for which

t
(x
t
) = 
t
(x
t+1
) +(x
t
x
t+1
)
>
r
t
(x
t+1
) +
1
2
kx
t
x
t+1
k
2
z
t
 
t
(x
t+1
) +
1
2
kx
t
x
t+1
k
2
z
t
;
using the notation kyk
2
z
= y
>
r
2

t
(z)y.The inequality above is true because x
t+1
is a
minimumof 
t
over K.Thus,
kx
t
x
t+1
k
2
z
t
 2 
t
(x
t
) 2 
t
(x
t+1
)
= 2 (
t1
(x
t
) 
t1
(x
t+1
)) +2[f
t
(x
t
) f
t
(x
t+1
)]
 2[f
t
(x
t
) f
t
(x
t+1
)] optimality of x
t
 2rf
t
(x
t
)
>
(x
t
x
t+1
) convexity of f
t
:
By convexity and Cauchy-Schwarz:
f
t
(x
t
) f
t
(x
t+1
)  rf
t
(x
t
)(x
t
x
t+1
)  krf
t
(x
t
)k

z
t
kx
t
x
t+1
k
z
t
 krf
t
(x
t
)k

z
t
q
2 rf
t
(x
t
)
>
(x
t
x
t+1
)
Shifting sides and squaring,we get
f
t
(x
t
) f
t
(x
t+1
)  rf
t
(x
t
)(x
t
x
t+1
)  2krf
t
(x
t
)k
 2
z
t
Since f
t
are assumed to be H-strongly convex,we have k  k
z
 k  k
Ht
,and hence for
the dual norm,
f
t
(x
t
) f
t
(x
t+1
)  2krf
t
(x
t
)k
 2
z
t
 2
krf
t
(x
t
)k
2
2
Ht

2G
2
Ht
Summing over all iterations we get
T
X
t=1
f
t
(x
t
)  min
kxk
2
1
T
X
t=1
f
t
(x) 
X
t
[f
t
(x
t
) f
t
(x
t+1
)] 
X
t
2G
2
Ht

2G
2
H
log T
Combining sampling and regret minimization
LEMMA A.5.Consider a set of H-strongly convex functions f
1
;:::;f
T
such that the
normof their gradients is bounded over the unit ball by G  max
t
max
x2B
krf
t
(x)k.Let
y
t+1

8
>
<
>
:
arg min
x2B
n
P
t
=1
f

(x)
o
w.p.
y
t
o=w
Then for a fixed x

we have
E[
T
X
t=1
f
t
(y
t
) 
T
X
t=1
f
t
(x

)] 
1

2G
2
H
log T:
This is true even if each f
t
is dependent on y
1
;:::;y
t1
.
PROOF.
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
Sublinear Optimization for Machine Learning 0:43
Consider the sequence of functions
~
f
t
defined as
~
f
t

8
<
:
f
t

w.p.
0 o=w
Where 0 denotes the all-zero function.Then the algorithmfromLemma A.4 applied to
the functions
~
f
t
is exactly the algorithmwe apply above to the functions f
t
.Notice that
the functions
~
f
t
are
H

-strongly convex,and in addition their gradients are bounded by
G

.Hence applying Lemma A.4 we obtain
E[
T
X
t=1
f
t
(y
t
) 
T
X
t=1
f
t
(x

)] = E[
T
X
t=1
~
f
t
(x
t
) 
T
X
t=1
~
f
t
(x

)] 
1

2G
2
H
log T:
B.AUXILIARY LEMMAS
First,some simple lemmas about randomvariables.
LEMMA B.1.Let X be a randomvariable,let

X = clip(X;C) = minfC;maxfC;Xgg
and assume that j E[X]j  C=2 for some C > 0.Then


E[

X] E[X]



2Var[X]
C
:
PROOF.As a first step,note that for x > C we have x E[X]  C=2,so that
C(x C)  2(x E[X])(x C)  2(x E[X])
2
:
Hence,we obtain
E[X] E[

X] =
Z
x<C
(x +C)d
X
+
Z
x>C
(x C)d
X

Z
x>C
(x C)d
X

2
C
Z
x>C
(x E[X])
2
d
X

2
C
Var[X]:
Similarly one can prove that E[X] E[

X]  2Var[X]=C,and the result follows.
LEMMA B.2.For randomvariables X and Y,and  2 [0;1],
E[(X +(1 )Y )
2
]  maxfE[X
2
];E[Y
2
]g:
This implies by induction that the second moment of a convex combination of random
variables is no more than the maximumof their second moments.
PROOF.By convexity of the quadratic function (Jensen’s inequality) we have
E[(X +(1 )Y )
2
]  E[X
2
+(1 )Y
2
]
= E[X
2
] +(1 ) E[Y
2
]
 maxfE[X
2
];E[Y
2
]g:
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
0:44 Clarkson,Hazan and Woodruff
B.1.Martingale and concentration lemmas
The Bernstein inequality,that holds for randomvariables Z
t
;t 2 [T] that are indepen-
dent,and such that for all t,E[Z
t
] = 0,E[Z
2
t
]  s,and jZ
t
j  V,states
log Probf
X
t2[T]
Z
t
 g  
2
=2(Ts +V=3) (31)
Here we need a similar bound for random variables which are not independent,but
form a martingale with respect to a certain filtration.Many concentration results
have been proven for Martingales,including somewhere,in all likelihood,the present
lemma.However,for clarity and completeness,we will outline how the proof of the
Bernstein inequality can be adapted to this setting.
LEMMA B.3.Let fZ
t
g be a martingale difference sequence with respect to filtration
fS
t
g,such that E[Z
t
jS
1
;:::;S
t
] = 0.Assume the filtration fS
t
g is such that the values
in S
t
are determined using only those in S
t1
,and not any previous history,and so the
joint probability distribution
ProbfS
1
= s
1
;S
2
= s
2
;:::;S
T
= s
t
g =
Y
t2[T1]
ProbfS
t+1
= s
t+1
j S
t
= s
t
g;
In addition,assume for all t,E[Z
2
t
jS
1
;:::;S
t
]  s,and jZ
t
j  V.Then
log Probf
X
t2T
Z
t
 g  
2
=2(Ts +V=3):
PROOF.
A key step in proving the Bernstein inequality is to show an upper bound on the
exponential generating function E[exp(Z)],where Z 
P
t
Z
t
,and  > 0 is a parame-
ter to be chosen.This step is where the hypothesis of independence is applied.In our
setting,we can show a similar upper bound on this expectation:Let E
t
[] denote expec-
tation with respect to S
t
,and E
[T]
denote expectation with respect to S
t
for t 2 [T].This
expression for the probability distribution implies that for any real-valued function f
of state tuples S
t
,
E
[T]
[
Y
t2[T]
f(S
t
)]
= f(s
1
)
Z
s
2
;:::;s
T
[
Y
t2[T1]
f(s
t+1
)][
Y
t2[T1]
ProbfS
t+1
= s
t+1
j S
t
= s
t
g]
= f(s
1
)
Z
s
2
;:::;s
T1
2
4
[
Y
t2[T2]
f(s
t+1
)][
Y
t2[T2]
ProbfS
t+1
= s
t+1
j S
t
= s
t
g]
Z
s
T
f(s
T
) ProbfS
T
= s
T
j S
T1
= s
T1
g

;
where the inner integral can be denoted as the conditional expectation E
T
[f(S
T
) j
S
T1
].By induction this is
f(s
1
)
Z
s
2
f(s
2
) ProbfS
2
= s
2
j S
1
= s
1
g
Z
s
3
:::
Z
s
T
f(s
T
) ProbfS
T
= s
T
j S
T1
= s
T1
g

:::

;
and by writing the constant f(S
1
) as the expectation with respect to the constant S
0
=
s
0
,and using E
X
[E
X
[Y ]] = E
X
[Y ] for any random variables X and Y,we can write
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
Sublinear Optimization for Machine Learning 0:45
this as
E
[T]
[
Y
t2[T]
f(S
t
)] = E
[T]
[
Y
t2[T]
E
t
[f(S
t
) j S
t1
]]:
For fixed i and a given  2 R,we take f(S
1
) = 1,and f(S
t
)  exp(Z
t1
),to obtain
E
[T]
2
4
exp(
X
t2[T]
Z
t
)]
3
5
= E
[T]
2
4
Y
t2[T]
E
t
[exp(Z
t
) j S
t1
]
3
5
:
Now for any randomvariable X with E[X] = 0,E[X
2
]  s,and jXj  V,
E[exp(X)]  exp

s
V
2
(e
V
1 V )

;
(as is shown and used for proving Bernstein’s inequality in the independent case) and
therefore
E
[T]
[exp(Z)]  E
[T]
2
4
Y
t2[T]
exp

s
V
2
(e
V
1 V )

3
5
= exp

T
s
V
2
(e
V
1 V )

:
where Z 
P
t2[T]
Z
t
.This bound is the same as is obtained for independent Z
t
,and so
the remainder of the proof is exactly as in the proof for the independent case:Markov’s
inequality is applied to the randomvariable exp(Z),obtaining
ProbfZ  g  exp() E
[T]
[exp(Z)]  exp( +T
s
V
2
(e
V
1 V ));
and an appropriate value  =
1
V
log(1 + V=Ts) is chosen for minimizing the bound,
yielding
ProbfZ  g  exp(
Ts
V
2
((1 + ) log(1 + )  ));
where  V=Ts,and finally the inequality for  0 that (1+ ) log(1+ )  

2
=2
1+ =3
is applied.
B.2.Proof of lemmas used in main theorem
We restate and prove lemmas 2.4,2.5 and 2.6,in slightly more general form.In the
following we only assume that v
t
(i) = clip(~v
t
(i);
1

) is the clipping of a randomvariable
~v
t
(i) (hence the parameter C for Lemma B.1 is C =
1

>> 1).The variance of ~v
t
(i) is
at most one Var[~v
t
(i)]  1,and we denote by 
t
(i) = E[~v
t
(i)].We also assume that the
expectations of ~v
t
(i) are bounded by an absolute constant j
t
(i)j  ,and 1   
1

.
This constant is one for the perceptron application,but at most two for MEB.Note that
since the variance of ~v
t
(i) is bounded by one,so is the variance of it’s clipping v
t
(i)
3
.
LEMMA B.4.For
q
log n
T
  
1

,with probability at least 1 O(1=n),
max
i
X
t2[T]
[v
t
(i) 
t
(i)]  6T:
3
This follows fromthe fact that the second moment only decreases by the clipping operation,and definition
of variance as Var(v
t
(i)) = min
z
E[v
t
(i)
2
z
2
].We can use z = E[~v
t
(i)],and hence the decrease in second
moment suffices.
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
0:46 Clarkson,Hazan and Woodruff
PROOF.
Lemma B.1 implies that j E[v
t
(i)] 
t
(i)j  2,since Var[~v
t
(i)]  1.
We show that for given i 2 [n],with probability 1 O(1=n
2
),
P
t2[T]
[v
t
(i) E[v
t
(i)]] 
4T,and then apply the union bound over all i 2 [n].This together with the above
bound on j E[v
t
(i)] 
t
(i)j implies the lemma via the triangle inequality.
Fixing i,let Z
i
t
 v
t
(i) E[v
t
(i)],and consider the filtration given by
S
t
 (x
t
;p
t
;w
t
;y
t
;v
t1
;i
t1
;j
t1
;v
t1
E[v
t1
]);
Using the notation E
t
[] = E[jS
t
],Observe that
(1) 8t:E
t
[(Z
i
t
)
2
] = E
t
[v
t
(i)
2
] E
t
[v
t
(i)]
2
= Var(v
t
(i))  1.
(2) jZ
i
t
j  2=.This holds since by construction,jv
t
(i)j  1=,and hence
jZ
i
t
j = jv
t
(i) E[v
t
(i)]j  jv
t
(i)j +j E[v
t
(i)]j 
2

Using these conditions,despite the fact that the Z
i
t
are not independent,we can use
Lemma B.3,and conclude that Z 
P
t2T
Z
i
t
satisfies the Bernstein-type inequality
with s = 1 and V = 2=
log ProbfZ  g  

2
2(Ts +V=3)
 

2
2(T +2=3)
;
Hence for  = 4T,we have
log ProbfZ  4Tg  
16
2(1 +8=3)
T  2
2
T
For  
q
log n
T
,above probability is at most e
2 log n

1
n
2
.
Lemma 2.5 can be restated in the following more general form:
LEMMA B.5.For
q
log n
T
  
1

,with probability at least 1 O(1=n),it holds that



P
t2[T]

t
(i
t
) 
P
t
p
>
t
v
t


  10T:
It is a corollary of the following two lemmas:
LEMMA B.6.For  
q
log n
T
,with probability at least 1 O(1=n),






X
t2[T]
p
>
t
v
t

X
t
p
>
t

t






 4T:
PROOF.
This Lemma is proven in essentially the same manner as Lemma 2.4,and proven
below for completeness.
Lemma B.1 implies that j E[v
t
(i)]  
t
(i)j  2,using Var[~v
t
(i)]  1.Since p
t
is a
distribution,it follows that j E[p
>
t
v
t
] p
>
t

t
j  2
Let Z
t
 p
>
t
v
t
 E[p
>
t
v
t
] =
P
i
p
t
(i)Z
i
t
,where Z
i
t
= v
t
(i)  E[v
t
(i)].Consider the
filtration given by
S
t
 (x
t
;p
t
;w
t
;y
t
;v
t1
;i
t1
;j
t1
;v
t1
E[v
t1
]);
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
Sublinear Optimization for Machine Learning 0:47
Using the notation E
t
[] = E[jS
t
],the quantities jZ
t
j and E
t
[Z
2
t
] can be bounded as
follows:
jZ
t
j = j
X
i
p
t
(i)Z
i
t
j 
X
i
p
t
(i)jZ
i
t
j  2
1
using jZ
i
t
j  2
1
as in Lemma 2.4:
Also,using properties of variance,we have
E[Z
2
t
] = Var[p
>
t
v
t
] =
X
i
p
t
(i)
2
Var(v
t
(i))  max
i
Var[v
t
(i)]  1:
We can nowapply the Bernstein-type inequality of Lemma B.3,and continue exactly
as in Lemma 2.4.
LEMMA B.7.For
q
log n
T
  
1

,with probability at least 1 O(1=n),






X
t2[T]

t
(i
t
) 
X
t
p
t

t






 6T:
PROOF.
Let Z
t
 
t
(i
t
)p
t

t
,where now
t
is a constant vector and i
t
is the randomvariable,
and consider the filtration given by
S
t
 (x
t
;p
t
;w
t
;y
t
;v
t1
;i
t1
;j
t1
;Z
t1
);
The expectation of 
t
(i
t
),conditioning on S
t
with respect to the randomchoice r(i
t
),is
p
t

t
.Hence E
t
[Z
t
] = 0,where E
t
[] denotes E[jS
t
].The parameters jZ
t
j and E[Z
2
t
] can
be bounded as follows:
jZ
t
j  j
t
(i)j +jp
t

t
j  2
E[Z
2
t
] = E[(
t
(i) p
>
t

t
)
2
]  2 E[
t
(i)
2
] +2(p
>
t

t
)
2
 4
2
Applying Lemma B.3 to Z 
P
t2T
Z
t
,with parameters s  4
2
;V  2,we obtain
log ProbfZ  g  

2
4
2
T +2
;
Letting  6T,we obtain
log ProbfZ  6Tg  
36
2

2
T
2
4
2
T +12
2
T
 2
2
T  2 log n
Where the last inequality holds assuming  
q
log n
T
.
Finally,we prove Lemma 2.6 by a simple application of Markov’s inequality:
LEMMA B.8.w.p.at least 1 
1
4
it holds that
P
t
p
>
t
v
2
t
 8
2
T:
PROOF.By assumption,E[~v
2
t
(i)] = Var(~v
t
(i)) +E[~v
t
(i)]
2
 1 +
2
 2
2
,and hence
E[v
t
(i)
2
]  E[~v
t
(i)
2
]  2
2
.
By linearity of expectation,we have E[
P
t
p
>
t
v
2
t
]  2
2
T,and since the randomvari-
ables v
2
t
are non-negative,applying Markov’s inequality yields the lemma.
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
0:48 Clarkson,Hazan and Woodruff
C.BOUNDED PRECISION
All algorithms in this paper can be implemented with bounded precision.
First we observe that approximation of both the training data and the vectors that
are “played” does not increase the regret too much,for both settings we are working
in.
LEMMA C.1.Given a sequence of functions f
1
;:::;f
T
and another sequence
~
f
1
;:::;
~
f
T
all mapping R
d
to R,such that j
~
f
t
(x) f
t
(x)j  
f
for all x 2 B and t 2 [T],
suppose x
1
;:::;x
T
2 B is a sequence of regret R against f
~
f
t
g,that is,
max
x2B
X
t2[T]
~
f
t
(x) 
X
t2[T]
~
f
t
(x
t
)  R:
Now suppose ~x
1
;:::;~x
T
2 R
d
is a sequence with jf
t
(~x
t
)  f
t
(x
t
)j  
x
for all t 2 [T].
Then
max
x2B
X
t2[T]
f
t
(x) 
X
t2[T]
f
t
(~x
t
)  R+T(
x
+2
f
):
PROOF.For x 2 B,we have
P
t2[T]
f
t
(x) 
P
t2[T]
~
f
t
(x) +T
f
,and
X
t2[T]
f
t
(~x
t
) 
X
t2[T]
f
t
(x
t
) T
x

X
t2[T]
~
f
t
(x
t
) T
x
T
f
;
and the result follows by combining these inequalities.
That is,x
t
is some sequence known to have small regret against the “training func-
tions”
~
f
t
(x),which are approximations to the true functions of interest,and the ~x
t
are
approximations to these x
t
.The lemma says that despite these approximations,the ~x
t
sequence has controllable regret against the true functions.
This lemma is stated in more generality than we need:all functions considered here
have the form f
t
(x) = b
t
+ q
>
t
x + kxk
2
,where jb
t
j  1,q
t
2 B,and j j  1.Thus
if
~
f
t
(x) =
~
b
t
+ ~q
>
t
x + kxk
2
,then the first condition j
~
f
t
(x)  f
t
(x)j  
f
holds when
jb
t

~
b
t
j +kq
t
 ~q
t
k  
f
.Also,the second condition jf
t
(~x
t
) f
t
(x
t
)j  
x
holds for such
functions when k~x
t
x
t
k  
x
=3.
LEMMA C.2.Given a sequence of vectors q
1
;:::;q
T
2 R
n
,with kq
t
k
1
 B for t 2 [T],
and a sequence ~q
1
;:::;~q
T
2 R
n
such that k~q
t
 q
t
k
1
 
q
for all t 2 [T],suppose
p
1
;:::;p
T
2  is a sequence of regret R against f~q
t
g,that is,
X
t2[T]
p
>
t
~q
t
min
p2
X
t2[T]
p
>
~q
t
 R:
Now suppose ~p
1
;:::;~p
T
2 R
n
is a sequence with k~p
t
p
t
k
1
 
p
for all t 2 [T].Then
X
t2[T]
~p
>
t
q
t
min
p2
X
t2[T]
p
>
q
t
 R+T(B
p
+2
q
):
PROOF.For p 2  we have
P
t2[T]
p
>
q
t

P
t2[T]
p
>
~q
t
+T
q
,and
X
t2[T]
~p
>
t
q
t

X
t2[T]
p
>
t
q
t
+TB
p

X
t2[T]
p
>
t
~q
t
+TB
p
+T
q
;
The proof follows by combining the inequalities.
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.
Sublinear Optimization for Machine Learning 0:49
Note that to have k~p
t
p
t
k
1
 
p
,it is enough that the relative error of each entry of
~p
t
is 
p
.
The use of ~q
t
in place of q
t
(for either of the two lemmas) will be helpful for our ker-
nelized algorithms ( x5),where computation of the norms ky
t
k of the working vectors y
t
is a bottleneck;the above two lemmas imply that it is enough to compute such norms
to within relative  or so.
C.1.Bit Precision for Algorithm1
First,the bit precision needed for the OGD part of the algorithm.Let denote a
sufficiently small constant fraction of ,where the small constant is absolute.From
Lemma C.1 and following discussion,we need only use the rows A
i
up to a precision
that gives an approximation
~
A
i
that is within Euclidean distance ,and similarly for
an approximation ~x
t
of x
t
.For the latter,in particular,we need only compute ky
t
k to
within relative error .Thus a per-entry precision of =
p
d is sufficient.
We need kx
t
k for`
2
sampling;arithmetic relative error =
p
d in the sampling proce-
dure gives an estimate of ~v
t
(i) for which E[A~v
t
] = A^x
t
,where ^x
t
is a vector within O( )
Euclidean distance of x
t
.We can thus charge this error to the OGD analysis,where ^x
t
is the ~x
t
of Lemma C.1.
For the MW part of the algorithm,we observe that due to the clipping step,if the
initial computation of ~v
t
(i),Line 9,is done with =5 relative error,then the computed
value is within =5 additive error.Similar precision for the clipping implies that the
computed value of v
t
(i),which takes the place of ~q
t
in Lemma C.2,is within =5 of the
exact version,corresponding to q
t
in the lemma.Here B of the lemma,bounding kq
t
k
1
,
is 1=,due to the clipping.
It remains to determine the arithmetic relative error needed in the update step,
Line 11,to keep the relative error of the computed value of p
t
,or ~p
t
of Lemma C.2,
small enough.Indeed,if the relative error is a small enough constant fraction of =T,
then the relative error of all updates together can be =3.Thus 
p
 =3 and 
q
 =3
and the added regret due to arithmetic error is at most T.
Summing up:the arithmetic precision needed is at most on the order of
log minf=
p
d;;=Tg = O(log(nd=));
to obtain a solution with additive T=10 regret over the solution computed using exact
computation.This implies an additional error of =10 to the computed solution,and
thus changes only constant factors in the algorithm.
C.2.Bit Precision for Convex Quadratic Programming
From the remarks following Lemma C.1,the conditions of that lemma hold in the
setting of convex quadratic programming in the simplex,assuming that every A
i
2 B.
Thus the discussion of xC.1 carries over,up to constants,with the simplification that
computation of ky
t
k is not needed.
Journal of the ACM,Vol.0,No.0,Article 0,Publication date:January 2012.