Machine learning with quantum relative entropy

achoohomelessAI and Robotics

Oct 14, 2013 (3 years and 9 months ago)

97 views

This content has been downloaded from IOPscience. Please scroll down to see the full text.
Download details:
IP Address: 54.205.106.19
This content was downloaded on 14/10/2013 at 22:08
Please note that terms and conditions apply.
Machine learning with quantum relative entropy
View the table of contents for this issue, or go to the journal homepage for more
Home
Search
Collections
Journals
About
Contact us
My IOPscience
Machine learning with quantum relative entropy
Ko
ji Tsuda
Max Planck Institute for Biological Cybernetics,Spemannstr.38,T¨ubingen,72076 Germany
E-mail:koji.tsuda@tuebingen.mpg.de
Abstract.Density matrices are a central tool in quantum physics,but it is also used in
machine learning.A positive definite matrix called kernel matrix is used to represent the
similarities between examples.Positive definiteness assures that the examples are embedded
in an Euclidean space.When a positive definite matrix is learned from data,one has to
design an update rule that maintains the positive definiteness.Our update rule,called matrix
exponentiated gradient update,is motivated by the quantum relative entropy.Notably,the
relative entropy is an instance of Bregman divergences,which are asymmetric distance measures
specifying theoretical properties of machine learning algorithms.Using the calculus commonly
used in quantum physics,we prove an upperbound of the generalization error of online learning.
1.Introduction
Machine learning and quantum physics are totally different subjects which do not seem to be
related to each other.In machine learning,the central aimis to develop a good algorithmto learn
from data for predicting some important property of yet unseen objects.Figure 1 illustrates
a pattern recognition task of predicting which category a shape belongs to (i.e.,red or blue).
Each shape is represented as a numerical vector.From a set of training examples,we need to
learn a discriminant function f(x) which classifies x to positive if f(x) > 0,and otherwise to
negative.Category labels of a set of test examples are predicted by the discriminant function.
The two class classification setting appears in many applications like character recognition,
speech recognition and gene expression profile analysis.
However,recent rise of kernel methods [1] has brought the two subjects slightly closer.Kernel
methods represent the similarity of n objects (e.g.,shapes) as n×n kernel matrix W.When the
n objects include both training and test examples,all computation in learning and prediction
are completely determined by W.By definition,W has to be positive semidefinite,W  0,
otherwise it causes a problemin learning (i.e.,local minima in parameter optimization).Usually,
the kernel matrix is determined a priori by users’ prior knowledge,but it would also be possible
to learn the matrix from data [2].
Learning is basically a parameter adjustment process.In batch learning,a set of training
examples is given a priori,and the parameters are determined as the solution of an optimization
problem.On the other hand,in online learning,we assume a stream of examples.At a trial,an
example is given to the learning machine.Prediction with respect to the example is computed
based on the current parameters.Then,the true value is exposed and the loss is incurred by
comparing the prediction and the true value.Finally the parameters are updated for better
prediction in next steps.The sum of losses from trials 1 to T is called “total loss”.A good
learning algorithm has a good updating rule leading to small total loss.Gradient descent and
Figure 1.Typical pattern recognition task.
exponentiated gradient descent are among the most popular algorithms [3].However,they deal
with numerical vectors,not positive semidefinite matrices.
Given density matrices W and
f
W,W  0,
f
W  0,tr(W) = tr(
f
W) = 1,the quantum
relative entropy is defined as
Δ
F
(
f
W,W) = tr(
f
Wlog
f
W −
f
WlogW).
This formulation is introduced by Umegaki [4],and regarded as a natural extension of the von
Neumann entropy[4].As kernel matrices are positive semidefinite,the quantum entropy can
be used for computing the distance between kernel matrices.In designing an online learning
algorithm,a distance measure is necessary.In each trial,the parameters are shifted to reduce
the loss of an example,but should not be moved too much in order not to forget the previously
learned result.The updating rule is determined by the trade-off between the loss and the
distance to the previous parameters.Using the quantum relative entropy,an updating rule for
kernel matrices,called “matrix exponential gradient updates” [2],can be derived.It has an
advantage that the positive definiteness is preserved after updates.
To evaluate the algorithm theoretically,it is possible to build an upperbound of the total
loss [3].For matrix parameters,it is not straightforward to generalize the bounds for vectors
due to the lack of commutativity (i.e.,AB 6= BA).However,it turns out that the bound can
still be derived via the Golden-Thompson inequality,
tr(exp(A+B)) ≤ tr(exp(A) exp(B)),(1)
which holds for arbitrary symmetric matrices A and B.We also need the following basic
inequalities for symmetric matrices.The first one may be seen as a generalization of Jensen’s
inequality when applied to the matrix exponential.
Lemma 1.If a symmetric matrix A∈ R
d×d
satisfies 0 ≺ A I,then exp(ρ
1
A+ρ
2
(I −A)) 
exp(ρ
1
)A+exp(ρ
2
)(I −A) for finite ρ
1

2
∈ R.
The rest of this paper is organized as follows.Section 2 prepares mathematical definitions
about kernels and Bregman divergences.In Section 3,the online kernel learning problem is
defined and the bound of total loss is proven.Section 4 presents an experiment illustrating the
tightness of the bound.Section 5 concludes the paper with discussions.
2.Preliminaries
2.1.Kernels
Kernel methods process data based on pairwise similarity function called kernel function.Given
two objects x,x

∈ X,the kernel function is described as w(x,x

).The domain X can be
x
( )
( )x
d(x,x )

X
F
x
x
Figure 2.Given a space X endowed with a kernel,a distance can be defined between points of
X mapped to the feature space F associated with the kernel.This distance can be computed
without explicitly knowing the mapping φ thanks to the kernel trick.
numerical vector,strings,graphs,etc [5].For example,in pattern recognition with support
vector machines,the discriminant function is described as
f(x) =

X
i=1
α
i
w(x,x
i
),
where x
1
,...,x

are training examples,and α denotes weight parameters.If and only if k is
positive semidefinite,the whole domain X can be embedded in a Hilbert space such that the
kernel is preserved as the inner product (Figure 2).In the following,denote by W the matrix
of kernel values involving all training and test examples.
In most cases,the kernel function is given a priori.However,there are multiple literature [6,7]
about the “meta”-problemof learning kernels.Sometimes,it is possible to measure the similarity
between some pairs,but not all.In protein sequence comparison,the similarity of close homolog
can be measured reliably by sequence alignment,but it is difficult to come up with a reasonable
similarity measure for distant sequences.In such cases,a series of measurements
y
t
= tr(WX
t
),
where X
t
is a sparse matrix,is given,and W is estimated based on measurements.In estimation,
we have to make sure that W is positive semidefinite.In [7],the matrix is parameterized as
W = XX

and X is optimized,but this method introduces non-uniqueness of the solution
and non-convexity.It will be shown in the next section that our learning rule can keep positive
definiteness without any correction steps.
2.2.Bregman divergence
In machine learning,Bregman divergences are an important tool to define asymmetric distances
between parameters [8].The Kullback-Leibler divergence,Hellinger distance,Euclidean distance
are instances of Bregman divergences.In this section,it is shown that the quantum relative
entropy is also an instance of Bregman divergences.We found that Petz [9] recently pointed out
this fact and discussed the relationship with the relative operator entropy.
If F is a real-valued strictly convex differentiable function on the parameter domain and
f(W):= ∇
W
F(W),then the Bregman divergence between two parameters
f
W and W is
defined as
Δ
F
(
f
W,W) = F(
f
W) −F(W) −tr((
f
W −W)f(W)
T
).
Since F is strictly convex,Δ
F
(
f
W,W) is also strictly convex in its first argument.Furthermore,
the gradient in the first argument has the following simple form:

f
W
Δ
F
(
f
W,W) = f(
f
W) −f(W),
since ∇
A
tr(AB) = B

.Quantum relative entropy is obtained when F(W) = tr(WlogW −
W).The strict convexity of this function is well known [4].Furthermore,∇
W
F(W) = f(W) =
logW.
If W =
P
i
λ
i
v
i
v

i
is our notation for the eigenvalue decomposition,we can rewrite the
divergence as
Δ
F
(
f
W,W) =
X
i
˜
λ
i
log
˜
λ
i

X
i,j
˜
λ
i
log λ
j
(˜v

i
v
j
)
2
.(2)
This divergence quantifies the difference in the eigenvalues as well as the eigenvectors.When
both eigensystems are the same (i.e.˜v
i
= v
i
),then the divergence becomes the usual relative
entropy between the eigenvalues Δ
F
(
f
W,W) =
P
i
˜
λ
i
log
˜
λ
i
λ
i
.
3.Online
Learning of Kernels
In this section,we consider on-line learning which proceeds in trials.In the most basic form,
the on-line algorithm produces a parameter W
t
at trial t and then incurs a loss L
t
(W
t
).In
this paper the parameters are square matrices in R
d×d
.In the refined form several actions occur
in each trial:The algorithm first receives an instance X
t
in some instance domain X.It then
produces a prediction
ˆ
Y
t
for the instance X
t
based on the algorithm’s current parameter matrix
W
t
and receives a label y
t
in some labeling domain Y.Finally it incurs a real valued loss
L(ˆy
t
,y
t
) and updates its parameter matrix to W
t+1
.
For example in Section 3.2 we will analyze an on-line algorithm that predicts with ˆy
t
=
tr(W
t
X
t
) and is based on the loss L
t
(W
t
) = L(ˆy
t
,y
t
) = (ˆy
t
−y
t
)
2
.
3.1.Matrix Exponentiated Gradient Update
In this section we only discuss updates at a high level and only consider the basic form of the
on-line algorithm.We assume that L
t
(W) is convex in the parameter W (for all t) and that
the gradient ∇
W
L
t
(W) is a well-defined matrix in R
d×d
.In the update we aim to solve the
following problem (see e.g.[3,10]):
W
t+1
= argmin
W
Δ
F
(W,W
t
) +ηL
t
(W),(3)
where the convex function F defines the Bregman divergence and η is a non-negative learning
rate.The update balances two conflicting goals:staying close to the old parameter W
t
(as
quantified by the divergence) and achieving small loss on the current labeled instance.The
learning rate becomes a trade-off parameter.
Setting the gradient with respect to W of the objective in the argmin to zero,we obtain
W
t+1
= f
−1
(f(W
t
) −η∇
W
L
t
(W
t+1
)).(4)
If we assume that f and f
−1
preserve symmetry,then constraining W in (3) to be symmetric
1
changes the update to
W
t+1
= f
−1
(f(W
t
) −η sym(∇
W
L
t
(W
t+1
))),(5)
where sym(X) = (X+X

)/2.
1
Note that square matrices with real eigenvalues are not closed under addition.
The above implicit update is usually not solvable in closed form.A common way to avoid this
problem [3] is to approximate ∇
W
L
t
(W
t+1
) by ∇
W
L
t
(W
t
),leading to the following explicit
update for the constraint case:
W
t+1
= f
−1
(f(W
t
) −η sym(∇
W
L
t
(W
t
))).
In the case of the quantumrelative entropy,the functions f(W) = logW and f
−1
(Q) = expQ
clearly preserve symmetry.When using this divergence we arrive at the following (explicit)
update:
W
t+1
= exp



log
sym.pos.def.
z}|{
W
t
−η sym(
any sq.matrix
z
}|
{

W
L
t
(W
t
) )
|
{z
}
symmetric



|
{z
}
symmetric
positive definite
.(6)
We call this update the Unnormalized Matrix Exponentiated Gradient Update.Note that
f(W) = logW maps symmetric positive definite matrices to arbitrary symmetric matrices,and
after adding a scaled symmetrized gradient,the function f
−1
(Q) = expQ maps the symmetric
exponent back to a symmetric positive definite matrix.
When the parameters are constrained to trace one,then we arrive at the M
atrix E
xponentiated
G
radient (MEG) Update,which generalizes the Exponentiated Gradient (EG) update of [3] to
non-diagonal matrices:
W
t+1
=
1
Z
t
exp(
logW
t
−η sym(∇
W
L
t
(W
t
))).(7)
where Z
t
= tr (exp(logW
t
−η sym(∇
W
L
t
(W
t
)))) is the normalizing constant.
3.2.Relative Loss Bounds
In this section we prove a certain type of relative loss bound for the MEGupdate which generalize
the analogous known bounds for the EG algorithm to the non-diagonal case.
For the sake of simplicity we now restrict ourselves to the case when the algorithm predicts
with ˆy
t
= tr(W
t
X
t
) and the loss function is quadratic:L
t
(W
t
) = L(ˆy
t
,y
t
):= (ˆy
t
−y
t
)
2
.
We begin with the definitions needed for the relative loss bounds.Let S = (X
1
,y
1
),...,
(X
T
,y
T
) denote a sequence of examples,where the instance matrices X
t
∈ R
d×d
and the
labels y
t
∈ R.For any symmetric positive semi-definite matrix U with tr(U) = 1,define
its total loss as L
U
(S) =
P
T
t=1
(tr(UX
t
) − y
t
)
2
.The total loss of the on-line algorithm is
L
MEG
(S) =
P
T
t=1
(tr(W
t
X
t
) −y
t
)
2
.We prove a bound on the relative loss L
MEG
(S) −L
U
(S)
that holds for any comparator parameter U.The proof generalizes a similar bound for the
Exponentiated Gradient update (Lemmas 5.8 and 5.9 of [3]).The relative loss bound is derived
in two steps:Lemma 2 upper bounds the relative loss for an individual trial in terms of the
progress towards the comparator parameter U (as measured by the divergence).In the second
Lemma 3,the bound for individual trials is summed to obtain a bound for a whole sequence.
Lemma 2.Let W
t
be any symmetric positive definite matrix.Let X
t
be any symmetric matrix
whose eigenvalues have range at most r,i.e.λ
max
(X
t
) − λ
min
(X
t
) ≤ r.Assume W
t+1
is
produced from W
t
by the MEG update with learning rate η,and let U be any symmetric positive
semi-definite matrix.Then for any b > 0 and a = η = 2b/(2 +r
2
b):
a (y
t
−tr(W
t
X
t
))
2
|
{z
}
MEG-loss

b (y
t
−tr(UX
t
))
2
|
{z
}
U-loss
≤ Δ
F
(
U,W
t
) −Δ
F
(U,W
t+1
)
|
{z
}
pr
ogress towards U
.
The proof is given in Appendix.In the proof,we use the Golden-Thompson inequality (1).
and the approximation of the matrix exponential (Lemma 1).
Lemma 3.Let S be any sequence of examples with positive symmetric matrices as instances
and real labels and let r be an upper bound on the range of eigenvalues of each instance matrix
of S.Let W
1
and U be arbitrary symmetric positive definite initial and comparison matrices,
respectively.Then for any c such that η = 2c/(r
2
(2 +c)),
L
MEG
(S) ≤

1 +
c
2

L
U
(S) +

1
2
+
1
c

r
2
Δ
F
(U,W
1
).(8)
Pr
oof.For the maximum tightness of (2),a should be chosen as a = η = 2b/(2 + r
2
b).Let
b = c/r
2
,and thus a = 2c/(r
2
(2 +c)).Then (2) is rewritten as
2c
2 +c
(y
t
−tr(W
t
X
t
))
2
−c
(y
t
−tr(UX
t
))
2
≤ r
2

F
(U,W
t
) −Δ
F
(U,W
t+1
)).
Adding the bounds for t = 1,  ,T,we get
2c
2 +c
L
ME
G
(S) −cL
U
(S) ≤ r
2

F
(U,W
1
) −Δ
F
(U,W
t+1
)) ≤ r
2
Δ
F
(U,W
1
),
which is equivalent to (8).
Assuming L
U
(S) ≤ L
ma
x
and Δ
F
(U,W
1
) ≤ d
max
,the bound (8) is tightest when
c = r
p
2d
ma
x
/L
max
.With this choice of c,we have
L
MEG
(S) −L
U
(S) ≤ r
p
2L
ma
x
d
max
+
r
2
2
Δ
F
(U,W
1
).
In
particular,if W
1
=
1
d
I,then
Δ
F
(U,W
1
) = log d −
P
i
λ
i
log
1
λ
i
≤ log d.
Additionally,when
L
max
= 0,then the total loss of the algorithm is bounded by
r
2
log d
2
.
Note
that the MEG algorithm generalizes the EG algorithm of [3].In the case of linear
regression,a square of a product of dual norms appears in the bounds for the EG algorithm:
||u||
2
1
X
2

.Here u is a parameter vector and X

is an upper bound on the infinity norm of
the instance vectors x
t
.Note the correspondence with the above bound (which generalizes the
bounds for EG to the non-diagonal case):the one norm of the parameter vector is replaced by
the trace and the infinity normby the maximumrange of the eigenvalues.In practice,the bound
is often too loose to have any practical value.Nevertheless,the bound is valuable to assess the
behaviour of the algorithm for the worst data with absolutely no regularity.
4.Experiments
In this section,our technique is applied to learning a kernel matrix from a set of distance
measurements.This application is not on-line per se,but it shows nevertheless that the
theoretical bounds can be reasonably tight on natural data.
When K is a d ×d kernel matrix among d objects,then the K
ij
characterizes the similarity
between objects i and j.In the feature space,K
ij
corresponds to the inner product between
object i and j,and thus the Euclidean distance can be computed from the entries of the kernel
matrix [1].In some cases,the kernel matrix is not given explicitly,but only a set of distance
measurements is available.The data are represented either as (i) quantitative distance values
(e.g.,the distance between i and j is 0.75),or (ii) qualitative evaluations (e.g.,the distance
between i and j is small) [7,11].Our task is to obtain a positive definite kernel matrix which
fits well to the given distance data.
0
0.5
1
1.5
2
2.5
3
x 10
5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Iterations
Total Loss
0
0.5
1
1.5
2
2.5
3
x 10
5
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Iterations
Classification Error
Figure 3.Numerical results of on-line learning.(Left) total loss against the number of
iterations.The dashed line shows the loss bound.(Right) classification error of the nearest
neighbor classifier using the learned kernel.The dashed line shows the error by the target
kernel.
In the experiment,we consider the on-line learning scenario in which only one distance
example is shown to the learner at each time step.The distance example at time t is described
as {a
t
,b
t
,y
t
},which indicates that the squared Euclidean distance between objects a
t
and b
t
is y
t
.
Let us define a time-developing sequence of kernel matrices as {W
t
}
T
t=1
,and the corresponding
points in the feature space as {x
ti
}
d
i=1
(i.e.(W
t
)
ab
= x

ta
x
tb
).Then,the total loss incurred by
this sequence is
T
X
t=1

kx
ta
t
−x
tb
t
k
2
−y
t

2
=
T
X
t=1
(tr(W
t
X
t
) −y
t
)
2
,
where X
t
is a symmetric matrix whose (a
t
,a
t
) and (b
t
,b
t
) elements are 0.5,(a
t
,b
t
) and (b
t
,a
t
)
elements are -0.5,and all the other elements are zero.We consider a controlled experiment in
which the distance examples are created from a known target kernel matrix.We used a 52 ×52
kernel matrix among gyrB proteins of bacteria (d = 52).This data contains three bacteria species
(see [12] for details).Each distance example is created by randomly choosing one element of the
target kernel.The initial parameter was set as W
1
=
1
d
I.When
the comparison matrix U is set
to the target matrix,L
U
(S) = 0 and L
max
= 0,because all the distance examples are derived
from the target matrix.Therefore we choose learning rate η = 2,which minimizes the relative
loss bound of Lemma 3.The total loss of the kernel matrix sequence obtained by the matrix
exponential update is shown in Figure 3 (left).In the plot,we have also shown the relative loss
bound.The bound seems to give a reasonably tight performance guarantee—it is about twice
the actual total loss.To evaluate the learned kernel matrix,the prediction accuracy of bacteria
species by the nearest neighbor classifier is calculated (Figure 3,right),where the 52 proteins are
randomly divided into 50% training and 50% testing data.The value shown in the plot is the
test error averaged over 10 different divisions.It took a large number of iterations (∼ 2×10
5
) for
the error rate to converge to the level of the target kernel.In practice one can often increase the
learning rate for faster convergence,but here we chose the small rate suggested by our analysis
to check the tightness of the bound.
5.Discussion
In
this paper,we have shown that online learning learning algorithms can be derived from
quantum relative entropy,and the upperbound of its total loss can be derived.The main
difficulty in deriving the bound was due to non-commutativity of matrices,and the quantum-
statistical calculus such as Golden-Thompson inequality was very effective in breaking the wall.
Since the introduction of the quantum entropy to machine learning in [2],several follow-up
studies have appeared.We dealt with a full rank matrix W,but if W is low-rank,it can represent
a subspace in a high dimensional space.According to this idea,online learning algorithms for
principal component analysis and subspace Winnow were proposed [13,14].They use very
similar updates as shown in this paper,and the relative loss bounds can be derived.Recently,
[15] proposed to use the matrix updates to solve certain classes of semidefinite programs with
promising results.We hope such attempts lead to increased communication between quantum
physics and machine learning.
Appendix:Proof of Lemma 2
Let δ
t
= −2η(tr(XW
t
) − y
t
),then the right hand side of (2) is rewritten as Δ
F
(U,W
t
) −
Δ
F
(U,W
t+1
) = δ
t
tr(UX
t
) −log tr(exp(logW
t

t
sym(X
t
))).Therefore,(2) is equivalent
to f ≤ 0,where f = log tr(exp(logW
t
+ δ
t
sym(X
t
))) − δ
t
tr(UX
t
) +a(y
t
− tr(W
t
X
t
))
2

b(y
t
− tr(UX
t
))
2
.Let us bound the first term.Due to Golden-Thompson inequality (1),we
have
tr (exp(logW
t

t
sym(X
t
))) ≤ tr (W
t
exp(δ
t
sym(X
t
))).(.1)
The right hand side can be rewritten as exp(δ
t
sym(X
t
)) = exp(r
0
δ
t
) exp(δ
t
(sym(X
t
) −r
0
I)).
Using Jensen’s inequality for matrices (Lemma 1),we have exp(δ
t
(sym(X
t
) − r
0
I)) 
I −
sym(X
t
)−r
0
I
r
(1 − exp(r
δ
t
)).Here 0 ≺ A  I,because r
0
I ≺ sym(X
t
)  (r
0
+ r)I by
assumption.Since W
t
is strictly positive definite,tr(W
t
B) ≤ tr(W
t
C) if B  C.So,the
right hand side of (.1) can be written as
tr (W
t
exp(δ
t
sym(X
t
))) ≤ exp(r
0
δ
t
)

1 −
tr(W
t
X
t
) −r
0
r
(1 −exp(r
δ
t
))

,
where we used the assumption tr(W
t
) = 1.We now plug this upper bound of the first term
of f back into f and obtain f ≤ g,where
g = r
0
δ
t
+log(1 −
tr(W
t
X
t
)−r
0
r
(1 −exp(r
δ
t
))) −tr(UX
t

t
+a(y
t
−tr(W
t
X
t
))
2
−b(y
t
−tr(UX
t
))
2
.(.2)
Let us define z = tr(UX
t
) and maximize the upper bound (.2) with respect to z.Solving
∂g
∂z
= 0,
we have z = y
t
−δ
t
/(2b) = y
t
+η(tr(X
t
W
t
) −y
t
)/b.Substituting this into (.2),we have
the upper bound g ≤ h where
h = 2ηr
0
(y
t
−tr(X
t
W
t
)) +log

1 −
tr(X
t
W
t
)−r
0
r
(1 −exp
(2ηr(y −tr(X
t
W
t
))))

−2ηy
t
(y
t
−tr(X
t
W
t
)) +(a +
η
2
b
(y −tr(X
t
W
t
))
2
.
Using
the upper bound log(1 −q(1 −expp)) ≤ pq +p
2
/8 in the second term,we have
h ≤
(y
t
−tr(X
t
W
t
))
2
2b
((2
+r
2
b)η
2
−4bη +2ab).
It remains to show q = (2 + r
2
b)η
2
− 4bη + 2ab ≤ 0.We easily see that q is minimized for
η = 2b/(2 +r
2
b) and that for this value of η we have q ≤ 0 if and only if a ≤ 2b/(2 +r
2
b).
Reference
[1]
Sch¨olkopf B and Smola A J 2002 Learning with Kernels (Cambridge,MA:MIT Press)
[2] Tsuda K,R¨atsch G and Warmuth M 2005 Journal of Machine Learning Research 6 995–1018
[3] Kivinen J and Warmuth M 1997 Information and Computation 132(1) 1–63
[4] Nielsen Mand Chuang I 2000 Quantum Computation and Quantum Information (New York,NY:Cambridge
University Press)
[5] Sch¨olkopf B,Tsuda K and Vert J,eds 2004 Kernel Methods in Computational Biology (Cambridge,MA:
MIT Press)
[6] Shai-Shwartz S,Singer Y and Ng A 2004 in Proceedings of the 21st International Conference on Machine
Learning (Banff,Canada) pp 94–100
[7] Xing E,Ng A,Jordan M and Russell S 2003 in S T S Becker and K Obermayer,eds,Advances in Neural
Information Processing Systems 15 (Cambridge,MA:MIT Press) pp 505–512
[8] Bregman L 1965 Dokl.Akad.Nauk SSSR 165 487–490
[9] Petz D 2007 Acta Mathematica Hungarica 116 127–131
[10] Kivinen J and Warmuth M K 2001 Machine Learning 45(3) 301–329
[11] Tsuda K and Noble W2004 Bioinformatics 20(Suppl.1) i326–i333
[12] Tsuda K,Akaho S and Asai K 2003 Journal of Machine Learning Research 4 67–81
[13] Warmuth M and Kuzmin D 2007 in Proceedings of the 24th International Conference for Machine Learning
(ICML 07) (Corvallis,OR) pp 465–472
[14] Warmuth M 2007 in Proceedings of the 24th International Conference for Machine Learning (Corvallis,OR)
pp 999–1006
[15] Arora S and Kale S 2007 in Annual ACM Symposium on Theory of Computing (San Diego,CA) pp 227–236