Using Two-Person, Zero-Sum, Games To Investigate On-line Learning Algorithms.

stepweedheightsAI and Robotics

Oct 15, 2013 (3 years and 11 months ago)

82 views

CMPS242 Project Report

Alastair Fyfe

Using Two
-
Person
,

Zero
-
Sum
,

Games

To Investigate On
-
line
Learning Algorithms.



Introduction


This project report describes two investigations only partly related to one another. The
first two sections focus on a game
-
theoretic framework intr
oduced by Freund and
Schapire [3],[4
] as an alternative view of o
n
-
line
learning and boost
ing algorithms. The
last section explores application of randomized algorithms
to the disk spin
-
down problem
[2
].


One of the striking features of this quarter’s review of machine learning algorithms is the
variety of alternative techniques available for s
olving classification/prediction problems.
The relative merits of competing algorithms are commonly assessed by (a) comparison of
demonstrable error
-
bounds
and (b) performance on real
-
world

data sets. Both approaches
are important but have drawbacks. Provi
ng an upper bound on error rate does not give
any
infor
mation about

how far below this rate an algorithm will perform on actual data.
Observing a difference in erro
r rate on a particular real
-
world

data set does not
necessarily shed light on its cause: oft
en it is hard to explain why one algorithm
outperforms another. A third approach involves the construction of small, synthetic

data
sets

which

can
maximize the error difference between competing algorithms. Careful
analysis of how algorithms interact with suc
h data sets may help to illuminate their
differences.


The first section of this report describes a framework introduced by Freund and Schapire
for repeatedly playing rounds of a two
-
person zero
-
sum game
. Both the weighted
-
majority

and boosting algorit
hms can be transformed into this framework. Because the
minmax solution of a two
-
person zero
-
sum game can be computed via linear
programming, this framework also establishes a link between on
-
line learning and
approximate solutions of linear programming p
roblems. The discussion in Freund and
Schapire's papers is focused on the use of this framework

as a novel
device for
investigating the theoretical properties of algorithms. In this project I wanted to
investigate whether game matrices
could be also used
as data sets to characterize
differences between algorithms and parameter settings. The results are given in section
two.


The third section considers a different problem. Randomized algorithms can be used in
on
-
line prediction problem
s such as

in
deciding whether to block or

spin a t
hread while
waiting for a lock[1
] or whether to shut down or keep spinning a d
isk drive once it
becomes idle[2]. In [1
] Karlin,
et al

showed that any randomized algorithms cannot
achieve a cost ratio bett
er than e/(e
-
1) ~1.58 over the optimal, omniscient, algorithm and
also provided a simple randomized algorithm with a continuous probability density that
achieved this bound. Interestingly, the bound can
b
e closely approximated by using a
multinomial over
a small number of fixed values. The technique for choosing coefficients
for this multinomial is discussed for both cost ratio and cost difference comparisons.


I) The Game
-
Theoretic Framework for Online
-
Learning and Boosting.


In [3] and [4
] Freund and S
chapire set out a connection between on line learning
algorithms, game theory and boosting. The crux of their idea is that both on
-
line learning
and boosting can be cast in the context of learning to play repeated rounds of a non
-
cooperative two
-
person ga
me. Such games have well characterized properties including
an optimal minmax solution which can be computed by linear programming.


I
n the framework introduced in [3
], games are played as follows. The game is
characterized by a payoff matrix
M

with entri
es in [0,1] . The row player, sometimes
referred to as the
learner
, does not have prior knowledge of
M
and plays a mixed strategy
P

over the rows of

M .
The column player, sometimes referred to as the
environment,

knows
M
, knows the learner's current mi
xed strategy and plays a mixed strategy
Q.

The
cost to the learner of playing
P

when the environment plays
Q

is

P
(
i
)
M
(
i,j
)
Q
(
j
) =
P
T
MQ

which is also represented as
M(P,Q)
.


If the environment chooses a fixed strategy j, the cost to
the learner is
P
(i)
M
(i,j)
which is also represented as
M
(
P
,j). The learner maintains weights on the rows of
M

which are used to calculate
P
. He starts by initializing all weights to 1. Play then proceeds
in a sequence of rounds t=1..T
as follows :


1.

the learner selects a mixed strategy
P
t

computed as
P
t
(i) = w
t
(i)/weight_sum

2.

the environment select a mixed column strategy
Q
t

3.

the learner is told
M
(i,
Q
t)

=

M
(i,j)

Q
(j) for all i. This is the expected loss it
would hav
e incurred had it chosen to only play row i in the face of the environment's
current mixed strategy
Q
t
. Interestingly, the learner is never told the actual contents of
the game matrix, only the exp
e
cted loss that results from application of the opponent's
strategy.

4.

the learner's total loss i
n incremented by
M(P
t
,
Q
t
)


5.

the learner calculates new weights with a simple multiplicative update:
w
t+1
(i)=w
t
(i)β
M
(i,
Q
t)

where β is a given parameter.


Freund and Schapire go on to show that t
he loss bounds established in [6
] for the
Weighted Majority algorithm transfer to this framework. By use of this algorithm, the
learner's expected loss can be brought arbit
rarily close to the best loss that would have
been realized if the learner had known the environment's strategies
Q
t

for t=1..T.


The standard framework for on
-
line learning can be reduced to this variant of repeated
game playing with the following exten
sions. Along with
M

we are now given an instance
space
X
, a hypothesis space
H

and a target concept c.
M

is taken to have |
H
| rows and |
X
|
columns. The environment plays by selecting an instance x
t
from
X
. The row player
randomly selects a row i of
M

accor
ding to
P
t

and predicts with h
i
(x
t
) incurring loss
M
(i,x
t
). The weight update rule is as given above. Loss bounds for online
-
learning can be
calculated using this reduction and, not surprisingly, they are the same as those originally
obtained for WM.

II)

Results


The game matrix for the game "Rock, Paper, Scissors" expressed with entries in [0,1] is

1/2

1

0

0

1/2

1

1

0

1/2

the minmax strategies for the row and column players are

P* = Q* = [ 1/3, 1/3,1/3] and
the expected loss to the learner, that is
the value of the game is 1/2. In this form, the game
provides no incentive for the row player to adopt a particular strategy. If the column
player plays Q*, then MQ* is [1/2,1/2,1/2] and regardless of how the row player selects
p1, p2 and p3 P
T
MQ* is p1/2+
p2/2+p3/2 = 1/2(p1+p2+p3) = 1/2.


To obtain a game that motivates the row player towards a particular strategy, I used
asymmetric game matrices of the form:

1/2

1

2*x

0

1/2

1

1
-
x

0

1/2
-
2*x/3


For

values of x in (0,1/2]

the row player's minmax strategy

and game value are shown
below.


x

p1

p2

p3

value*1000

1/2

0

2/3

1/2

667

1/4

11/30

2/15

1/2

567

1/6

5/14

3/14

3/7

536

1/8

58/165

41/165

2/5

524

1/10

95/273

73/273

5/13

518

1/12

47/136

19/68

3/8

514

1/14

98/285

82/285

7/19

512

1/16

260/759

223/759

4/11

510

1/18

111/325

97/325

9/25

509

1/20

415/1218

184/609

5/14

508


The minmax strategies and game value were calculated with the maple "simplex"
package. Simulations were run by having both players start with a uniform probability
distribution then i
teratively update their weights. I had initially wanted to have the
column player only play its maxmin strategy so that the effect of the learner's updates
could be tracked more closely, but had difficulty relating the dual solution to the column
player's
probabilities.


Convergence of the observed loss to the game value was quick. The graph below shows
the difference between the average observed loss and game value of .567 as a function of
the Beta parameter for the game with x=1/4. Results for other valu
es of X were similar.



The red and green circles correspond to 10 and 50 rounds of the game respectively.




Convergence of the probability distributions to the maxmin solutions was more of a
problem. The relative entropy, D(p||q), where p is the final

mixture calculated by a player
after n rounds of the game and q is the minmax solution calculated by linear
programming, should
also
vary monotonically as a function of Beta. However I was not
able to confirm this with the results obtained.

III) Computing M
ixture Coefficients for a Randomized Algorithm in Disk Spin
-
Down


This section investigates the use of randomized algorithms f
or the disk spin
-
down
problem [2
]. The following notation is used: i is the idle time, x is the selected time out, s
is the spin
-
down cost. The cost incurred by algorithm A if it selects a time
-
out of x when
the observed idle time is i is:

cost
A
(x,i) =

i

if i <= x

x+s

if i> x


A hypothetical algorithm that could look ahead and know i before choosing
x

could
minimize its cost
by choosing a time
-
out of 0 when i exceeds s and some x>i when i is
less than or equal to s. It's cost would then be:

opt(i) =

i

if i <=s



s

if i > s


The quality of an algorithm can be assessed by considering either the ratio cost
A
(i)/opt(i)
or the di
fference cost
A
(i)
-

opt(i). For example , the algorithm that simply sets x=s will
never incur a cost greater than twice that of the optimal algorithm. For values of i<s it's
cost will be the same as opt(i). For values of i>=s it's cost will be exactly twi
ce opt(i).


Various classes of algorithms for choosing x have been studied. One class of algorithms
maintains a weighted collection of sub
-
algorithms, or "experts", each of which nominates
a single time
-
out. The master algorithm focuses on updating the wei
ghts of the experts to
reflect past performance. An alternative class, called randomized algorithms, chooses the
timeout at random from some probability distribution.


Karlin
et al

[1
] considered the problem of how long a thread should spin waiting for a
lock before choosing to block and incurring the cost of a context switch, a problem very
similar to the disk spin
-
down. They showed that no randomized algorithm can hope for a
cost ratio smaller than e/(e
-
1) ~1.58 relative to the optimal algorithm. They al
so showed
that a randomized algorithm that selects its timeout from the density f(x) = e
x/s
/s(e
-
1) will
meet this bound. A plot of this density for a spin down
-
cost of 1 is shown below.



An alternative randomized algorithm might pick its time
-
out by sel
ecting one of several
candidates according to some discrete probability distribution. Given the above result it is
interesting to explore how well this alternative fares, in particular how the coefficients of
the distribution should be chosen to maximize t
he cost ratio or cost difference.


To calculate cost
A
(i) we must calculate the expectation over the algorithm's probability
distribution for time
-
outs. Assume for simplicity that we are only using two time
-
outs s
and s/2 . We will choose s/2 with probabili
ty p and s with probability 1
-
p then

cost
A
(i) = p*cost(s/2,i)+(1
-
p)cost(s,i)


For this case, a

graph of the cost ratio, cost
A
(i)/cost(i), as a function of x and i is shown
below:


To obtain a more informative expression of cost
A
(i) we can divide i over three ranges :
0<
i<=s/2, 2/s<i<s and s <= i. The resulting expressions for the algorithm's cost, cost ratio
and cost difference are given in the following table:



idle value range:

[0, s/2)

[s/2,s)

[s,∞)

c潳o
A
(i)

p*i+(1
-
p)*i

p*(s/2+s)+(1
-
p)*i

p*(s/2+s)+(1
-
p)*(s+s)

max
i
(cost
A
(i)/opt(i))

1

1+2p

(4
-
p)/2

max
i
(cost
A
(i)
-

opt(i))

0

p*s

(2*s
-
p*s)/2



From inspection of this table it is apparent that the maximum value for the ratio
opt
A
(i)/opt(i) occurs a
t p = 1 over the range s/2 <= i < s and at p=0 for the i >= s. The
value of p that maximizes both of these constraints is given by the intersection of the two
lines, 1+2*p = (4
-
p)/2. This occurs at p=2/5 and the value of the ratio at this point is 9/5.
Thi
s implies
that
a randomized algorithm that chooses

a

time
-
out of s/2 with probability
2/5 and a time
-
out of s with probability 3/5 will be 1.8 competitive, already significantly
better than the 2
-
competitive algorithm above.


Applying the same approach to the dif
ference cost
A
(i)
-
opt(i) gives a value of p = 2/3 at
the intersection p*s=(2*s
-
p*s)/2 so that the value of the difference will be 2*s/3
. One
disadvantage to using the cost
difference to compare the two algorithms is that it is not
possible to eliminate a depende
nce on the spin
-
down cost.


General expressions for both the cost ratio and cost difference can be obtained for
multiple equally spaced timeouts. Construction of these expressions can be illustrated by
expanding the expected cost ratio for three timeou
ts. For this case there will be four idle
time intervals to consider : [0,s/3), [s/3,2*s/3),[2*s/3,s),[s,∞). For idle times in the third
of these intervals, the expected cost over the randomized time
-
outs will be:


( p1*(s/3+s)+p2*(2*s/3+s)+p3*i )/ i


For

a spin
-
down cost s <= 1, and i in [2*s/3,s) this ratio will be at a maximum when i
takes the minimum point of the interval, i=2*s/3.


Using p3=1
-
p1
-
p2 we can rewrite it as:

p1*(s/3+s)/(2*s/3)+ p2*(2*s/3+s)/(2*s/3) + 1
-
p1
-
p2

or


1+p1*(3+1
-
2)/2+p2*(3+2
-
2)/
2


or more generally as :S(j) =

where k is the number of time
-
outs
and j identifies the interval of idle
-
time values over which we are calculating the
expectation. To solve for the mixture probabilities we can find the intersecti
on of the
above hyperplanes , for example, setting S(1)=S(2)=S(3) for the three time
-
out case,
gives p1= 9/37, p2=12/37 and p3=16/37.


The same approach can be used to solve for the cost difference cost
A
(i)
-
opt(i)

or

( p1*(s/3+s)+p2*(2*s/3+s)+p3*i )
-

i


T
his difference will be at a minimum when i takes the minimum point in the interval, i =
2*s/3. So we can write it as


p1*(s/3+s)+p2*(2*s/3+s)
-
p1*2*s/3
-
p2*2*s/3+1

or


p1*(s+3*s
-
2*s)/3+p2*(2*s+3*s
-
2*s)/3+1

or


1+p1*s*(3+1
-
2)/3 + p2*s*(3+2
-
2)/3

or more gener
ally as SD(j)=

Once again we can obtain an optimal mixture by solving for the intersection of
hyperplanes. For the three time
-
out case, setting SD(1)=SD(2)=SD(3) gives
p1=9/16,p2=3/16,p3=1/4.


A plot of the cost ratio obtained by us
ing the optimal mixtures over equally spaced sets of
2,3,5,10,15 and 20 points in the interval [0,s] is shown below. Clearly there is little
advantage to considering a greater number of points as 20 points comes quite close to the
theoretical limit.


IV)
Add
itional Work.



As might be expected from a project of this scope
,

the work has raised more questions
than it answered. With respect to the use of game matrices as data sets for exploring the
properties of on
-
line algorithms, it would be interesting to un
derstand the problems in
convergence of the mixture distribution. Another topic, not adressed here, would be to
extend this approach to the investigation of alternative boosting
alg
orithms
.


For the mixture calculation work, it would be interesting to com
pare the relative merits of

the ratio
-
derived
and differenc
e
-
derived mixtures on actual data traces.

It would also be
interesting to explore alternatives to un
i
form spacing, such as "harmonic" spacing.


V)
References:


[1] Karlin, A. R., Manasse, M., S., McGeoch, A.,L., Owicki,

S.1990. "Competitive
Randomized Algorithms for Non
-
Uniform Problems". In
Proceedings of the ACM
-
SIAM
Symposium on Discrete Algorithms

pp 301
-
309

[2] Sherrod, B. 1997. "A Dynamic Disk Spin
-
Down Technique For Mobile Computing".
Master of Science Thesis, Uni
versity of California, Santa Cruz

[3] Freund, Y., Schapire, R., E. 1996. "Game Theory, On
-
line Prediction and Boosting".
In
Proceedings of the 9
th

Annual Conference on Computational Learning Theory
, pp.
325
-
332.

[4] Freund, Y., Schapire, R. E. 1999. "Adap
tive Game Playing Using Multiplicative
Weights". Games and Economic Behavior
29
:79
-
103.

[5] Breiman, L.. 1997. "Prediction Games and Arcing Algorithms". Technical Report
504. Statistics Department, University of California, Berkeley.

[6] Littlestone, N., Warmuth, M. K. 1994. "The Weighted Major
ity Algorithm".

Information and Computation
108
:212
-
261