Learning BlackJack with ANN (Aritificial Neural Network) ECE 539 Final Project

glibdoadingAI and Robotics

Oct 20, 2013 (3 years and 10 months ago)

85 views


1





This word document was downloaded from
http://www.worddocx.com/

please remain this
link
information when

you

reproduce , cop
y
, or use

it.

<a href='http://www.wordwendang.com/en'>word documents</a>




Learning B
lackJack with
ANN
(
Aritificial Neural Network
)



ECE 539


Final

Project


Ip Kei Sam


sam@cae.wisc.edu


ID: 9012828100



2





3

Abstract


Blackjack

is a card game where the player attempts to beat the dealer by having th
e total
points hi
gher than the dealer’s hand but less than or equal to 21
.
The probabilistic nature
of Blackjack makes it an illustrative application

for learning algorithms. The learning
system needs to explore different strategies with a higher probabili
ty of winning games.
This project explores the method of
Artificial

Neural Network in
Blackjack

for learning
strategies, and the reinforcement learning algorithm
is

used to learn strategies.
R
einforcement learning is
a
process

to
map situations to actions
when

the learner is

not
told which actions to take

but
to

discover which actions yield the
highest reward by
trial
and error
.

The trained AN
N will be used to play Blackjack

without explicitly teaching the
rule
s

of

the

game.
Furthermore, t
he efficiency of t
he ANN
and

the results from
the
learning

algorithm

will be investigated and interpreted as different strategies for playing
Blackjack and other random game.



Background

Initially with two cards to each player, the object of Blackjack is to draw cards fr
om a
deck of 52 cards to a total value of 21.
The

player
can

choose from
the following

actions
:



Stand

to s
tay with
the current hand and take no
card
.



Hit

to

a
dd

a card to
the

hand to
make

the total
card

value

c
loser to
21



Double Down

W
hen t
he player

is ho
lding 2 cards, the player

can

double
his bet

by

hit
ting

with only one more card and stand

after that.



Split

Having the pair of cards with the same values, t
he player
can

split
his hand into two hands
.
The

player may split up to
3

times in a game
.



For si
mplicity of this project, Double Down and Split will not be considered in
this project.
The value of a hand is the sum of the values of each card in the hand
, where
each

card

from

2
to

10
is valued according, with
J, Q, and K
of t
he value of 10. Aces can
be
either 1 or 11. Each player plays against the dealer, and

the goal is

to obtain a hand

with a greater value than
the dealer
’s hand

but less than or equal to 21. A play
er may hit
as many times as he

w
ish
es

as long as it is not over

21.
The player can a
lso win by having

4

5 cards in hand with total points less than 21.
The dealer
hit
s

when his hand is
less than
17 and stand
when it is greater than 17
.
When

the

player is dealt
21 points in the first 2
cards,

he automatically wins his bet
if

the dealer is
n
ot dealt 21
.
I
f the dealer has
blackjack

(21 points)
,

the game is o
ver and the dealer wins all bets
, or ties with any
player with blackjack

(21 points)
.

Figure 1 and 2 shows a demo of the Matlab Blackjack

(blackjack.m)

program. The player needs to press t
he hit or stand button each turn, and
the total dealer and player
points

are calculated as shown in the bottom. It also shows the
total balance remains in the player’s account and the amount of bet the player has put in
f
or this game. In this example, the
player
bet $0.

The program exits when the player
balance is less than zero.


Figure 1: the initial state of the game, the first 2 cards are dealt to the player and to the dealer.



To
measure the efficiency of the rules in Blackjack, I have simulated the
program to play
against the dealer for 1000 games, where the dealer follows the 17 points value. The
efficiency can be observed from the percentage of w
inning and the percentage of draw

games. The comparison of the play’s random moves versus the dealer’s
1
7 point
rule is
shown in Figure 2a.


5


Figure 2: as the player chooses to stand, the dealer chooses to hit but get busted. The player won.



Strategy

Win %

Tie %

Player’s random moves

31
%

8%

Dealer’s 17 points rule

61
%

8%

Figure 2
a
:
Efficiency of differ
ent strategies in Blackjack
.

Each Strategy is played1000 games.


One of the
goal
s

of this project is to develop a better strategy with ANN that beats the
Dealer’s 17 points

rule
, that is, the new strategy will have a higher wining percentage
.

Different
con
figurations

of MLP

and
preprocess of
input training data sets

will
also
be
ex
perimente
d lat
er in this paper.

Finally, it

will explore some of the Blackjack strategies
interpreted from the experiment results.




6

Applying Reinforcement L
earning

to Blackjack


Reinforcement learning is a
process

to map situations to actions such that the reward

value

is maximized
. The learning algorithm

decides which actions to take by finding
the

actions

that

yield
s

the highest reward through trial and error. The taken actions

will affect
the immediate rewards
and
the subsequent rewards

as well
.
In Blackjack, given a set of
dealer
’s

and player
’s

cards, if

the probability of winning

of each
outcome

is know
n,

it
can
always make the best

decision

(hit or stand)

by
tak
ing

an action

that yields the
highest

winning
probability

in the next state.

For each final state
,
the probabili
ty of winning
is
either 1 (if the player wins/draws) or 0 (if the player loses).
In this project, t
he i
nitial

winning probability of

each intermediate

state
is

set to

0.5

and the learning parameter α is
also initialized to 0.5
.
The

winning probability of each

state
is updated for the dealer and

player

after each game
.
The
winning probability
of the previous state will get

closer to
the current state
base on this equation
:

P(s) = P(s) + α*[P(s’)
-
P(s)], where α is the
learning parameter, s is the current state and s’ is the next state.

For example,
figure 3
shows
the
first 3 rows taken from the
result table
in the output
when the

Ma
tlab program

genlookuptable.m
is simulated t
o play against the dealer based on random decision.



2.0000 5.0000 0


0 0 6.0000 6.0000 0

0 0 0.37
00


1.0000 0


2.0000 5.0000
0


0 0 4.0000 6.0000

6.0000 0 0 0.250
0



1.0000



0


2.0000 5.0000 10.0000 0 0 4.0000 6.0000 6.0000 7.0000 0 0


1.0000

1.0000

Figure 3: the result table (lpxx) from the Matlab program output after one
game.


The first 5 columns represent the dealer’s card
s

and

the next 5 columns represent the
player’s card
s
. The dealer and player can each have a maximum of 5 cards

by the game
rule
. The card

values

in each hand are sorted in ascending order

before they a
re inserted
into the table
. Column 11 is the winning probability of each state. Column 12 and 13
represented the action taken by the player, where [1 0] denotes a “hit” and [0 1] denotes a
“stand” and [1 1] denotes an end state where no action is required.

In the first row, the
dealer has 2 and 5 (with a total score of 7) and the player has 6 and 6 (with a total score
of 12).
Based on random decision, the player decides to hit. The next r
ound, the player
gets a 4 with

total
points

of 16 where the dealer get
s a 10 with a total of 17. The player

7

decides to hit again. This time the player is busted with a total score of 23

in the final
state (row 3)
.
Since the player lost
, t
h
is

state is rewarded with P(
3
)=0
.
The learning
parameter α = 0.5.
The winning probability of taking a “hit” in previous state is:

P(2) = P(2) + α*[P(3)
-
P(2)] = 0.5 + 0.5 * (0
-
0.5) = 0.2500

Similarly,
P(1) = P(1) + α*[P(2)
-
P(2)] = 0.5 + 0.5 * (0
-
0.25) = 0.3750


After playing a sequence of
games, the result table contains the
winning
probability of all
intermediate
states
,

the dealer’s

cards

and player’s cards

and the

player’s

taken
actions
.

When playing a new game, if the same card pattern is encountered, for example in row 2,


2.0000

5.0000
0


0 0 4.0000 6.0000 6.0000 0 0 0.250
0



1.0000



0


The player will choose to “stand” this time as taking a “hit” will has a low winning
probability.
After the new game, the winning probabi
lity will
again
be updated based on
the game result.
If the player loses again this time by hitting an additional card, the
wining probability will be: P(2) = P(2) + α*[P(3)
-
P(2)] = 0.25 + 0.5 * (0
-
0.25) = 0.125

Otherwise, if the player wins this time, the

probability is:

P(2) = P(2) + α*[P(3)
-
P(2)] = 0.25 + 0.5 * (1
-
0.25) = 0.625

The wining probability of each state will converge after a large number of game sets. Of
course, the more the game tried, the more accurate the wining probability will become.


Or
iginally, the dealer is following the 17 point rule in Blackjack. To make the dealer
more intelligent, reinforcement learning can be applied to the dealer as well.

Similarly, f
or the dealer, a
result

table can be
obtained

after each game
. Le
t’s take a look

at the following

dealer result table:


2.0000 5.0000 0


0 0 6.0000 6.0000 0 0 0 0.7
500

1.0000

0


2.0000 5.0000 10.0000 0 0 4.0000 6.0000 6.0000 0 0 1.000
0

1.0000

1.0000


In this game, the dealer was initially dealt 2 and 5 with a total of 7, where the player is
dealt 6 and 6 with a total of 12. The dealer chose to hit this time, resulting a total of 17
points in the final state. The player decides to hit,
however, resulting a total of 16 points.
The dealer P(
2
) = 1,
so
P(
1
) = P(
1
) + α*[P(
2
)
-
P(
1
)] = 0.5 + 0.5 * (1
-
0.5) = 0.7500


8

When the same card pattern

(in row 1)

is encountered
again
in a new game, the dealer
will choose to hit again since
it will result

in
a higher probability of winning with a “hit”.

Again, the probability valu
e will be updated

based on the

result
s

after each

new game.


The learning parameter

α

is initially set to 0
.5 and
it
decrease
s after each game

based on
this equation: α = α * # of games remaining/ total # of game.


Therefore, as it finishes all test games
, α becomes 0. I
f the player or the dealer always
takes the actions that
will
result in the states with the highest

winning

probability, there
will be some

intermediate

states with the initial
winning
probability of 0.5 that

is always
less than some states

and

will never be encountered. Therefore, random moves will be
necessary in between

in order to explore these states
. The
winning
probability
of the
random moves will not be updated at the end of the game.

The total number of game
simulated
in this progra
m is 1000. In each game, 20% of the moves are randomly
generated.
Sin
ce α
decrease

after each game, t
he

winning

probability value of each state
converges

as the number of game played increases.


Let’s look at the previous example one more time:

2.0000 5.0000 0


0 0 6.0000 6.0000 0

0 0 0.7
500

1.0000

0


2.0000 5.0000 10.0000 0 0 4.0000 6.0000 6.0000 0 0 1.0000

1.0000

1.0000


The P(1) value starts at the default value of 0.5, the first time when it encounters the
above pattern

and if it wins
,
P(1) is

updated to 0.7
259
. After running 5000 games

in the
simulation
, this pattern is encountered 9

times. The P

(1) value converges to 0.
76

as
shown in the figure

4

below
. It suggested that if the player has two 6’s, taking a “hit” has
a

better chance to win.


9


Figure
4
:
the convergence of winning probability in 5000 games
.


Let’s look another exa
mple
.

T
his time the player got
8

and 10

in hands:

7
.0000
8
.0000 0


0 0
8
.0000
10
.0000 0

0 0

0.2
500

1.0000


0



3
.0000
7
.0000
8.0000 0 0 8
.0000
10
.0000


5.000

0 0
0

1.0000

1.0000


The P(1) value starts at the default value of 0.5, the first time when it encounters the
above pattern, P(1) is updated t
o 0.25

when

it los
es
. After running 5000 games

in the
simulation
, this pattern is encountered
11

times. The

P

(1) value converges to
0
.35

as
shown in the figure 5. It suggest
ed that if the player got 8 and 10
, taking a “hit” will be
very likely to lose. In

this case, the player should only “hit” when the dealer’s hand is
higher than the player’s hand.


Figure
5
:
the convergence of winning probability in 5000 games
.


10


Both
experiments

show the convergences of the winning probability values in each state
as t
he learning parameter decreases after each game.




Equation (1)


The update method based on equation (1) performs quite well in this
experiment
. As α is
reduced ove
r time, the method converges to the true winning probability value in each
state given optimal
actions

by the player
.
Except for
random moves, the next taken action
(hit or stand) is in fact the optimal moves against the dealer

since the method converges
t
o an optimal policy for playing Blackjack. If α does not decrease to
0

a
t the end of the
training, then
this player also

plays well against the dealer

when the dealer
change
its

way of playing slowly over time.

Applying reinforcement learning to the deale
r will
certainly beat the dealer’s original 17 point rule.


Matlab Implementation

As the Matlab program

(blackjackN.m)

is simulated to play against the dealer,
each
time
when a new card pattern is encountered, it will have a default probability value of
0.5. Its
winning
probability
will be updated after each

game

based on the final result and will be
added to the result

table
at the end
. For the other s
tates that have not been encountered

by
the program, it will not be available in the
result

table. When
the program encounters a
card pattern that is not found in the
result

table, it will make the decision based on
random moves.
T
here
are

approximately
10
6

total different
states for the dealer’s and
player’s card pattern
s
. O
ut of
these possible states, many

of them

do n
ot need to be
considered as they may

go way beyond 21 points. For example, the state of having five
10’s in hand is not possible, as it will be busted when it reaches three 10.

Moreover
, i
t is
not possible to simulate all
these
different stat
es

because of the limitation of
the
slow
iterative operations in Matlab
.



11

Matlab Simulation

All cards in one hand will be sorted before they are inserted into the

table. The dealer and
the player have different table
s

of their own
. Each table is the inpu
t data to the
ANN

for
training either
from

the dealer

table

or

from

the player

table
.
Based on the card pattern,
the

corresponding winning probability

will be returned.


The

Matlab program

(genlookuptable.m)

is simulated to play against the dealer, startin
g
with

the initial

empty dealer table

and player table to explore the different combination
s

of the dealer’s and the player’s cards.

There are 15
00
0 games simulated, among which
2
8273 entries were genera
ted for the dealer’s table and 3
2372 entries were gen
erated for
the player’s table.
After
all
,
the program

had

only encountered less than 1
0
% of all
possible

states
. H
owever, the states that
were

encountered

by the program

are the most
common state
s
.
Actually
, only less than 1
0
%
out
of all
different

states
are commonly

possible and
encountered in playing Blackjack. For example, the state
s

of having
4 cards
or 5 cards in hand
are

actually very less frequently
occurred
, and having 4 or 5 tens in
hands is a state that is impossible to occur in Blackjack
.


Both

the dealer and player learn how to play better as the number of game
s

played
increases.

To test the accuracy of both the dealer

table

and the player table, I set up
another
sequence

games to let the dealer play again the player with the generated lookup
t
able.


The following tables show the

summary of the experiments:

When applying Reinforcement learning to the player to play against a dealer with the 17
point rule, the player has a significant improvement on the winning percentage.

Out of
1000 games, the

player won 512 times.

Strategy

Win %

Tie %

Player (with learning)

51.2
%

7%



12

When applying Reinforcement learning to the dealer

(where
the dealer does not follow
the 17 points rule)

to play against a player with random moves, the dealer has improved
its
winning percentage to 67%.

Strategy

Win %

Tie %

Dealer (with learning)

67%

5%


When applying Reinforcement learning to both dealer and player to play against each
other, the percentage of tie games increases, because they are using more likely the same
s
et of strategies based on Reinforcement learning.

Strategy

Win %

Tie %

Player’s random moves

43%

12%

Dealer’s 17 points rule

45%

12%


Finally, I played against the dealer with the dealer lookup table. I lost 11 games out of 20
games

and have 1 tie game
.


Strategy

Win %

Tie %

Human Player

45
%

5
%


Using the dealer and player tables has highly increased the number of games won for
both dealer and the player. Therefore the tables generated for the dealer and player is
good enough to use for the
successive

experiments.


Just by lookup at the values in the lookup tables, the game strategy can be interpreted.
For the player, i
f the total point is higher than 1
4
, in most cases, it suggests not to “hit”.

Similarly, it suggest
ed not to “hit” if it is over 15

fo
r the dealer.
Without knowing the
rules of Blackjack, it is able to determine the thres
hold of 14

for the player and 1
5

for the
dealer

by reinforcement learning and
by the
updates at the end of each game, even though
it is a little bit conservative than th
e dealer’s 17 point rule.

By looking at the player
playing against the dealer when both are using lookup tables, the dealer has a lightly
higher winning percentage. It suggested that the dealer table, which suggested not to “hit”
at 15, players a little be
tter than the player table, which suggested not to “hit” at 14.


13

Applying ANN

to Blackjack

The next step is to prepare the input data for the ANN network
by
using the result lookup
tables.
Once the

dealer an
d the

player

lookup tables are ready, they can be

co
nverted to
the training data
for

the ANN.

The dealer’s

cards

and the player’s cards

from the

lookup

tables
are the input
s to the ANN. The output contains two neurons; each
correspond
s to
one

action (either hit or stand).

The taken action depends on whic
h output neuron has a
higher value.

This is a classification
problem.
To simplify the problem a little bit, I have
set up several MLP structures in different stage to better classify the problem.
The
different MLP
needed for the dealer and the player as we
ll as

the

overall

game flow chart
are shown in figure 6 below.

Note that the rule
s

of Blackjack

(and the 17
-
point dealer
rule)

are

not implemented in the project. All decisions are based on MLP outputs.


Figure 6: This figure shows the overall flow charts

of using different MLP in this project



14

The first level MLP contains 4 inputs (2 cards from the dealer and 2 cards from the player
in the initial state) and 2 outputs

(either hit or stand)
. Then the second level MLP contains
5 inputs (2 from the dealer an
d 3 from player) and 2 outputs

(hit/stand)
. The third level
MLP contains 5 inputs (3 from the dealer and 2 from player) and 2 outputs

(hit/stand)
.
The fourth level MLP contains 6 inputs (3 from the d
ealer and 3 from player) and 2
out
put
s (hit/stand). Other

levels
MLP (4 or 5 cards for the player or dealer) will not
considered, as

the probability of getting to these stage

of having 4 or 5 cards in hand
becomes

less likely and their
structure
s

become more complicated

which leads to more
difficulties

in classi
fication
.
When either the player or the dealer has 4 or 5 cards in one
hand
, t
he action will be determined by either random moves

or by checking if the dealer’s
total point is less than the player’s total points, and vise versa.

Accordingly, the input
look
up table will be separated into 5 sub
-
tables

(findMLP.m implemented this function)
,
namely the first sub
-
table picks up all rows that contains 2 dealer cards and 2 player cards
(the initial state
only
), the second sub
-
table contains the state with

2

dealer

cards and
3

player cards only, the third sub
-
table has

the states with 3

dealer cards and
2

player cards,
and fourth table contains

states with

3 dealer and 3 player cards. The rest of the rows will
be put into the fifth sub
-
table, which has the remaining

states with
4 or 5 dealer
or

player
cards. Each sub table will be
saved as
inputs to each of the

4 different

MLP level
s

as
defined above.

During the simulation, depending on how many cards the dealer and the
player have, different MLP structure
s

will be c
alled to decide which action (hit/stand) to
take.

The “End” state represents the end of the game, where either the player won, the
dealer won or a draw game.


In the next step,
the a
nalysis of the input data for each feature vector and preprocess will
be
r
equired

before training the
different
MLP
. As each column
of the lookup table
contains a
card value, the range of the value is from 1 to 10.
First, let’s look at the MLP
level 1 experiment.


MLP

level 1
: 4 inputs

(2 from the dealer and 2 from the player)

a
nd
2

outputs
.

The first and second feature vectors are the dealer’s and the third and fourth vectors are
the player’s. Since the cards are sorted before they are inserted into the table, the mean of

15

the first

(or third)

feature vector is always less than
the mean

of the second
(or fourth)
feature vectors.

The values are shown in the following table:

Feature
vector #

Mean

Standard
Deviation

Variance

Minimum

Maximum

1

(dealer)

4.5
435

1
.2
90
7

1.1
94

1

10

2

(dealer)

7.2
961

2.33
53

1.2
82

1

10

3

(player)

3.8
168

1.71
96

1
.674

1

10

4

(player)

6.7
322

1.85
96

1.3
74

1

10



After inspecting the feature vectors of the input data, the table shows that the
mean of each featu
re is different from the others, since the mean of feature 1 and feature
3 is lower than feature 2
and feature 4.

The minimum and maximum of the
4

feature
vectors are

the same. The first preprocess started with
feature

vector

normalization

and
then scaling all the feature vectors into the same range (in the range of
-
5 to 5).

Inspecting the standard dev
iation and variance of each feature
could show that feature
vector
2

and 4

are

more important with respect to its class (since its standard deviation
value
are the highest). The system

could pay more attention to th
ese

feature

vectors

by
putting more weigh
ts on it. This data set is nonlinear. The classification rate is not
satisfactory if there is no preprocessing the data set. With the pre
-
process on input data, it
can identify exactly which feature

vectors

play more important roles in the classification
p
hase.
Therefore, these feature vectors

would have more weights.


After preprocessing, t
he next step is to train the MLP with this set of the training data.
The Matlab program “
bp2
.m” is implemented using back propagation algorithm that
trains the MLP
.
Seve
ral MLP configurations were experimented.
Based on the training
data, it produced the weight matrix. The activation function
hyperbolic tangent

was used
in the hidden layers

and the sigmoidal

function

was used in the output neurons.

Different α, µ, epoch s
ize, numbers of hidden layers and numbers of neurons in the
hidden layers are also experimented.


MLP (level 1)
Configurations used in the experiment:

With
normalization in feature vectors

The input feature vectors are scaled to range of
-
5 to 5,

Learning
rate Alpha = 0.1

(or 0.8)


Momentum =

0

(or 0.8)


16

Excluding in
put layer, total # of layers = 2,3,4

MLP. Max. Training Epochs: 1000

Number of neurons in hidden layer

= 5
, 10, 50

MLP Configuraiton: 4
-
5
-
2
, 4
-
5
-
5
-
2, 4
-
5
-
5
-
5
-
2, 4
-
10
-
10
-
10
-
2, 4
-
50
-
50
-
50
-
2


The
a
ctivation function
for the hidden layers
: hyperbolic tangent

The
activation function
for
the output

layer
: sigmoidal


Partition training set dynamically into training and tuning sets;

Where 20% of
training data reserved for tuning

# of epochs between conve
rgence check

= 10

epoch size = 64


α

µ

MLP
Configuration

Classification Rate

0.1

0

4
-
5
-
2

8
0
.2%

0.1

0.8

4
-
5
-
2

89
.5%

0.8

0

4
-
5
-
2

79
.1%

0.8

0.8

4
-
5
-
2

78.9%

0.1

0

4
-
5
-
5
-
2

85.8%

0.1

0.8

4
-
5
-
5
-
2

83.2%

0.8

0

4
-
5
-
5
-
2

84.1%

0.8

0.8

4
-
5
-
5
-
2

84.0%

0.1

0

4
-
5
-
5
-
5
-
2

87.2%

0.1

0.8

4
-
5
-
5
-
5
-
2

86.5
%

0.8

0

4
-
5
-
5
-
5
-
2

85.3%

0.8

0.8

4
-
5
-
5
-
5
-
2

82.8%

0.1

0

4
-
10
-
10
-
10
-
2

89.5%

0.1

0.8

4
-
10
-
10
-
10
-
2

87.0%

0.8

0

4
-
10
-
10
-
10
-
2

86.8%

0.8

0.8

4
-
10
-
10
-
10
-
2

84.8%

0.1

0

4
-
50
-
50
-
50
-
2

88
.
0
%

0.1

0.8

4
-
50
-
50
-
50
-
2

86
.
8
%

0.8

0

4
-
50
-
50
-
50
-
2

86
.
2
%

0.8

0.8

4
-
5
0
-
5
0
-
5
0
-
2

85
.
5
%

Table 1: This table shows the class
ification rates of different MLP (level 1)

configurations with α and µ valued shown in column 1 and 2



The highest classification rate is 89.5%. If the MLP is too large, it may be trapped in a
certain classification rate
that would

stop it from improving during the training process.

The best conf
iguration is α = 0.1, µ = 0, MLP Config
uration

4
-
10
-
10
-
10
-
2
.


17



MLP level 2: 5 inputs (
2 from the dealer and 3 from player
) and
2

outputs.

This

experiment se
t
up is similar to

the setup

in

MLP level 1. First the input feature vector
is preprocessed and ana
lyzed. The characteristics of the 5 feature vectors are shown in the
table as follows:

Feature
vector #

Mean

Standard
Deviation

Variance

Minimum

Maximum

1

(dealer)

5
.
2
532

1
.
1824

1.21
4

1

10

2

(dealer)

7
.
7
195

1
.33
53

1.55
2

1

10

3

(player)

4
.8
351

1.
22
6
1

1
.3
65

1

10

4

(player)

5
.
2
465

1.
5
1
52

1
.4
7
7

1

10

5

(player)

7
.
1
328

1
.
98
9
3

1.791

1

10


When looking at the feature vectors of the input data, the table shows that the mean of
each feature is different from the others. The first preprocess started with
feature

vector

normalization

and then scaling all the feature vectors into the same range (in the range of
-
5 to 5). Inspecting the standard deviation and variance of each feature could show that
feature vector 2 and 5 are more important with respect to its class

(since its standard
deviation value are the highest). The system could pay more attention to these feature
vectors by putting more weights on it.


MLP (level 2) Configurations used in the experiment:


With normalization in feature vectors

The input featu
re vectors are scaled to range of
-
5 to 5,

Learning rate Alpha = 0.1

(or 0.8)


Momentum = 0 (or 0.8)

Excluding in
put layer, total # of layers = 2,3,4

MLP. Max. Training Epochs: 1000


18

Number of neurons in hidden layer

= 5, 10, 50

MLP Configuraiton: 5
-
5
-
2,

5
-
5
-
5
-
2, 5
-
5
-
5
-
5
-
2, 5
-
10
-
10
-
10
-
2, 5
-
50
-
50
-
50
-
2

The
activation function
for the hidden layers
: hyperbolic tangent

The
activation function
for the output layers
: sigmoidal


Partition training set dynamically into training and tuning sets;

Where 20% of
trainin
g data reserved for tuning

# of epochs between convergence check

= 10

epoch size = 64

α

µ

MLP
Configuration

Classification Rate

0.1

0

5
-
5
-
2

78
.
3
%

0.1

0.8

5
-
5
-
2

79
.
9
%

0.8

0

5
-
5
-
2

79
.
5
%

0.8

0.8

5
-
5
-
2

78
.
8
%

0.1

0

5
-
5
-
5
-
2

84
.
2
%

0.1

0.8

5
-
5
-
5
-
2

84
.
5
%

0.8

0

5
-
5
-
5
-
2

83
.
8
%

0.8

0.8

5
-
5
-
5
-
2

82
.
9
%

0.1

0

5
-
5
-
5
-
5
-
2

8
6
.
6
%

0.1

0.8

5
-
5
-
5
-
5
-
2

85
.
4
%

0.8

0

5
-
5
-
5
-
5
-
2

84
.
9
%

0.8

0.8

5
-
5
-
5
-
5
-
2

83
.
7
%

0.1

0

5
-
10
-
10
-
10
-
2

90
.
1
%

0.1

0.8

5
-
10
-
10
-
10
-
2

91
.
1
%

0.8

0

5
-
10
-
10
-
10
-
2

90
.
7
%

0.8

0.8

5
-
10
-
10
-
10
-
2

87
.8%

0.1

0

5
-
50
-
50
-
50
-
2

90
.
7
%

0.1

0.8

5
-
50
-
50
-
50
-
2

89
.
4
%

0.8

0

5
-
50
-
50
-
50
-
2

89
.
6
%

0.8

0.8

5
-
50
-
50
-
50
-
2

90
.
9
%

Table
2
: This table shows the classification rates of different MLP

(level 2)

configurations with α and µ valued shown in column 1 and 2


The highest classification rate is 91.1%. If the MLP is too large, it may be trapped in a
certain classification rate that would stop it from improving during the training process.

The best confi
guration is α = 0.
1
, µ = 0
.8
, MLP Config
uration 5
-
10
-
10
-
10
-
2
.


MLP level
3
: 5 inputs (
3

from the dealer and
2

from player
) and
2

outputs.

This experiment setup is similar to that in MLP level 1

and 2
. First the input feature
vector is preprocessed and ana
lyzed. The characteristics of the 5 feature vectors are
shown in the table as follows:



19

Feature
vector #

Mean

Standard
Deviation

Variance

Minimum

Maximum

1 (dealer)

5
.
0133

1
.
2214

1.1003

1

10

2 (dealer)

6
.
2891

1.3174

1.321
2

1

10

3 (
dealer
)

6
.8
344

1.
529
1

1.
2984

1

10

4 (player)

6
.
7214

1.0
12
7

1.
013
7

1

10

5 (player)

7
.
1722
1

1.
4
31
9

1.
2
1
01

1

10


When looking at the feature vectors of the input data, the table shows that the mean of
each feature is different from the others. The first preprocess started with
feature

vector

normalization

and then scaling all the feature vectors into the same range (in the range of
-
5 to 5). Inspecting the standard deviation and variance of each feature could show that
feature vector
3

and 5 are more important with respect to it
s class (since its standard
deviation value are the highest). The system could pay more attention to these feature
vectors by putting more weights on it.


MLP (level 3) Configurations used in the experiment:


With normalization in feature vectors

The inpu
t feature vectors are scaled to range of
-
5 to 5,

Learning rate Alpha = 0.1

(or 0.8)


Momentum = 0 (or 0.8)

Excluding in
put layer, total # of layers = 2,3,4

MLP. Max. Training Epochs: 1000

Number of neurons in hidden layer

= 5, 10, 50

MLP Configuraiton: 5
-
5
-
2,

5
-
5
-
5
-
2, 5
-
5
-
5
-
5
-
2, 5
-
10
-
10
-
10
-
2, 5
-
50
-
50
-
50
-
2

The
activation function
for the hidden layers
: hyperbolic tangent

The
activation function
for the output layers
: sigmoidal


Partition training set dynamically into training and tuning sets;

Where 20% of
training data reserved for tuning

# of epochs between convergence check

= 10
,
epoch size = 64

α

µ

MLP
Configuration

Classification Rate

0.1

0

5
-
5
-
2

78
.
2
%

0.1

0.8

5
-
5
-
2

78
.
8
%

0.8

0

5
-
5
-
2

8
0
.
1
%

0.8

0.8

5
-
5
-
2

79
.
4
%

0.1

0

5
-
5
-
5
-
2

82
.
0
%

0.1

0.8

5
-
5
-
5
-
2

84
.5%

0.8

0

5
-
5
-
5
-
2

84
.
1
%

0.8

0.8

5
-
5
-
5
-
2

83
.9%

0.1

0

5
-
5
-
5
-
5
-
2

85
.
2
%


20

0.1

0.8

5
-
5
-
5
-
5
-
2

83
.
9
%

0.8

0

5
-
5
-
5
-
5
-
2

85
.
7
%

0.8

0.8

5
-
5
-
5
-
5
-
2

84
.
5
%

0.1

0

5
-
10
-
10
-
10
-
2

89
.
9
%

0.1

0.8

5
-
10
-
10
-
10
-
2

91
.
7
%

0.8

0

5
-
10
-
10
-
10
-
2

92
.
5
%

0.8

0.8

5
-
10
-
10
-
10
-
2

90
.8%

0.1

0

5
-
50
-
50
-
50
-
2

90
.7%

0.1

0.8

5
-
50
-
50
-
50
-
2

9
0
.
1
%

0.8

0

5
-
50
-
50
-
50
-
2

91
.
1
%

0.8

0.8

5
-
50
-
50
-
50
-
2

89
.
8
%

Table
3
: This table shows the classification rates of different MLP

(level
3
)

configurations with

α and µ valued shown in column 1 and 2


The highest classification rate is 9
2
.
5
%.

The best configuration is α = 0.
8
, µ = 0, MLP
Config
uration 5
-
10
-
10
-
10
-
2
.


MLP level
4
:
6

inputs (
3

from the dealer and
3

from player
) and
2

outputs.

This experiment setup
is

similar to previous experiments
. First the input feature vector is
preprocessed and analyzed. The characteristics of the
6

feature vectors are shown in the
table as follows:

Feature
vector #

Mean

Standard
Deviation

Variance

Minimum

Maximum

1 (dealer)

5
.
27
1
0

1
.
00
1
5

1.0121

1

10

2 (dealer)

6
.
5097

1.
1232

1.
13
2
4

1

10

3 (dealer)

7
.
1321

1.
2156

1.2533

1

10

4 (player)

6.
0
21
1

1.
04
7
1

1.
087
7

1

10

5 (player)

6
.
6872

1.
2016

1.20
01

1

10

6

(player)

7
.
0
72
8

1.354
9

1.
353
1

1

10


When looking at the feature vectors of
the input data, the table shows that the mean of
each feature is different from the others. The first preprocess started with
feature

vector

normalization

and then scaling all the feature vectors into the same range (in the range of
-
5 to 5). Inspecting th
e standard deviation and variance of each feature could
show that
feature vector 3 and 6

are more important with respect to its class (since its standard
deviation value are the highest). The system could pay more attention to these feature
vectors by putt
ing more weights on it.




21

MLP (level
4
) Configurations used in the experiment:


With normalization in feature vectors

The input feature vectors are scaled to range of
-
5 to 5,

Learning rate Alpha = 0.1

(or 0.8)


Momentum = 0 (or 0.8)

Excluding in
put layer
, total # of layers = 2,3,4

MLP. Max. Training Epochs: 1000

Number of neurons in hidden layer

= 8
, 1
2
, 50

MLP Configuraiton: 6
-
8
-
2,

6
-
8
-
8
-
2, 6
-
8
-
8
-
8
-
2,
6
-
1
2
-
1
2
-
1
2
-
2,
6
-
50
-
50
-
50
-
2

The
activation function
for the hidden layers
: hyperbolic tangent

The
activa
tion function
for the output layers
: sigmoidal


Partition training set dynamically into training and tuning sets;

Where 20% of
training data reserved for tuning

# of epochs between convergence check

= 10, epoch size = 64

α

µ

MLP
Configuration

Classificatio
n Rate

0.1

0

6
-
8
-
2

7
9
.
1
%

0.1

0.8

6
-
8
-
2

80
.
9
%

0.8

0

6
-
8
-
2

80
.
5
%

0.8

0.8

6
-
8
-
2

80
.
1
%

0.1

0

6
-
8
-
8
-
2

81
.
5
%

0.1

0.8

6
-
8
-
8
-
2

82
.
2
%

0.8

0

6
-
8
-
8
-
2

82
.
8
%

0.8

0.8

6
-
8
-
8
-
2

81
.
4
%

0.1

0

6
-
8
-
8
-
8
-
2

84
.
7
%

0.1

0.8

6
-
8
-
8
-
8
-
2

85
.
2
%

0.8

0

6
-
8
-
8
-
8
-
2

85
.
7
%

0.8

0.8

6
-
8
-
8
-
8
-
2

84.
6
%

0.1

0

6
-
1
2
-
1
2
-
1
2
-
2

90
.
2
%

0.1

0.8

6
-
12
-
12
-
12
-
2

89
.
3
%

0.8

0

6
-
12
-
12
-
12
-
2

88
.
6
%

0.8

0.8

6
-
12
-
12
-
12
-
2

86
.
9
%

0.1

0

6
-
50
-
50
-
50
-
2

87
.
6
%

0.1

0.8

6
-
50
-
50
-
50
-
2

88
.
3
%

0.8

0

6
-
50
-
50
-
50
-
2

88.5
%

0.8

0.8

6
-
50
-
50
-
50
-
2

87
.8%

Table
4
: This table show
s the classification rates of different MLP

(level 4
)

configurations with α and µ valued shown in column 1 and 2


The highest classification rate is 9
0.2
%.
The best configuration is α = 0.
1
, µ = 0, MLP
Config
uration 6
-
1
2
-
1
2
-
1
2
-
2
.


The previous experiments
have tested the 4 MLP

structures

that will be used in this
project.


22

After the best configuration for each MLP is defined, in order to improve the
classification performance, I have picked up those cases which are often easily
misclassified by the MLP. I h
ave set up another misclassification table to store these
cases. The MLP structures will be implemented to the Blackjack game so that the
program will call a MLP when it’s time to make a decision. Moreover, before the MLP
continue to do its classification
task, the program need to check if this specific dealer and
player card pattern exists in the misclassification table. If it does, it will use the action
from the misclassification table instead of using the MLP to make a decision. There are
610 entries in

the misclassification table. For the MLP
Config
uration 4
-
10
-
10
-
10
-
2
, it can
improve the clas
sification rate up to 92.9
%.


Here are

summaries of
the 4 different
MLP Configuration
s
:



Normalization in feature vectors, and scaled to range of
-
5 to 5.



Max. Tra
ining Epochs: 1000, epoch size = 64



Activation function (hidden layer)=hyperbolic tangent



Activation function (output layer) = sigmoidal


α
=

=
C潮晩o畲a瑩潮
=
C污獳楦楣a瑩潮⁒a瑥
=
jimN
=
〮M
=
M
=
4
-

-

-

-
2
=
㠹⸵8⸠
=
jim2
=
〮M
=
〮M
=
R
-

-

-

-
2
=
㤱⸱9
=
jimP
=
〮M
=
M
=
R
-

-

-

-
2
=
㤲⸵9⸠
=
jim4
=
〮M
=
M
=
S
-

-

-

-
2
=
㤰⸲9⸠
=
=

Matlab Simulation using MLP

The Matlab program BlackjackANN.m is implemented using the MLP that were defined
in the previous section.
There was 1000 games simulation in this experiment. Both the
deal
er and the player use MLP to make decisions in their turns.

The winning percentages
of the dealer and the player are

as follows:


When applying MLP to the player to play against a dealer with the 17 poi
nt rule, the
player has a
winning percentage

of 56.5%
.

Out of 1000 games, the player won 5
65

times.


23

Strategy

Win %

Tie %

Player
with MLP

56
.
5
%

9
%


When applying
MLP

to the dealer (where
the dealer does not follow the 17 points rule)

to
play against a player with rando
m moves, the dealer has an outstanding

w
inning
percentage
of

68
.2
%.

Strategy

Win %

Tie %

Dealer
with MLP

68
.2
%

3
%


When applying
MLP
to both dealer and player to play against each other, the percentage
of tie games increases,
as

they are using more likely the same strategies based on
the
deale
r’s and the player’s input lookup tables
.

It also depends on the number of entries in
the misclassification tables. In this experiment, the player performed a little bit better
than the dealer. It would be because the player’s misclassification table actua
lly identifies
more misclassified patterns than the dealer’s table does.

Strategy

Win %

Tie %

Player

with MLP

54
%

3
%

Dealer

with MLP

4
3
%

3
%


Finally, I

had the opportunity to

play against the

system that I developed. I played against
the
dealer with the

original dealer lookup table. I was able to win

12

games out of 20
games
, as well as 2 tie games
.
Therefore, I decided to let the dealer use a better table (the
player’s original table) as the input to its MLP networks. Then I played against the dealer
ag
ain. This time, it improved its performance by 1 game

(and 1 tie game this time)
.

Strategy

Win %

Tie %

Human Player

60%

10%

Dealer with MLP

3
0
%

10
%


Using the

MLP networks for the

dealer and
the
player has highly increased the number of
games won for bo
th
the
dealer and the player
.

Of course, the human still plays better than
the MLP at the end, as the training set for the MLP is very small compared to all the

24

possible states a Blackjack game would have. If the entries in both dealer’s and player’s
table

increases (so that it covers most of the states in the game), I believe that the MLP
can certainly play better than human being. Building such a large network would require
additional numerical analysis techniques, (for example, how to make the lookup tab
le
smaller and contains more information), such that the MLP can have a reasonable
response table in each turn.

As human beings are so adapted to using their own experiences in playing games, the
more games you play, the more you are familiar with the gam
e by practicing. It draws an
interesting idea on learning by letting both the program and the human learn together at
the same time and
the program
observe
s

who actually learns faster. Therefore, I decided
to implement a side
-
version of Blackjack as an exp
eriment. That is, instead of having 21
points as the default goal, you can actually define your target points in the game.
The
next section will show the details of this experiment.



Applying ANN

to play user
-
defined

Blackjack

In the normal Blackjack game
, t
he goal of
each game

is to draw cards from a deck of 52
cards to a total value
as close to

21

as possible
.

To make applying ANN to Blackjack
more interesting, I have c
reated a version of Blackjack where the user can define his only
target
points
, for ex
ample, 30 points
, so that the program and the human can learn how to
p
lay a new game at the same time.

To make the implantation easier, I still decided to let
both the dealer and the player have maximum of 5 cards in hand. If both the dealer and
the player

reach 5 cards without busted, the total points will determine who the winner is.

Of course, a new set of input lookup tables will be needed to train the dealer to play this
new game. Since I am playing as the only player, I did not train the player input
lookup
tables.

When the target
points changed, all the existing

game
rules and
strategies changed
as well. Without knowing the new rule and
strategies of the game

in advance
, does MLP
network learn better and faster than the human player
?

The Malab progra
m Blackjack
ANN
x.m is implemented for the user
-
def
ined Blackjack
game. There are 2
0 games set up between the human and the dealer. Without cheating in
the game, I did not calculate the probability of

my play in advance.
After the 2
0 games,
the experiment re
sults are as follows:


25

30
-
point target game

Win %

Tie %

Human Player

40
%

5
%

Dealer with MLP

55
%

5
%


As this is my first time playing the 30
-
point Blackjack, I won only 8 games out of 20
games. The winning percentage for the MLP and the human is close. Th
ough the MLP
has a slightly larger input table, however, it only increases a small percentage of the
possible states it covers in this new game.
The table above shows the experiment result of
the first time I play the new 30
-
point game. As I play more, I s
tarted to win the dealer
more.
As the number of games played increase, t
he learning rate of human being is
definitely a lot higher than that of the program.


On the other hand, t
he percentage of tie game decreases as the total target point increases.
Ther
e is only 1 tie game out of 20 games because when the number of states increase, the
probability of two players getting to the same state becomes less likely.
As the total target
points increase, the number of states also increases

dramatically
.
Moreover
,
it is
impossible for the
program
to capture al
l possible

states
,

as

the

number of

lookup table

entries would
increase

dramatically.
The
winning
probability of each state will only
converge after m
illions of
games and millions of
entries
in the lookup table
s
.

It can be
shown in an

experiment

similar

as before
. In this tremendous number of states in the
game, it is very depending on how lucky the simulation plays the

first

10000 games to
collect the
input data in the lookup table and also how lucky it is to
c
apture
most
of the
common states
.
After simulating more games, w
ith the lookup table as the input of the
dealer’s
MLP, the MLP did its best at 65% classification rate

in a new game
.


To simulate the program one more time, this time I let the simulated pla
y to use the
previous trained dealer’s lookup table as the input to the player’s MLP network. At the
same time, the dealer is using a 25
-
point to play in this new game. There are 1000 games
simulated and the experiment results are shown as follows:




26

30
-
p
oint target game

Win %

Tie %

Player

with MLP

5
0
.2
%

2
.0
%

Dealer w
ith
25
-
point rule

4
7
.6
%

2
.
0
%


Even if the player is using a small lookup table as input to its MLP, it can actually beat
the dealer
,

who is using
a fixed game rule
, by 2
.
4
%.
The
MLP

network

work
s best

for
highly random and dynamic games, where the game rules and the strategies are hard to
define
and the game outputs are hard to predict
exactly.

Of course, it also depends how
lucky the system is when it first collected its input data.


Concl
usion


This project explore
d

the method
s

of Artificial Neural Network in Blackjack for learning
strategies without explicitly teaching the rules of the game, and the reinforcement
learning algorithm
is

used to learn strategies
.

From the above experiment, i
t can be
concluded that
the
Reinforcement learning
algorithm
does learn better strategies
than the
fixed rule strategies
in playing Blackjack, as it improves the winning probability of each
game dramatically.
Reinforcement learning algorithms is able to su
ggest the best actions
when the situation changes over time. It finds a mapping from different states to
probability distributions over various actions.
Without knowing any rules of Blackjack,
the ANN is able to performance well in playing the game, given
the training data set is
larger and accuracy enough

to cover most of the game states
. The misclassification table

is
also
useful

in improving

the classification rate

by reducing the misclassification
number
. However, it requires some manually work to pick
up
entries from
the lookup
table and recognize the misclassification patterns and put them into the misclassification
table.



The highest classification rate it can get

from the MLP

is
81
%.
Of course, the input
training table can also be repeated trained
by the correctly classified test data to improve
the classification rate.
The performance of playing with ANN has an average of 51%
wining
, a minimum of 39% winning
and a maximum of 65% of winning

when simulated

27

to play in 1000 games
.

However, as the numbe
r of games increases, the number of un
-
encountered states increases, which
include

the states that are not included in the lookup
tables
. Therefore,

the average winning drops down to 42%. Continue learning is
important to produce good results, that is, the

input training data to the MLP should be
updated after 10000 game
s

played
, or the testing train after
each game can be feed back
to the training input table to increase the training set
, for example.


As the number of
game increases, the game strategies w
ill change over time. The larger the input training
data, the better the performance will be.


Since the cards are dealt from a 52 card deck, the current hand is actually dependent on
the last hands. Taking advantage of the memory of playing Blackjack mak
es the search
space much smaller and hence increases the winning probability of each state. Including
the card counting algorithm in the program is another desirable feature for future work.
Others

include:

adding additional players to the game,

training t
he ANN with a teacher

to
eliminate duplicate pattern
s (for example, 4 + 7 = 7 + 4 = 5 + 6 = …)

and identify
misclassified pattern
, trai
ning the ANN to play against

different experts so that it can pick
up various game strategies
, including

game tricks and
strategies in a different table

for
the ANN to look up when different

state
s

are

encountered
,
exploring

other learning
methods such as Q learning,

including

a large training input data over time to make a
better ANN,
extending the ANN to play other similar

games in Pokers,
etc
.

The MLP
network works best for highly random and dynamic games, where the game rules and the
strategies are hard to define and the game outputs are hard to predict exactly.


28

Appendix 1

Matlab Files


Blackjack.m

This program allows th
e user to play blackjack with the
dealer.

BlackjackANN.m

Based on the MLP and input data, the ANN can make
decision to play against the dealer.

Genlookuptable.m

This program will simulate 10000 games between the dealer
and the player, where the player m
akes random moves. The
winning probability of each state will be updated after each
game. The move pattern and winning probability value will
be stored in the dealer and player lookup tables, which will
be used as the input to the MLP.

BlackjackN.m

This
program is called by genlookuptable.m to play against
the dealer based on random move and update lookup tables.

Bp2.m,
bpconfig2.m

Bpdisplay.m

These two
programs

take the lookup tables as training data.
Using back propagation algorithm, it outputs the wei
ght
matrix for the MLP and classification rate of each setting.

Blackjackx.m

This program is implemented with r
egular Blackjack with
user defined target points
, for example 30 points.

Blackjack
ANN
x.m

The general form of Blackjack, which takes the user de
fined
target value (by default, the target value is 21). The learning
algorithm is also applied to this program.

Cardhit.m

Update screen when player or dealer hits a card

Cardplot.m

Plot dealt cards in the user screen

Checkmoney.m

Check user balance, if

less than zero, then quit

Dealerwin.m

Performs updates to the lookup tables if dealer won.

findMLP.m

Separate input lookup table to different sub
-
tables for 4 MLP

Scale.m

Rsample.m

Randomize.m

Cvgtest.m

Actfun.m

Actfunp.m

Utility Matlab function
s

from
the class website

Shuffle.m

Shuffle cards before they are dealt to the player and dealer

Youwin.m

Perform
s

updates to the lookup table
s

if player won



29

Appendix 2

Reference
s



De Wilde, Philippe (1995).
Neural network models: an analysis,
Springer
-
Verlag



M.L littman “Markov games as a framework for mul
ti
-
agent reinforcement
learning
”, Proceedings of the Eleven International Conference on Machine
Learning, 1994, pp. 157


163. Morgan Kaufmann.



R.S.
Sutton

and A.G. Barto, Reinforcement Learning: An Introduct
ion, MIT
Press, 1998.



B. Widrow, N. Gupta and S. Maitra, “Punish/Reward: Learning with a Crit
ic in
Adaptive Threshold System

,

IEEE Transactions on Systems, Man and
Cybernetics, vol 3, no. 5, pp. 455


465, 1973.



Haykin, Simon (1999).
Neural networks: a c
omprehensive foundation,
2nd
edition, Prentice Hall, New Jersey




Online Presentation and Matlab Files


This word document was downloaded from
http://www.worddocx.com/

please remain this
link
information when

you

r
eproduce , cop
y
, or use

it.

<a href='http://www.wordwendang.com/en'>word documents</a>



The project report, online presentation and Matlab files are available for download at:

http://www.cae.wisc.edu/~sam/ece539