1
This word document was downloaded from
http://www.worddocx.com/
please remain this
link
information when
you
reproduce , cop
y
, or use
it.
<a href='http://www.wordwendang.com/en'>word documents</a>
Learning B
lackJack with
ANN
(
Aritificial Neural Network
)
ECE 539
Final
Project
Ip Kei Sam
sam@cae.wisc.edu
ID: 9012828100
2
3
Abstract
Blackjack
is a card game where the player attempts to beat the dealer by having th
e total
points hi
gher than the dealer’s hand but less than or equal to 21
.
The probabilistic nature
of Blackjack makes it an illustrative application
for learning algorithms. The learning
system needs to explore different strategies with a higher probabili
ty of winning games.
This project explores the method of
Artificial
Neural Network in
Blackjack
for learning
strategies, and the reinforcement learning algorithm
is
used to learn strategies.
R
einforcement learning is
a
process
to
map situations to actions
when
the learner is
not
told which actions to take
but
to
discover which actions yield the
highest reward by
trial
and error
.
The trained AN
N will be used to play Blackjack
without explicitly teaching the
rule
s
of
the
game.
Furthermore, t
he efficiency of t
he ANN
and
the results from
the
learning
algorithm
will be investigated and interpreted as different strategies for playing
Blackjack and other random game.
Background
Initially with two cards to each player, the object of Blackjack is to draw cards fr
om a
deck of 52 cards to a total value of 21.
The
player
can
choose from
the following
actions
:
Stand
to s
tay with
the current hand and take no
card
.
Hit
to
a
dd
a card to
the
hand to
make
the total
card
value
c
loser to
21
Double Down
W
hen t
he player
is ho
lding 2 cards, the player
can
double
his bet
by
hit
ting
with only one more card and stand
after that.
Split
Having the pair of cards with the same values, t
he player
can
split
his hand into two hands
.
The
player may split up to
3
times in a game
.
For si
mplicity of this project, Double Down and Split will not be considered in
this project.
The value of a hand is the sum of the values of each card in the hand
, where
each
card
from
2
to
10
is valued according, with
J, Q, and K
of t
he value of 10. Aces can
be
either 1 or 11. Each player plays against the dealer, and
the goal is
to obtain a hand
with a greater value than
the dealer
’s hand
but less than or equal to 21. A play
er may hit
as many times as he
w
ish
es
as long as it is not over
21.
The player can a
lso win by having
4
5 cards in hand with total points less than 21.
The dealer
hit
s
when his hand is
less than
17 and stand
when it is greater than 17
.
When
the
player is dealt
21 points in the first 2
cards,
he automatically wins his bet
if
the dealer is
n
ot dealt 21
.
I
f the dealer has
blackjack
(21 points)
,
the game is o
ver and the dealer wins all bets
, or ties with any
player with blackjack
(21 points)
.
Figure 1 and 2 shows a demo of the Matlab Blackjack
(blackjack.m)
program. The player needs to press t
he hit or stand button each turn, and
the total dealer and player
points
are calculated as shown in the bottom. It also shows the
total balance remains in the player’s account and the amount of bet the player has put in
f
or this game. In this example, the
player
bet $0.
The program exits when the player
balance is less than zero.
Figure 1: the initial state of the game, the first 2 cards are dealt to the player and to the dealer.
To
measure the efficiency of the rules in Blackjack, I have simulated the
program to play
against the dealer for 1000 games, where the dealer follows the 17 points value. The
efficiency can be observed from the percentage of w
inning and the percentage of draw
games. The comparison of the play’s random moves versus the dealer’s
1
7 point
rule is
shown in Figure 2a.
5
Figure 2: as the player chooses to stand, the dealer chooses to hit but get busted. The player won.
Strategy
Win %
Tie %
Player’s random moves
31
%
8%
Dealer’s 17 points rule
61
%
8%
Figure 2
a
:
Efficiency of differ
ent strategies in Blackjack
.
Each Strategy is played1000 games.
One of the
goal
s
of this project is to develop a better strategy with ANN that beats the
Dealer’s 17 points
rule
, that is, the new strategy will have a higher wining percentage
.
Different
con
figurations
of MLP
and
preprocess of
input training data sets
will
also
be
ex
perimente
d lat
er in this paper.
Finally, it
will explore some of the Blackjack strategies
interpreted from the experiment results.
6
Applying Reinforcement L
earning
to Blackjack
Reinforcement learning is a
process
to map situations to actions such that the reward
value
is maximized
. The learning algorithm
decides which actions to take by finding
the
actions
that
yield
s
the highest reward through trial and error. The taken actions
will affect
the immediate rewards
and
the subsequent rewards
as well
.
In Blackjack, given a set of
dealer
’s
and player
’s
cards, if
the probability of winning
of each
outcome
is know
n,
it
can
always make the best
decision
(hit or stand)
by
tak
ing
an action
that yields the
highest
winning
probability
in the next state.
For each final state
,
the probabili
ty of winning
is
either 1 (if the player wins/draws) or 0 (if the player loses).
In this project, t
he i
nitial
winning probability of
each intermediate
state
is
set to
0.5
and the learning parameter α is
also initialized to 0.5
.
The
winning probability of each
state
is updated for the dealer and
player
after each game
.
The
winning probability
of the previous state will get
closer to
the current state
base on this equation
:
P(s) = P(s) + α*[P(s’)

P(s)], where α is the
learning parameter, s is the current state and s’ is the next state.
For example,
figure 3
shows
the
first 3 rows taken from the
result table
in the output
when the
Ma
tlab program
genlookuptable.m
is simulated t
o play against the dealer based on random decision.
2.0000 5.0000 0
0 0 6.0000 6.0000 0
0 0 0.37
00
1.0000 0
2.0000 5.0000
0
0 0 4.0000 6.0000
6.0000 0 0 0.250
0
1.0000
0
2.0000 5.0000 10.0000 0 0 4.0000 6.0000 6.0000 7.0000 0 0
1.0000
1.0000
Figure 3: the result table (lpxx) from the Matlab program output after one
game.
The first 5 columns represent the dealer’s card
s
and
the next 5 columns represent the
player’s card
s
. The dealer and player can each have a maximum of 5 cards
by the game
rule
. The card
values
in each hand are sorted in ascending order
before they a
re inserted
into the table
. Column 11 is the winning probability of each state. Column 12 and 13
represented the action taken by the player, where [1 0] denotes a “hit” and [0 1] denotes a
“stand” and [1 1] denotes an end state where no action is required.
In the first row, the
dealer has 2 and 5 (with a total score of 7) and the player has 6 and 6 (with a total score
of 12).
Based on random decision, the player decides to hit. The next r
ound, the player
gets a 4 with
total
points
of 16 where the dealer get
s a 10 with a total of 17. The player
7
decides to hit again. This time the player is busted with a total score of 23
in the final
state (row 3)
.
Since the player lost
, t
h
is
state is rewarded with P(
3
)=0
.
The learning
parameter α = 0.5.
The winning probability of taking a “hit” in previous state is:
P(2) = P(2) + α*[P(3)

P(2)] = 0.5 + 0.5 * (0

0.5) = 0.2500
Similarly,
P(1) = P(1) + α*[P(2)

P(2)] = 0.5 + 0.5 * (0

0.25) = 0.3750
After playing a sequence of
games, the result table contains the
winning
probability of all
intermediate
states
,
the dealer’s
cards
and player’s cards
and the
player’s
taken
actions
.
When playing a new game, if the same card pattern is encountered, for example in row 2,
2.0000
5.0000
0
0 0 4.0000 6.0000 6.0000 0 0 0.250
0
1.0000
0
The player will choose to “stand” this time as taking a “hit” will has a low winning
probability.
After the new game, the winning probabi
lity will
again
be updated based on
the game result.
If the player loses again this time by hitting an additional card, the
wining probability will be: P(2) = P(2) + α*[P(3)

P(2)] = 0.25 + 0.5 * (0

0.25) = 0.125
Otherwise, if the player wins this time, the
probability is:
P(2) = P(2) + α*[P(3)

P(2)] = 0.25 + 0.5 * (1

0.25) = 0.625
The wining probability of each state will converge after a large number of game sets. Of
course, the more the game tried, the more accurate the wining probability will become.
Or
iginally, the dealer is following the 17 point rule in Blackjack. To make the dealer
more intelligent, reinforcement learning can be applied to the dealer as well.
Similarly, f
or the dealer, a
result
table can be
obtained
after each game
. Le
t’s take a look
at the following
dealer result table:
2.0000 5.0000 0
0 0 6.0000 6.0000 0 0 0 0.7
500
1.0000
0
2.0000 5.0000 10.0000 0 0 4.0000 6.0000 6.0000 0 0 1.000
0
1.0000
1.0000
In this game, the dealer was initially dealt 2 and 5 with a total of 7, where the player is
dealt 6 and 6 with a total of 12. The dealer chose to hit this time, resulting a total of 17
points in the final state. The player decides to hit,
however, resulting a total of 16 points.
The dealer P(
2
) = 1,
so
P(
1
) = P(
1
) + α*[P(
2
)

P(
1
)] = 0.5 + 0.5 * (1

0.5) = 0.7500
8
When the same card pattern
(in row 1)
is encountered
again
in a new game, the dealer
will choose to hit again since
it will result
in
a higher probability of winning with a “hit”.
Again, the probability valu
e will be updated
based on the
result
s
after each
new game.
The learning parameter
α
is initially set to 0
.5 and
it
decrease
s after each game
based on
this equation: α = α * # of games remaining/ total # of game.
Therefore, as it finishes all test games
, α becomes 0. I
f the player or the dealer always
takes the actions that
will
result in the states with the highest
winning
probability, there
will be some
intermediate
states with the initial
winning
probability of 0.5 that
is always
less than some states
and
will never be encountered. Therefore, random moves will be
necessary in between
in order to explore these states
. The
winning
probability
of the
random moves will not be updated at the end of the game.
The total number of game
simulated
in this progra
m is 1000. In each game, 20% of the moves are randomly
generated.
Sin
ce α
decrease
after each game, t
he
winning
probability value of each state
converges
as the number of game played increases.
Let’s look at the previous example one more time:
2.0000 5.0000 0
0 0 6.0000 6.0000 0
0 0 0.7
500
1.0000
0
2.0000 5.0000 10.0000 0 0 4.0000 6.0000 6.0000 0 0 1.0000
1.0000
1.0000
The P(1) value starts at the default value of 0.5, the first time when it encounters the
above pattern
and if it wins
,
P(1) is
updated to 0.7
259
. After running 5000 games
in the
simulation
, this pattern is encountered 9
times. The P
(1) value converges to 0.
76
as
shown in the figure
4
below
. It suggested that if the player has two 6’s, taking a “hit” has
a
better chance to win.
9
Figure
4
:
the convergence of winning probability in 5000 games
.
Let’s look another exa
mple
.
T
his time the player got
8
and 10
in hands:
7
.0000
8
.0000 0
0 0
8
.0000
10
.0000 0
0 0
0.2
500
1.0000
0
3
.0000
7
.0000
8.0000 0 0 8
.0000
10
.0000
5.000
0 0
0
1.0000
1.0000
The P(1) value starts at the default value of 0.5, the first time when it encounters the
above pattern, P(1) is updated t
o 0.25
when
it los
es
. After running 5000 games
in the
simulation
, this pattern is encountered
11
times. The
P
(1) value converges to
0
.35
as
shown in the figure 5. It suggest
ed that if the player got 8 and 10
, taking a “hit” will be
very likely to lose. In
this case, the player should only “hit” when the dealer’s hand is
higher than the player’s hand.
Figure
5
:
the convergence of winning probability in 5000 games
.
10
Both
experiments
show the convergences of the winning probability values in each state
as t
he learning parameter decreases after each game.
Equation (1)
The update method based on equation (1) performs quite well in this
experiment
. As α is
reduced ove
r time, the method converges to the true winning probability value in each
state given optimal
actions
by the player
.
Except for
random moves, the next taken action
(hit or stand) is in fact the optimal moves against the dealer
since the method converges
t
o an optimal policy for playing Blackjack. If α does not decrease to
0
a
t the end of the
training, then
this player also
plays well against the dealer
when the dealer
change
its
way of playing slowly over time.
Applying reinforcement learning to the deale
r will
certainly beat the dealer’s original 17 point rule.
Matlab Implementation
As the Matlab program
(blackjackN.m)
is simulated to play against the dealer,
each
time
when a new card pattern is encountered, it will have a default probability value of
0.5. Its
winning
probability
will be updated after each
game
based on the final result and will be
added to the result
table
at the end
. For the other s
tates that have not been encountered
by
the program, it will not be available in the
result
table. When
the program encounters a
card pattern that is not found in the
result
table, it will make the decision based on
random moves.
T
here
are
approximately
10
6
total different
states for the dealer’s and
player’s card pattern
s
. O
ut of
these possible states, many
of them
do n
ot need to be
considered as they may
go way beyond 21 points. For example, the state of having five
10’s in hand is not possible, as it will be busted when it reaches three 10.
Moreover
, i
t is
not possible to simulate all
these
different stat
es
because of the limitation of
the
slow
iterative operations in Matlab
.
11
Matlab Simulation
All cards in one hand will be sorted before they are inserted into the
table. The dealer and
the player have different table
s
of their own
. Each table is the inpu
t data to the
ANN
for
training either
from
the dealer
table
or
from
the player
table
.
Based on the card pattern,
the
corresponding winning probability
will be returned.
The
Matlab program
(genlookuptable.m)
is simulated to play against the dealer, startin
g
with
the initial
empty dealer table
and player table to explore the different combination
s
of the dealer’s and the player’s cards.
There are 15
00
0 games simulated, among which
2
8273 entries were genera
ted for the dealer’s table and 3
2372 entries were gen
erated for
the player’s table.
After
all
,
the program
had
only encountered less than 1
0
% of all
possible
states
. H
owever, the states that
were
encountered
by the program
are the most
common state
s
.
Actually
, only less than 1
0
%
out
of all
different
states
are commonly
possible and
encountered in playing Blackjack. For example, the state
s
of having
4 cards
or 5 cards in hand
are
actually very less frequently
occurred
, and having 4 or 5 tens in
hands is a state that is impossible to occur in Blackjack
.
Both
the dealer and player learn how to play better as the number of game
s
played
increases.
To test the accuracy of both the dealer
table
and the player table, I set up
another
sequence
games to let the dealer play again the player with the generated lookup
t
able.
The following tables show the
summary of the experiments:
When applying Reinforcement learning to the player to play against a dealer with the 17
point rule, the player has a significant improvement on the winning percentage.
Out of
1000 games, the
player won 512 times.
Strategy
Win %
Tie %
Player (with learning)
51.2
%
7%
12
When applying Reinforcement learning to the dealer
(where
the dealer does not follow
the 17 points rule)
to play against a player with random moves, the dealer has improved
its
winning percentage to 67%.
Strategy
Win %
Tie %
Dealer (with learning)
67%
5%
When applying Reinforcement learning to both dealer and player to play against each
other, the percentage of tie games increases, because they are using more likely the same
s
et of strategies based on Reinforcement learning.
Strategy
Win %
Tie %
Player’s random moves
43%
12%
Dealer’s 17 points rule
45%
12%
Finally, I played against the dealer with the dealer lookup table. I lost 11 games out of 20
games
and have 1 tie game
.
Strategy
Win %
Tie %
Human Player
45
%
5
%
Using the dealer and player tables has highly increased the number of games won for
both dealer and the player. Therefore the tables generated for the dealer and player is
good enough to use for the
successive
experiments.
Just by lookup at the values in the lookup tables, the game strategy can be interpreted.
For the player, i
f the total point is higher than 1
4
, in most cases, it suggests not to “hit”.
Similarly, it suggest
ed not to “hit” if it is over 15
fo
r the dealer.
Without knowing the
rules of Blackjack, it is able to determine the thres
hold of 14
for the player and 1
5
for the
dealer
by reinforcement learning and
by the
updates at the end of each game, even though
it is a little bit conservative than th
e dealer’s 17 point rule.
By looking at the player
playing against the dealer when both are using lookup tables, the dealer has a lightly
higher winning percentage. It suggested that the dealer table, which suggested not to “hit”
at 15, players a little be
tter than the player table, which suggested not to “hit” at 14.
13
Applying ANN
to Blackjack
The next step is to prepare the input data for the ANN network
by
using the result lookup
tables.
Once the
dealer an
d the
player
lookup tables are ready, they can be
co
nverted to
the training data
for
the ANN.
The dealer’s
cards
and the player’s cards
from the
lookup
tables
are the input
s to the ANN. The output contains two neurons; each
correspond
s to
one
action (either hit or stand).
The taken action depends on whic
h output neuron has a
higher value.
This is a classification
problem.
To simplify the problem a little bit, I have
set up several MLP structures in different stage to better classify the problem.
The
different MLP
needed for the dealer and the player as we
ll as
the
overall
game flow chart
are shown in figure 6 below.
Note that the rule
s
of Blackjack
(and the 17

point dealer
rule)
are
not implemented in the project. All decisions are based on MLP outputs.
Figure 6: This figure shows the overall flow charts
of using different MLP in this project
14
The first level MLP contains 4 inputs (2 cards from the dealer and 2 cards from the player
in the initial state) and 2 outputs
(either hit or stand)
. Then the second level MLP contains
5 inputs (2 from the dealer an
d 3 from player) and 2 outputs
(hit/stand)
. The third level
MLP contains 5 inputs (3 from the dealer and 2 from player) and 2 outputs
(hit/stand)
.
The fourth level MLP contains 6 inputs (3 from the d
ealer and 3 from player) and 2
out
put
s (hit/stand). Other
levels
MLP (4 or 5 cards for the player or dealer) will not
considered, as
the probability of getting to these stage
of having 4 or 5 cards in hand
becomes
less likely and their
structure
s
become more complicated
which leads to more
difficulties
in classi
fication
.
When either the player or the dealer has 4 or 5 cards in one
hand
, t
he action will be determined by either random moves
or by checking if the dealer’s
total point is less than the player’s total points, and vise versa.
Accordingly, the input
look
up table will be separated into 5 sub

tables
(findMLP.m implemented this function)
,
namely the first sub

table picks up all rows that contains 2 dealer cards and 2 player cards
(the initial state
only
), the second sub

table contains the state with
2
dealer
cards and
3
player cards only, the third sub

table has
the states with 3
dealer cards and
2
player cards,
and fourth table contains
states with
3 dealer and 3 player cards. The rest of the rows will
be put into the fifth sub

table, which has the remaining
states with
4 or 5 dealer
or
player
cards. Each sub table will be
saved as
inputs to each of the
4 different
MLP level
s
as
defined above.
During the simulation, depending on how many cards the dealer and the
player have, different MLP structure
s
will be c
alled to decide which action (hit/stand) to
take.
The “End” state represents the end of the game, where either the player won, the
dealer won or a draw game.
In the next step,
the a
nalysis of the input data for each feature vector and preprocess will
be
r
equired
before training the
different
MLP
. As each column
of the lookup table
contains a
card value, the range of the value is from 1 to 10.
First, let’s look at the MLP
level 1 experiment.
MLP
level 1
: 4 inputs
(2 from the dealer and 2 from the player)
a
nd
2
outputs
.
The first and second feature vectors are the dealer’s and the third and fourth vectors are
the player’s. Since the cards are sorted before they are inserted into the table, the mean of
15
the first
(or third)
feature vector is always less than
the mean
of the second
(or fourth)
feature vectors.
The values are shown in the following table:
Feature
vector #
Mean
Standard
Deviation
Variance
Minimum
Maximum
1
(dealer)
4.5
435
1
.2
90
7
1.1
94
1
10
2
(dealer)
7.2
961
2.33
53
1.2
82
1
10
3
(player)
3.8
168
1.71
96
1
.674
1
10
4
(player)
6.7
322
1.85
96
1.3
74
1
10
After inspecting the feature vectors of the input data, the table shows that the
mean of each featu
re is different from the others, since the mean of feature 1 and feature
3 is lower than feature 2
and feature 4.
The minimum and maximum of the
4
feature
vectors are
the same. The first preprocess started with
feature
vector
normalization
and
then scaling all the feature vectors into the same range (in the range of

5 to 5).
Inspecting the standard dev
iation and variance of each feature
could show that feature
vector
2
and 4
are
more important with respect to its class (since its standard deviation
value
are the highest). The system
could pay more attention to th
ese
feature
vectors
by
putting more weigh
ts on it. This data set is nonlinear. The classification rate is not
satisfactory if there is no preprocessing the data set. With the pre

process on input data, it
can identify exactly which feature
vectors
play more important roles in the classification
p
hase.
Therefore, these feature vectors
would have more weights.
After preprocessing, t
he next step is to train the MLP with this set of the training data.
The Matlab program “
bp2
.m” is implemented using back propagation algorithm that
trains the MLP
.
Seve
ral MLP configurations were experimented.
Based on the training
data, it produced the weight matrix. The activation function
hyperbolic tangent
was used
in the hidden layers
and the sigmoidal
function
was used in the output neurons.
Different α, µ, epoch s
ize, numbers of hidden layers and numbers of neurons in the
hidden layers are also experimented.
MLP (level 1)
Configurations used in the experiment:
With
normalization in feature vectors
The input feature vectors are scaled to range of

5 to 5,
Learning
rate Alpha = 0.1
(or 0.8)
Momentum =
0
(or 0.8)
16
Excluding in
put layer, total # of layers = 2,3,4
MLP. Max. Training Epochs: 1000
Number of neurons in hidden layer
= 5
, 10, 50
MLP Configuraiton: 4

5

2
, 4

5

5

2, 4

5

5

5

2, 4

10

10

10

2, 4

50

50

50

2
The
a
ctivation function
for the hidden layers
: hyperbolic tangent
The
activation function
for
the output
layer
: sigmoidal
Partition training set dynamically into training and tuning sets;
Where 20% of
training data reserved for tuning
# of epochs between conve
rgence check
= 10
epoch size = 64
α
µ
MLP
Configuration
Classification Rate
0.1
0
4

5

2
8
0
.2%
0.1
0.8
4

5

2
89
.5%
0.8
0
4

5

2
79
.1%
0.8
0.8
4

5

2
78.9%
0.1
0
4

5

5

2
85.8%
0.1
0.8
4

5

5

2
83.2%
0.8
0
4

5

5

2
84.1%
0.8
0.8
4

5

5

2
84.0%
0.1
0
4

5

5

5

2
87.2%
0.1
0.8
4

5

5

5

2
86.5
%
0.8
0
4

5

5

5

2
85.3%
0.8
0.8
4

5

5

5

2
82.8%
0.1
0
4

10

10

10

2
89.5%
0.1
0.8
4

10

10

10

2
87.0%
0.8
0
4

10

10

10

2
86.8%
0.8
0.8
4

10

10

10

2
84.8%
0.1
0
4

50

50

50

2
88
.
0
%
0.1
0.8
4

50

50

50

2
86
.
8
%
0.8
0
4

50

50

50

2
86
.
2
%
0.8
0.8
4

5
0

5
0

5
0

2
85
.
5
%
Table 1: This table shows the class
ification rates of different MLP (level 1)
configurations with α and µ valued shown in column 1 and 2
The highest classification rate is 89.5%. If the MLP is too large, it may be trapped in a
certain classification rate
that would
stop it from improving during the training process.
The best conf
iguration is α = 0.1, µ = 0, MLP Config
uration
4

10

10

10

2
.
17
MLP level 2: 5 inputs (
2 from the dealer and 3 from player
) and
2
outputs.
This
experiment se
t
up is similar to
the setup
in
MLP level 1. First the input feature vector
is preprocessed and ana
lyzed. The characteristics of the 5 feature vectors are shown in the
table as follows:
Feature
vector #
Mean
Standard
Deviation
Variance
Minimum
Maximum
1
(dealer)
5
.
2
532
1
.
1824
1.21
4
1
10
2
(dealer)
7
.
7
195
1
.33
53
1.55
2
1
10
3
(player)
4
.8
351
1.
22
6
1
1
.3
65
1
10
4
(player)
5
.
2
465
1.
5
1
52
1
.4
7
7
1
10
5
(player)
7
.
1
328
1
.
98
9
3
1.791
1
10
When looking at the feature vectors of the input data, the table shows that the mean of
each feature is different from the others. The first preprocess started with
feature
vector
normalization
and then scaling all the feature vectors into the same range (in the range of

5 to 5). Inspecting the standard deviation and variance of each feature could show that
feature vector 2 and 5 are more important with respect to its class
(since its standard
deviation value are the highest). The system could pay more attention to these feature
vectors by putting more weights on it.
MLP (level 2) Configurations used in the experiment:
With normalization in feature vectors
The input featu
re vectors are scaled to range of

5 to 5,
Learning rate Alpha = 0.1
(or 0.8)
Momentum = 0 (or 0.8)
Excluding in
put layer, total # of layers = 2,3,4
MLP. Max. Training Epochs: 1000
18
Number of neurons in hidden layer
= 5, 10, 50
MLP Configuraiton: 5

5

2,
5

5

5

2, 5

5

5

5

2, 5

10

10

10

2, 5

50

50

50

2
The
activation function
for the hidden layers
: hyperbolic tangent
The
activation function
for the output layers
: sigmoidal
Partition training set dynamically into training and tuning sets;
Where 20% of
trainin
g data reserved for tuning
# of epochs between convergence check
= 10
epoch size = 64
α
µ
MLP
Configuration
Classification Rate
0.1
0
5

5

2
78
.
3
%
0.1
0.8
5

5

2
79
.
9
%
0.8
0
5

5

2
79
.
5
%
0.8
0.8
5

5

2
78
.
8
%
0.1
0
5

5

5

2
84
.
2
%
0.1
0.8
5

5

5

2
84
.
5
%
0.8
0
5

5

5

2
83
.
8
%
0.8
0.8
5

5

5

2
82
.
9
%
0.1
0
5

5

5

5

2
8
6
.
6
%
0.1
0.8
5

5

5

5

2
85
.
4
%
0.8
0
5

5

5

5

2
84
.
9
%
0.8
0.8
5

5

5

5

2
83
.
7
%
0.1
0
5

10

10

10

2
90
.
1
%
0.1
0.8
5

10

10

10

2
91
.
1
%
0.8
0
5

10

10

10

2
90
.
7
%
0.8
0.8
5

10

10

10

2
87
.8%
0.1
0
5

50

50

50

2
90
.
7
%
0.1
0.8
5

50

50

50

2
89
.
4
%
0.8
0
5

50

50

50

2
89
.
6
%
0.8
0.8
5

50

50

50

2
90
.
9
%
Table
2
: This table shows the classification rates of different MLP
(level 2)
configurations with α and µ valued shown in column 1 and 2
The highest classification rate is 91.1%. If the MLP is too large, it may be trapped in a
certain classification rate that would stop it from improving during the training process.
The best confi
guration is α = 0.
1
, µ = 0
.8
, MLP Config
uration 5

10

10

10

2
.
MLP level
3
: 5 inputs (
3
from the dealer and
2
from player
) and
2
outputs.
This experiment setup is similar to that in MLP level 1
and 2
. First the input feature
vector is preprocessed and ana
lyzed. The characteristics of the 5 feature vectors are
shown in the table as follows:
19
Feature
vector #
Mean
Standard
Deviation
Variance
Minimum
Maximum
1 (dealer)
5
.
0133
1
.
2214
1.1003
1
10
2 (dealer)
6
.
2891
1.3174
1.321
2
1
10
3 (
dealer
)
6
.8
344
1.
529
1
1.
2984
1
10
4 (player)
6
.
7214
1.0
12
7
1.
013
7
1
10
5 (player)
7
.
1722
1
1.
4
31
9
1.
2
1
01
1
10
When looking at the feature vectors of the input data, the table shows that the mean of
each feature is different from the others. The first preprocess started with
feature
vector
normalization
and then scaling all the feature vectors into the same range (in the range of

5 to 5). Inspecting the standard deviation and variance of each feature could show that
feature vector
3
and 5 are more important with respect to it
s class (since its standard
deviation value are the highest). The system could pay more attention to these feature
vectors by putting more weights on it.
MLP (level 3) Configurations used in the experiment:
With normalization in feature vectors
The inpu
t feature vectors are scaled to range of

5 to 5,
Learning rate Alpha = 0.1
(or 0.8)
Momentum = 0 (or 0.8)
Excluding in
put layer, total # of layers = 2,3,4
MLP. Max. Training Epochs: 1000
Number of neurons in hidden layer
= 5, 10, 50
MLP Configuraiton: 5

5

2,
5

5

5

2, 5

5

5

5

2, 5

10

10

10

2, 5

50

50

50

2
The
activation function
for the hidden layers
: hyperbolic tangent
The
activation function
for the output layers
: sigmoidal
Partition training set dynamically into training and tuning sets;
Where 20% of
training data reserved for tuning
# of epochs between convergence check
= 10
,
epoch size = 64
α
µ
MLP
Configuration
Classification Rate
0.1
0
5

5

2
78
.
2
%
0.1
0.8
5

5

2
78
.
8
%
0.8
0
5

5

2
8
0
.
1
%
0.8
0.8
5

5

2
79
.
4
%
0.1
0
5

5

5

2
82
.
0
%
0.1
0.8
5

5

5

2
84
.5%
0.8
0
5

5

5

2
84
.
1
%
0.8
0.8
5

5

5

2
83
.9%
0.1
0
5

5

5

5

2
85
.
2
%
20
0.1
0.8
5

5

5

5

2
83
.
9
%
0.8
0
5

5

5

5

2
85
.
7
%
0.8
0.8
5

5

5

5

2
84
.
5
%
0.1
0
5

10

10

10

2
89
.
9
%
0.1
0.8
5

10

10

10

2
91
.
7
%
0.8
0
5

10

10

10

2
92
.
5
%
0.8
0.8
5

10

10

10

2
90
.8%
0.1
0
5

50

50

50

2
90
.7%
0.1
0.8
5

50

50

50

2
9
0
.
1
%
0.8
0
5

50

50

50

2
91
.
1
%
0.8
0.8
5

50

50

50

2
89
.
8
%
Table
3
: This table shows the classification rates of different MLP
(level
3
)
configurations with
α and µ valued shown in column 1 and 2
The highest classification rate is 9
2
.
5
%.
The best configuration is α = 0.
8
, µ = 0, MLP
Config
uration 5

10

10

10

2
.
MLP level
4
:
6
inputs (
3
from the dealer and
3
from player
) and
2
outputs.
This experiment setup
is
similar to previous experiments
. First the input feature vector is
preprocessed and analyzed. The characteristics of the
6
feature vectors are shown in the
table as follows:
Feature
vector #
Mean
Standard
Deviation
Variance
Minimum
Maximum
1 (dealer)
5
.
27
1
0
1
.
00
1
5
1.0121
1
10
2 (dealer)
6
.
5097
1.
1232
1.
13
2
4
1
10
3 (dealer)
7
.
1321
1.
2156
1.2533
1
10
4 (player)
6.
0
21
1
1.
04
7
1
1.
087
7
1
10
5 (player)
6
.
6872
1.
2016
1.20
01
1
10
6
(player)
7
.
0
72
8
1.354
9
1.
353
1
1
10
When looking at the feature vectors of
the input data, the table shows that the mean of
each feature is different from the others. The first preprocess started with
feature
vector
normalization
and then scaling all the feature vectors into the same range (in the range of

5 to 5). Inspecting th
e standard deviation and variance of each feature could
show that
feature vector 3 and 6
are more important with respect to its class (since its standard
deviation value are the highest). The system could pay more attention to these feature
vectors by putt
ing more weights on it.
21
MLP (level
4
) Configurations used in the experiment:
With normalization in feature vectors
The input feature vectors are scaled to range of

5 to 5,
Learning rate Alpha = 0.1
(or 0.8)
Momentum = 0 (or 0.8)
Excluding in
put layer
, total # of layers = 2,3,4
MLP. Max. Training Epochs: 1000
Number of neurons in hidden layer
= 8
, 1
2
, 50
MLP Configuraiton: 6

8

2,
6

8

8

2, 6

8

8

8

2,
6

1
2

1
2

1
2

2,
6

50

50

50

2
The
activation function
for the hidden layers
: hyperbolic tangent
The
activa
tion function
for the output layers
: sigmoidal
Partition training set dynamically into training and tuning sets;
Where 20% of
training data reserved for tuning
# of epochs between convergence check
= 10, epoch size = 64
α
µ
MLP
Configuration
Classificatio
n Rate
0.1
0
6

8

2
7
9
.
1
%
0.1
0.8
6

8

2
80
.
9
%
0.8
0
6

8

2
80
.
5
%
0.8
0.8
6

8

2
80
.
1
%
0.1
0
6

8

8

2
81
.
5
%
0.1
0.8
6

8

8

2
82
.
2
%
0.8
0
6

8

8

2
82
.
8
%
0.8
0.8
6

8

8

2
81
.
4
%
0.1
0
6

8

8

8

2
84
.
7
%
0.1
0.8
6

8

8

8

2
85
.
2
%
0.8
0
6

8

8

8

2
85
.
7
%
0.8
0.8
6

8

8

8

2
84.
6
%
0.1
0
6

1
2

1
2

1
2

2
90
.
2
%
0.1
0.8
6

12

12

12

2
89
.
3
%
0.8
0
6

12

12

12

2
88
.
6
%
0.8
0.8
6

12

12

12

2
86
.
9
%
0.1
0
6

50

50

50

2
87
.
6
%
0.1
0.8
6

50

50

50

2
88
.
3
%
0.8
0
6

50

50

50

2
88.5
%
0.8
0.8
6

50

50

50

2
87
.8%
Table
4
: This table show
s the classification rates of different MLP
(level 4
)
configurations with α and µ valued shown in column 1 and 2
The highest classification rate is 9
0.2
%.
The best configuration is α = 0.
1
, µ = 0, MLP
Config
uration 6

1
2

1
2

1
2

2
.
The previous experiments
have tested the 4 MLP
structures
that will be used in this
project.
22
After the best configuration for each MLP is defined, in order to improve the
classification performance, I have picked up those cases which are often easily
misclassified by the MLP. I h
ave set up another misclassification table to store these
cases. The MLP structures will be implemented to the Blackjack game so that the
program will call a MLP when it’s time to make a decision. Moreover, before the MLP
continue to do its classification
task, the program need to check if this specific dealer and
player card pattern exists in the misclassification table. If it does, it will use the action
from the misclassification table instead of using the MLP to make a decision. There are
610 entries in
the misclassification table. For the MLP
Config
uration 4

10

10

10

2
, it can
improve the clas
sification rate up to 92.9
%.
Here are
summaries of
the 4 different
MLP Configuration
s
:
Normalization in feature vectors, and scaled to range of

5 to 5.
Max. Tra
ining Epochs: 1000, epoch size = 64
Activation function (hidden layer)=hyperbolic tangent
Activation function (output layer) = sigmoidal
α
=
딠
=
C潮晩o畲a瑩潮
=
C污獳楦楣a瑩潮⁒a瑥
=
jimN
=
〮M
=
M
=
4




2
=
㠹⸵8⸠
=
jim2
=
〮M
=
〮M
=
R




2
=
㤱⸱9
=
jimP
=
〮M
=
M
=
R




2
=
㤲⸵9⸠
=
jim4
=
〮M
=
M
=
S

ㄲ

ㄲ

ㄲ

2
=
㤰⸲9⸠
=
=
Matlab Simulation using MLP
The Matlab program BlackjackANN.m is implemented using the MLP that were defined
in the previous section.
There was 1000 games simulation in this experiment. Both the
deal
er and the player use MLP to make decisions in their turns.
The winning percentages
of the dealer and the player are
as follows:
When applying MLP to the player to play against a dealer with the 17 poi
nt rule, the
player has a
winning percentage
of 56.5%
.
Out of 1000 games, the player won 5
65
times.
23
Strategy
Win %
Tie %
Player
with MLP
56
.
5
%
9
%
When applying
MLP
to the dealer (where
the dealer does not follow the 17 points rule)
to
play against a player with rando
m moves, the dealer has an outstanding
w
inning
percentage
of
68
.2
%.
Strategy
Win %
Tie %
Dealer
with MLP
68
.2
%
3
%
When applying
MLP
to both dealer and player to play against each other, the percentage
of tie games increases,
as
they are using more likely the same strategies based on
the
deale
r’s and the player’s input lookup tables
.
It also depends on the number of entries in
the misclassification tables. In this experiment, the player performed a little bit better
than the dealer. It would be because the player’s misclassification table actua
lly identifies
more misclassified patterns than the dealer’s table does.
Strategy
Win %
Tie %
Player
with MLP
54
%
3
%
Dealer
with MLP
4
3
%
3
%
Finally, I
had the opportunity to
play against the
system that I developed. I played against
the
dealer with the
original dealer lookup table. I was able to win
12
games out of 20
games
, as well as 2 tie games
.
Therefore, I decided to let the dealer use a better table (the
player’s original table) as the input to its MLP networks. Then I played against the dealer
ag
ain. This time, it improved its performance by 1 game
(and 1 tie game this time)
.
Strategy
Win %
Tie %
Human Player
60%
10%
Dealer with MLP
3
0
%
10
%
Using the
MLP networks for the
dealer and
the
player has highly increased the number of
games won for bo
th
the
dealer and the player
.
Of course, the human still plays better than
the MLP at the end, as the training set for the MLP is very small compared to all the
24
possible states a Blackjack game would have. If the entries in both dealer’s and player’s
table
increases (so that it covers most of the states in the game), I believe that the MLP
can certainly play better than human being. Building such a large network would require
additional numerical analysis techniques, (for example, how to make the lookup tab
le
smaller and contains more information), such that the MLP can have a reasonable
response table in each turn.
As human beings are so adapted to using their own experiences in playing games, the
more games you play, the more you are familiar with the gam
e by practicing. It draws an
interesting idea on learning by letting both the program and the human learn together at
the same time and
the program
observe
s
who actually learns faster. Therefore, I decided
to implement a side

version of Blackjack as an exp
eriment. That is, instead of having 21
points as the default goal, you can actually define your target points in the game.
The
next section will show the details of this experiment.
Applying ANN
to play user

defined
Blackjack
In the normal Blackjack game
, t
he goal of
each game
is to draw cards from a deck of 52
cards to a total value
as close to
21
as possible
.
To make applying ANN to Blackjack
more interesting, I have c
reated a version of Blackjack where the user can define his only
target
points
, for ex
ample, 30 points
, so that the program and the human can learn how to
p
lay a new game at the same time.
To make the implantation easier, I still decided to let
both the dealer and the player have maximum of 5 cards in hand. If both the dealer and
the player
reach 5 cards without busted, the total points will determine who the winner is.
Of course, a new set of input lookup tables will be needed to train the dealer to play this
new game. Since I am playing as the only player, I did not train the player input
lookup
tables.
When the target
points changed, all the existing
game
rules and
strategies changed
as well. Without knowing the new rule and
strategies of the game
in advance
, does MLP
network learn better and faster than the human player
?
The Malab progra
m Blackjack
ANN
x.m is implemented for the user

def
ined Blackjack
game. There are 2
0 games set up between the human and the dealer. Without cheating in
the game, I did not calculate the probability of
my play in advance.
After the 2
0 games,
the experiment re
sults are as follows:
25
30

point target game
Win %
Tie %
Human Player
40
%
5
%
Dealer with MLP
55
%
5
%
As this is my first time playing the 30

point Blackjack, I won only 8 games out of 20
games. The winning percentage for the MLP and the human is close. Th
ough the MLP
has a slightly larger input table, however, it only increases a small percentage of the
possible states it covers in this new game.
The table above shows the experiment result of
the first time I play the new 30

point game. As I play more, I s
tarted to win the dealer
more.
As the number of games played increase, t
he learning rate of human being is
definitely a lot higher than that of the program.
On the other hand, t
he percentage of tie game decreases as the total target point increases.
Ther
e is only 1 tie game out of 20 games because when the number of states increase, the
probability of two players getting to the same state becomes less likely.
As the total target
points increase, the number of states also increases
dramatically
.
Moreover
,
it is
impossible for the
program
to capture al
l possible
states
,
as
the
number of
lookup table
entries would
increase
dramatically.
The
winning
probability of each state will only
converge after m
illions of
games and millions of
entries
in the lookup table
s
.
It can be
shown in an
experiment
similar
as before
. In this tremendous number of states in the
game, it is very depending on how lucky the simulation plays the
first
10000 games to
collect the
input data in the lookup table and also how lucky it is to
c
apture
most
of the
common states
.
After simulating more games, w
ith the lookup table as the input of the
dealer’s
MLP, the MLP did its best at 65% classification rate
in a new game
.
To simulate the program one more time, this time I let the simulated pla
y to use the
previous trained dealer’s lookup table as the input to the player’s MLP network. At the
same time, the dealer is using a 25

point to play in this new game. There are 1000 games
simulated and the experiment results are shown as follows:
26
30

p
oint target game
Win %
Tie %
Player
with MLP
5
0
.2
%
2
.0
%
Dealer w
ith
25

point rule
4
7
.6
%
2
.
0
%
Even if the player is using a small lookup table as input to its MLP, it can actually beat
the dealer
,
who is using
a fixed game rule
, by 2
.
4
%.
The
MLP
network
work
s best
for
highly random and dynamic games, where the game rules and the strategies are hard to
define
and the game outputs are hard to predict
exactly.
Of course, it also depends how
lucky the system is when it first collected its input data.
Concl
usion
This project explore
d
the method
s
of Artificial Neural Network in Blackjack for learning
strategies without explicitly teaching the rules of the game, and the reinforcement
learning algorithm
is
used to learn strategies
.
From the above experiment, i
t can be
concluded that
the
Reinforcement learning
algorithm
does learn better strategies
than the
fixed rule strategies
in playing Blackjack, as it improves the winning probability of each
game dramatically.
Reinforcement learning algorithms is able to su
ggest the best actions
when the situation changes over time. It finds a mapping from different states to
probability distributions over various actions.
Without knowing any rules of Blackjack,
the ANN is able to performance well in playing the game, given
the training data set is
larger and accuracy enough
to cover most of the game states
. The misclassification table
is
also
useful
in improving
the classification rate
by reducing the misclassification
number
. However, it requires some manually work to pick
up
entries from
the lookup
table and recognize the misclassification patterns and put them into the misclassification
table.
The highest classification rate it can get
from the MLP
is
81
%.
Of course, the input
training table can also be repeated trained
by the correctly classified test data to improve
the classification rate.
The performance of playing with ANN has an average of 51%
wining
, a minimum of 39% winning
and a maximum of 65% of winning
when simulated
27
to play in 1000 games
.
However, as the numbe
r of games increases, the number of un

encountered states increases, which
include
the states that are not included in the lookup
tables
. Therefore,
the average winning drops down to 42%. Continue learning is
important to produce good results, that is, the
input training data to the MLP should be
updated after 10000 game
s
played
, or the testing train after
each game can be feed back
to the training input table to increase the training set
, for example.
As the number of
game increases, the game strategies w
ill change over time. The larger the input training
data, the better the performance will be.
Since the cards are dealt from a 52 card deck, the current hand is actually dependent on
the last hands. Taking advantage of the memory of playing Blackjack mak
es the search
space much smaller and hence increases the winning probability of each state. Including
the card counting algorithm in the program is another desirable feature for future work.
Others
include:
adding additional players to the game,
training t
he ANN with a teacher
to
eliminate duplicate pattern
s (for example, 4 + 7 = 7 + 4 = 5 + 6 = …)
and identify
misclassified pattern
, trai
ning the ANN to play against
different experts so that it can pick
up various game strategies
, including
game tricks and
strategies in a different table
for
the ANN to look up when different
state
s
are
encountered
,
exploring
other learning
methods such as Q learning,
including
a large training input data over time to make a
better ANN,
extending the ANN to play other similar
games in Pokers,
etc
.
The MLP
network works best for highly random and dynamic games, where the game rules and the
strategies are hard to define and the game outputs are hard to predict exactly.
28
Appendix 1
Matlab Files
Blackjack.m
This program allows th
e user to play blackjack with the
dealer.
BlackjackANN.m
Based on the MLP and input data, the ANN can make
decision to play against the dealer.
Genlookuptable.m
This program will simulate 10000 games between the dealer
and the player, where the player m
akes random moves. The
winning probability of each state will be updated after each
game. The move pattern and winning probability value will
be stored in the dealer and player lookup tables, which will
be used as the input to the MLP.
BlackjackN.m
This
program is called by genlookuptable.m to play against
the dealer based on random move and update lookup tables.
Bp2.m,
bpconfig2.m
Bpdisplay.m
These two
programs
take the lookup tables as training data.
Using back propagation algorithm, it outputs the wei
ght
matrix for the MLP and classification rate of each setting.
Blackjackx.m
This program is implemented with r
egular Blackjack with
user defined target points
, for example 30 points.
Blackjack
ANN
x.m
The general form of Blackjack, which takes the user de
fined
target value (by default, the target value is 21). The learning
algorithm is also applied to this program.
Cardhit.m
Update screen when player or dealer hits a card
Cardplot.m
Plot dealt cards in the user screen
Checkmoney.m
Check user balance, if
less than zero, then quit
Dealerwin.m
Performs updates to the lookup tables if dealer won.
findMLP.m
Separate input lookup table to different sub

tables for 4 MLP
Scale.m
Rsample.m
Randomize.m
Cvgtest.m
Actfun.m
Actfunp.m
Utility Matlab function
s
from
the class website
Shuffle.m
Shuffle cards before they are dealt to the player and dealer
Youwin.m
Perform
s
updates to the lookup table
s
if player won
29
Appendix 2
Reference
s
De Wilde, Philippe (1995).
Neural network models: an analysis,
Springer

Verlag
M.L littman “Markov games as a framework for mul
ti

agent reinforcement
learning
”, Proceedings of the Eleven International Conference on Machine
Learning, 1994, pp. 157
–
163. Morgan Kaufmann.
R.S.
Sutton
and A.G. Barto, Reinforcement Learning: An Introduct
ion, MIT
Press, 1998.
B. Widrow, N. Gupta and S. Maitra, “Punish/Reward: Learning with a Crit
ic in
Adaptive Threshold System
”
,
IEEE Transactions on Systems, Man and
Cybernetics, vol 3, no. 5, pp. 455
–
465, 1973.
Haykin, Simon (1999).
Neural networks: a c
omprehensive foundation,
2nd
edition, Prentice Hall, New Jersey
Online Presentation and Matlab Files
This word document was downloaded from
http://www.worddocx.com/
please remain this
link
information when
you
r
eproduce , cop
y
, or use
it.
<a href='http://www.wordwendang.com/en'>word documents</a>
The project report, online presentation and Matlab files are available for download at:
http://www.cae.wisc.edu/~sam/ece539
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο