Application of reinforcement learning
to the game of Othello
Nees
Jan van Eck,
Mechiel
van
Wezel
Soft Computing Laboratory
장
수
형
Computers & Operations Research,
no.
35, 2008
Introduction
•
Many decision making problems in real life
–
Not depend on an isolated decision but rather on a sequence of decision
–
But traditional theory does not account for consciousness in WM
•
Markov decision processes(MDPs)
–
A well

known class of sequential decision making problems
–
Optimal decision in a given state
–
Widespread application
–
Some algorithms that are guaranteed to find optimal policies
•
Dynamic programming methods
•
Weak point
–
Many possible states
–
Require exact knowledge
–
Reinforcement learning algorithms
•
Machine learning, operation research, control theory, psychology,
neurosicence
–
Robotics, control to industrial manufacturing combinatorial search
1
Intoduction
•
Othello
–
Well defined sequential decision making problem
–
Huge state space(approximately )
–
Easily measure performance
–
Experiment without the use of any knowledge provided by human
2
Reinforcement learning and sequential decision
making problems
•
At each moment
–
Environment is in a certain state
–
The agent observes this state
–
The agent takes an action
–
The environment responds with reward
•
The agent’s task is to learn to take optimal action
–
Maximize the sum of immediate rewards and future reward
–
Sacrificing immediate rewards to obtain a greater cumulative reward
3
RL and sequential decision making problems
•
: Learning late(future discount factor)
–
If = 0, only the immediate rewards are consider
–
As is set closer to 1, future rewards are given greater emphasis
4
r : Reward
state

value function
Q

learning
•
A reinforcement learning algorithm that learns the value of a
functiotn
Q(
s,a
) to find an optimal policy
5
Neural Network
6
Q

leaning with neural network
7
Networks
•
Single and distinct
8
Action select
•
Trade

off between exploration and exploitation
•
Exploiting
–
Select the action with the highest estimated Q

value
–
Obtain high reward
•
Exploring
–
Improve its knowledge of the Q

function
–
Make better action selections in the future
•
Softmax
function
–
probability model
–
Using
Bolzmann
distribution
9
Othello
•
A two

player game
–
Zero

sum board game(competitive)
•
Fixed total reward
–
Perfect information
•
Imperfect game : poker, RTS games
–
The state space size is approximately
•
Its length is 60 moves at most
–
8 by 8 board using 64 two sided discs
–
Initially the board is empty except for central four square
10
Othello
11
Strategies
•
Three phases
–
Opening game, middle game
•
The goal is to strategically position the disc on the board
•
Cannot be flipped
•
Corners and edges
–
End game
•
Maximizing one’s own discs while minimizing the opponent’s disc
12
Positional player
•
Does not learn
•
Plays according to the positional strategy
13
Opponent =

1
Player = 1
Unoccupied

0
Mobility player
•
Does not learn
•
Plays according to the mobility strategy
–
Mobility concept
•
Legal moves
–
Corner position are great importance
14
Number of corner squares occupied by player
Number of corner squares occupied by opponent
Player’s mobility
Opponent’s mobility
Weight parameter
Q

learning player
•
Uses the q

learning algorithm
•
The current state of the board
–
State of game
•
Reward is 0 until end of the game
•
Upon completing the game
–
+1 for a win,

1 for a loss, and 0 for a draw
•
Aims to choose optimal actions leading to maximal reward
•
Leaning rate is set to 0,1, discount factor is set to 1
–
Does not change during learning
–
Equal weight to immediate and future rewards
–
Only care about winning and not about winning as fast as possible
–
Softmax
action selection method
15
Implementation of the Othello playing agents
16
Experiment & Result
•
15,000,000 games were played for training
•
Be Evaluated by playing 100 game against two benchmark player
–
Positional player, Mobility player
•
More difficult Q

learning player to play against a mobility player
than against the positional player
17
Experiment & Result
•
15,000,000 games were played for training
•
Be Evaluated by playing 100 game against two benchmark player
–
Positional player, Mobility player
18
Summary, conclusion and outlook
•
Summary
–
Reinforcement learning
–
Described q

leaning with neural network
–
Othello has a huge space
–
Applied q

learning to the game of Othello with neural network
•
Future research
–
Use of an adapted version of Q

learning
•
The
minimax
Q

learning described by Littman
–
Study the effects of the presentation of special board features
•
In order to simplify learning
–
Study potential application of reinforcement learning
•
Operation research, management science
•
General MDP application by WHITE
19
E.N.D
Comments 0
Log in to post a comment