to the game of Othello

unknownlippsΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 4 χρόνια και 23 μέρες)

84 εμφανίσεις

Application of reinforcement learning
to the game of Othello

Nees

Jan van Eck,
Mechiel

van
Wezel

Soft Computing Laboratory







Computers & Operations Research,
no.
35, 2008

Introduction


Many decision making problems in real life


Not depend on an isolated decision but rather on a sequence of decision


But traditional theory does not account for consciousness in WM



Markov decision processes(MDPs)


A well
-
known class of sequential decision making problems


Optimal decision in a given state


Widespread application


Some algorithms that are guaranteed to find optimal policies


Dynamic programming methods


Weak point


Many possible states


Require exact knowledge


Reinforcement learning algorithms


Machine learning, operation research, control theory, psychology,
neurosicence


Robotics, control to industrial manufacturing combinatorial search



1

Intoduction


Othello


Well defined sequential decision making problem


Huge state space(approximately )


Easily measure performance


Experiment without the use of any knowledge provided by human




2

Reinforcement learning and sequential decision
making problems


At each moment


Environment is in a certain state


The agent observes this state


The agent takes an action


The environment responds with reward





The agent’s task is to learn to take optimal action


Maximize the sum of immediate rewards and future reward


Sacrificing immediate rewards to obtain a greater cumulative reward


3

RL and sequential decision making problems



: Learning late(future discount factor)


If = 0, only the immediate rewards are consider


As is set closer to 1, future rewards are given greater emphasis


4

r : Reward

state
-
value function

Q
-
learning


A reinforcement learning algorithm that learns the value of a
functiotn

Q(
s,a
) to find an optimal policy

5

Neural Network


6

Q
-
leaning with neural network


7

Networks


Single and distinct

8

Action select


Trade
-
off between exploration and exploitation


Exploiting


Select the action with the highest estimated Q
-
value


Obtain high reward


Exploring


Improve its knowledge of the Q
-
function


Make better action selections in the future


Softmax

function


probability model


Using
Bolzmann

distribution


9

Othello


A two
-
player game


Zero
-
sum board game(competitive)


Fixed total reward


Perfect information


Imperfect game : poker, RTS games


The state space size is approximately


Its length is 60 moves at most


8 by 8 board using 64 two sided discs


Initially the board is empty except for central four square



10

Othello


11

Strategies


Three phases


Opening game, middle game


The goal is to strategically position the disc on the board


Cannot be flipped


Corners and edges


End game


Maximizing one’s own discs while minimizing the opponent’s disc




12

Positional player


Does not learn


Plays according to the positional strategy

13

Opponent =
-
1

Player = 1

Unoccupied
-

0

Mobility player


Does not learn


Plays according to the mobility strategy


Mobility concept


Legal moves


Corner position are great importance






14


Number of corner squares occupied by player


Number of corner squares occupied by opponent


Player’s mobility


Opponent’s mobility


Weight parameter

Q
-
learning player


Uses the q
-
learning algorithm


The current state of the board


State of game


Reward is 0 until end of the game


Upon completing the game


+1 for a win,
-
1 for a loss, and 0 for a draw


Aims to choose optimal actions leading to maximal reward


Leaning rate is set to 0,1, discount factor is set to 1


Does not change during learning


Equal weight to immediate and future rewards


Only care about winning and not about winning as fast as possible


Softmax

action selection method





15

Implementation of the Othello playing agents


16

Experiment & Result


15,000,000 games were played for training


Be Evaluated by playing 100 game against two benchmark player


Positional player, Mobility player


More difficult Q
-
learning player to play against a mobility player
than against the positional player



17

Experiment & Result


15,000,000 games were played for training


Be Evaluated by playing 100 game against two benchmark player


Positional player, Mobility player


18

Summary, conclusion and outlook


Summary


Reinforcement learning


Described q
-
leaning with neural network


Othello has a huge space


Applied q
-
learning to the game of Othello with neural network


Future research


Use of an adapted version of Q
-
learning


The
minimax

Q
-
learning described by Littman


Study the effects of the presentation of special board features


In order to simplify learning


Study potential application of reinforcement learning


Operation research, management science


General MDP application by WHITE



19

E.N.D