Document

Transcript Document

Backgammon
Backgammon is an ancient game of skill with a chance element: players throw
two dice to determine the moves that are possible, then choose among them.
Each die roll permits the player to move one
man that many points. Both rolls must be
used if possible. The same man may be
moved twice. 2 rolls the same  4 moves
White’s “inner table”
Black’s
direction
of travel
A man may not land on a block - a point
occupied by two or more enemy men.
White’s
direction
of travel
If a man lands on a point occupied by one
enemy - hits a blot - the enemy is removed.
Removed men must start again, from an
imaginary off-board point at the player’s
start point. No other move while men are off.
Players race to a) get all their men into their inner table, then b)
move them to any imaginary point beyond their end point.
http://csiweb.ucd.ie/Staff/acater/comp4031.html
Black’s “inner table”
Max 15 men per point
Artificial Intelligence for Games and Puzzles
1
Gambling, Sizes of win
Backgammon is played for stakes (matchsticks? money?) and for championships.
If a player wins before his opponent has removed any men, he wins double.
If in addition the opponent still has men in his first quadrant, he wins triple.
There is a “doubling cube”, bearing the numbers 2,4,8,16,32,64.
• A player may “double” - offer to continue the game for twice the stake.
• If the opponent refuses, he resigns the game and loses the current stake.
• If he accepts, only he may offer the next double.
http://csiweb.ucd.ie/Staff/acater/comp4031.html
Artificial Intelligence for Games and Puzzles
2
Search complexity of backgammon
On each turn, the two dice can be rolled in 21 different ways.
Typically there are around 20 moves possible for any result of the dice.
Therefore the branching factor is approx 400.
It is estimated there are around 1020 possible arrangements of the men.
• Less than chess, go;
• comparable with checkers, othello;
• still much too large for exhaustive analysis.
http://csiweb.ucd.ie/Staff/acater/comp4031.html
Artificial Intelligence for Games and Puzzles
3
Berliner’s BKG program
In 1979 Berliner’s BKG9.8 won a 7-game exhibition match against Luigi Villa,
the then World Champion. Caveats:
 7 games is too short a series to be conclusive, since luck plays a part
 Villa had reason not to take the comic-robot program very seriously
 It was an exhibition match, not a championship
BKG was a pure no-search Type-B program, it evaluated moves at 1 ply.
Its hand-crafted evaluation method was called SNAC
• Smoothness (as opposed to generalised form of Horizon Effect)
• Non-linearity (importance of features inter-dependent, not just additive)
• Application Coefficients (program computed higher-level features, the
evaluation function was not expressed directly in terms of positions of men)
http://csiweb.ucd.ie/Staff/acater/comp4031.html
Artificial Intelligence for Games and Puzzles
4
Tesauro’s TD-GAMMON
TD-Gammon was (primarily) a connectionist system.
It learnt move decisions by self-play, it had hand-crafted doubling algorithm.
Its two main features were
1. Use of the “Method of Temporal Differences” to solve the (temporal) credit
assignment problem
2. Use of “Rollout analysis” to generate reliable statistics
TD-Gammon did (in mid-1990s) play the then champion, and other world-class
players, in substantial numbers of games, and showed near parity with them.
http://csiweb.ucd.ie/Staff/acater/comp4031.html
Artificial Intelligence for Games and Puzzles
5
MLP (Multi Layer Perceptron)
Inputs
Hidden layer
Outputs
win x2
w@3:1
w@3:2
w@3:3
w@3:4+
win x1
all inputs
connect to
all hidden
all hidden
connect to
all outputs
lose x1
lose x2
http://csiweb.ucd.ie/Staff/acater/comp4031.html
Artificial Intelligence for Games and Puzzles
6
Temporal Difference learning
TD learning applies when a series of choices leads to an eventual outcome.
Like reinforcement learning, over many trials,
• those choices that lead to good outcomes should be preferred
• those choices that lead to bad outcome should be deprecated
The essence of TD learning is that
the valuation of a situation before a choice is made
should become more similar to
the valuation of the situation after the choice is made
Situations just before a win (or loss) should be assessed virtually as if they
themselves were a win (loss) - and so on back through all prior choices.
http://csiweb.ucd.ie/Staff/acater/comp4031.html
Artificial Intelligence for Games and Puzzles
7
Learning from self play vs Tuition
Human experts are good, but fallible.
• They get tired, make mistakes, cannot communicate everything.
• Community opinions change over time
 Indeed TD-Gammon rollouts have helped change expert opinion!
Very large numbers of games can be self-played by machine.
• 200k training games and TD-Gammon showed “feature discovery”
• Successive versions with more training showed continuing improvements
vis à vis top class players
• Danger of self-play: random strategy at start may be very bad
http://csiweb.ucd.ie/Staff/acater/comp4031.html
Artificial Intelligence for Games and Puzzles
8
Early learning
A danger in learning by self play is that the random strategy at the beginning of
learning may be very bad.
• Games may last 1000’s of moves, rather than the typical 50-60
• Nothing of value might then be learnt.
TD-Gammon however did learn some elementary tactics and strategy from its
early games. The trend towards the end of the game, forced by the generally
forward movement of men, meant that even random games were brief.
By scoring every possible move, choosing the one with highest expected
outcome, it learnt early on some basic concepts:
 hit the opponent; play safe; build blocks
Later learning could build on this to discover interesting evaluation features
http://csiweb.ucd.ie/Staff/acater/comp4031.html
Artificial Intelligence for Games and Puzzles
9
Judgment and Calculation
Experts commenting on TD-Gammon’s play noted contrast with chess programs:
• Chess programs excel at laborious calculation but show poor judgment
• TD-Gammon showed good positional judgment
Pure TD-learning from experience, using only raw coding of positions, enabled
TD-Gammon to match the performance of Tesauro’s earlier NeuroGammon.
Adding further inputs - giving access to precomputed tables of “strength of
blockade”, precomputed “probability of being hit” - brought TD-Gammon to
near parity with world championship level.
http://csiweb.ucd.ie/Staff/acater/comp4031.html
Artificial Intelligence for Games and Puzzles
10
Rollout Analysis
TD-Gammon also performs the most trusted “rollout analysis” operation:
Where commentators & analysts look over championship games, they sometimes
disagree about the merits of positions resulting from alternative possible moves.
A “rollout” can be used to determine which is better:
• from each resulting position, generate and play out 10,000 random dice
sequences, pick the move that gives the best result
Even though the absolute error in the original scoring might be large, the relative
error would not be - faults in the evaluations of related positions are highly
correlated.
http://csiweb.ucd.ie/Staff/acater/comp4031.html
Artificial Intelligence for Games and Puzzles
11
Rare success for TD learning
TD learning was very successfully applied in TD-Gammon; but it has not proved
so successful in other domains. Possible reasons for its success at backgammon:
• because of the nature of the rules of backgammon, even a random strategy
gives sequences brief enough to be learned from. Go could go on for ages.
• the randomness of the dice causes noise, meaning that the same strategy will
on different occasions explore different parts of the search space. The learner
will not play deterministically and get stuck on a poor strategy. (Noise
Injection has been found helpful in other applications)
• there are simple “linear” concepts that can be learned early (blots are bad,
blocks are good, bear off asap), and later learning can bootstrap from these.
http://csiweb.ucd.ie/Staff/acater/comp4031.html
Artificial Intelligence for Games and Puzzles
12
References
Berliner, H. 1980. Performance Note: Backgammon computer program beats
world champion. Artificial Intelligence v.14 pp205-220
Tesauro, G. 1995. Temporal Difference learning and TD-Gammon. CACM v38:3
pp58-68
http://csiweb.ucd.ie/Staff/acater/comp4031.html
Artificial Intelligence for Games and Puzzles
13

Document

Transcript Document

Directory