ReinforcementLearning_part2x

Download Report

Transcript ReinforcementLearning_part2x

Understanding AlphaGo
Understanding AlphaGo
Go Overview
• Originated in ancient China
2,500 years ago
• Two players game
• Goal - surround more territory
than the opponent
• 19X19 grid board
• playing pieces “stones“
• Turn = place a stone or pass
• The game ends when both
players pass
Go Overview
Only two basic rules
1. Capture rule: stones that have no liberties ->captured and removed from
board
2. ko rule: a player is not allowed to make a move that returns the game to
the previous position
X
Go In a Reinforcement Set-Up
• Environment states
• Actions
S=
A=
• Transition between
states
• Reinforcement
function
r(s)=
0
1
if s is not a terminal state
o.w
Goal : find policy that maximize the expected total payoff
Why it is hard for computers to play GO?
• Possible configuration of board extremely high ~10^700
• Impossible to use brute force exhaustive search
• Chess (b≈35, d≈80)
• main challenges
• Branching factor
• Value function
Go (b≈250, d≈150)
Training the Deep Neural Networks
Human experts
(state,action)
𝑃ϭ - SL policy
network
𝑃π - Rollout
policy network
Monte Carlo Tree Search
𝑃ƿ - RL policy
network
(state, win/loss)
𝑉Ɵ - Value
network
Training the Deep Neural Networks
Policy
Policy
Value
SL Policy Network : 𝑷ϭ
• ~30 million (state, action)
• Goal:maximize the log likelihood of an action
• Input : 48 feature planes
12 convolutional +
rectifier layers
Softmax
19X19X48
• Output: action probability map
Probability
map
SL Policy Network : 𝑷ϭ
Bigger -> better and slower
Accuracy
AlphaGo (all input features)
57.0%
AlphaGo (only raw board position)
55.7%
state of the art
44.4%
Training the Rollout Policy Network 𝑷π
SL policy net 𝑃ϭ
Rollout Policy
Network 𝑃π
Forwarding
3 milliseconds
2 microseconds
12 convolutional
+ rectifier layers
Accuracy
55.4%
24.2%
Softmax
• Similar to SL policy 𝑷ϭ
• Output – probability map over actions
• Goal: maximize the log likelihood
• Input
• Not full grid
• *handcrafted local features*
Probability
map
Training the RL Policy Network 𝑷𝛒
• {𝜌_ | 𝜌_ is an old version of 𝜌}
12 convolutional
+ rectifier layers
• 𝑃𝜌 vs. 𝑃𝜌−
SGA
• Preventing overfitting
• RL policy Won more then 80% of the games against SL policy
Softmax
• Initialize weights to 𝜌=ϭ
19X19X48
• Refined version of SL policy (𝑷ϭ )
Probability
map
Training the Deep Neural Networks
Human experts
(state,action)
𝑃ϭ - SL policy
network
𝑃π - Rollout
policy network
Monte Carlo Tree Search
𝑃ƿ - RL policy
network
(state, win/loss)
𝑉Ɵ - Value
network
Training the Value Network 𝑽𝛉
• Position evaluation
• Approximating optimal value function
• Input : state , output: probability to win
• Goal: minimize MSE
convolutional +
rectifier layers
fc
19X19X48
• Overfitting - position within games are strongly correlated
scalar
Training the Deep Neural Networks
𝑃ϭ - SL policy
network
~30m Human
expert
(state,action)
𝑃π - Rollout
policy network
Monte Carlo Tree Search
𝑃ƿ - RL policy
network
(state,won/loss)
𝑉Ɵ - Value
network
Monte Carlo Tree Search
• Monte Carlo Experiments : repeated random sampling
to obtain numerical results
• Search method
• Method for making optimal decisions in
artificial intelligence (AI) problems
• The strongest Go AIs (Fuego, Pachi, Zen, and
Crazy Stone) all rely on MCTS
Monte Carlo Tree Search
Each round of Monte Carlo tree search consists of four steps
1. Selection
2. Expansion
3. Simulation
4. Backpropagation
MCTS – Upper Confidence Bounds for
Trees
• Exploration Exploitation Tradeoff
• Kocsis, L. & Szepesvári, C. Bandit based MonteCarlo planning (2006)
• Convergence to the optimal solution
Exploitation
Exploration
Wi #wins after visiting the node i
ni #times node i has been visited
C exploration parameter
t #times node i parent has been visited
AlphaGo MCTS
Selection
Expansion
Evaluation
• Each edge (s,a) stores:
• Q(s,a) - action value (avrerage value of sub tree)
• N(s,a) – visit count
• P(s,a) – prior probability
Backpropagation
AlphaGo MCTS
Selection
Expansion
Evaluation
Backpropagation
AlphaGo MCTS
Selection
Expansion
Leaf evaluation:
1. Value network
2. Random rollout
played until terminal
Evaluation
Backpropagation
AlphaGo MCTS
Selection
Expansion
Evaluation
How to choose the next move?
• Maximum visit count
• Less sensitive to outliers than maximum
action value
Backpropagation
AlphaGo VS Experts
4:1
5:0
Take Home
VS
• Modular system
• Reinforcement and Deep learning
• Generic