Planning Under Uncertainty

Download Report

Transcript Planning Under Uncertainty

Planning Under Uncertainty
573 Core Topics
Inference
Logic-Based
Supervised
Learning
Knowledge
Representation
Search
Problem Spaces
Agency
Reinforcement
Learning
Planning
Probabilistic
Administrivia
• Reading for today’s class: ch 17 thru 17.3
• Reading for Thursday: ch 21
Reinforcement learning
• Problem Set
Extension until Monday 11/10 8am
One additional problem
• Tues 11/11 – no class
• Thurs 11/13 – midterm
In class
Closed book
May bring 1 sheet of 8.5x11” paper
Semantics
• Syntax: a description of the legal
arrangements of symbols
(Def “sentences”)
• Semantics: what the arrangement of
symbols means in the world
Sentences
Models
Sentences
Semantics
World
Semantics
Representation
Inference
Models
Propositional Logic: SEMANTICS
• “Interpretation” (or “possible world”)
• Specifically, TRUTH TABLES
Assignment to each variable either T or F
Assignment of T or F to each connective
Think “function mapping from P to T (or F)”
P
Q
T F
T T F
F F F
PQ
• Does PQ |= PQ ?
First Order Logic
• Syntax more complex
• Semantics more complex
Specifically, the mappings are more complex
(And the range of the mappings is more complex)
Models
• Depiction of one possible “real-world” model
Interpretations=Mappings
syntactic tokens  model elements
Depiction of one possible interpretation, assuming
Constants:
Richard John
Functions:
Leg(p,l)
Relations:
On(x,y) King(p)
Interpretations=Mappings
syntactic tokens  model elements
Another interpretation, same assumptions
Constants:
Richard John
Functions:
Leg(p,l)
Relations:
On(x,y) King(p)
Satisfiability, Validity, & Entailment
• S is valid if it is true in all interpretations
• S is satisfiable if it is true in some interp
• S is unsatisfiable if it is false all interps
|=
• S1 entails S2 if
forall interps where S1 is true,
S2 is also true
Previously, in 573…
Bayes
rules!
P( E | H ) P( H )
P( H | E ) 
P( E )
Simple proof from def of conditional probability
11
Joint Distribution
• All you need - Can answer any question
Inference by enumeration
• But… exponential
Both time & space
• Solution: exploit conditional independence
© Daniel S. Weld
12
Joint Distribution
• All you need to know
• Can answer any question
Inference by enumeration
• But… exponential
Both time & space
• Solution: exploit conditional independence
© Daniel S. Weld
13
Sample Bayes Net
Earthquake
Radio
Nbr1Calls
© Daniel S. Weld
Burglary
Alarm
e,b
e,b
e,b
e,b
Pr(B=t) Pr(B=f)
0.05 0.95
Pr(A|E,B)
0.9 (0.1)
0.2 (0.8)
0.85 (0.15)
0.01 (0.99)
Nbr2Calls
14
Given Markov Blanket, X is
Independent of All Other Nodes
Earthquake
Alarm Sounded
MB(X) = Par(X)  Childs(X)  Par(Childs(X))
© Daniel S. Weld
15
Inference in BNs
•We generally want to compute
Pr(X), or
Pr(X|E) where E is (conjunctive) evidence
•The graphical independence representation
Efficient inference, organized by network shape
•Two simple algorithms:
Variable elimination (VE)
Junction trees
Approximate: Markov Chain Sampling
© Daniel S. Weld
16
Learning
• Parameter Estimation:
Maximum Likelihood (ML)
Maximum A Posteriori (MAP)
Bayesian
• Learning Parameters for a Bayesian Network
• Learning Structure of Bayesian Networks
• Naïve Bayes Models
• Hidden Variables
© Daniel S. Weld
(later)
17
Parameter Estimation Summary
Prior
Hypothesis
Maximum Likelihood
Estimate
Uniform
The most likely
Maximum A
Posteriori Estimate
Any
The most likely
Bayesian Estimate
Any
Weighted
combination
Parameter Estimation and
Bayesian Networks
E
T
F
F
F
F
B
F
F
T
F
T
...
P(A|E,B) = ?
Prior
P(A|E,¬B) = ?
P(A|¬E,B) = ?Beta(2,3) + data=
P(A|¬E,¬B) = ?
(3,4)
R
T
F
F
F
F
A
T
F
T
T
F
J
F
F
T
T
F
M
T
T
T
T
F
Structure Learning as Search
• Local Search
1. Start with some network structure
2. Try to make a change
(add or delete or reverse edge)
3. See if the new network is any better
• What should the initial state be?
Uniform prior over random networks?
Based on prior knowledge?
Empty network?
• How do we evaluate networks?
© Daniel S. Weld
20
Naïve Bayes
Class
Value
…
F 1
F 2
F 3
F N-2
F N-1
F N
Assume that features are conditionally ind.
given class variable
Works well in practice
Forces probabilities towards 0 and 1
Naïve Bayes for Text
• P(spam | w1 … wn)
• Independence assumption?
Spam?
apple
dictator
…
Nigeria
Naïve Bayes for Text
• Modeled as generating a bag of words for
a document in a given category by
repeatedly sampling with replacement
from a vocabulary V = {w1, w2,…wm} based on
the probabilities P(wj | ci).
• Smooth probability estimates with Laplace
m-estimates assuming a uniform
distribution over all words (p = 1/|V|) and
m = |V|
Equivalent to a virtual sample of seeing each word in
each category exactly once.
24
Text Naïve Bayes Algorithm
(Train)
Let V be the vocabulary of all words in the documents in D
For each category ci  C
Let Di be the subset of documents in D in category ci
P(ci) = |Di| / |D|
Let Ti be the concatenation of all the documents in Di
Let ni be the total number of word occurrences in Ti
For each word wj  V
Let nij be the number of occurrences of wj in Ti
Let P(wi | ci) = (nij + 1) / (ni + |V|)
25
Text Naïve Bayes Algorithm
(Test)
Given a test document X
Let n be the number of word occurrences in X
Return the category:
n
argmax P(ci ) P(ai | ci )
ci C
i 1
where ai is the word occurring the ith position in X
26
Naïve Bayes Time Complexity
• Training Time: O(|D|Ld + |C||V|))
where Ld is the average length of a document in
D.
Assumes V and all Di , ni, and nij pre-computed in
O(|D|Ld) time during one pass through all of the data.
Generally just O(|D|Ld) since usually |C||V| < |D|Ld
• Test Time: O(|C| Lt)
where Lt is the average length of a test
document.
• Very efficient overall, linearly proportional to the
time needed to just read in all the data.
27
Easy to Implement
• But…
• If you do… it probably won’t work…
28
Probabilities: Important Detail!
• P(spam | E1 … En) =

P(spam | E )
i
i
Any more potential problems here?
 We are multiplying lots of small numbers
Danger of underflow!
 0.557 = 7 E -18
 Solution? Use logs and add!
 p1 * p2 = e log(p1)+log(p2)
 Always keep in log form
Naïve Bayes Posterior
Probabilities
• Classification results of naïve Bayes
I.e. the class with maximum posterior
probability…
Usually fairly accurate (?!?!?)
• However, due to the inadequacy of the
conditional independence assumption…
Actual posterior-probability estimates not
accurate.
Output probabilities generally very close to 0
or 1.
30
573 Core Topics
Inference
Logic-Based
Supervised
Learning
Knowledge
Representation
Search
Problem Spaces
Agency
Reinforcement
Learning
Planning
Probabilistic
Planning under
Planning
uncertainty
Static
Environment
Instantaneous
Perfect
Fully Observable
What action
next?
Percepts
Actions
Full
Stochastic
Models of Planning
Deterministic
Complete
Observation
Uncertainty
Disjunctive
Probabilistic
Classical
Contingent
MDP
Partial
???
Contingent
POMDP
None
???
Conformant
POMDP
Awkward
• Never any defn of what an MDP is!
• Missed prob strips defn
Should have done that before the 2DBN
All the asumptions about a Markov model could
be simplified and eliminated – took too long!
Defn: Markov Model
Q: set of states
p: init prob distribution
A: transition probability distribution
E.g. Predict Web Behavior
When will visitor leave site?
Q: set of states
(Pages)
p: init prob distribution
(Likelihood of site entry point)
A: transition probability distribution
(User navigation model)
E.g. Predict Robot’s Behavior
Will it attack Dieter?
Q: set of states
p: init prob distribution
A: transition probability distribution
Probability Distribution, A
• Forward Causality
The probability of st does not depend directly
on values of future states.
• Probability of new state could depend on
The history of states visited.
Pr(st|st-1,st-2,…, s0)
• Markovian Assumption
Pr(st|st-1,st-2,…s0) = Pr(st|st-1)
• Stationary Model Assumption
Pr(st|st-1) = Pr(sk|sk-1) for all k.
Representing A
Q: set of states
s0
p: init prob distribution
s1
s2
A: transition probabilities
how can we represent these?
s3
s4
s5
s0 s1 s2 s3 s4 s5 s6
s0
s1
s2
s3
s4
s5
s6
p12
Probability of transitioning from s1 to s2
∑ ?
s6
Factoring Q
s0
• Represent Q simply as a
set of states?
• Is there internal structure?
Consider a robot domain
What is the state space?
s1
s2
s3
s4
s5
s6
A Factored domain
• Six Boolean Variables :
has_user_coffee (huc) ,
has_robot_coffee (hrc),
robot_is_wet (w),
has_robot_umbrella (u),
raining (r),
robot_in_office (o)
• How many states?
26 = 64
Representing
p
Compactly
Q: set of states
s0
p: init prob distribution
How represent this efficiently?
r
u
w
s1
s2
s3
s4
s6
s5
hrc
With a Bayes net
(of course!)
Representing A Compactly
Q: set of states
p: init prob distribution
A: transition probabilities
s0 s1 s2 s3 s4 s5 s6
s0
s1
s2
s3
s4
s5
s6
…
s35
p12
s0
s1
s2
s3
s4
s5
… s35
How big is matrix version of A? 4096
s6
2-Dynamic Bayesian Network
huc
huc
8
hrc
hrc
4
w
w
16
u
u
r
r
o
o
T
T+1
4
2
2
Total values
required to
represent
transition
probability
table = 36
Vs. 4096
required for
a complete
state
probablity
table?
Dynamic Bayesian Network
huc
huc
huc
hrc
hrc
hrc
w
w
w
u
u
u
r
r
r
o
o
o
0
T
Also known as a
Defined formally as
* Set of random vars
* BN for initial state
* 2-layer DBN for
transitions
T+1
Factored Markov Model
Observability
• Full Observability
• Partial Observability
• No Observability
Reward/cost
•
•
•
•
Each action has an associated cost.
Agent may accrue rewards at different
stages. A reward may depend on
The current state
The (current state, action) pair
The (current state, action, next state) triplet
Additivity assumption : Costs and rewards are
additive.
Reward accumulated = R(s0)+R(s1)+R(s2)+…
Horizon
• Finite : Plan till t stages.
Reward = R(s0)+R(s1)+R(s2)+…+R(st)
• Infinite : The agent never dies.
The reward R(s0)+R(s1)+R(s2)+…
Could be unbounded.
?
Discounted reward : R(s0)+γR(s1)+ γ2R(s2)+…
Average reward : lim n∞ (1/n)[Σi R(si)]
Goal for an MDP
•
Find a policy which:
maximizes expected discounted reward
over an infinite horizon
for a fully observable
Markov decision process.
Why shouldn’t the planner find a plan??
What is a policy??
Optimal value of a state
• Define V*(s) `value of a state’ as the maximum
expected discounted reward achievable from this
state.
• Value of state if we force it to do action “a”
right now, but let it act optimally later:
Q*(a,s)=R(s) + c(a) +
γΣs’εS Pr(s’|a,s)V*(s’)
• V* should satisfy the following equation:
V*(s) = maxaεA {Q*(a,s)}
= R(s) + maxaεA {c(a) + γΣs’εS Pr(s’|a,s)V*(s’)}
Value iteration
• Assign an arbitrary assignment of values to
each state (or use an admissible heuristic).
• Iterate over the set of states and in each
iteration improve the value function as follows:
Vt+1(s)=R(s) + maxaεA {c(a)+γΣs’εS Pr(s’|a,s) Vt(s’)}
`Bellman Backup’
• Stop the iteration appropriately. Vt approaches
V* as t increases.
Bellman Backup
Qn+1(s,a)
Max
Vn
Vn
a1
Vn
Vn+1(s)
s
a2
Vn
a3
Vn
Vn
Vn
Stopping Condition
• ε-convergence : A value function is ε –optimal
if the error (residue) at every state is less
than ε.
Residue(s)=|Vt+1(s)- Vt(s)|
Stop when maxsεS R(s) < ε
Complexity of value iteration
• One iteration takes O(|S|2|A|) time.
• Number of iterations required :
poly(|S|,|A|,1/(1-γ))
• Overall, the algo is polynomial in state space!
• Thus exponential in number of state vars.
Computation of optimal policy
• Given the value function V*(s), for each
state, do Bellman backups and the action
which maximises the inner product term is
the optimal action.
• Optimal policy is stationary (time
independent) – intuitive for infinite horizon
case.
Policy evaluation
• Given a policy Π:SA, find value of each
state using this policy.
• VΠ(s) = R(s) + c(Π(s)) +
γ[Σs’εS Pr(s’| Π(s),s)VΠ(s’)]
• This is a system of linear equations
involving |S| variables.
Bellman’s principle of optimality
• A policy Π is optimal if VΠ(s) ≥ VΠ’(s) for
all policies Π’ and all states s є S.
• Rather than finding the optimal value
function, we can try and find the optimal
policy directly, by doing a policy space
search.
Policy iteration
• Start with any policy (Π0).
• Iterate
Policy evaluation : For each state find VΠi(s).
Policy improvement : For each state s, find action
a* that maximises QΠi(a,s).
If QΠi(a*,s) > VΠi(s) let Πi+1(s) = a*
else let Πi+1(s) = Πi(s)
• Stop when Πi+1 = Πi
• Converges faster than value iteration but
policy evaluation step is more expensive.
Modified Policy iteration
• Rather than evaluating the actual value of
policy by solving system of equations,
approximate it by using value iteration with
fixed policy.
RTDP iteration
• Start with initial belief and initialize value of
each belief as the heuristic value.
• For current belief
Save the action that minimises the current state
value in the current policy.
Update the value of the belief through Bellman
Backup.
• Apply the minimum action and then randomly
pick an observation.
• Go to next belief assuming that observation.
• Repeat until goal is achieved.
Fast RTDP convergence
• What are the advantages of RTDP?
• What are the disadvantages of RTDP?
How to speed up RTDP?
Other speedups
• Heuristics
• Aggregations
• Reachability Analysis
Going beyond full observability
• In execution phase, we are uncertain
where we are,
• but we have some idea of where we can be.
• A belief state = ?
Models of Planning
Deterministic
Complete
Observation
Uncertainty
Disjunctive
Probabilistic
Classical
Contingent
MDP
Partial
???
Contingent
POMDP
None
???
Conformant
POMDP
Speedups
• Reachability Analysis
• More informed heuristic
Algorithms for search
•
•
•
•
A* : works for sequential solutions.
AO* : works for acyclic solutions.
LAO* : works for cyclic solutions.
RTDP : works for cyclic solutions.