Introduction - Subbarao Kambhampati
Download
Report
Transcript Introduction - Subbarao Kambhampati
CSE 571: Artificial Intelligence
Instructor: Subbarao Kambhampati
[email protected]
Homepage: http://rakaposhi.eas.asu.edu/cse571
Office Hours: 1-2pm M/W BY 560
Markov Decision Processes
An MDP has is a 4-tuple: <S, A, R, T>:
(finite) state set S (|S| = n)
(finite) action set A (|A| = m)
(Markov) transition function T(s,a,s’) = Pr(s’ | s,a)
Probability of going to state s’ after taking action a in state s
How many parameters does it take to represent?
bounded, real-valued (Markov) reward function R(s)
Immediate reward we get for being in state s
For example in a goal-based domain R(s) may equal 1 for goal
states and 0 for all others
Can be generalized to include action costs: R(s,a)
Can be generalized to be a stochastic function
Can easily generalize to countable or continuous state and
action spaces (but algorithms will be different)
2
CSE 571
• “Run it as a Graduate Level Follow-on to CSE
471”
• Broad objectives
– Deeper treatment of some of the 471 topics
– More emphasis on tracking current state of the art
– (possibly) Training for literature survey and
independent projects
Class Make-up
• 46 students (rapidly changing..)
– 15 PhD students (13 CS; 2 from outside)
– 22 MS students (21 CS)
–8
MCS students
– 1 UG student (Integrated MS?)
Class Survey
•
Have you taken an Intro to AI course (or AI-related courses) before? If so, where did you take it?
What text book did you use?
–
–
•
Look at http://rakaposhi.eas.asu.edu/cse571 for the list of topics covered in CSE571 offering the
last time I did it. Please list the topics from there that you interested in learning about. (You can
list them in the order of importance for you)
–
•
•
No obvious patterns
Do you have any special reason for taking this course? This could, for example, be related to your
ongoing research and or interest in specific topics.
As you can see from http://rakaposhi.eas.asu.edu/cse571 videos from my previous offering of
this course are available on Youtube. For the topics that we happen to repeat, I am considering
exploiting their presence by having you watch them before coming to class, and then use the
class period for discussions and problem solving. (Before you consider alerting ABOR about my
laziness, please note that this is actually going to be more work for me too..:) Let me know
whether you are in favor of such a practice or you would prefer traditional lecture style classes
for all topics.
–
•
Popular topics: MDPs; Belief-space Planning; Statistical Learning; Reinforcement Learning
List any other topics (not covered last time) that you would like to see covered
–
•
12/33 haven’t taken any AI course
I will assume everyone has Intro to AI background; If you don’t I will assume you will pick up topics as
needed (some of you have already asked me for overrides this way).
Alas Real Genius; but at least Job Security!
Do you want structured projects or quasi-self-defined semester projects for the class?
–
Best answer: “I am not sure, probably not”
Reading Material…Eclectic
• Chapters from R&N
• Chapters from other books
– POMDPS from Thrun/Burgard/Fox
– Templated Graphical models from
Koller &Friedman
– CSP/Tree-width stuff from Dechter
• Tutorial papers etc
“Grading”?
• 3 main ways
– Problem sets (with mini-projects); Mid-term; Possibly
final
– Participate in the class actively.
– Graduate level assessment (to be decided)
• Read assigned chapters/papers; submit reviews before the
class; take part in the discussion
• Learn/Present the state of the art in a sub-area of AI
– You will pick papers from IJCAI 2009 as a starting point
– http://ijcai.org/papers09/contents.php
• Work on a term project
– Can be in groups of two
(Static vs. Dynamic)
Environment
(perfect vs.
Imperfect)
(Full vs.
Partial satisfaction)
Goals
(Observable vs.
Partially Observable)
(Instantaneous vs.
Durative)
(Deterministic vs.
Stochastic)
What action next?
A: A Unified Brand-name-Free Introduction to Planning
Subbarao Kambhampati
Topics Covered in CSE471
•
•
Table of Contents (Full Version)
Preface (html); chapter map
Part I Artificial Intelligence
1 Introduction
2 Intelligent Agents
Part II Problem Solving
3 Solving Problems by Searching
4 Informed Search and Exploration
5 Constraint Satisfaction Problems
6 Adversarial Search
Part III Knowledge and Reasoning
7 Logical Agents
8 First-Order Logic
9 Inference in First-Order Logic
10 Knowledge Representation
Part IV Planning
11 Planning (pdf)
12 Planning and Acting in the Real
World
•
Part V Uncertain Knowledge and
Reasoning
13 Uncertainty
14 Probabilistic Reasoning
15 Probabilistic Reasoning Over Time
16 Making Simple Decisions
17 Making Complex Decisions
Part VI Learning
18 Learning from Observations
19 Knowledge in Learning
20 Statistical Learning Methods
21 Reinforcement Learning
Part VII Communicating, Perceiving,
and Acting
22 Communication
23 Probabilistic Language Processing
24 Perception
25 Robotics
Part VIII Conclusions
26 Philosophical Foundations
27 AI: Present and Future
Representation Mechanisms:
Logic (propositional; first order)
Probabilistic logic
Learning
the models
How the course topics stack up…
Search
Blind, Informed
SAT; Planning
Inference
Logical resolution
Bayesian inference
Pendulum Swings in AI
• Top-down vs. Bottom-up
• Ground vs. Lifted representation
– The longer I live the farther down the Chomsky
Hierarchy I seem to fall [Fernando Pereira]
• Pure Inference and Pure Learning vs.
Interleaved inference and learning
• Knowledge Engineering vs. Model Learning vs.
Data-driven Inference
• Human-aware vs. Stand-Alone
Class forum
CLASS OF 8/29
Agent Classification in Terms of State
Representations
Type
State representation
Focus
Atomic
States are indivisible;
No internal structure
Search on atomic states;
Propositional
(aka Factored)
States are made of state
variables that take values
(Propositional or Multivalued or Continuous)
Search+inference in
logical (prop logic) and
probabilistic (bayes nets)
representations
Relational
States describe the
objects in the world and
their inter-relations
Search+Inference in
predicate logic (or
relational prob. Models)
First-order
+functions over objects
Search+Inference in first
order logic (or first order
probabilistic models)
Markov Decision Processes
Atomic Model for stochastic
environments with generalized
rewards
•Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld
•Some slides from Mausam/Kolobov Tutorial; and a couple from Terran Lane
16
Atomic Model for Deterministic
Environments and Goals of Attainment
Deterministic worlds +
goals of attainment
Atomic model: Graph search
Propositional models: The
PDDL planning that we
discussed..
What is missing?
Rewards are only at the end (and
then you die).
What about “the Journey is the
reward” philosophy?
Dynamics are assumed to be
Deterministic
What about stochastic dynamics?
17
Atomic Model for stochastic
environments with generalized rewards
Stochastic worlds
+generalized rewards
An action can take you to
any of a set of states with
known probability
You get rewards for
visiting each state
Objective is to increase
your “cumulative”
reward…
What is the solution?
18
(Static vs. Dynamic)
Environment
(perfect vs.
Imperfect)
(Full vs.
Partial satisfaction)
Goals
(Observable vs.
Partially Observable)
(Instantaneous vs.
Durative)
(Deterministic vs.
Stochastic)
What action next?
A: A Unified Brand-name-Free Introduction to Planning
Subbarao Kambhampati
Optimal Policies depend on
horizon, rewards..
-
-
-
-
9/5/2012
25
Markov Decision Processes
An MDP has four components: S, A, R, T:
(finite) state set S (|S| = n)
(finite) action set A (|A| = m)
(Markov) transition function T(s,a,s’) = Pr(s’ | s,a)
Probability of going to state s’ after taking action a in state s
How many parameters does it take to represent?
bounded, real-valued (Markov) reward function R(s)
Immediate reward we get for being in state s
For example in a goal-based domain R(s) may equal 1 for goal
states and 0 for all others
Can be generalized to include action costs: R(s,a)
Can be generalized to be a stochastic function
Can easily generalize to countable or continuous state
and action spaces (but algorithms will be different)
26
Assumptions
First-Order Markovian dynamics (history independence)
Pr(St+1|At,St,At-1,St-1,..., S0) = Pr(St+1|At,St)
Next state only depends on current state and current action
First-Order Markovian reward process
Pr(Rt|At,St,At-1,St-1,..., S0) = Pr(Rt|At,St)
Reward only depends on current state and action
As described earlier we will assume reward is specified by a deterministic
function R(s)
i.e. Pr(Rt=R(St) | At,St) = 1
Stationary dynamics and reward
Pr(St+1|At,St) = Pr(Sk+1|Ak,Sk) for all t, k
The world dynamics do not depend on the absolute time
Full observability
Though we can’t predict exactly which state we will reach when we
execute an action, once it is realized, we know what it is
28
Policies (“plans” for MDPs)
Nonstationary policy [Even though we have
stationary dynamics and reward??]
π:S x T → A, where T is the non-negative integers
π(s,t) is action to do at state s with t stages-to-go
What if we want to keep acting indefinitely?
Stationary policy
π:S → A
π(s) is action to do at state s (regardless of time)
specifies a continuously reactive controller
If you are 20 and
are not a liberal, you
are heartless
If you are 40 and
not a conservative,
you are mindless
-Churchill
These assume or have these properties:
full observability
Why not just consider
history-independence
sequences of actions?
deterministic action choice
Why not just replan?
29
Value of a Policy
How good is a policy π?
How do we measure “accumulated” reward?
Value function V: S →ℝ associates value with each
state (or each state and time for non-stationary π)
Vπ(s) denotes value of policy at state s
Depends on immediate reward, but also what you achieve
subsequently by following π
An optimal policy is one that is no worse than any other
policy at any state
The goal of MDP planning is to compute an optimal
policy (method depends on how we define value)
30
Finite-Horizon Value Functions
We first consider maximizing total reward over a
finite horizon
Assumes the agent has n time steps to live
To act optimally, should the agent use a stationary
or non-stationary policy?
Put another way:
If you had only one week to live would you act the same
way as if you had fifty years to live?
31
Finite Horizon Problems
Value (utility) depends on stage-to-go
hence so should policy: nonstationary π(s,k)
k
V (s)is k-stage-to-go value function for π
expected total reward after executing π for k time steps (for k=0?)
k
V ( s ) E [ R | , s ]
k
t
t 0
k
E [ R( s t ) | a t ( s t , k t ), s s 0 ]
t 0
Here Rt and st are random variables denoting the reward
received and state at stage t respectively
32
Computing Finite-Horizon Value
Can use dynamic programming to compute
k
V (s)
Markov property is critical for this
(a) V0 (s) R(s),
s
(b) Vk ( s ) R( s )
s'T (s, (s, k ), s' ) V
immediate reward
k 1
( s' )
expected future payoff
with k-1 stages to go
π(s,k)
0.7
What is time complexity?
0.3
Vk
Vk-1
33
Bellman Backup
How can we compute optimal Vt+1(s) given optimal Vt ?
Vt
Compute
Expectations
Compute
Max
Vt+1(s)
s1
0.7
a1
0.3
s
0.4
a2
0.6
s2
s3
s4
Vt+1(s) = R(s)+max { 0.7 Vt (s1) + 0.3 Vt (s4)
0.4 Vt (s2) + 0.6 Vt(s3)
}
34
Value Iteration: Finite Horizon Case
Markov property allows exploitation of DP
principle for optimal policy construction
no need to enumerate |A|Tn possible policies
Value Iteration
V (s) R(s), s
0
Bellman backup
V ( s ) R( s ) max T ( s, a, s' ) V
s
'
a
k
* ( s, k ) arg max s' T ( s, a, s' ) V
k 1
k 1
(s' )
( s' )
a
Vk is optimal k-stage-to-go value function
Π*(s,k) is optimal k-stage-to-go policy
35
Value Iteration
Optimal value depends on stages-to-go
(independent in the infinite horizon case)
V3
V1
V2
V0
s1
0.7
0.7
0.7
0.4
0.4
0.4
0.6
0.6
0.6
0.3
0.3
0.3
V1(s4) = R(s4)+max { 0.7 V0 (s1) + 0.3 V0 (s4)
0.4 V0 (s2) + 0.6 V0(s3)
s2
s3
s4
}
36
Value Iteration
V3
V1
V2
V0
s1
0.7
0.7
0.7
0.4
0.4
0.4
0.6
0.6
0.6
0.3
0.3
P*(s4,t) = max {
0.3
s2
s3
s4
}
37
9/10/2012
38
Value Iteration
Note how DP is used
optimal soln to k-1 stage problem can be used without
modification as part of optimal soln to k-stage problem
Because of finite horizon, policy nonstationary
What is the computational complexity?
T iterations
At each iteration, each of n states, computes
expectation for |A| actions
Each expectation takes O(n) time
Total time complexity: O(T|A|n2)
Polynomial in number of states. Is this good?
39
Summary: Finite Horizon
Resulting policy is optimal
V * ( s) V ( s), , s, k
k
k
convince yourself of this
Note: optimal value function is unique, but
optimal policy is not
Many policies can have same value
40
Discounted Infinite Horizon MDPs
Defining value as total reward is problematic with
infinite horizons
many or all policies have infinite expected reward
some MDPs are ok (e.g., zero-cost absorbing states)
“Trick”: introduce discount factor 0 ≤ β < 1
future rewards discounted by β per time step
k
t t
V ( s) E [ R | , s ]
t 0
Note:
V ( s) E [ R
t 0
t
max
1
max
]
R
1
Motivation: economic? failure prob? convenience?
41
Notes: Discounted Infinite Horizon
Optimal policy maximizes value at each state
Optimal policies guaranteed to exist (Howard60)
Can restrict attention to stationary policies
I.e. there is always an optimal stationary policy
Why change action at state s at new time t?
We define
V * (s) V (s) for some optimal π
42
Computing an Optimal Value Function
Bellman equation for optimal value function
V * ( s) R( s ) β max T ( s, a, s' ) V *( s' )
s
'
a
Bellman proved this is always true
How can we compute the optimal value function?
The MAX operator makes the system non-linear, so the problem is
more difficult than policy evaluation
Notice that the optimal value function is a fixed-point
of the Bellman Backup operator B
B takes a value function as input and returns a new value function
B[V ]( s ) R( s ) β max T ( s, a, s ' ) V ( s ' )
s
'
a
43
Value Iteration
Can compute optimal policy using value
iteration, just like finite-horizon problems (just
include discount term)
V ( s) 0
0
V ( s) R( s) max T ( s, a, s' ) V
s
'
a
k
k 1
( s' )
Will converge to the optimal value function as k
gets large. Why?
44
Convergence
B[V] is a contraction operator on value functions
For any V and V’ we have || B[V] – B[V’] || ≤ β || V – V’ ||
Here ||V|| is the max-norm, which returns the maximum element of
the vector
So applying a Bellman backup to any two value functions causes
them to get closer together in the max-norm sense.
Convergence is assured
any V: || V* - B[V] || = || B[V*] – B[V] || ≤ β|| V* - V ||
so applying Bellman backup to any value function brings us closer
to V* by a factor β
thus, Bellman fixed point theorems ensure convergence in the limit
When to stop value iteration? when ||Vk - Vk-1||≤ ε
this ensures ||Vk – V*|| ≤ εβ /1-β
45
Contraction property proof sketch
Note that for any functions f and g
We can use this to show that
|B[V]-B[V’]| <= |V – V’|
f
B[V ]( s ) R( s ) β max T ( s, a, s' ) V ( s' )
s'
a
B[V ' ]( s ) R( s ) β max T ( s, a, s' ) V ' ( s' )
s'
a
g
Subtract one from other
( B[V ] B[V ' ])( s ) β[ max T ( s, a, s' ) V ( s' ) max T ( s, a, s' ) V ' ( s' )]
s'
s'
a
a
β max [ T ( s, a, s' ) (V ( s' ) V ' ( s' ))]
s'
a
46
How to Act
Given a Vk from value iteration that closely
approximates V*, what should we use as our
policy?
Use greedy policy:
greedy[V ]( s) arg max T ( s, a, s' ) V ( s' )
s'
k
k
a
Note that the value of greedy policy may not
be equal to Vk
Let VG be the value of the greedy policy? How
close is VG to V*?
47
How to Act
Given a Vk from value iteration that closely approximates
V*, what should we use as our policy?
Use greedy policy:
greedy[V ]( s) arg max T ( s, a, s' ) V ( s' )
s'
a
k
k
We can show that greedy is not too far from optimal if Vk is close
to V*
In particular, if Vk is within ε of V*, then VG within 2εβ /1-β
of V* (if ε is 0.001 and β is 0.9, we have 0.018)
Furthermore, there exists a finite ε s.t. greedy policy is
optimal
That is, even if value estimate is off, greedy policy is optimal
once it is close enough
48
Improvements to Value Iteration
Initialize with a good approximate value
function
Instead of R(s), consider something more like h(s)
Well defined only for SSPs
Asynchronous value iteration
Can use the already updated values of neighors to
update the current node
Prioritized sweeping
Can decide the order in which to update states
As long as each state is updated infinitely often,
it doesn’t matter if you don’t update them
What are good heuristics for Value iteration?
49
9/14 (make-up for 9/12)
Policy Evaluation for Infinite Horizon MDPS
Policy Iteration
Why it works
How it compares to Value Iteration
Indefinite Horizon MDPs
The Stochastic Shortest Path MDPs
With initial state
Value Iteration works; policy iteration?
Reinforcement Learning start
50
Policy Evaluation
Value equation for fixed policy
V ( s) R( s) β T ( s, ( s), s' ) V ( s' )
s'
Notice that this is stage-indepedent
How can we compute the value function for a
policy?
we are given R and Pr
simple linear system with n variables (each
variables is value of a state) and n constraints
(one value equation for each state)
Use linear algebra (e.g. matrix inverse)
51
Policy Iteration
Given fixed policy, can compute its value exactly:
V ( s) R( s) T ( s, ( s ), s ' ) V ( s ' )
s'
Policy iteration exploits this: iterates steps of policy
evaluation and policy improvement
1. Choose a random policy π
Policy improvement
2. Loop:
(a) Evaluate Vπ
(b) For each s in S, set ' ( s) arg max T ( s, a, s' ) V ( s' )
s'
a
(c) Replace π with π’
Until no improving action possible at any state
52
P.I. in action
Iteration 0
Policy
Value
The PI in action slides from Terran Lane’s Notes
P.I. in action
Iteration 1
Policy
Value
P.I. in action
Iteration 2
Policy
Value
P.I. in action
Iteration 3
Policy
Value
P.I. in action
Iteration 4
Policy
Value
P.I. in action
Iteration 5
Policy
Value
P.I. in action
Iteration 6: done
Policy
Value
Policy Iteration Notes
Each step of policy iteration is guaranteed to strictly improve the
policy at some state when improvement is possible
[Why? The same contraction property of Bellman Update.
Note that when you go from value to policy to value to policy,
you are effectively, doing a DP update on the value]
Convergence assured (Howard)
intuitively: no local maxima in value space, and each policy
must improve value; since finite number of policies, will
converge to optimal policy
Gives exact value of optimal policy
Complexity:
There are at most exp(n) policies, so PI is no worse than
exponential time in number of states
Empirically O(n) iterations are required
Still no polynomial bound on the number of PI iterations
(open problem)!
60
Improvements to Policy Iteration
Find the value of the policy approximately (by
value iteration) instead of exactly solving the
linear equations
Can run just a few iterations of the value iteration
This can be asynchronous, prioritized etc.
61
Value Iteration vs. Policy Iteration
Which is faster? VI or PI
It depends on the problem
VI takes more iterations than PI, but PI
requires more time on each iteration
PI must perform policy evaluation on each step
which involves solving a linear system
Can be done approximately
Also, VI can be done with asynchronous and
prioritized update fashion..
***Value Iteration is more robust—it’s
convergence is guaranteed for many more
types of MDPs..
62
Need for Indefinite Horizon MDPs
We have see
Finite horizon MDPs
Infinite horizon MDPs
In many cases, we neither have finite nor
infinite horizon, but rather some indefinite
horizon
Need to model MDPs without discount factor,
knowing only that the behavior sequences will be
finite
63
Stochastic Shortest-Path MDPs: Definition
Bertsekas, 1995
SSP MDP is a tuple <S, A, T, C, G>, where:
•
•
•
•
•
•
S is a finite state space
(D is an infinite sequence (1,2, …))
A is a finite action set
T: S x A x S [0, 1] is a stationary transition function
C: S x A x S R is a stationary cost function (= -R: S x A x S R)
G is a set of absorbing cost-free goal states
Under two conditions:
• There is a proper policy (reaches a goal with P= 1 from all states)
– No sink states allowed..
• Every improper policy incurs a cost of ∞ from every state from
which it does not reach the goal with P=1
64
[SSP slides from Mausam/Kolobov Tutorial]
SSP MDP Details
• In SSP, maximizing ELAU = minimizing exp. cost
• Every cost-minimizing policy is proper!
• Thus, an optimal policy = cheapest way to a goal
65
Not an SSP MDP Example
No cost-free
“loops” allowed!
C(s1, a2, s1) = 7.2
a2
C(s1, a1, s2) = 1
a1
S1
S2
a1
C(s2, a1, s1) = -1
C(sG, a2, sG) = 0
C(s2, a2, sG) = 1
T(s2, a2, sG) = 0.3
a2
a2
SG
C(s2, a2, s2) = -3
T(s2, a2, sG) = 0.7
a1
C(sG, a1, sG) = 0
No dead ends
allowed!
a1
C(s3, a1, s3) = 2.4
S3
a2
C(s3, a2, s3) = 0.8
66
SSP MDPs: Optimality Principle
For an SSP MDP, let:
Exp. Lin. Add. Utility
–
Vπ(s,t)
π
= Es,t[C1 + C2 + …] for all s, t
For every s,t, the
value of a policy is
well-defined!
Every policy either takes a
Then: finite exp. # of steps to reach
a goal, or has an infinite cost.
– V* exists, π* exists, both are stationary
– For all s:
V*(s) = mina in A [ ∑s’ in S T(s, a, s’) [ C(s, a, s’) + V*(s’) ] ]
π*(s) = argmina in A [ ∑s’ in S T(s, a, s’) [ C(s, a, s’) + V*(s’) ] ]
67
SSP and Other MDP Classes
IHDR
SSP
FH
• SSP is an “indefinite-horizon” MDP
• Can compile FH by considering a new state space that is <S,T>
(state at epoch)
• Can compile IHDR to SSP by introducing a “goal” node and adding
a probability transition from any state to goal state.
68
Algorithms for SSP
• Value Iteration works without change for SSPs
– (as long as they *are* SSPs—i.e., have proper policies
and infinite costs for all improper ones)
– Instead of Max operations, the Bellman update does
min operations
• Policy iteration works *iff* we start with a proper
policy (otherwise, it diverges)
– It is not often clear how to pick a proper policy though
SSPs and A* Search
• The SSP model, being based on absorbing goals and action
costs, is very close to the A* search model
– Identical if you have deterministic actions and start the SSP with
a specific initial state
– For this case, the optimal value function of SSP is the perfect
heuristic for the corresponding A* search
– An admissible heuristic for A* will be a “lower bound” on V* for
the SSP
• Start value iteration by initializing with an admissible heuristic!
• ..and since SSP theoretically subsumes Finite Horizon and
Infinite Horizon models, you get an effective bridge
between MDPs and A* search
• The bridge also allows us to “solve” MDPs using advances in
deterministic planning/A* search..
Summary of MDPs
• There are many MDPs, we looked at those that
– aim to maximize expected cumulative reward (as against,
say, average reward)
– Finite horizon, Infinite Horizon and SSP MDPs
• We looked at Atomic MDPs—states are atomic
• We looked at exact methods for solving MDPs
– Value Iteration (and improvements including asynchronous
and prioritized sweeping)
– Policy Iteration (and improvements including modified PI)
• We looked at connections to A* search
Other topics in MDPs
(that we will get back to)
• Approximate solutions for MDPs
– E.g. Online solutions based on determinizations
• Factored representations for MDPs
– States in terms of state variables
– Actions in terms of either Probabilistic STRIPS or
Dynamic Bayes Net representations
– Value and Reward functions in terms of decision
diagrams
Modeling Softgoal problems as
deterministic MDPs
• Consider the net-benefit problem, where actions have
costs, and goals have utilities, and we want a plan with
the highest net benefit
• How do we model this as MDP?
– (wrong idea): Make every state in which any subset of goals hold
into a sink state with reward equal to the cumulative sum of
utilities of the goals.
• Problem—what if achieving g1 g2 will necessarily lead you through
a state where g1 is already true?
– (correct version): Make a new fluent called “done” dummy action
called Done-Deal. It is applicable in any state and asserts the
fluent “done”. All “done” states are sink states. Their reward is
equal to sum of rewards of the individual states.