Why Probability?

Download Report

Transcript Why Probability?

Probabilistic Plan Recognition
Kathryn Blackmond Laskey
Department of Systems Engineering and Operations Research
George Mason University
Dagstuhl Seminar April 2011
The problem of plan recognition is to take
as input a sequence of actions performed
by an actor and to infer the goal pursued by
the actor and also to organize the action
sequence in terms of a plan structure
Schmidt, Sridharan and Goodson, 1978
…the problem of plan recognition is
largely a problem of inference under
conditions of uncertainty.
Charniak and Goldman, 1993
1
PPR in a Nutshell
P( plan | obs)

P(obs | plan)P( plan)
P(obs)

Thomas Bayes
(1702-1762)
Bayes, Thomas. An essay towards solving a problem
in the doctrine of chances. Philosophical Transactions
2
of the Royal Society of London, 53:370- 418, 1763.
Represent
• set of possible plans
• anticipated evidence for each plan
Specify
…or just directly specify P(plan|obs)
• prior probabilities for plans
• likelihood for evidence given plans
Infer plans using Bayes Rule
Why Probability?
• Theoretically well-founded representation for relative
plausibility of competing explanations
• Unified approach to inference and learning
• Combine engineered and learned knowledge
• Many general-purpose exact and approximate algorithms with
strong theoretical justification and practical success
• Good results on many interesting problems
• But…
– Inference and learning (exact and approximate) are NP hard
– Balancing tractability and expressiveness is a major research and
engineering challenge
3
Representing Plans and Observations
• Plan recognition requires a computational representation of
possible plans and observable evidence
– Goals
– Actions
• When executed in combination, actions are expected (with high
probability) to achieve the goal
– Preconditions / postconditions of actions
– Constraints
• Most notably, temporal ordering
– Observables
• Actions may or may not be directly observable
• Sometimes we observe effects of actions
– Hierarchical decomposition of the above
• For probabilistic plan recognition, we need to assign
probabilities to these elements
4
– Balance expressivity against tractability of inference & learning
Some Representations for PPR
• Bayesian networks
• Hidden Markov Models / Dynamic Bayesian Networks
• Plan Recognition Bayesian Networks / Probabilistic
Relational Models / Multi-Entity Bayesian Networks
• Bayesian Abductive Logic Programs
• Stochastic Grammars
• Conditional Random Fields
• Markov Logic Networks
Each of these formalisms can be thought of
as a way of representing a set of “possible
worlds” and defining a probability measure
on an algebra of subsets
5
Graphical Probability Models
• Factorize joint distribution into factors involving only a few
variables each
– Graph represents conditional independence assumptions
– Local distributions specify probability information for small groups
of related variables
– Factors are combined into joint distribution
G
• Drastically simplifies specification,
inference and learning
• 20 possible goals, 100 possible actions
A5
A1
A4
A2
A3
Naïve Bayes Model
– Fully general model  2.5x1031 probabilities
– “Naïve Bayes model”  19x20x100=38,000 probabilities
– If each goal has only 10 associated actions then “naïve Bayes
model”  19x10 = 190 probabilities
– Naïve Bayes inference scales as #variables x #states/variable
6
Bayesian Network (BN)
• Directed graph represents dependencies
n
• Joint distribution factors as Pr(X1,K , X n )  Pr(X i | X pa(i) )
i1
• Factored representation makes specification, inference and
learning tractable for interesting classes of problems
• Directed graph naturally represents causality
 via “do” operator
– Effects of intervention
– Explaining away
Regional
Domination (R)
Exercise (E)
Troops
Moving (T)
7
Invasion (I)
Pr(R,E,I,W,T,B,S) =
Pr(R)Pr(E)Pr(I|R)Pr(W|R)Pr(T|E,I)Pr(B|W)Pr(S|W)
Nuclear
Weapons (W)
Buy
Reactors (B)
Steal
Materials (S)
127 probabilities
 14 probabilities
Possible and Probable Worlds
• “Traditional or deductive logic admits only three attitudes to any
proposition: definite proof, disproof, or blank ignorance.” (Jeffreys)
• Semantics of classical logic is based on possible worlds
– Set of possible worlds defined by language, domain, and axioms
– In propositional logic, possible worlds assign truth values to atoms
(e.g., R  T; W  T; E  F)
• Probability theory
– Set of possible worlds is called the sample space
– Probability measure maps subsets to real numbers
– Probability axioms are a natural extension of classical propositional logic to
likelihood
• BN combines propositional logic with probability
Regional
Domination (R)
Exercise (E)
8
Troops
Moving (T)
Invasion (I)
Nuclear
Weapons (W)
Buy
Reactors (B)
Steal
Materials (S)
R
E
I
W
T
B
S
Pr
T
T
T
T
T
T
T
p1
T
T
T
T
T
T
F
p2
T
T
T
T
T
F
T
p3
…
Other Factored Representations
• Markov network: factorization specified by undirected
graph
– More natural for domains without natural causal direction
C indexes cliques in the graph
– Joint distribution factorizes as:
1
P(X1,K , X n )  C (X1C ,K X kC C )
Z C
xiC is ith variable in clique C
kC is size of clique C
Z is a normalization constant
• Chain graph: factorization specified by graph with both
directed and undirected edges
 • Representations to exploit context-specific
independence
– Probability trees
– Tree-structured parameterization for local distributions in a BN
9
Conditional Random Fields
• Bayesian networks are generative models
– Represent joint probability over plans and observations
– Realistic dependence models often yield intractable
inference
• Conditional (or discriminative) model directly
represents probability of plans given observations
– Can allow some dependencies to be relaxed
• CRFs are discriminative
– Undirected graph represents local dependencies
– Potential function represents strength of dependence
• A CRF is a family of MRFs (a mapping from
observations to potentials)
10
Inference in Graphical Models
• Exact inference
– E.g., Belief propagation, junction tree, bucket elimination, symbolic
probabilistic inference, cutset conditioning
– Exploit graph structure / factorization to simplify computation
– Infeasible for complex problems
• Approximate (deterministic)
– E.g., Loopy BP, variational Bayes
• Approximate (stochastic)
– E.g., Gibbs sampling, Metropolis-Hastings sampling, likelihood
weighting
• Combinations
– E.g., Bidyuk and Dechter (2007) – cutset sampling
11
Belief Propagation for Singly Connected BNs
•
•
•
Goal: compute probability
distribution of random
variable B given evidence
(assume B itself is not
known)
Random variables
“above” B
A1
Key idea: impact of belief
in B from evidence "above"
B and evidence "below" B
can be processed
separately
A2
A3
A4
A5
p
A6
?
Justification: B d-separates
“above” random variables
from “below” random
variables
D5
D6
Random
variables
“below” B
• This picture depicts the updating process for one node.
• The algorithm simultaneously updates beliefs for all the nodes.
• Loopy BP applies BP to network with loops; often results in
good approximation
12
p
D7
B
l
D1
D2
D3
D4
= evidence random variable
Likelihood Weighting (for BNs)
1. Proceed through non-evidence
variables in order consistent with
partial ordering induced by graph
A
C
– Sample variable according to its local
probability distribution
– Calculate weight proportional to
Pr(evidence | sampled values)
2. Repeat Step 1 until done
3. Estimate Pr(Variable=value) by
weighted sample frequency
B
D
G
E
H
F
K
J
L
13
Junction Tree Algorithm
1. Compile BN into junction tree
– Tree of clusters of nodes
– Has JT property: variable
belonging to 2 clusters must
belong to all clusters along path
connecting them
– Becomes part of the knowledge
representation
– Changes only if the graph
changes
2. Use local local messagepassing algorithm to propagate
beliefs in the junction tree
3. Query on any node or any
set of nodes in same cluster
can be computed from
cluster joint distribution
14
ABC
A
BCDEH
C
B
CDGEH
D
DEGHJ
G
E
H
DFGHJ
F
K
J
L
FGJK
JKL
Gibbs Sampling
A
1. Initialize
– Evidence variables assigned to
observed values
– Arbitrary value for other variables
C
2. Sample non-evidence nodes one at
a time:
– Sample with probability
Pr(variable | Markov blanket)
– Replace with newly sampled value
D
G
3. Repeat Step 2 until done
K
4. Estimate Pr(Variable=value) by
sample frequency
• Markov blanket
15
B
E
H
F
J
L
- In BN: parents, children, co-parents
- In MN: neighbors
• Variable is conditionally independent of
rest of network given its Markov blanket
Cutset Sampling (for BNs)
A
• Find a loop cutset
• Initialize cutset variables
• Do until done
C
– Propagate beliefs on non-cutset variables
– Do Gibbs iteration on cutset
• Estimate P(Variable=value) by
averaging probability over samples
• This is a kind of “RaoBlackwellization”
– Reduce variance of Monte Carlo estimator
by replacing a sampling step with an
exact computation with same expected
value
16
B
D=d
D
G
E
H
F
K
J=j
J
L
Variational Inference
• Method for approximating posterior distribution of
unobserved variables given observed variables
• Approximation finds distribution in family with
simpler functional form (e.g., remove some arcs in
graph) by minimizing a measure of distance from
true posterior
• Estimation via “variational EM”
– Alternate between “expectation” and “maximization”
steps
– Converges to local minimum of distance function
– Yields lower bound for marginal likelihood
• Often faster but less accurate than MC
17
Extending Expressive Power of BNs
• Propositional logic + probability is insufficiently expressive for
requirements of plan recognition
–
–
–
–
Repeated structure
Multiple interrelated entities (e.g., plans, actors, actions)
Type hierarchy and inheritance
Unbounded number of potentially relevant variables
• Some formalisms with greater expressive power:
18
– PBN (Plan recognition BN)
– PRM (Probabilistic
Relational Models)
– OOBN (Object-Oriented
Bayesian Networks)
– MEBN (Multi-Entity BN)
– Plates
– BALP (Bayesian Abductive
Logic Programs)
Charniak and Goldman (1993)
Example: Maritime Domain Awareness
Entities, attributes and relations
19
MDA Probabilistic Ontology
20
Built in UnBBayes-MEBN
MDA SSBN
Screenshot of situation-specific BN in UnBBayes-MEBN
21
(open-source tool for building & reasoning with PR-OWL ontologies)
Protégé Plugin for UnBBayes
22
Drag-and-Drop Mapping
23
Markov Logic Networks
• First-order knowledge base with weight
attached to formulas and clauses
• KB + individual constants  ground
Markov network containing variable for
each grounding of a formula in the KB
• Compact language for specifying large
Markov networks
24
MLN Example
(Richardson and Domingos, 2006)
25
CRFs for Chat Recognition
(Hsu, Lian and Jih, 2011)
•
•
•
•
Subscript indexes pairs of individuals
Yit represents chatting activity of pair
Xit represents observed acoustic features
Dependence structure:
– Within-pair temporal
dependence
– Between-pair concurrent
dependence
• Can be represented
as MLN
26
Possible and Probable FO
Worlds
• In first-order logic, a possible world (aka “structure”) assigns:
– Each constant symbol to a domain element (e.g., go3  obj23)
– Each n-ary function symbol to a function on n-tuples of domain
elements (e.g., (go-stp pln1)  obj23
– Each n-ary relation symbol to a set of n-tuples of domain
elements (e.g., inst  {(obj23, go-), (obj78, liquor-store),
(obj78, store) … }
• A first-order probabilistic logic assigns a probability measure
to first-order structures
– This is called “measure model” semantics (Gaifman,1964)
27
FOL + Probability: Issues
• Probability zero ≠ unsatisfiable
– E.g., every possible value of a continuous distribution has probability
zero
• FOL is undecidable; FOL + probability is not even semidecidable
– Example: IID sequence of coin tosses, 0 < P(H) < 1
• Given any finite sequence of prior tosses, both H and T are possible
• We cannot disprove any non-extreme probability distribution from a
finite sequence of tosses
• Wrong solution: “We will prevent you from expressing this
query because we cannot tractably compute the answer.”
• Better solution: “Represent the problem you really want to
solve, and then figure out a way to approximate the answer.”
– Think carefully about what the real problem is!
28
Knowledge Based Model Construction
• KBMC system contains :
– Base representation that represents goals, plans, actions, actors,
observables, constraints, etc.
– Model construction procedure that maps a context and/or query
into a target model
• At problem solving time
– Construct a problem-specific Bayesian network
– Process queries on constructed model using general-purpose
BN algorithm
• Advantages of expressive representation
29
–
–
–
–
–
Understandability
Maintainability
Knowledge reuse
Exploit repeated structure (representation, inference, learning)
Construct only as much of model as needed for query
Hypothesis Management
• Constructed BN rapidly becomes intractable,
especially in presence of existence and
association uncertainty
• What do we really need to represent?
• Heuristics help to avoid constructing (or prune)
very unlikely hypotheses (or variables with very
weak impact on conclusions)
– E.g., from only “John went to the airport” do not
nominate hypothesis that John intends to set off a
bomb
– But a security system needs to be on the alert for
prospective bombers!
30
Lifted Inference
• Constructed BN (propositionalized theory)
typically contains repeated structure
• Applying standard BN inference often results in
many repetitions of the identical computation
• Lifted inference algorithms detect such
repetitions
– “Lift” problem from ground to first-order level
– Perform computation only once
• Very active area of research
(Braz, et al., 2005)
31
Learning = Inference
… in theory, at least
(inst xi liquor-shopping)
(inst yj liquor-store)
(= (store-of xi) yj)
i=1,…,N
j=1,…,M
Plate model for parameter learning of
store-of local distribution
32
Representing Temporal Evolution
• Plans evolve in time
• HMM / DBN / PDBN replicate variables describing
temporally evolving situation
Position(k-1)
Position(k)
Mission
Destination
Report(k-1)
Shape
Report(k)
Hidden Markov Model (HMM)
unobservable evolving state + observable indicator
Position(k-1)
Position(k-1)
Position(k)
Position(k)
ApparentShape(k-1)
ApparentShape(k-1)
ApparentShape(k)
PosReport(k-1)
PosReport(k-1)
PosReport(k)
PosReport(k)
ShpReport(k-1)
ShpReport(k-1)
33
ApparentShape(k)
ShpReport(k-1)
ShpReport(k-1)
Dynamic Bayesian Network (DBN)
factored representation of state / observable
Partially Dynamic Bayesian Network (PDBN)
some variables not time-dependent
DBN Inference
• Any BN inference algorithm can be applied to a finitehorizon DBN
• Special-case inference algorithms exploit DBN structure
– “Rollup” algorithm marginalizes out past hidden states given past
observations to explicitly represent only a sliding window
– Viterbi algorithm finds most probable values of hidden states given
observations
– Forward-backward algorithm estimates marginal distributions for
hidden states given observations
• Exact inference is generally intractable
– Factored frontier algorithm approximates marginalization of past
hidden state for intractable DBNs
– Particle filter is a temporal variant of likelihood weighting with
resampling
34
• Beware of static nodes!
Particle Filter
initialization
Likelihood
weighting
Resampling
Resampling
Evolution
Likelihood
weighting
•
•
•
35
•
Maintains sample of weighted particles
Each particle is a single realization of all non-evidence nodes
Particle is weighted by likelihood of observation given particle
Particles are resampled with probability proportional to weight
From van der Merwe et al. (undated)
Particle Impoverishment
• Particles with large weights are sampled more
often, leading to low particle diversity
• This effect is counteracted by “spreading” effects
of process noise
• Impoverishment is very serious when:
– Observations are extremely unlikely
– Low “process noise” leads to long dwell times in
widely separated basins of attraction
• “In fact, for the case of very small process noise, all particles
will collapse to a single point within a few iterations.”
• “If the process noise is zero, then using a particle filter is not
entirely appropriate.” (Arulampulam et al., 2002)
36
Particle Filter with Static Nodes
• PF cannot recover from impoverishment of static node
• Some approaches:
– Estimate separate PF for each combination of static node
• Only if static node has small state space
– Regularized PF - artificial evolution of static node
• Ad hoc; no justification for amount of perturbation;
information loss over time
– Shrinkage (Liu & West)
• Combines ideas from artificial evolution & kernel smoothing
• Perturbation “shrinks” static node for each particle
toward weighted sample mean
– Perturbation holds variance of set of particles constant
– Correlation in disturbances compensates for information loss


  
X


– Resample-Move (Gilks & Berzuini)
37
• Metropolis-Hastings step corrects for particle impoverishment
• MH sampling of static node involves entire trajectory but is performed
less frequently as runs become longer



Stochastic Grammars
• Motivation: find representation that is sufficiently
expressive for plan recognition but more
tractable than general DBN inference
• A stochastic grammar is a set of stochastic
production rules for generating sequences of
actions (terminal symbols in the grammar)
• Modularity of production rules yields factored
joint distribution
38
Stochastic Grammar - Example
• Taken from Geib and
Goldman (2009)
• Plans are represented as
and/or tree with temporal
constraints
39
Stochastic Grammar - Inference
• Parsing algorithms can be applied to compute
restricted class of queries
– If plans can be represented in a given formalism then that
formalism’s inference algorithms can be applied to process
queries
– We are often interested in a broader class of queries than
traditional parsing algorithms can handle (e.g., we usually
have not observed all actions)
• Parse tree can be converted to DBN
– Enables answering a broader class of queries
– Can exploit structure of grammar to improve tractability of
inference
• Special-purpose algorithms exploit grammar structure
40
Where Do We Stand?
• Contributions of probabilistic methods
–
–
–
–
–
Useful way of thinking about problems
Unified approach to reasoning, parameter learning, structure learning
Principled combination of KE with learning
Can learn from small, moderate and large samples
Many general-purpose exact and approximate algorithms with strong
theoretical justification and practical success
– Good results (better than previous state of the art) on many interesting
problems
• Many challenging problems remain
– Exact learning and inference are intractable
– High-dimensional multi-modal distributions are just plain ugly
• All inference algorithms break down on the toughest cases
– Asymptotics doesn’t mean much when the long run is millions of years!
– With good engineering backed by solid theory, we will continue to make
progress
41
Bibliography (1 of 2)
Arulampalam, M. Maskell, S., Gordon, N. and Clapp, T. A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian
Bayesian Tracking, IEEE Transactions on Signal Processing, 50 , pp. 174–188, 2002.
Bidyuk, B. and Dechter, R. "Cutset Sampling for Bayesian Networks", Journal of Artificial Intelligence Research 28, pages 148, 2007.
Braz, R., Amir, E. and Roth, D. Lifted First-Order Probabilistic Inference. Proceedings of the International Joint Conference on
Artificial Intelligence, 2005.
Bui, H., Venkatesh, S. and West, G. "Policy Recognition in the Abstract Hidden Markov Model", Artificial Intelligence, Journal
of Artificial Intelligence Research, Volume 17, pages 451-499, 2002.
Charniak, E. and Goldman, R. A Bayesian Model of Plan Recognition. Artificial Intelligence, 64: 53-79, 1993.
Charniak, E. and Goldman, R. A Probabilistic Model of Plan Recognition. Proceedings of the Ninth Conference on Artificial
Intelligence 1991.
Darwiche, A. Modeling and Reasoning with Bayesian Networks. Cambridge University Press. 2009.
Gaifman, H. Concerning measures in First-Order calculi. Israel Journal of Mathematics, 2, 1–18, 1964.
Geib, C.W. and Goldman, R.P. A Probabilistic Plan Recognition Algorithm Based on Plan Tree Grammars. Artificial
Intelligence 173, pp. 1101–1132, 2009.
Gilks, W.R. and Berzuini, C. Following a Moving Target—Monte Carlo Inference for Dynamic Bayesian Models,” Journal of the
Royal Statistical Society B, 63, pp. 127–146, 2001.
Hsu, J., Lian, C., and Jih, W. Probabilistic Models for Concurrent Chatting Activity Recognition. ACM Transactions on
Intelligent Systems and Technology, Vol. 2, No. 1, 2011.
Jensen, F., Bayesian Networks and Decision Graphs (2nd edition). Springer, 2007.
Korb, K. and Nicholson, A. Bayesian Artificial Intelligence. Chapman and Hall, 2003.
Koller, D., Friedman, N. Probabilistic Graphical Models. MIT Press, 2009.
Lafferty, J., McCallum, A., and Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling
sequence data. In Proceedings of the 18th International Conference on Machine Learning, 2001.
Laskey, K.B., MEBN: A Language for First-Order Bayesian Knowledge Bases, Artificial Intelligence, 172(2-3): 140-178, 2008
42
Bibliography (2 of 2)
Liao, L. Patterson, D. J. Fox, D. and Kautz, H. Learning and Inferring Transportation Routines. Artificial Intelligence, 2007.
Liao, L., Fox, D., AND Kautz, H. Hierarchical Conditional Random Fields for GPS-based Activity Recognition. In Springer
Tracts in Advanced Robotics. Springer, 2007.
Liu, J. and West, M., Combined Parameter and State Estimation in Simulation-Based Filtering,” in Sequential Monte Carlo
Methods in Practice, A. Doucet, J. F. G. de Freitas, and N. J. Gordon, Eds. New York: Springer-Verlag, 2001.
Musso, C. Oudjane, N and LeGland, F. Improving Regularised Particle Filters, in Sequential Monte Carlo Methods in Practice,
A. Doucet, J. F. G. de Freitas, and N. J. Gordon, Eds. New York: Springer-Verlag, 2001.
Neapolitan, R. Learning Bayesian Networks. Prentice Hall, 2003.
Pearl, J. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.
Pynadath, D.V. and Wellman, M.P. Probabilistic State-Dependent Grammars for Plan Recognition. Proceedings of the
Sixteenth Conference on Uncertainty in Artificial Intelligence, 2000.
Pynadath, D. V. and Wellman, M. P. Generalized queries on probabilistic context-free grammars. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 20(1):65–77, 1998.
Richardson, M. and Domingos, P., Markov Logic Networks. Machine Learning, 62, 107-136, 2006.
Schmidt, C., Sridharan, N., and Goodson, J., The plan recognition problem: An Intersection of psychology and Artificial
Intelligence, Artificial Intelligence 11 pp. 45–83, 1978.
van der Merwe, R., Doucet, A., de Freitas, N. and Wan, E. The Unscented Particle Filter, Adv. Neural Inform. Process. Syst.
2000.
van der Merwe, R., Doucet, A., de Freitas, N. and Wan, E. (undated) “The unscented particle filter,”
http://cslu.cse.ogi.edu/publications/ps/UPF_CSLU_talk.pdf
Wellman, M.P., J.S. Breese, and R.P. Goldman (1992) From knowledge bases to decision models. The Knowledge
Engineering Review, 7(1):35-53.
43