P * (X) - Knowledge Representation & Reasoning at UIUC!

Download Report

Transcript P * (X) - Knowledge Representation & Reasoning at UIUC!

UIUC CS 497: Section EA
Lecture #7
Reasoning in Artificial Intelligence
Professor: Eyal Amir
Spring Semester 2004
(Based on slides by Gal Elidan (Hebrew U))
Last Time
• Probabilistic graphical models
• Exact reasoning
– Variable elimination
– Junction tree algorithm
• Applications (of Bayes Networks):
– Sensor networks, medical diagnosis,
computer diagnosis (MS Windows),
classification
Approximate Inference
• Large treewidth
– Large, highly connected graphical models
– Treewidth may be large (>40) in sparse networks
• In many applications, approximation are
sufficient
– Example: P(X = x|e) = 0.3183098861838
– Maybe P(X = x|e)  0.3 is a good enough
approximation
– e.g., we take action only if P(X = x|e) > 0.5
Today: Approximate reasoning via
sampling
1. Monte Carlo techniques
1. Rejection sampling
2. Likelihood weighting
3. Importance sampling
2. Markov Chain Monte Carlo (MCMC)
1. Gibbs sampling
2. Metropolis-Hastings
3. Applications du jour: ?
Types of Approximations
Absolute error
• An estimate q of P(X = x | e) has
absolute error , if
1
P(X = x|e) -   q  P(X = x|e) + 
equivalently
q -   P(X = x|e)  q + 
2
• Not always what we want: error 0.001
– Unacceptable if P(X = x | e) = 0.0001,
– Overly precise if P(X = x | e) = 0.3
q
0
Types of Approximations
Relative error
• An estimate q of P(X = x | e) has
relative error , if
P(X = x|e)(1-)  q  P(X = x|e)(1+)
q/(1-)
equivalently
q/(1+)  P(X = x|e)  q/(1-)
q/(1+)
• Sensitivity of approximation
depends on actual value of desired
result
1
q
0
Complexity
• Recall, exact inference is NP-hard
• Is approximate inference any easier?
• Construction for exact inference:
– Input: a 3-SAT problem 
– Output: a BN such that P(X=t) > 0 iff  is
satisfiable
Complexity: Relative Error
• Suppose that q is a relative error estimate
of P(X = t),
• If  is not satisfiable, then
0 = P(X = t)(1 - )  q  P(X = t)(1 + ) = 0
Thus, if q > 0, then  is satisfiable
An immediate consequence:
Thm:
Given , finding an -relative error approximation is NPhard
Complexity: Absolute error
• Thm: If  < 0.5, then finding an estimate of
P(X=x|e) with  absulote error
approximation is NP-Hard
Search Algorithms
Idea: search for high probability instances
• Suppose x[1], …, x[N] are instances with
high mass
• We can approximate:
P (Y
 P (Y  y , e|x[i])P(x[i])
 y | e) 
 P (e|x[i])P(x[i])
i
i
• If x[i] is a complete instantiation, then
P(e|x[i]) is 0 or 1
Search Algorithms (cont)
P (Y
 P (Y  y , e|x[i])P(x[i])
 y | e) 
 P (e|x[i])P(x[i])
i
i
• Instances that do not satisfy e, do not
play a role in approximation
• We need to focus the search to find
instances that do satisfy e
• Clearly, in some cases this is hard (e.g.,
the construction from our NP-hardness
result
Stochastic Simulation
• Suppose we can sample instances
<x1,…,xn> according to P(X1,…,Xn)
• What is the probability that a random
sample <x1,…,xn> satisfies e?
– This is exactly P(e)
• We can view each sample as tossing a
biased coin with probability P(e) of “Heads”
Stochastic Sampling
• Intuition: given a sufficient number of
samples x[1],…,x[N], we can estimate
# Heads
P (e) 

N
 P (e|x[i])
i
N
• Law of large number implies that as N
grows, our estimate will converge to p whp
• How many samples do we need to get a
reliable estimation?
Use Chernof’s bound for binomial distributions
Sampling a Bayesian Network
• If P(X1,…,Xn) is represented by a Bayesian
network, can we efficiently sample from it?
• Idea: sample according to structure of the
network
– Write distribution using the chain rule, and
then sample each variable given its parents
Logic sampling
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
0.03
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b
a
P(c) 0.8 0.05
0.4
0.01
Logic sampling
0.001
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b e
a
P(c) 0.8 0.05
0.4
0.01
Logic sampling
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b e a
a
P(c) 0.8 0.05
0.4
0.01
Logic sampling
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b e a c
a
P(c) 0.8 0.05
0.4
0.01
Logic sampling
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b e a c
r
a
P(c) 0.8 0.05
0.4
0.01
Logic Sampling
• Let X1, …, Xn be order of variables
consistent with arc direction
• for i = 1, …, n do
– sample xi from P(Xi | pai )
– (Note: since Pai  {X1,…,Xi-1}, we already
assigned values to them)
• return x1, …,xn
Logic Sampling
• Sampling a complete instance is linear in
number of variables
–Regardless of structure of the network
• However, if P(e) is small, we need many
samples to get a decent estimate
Can sample from P(X1,…,Xn |e)?
• If evidence is in roots of network, easily
• If evidence is in leaves of network, we
have a problem
– Our sampling method proceeds according to
order of nodes in graph
• Note, we can use arc-reversal to make
evidence nodes root.
– In some networks, however, this will create
exponentially large tables...
Likelihood Weighting
• Can we ensure that all of our sample
satisfy e?
• One simple solution:
–When we need to sample a variable that is
assigned value by e, use the specified value
• For example: we know Y = 1
X
–Sample X from P(X)
–Then take Y = 1
• Is this a sample from P(X,Y |Y = 1) ?
Y
Likelihood Weighting
• Problem: these samples of X from P(X)
Y
X
• Solution:
– Penalize samples in which P(Y=1|X) is small
• We now sample as follows:
– Let x[i] be a sample from P(X)
– Let w[i] be P(Y = 1|X = x [i])
P (X
w [i ]P (X  x|x[i])
 x |Y  1 ) 
w [i ]
i
i
Likelihood Weighting
• Why does this make sense?
• When N is large, we expect to sample
NP(X = x) samples with x[i] = x
• Thus, w i  NP (X  x )P (Y  1 | X  x )
i ,x [i ] x
 NP (X  x ,Y  1)
• When we normalize, we get approximation
of the conditional probability
Likelihood Weighting
P(e) 0.001
Earthquake
Radio
e
e
P(r) 0.3 0.001
Burglary
Alarm
=a
=r
0.03
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b
a
P(c) 0.8 0.05
0.4
0.01
Likelihood Weighting
0.001
P(e) 0.001
Earthquake
Radio
e
e
P(r) 0.3 0.001
Burglary
Alarm
=a
=r
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b e
a
P(c) 0.8 0.05
0.4
0.01
Likelihood Weighting
P(e) 0.001
Burglary
Earthquake
Radio
e
Alarm
=a
=r
e
P(r) 0.3 0.001
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b e
a
0.6
a
P(c) 0.8 0.05
0.4
0.01
Likelihood Weighting
P(e) 0.001
Burglary
Earthquake
Radio
e
Alarm
=a
=r
e
P(r) 0.3 0.001
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b e
a c
0.6
a
P(c) 0.8 0.05
0.05
0.4
0.01
Likelihood Weighting
P(e) 0.001
Earthquake
Radio
e
e
Burglary
Alarm
=a
=r
P(r) 0.3 0.001
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b e
a c r
0.6 *0.3
a
P(c) 0.8 0.05
0.4
0.01
Likelihood Weighting
• Let X1, …, Xn be order of variables
consistent with arc direction
•w=1
• for i = 1, …, n do
–if Xi = xi has been observed
• w w* P(Xi = xi | pai )
–else
• sample xi from P(Xi | pai )
• return x1, …,xn, and w
Importance Sampling
• A method for evaluating expectation of f
under P(x), <f>P(X)
• Discrete:
 f  P ( X )   f ( x) P( x)
x
• Continuous:  f  P ( X )   f ( x) P( x)dx
• If we could sample from P
1
 f  P ( X )   f ( x[r ])
R r
Importance Sampling
A general method for evaluating <f>P(X) when
we cannot sample from P(X).
Idea: Choose an approximating distribution
W(X)
Q(X) and sample from it
f (x )
P (X )
 f (x )P (x )dx  f (x )P (x )
x
x
Q (X )
P (X )
dx  f (x )
Q (X )
Q (X )
Q (X )
Using this we can now sample from Q and
then
1 M
1 M
f (x ) P ( X ) 
f (X [m ]) 
f (x [m ])w (m )


M
M
m 1
If we could generate
samples from P(X)
m 1
Now that we generate
the samples from Q(X)
(Unnormalized) Importance Sampling
1. For m=1:M
Sample X[m] from Q(X)
Calculate W(m) = P(X)/Q(X)
2. Estimate the expectation of f(X) using
1 M
f (x ) P ( X ) 
f (x [m ])w (m )

M m 1
Requirements:
 P(X)>0  Q(X)>0 (don’t ignore possible scenarios)
 Possible to calculate P(X),Q(X) for a specific X=x
 It is possible to sample from Q(X)
Normalized Importance Sampling
Assume that we cannot evalute P(X=x) but can
evaluate P’(X=x) = P(X=x)
(for example we can evaluate P(X) but not P(X|e) in a Bayesian network)
We define w’(X) = P’(X)/Q(X). We can then
evaluate :
w '(X )
Q (X )
P '(X )
 Q ( X )
  P '(x )  α
Q (X )
x
x
and then:
f (x )
P (X )
 f (x )P (x )dx  f (x )P (x )
x
x
1
Q (X )
1
f
(
x
)
P
'
(
x
)
dx

f (X )w '(X )

αx
Q (X )
α
Q (X )
dx 
Q (X )
Q (X )

f (X )w '(X )
w '(X )
Q (X )
Q (X )
In the last step we simply replace  with the above equation
Normalized Importance Sampling
We can now estimate the expectation of f(X)
similarly to unnormalized importance sampling
by sampling x[m] from Q(X) and then
M
f ( x)
P( X )

 f ( x[m]) w' (m)
m 1
M
 w' (m)
m 1
(hence the name “normalized”)
Importance Sampling Weaknesses
• Important to choose sampling distribution
with heavy tails
– Not to “miss” large values of f
• Many-dimensional I-S:
– “Typical set” of P may take a long time to find,
unless Q good approximation to P
– Weights vary by factors exponential in N
• Similar for Likelihood Weighting
Today: Approximate Reasoning via
Sampling
1. Monte Carlo techniques
1. Rejection sampling
2. Likelihood weighting
3. Importance sampling
2. Markov Chain Monte Carlo (MCMC)
1. Gibbs sampling
2. Metropolis-Hastings
3. Applications du jour: ?
Stochastic Sampling
• Previously: independent samples to
estimate P(X = x |e )
• Problem: difficult to sample from P(X1, ….
Xn |e )
• We had to use likelihood weighting
– Introduces bias in estimation
• In some case, such as when the evidence
is on leaves, these methods are inefficient
MCMC Methods
• Sampling methods that are based on
Markov Chain
– Markov Chain Monte Carlo (MCMC)
methods
• Key ideas:
– Sampling process as a Markov Chain
• Next sample depends on the previous one
– Approximate any posterior distribution
• Next: review theory of Markov chains
Markov Chains
• Suppose X1, X2, … take some set of values
– wlog. These values are 1, 2, ...
• A Markov chain is a process that
corresponds to the network:
X1
X2
X3
...
Xn
...
• To quantify the chain, we need to specify
– Initial probability: P(X1)
– Transition probability: P(Xt+1|Xt)
• A Markov chain has stationary transition
probability: P(Xt+1|Xt) same for all times t
Irreducible Chains
• A state j is accessible from state i if there
is an n such that P(Xn = j | X1 = i) > 0
– There is a positive probability of reaching j
from i after some number steps
• A chain is irreducible if every state is
accessible from every state
Ergodic Chains
• A state is positively recurrent if there is a
finite expected time to get back to state i
after being in state i
– If X has finite number of states, then this is
suffices that i is accessible from itself
• A chain is ergodic if it is irreducible and
every state is positively recurrent
(A)periodic Chains
• A state i is periodic if there is an integer d
such that when n is not divisible by d
P(Xn = i | X1 = i ) = 0
• Intuition: only every d steps state i may
occur
• A chain is aperiodic if it contains no
periodic state
Stationary Probabilities
Thm:
• If a chain is ergodic and aperiodic, then the
limit
lim P (Xn | X1  i )
n 
exists, and does not depend on i
*
P
• Moreover, let (X  j )  lim P (Xn  j | X1  i )
n 
P*(X)
then,
is the unique probability
satisfying
P * (X  j )   P (Xt 1  j | Xt  i )P * (X  i )
i
Stationary Probabilities
• The probability P*(X) is the stationary
probability of the process
• Regardless of the starting point, the
process will converge to this probability
• The rate of convergence depends on
properties of the transition probability
Sampling from the stationary
probability
• This theory suggests how to sample from
the stationary probability:
– Set X1 = i, for some random/arbitrary i
– For t = 1, 2, …, n
• Sample a value xt+1 for Xt+1 from
P(Xt+1|Xt=xt)
– return xn
• If n is large enough, then this is a sample
from P*(X)
Designing Markov Chains
• How do we construct the right chain to
sample from?
– Ensuring aperiodicity and irreducibility is
usually easy
• Problem is ensuring the desired stationary
probability
Designing Markov Chains
Key tool:
• If the transition probability satisfies
P ( Xt 1  j |Xt i )
P ( Xt 1 i |Xt  j )

Q (X  j )
whenever P ( Xt 1  j | Xt  i )  0
Q ( X i )
then, P*(X) = Q(X)
• This gives a local criteria for checking that
the chain will have the right stationary
distribution
MCMC Methods
• We can use these results to sample from
P(X1,…,Xn|e)
Idea:
• Construct an ergodic & aperiodic Markov
Chain such that
P*(X1,…,Xn) = P(X1,…,Xn|e)
• Simulate the chain n steps to get a sample
MCMC Methods
Notes:
• The Markov chain variable Y takes as
value assignments to all variables that are
consistent evidence
V (Y )  { x1,..., xn V (X1 )   V (X1 ) | x1,..., xn satisfy e }
• For simplicity, we will denote such a state
using the vector of variables
Gibbs Sampler
• One of the simplest MCMC method
• Each transition changes the state of one Xi
• The transition probability defined by P itself
as a stochastic procedure:
– Input: a state x1,…,xn
– Choose i at random (uniform probability)
– Sample x’i from P(Xi|x1, …, xi-1, xi+1 ,…, xn, e)
– let x’j = xj for all j  i
– return x’1,…,x’n
Correctness of Gibbs Sampler
• How do we show correctness?
Correctness of Gibbs Sampler
• By chain rule
P(x1,…,xi-1, xi, xi+1,…,xn|e) =
P(x1,…,xi-1, xi+1,…,xn|e)P(xi|x1,…,xi-1, xi+1,…,xn, e)
• Thus, we get
P ( x1 ,,xi 1 ,xi ,xi 1 ,,xn |e )
P ( x1 ,,xi 1 ,x 'i ,xi 1 ,,xn |e )
Transition

P ( xi |x1 ,,xi 1 ,xi 1 ,,xn ,e )
P ( x 'i |x1 ,,xi 1 ,xi 1 ,,xn ,e )
• Since we choose i from the same
distribution at each stage, this procedure
satisfies the ratio criteria
Gibbs Sampling for Bayesian
Network
• Why is the Gibbs sampler “easy” in BNs?
• Recall that the Markov blanket of a
variable separates it from the other
variables in the network
– P(Xi | X1,…,Xi-1,Xi+1,…,Xn) = P(Xi | Mbi )
• This property allows us to use local
computations to perform sampling in each
transition
Gibbs Sampling in Bayesian
Networks
• How do we evaluate
P(Xi | x1,…,xi-1,xi+1,…,xn) ?
• Let Y1, …, Yk be the children of Xi
– By definition of Mbi, the parents of Yj are in
Mbi{Xi}
• It is easy to show that
P( xi | Mbi ) 
P( xi | Pai ) P( y j | pa y j )
j
 P( x' | Pa ) P( y
i
x 'i
i
j
j
| pa y j )
Metropolis-Hastings
• More general than Gibbs (Gibbs is a
special case of M-H)
• Proposal distribution arbitrary q(x’|x) that is
ergodic and aperiodic (e.g., uniform)
• Transition to x’ happens with probability
(x’|x)=min(1, P(x’)q(x|x’)/P(x)q(x’|x))
• Useful when computing P(x) infeasible
• q(x’|x)=0 implies P(x’)=0 or q(x|x’)=0
Sampling Strategy
• How do we collect the samples?
Strategy I:
• Run the chain M times, each for N steps
– each run starts from a different state points
• Return the last state in each run
M chains
Sampling Strategy
Strategy II:
• Run one chain for a long time
• After some “burn in” period, sample points
every some fixed number of steps
“burn in”
M samples from one chain
Comparing Strategies
Strategy I:
– Better chance of “covering” the space of points
especially if the chain is slow to reach stationarity
– Have to perform “burn in” steps for each chain
Strategy II:
– Perform “burn in” only once
– Samples might be correlated (although only weakly)
Hybrid strategy:
– Run several chains, sample few times each
– Combines benefits of both strategies
Summary: Approximate Inference
• Monte Carlo (sampling with positive and
negative error) Methods:
– Pos: Simplicity of implementation and theoretical
guarantee of convergence
– Neg: Can be slow to converge and hard to diagnose
their convergence.
• Variational Methods – Your presentation
• Loopy Belief Propagation and Generalized Belief
Propagation -- Your presentation
Next Time
• Combining Probabilities with Relations and
Objects
THE END
Example: Naïve Bayesian
Model
• A common model in early diagnosis:
– Symptoms are conditionally independent
given the disease (or fault)
• Thus, if
– X1,…,Xp denote whether the symptoms
exhibited by the patient (headache, highfever, etc.) and
– H denotes the hypothesis about the patients
health
• then, P(X1,…,Xp,H) =
Elimination on Trees
• Formally, for any tree, there is an
elimination ordering with induced width = 1
Thm
• Inference on trees is linear in number of
variables
Importance Sampling to LW
We want to compute P(Y=y|e)? (X is the set of
random variables in the network and Y is some subset we are
interested in)
1) Define a mutilated Bayesian network BZ=z to
be a
network where:
• all variables in Z are disconnected from their
parents and are deterministically set to z
• all other variables remain unchanged
2) Choose Q to be BE=e
convince yourself that P’(X)/Q(X) is exactly P(Y=y|X)
3) Choose f(x) to be 1(Y[m]=y)/M
A Word of Caution
• Deterministic nodes
– Not ergodic in the simple sense
– M-H cannot be used