Probabilistic Inference
Download
Report
Transcript Probabilistic Inference
CS B553: ALGORITHMS FOR
OPTIMIZATION AND LEARNING
Continuous Probability Distributions and
Bayesian Networks with Continuous Variables
AGENDA
Continuous probability distributions
Common families:
The Gaussian distribution
Linear Gaussian Bayesian networks
CONTINUOUS PROBABILITY DISTRIBUTIONS
Let X be a random variable in R, P(X) be a
probability distribution over X
P(x)0 for all x, “sums to 1”
Challenge: (most of the time) P(X=x) = 0 for any x
CDF AND PDF
Probability density function (pdf) f(x)
Nonnegative, f(x) dx = 1
Cumulative distribution function (cdf) g(x)
g(x)
= P(Xx)
g(-)
f(x)
= 0, g() = 1, g(x) = (-,x] f(y) dy, monotonic
= g’(x)
pdf f(x)
1
Both cdfs and pdfs are complete
representations of the probability
space over X, but usually
pdfs
are
cdf g(x)
more intuitive to work with.
CAVEATS
pdfs may exceed 1
Deterministic values, or ones taking on a few
discrete values, can be represented in terms of
the Dirac delta function a(x) pdf (an improper
function)
a(x) = 0 if x a
a(x) = if x = a
a(x) dx = 1
U(a1,b1)
COMMON DISTRIBUTIONS
U(a2,b2)
Uniform distribution U(a,b)
p(x) = 1/(b-a) if x [a,b], 0 otherwise
P(Xx) = 0 if x < a, (x-a)/(b-a) if x [a,b], 1 otherwise
Gaussian (normal) distribution N(,)
= mean, = standard deviation
x−𝜇 2
1
−
e 2𝜎2
2𝜋𝜎
𝑝(𝑥) =
P(Xx) not closed form:
(1+erf(x))/2 for N(0,1)
MULTIVARIATE CONTINUOUS
DISTRIBUTIONS
Consider c.d.f. g(x,y) = P(Xx,Yy)
g(-,y) = 0, g(x,-) = 0
g(,) = 1
g(x,) = P(Xx), g(,x) = P(Yy)
g monotonic
Its joint density is given by the p.d.f. f(x,y) iff
g(p,q) = (-,p] (-,q] f(x,y) dy dx
i.e. P(axXbx,ayYby) = [ax,bx] [ay,by] f(x,y) dy dx
MARGINALIZATION WORKS OVER PDFS
Marginalizing f(x,y) over y:
If h(x) = (-,) f(x,y) dy, then h(x) is a p.d.f. for P(Xx)
Proof:
P(Xa) = P(Xa,Y) = g(a,) = (-,a] (-,) f(x,y) dy dx
h(a) = d/da P(Xa)
= d/da (-,a] (-,) f(x,y) dy dx (definition)
= (-,) f(a,y) dy (fundamental theorem of calculus)
So, the joint density contains all information needed to
reconstruct the density of each individual variable
CONDITIONAL DENSITIES
We might want to represent the density
P(Y|X=x)… but how?
Consider pdf p(x,y), consider taking limit
P(aYb|x+eXx+e) as e0
Naively, P(aYb|X=x) = P(aYb,X=x)/P(X=x), but
denominator is 0!
𝑃 𝑎𝑌𝑏 𝑥 + 𝜖𝑋𝑥 + 𝜖 =
≈
𝑏
𝑎 2𝜖𝑝
𝑥,𝑦
2𝜖𝑝 𝑥
=
𝑏
𝑎 𝑝
𝑏 𝑥+𝜖
𝑎 𝑥−𝜖 𝑝 𝑥,𝑦 𝑑𝑥𝑑𝑦
𝑥+𝜖
𝑥−𝜖 𝑝 𝑥 𝑑𝑥
𝑥,𝑦
𝑝 𝑥
So p(x,y)/p(x) is the conditional density
TRANSFORMATIONS OF CONTINUOUS
RANDOM VARIABLES
Suppose we want to compute the distribution of
f(X), where X is a random variable distributed
w.r.t. pX(x)
Assume f is monotonic and invertible
Consider Y=f(X) a random variable
P(Yy) = I[f(x)y]pX(x) dx
f −1 𝑦
= −∞ 𝑝𝑋 𝑥 𝑑𝑥 = P(X f-1(y))
pY(y) = d/dy P(Yy)
= d/dy P(X f-1(y))
= p(f-1(y)) d/dy f-1(y) by chain rule
= pX(f-1(y))/f ’(f-1(y)) by inverse function derivative
NOTES:
In general, continuous multivariate distributions
are hard to handle exactly
But, there are specific classes that lead to
efficient exact inference techniques
In particular, Gaussians
Other distributions usually require resorting to
Monte Carlo approaches
MULTIVARIATE GAUSSIANS
Multivariate analog in N-D space
X
Mean (vector) , covariance (matrix) S
1 −1 𝑥−𝜇 𝑇 Σ−1 𝑥−𝜇
𝑃 𝑋 =𝑥 = 𝑒 2
𝑍
𝑑/2 Σ 1/2 a normalization
With Z = (2𝜋)
~ N(,S)
factor
INDEPENDENCE IN GAUSSIANS
If X ~ N(X,SX) and Y ~ N(Y,SY) are independent,
then
(𝑋, 𝑌) ~𝑁
𝜇𝑋 Σ𝑋
𝜇𝑌 , 0
0
Σ𝑌
Moreover, if X~N(,S), then Sij=0 iff Xi and Xj are
independent
LINEAR TRANSFORMATIONS
Linear
transformations of gaussians
If X~ N(,S), y = A x + b
Then Y ~ N(A+b, ASAT)
In fact,
𝑋
~𝑁
𝑌
𝑇
𝜇
Σ
Σ𝐴
𝐴𝜇 + 𝑏 , 𝐴Σ 𝐴Σ𝐴𝑇
Consequence:
If X ~ N(x,Sx), Y ~ N(y,Sy), Z=X+Y
Then Z ~ N(x+y,Sx+Sy)
MARGINALIZATION AND CONDITIONING
If (X,Y) ~ N([X Y],[SXX,SXY;SYX,SYY]), then:
Marginalization
Summing out Y gives
X ~ N(X , SXX)
Conditioning:
On observing Y=y, we have
X ~ N(X-SXYSYY-1(y-Y), SXX-SXYSYY-1SYX)
LINEAR GAUSSIAN MODELS
A conditional linear Gaussian model has :
P(Y|X=x) = N(0+Ax,S0)
With parameters 0, A, and S0
LINEAR GAUSSIAN MODELS
A conditional linear Gaussian model has :
P(Y|X=x) = N(0+Ax,S0)
With parameters 0, A, and S0
If X ~ N(X,SX), then joint distribution over is given by:
𝑋
~𝑁
𝑌
𝜇𝑥
ΣX
𝜇0 + 𝐴𝜇𝑥 , 𝐴ΣX
Σ𝑋 𝐴𝑇
𝐴ΣX 𝐴𝑇 + Σ0
(Recall the linear transformation rule)
If X~ N(,S) and y=Ax+b, then
Y ~ N(A+b, ASAT)
CLG BAYESIAN NETWORKS
If all variables in a Bayesian network have
Gaussian or CLG CPTS, inference can be done
efficiently!
P(X1) = N(1,1)
P(X2) = N(2,2)
X1
X2
Y
Z
P(Y|x1,x2) = N(ax1+bx2,y)
P(Z|x1,y) = N(c+dx1+ey,z)
CANONICAL REPRESENTATION
All factors in a CLG Bayes net can be represented as
C(x;K,h,g) with
C(x;K,h,g) = exp(-1/2 xT Kx + hTx + g)
Ex: if P(Y|x) = N(0+Ax,S0) then
P(y|x) = 1/Z exp(-1/2 (y-Ax-0)T S0-1(y-Ax-0))
=1/Z exp(-1/2 (y,x)T [I –A]T S0-1 [I –A](y,x) + 0T S0-1
[I –A](y,x) – ½ 0TS0-10)
Is of form C((y,x);K,h,g) with
K
= [I –A]T S0-1 [I –A]
h
= [I –A]T S0-1 0
g=
log(1/Z) exp (–½ 0TS0-10)
PRODUCT OPERATIONS
C(x;K1,h1,g1)C(x;K2,h2,g2) = C(x;K,h,g) with
K=K1+K2
h=h1+h2
g
= g1+g2
If the scopes of the two factors are not equivalent, just
extend the K’s with 0 rows and columns, and h’s with 0
rows so that each row/column matches
SUM OPERATION
C((x,y);K,h,g)dy = C(x;K’,h’,g’)with
K’=KXX-KXYKYY-1KYX
h’=hX-KXYKYY-1hY
g’
= g+1/2 (log|2pKYY-1|+hYTKYY-1hY)
Using these two operations we can implement
inference algorithms developed for discrete Bayes
nets:
Top-down inference, variable elimination (exact)
Belief propagation (approximate)
MONTE CARLO WITH GAUSSIANS
Assume sample X ~ N(0,1) is given as a primitive
RandN()
To sample X ~ N(,2), simply + RandN()
How to generate a random multivariate Gaussian
variable N(,S)?
MONTE CARLO WITH GAUSSIANS
Assume sample X ~ N(0,1) is given as a primitive
RandN()
To sample X ~ N(,2), simply set x + RandN()
How to generate a random multivariate Gaussian
variable N(,S)?
Take Cholesky decomposition: S-1=LLT, L invertible if
S is positive definite
Let y = LT(x-)
P(y) exp(-1/2 (y12 + … + yN2)) is isotropic, and each yi
is independent
Sample each component of y at random
Set x L-Ty+
MONTE CARLO WITH LIKELIHOOD
WEIGHTING
Monte Carlo with rejection has probability 0 of
finding a continuous value given as evidence, so
likelihood weighting must be used
P(X)=N(X,SX)
X
P(Y|x)=N(Ax+Y,SY)
Y=y
Step 1: Sample x ~ N(X,SX)
Step 2: weight by P(y|x)
HYBRID NETWORKS
Hybrid networks combine both discrete and
continuous variables
Exact inference techniques are hard to apply
Result in Gaussian mixtures
NP hard even in polytree networks
Monte Carlo techniques apply in straightforward
way
Belief approximation can be applied (e.g.,
collapsing Gaussian mixtures to single
Gaussians)
ISSUES
Non-gaussian distributions
Nonlinear dependencies
More in future lectures on particle filtering