gaussian_process_tutorial
Download
Report
Transcript gaussian_process_tutorial
9/10/07
Tutorial
on
Gaussian Processes
DAGS ’07
Jonathan Laserson and Ben Packer
Outline
Linear Regression
Bayesian Inference Solution
Gaussian Processes
Gaussian Process Solution
Kernels
Implications
Linear Regression
Task: Predict y given x
xi ; yi
d
x
y1
x
y
2
X
; y
T
xM
y M
T
1
T
2
Linear Regression
Predicting Y given X
y wT x
~ N (0, 2 )
y | x, w ~ N ( w x, )
T
2
wML arg max P( y | x, w)
wLS arg min
(w
T
y * wML
x*
T
i
xi yi )
2
L2 Regularized Lin Reg
Predicting Y given X
y w x
T
~ N (0, 2 )
2
w ~ N (0, I )
wMAP arg max P( y, w | x)
wRLS arg min
T
2
2
(
w
x
y
)
||
w
||
i i
i
T
y* wMAP
x*
Bayesian Instead of MAP
Instead of using wMAP = argmax P(y,w|X) to predict
y*, why don’t we use entire distribution P(y,w|X)
to estimate P(y*|X,y,x*)?
We have P(y|w,X) and P(w)
Combine these to get P(y,w|X)
Marginalize to get P(y|X)
Same as P(y,y*|X,x*)
Conditional Gaussian->Joint to get P(y*|y,X,x*)
Bayesian Inference
We have P(y|w,X) and P(w)
y | X , w ~ N ( Xw, 2 )
w ~ N (0, 2 )
Combine these to get P(y,w|X)
2
0 2 I
X
w, y | X ~ N , 2 T
0 X 2 I 2 XX T
Marginalize to get P(y|X)
Same as P(y,y*|X,x*)
Joint Gaussian->Conditional Gaussian
y | X ~ N (0, 2 I 2 XX T )
y* | y, X , x* ~ N ( y* , y* )
Error bars!
Gaussian Process
We saw a distribution over Y directly
y | X ~ N (0, 2 I 2 XX T )
Why not start from here?
Instead of choosing a prior over w and defining fw(x), put your
prior over f directly
Since y = f(x) + noise, this induces a prior over y
Next… How to put a prior on f(x)
What is a random process?
It’s a prior over functions
A stochastic process is a collection of random variables, f(x),
indexed by x
It is specified by giving the joint probability of every finite
subset of variables f(x1), f(x2), …, f(xk)
In a consistent way!
What is a Gaussian process?
It’s a prior over functions
A stochastic process is a collection of random variables, f(x),
indexed by x
It is specified by giving the joint probability of every finite
subset of variables f(x1), f(x2), …, f(xk)
In a consistent way!
The joint probability of f(x1), f(x2), …, f(xk) is a multivariate Gaussian
What is a Gaussian Process?
It is specified by giving the joint probability of every finite
subset of variables f(x1), f(x2), …, f(xk)
In a consistent way!
The joint probability of f(x1), f(x2), …, f(xk) is a multivariate Gaussian
Enough to specify mean and covariance functions
μ(x) = E[f(x)]
C(x,x’) = E[ (f(x)- μ(x)) (f(x’)- μ(x’)) ]
f(x1), …, f(xk) ~ N( [μ(x1) … μ(xk)], K)
Ki,j = C(xi, xj)
For simplicity, we’ll assume μ(x) = 0.
Back to Linear Regression
Recall: Want to put a prior directly on f
Can use a Gaussian Process to do this
How do we choose μ and C?
Use knowledge of prior over w
w ~ N(0, σ2I)
μ(x) = E[f(x)] = E[wTx] = E[wT]x = 0
C(x,x’) = E[ (f(x)- μ(x)) (f(x’)- μ(x’)) ]
= E[f(x)f(x’)]
Can have
= xTE[wwT]x’
f(x) = WTΦ(x)
= xT(σ2I)x’ = σ2xTx’
Back to Linear Regression
μ(x) = 0
C(x,x’) = σ2xTx’
f ~ GP(μ,C)
It follows that
f(x1),f(x2),…,f(xk) ~ N(0, K)
y1,y2,…,yk ~ N(0,ν2I + K)
K = σ2XXT
Same as Least Squares Solution!
If we use a different C, we’ll have a different K
Kernels
If we use a different C, we’ll have a different K
What do these look like?
Linear
Poly
Gaussian
C(x,x’) = σ2xTx’
Kernels
If we use a different C, we’ll have a different K
What do these look like?
Linear
Poly
Gaussian
C(x,x’) = (1+xTx’)2
Kernels
If we use a different C, we’ll have a different K
What do these look like?
Linear
Poly
Gaussian
C(x,x’) = exp{-0.5*(x-x’)2}
End
C ( x*, x )
i
i
i
Learning a kernel
Parameterize a family of kernel functions using θ
Learn K using gradient of likelihood
y | X ~ N (0, 2 I K )
log p ( y | X , )
0.5 log det( K 2 I ) 0.5 y T ( K 2 I ) 1 y 0.5 n log( 2 )
K
K
GP Graphical Model
Starting point
For details, see
Rasmussen’s NIPS 2006 Tutorial
Williamson’s Gaussian Processes paper
http://www.kyb.mpg.de/bs/people/carl/gpnt06.pdf
http://www.dai.ed.ac.uk/homes/ckiw/postscript/hbtnn.ps.gz
GPs for classification (approximation)
Sparse methods
Connection to SVMs
Your thoughts…