gaussian_process_tutorial

Download Report

Transcript gaussian_process_tutorial

9/10/07
Tutorial
on
Gaussian Processes
DAGS ’07
Jonathan Laserson and Ben Packer
Outline






Linear Regression
Bayesian Inference Solution
Gaussian Processes
Gaussian Process Solution
Kernels
Implications
Linear Regression

Task: Predict y given x
xi   ; yi  
d
x 
 y1 
 
 
x
y

2 



X
; y
  
  
 T
 
 xM 
 y M 
T
1
T
2
Linear Regression

Predicting Y given X
y  wT x 
 ~ N (0, 2 )
y | x, w ~ N ( w x, )
T
2
wML  arg max P( y | x, w)
wLS  arg min
 (w
T
y *  wML
x*
T
i
xi  yi )
2
L2 Regularized Lin Reg

Predicting Y given X
y  w x 
T
 ~ N (0, 2 )
2
w ~ N (0,  I )
wMAP  arg max P( y, w | x)
wRLS  arg min
T
2
2
(
w
x

y
)


||
w
||
 i i
i
T
y*  wMAP
x*
Bayesian Instead of MAP

Instead of using wMAP = argmax P(y,w|X) to predict
y*, why don’t we use entire distribution P(y,w|X)
to estimate P(y*|X,y,x*)?





We have P(y|w,X) and P(w)
Combine these to get P(y,w|X)
Marginalize to get P(y|X)
Same as P(y,y*|X,x*)
Conditional Gaussian->Joint to get P(y*|y,X,x*)
Bayesian Inference

We have P(y|w,X) and P(w)
y | X , w ~ N ( Xw, 2 )
w ~ N (0,  2 )

Combine these to get P(y,w|X)
2
 0    2 I


X

w, y | X ~ N   ,  2 T
 0  X  2 I   2 XX T  




Marginalize to get P(y|X)
Same as P(y,y*|X,x*)

Joint Gaussian->Conditional Gaussian

y | X ~ N (0, 2 I  2 XX T )
y* | y, X , x* ~ N (  y* ,  y* )
Error bars!
Gaussian Process

We saw a distribution over Y directly
y | X ~ N (0, 2 I  2 XX T )

Why not start from here?



Instead of choosing a prior over w and defining fw(x), put your
prior over f directly
Since y = f(x) + noise, this induces a prior over y
Next… How to put a prior on f(x)
What is a random process?




It’s a prior over functions
A stochastic process is a collection of random variables, f(x),
indexed by x
It is specified by giving the joint probability of every finite
subset of variables f(x1), f(x2), …, f(xk)
In a consistent way!
What is a Gaussian process?




It’s a prior over functions
A stochastic process is a collection of random variables, f(x),
indexed by x
It is specified by giving the joint probability of every finite
subset of variables f(x1), f(x2), …, f(xk)
In a consistent way!

The joint probability of f(x1), f(x2), …, f(xk) is a multivariate Gaussian
What is a Gaussian Process?


It is specified by giving the joint probability of every finite
subset of variables f(x1), f(x2), …, f(xk)
In a consistent way!






The joint probability of f(x1), f(x2), …, f(xk) is a multivariate Gaussian
Enough to specify mean and covariance functions
μ(x) = E[f(x)]
C(x,x’) = E[ (f(x)- μ(x)) (f(x’)- μ(x’)) ]
f(x1), …, f(xk) ~ N( [μ(x1) … μ(xk)], K)
Ki,j = C(xi, xj)
For simplicity, we’ll assume μ(x) = 0.
Back to Linear Regression



Recall: Want to put a prior directly on f
Can use a Gaussian Process to do this
How do we choose μ and C?

Use knowledge of prior over w



w ~ N(0, σ2I)
μ(x) = E[f(x)] = E[wTx] = E[wT]x = 0
C(x,x’) = E[ (f(x)- μ(x)) (f(x’)- μ(x’)) ]
= E[f(x)f(x’)]
Can have
= xTE[wwT]x’
f(x) = WTΦ(x)
= xT(σ2I)x’ = σ2xTx’
Back to Linear Regression




μ(x) = 0
C(x,x’) = σ2xTx’
f ~ GP(μ,C)
It follows that





f(x1),f(x2),…,f(xk) ~ N(0, K)
y1,y2,…,yk ~ N(0,ν2I + K)
K = σ2XXT
Same as Least Squares Solution!
If we use a different C, we’ll have a different K
Kernels


If we use a different C, we’ll have a different K
What do these look like?



Linear
Poly
Gaussian
C(x,x’) = σ2xTx’
Kernels


If we use a different C, we’ll have a different K
What do these look like?



Linear
Poly
Gaussian
C(x,x’) = (1+xTx’)2
Kernels


If we use a different C, we’ll have a different K
What do these look like?



Linear
Poly
Gaussian
C(x,x’) = exp{-0.5*(x-x’)2}
End
 C ( x*, x )
i
i
i
Learning a kernel


Parameterize a family of kernel functions using θ
Learn K using gradient of likelihood
y | X ~ N (0, 2 I K )
  log p ( y | X ,  )
  0.5  log det( K   2 I )  0.5  y T ( K   2 I ) 1 y  0.5  n  log( 2 )
  K

 K 
GP Graphical Model
Starting point

For details, see

Rasmussen’s NIPS 2006 Tutorial


Williamson’s Gaussian Processes paper




http://www.kyb.mpg.de/bs/people/carl/gpnt06.pdf
http://www.dai.ed.ac.uk/homes/ckiw/postscript/hbtnn.ps.gz
GPs for classification (approximation)
Sparse methods
Connection to SVMs
Your thoughts…