PowerPoint Presentation - A Gaussian Process Tutorial

Download Report

Transcript PowerPoint Presentation - A Gaussian Process Tutorial

Gaussian Processes
Li An
[email protected]
The Plan
• Introduction to Gaussian Processes
• Revisit Linear regression
• Linear regression updated by Gaussian
Processes
• Gaussian Processes for Regression
• Conclusion
Why GPs?
• Here are some data points! What function did
they come from?
• I have no idea.
• Oh. Okay. Uh, you think this point is likely in
the function too?
• I have no idea.
Why GPs?
• You can’t get anywhere without making some
assumptions
• GPs are a nice way of expressing this ‘prior on
functions’ idea.
• Can do a bunch of cool stuff
• Regression
• Classification
• Optimization
• Unimodal
• Concentrated
• Easy to compute with
• Sometimes
• Tons of crazy properties

Gaussian
(x  )2
e
2
2
2
2
Linear Regression Revisited
•
Linear regression model: Combination of M fixed basis functions
given by  (x ) , so that
T
y( x )  w  ( x )
•
Prior distribution
•
Given training data points
of y( x1 ),...,y( xn ) ?
•
y
p(w)  N (w | 0, 1I )
x1,...,xn , what is the joint distribution
is the vector with elements yn  y( xn ) , this vector is given by
where

y  w
is the design matrix with elements
nk  k ( xn )
Linear Regression Revisited
•
y  w , y is a linear combination of Gaussian
distributed variables given by the elements of w,
hence itself is Gaussian.
• Find its mean and covariance
Ε[y] Ε[w]  0
cov[y] E[yy ]  E[ww ] 
T
T
T
1

 T  K
where K is theGram matrix with elements
1
K nm  k ( xn , xm )   ( xn )T  ( xm )

and k(x,x ' ) is thekernelfunction.
Definition of GP
• A Gaussian process is defined as a probability distribution over
functions y(x), such that the set of values of y(x) evaluated at
an arbitrary set of points x1,.. Xn jointly have a Gaussian
distribution.
• Probability distribution indexed by an arbitrary set
• Any finite subset of indices defines a multivariate Gaussian
distribution
• Input space X, for each x the distribution is a Gaussian, what
determines the GP is
• The mean function µ(x) = E(y(x))
• The covariance function (kernel) k(x,x')=E(y(x)y(x'))
• In most applications, we take µ(x)=0. Hence the prior is
represented by the kernel.
Linear regression updated by GP
• Specific case of a Gaussian Process
• It is defined by the linear regression model
y( x)  wT ( x)
with a weight prior
p(w)  N (w | 0, 1I )
the kernel function is given by
1
T
k ( xn , xm )   ( xn )  ( xm )

Kernel function
• We can also
define the kernel
function directly.
• The figure show
samples of
functions drawn
from Gaussian
processes for two
different choices
of kernel
functions
GP for Regression
Take account of the noise on the observed target values,
which are given by
t n  yn   n
where yn  y ( xn ), and  n is a random noise variable
Here we consider noise processesthathavea Gaussian distribution,
so that
p(tn | y n )  N(tn | y n ,  -1 )
where  is a hyperparameter representing the precisionof thenoise.
Because thenoise is independent, the joint distribution of t  (t1 ,...,t n )T
conditioned on y  (y1 ,...,y n )T is given by
p(t | y)  N(t | y,  -1I n )
GP for regression
• From the definition of GP, the marginal
distribution p(y) is given by
p( y )  N ( y | 0, K )
• The marginal distribution of t is given by
p(t )   p(t | y ) p( y )dy  N (t | 0, C )
• Where the covariance matrix C has elements
C( xn , xm )  k ( xn , xm )   1 nm
GP for Regression
• The sampling of data points t
GP for Regression
• We’ve used GP to build a model of the joint
distribution over sets of data points
T
Given
training
points
t

(
t
,..,
t
)
, input values x1,...,x n ,
n
1
n
• Goal:
predict tn 1 for a new input xn 1
• To find p(tn1 | t ) , we begin by writing down the
joint distribution
p(t n 1 )  N (tn 1 | 0, Cn 1 )
where Cn 1 is (n  1)  (n  1) matrix,
 Cn k 
Cn 1   T , where Cn is n  n matrix,and c  k(xn 1 , x n 1 )   -1
k c 
GP for Regression
• The conditional distribution p(tn1 | t ) is a Gaussian
distribution with mean and covariance given by
m( xn 1 )  k T Cn1t
 2 ( xn 1 )  c  k T Cn1k
• These are the key results that define Gaussian
process regression.
• The predictive distribution is a Gaussian whose
mean and variance both depend on xn 1
A Example of GP Regression
GP for Regression
• The only restriction on the kernel is that the
covariance matrix given by
C( xn , xm )  k ( xn , xm )   1 nm
must be positive definite.
• GP will involve a matrix of size n*n, for which
3
O
(
n
) computations.
require
Conclusion
• Distribution over functions
• Jointly have a Gaussian distribution
• Index set can be pretty much whatever
•
•
•
•
•
Reals
Real vectors
Graphs
Strings
…
• Most interesting structure is in k(x,x’), the ‘kernel.’
• Uses for regression to predict the target for a new
input
Questions
• Thank you!