ICS 178 Introduction Machine Learning & data Mining

Download Report

Transcript ICS 178 Introduction Machine Learning & data Mining

ICS 178
Introduction Machine Learning
& data Mining
Instructor max Welling
Lecture 4: Least squares
Regression
What have we done so far?
parametric
unsupervised
non-parametric
density estimation:
parzen-windowing
classification
supervised
future example: k-mean
kNN
regression
classification
regression
future example:
today:
logistic regression
least-squares
X
Problem
# of mantee-kills versus boats
Goal
• Given data find a linear relationship between them.
• In 1 dimension we have data {Xn,Yn} (blue dots)
• We imagine vertical springs (red lines) between the data and a stiff rod (line).
(imagine they can slide over the rod so they remain vertical).
• Springs have rest length 0, so they compete to pull the rod towards them.
• The relaxed solution is what we are after.
Cost Function
We measure the total squared length of all the springs:
N
2
N
Error  d   (Yn  aXn  b )
n 1
2
n
n 1
We can now take derivatives wrt a,b
and set that to 0.
After some algebra (on white board)
we find,
a 
*
1
N
1
Y
X


n n
N
n 1
N
1
N
1
Y

n
N
n 1
N
N
X
n 1
2
 1 N

X

X

N  n 
n 1
 n 1 
N
1 N
*
* 1
b 
Xn
Yn  a N 
N n 1
n 1
N
2
n
n
More Variables
• More generally, we want to have Dx input variables and Dy output variables.
N
n 1
 1
A*  
N

N
1

N
Y X
n 1
2
N
Error  d  ||Yn  AXn  b ||
• The cost is now:
n
T
n
2
n
1
Y

n
N
n 1
N

n 1
N
X
n 1
T
n
 1
 N

 E [YX T ]  E [Y ]E [X ]T * COV [X ] 1
1
b 
N
*
1
Yn  A

N
n 1
N
*
 E [Y ]  A *E [X ]
N
X
n 1
n
N
X X
n 1
n
T
n
1

N
1
X

n
N
n 1
N
N
X
n 1
T
n



1
In Matlab
function [A,b] = LSRegression(X,Y)
[D,N] = size(X);
EX = sum(X,2)/N;
CovX = X*X'/N - EX*EX';
EY = sum(Y,2)/N;
CovXY = Y*X'/N - EY*EX';
A = CovXY * inv(CovX);
b = EY - A*EX;
Statistical Interpretation
• We can think of the problem as one where we are trying to find the
probability distribution for P(Y|X).
• We can write: Yn  AXn  b  dn
where d is the residual error pointing vertically from the line to the data-point.
• d is a random vector and we may assume is has a Gaussian distribution.
d
N (0,  )
Statistical Interpretation
dn  (Yn  AXn  b )
Yn
N (AXn  b ,  ) 
N (0,  )
1
1
exp[  2 ||Yn  AXn  b ||2 ]
2
2
• We can now maximize the probability of the data under the model by adapting
the parameters A,b.
• If we use negative log-probability we get:
C 
1
2
N
2
||Y
n 1
n
 AXn  b ||2  log   cons .
• Looks familiar?
• We can also optimize for 
(It won’t affect A,b)
• This is called “maximum likelihood learning”.