ICS 178 Introduction Machine Learning & data Mining
Download
Report
Transcript ICS 178 Introduction Machine Learning & data Mining
ICS 178
Introduction Machine Learning
& data Mining
Instructor max Welling
Lecture 4: Least squares
Regression
What have we done so far?
parametric
unsupervised
non-parametric
density estimation:
parzen-windowing
classification
supervised
future example: k-mean
kNN
regression
classification
regression
future example:
today:
logistic regression
least-squares
X
Problem
# of mantee-kills versus boats
Goal
• Given data find a linear relationship between them.
• In 1 dimension we have data {Xn,Yn} (blue dots)
• We imagine vertical springs (red lines) between the data and a stiff rod (line).
(imagine they can slide over the rod so they remain vertical).
• Springs have rest length 0, so they compete to pull the rod towards them.
• The relaxed solution is what we are after.
Cost Function
We measure the total squared length of all the springs:
N
2
N
Error d (Yn aXn b )
n 1
2
n
n 1
We can now take derivatives wrt a,b
and set that to 0.
After some algebra (on white board)
we find,
a
*
1
N
1
Y
X
n n
N
n 1
N
1
N
1
Y
n
N
n 1
N
N
X
n 1
2
1 N
X
X
N n
n 1
n 1
N
1 N
*
* 1
b
Xn
Yn a N
N n 1
n 1
N
2
n
n
More Variables
• More generally, we want to have Dx input variables and Dy output variables.
N
n 1
1
A*
N
N
1
N
Y X
n 1
2
N
Error d ||Yn AXn b ||
• The cost is now:
n
T
n
2
n
1
Y
n
N
n 1
N
n 1
N
X
n 1
T
n
1
N
E [YX T ] E [Y ]E [X ]T * COV [X ] 1
1
b
N
*
1
Yn A
N
n 1
N
*
E [Y ] A *E [X ]
N
X
n 1
n
N
X X
n 1
n
T
n
1
N
1
X
n
N
n 1
N
N
X
n 1
T
n
1
In Matlab
function [A,b] = LSRegression(X,Y)
[D,N] = size(X);
EX = sum(X,2)/N;
CovX = X*X'/N - EX*EX';
EY = sum(Y,2)/N;
CovXY = Y*X'/N - EY*EX';
A = CovXY * inv(CovX);
b = EY - A*EX;
Statistical Interpretation
• We can think of the problem as one where we are trying to find the
probability distribution for P(Y|X).
• We can write: Yn AXn b dn
where d is the residual error pointing vertically from the line to the data-point.
• d is a random vector and we may assume is has a Gaussian distribution.
d
N (0, )
Statistical Interpretation
dn (Yn AXn b )
Yn
N (AXn b , )
N (0, )
1
1
exp[ 2 ||Yn AXn b ||2 ]
2
2
• We can now maximize the probability of the data under the model by adapting
the parameters A,b.
• If we use negative log-probability we get:
C
1
2
N
2
||Y
n 1
n
AXn b ||2 log cons .
• Looks familiar?
• We can also optimize for
(It won’t affect A,b)
• This is called “maximum likelihood learning”.