LogReg178winter07

Download Report

Transcript LogReg178winter07

ICS 178
Introduction Machine Learning
& data Mining
Instructor max Welling
Lecture 6: Logistic Regression
Logistic Regression
• This is also regression but with targets Y=(0,1). I.e. it is classification!
• We will fit a regression function on P(Y=1|X)
linear regression
logistic regression
fi (x )
Aij
xj
Aij
xj
Yn  AXn  b
P (Yn  1 | Xn )  f (AXn  b )
f (X ) 
1
1  exp[(AX  b )]
Sigmoid function f(x)
fi (x )
Aij
data-points with Y=1
xj
P (Yn  1 | Xn )  f (AXn  b )
data-points with Y=0
f (X ) 
1
1  exp[(AX  b )]
In 2 Dimensions
A,b determine
1) orientation
2) thickness (margin)
3) offset
of decision surface
sigmoid f(x)
Cost Function
• We want a different error measure that is better suited for 0/1 data.
• This can be derived from maximizing the probability of the data again.
P (Yn  1 | Xn )  f (AXn  b )
P (Yn  0 | Xn )  1  f (AXn  b )
P (Yn | Xn )  f (Xn )Yn (1  f (Yn ))1Yn
N
Error  Yn logf (Xn )  (1  Yn )log(1  f (Xn ))
n 1
Learning A,b
• Again, we take the derivatives of the Error wrt to the parameters.
• This time however, we can’t solve them analytically, so we use gradient descent.
dError
A  A 
dA
dError
b  b 
db
Gradients for Logistic Regression
• After the math (on the white-board) we find:
T
Error
  Yn 1  f (Xn )   (1  Yn )f (Xn )  Xn
A
n
Error
  Yn 1  f (Xn )   (1  Yn )f (Xn )
b
n
Note: first term in each eqn. (multiplied by Y) only sums over data with Y=1,
while second term (multiplied by (1-Y) only sums over data with Y=0.
Follow the gradient until the change in A,b falls below a small theshold (e.g. 1E-6).
Classification
• Once we have found the optimal values for A,b we classify future data with:
Ynew  round (f (Xnew ))
• Least squares and Logistic regression are parametric methods since
all the information in the data is stored in the parameters A,b, i.e.
after learning you can toss out the data.
• Also, the decision surface is always linear, its complexity does not grow
with the amount of data.
• We have imposed our prior knowledge that the decision surface should be linear.
A Real Example
collaboration with S. Cole)
• Fingerprints are matched against a data-base.
• Each match is scored.
• Using Logistic Regression we try to predict if a future match is a real or false.
• Human fingerprint examiners claim 100% accuracy. Is this true?
Exercise
• You have layed your hands on a dataset where data have a single attribute
and a class label (0 or 1). You train a logistic regression classifier.
A new data-case is presented. What do you do to decide in what class it falls
(use an equation or pseudo-code)
• How many parameters are there to tune for this problem?
Explain what these parameters mean in terms of the function P(Y=1|X).