LogReg178winter07
Download
Report
Transcript LogReg178winter07
ICS 178
Introduction Machine Learning
& data Mining
Instructor max Welling
Lecture 6: Logistic Regression
Logistic Regression
• This is also regression but with targets Y=(0,1). I.e. it is classification!
• We will fit a regression function on P(Y=1|X)
linear regression
logistic regression
fi (x )
Aij
xj
Aij
xj
Yn AXn b
P (Yn 1 | Xn ) f (AXn b )
f (X )
1
1 exp[(AX b )]
Sigmoid function f(x)
fi (x )
Aij
data-points with Y=1
xj
P (Yn 1 | Xn ) f (AXn b )
data-points with Y=0
f (X )
1
1 exp[(AX b )]
In 2 Dimensions
A,b determine
1) orientation
2) thickness (margin)
3) offset
of decision surface
sigmoid f(x)
Cost Function
• We want a different error measure that is better suited for 0/1 data.
• This can be derived from maximizing the probability of the data again.
P (Yn 1 | Xn ) f (AXn b )
P (Yn 0 | Xn ) 1 f (AXn b )
P (Yn | Xn ) f (Xn )Yn (1 f (Yn ))1Yn
N
Error Yn logf (Xn ) (1 Yn )log(1 f (Xn ))
n 1
Learning A,b
• Again, we take the derivatives of the Error wrt to the parameters.
• This time however, we can’t solve them analytically, so we use gradient descent.
dError
A A
dA
dError
b b
db
Gradients for Logistic Regression
• After the math (on the white-board) we find:
T
Error
Yn 1 f (Xn ) (1 Yn )f (Xn ) Xn
A
n
Error
Yn 1 f (Xn ) (1 Yn )f (Xn )
b
n
Note: first term in each eqn. (multiplied by Y) only sums over data with Y=1,
while second term (multiplied by (1-Y) only sums over data with Y=0.
Follow the gradient until the change in A,b falls below a small theshold (e.g. 1E-6).
Classification
• Once we have found the optimal values for A,b we classify future data with:
Ynew round (f (Xnew ))
• Least squares and Logistic regression are parametric methods since
all the information in the data is stored in the parameters A,b, i.e.
after learning you can toss out the data.
• Also, the decision surface is always linear, its complexity does not grow
with the amount of data.
• We have imposed our prior knowledge that the decision surface should be linear.
A Real Example
collaboration with S. Cole)
• Fingerprints are matched against a data-base.
• Each match is scored.
• Using Logistic Regression we try to predict if a future match is a real or false.
• Human fingerprint examiners claim 100% accuracy. Is this true?
Exercise
• You have layed your hands on a dataset where data have a single attribute
and a class label (0 or 1). You train a logistic regression classifier.
A new data-case is presented. What do you do to decide in what class it falls
(use an equation or pseudo-code)
• How many parameters are there to tune for this problem?
Explain what these parameters mean in terms of the function P(Y=1|X).