Classical Assumptions
Download
Report
Transcript Classical Assumptions
ECLT 5810
Brief introduction on
Logistic Regression
1
Review on Ordinary Least Square (OLS)
Regression
A “curve fitting on data points” procedure
Achieved by minimizing the total squared
distance between the curve and the data
points
The model usually looks like
y = β0 + β1x1 + β2x2 + · · · + βnxn
2
Our analysis on such models are usually:
If the beta coefficients are significantly positive/ negative/
different from a certain value, with estimation errors
considered. (done by t-statistic on beta estimates)
If the model has good explanatory power to describe the
dependent variable, with estimation errors considered.
(done by F-statistic on R^2 measures)
The implication from the model, i.e., does y depends on
x ? In what extent? Are there any interaction effect?
(Done by differentiation/differencing on the estimated
model)
Prediction (in interval) given the dependent variable.
3
Classical Assumptions for OLS
However, all those analysis is done under the
following assumptions.
A1 (Linear in Parameter)
y = β0 + β1x1 + β2x2 + · · · + error .
A2 (No perfect collinearity) No independent
variable is constant or a perfect linear
combination of the others
A1 and A2 could be fulfilled by choosing a
suitable form of equation.
4
A3 (Zero conditional mean of errors)
E(error t |X) = 0, t = 1, 2, · · · # of data,
where X is a collection of all independent variables
X = (x1, x2, · · · , xn)
Under A1-A3 the OLS estimators are unbiased, i.e.
E( estimated βj ) = βj for all j.
A4 (Homoskedasticity in errors)
Var(error t |X) = σ^2 (i.e. independent of X), t = 1, 2, · · ·.
A5 (No serial correlation in errors)
Corr (errort , errors |X) = 0, for t not equal to s.
Under A1-A5, the OLS estimators are the minimumvariance linear unbiased estimators conditional on X.
5
A6 (Normality of errors) ut are are independently
and identically distributed as N (0, σ^2).
Under A1-A6, the OLS estimators are normally
distributed conditional on X. And t-statistic on
parameters and F-statistic on the R^2 can be
used for different statistical reasoning.
A3-A6 are usually assumed to be true unless
there is significant evidence/ reason against
them.
6
Early models for classification
As our main target is make prediction in data mining, the
dependent variable is usually nominal/ ordinal/ binary in
nature. Usually we use a binary y to represent this, i.e. y=1
for yes and 0 for no.
An early model is the linear probability model, which
regress binary y on other explanatory variable X. As y is
binary, the predicted value is usually around the range 0
and 1. So people used this model to predict the probability
for an event.
However, such model violates A3, A4 and A6. Also, the
predicted value could be out of the range 0 and 1. The
model become not so useful.
7
The problem could be rectified by introducing a
threshold such that when the predicted y is greater
than the threshold, we classify y as 1. This become
the most simple neural network model, which will be
introduced later.
However, what we obtain become a decision rather
than a probability, which might be useful in some
cases. Also, the relation between the probability and
the explanatory variable become less clear.
Statisticians invented logistic regression to solve the
problem.
8
Logistic Regression
The idea is to use a 1 to 1 mapping to map the probability from
range between [0,1] to all real numbers. Then, there will be no
problem no matter what the right hand side is.
3 common transformation/ link function (provided by SAS):
Logit : ln(p/1-p)
(We call this log of odd ratio)
Probit: Normal inverse of p
(Recall: normal table’s mapping scheme)
Complementary log-log: ln(-ln(1-p))
The choice of link function depends on your purpose rather than
performance. They all perform equally good but the implications
is a bit different.
9
However, as the model is no longer in linear form,
ordinary least square cannot be used. Furthermore, if we
put y directly into transformation, we get
positive/negative infinity.
We use Maximum Likelihood Estimator (MLE) methods
instead. In which we choose beta coefficients that
maximize the probability that the data as we see now.
MLE needs fewer assumptions than OLS, but much less
inference could be made, especially for logistic
regression.
Also, as both MLE and OLS use only one beta
coefficient to describe the effect of an explanatory
variable brings about, data scaling/ normalization is
particular important.
10
Example on Logit
Assume we believe the relation between probability p of
an event is “yes” and independent variable x can be
described by the equation
Then,
ln(p(x)/1-p(x)) = a+bx
p(x) = exp(a+bx) / [1+exp(a+bx)]
If we have 4 data points :(Yes,x1) ,(No,x2) ,(Yes,x3), (No,x4)
and assume they are mutually independent , then the
probability that we see these 4 data point is the product:
p(x1)[1-p(x2)]p(x3 )[1-p(x4)]
and MLE tries to maximize this by choosing suitable a
and b.
11
Reading the Report
Akaike’s Information Criteria (AIC) and Schwarz’s Bayesian
Criteria (SBC) :
(Compare to: F-test on Adjusted R^2 for OLS)
- both has smaller value for higher maximized likelihood, and
higher value if more explanatory variable is used (to penalize
over-fitting).
- So smaller of it is preferred. (though is not the only
consideration for choosing model)
T-score
(Compare to: t-test on estimated betas for OLS)
It is the estimate divided by its standard error. We may treat it
like t-test as in OLS, and construct a confidence interval for the
betas. But in practice, it works only asymptotically. We just
consider large t-score as an indicator for possibly significant
12
effect but no hypothesis testing could be done.
Wald’s Chi-square (Compare to t-test for OLS)
We could treat an effect as significant if the tail probability is
small enough (< 5%).
If we are using the model for predicting the outcome rather
than the probability for that outcome (the case when the
criterion is set to minimize loss), the interpretation for
misclassification rate/ profit and loss/ ROC curve/ lift chart is
similar to those for decision tree.
Some scholars suggest prediction interval for the probability
P of the event given independent variable be
Pestimated + Z1-a/2 [Pestimated (1-Pestimated)/#data]^(1/2)
Z being the Z-score for normal table and a being the
significance level. But we do not have this in SAS.
13
The interpretation for the model form is similar for OLS
by techniques like differentiation and differencing.
One common use is, for Logit model with form:
f(x) = ln(P(x)/1-P(x)) = a+bx, x being binary
f(1) = a+b, f(0)= a
f(1)/f(0) ~ ln(P(1)/P(0)) = b for small P(0), P(1)
P(1) = exp(b) * P(0)
Hence P(1) is exp(b) as big as P(0). We can draw
conclusion like “Having something (x) done increases
the probability to exp(b) times for not having it done”
14