Logistic Regression

Download Report

Transcript Logistic Regression

Logistic Regression
KNN Ch. 14 (pp. 555-618)
 MINITAB User’s Guide
 SAS EM documentation

1
Regression Models with Binary
Response Variable
In many applications the response variable has only two possible
outcomes (0/1):
In a study of liability insurance possession, using Age of head of household,
Amount of liquid assets, and Type of occupation of head of household as
predictors, the response variable had two possible outcomes: House has
liability insurance (=1), or Household does not have liability insurance (=0)
The financial status of a firm (sound status, headed toward insolvency) can
be coded as 0/1
Blood pressure status (high blood pressure, not high blood pressure) can be
coded as 0/1
2
Meaning of the Response Function for
Binary Outcomes
Consider the simple linear regression model
Yi   0  1 X i   i , Yi  0,1
EYi    0  1 X i
In this case, the expected response E{Yi} has a special meaning.
Consider Yi to be a Bernoulli random variable:
Yi
1
Probability
P(Yi=1) = pi
0
P(Yi=0) =1 - pi
3
Meaning of the Response Function for
Binary Outcomes
Using the definition of expected value of a random variable,
EYi   1(p i )  0(1  p i )  p i  P(Yi  1)
EYi    0  1 X i  p i
Therefore, the mean response E{Yi} is the probability that Yi =1
when the level of the predictor variable is Xi.
E{Y}
1
E{Y} = b0 + b1X
0
X
4
Problems when Response Variable is
Binary
1. Error Terms are not normal:
 At each X level, the error cannot be normally distributed
since it takes only 2 possible values, depending on whether Y
is 0 or 1
2. Error Variance is not constant:
 Error Variance is a function of X, therefore not constant
3. Constraints with the response function:
 We need to find response functions that do not exceed the
value of 1, and that is not easy
5
Link Functions
Inverse of distribution functions have a sigmoid shape that can
be helpful as a response function of a regression model with
binary outcome. Such a function is called Link Function.
We want to choose a link function that best fits our data.
Goodness-of-fit statistics can be used to compare fits using
different link functions:
Name
Link Function
logit
g(p i) = log(p i / (1-p i)) logistic
normit/probit g(p i) = F -1 (p i)
gompit
Distribution Mean
normal
g(p i) = log(-log(1-p i)) Gumbel
Variance
0
pi2 / 3
0
1
-g (Euler c.) pi2 / 6
6
Logistic Regression Assumption
logit
transformation
Assumption: The logit transformation of the
probabilities of the target value results in a linear
relationship with the input variables.
7
Linear versus Logistic Regression
Linear
Logistic
Regression
Regression
Target
is an interval
variable.
Input
Target
is a discrete
(binary or ordinal)
variable.
variables have
any measurement level.
Input
Predicted
Predicted
values are
the mean of the target
variable at the given
values of the input
variables.
variables have
any measurement level.
values are
the probability of a
particular level(s) of the
target variable at the
given values of the input
variables.
8
Interpretation of Parameter Estimates
The interpretation of the parameter estimates depends on
 The link function
 The reference event (1 or 0)
 The reference factor levels (for numerical factors, reference level
is the smallest value)
The logit link function provides the most natural
interpretation of the estimated coefficients:
 The odds of a reference event is the ratio of P(event) to
P(not event). The estimated coefficient of a predictor (factor
or covariate) is the estimated change in the log of
P(event)/P(not event) for each unit change in the predictor,
assuming the other predictors remain constant
9
Parametric Models
E(Y | X=x) = g(x;w)
E(Y
g-1|( X=x)
p (x)
) = g(x;w)
w0 + w1x1 +…+ wpxp)
w2
w1
Generalized Linear Model
Training Data
10
Logistic Regression Models
log(odds)
logit(p )
p
log g-1( p ) = w0 + w1x1 +…+ wpxp
1-p
(
)
logit(p)
1.0
p 0.5
0.0
Training Data
0
11
Changing the Odds
log
p
(1 - p ) =
p
log
= wexp(w
log
+) w
01 + w
01(x
1+1)+…+
1x1 +…+ wppxpp
1 - p´
1-p
odds
ratio
(
Training Data
p´
w0 + w1x1 +…+ wpxp
)
(
)
12
Regression diagnostics – Residual Analysis
To identify…
poorly fit factor/covariate
patterns
factor/covariate patterns
with strong influence on
parameter estimates
factor/covariate patterns
with a large leverage
Use…
Pearson
residual
Which measures…
the difference between the actual
and the predicted observation
changes in the coefficients when the
j-th factor/covariate pattern is
removed, based on Pearson
delta beta residuals
leverages of the j-th factor/covariate
Leverage pattern, a measure of how unusual
(Hi )
predictor values are
13
The Home Equity Loan Case

HMEQ Overview

Determine who should be
approved for a home equity loan.
The target variable is a binary
variable that indicates whether
an applicant eventually defaulted
on the loan.
The input variables are variables
such as the amount of the loan,
amount due on the existing
mortgage, the value of the
property, and the number of
recent credit inquiries.


14
HMEQ
– The consumer credit department of a bank wants to
automate the decision-making process for approval of
home equity lines of credit. To do this, they will follow the
recommendations of the Equal Credit Opportunity Act to
create an empirically derived and statistically sound credit
scoring model. The model will be based on data collected
from recent applicants granted credit through the current
process of loan underwriting. The model will be built from
predictive modeling tools, but the created model must be
sufficiently interpretable so as to provide a reason for any
adverse actions (rejections).
– The HMEQ data set contains baseline and loan
performance information for 5,960 recent home equity
loans. The target (BAD) is a binary variable that indicates if
an applicant eventually defaulted or was seriously
delinquent. This adverse outcome occurred in 1,189 cases
(20%). For each applicant, 12 input variables were
recorded.
15
Original HMEQ data
Name
Model Role
Measurement Level
Description
BAD
Target
Binary
1=defaulted on loan, 0=paid back loan
REASON
Input
Binary
HomeImp=home improvement,
DebtCon=debt consolidation
JOB
Input
Nominal
Six occupational categories
LOAN
Input
Interval
Amount of loan request
MORTDUE
Input
Interval
Amount due on existing mortgage
VALUE
Input
Interval
Value of current property
DEBTINC
Input
Interval
Debt-to-income ratio
YOJ
Input
Interval
Years at present job
DEROG
Input
Interval
Number of major derogatory reports
CLNO
Input
Interval
Number of trade lines
DELINQ
Input
Interval
Number of delinquent trade lines
CLAGE
Input
Interval
Age of oldest trade line in months
NINQ
Input
Interval
Number of recent credit inquiries
16
HMEQ: Modeling Goal
– The credit scoring model computes a
probability of a given loan applicant
defaulting on loan repayment. A threshold
is selected such that all applicants whose
probability of default is in excess of the
threshold are recommended for rejection.
17
HMEQ: two added variables
– For model comparison purposes, we
added two variables:
• BEHAVIOR (good/bad), which precisely
mirrors the 0/1 values in BAD, to see how we
can perfectly predict BAD using insider
information
• FLIPCOIN (Head/Tail), which is completely
random, to see if we can predict BAD using
random flips of a coin
18
Introducing SAS Enterprise Miner v.5.3
Enterprise-grade (and expensive!) Data Mining package
Implemented Methodology:
– Sample-Explore-Modify-Model-Assess (SEMMA)
Available Modeling Tools:
– Logistic Regression
– Many others, such as Decision Trees, Neural
Networks, Clustering, Market-Basket, etc.
19
Analysis of HMEQ in SAS EM
Three logistic Regression nodes were added to
the Analysis Diagram. In order to compare
them, a Compare node was added.
20
SAS EM 4.3: A more accessible version
Accessible through base SAS at UNT CoB
Start SAS 9.3. From the SAS menu bar, select Solutions >
Analysis > Enterprise Miner
21
Logistic Regression results (all predictors)
22
Logistic Regression results
(stepwise, final model)
23
Interpretation of Odds Ratio results
Predictors that cause the probability to default
on the loan to increase (=odds ratio coeff. > 1):
•
•
•
•
DEBTINC
DELINQ
DEROG
NINQ
Predictors that cause the probability to default
on the loan to decrease (=odds ratio coeff. < 1):
•
•
CLNO
YOJ
24
Model Comparison
Perfect Regression is, well, perfect.
In Baseline Regression, 20% of the borrowers default,
regardless of fitted value
Stepwise Regression is somewhere between the other two
models
25