6-17 Logistic Regression
Download
Report
Transcript 6-17 Logistic Regression
Chapter 6
Regression Algorithms in Data
Mining
Fit data
Time-series data: Forecast
Other data: Predict
結束
Contents
Describes OLS (ordinary least square)
regression and Logistic regression
Describes linear discriminant analysis and
centroid discriminant analysis
Demonstrates techniques on small data sets
Reviews the real applications of each model
Shows the application of models to larger data
sets
6-2
結束
Use in Data Mining
Telecommunication Industry, turnover (churn)
One of major analytic models for classification
problem.
Linear regression
The standard – ordinary least squares regression
Can use for discriminant analysis
Can apply stepwise regression
Nonlinear regression
More complex (but less reliable) data fitting
Logistic regression
When data are categorical (usually binary)
6-3
結束
OLS Model
Y 0 1 X 1 2 X 2 ... n X n
where Y is the dependent variable
0 is the intercept term
n are the n coefficien ts for independen t variable s
is the error term
6-4
結束
OLS Regression
Uses intercept and slope coefficients () to
minimize squared error terms over all i
observations
Fits the data with a linear model
Time-series data:
Observations over past periods
Best fit line (in terms of minimizing sum of
squared errors)
6-5
結束
Regression Output
R2 : 0.987
Intercept: 0.642
Week:
5.086
t=0.286
t=53.27
P=0.776
P=0
Requests = 0.642 + 5.086*Week
6-6
結束
Example
R2
SSE
SST
6-7
結束
Example
6-8
結束
A graph of the time-series model
(X1) Requests vs. (X2) Pred_lmreg_1
200
190
180
170
160
150
140
130
120
110
100
90
80
70
60
50
40
30
20
10
20
40
60
80
100
120
140
160
180
200
6-9
結束
Time-Series Forecast
Time-series prediction
250
200
150
100
50
0
0
10
20
30
40
50
6-10
結束
Regression Tests
FIT:
SSE – sum of squared errors
Synonym: SSR – sum of squared residuals
R2 – proportion explained by model
Adjusted R2 – adjusts calculation to penalize for
number of independent variables
Significance
F-test - test of overall model significance
t-test - test of significant difference between model
coefficient & zero
P – probability that the coefficient is zero
(or at least the other side of zero from the coefficient)
See page. 103
6-11
結束
Regression Model Tests
SSE (sum of squared errors)
For each observation, subtract model value from
observed, square difference, total over all observations
By itself means nothing
Can compare across models (lower is better)
Can use to evaluate proportion of variance in data
explained by model
R2
Ratio of explained squared dependent variable values
(MSR) to sum of squares (SST)
SST = MSR plus SSE
0 ≤ R2 ≤ 1
See page. 104
6-12
結束
Multiple Regression
Can include more than one independent variable
Trade-off:
Too many variables – many spurious, overlapping
information
Too few variables – miss important content
Adding variables will always increase R2
Adjusted R2 penalizes for additional independent
variables
6-13
結束
Example: Hiring Data
Dependent Variable – Sales
Independent Variables:
Years of Education
College GPA
Age
Gender
College Degree
See page. 104-105
6-14
結束
Regression Model
Sales =
269025
-17148*YrsEd
-7172*GPA
+4331*Age
-23581*Male
+31001*Degree
R2 = 0.252 Adj R2 = -0.015
Weak model, no significant at 0.10
P = 0.175
P = 0.812
P = 0.116
P = 0.266
P = 0.450
6-15
結束
Improved Regression Model
Sales =
173284
- 9991*YrsEd
+3537*Age
-18730*Male
P = 0.098*
P = 0.141
P = 0.328
R2 = 0.218 Adj R2 = 0.070
6-16
結束
Logistic Regression
Data often ordinal or nominal
Regression based on continuous numbers
not appropriate
Need dummy variables
Binary – either are or are not
– LOGISTIC REGRESSION (probability of either
1 or 0)
Two or more categories
– DISCRIMINANT ANALYSIS (perform
regression for each outcome; pick one that fit’s
best)
6-17
結束
Logistic Regression
For dependent variables that are nominal or ordinal
Probability of acceptance of
case i to class j
Sigmoidal function
(in English, an S curve from 0 to 1)
Pj
1
1 e
0 i xi
6-18
結束
Insurance Claim Model
Fraud =
81.824
-2.778 * Age
-75.893 * Male
+ 0.017 * Claim
-36.648 * Tickets
+ 6.914 * Prior
-29.362 * Atty Smith
P = 0.789
P = 0.758
P = 0.757
P = 0.824
P = 0.935
P = 0.776
Can get probability by running score through logistic formula
See page. 107~109
6-19
結束
Linear Discriminant Analysis
Group objects into predetermined set of outcome
classes
Regression one means of performing discriminant
analysis
2 groups: find cutoff for regression score
More than 2 groups: multiple cutoffs
6-20
結束
Centroid Method (NOT regression)
Binary data
Divide training set into two groups by binary
outcome
Standardize data to remove scales
Identify means for each independent variable by
group (the CENTROID)
Calculate distance function
6-21
結束
Fraud Data
Age
52
Claim
2000
Tickets
0
Prior
1
Outcome
OK
38
19
21
1800
600
5600
0
2
1
0
2
2
OK
OK
Fraud
41
4200
1
2
Fraud
6-22
結束
Standardized & Sorted Fraud Data
Age
Claim
Tickets
Prior
Outcome
1
0.60
1
0.5
0
0.9
0.64
1
1
0
0
0.88
0
0
0
0.633
0.707
0.667
0.500
0
0.05
0
1
0
1
1
0.16
1
0
1
0.525
0.080
1.000
0.000
1
6-23
結束
Distance Calculations
New
To 0
To 1
Age
0.50
(0.633-0.5)2
0.018 (0.525-0.5)2 0.001
Claim
0.30
(0.707-0.3)2
0.166
(0.08-0.3)2
0.048
Tickets
0
(0.667-0)2
0.445
(1-0)2
1.000
Prior
1
(0.5-1)2
0.250
(0-1)2
1.000
Totals
0.879
2.049
6-24
結束
Discriminant Analysis with Regression
Standardized data, Binary outcomes
Intercept
Age
Gender
Claim
Tickets
Prior Claims
Attorney
0.430
-0.421
0.333
-0.648
0.584
-1.091
0.573
P = 0.670
P = 0.671
P = 0.733
P = 0.469
P = 0.566
P = 0.399
P = 0.607
R2 = 0.804
Cutoff average of group averages: 0.429
6-25
結束
Case: Stepwise Regression
Stepwise Regression
Automatic selection of independent variables
Look at F scores of simple regressions
Add variable with greatest F statistic
Check partial F scores for adding each variable not in
model
Delete variables no longer significant
If no external variables significant, quit
Considered inferior to selection of variables by
experts
6-26
結束
Credit Card Bankruptcy Prediction
Foster & Stine (2004), Journal of the American Statistical Association
Data on 244,000 credit card accounts
12-month period
1 percent default
Cost of granting loan that defaults almost $5,000
Cost of denying loan that would have paid about $50
6-27
結束
Data Treatment
Divided observations into 5 groups
Used one for training
Any smaller would have problems due to insufficient
default cases
Used 80% of data for detailed testing
Regression performed better than C5 model
Even though C5 used costs, regression didn’t
6-28
結束
Summary
Regression a basic classical model
Many forms
Logistic regression very useful in data mining
Often have binary outcomes
Also can use on categorical data
Can use for discriminant analysis
To classify
6-29