6-17 Logistic Regression

Download Report

Transcript 6-17 Logistic Regression

Chapter 6
Regression Algorithms in Data
Mining
Fit data
Time-series data: Forecast
Other data: Predict
結束
Contents
Describes OLS (ordinary least square)
regression and Logistic regression
Describes linear discriminant analysis and
centroid discriminant analysis
Demonstrates techniques on small data sets
Reviews the real applications of each model
Shows the application of models to larger data
sets
6-2
結束
Use in Data Mining
Telecommunication Industry, turnover (churn)
One of major analytic models for classification
problem.
Linear regression
The standard – ordinary least squares regression
Can use for discriminant analysis
Can apply stepwise regression
Nonlinear regression
More complex (but less reliable) data fitting
Logistic regression
When data are categorical (usually binary)
6-3
結束
OLS Model
Y   0  1 X 1   2 X 2  ...   n X n  
where Y is the dependent variable
 0 is the intercept term
 n are the n coefficien ts for independen t variable s
 is the error term
6-4
結束
OLS Regression
Uses intercept and slope coefficients () to
minimize squared error terms over all i
observations
Fits the data with a linear model
Time-series data:
Observations over past periods
Best fit line (in terms of minimizing sum of
squared errors)
6-5
結束
Regression Output
R2 : 0.987
Intercept: 0.642
Week:
5.086
t=0.286
t=53.27
P=0.776
P=0
Requests = 0.642 + 5.086*Week
6-6
結束
Example
R2
SSE
SST
6-7
結束
Example
6-8
結束
A graph of the time-series model
(X1) Requests vs. (X2) Pred_lmreg_1
200
190
180
170
160
150
140
130
120
110
100
90
80
70
60
50
40
30
20
10
20
40
60
80
100
120
140
160
180
200
6-9
結束
Time-Series Forecast
Time-series prediction
250
200
150
100
50
0
0
10
20
30
40
50
6-10
結束
Regression Tests
FIT:
SSE – sum of squared errors
Synonym: SSR – sum of squared residuals
R2 – proportion explained by model
Adjusted R2 – adjusts calculation to penalize for
number of independent variables
Significance
F-test - test of overall model significance
t-test - test of significant difference between model
coefficient & zero
P – probability that the coefficient is zero
(or at least the other side of zero from the coefficient)
See page. 103
6-11
結束
Regression Model Tests
SSE (sum of squared errors)
For each observation, subtract model value from
observed, square difference, total over all observations
By itself means nothing
Can compare across models (lower is better)
Can use to evaluate proportion of variance in data
explained by model
R2
Ratio of explained squared dependent variable values
(MSR) to sum of squares (SST)
SST = MSR plus SSE
0 ≤ R2 ≤ 1
See page. 104
6-12
結束
Multiple Regression
Can include more than one independent variable
Trade-off:
Too many variables – many spurious, overlapping
information
Too few variables – miss important content
Adding variables will always increase R2
Adjusted R2 penalizes for additional independent
variables
6-13
結束
Example: Hiring Data
Dependent Variable – Sales
Independent Variables:
Years of Education
College GPA
Age
Gender
College Degree
See page. 104-105
6-14
結束
Regression Model
Sales =
269025
-17148*YrsEd
-7172*GPA
+4331*Age
-23581*Male
+31001*Degree
R2 = 0.252 Adj R2 = -0.015
Weak model, no significant at 0.10
P = 0.175
P = 0.812
P = 0.116
P = 0.266
P = 0.450
6-15
結束
Improved Regression Model
Sales =
173284
- 9991*YrsEd
+3537*Age
-18730*Male
P = 0.098*
P = 0.141
P = 0.328
R2 = 0.218 Adj R2 = 0.070
6-16
結束
Logistic Regression
Data often ordinal or nominal
Regression based on continuous numbers
not appropriate
Need dummy variables
Binary – either are or are not
– LOGISTIC REGRESSION (probability of either
1 or 0)
Two or more categories
– DISCRIMINANT ANALYSIS (perform
regression for each outcome; pick one that fit’s
best)
6-17
結束
Logistic Regression
For dependent variables that are nominal or ordinal
Probability of acceptance of
case i to class j
Sigmoidal function
(in English, an S curve from 0 to 1)
Pj 
1
1 e
  0   i xi 
6-18
結束
Insurance Claim Model
Fraud =
81.824
-2.778 * Age
-75.893 * Male
+ 0.017 * Claim
-36.648 * Tickets
+ 6.914 * Prior
-29.362 * Atty Smith
P = 0.789
P = 0.758
P = 0.757
P = 0.824
P = 0.935
P = 0.776
Can get probability by running score through logistic formula
See page. 107~109
6-19
結束
Linear Discriminant Analysis
Group objects into predetermined set of outcome
classes
Regression one means of performing discriminant
analysis
2 groups: find cutoff for regression score
More than 2 groups: multiple cutoffs
6-20
結束
Centroid Method (NOT regression)
Binary data
Divide training set into two groups by binary
outcome
Standardize data to remove scales
Identify means for each independent variable by
group (the CENTROID)
Calculate distance function
6-21
結束
Fraud Data
Age
52
Claim
2000
Tickets
0
Prior
1
Outcome
OK
38
19
21
1800
600
5600
0
2
1
0
2
2
OK
OK
Fraud
41
4200
1
2
Fraud
6-22
結束
Standardized & Sorted Fraud Data
Age
Claim
Tickets
Prior
Outcome
1
0.60
1
0.5
0
0.9
0.64
1
1
0
0
0.88
0
0
0
0.633
0.707
0.667
0.500
0
0.05
0
1
0
1
1
0.16
1
0
1
0.525
0.080
1.000
0.000
1
6-23
結束
Distance Calculations
New
To 0
To 1
Age
0.50
(0.633-0.5)2
0.018 (0.525-0.5)2 0.001
Claim
0.30
(0.707-0.3)2
0.166
(0.08-0.3)2
0.048
Tickets
0
(0.667-0)2
0.445
(1-0)2
1.000
Prior
1
(0.5-1)2
0.250
(0-1)2
1.000
Totals
0.879
2.049
6-24
結束
Discriminant Analysis with Regression
Standardized data, Binary outcomes
Intercept
Age
Gender
Claim
Tickets
Prior Claims
Attorney
0.430
-0.421
0.333
-0.648
0.584
-1.091
0.573
P = 0.670
P = 0.671
P = 0.733
P = 0.469
P = 0.566
P = 0.399
P = 0.607
R2 = 0.804
Cutoff average of group averages: 0.429
6-25
結束
Case: Stepwise Regression
Stepwise Regression
Automatic selection of independent variables
Look at F scores of simple regressions
Add variable with greatest F statistic
Check partial F scores for adding each variable not in
model
Delete variables no longer significant
If no external variables significant, quit
Considered inferior to selection of variables by
experts
6-26
結束
Credit Card Bankruptcy Prediction
Foster & Stine (2004), Journal of the American Statistical Association
Data on 244,000 credit card accounts
12-month period
1 percent default
Cost of granting loan that defaults almost $5,000
Cost of denying loan that would have paid about $50
6-27
結束
Data Treatment
Divided observations into 5 groups
Used one for training
Any smaller would have problems due to insufficient
default cases
Used 80% of data for detailed testing
Regression performed better than C5 model
Even though C5 used costs, regression didn’t
6-28
結束
Summary
Regression a basic classical model
Many forms
Logistic regression very useful in data mining
Often have binary outcomes
Also can use on categorical data
Can use for discriminant analysis
To classify
6-29