1 0.000 1.000 0.080 0.525
Download
Report
Transcript 1 0.000 1.000 0.080 0.525
Regression
Models
Fit data
Time-series data:
Forecast
Other data:
Predict
6-2
Use in Data Mining
• One of major analytic models
– Linear regression
• The standard – ordinary least squares regression
• Can use for discriminant analysis
• Can apply stepwise regression
– Nonlinear regression
• More complex (but less reliable) data fitting
– Logistic regression
• When data are categorical (usually binary)
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-3
OLS (Ordinary Least Square) Model
Y 0 1 X 1 2 X 2 ... n X n
where Y is the dependent variable
0 is the intercept term
n are the n coefficien ts for independen t variable s
is the error term
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-4
OLS Regression
• Uses intercept and slope coefficients () to
minimize squared error terms over all i
observations
• Fits the data with a linear model
• Time-series data:
– Observations over past periods
– Best fit line (in terms of minimizing sum of
squared errors)
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-5
Regression Output (page 101)
R2 : 0.987
Intercept: 0.642
Week:
5.086
t=0.286
t=53.27
P=0.776
P=0
Requests = 0.642 + 5.086*Week
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-6
Time-Series Forecast
Regression Forecast
300
250
200
Requests
150
Model
100
50
0
0
10
20
30
40
50
60
Week
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-7
Regression Tests
• FIT:
– SSE – sum of squared errors
• Synonym: SSR – sum of squared residuals
– R2 – proportion explained by model
– Adjusted R2 – adjusts calculation to penalize for
number of independent variables
• Significance
– F-test - test of overall model significance
– t-test - test of significant difference between model
coefficient & zero
– P – probability that the coefficient is zero
• (or at least the other side of zero from the coefficient)
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-8
Regression Model Tests
• SSE (sum of squared errors)
– For each observation, subtract model value from
observed, square difference, total over all
observations
– By itself means nothing
– Can compare across models (lower is better)
– Can use to evaluate proportion of variance in data
explained by model
• R2
– Ratio of explained squared dependent variable values
(MSR) to sum of squares (SST)
• SST = MSR plus SSE
– 0 ≤ R2 ≤ 1
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-9
Multiple Regression
• Can include more than one independent
variable
– Trade-off:
• Too many variables – many spurious, overlapping
information
• Too few variables – miss important content
– Adding variables will always increase R2
– Adjusted R2 penalizes for additional
independent variables
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-10
Example: Hiring Data
• Dependent Variable – Sales
• Independent Variables:
– Years of Education
– College GPA
– Age
– Gender
– College Degree
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-11
Regression Model
Sales =
269025
-17148*YrsEd
P = 0.175
-7172*GPA
P = 0.812
+4331*Age
P = 0.116
-23581*Male
P = 0.266
+31001*Degree
P = 0.450
R2 = 0.252
Adj R2 = -0.015
• Weak model, no IV significant at 0.10
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-12
Improved Regression Model
Sales =
173284
- 9991*YrsEd
+3537*Age
-18730*Male
R2 = 0.218
McGraw-Hill/Irwin
P = 0.098*
P = 0.141
P = 0.328
Adj R2 = 0.070
©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-13
Logistic Regression
• Data often ordinal or nominal
• Regression based on continuous numbers
not appropriate
– Need dummy variables
• Binary – either are or are not
– LOGISTIC REGRESSION (probability of either 1 or 0)
• Two or more categories
– DISCRIMINANT ANALYSIS (perform regression for each
outcome; pick one that fit’s best)
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-14
Logistic Regression
• For dependent
variables that are
nominal or ordinal
• Probability of
acceptance of
– case i to class j
Pj
• Sigmoidal function
1
1 e
0 i xi
– (in English, an S curve
from 0 to 1)
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-15
Insurance Claim Model
Fraud =
81.824
-2.778 * Age
P = 0.789
-75.893 * Male
P = 0.758
+ 0.017 * Claim
P = 0.757
-36.648 * Tickets
P = 0.824
+ 6.914 * Prior
P = 0.935
-29.362 * Attorney Smith
P = 0.776
Can get probability by running score through
logistic formula
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-16
Linear Discriminant Analysis
• Group objects into predetermined set of
outcome classes
• Regression one means of performing
discriminant analysis
– 2 groups: find cutoff for regression score
– More than 2 groups: multiple cutoffs
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-17
Centroid Method
(NOT regression)
• Binary data
• Divide training set into two groups by
binary outcome
– Standardize data to remove scales
• Identify means for each independent
variably by group (the CENTROID)
• Calculate distance function
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-18
Fraud Data
Age
Claim
Tickets
Prior
Outcome
52
2000
0
1
OK
38
1800
0
0
OK
19
600
2
2
OK
21
5600
1
2
Fraud
41
4200
1
2
Fraud
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-19
Standardized & Sorted Fraud Data
Age
Claim
Tickets
Prior
Outcome
1
0.60
1
0.5
0
0.9
0.64
1
1
0
0
0.88
0
0
0
0.633
0.707
0.667
0.500
0
0.05
0
1
0
1
1
0.16
1
0
1
0.525
0.080
1.000
0.000
1
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-20
Distance Calculations
New
To 0
Age
0.50
(0.633-0.5)2
0.018
(0.525-0.5)2
0.001
Claim
0.30
(0.707-0.3)2
0.166
(0.08-0.3)2
0.048
Tickets
0
(0.667-0)2
0.445
(1-0)2
1.000
Prior
1
(0.5-1)2
0.250
(0-1)2
1.000
Totals
McGraw-Hill/Irwin
To 1
0.879
2.049
©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-21
Discriminant Analysis with Regression
Standardized data, Binary outcomes
Intercept
0.430
P = 0.670
Age
-0.421
P = 0.671
Gender
0.333
P = 0.733
Claim
-0.648
P = 0.469
Tickets
0.584
P = 0.566
Prior Claims
-1.091
P = 0.399
Attorney
0.573
P = 0.607
• R2 = 0.804
• Cutoff average of group averages: 0.429
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-22
Case: Stepwise Regression
• Stepwise Regression
– Automatic selection of independent variables
• Look at F scores of simple regressions
• Add variable with greatest F statistic
• Check partial F scores for adding each variable not
in model
• Delete variables no longer significant
• If no external variables significant, quit
• Considered inferior to selection of
variables by experts
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-23
Credit Card Bankruptcy Prediction
Foster & Stine (2004), Journal of the American Statistical Association
• Data on 244,000 credit card accounts
– 12-month period
– 1 percent default
– Cost of granting loan that defaults almost
$5,000
– Cost of denying loan that would have paid
about $50
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-24
Data Treatment
• Divided observations into 5 groups
– Used one for training
– Any smaller would have problems due to
insufficient default cases
– Used 80% of data for detailed testing
• Regression performed better than C5
model
– Even though C5 used costs, regression didn’t
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
6-25
Summary
• Regression a basic classical model
– Many forms
• Logistic regression very useful in data
mining
– Often have binary outcomes
– Also can use on categorical data
• Can use for discriminant analysis
– To classify
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved