score and score and score odds marketing
Download
Report
Transcript score and score and score odds marketing
By Matt Bogard, M.S.
May 12, 2011
Single Variable Regression
Multivariable Regression
Logistic Regression
Data Mining vs. Classical Statistics
Decision Trees
Neural Networks
Can we describe this relationship with an
equation for a line?
Fitting a line to the data- gives us the
equation (regression equation)
How well does this line fit our data? How well
does it describe the relationship between the
variables (x & y)?
Interpretation & Inference
The goal then is to minimize the sum of squared residuals. That
is minimize:
∑ ei2 = ∑ (yi - b0 - b1 Xi )2 with respect to b0 and b1 .
This can be accomplished by taking the partial derivatives of
∑ ei2 with respect to each coefficient and setting it equal to zero.
∂ ∑ ei2 / ∂ b0 = 2 ∑ (yi - b0 - b1 Xi ) (-1) = 0
∂ ∑ ei2 / ∂ b1 = 2 ∑(yi - b0 - b1 Xi ) (-Xi) = 0
Solving for b0 and b1 yields:
b0 = ybar - b1 ∑ Xbar)
b1 = ∑ (( Xi - Xbar) (yi - ybar)) / ∑ ( Xi - Xbar) = SS( X,y) / SS(X).
R2 = SSR/SST
MSR = SSR/df
MSE = SSE/df
F =MSR/MSE
E(MSE) = σ2
E(MSR) = σ2 + β∑(x-xbar)2
If β = 0 then F = σ2/ σ2
If β ≠ 0 then
F=[σ2+β∑(x-xbar)2 ]/ σ2
Larger R2 -> better fit
Larger F -> significance of β (model)
VAR(bj) & SE(bj) (will discuss later)
Test Ho: bj= βo
t= (bj-βo)/ SE(bj)
Note: if Ho: bj= 0 = βo then t = (bj/ SE(bj)
and gives the same results as the F-test in a
single variable regression
LIBNAME ADHOC ‘file path’;
/* GET DATA */
DATA LSD;
INPUT SCORE CONC;
CARDS;
78.93 1.17
58.20 2.97
67.47 3.26
37.47 4.69
45.65 5.83
32.92 6.00
29.97 6.41
;
RUN;
PROC REG;
MODEL SCORE=CONC;
PLOT SCORE*CONC; /*PLOTS REGRESSION LINE FIT TO DATA*/
RUN; QUIT;
PROC GLM DATA = LSD;
MODEL SCORE=CONC;
RUN; QUIT;
y = bo + b1 X1+ b2 X2+ e
Often viewed in the context of matrices
Represented by y = bX + e
b = (XT X)-1XT y = XTy / (X’X) ~ S(XY)/SS(x)
b <- solve( t(x) %*% x )%*%( t(x) %*% y )
(1) E(y|x) = βX ‘we are estimating a linear
(2) E(e) = 0
(3) VAR(e) = σ2 I ‘constant variance’ ‘no
(4) Rank(X) = k ‘no perfect multicollinearity’
approximation to the
conditional expectation of y '
‘white noise error terms’
heteroskedasticity & no
serial correlation
Why are we concerned with the error terms?
Recall b = (X’X)-1X’y hence our b estimate does not
depend on e and E(b) = β
VAR(b) = s2 (X’X)-1 where s2 = MSE = e’e/n-k ~∑ei2/n-1
Note SE(b) = √ VAR(b) and t= (bj-βo)/ SE(bj)
Note F = MSR/MSE and E(MSE) = σ2
If we have σi2 vs σ2 then we run into issues related to
hypothesis testing and making inferences
Maybe rank(X) = k, but there is still some
correlation between the X variables in the
regression.
Blue bx1
Green bx2
Red -> correlation
between x1 and x2
As corr(x1,x2) increases,
blue and green decrease,
red increases (circles
overlap)
Decreased information
used to estimate b’s,
leads to increased
variance in the estimates
R 2 : b’s jointly can still explain variation in y.
Research: inferences about the specific
relationship between X1 and Y, rely on
SE(b) which are inflated by multicollinearity
Forecasting/Prediction: we are more
comfortable with multicollinearity
(Greene,1990; Kennedy, 2003; Studenmund;
2001)
DATA REG;
INPUT INTEREST INFLATION INVESTMENT ;
CARDS;
5.16
4.4
.161
5.87
5.15
.172
5.95
5.37
.158
4.88
4.99
.173
4.50
4.16
.195
6.44
5.75
.217
7.83
8.82
.199
6.25
9.31
.163
5.5
5.21
.195
5.46
5.83
.231
7.46
7.40
.257
10.28
8.64
.259
11.77
9.31
.225
13.42
9.44
.241
11.02
5.99
.204
;
RUN;
PROC REG DATA = REG;
MODEL INVESTMENT = INTEREST INFLATION/VIF ;
RUN; QUIT;
Ex: retention = Y/N
If y = { 0 or 1} then E[y|X] =
Pi a probability interpretation
Estimated probabilities
outside (0,1)
e~binomial
var(e) = n*p*(1-p) which
violates assumption of
uniform variance
Note however, despite theoretical concerns,
OLS is used quite often without practical
implications
Example: Statistical Alternatives for Studying
College Student Retention: A Comparative
Analysis of Logit, Probit, and Linear
Regression. Dey & Astin. Research in Higher
Education Vol 34 No 5, 1993.
Di = probability (y|x) = 1 = eXβ / ( 1 + eXβ )
Choose β’s to
maximize the
likelihood of the
sample being
observed.
Maximizes the
likelihood that data
comes from a ‘real
world’ characterized by
one set of β’s vs
another.
L(β) = ∏ eXβ / (1 + eXβ ) ∏ 1/(1 + eXβ)
the product of densities which give p(y=1) and p(y=0)
Take ln of both sides, choose β to maximize, → βMLE
NOT minimizing sums of squares, not fitting
a line to data NO R2
Changes in Log-likelihood are compared for
full vs. restricted models to provide measures
of ‘deviance’
Deviance is used for fit statistics such as AIC,
chi-square test, pseudo- r- square
Based on ratios of
deviance for full vs.
restricted model. Not
directly comparable to
R2 from OLS
from Applied Choice Analysis.
Hensher, Rose & Greene. 2005
% correct predictions
% correct 1’s
% correct 0’s
β = change in the log
of odds of y given a
change in X
eβ = odds ratio
PROC LOGISTIC
ODS GRAPHICS ON;
ODS HTML;
PROC LOGISTIC DATA = ADHOC.LOGIT PLOTS =
ROC OUTMODEL = MODEL1;
MODEL CLASS = X1 X2/ RSQ LACKFIT;
SCORE OUT = SCORE1 FITSTAT;
RUN;
QUIT;
“There are two cultures in the use of statistical modeling to reach
conclusions from data. One assumes that the data are generated by
a given stochastic data model. The other uses algorithmic models
and treats the data mechanism as unknown”
"Approaching problems by looking for a data model imposes an
apriori straight jacket that restricts the ability of statisticians to deal
with a wide range of statistical problems."
From Statistical Modeling: The Two Cultures. Statistical Science
2001, Vol. 16, No. 3, 199–231. Leo Breiman.
Classical Statistics: Focus is on hypothesis testing of causes and
Example: Regression, Logit/Probit, Duration Models,
Discriminant Analysis
effects and interpretability of models. Model Choice is based on
parameter significance and In-sample Goodness-of-fit.
Machine Learning: Focus is on Predictive Accuracy even in the
face of lack of interpretability of models. Model Choice is based
on Cross Validation of Predictive Accuracy using Partitioned Data
Sets.
Example : Classification and Regression Trees, Neural Nets, KNearest Neighbors, Association Rules, Cluster Analysis
‘prediction error over
an independent test
sample’
A function of the bias
and variance a model
exhibits across
multiple data sets
There is a biasvariance trade off
related to model
complexity
Partition data into training, validation, and
test samples (if data is sufficient)
Other methods: k-fold cross validation,
random forests, ensemble models
Choose inputs (and model specification) that
optimizes model performance on test and
validation data
“Tree-based methods partition the feature space
into a set of rectangles, and then fit a simple model
(like a constant) in each one.” (Trevor Hastie,
Robert Tibshirani & Jerome Friedman, 2009)
Each split creates a cross
tabulation
The split is evaluated with a
chi-square
Pearson's Chi-squared test
data: tab1
X-squared = 52.3918,
df = 1, p-value = 4.546e-13
A nonlinear model of
complex relationships
composed of multiple
'hidden' layers (similar to
composite functions)
Y = f(g(h(x)) or
x -> hidden layers ->Y
ACTIVATION FUNCTION: formula used for
transforming values from inputs and the outputs in
a neural network.
COMBINATION FUNCTION: formula used for
combining transformed values from activation
functions in neural networks.
HIDDEN LAYER: The layer between input and output
layers in a neural network.
RADIAL BASIS FUNCTION: A combination function
that is based on the Euclidean distance between
inputs and weights
Hidden Layer:
h1= logit(w10 +w11 x1 + w12 x2 )
h2 = logit(w20 +w21 x1 + w22 x2 )
h3 = logit(w30 +w31 x1 + w32 x2 )
h4 = logit(w40 +w41 x1 + w42 x2 )
Output Layer:
Y= W0 + W1 h1 + W2 h2 + W3 h3 +
W4h4
There is no ‘theoretically sound’ criteria for architecture
selection in terms of the # of hidden units & hidden
layers
The Autoneural node ‘automates’ some of these choices
to a limited extent
SAS Global Forum 2011: there was a presentation
utilizing genetic algorithms
Neural Networks don’t address model selection –
typically pre-filter inputs via use of decision trees &
regression nodes
Interpretation is a challenge- finance companies employ
them for marketing purposes don’t use in areas subject
to litigation (loan approvals)
Selection of Target Sites for Mobile DNA Integration in the Human Genome
Berry C, Hannenhalli S, Leipzig J, Bushman FD, 2006 Selection of Target Sites for Mobile DNA Integration in the
Human Genome. PLoS Comput Biol 2(11): e157. doi:10.1371/journal.pcbi.0020157 (supporting information
Text S1.)
Econometric Analysis. William H. Greene. 1990
A Guide to Econometrics. Kennedy . 5th Ed. 2003
Statistical Alternatives for Studying College Student Retention: A Comparative Analysis of Logit, Probit, and
Linear Regression. Dey & Astin. Research in Higher Education Vol 34 No 5, 1993.
Statistical Modeling: The Two Cultures. Statistical Science 2001, Vol. 16, No. 3, 199–231. Leo Breiman.
Applied Choice Analysis. Hensher, Rose & Greene. 2005
The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition Trevor Hastie,
A Course in Econometrics. Arthur S. Goldberger. 1991.
SAS Enterprise Miner
R Statistical Package http://www.r-project.org/
Robert Tibshirani & Jerome Friedman. 2009