003_assumptions_GLM_poissonx

download report

Transcript 003_assumptions_GLM_poissonx

WiFi password:
525-244-426-914
Assumptions
Assumptions
assumptions about the predictors
Absence of
Collinearity
No influential
data points
Normality of Errors
Homoskedasticity
of Errors
Independence
assumptions
about the
residuals
Assumptions
Absence of
Collinearity
No influential
data points
Normality of Errors
Homoskedasticity
of Errors
Independence
Assumptions
Absence of
Collinearity
No influential
data points
Normality of Errors
Homoskedasticity
of Errors
Independence
Collinearity…
… generally occurs when predictors are
correlated
(however, it may also occur in more complex
ways, through multi-collinearity)
Demo
Absence of Collinearity
Baayen
(2008: 182)
Collinearity
Baayen
(2008: 182)
“If collinearity is ignored, one is likely
to end up with a confusing statistical
analysis in which nothing is significant,
but where dropping one covariate can
make the others significant, or even
change the sign of estimated
parameters.”
(Zuur, Ieno & Elphick, 2010: 9)
Zuur, A. F., Ieno, E. N., & Elphick, C. S. (2010). A protocol for data exploration to avoid common
statistical problems. Methods in Ecology and Evolution, 1(1), 3-14.
You check collinearity through
variance inflation factors
library(car)
vif(xmdl)
Values >10 are commonly regarded as
dangerous; however, values substantially larger
than 1 are dangerous
I would definitely start worrying at around >4
Model comparison with separate
models of collinear predictors
xmdl1 <- lm(y ~ A)
xmdl2 <- lm(y ~ B)
AIC(xmdl1)
AIC(xmdl2)
Akaike’s information
criterion
If relative importance of (potentially) collinear
predictors is of prime interest…
Random forests:
myforest = cforest(...,
controls = data.controls)
my_varimp = varimp(myforest,
conditional = T)
check Stephanie Shih’s tutorials
Assumptions
Absence of
Collinearity
No influential
data points
Normality of Errors
Homoskedasticity
of Errors
Independence
Assumptions
Absence of
Collinearity
No influential
data points
Normality of Errors
Homoskedasticity
of Errors
Independence
0
-5
y
5
p = 0.37
-5
0
x
5
0
-5
y
5
p = 0.000529
-5
0
x
5
Influence diagnostics
•
•
•
•
•
•
DFFit
DFBeta
Leverage
Cook’s distance
Standardized residuals
Studentized residuals
… and more!
Code for doing DFBetas yourself
Perform leave-one-out influence diagnostics
General structure of code:
all.betas = c()
for(i in 1:nrow(xdata)){
xmdl = lm( … , xdata[-i,])
all.betas = c(all.betas,
coef(xmdl)["slope_of_interest"])
}
Influence diagnostics abuse
Assumptions
Absence of
Collinearity
No influential
data points
Normality of Errors
Homoskedasticity
of Errors
Independence
Assumptions
Absence of
Collinearity
No influential
data points
Normality of Errors
Homoskedasticity
of Errors
Independence
Q-Q plots
qqnorm(residuals(xmdl));qqline(residuals(xmdl))
1
0
-1
-2
Sample Quantiles
2
3
Normal Q-Q Plot
-2
-1
0
1
Theoretical Quantiles
2
Assumptions
Absence of
Collinearity
No influential
data points
Normality of Errors
Homoskedasticity
of Errors
Independence
Zuur, A. F., Ieno, E. N., & Elphick, C. S. (2010). A protocol for data exploration to avoid common
statistical problems. Methods in Ecology and Evolution, 1(1), 3-14.
Residual plot
(plot of residuals against fitted values)
-500
0
500
1000
This is really bad!!!
-1000
residuals(xmdl)
1500
2000
plot(fitted(xmdl),residuals(xmdl))
500
1000
fitted(xmdl)
1500
2000
Learning to interpret residual plots by
simulating random data
You can type these two lines into R again and again to
train your eye!:
## Good
par(mfrow=c(3,3))
for(i in 1:9)plot(1:50,rnorm(50))
## Weak non-constant variance
par(mfrow=c(3,3))
for(i in 1:9)
{plot(1:50,sqrt((1:50))*rnorm(50))}
Faraway, J. (2005). Linear models with R. Boca Raton: Chapman & Hall/CRC Press.
Learning to interpret residual plots by
simulating random data
## Strong non-constant variance
par(mfrow=c(3,3))
for(i in 1:9)plot(1:50,(1:50)*rnorm(50))
## Non-linearity
par(mfrow=c(3,3))
for(i in
1:9)plot(1:50,cos((1:50)*pi/25)+rnorm(50))
Faraway, J. (2005). Linear models with R. Boca Raton: Chapman & Hall/CRC Press.
Emphasis of graphical tools
• For now, forget about formal tests of deviations
from normality and homogeneity; graphical
methods are generally considered superior
(Montgomery & Peck, 1992; Draper & Smith, 1998; Quinn & Keough, 2002; Läänä, 2009; Zuur
et al., 2009)
• Problems with formal tests:
–
–
–
–
Type II errors
hard cut-offs if used in a frequentist fashion
less information about the data and the model
have assumptions themselves
Zuur, A. F., Ieno, E. N., & Elphick, C. S. (2010). A protocol for data exploration to avoid common
statistical problems. Methods in Ecology and Evolution, 1(1), 3-14.
If you have continuous data that
exhibits heteroskedasticity…
… you can perform nonlinear transformations
(e.g., log transform)
… there are several variants of regression that
can help you out (generalized least squares with
gls(); White-Huber covariance matrices using
bptest() and coeftest(); bootstrapping
using the “boot” package etc.)
“A priori violations”
In the following cases, your data violates the
normality and homoskedasticity assumption on
a priori grounds:
(1) count data
(2) binary data
“A priori violations”
In the following cases, your data violates the
normality and homoskedasticity assumption on
a priori grounds:
(1) count data  Poisson regression
(2) binary data
“A priori violations”
In the following cases, your data violates the
normality and homoskedasticity assumption on
a priori grounds:
(1) count data  Poisson regression
(2) binary data  logistic regression
General
Linear
Model
Generalized
Linear
Model
Generalized Linear Models:
Ingredients
An error distribution (normal, Poisson, binomial)
A linear predictor (LP)
A link function (identity, log, logit)
Generalized Linear Models:
Two important types
Poisson regression: a generalized linear model
with Poisson error structure and log link
function
Logistic regression: a generalized linear model
with binomial error structure and logit link
function
Relative Frequency
The Poisson Distribution
Mean = Variance
Mean = 1
Mean = 5
Mean = 15
0
5
10
15
Count
20
25
30
Hissing Koreans
Winter & Grawunder (2012)
Winter, B., & Grawunder, S. (2012). The phonetic profile of Korean formality. Journal of Phonetics, 40, 808-815.
Rates can be misleading
N = Rate
Time
16/s vs. 0/s
could be 1
millisecond or
10 years
No. of Speech Errors
10
8
6
4
2
0
0.0
0.1
0.2
0.3
Blood Alcohol Level
0.4
No. of Speech Errors
10
8
6
4
2
0
0.0
0.1
0.2
0.3
Blood Alcohol Level
0.4
No. of Speech Errors
10
8
6
4
2
0
0.0
0.1
0.2
0.3
Blood Alcohol Level
0.4
The basic LM formula
m = b0 + b1 * x
The basic GLM formula for Poisson
log(m ) = b0 + b1 * x
the link function
the linear
predictor
The basic LM formula
m = b0 + b1 * x
The basic GLM formula for Poisson
m =e
b0 +b1*x
linear predictor
Poisson model output
ex
log
values
e
x
exponentiate
predicted
mean rate
Poisson model in R
xmdl =
glm(…,xdata,family=“poisson”)
When using predict, you have to additionally
specify whether you want predictions in LP
space or response space
preds = predict.glm(xmdl,
newdata=mydata,
type=“response”,se.fit=T)
Relative Frequency
The Poisson Distribution
Mean = Variance
Mean = 1
Mean = 5
Mean = 15
0
5
10
15
Count
20
25
30
The Poisson Distribution
use negative binomial
regression
Mean = Variance
if variance > mean, then
you are dealing with
overdispersion
library(MASS)
xmdl.nb = glm.nb(…)
Overdispersion test
xmdl.nb = glm.nb(…)
library(pscl)
odTest(xmdl.nb)
Generalized Linear Models:
Two important types
Poisson regression: a generalized linear model
with Poisson error structure and log link
function
Logistic regression: a generalized linear model
with binomial error structure and logit link
function
Speech Error Yes/No
yes
no
0.0
0.1
0.2
0.3
Blood Alcohol Level
0.4
Speech Error Yes/No
yes
no
0.0
0.1
0.2
0.3
Blood Alcohol Level
0.4
Speech Error Yes/No
yes
no
0.0
0.1
0.2
0.3
Blood Alcohol Level
0.4
Speech Error Yes/No
yes
no
0.0
0.1
0.2
0.3
Blood Alcohol Level
0.4
The basic GLM formula for Poisson regression
log(m ) = b0 + b1 * x
The basic GLM formula for logistic regression
p(y)
log(
) = b0 + b1 * x
1- p(y)
the logit link function
x
æ p(x) ö
log ç
÷
è 1- p(x) ø
e
x
1+ e
= logit
function
= inverse
logit function
plogis()
Odds and log odds examples
Probability
Odds
Log odds (= “logits”)
0.1
0.2
0.3
0.4
0.111
0.25
0.428
0.667
-2.197
-1.386
-0.847
-0.405
0.5
0.6
0.7
1
1.5
2.33
0
0.405
0.847
0.8
4
1.386
0.9
9
2.197
Snijders & Bosker (1999: 212)
Estimate Std. Error z value Pr(>|z|)
(Intercept)
-3.643
1.123 -3.244 0.001179 **
alc
16.118
4.856
3.319 0.000903 ***
for probabilities:
transform the entire
LP with the logistic
function
plogis()