#### Transcript 003_assumptions_GLM_poissonx

WiFi password: 525-244-426-914 Assumptions Assumptions assumptions about the predictors Absence of Collinearity No influential data points Normality of Errors Homoskedasticity of Errors Independence assumptions about the residuals Assumptions Absence of Collinearity No influential data points Normality of Errors Homoskedasticity of Errors Independence Assumptions Absence of Collinearity No influential data points Normality of Errors Homoskedasticity of Errors Independence Collinearity… … generally occurs when predictors are correlated (however, it may also occur in more complex ways, through multi-collinearity) Demo Absence of Collinearity Baayen (2008: 182) Collinearity Baayen (2008: 182) “If collinearity is ignored, one is likely to end up with a confusing statistical analysis in which nothing is significant, but where dropping one covariate can make the others significant, or even change the sign of estimated parameters.” (Zuur, Ieno & Elphick, 2010: 9) Zuur, A. F., Ieno, E. N., & Elphick, C. S. (2010). A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution, 1(1), 3-14. You check collinearity through variance inflation factors library(car) vif(xmdl) Values >10 are commonly regarded as dangerous; however, values substantially larger than 1 are dangerous I would definitely start worrying at around >4 Model comparison with separate models of collinear predictors xmdl1 <- lm(y ~ A) xmdl2 <- lm(y ~ B) AIC(xmdl1) AIC(xmdl2) Akaike’s information criterion If relative importance of (potentially) collinear predictors is of prime interest… Random forests: myforest = cforest(..., controls = data.controls) my_varimp = varimp(myforest, conditional = T) check Stephanie Shih’s tutorials Assumptions Absence of Collinearity No influential data points Normality of Errors Homoskedasticity of Errors Independence Assumptions Absence of Collinearity No influential data points Normality of Errors Homoskedasticity of Errors Independence 0 -5 y 5 p = 0.37 -5 0 x 5 0 -5 y 5 p = 0.000529 -5 0 x 5 Influence diagnostics • • • • • • DFFit DFBeta Leverage Cook’s distance Standardized residuals Studentized residuals … and more! Code for doing DFBetas yourself Perform leave-one-out influence diagnostics General structure of code: all.betas = c() for(i in 1:nrow(xdata)){ xmdl = lm( … , xdata[-i,]) all.betas = c(all.betas, coef(xmdl)["slope_of_interest"]) } Influence diagnostics abuse Assumptions Absence of Collinearity No influential data points Normality of Errors Homoskedasticity of Errors Independence Assumptions Absence of Collinearity No influential data points Normality of Errors Homoskedasticity of Errors Independence Q-Q plots qqnorm(residuals(xmdl));qqline(residuals(xmdl)) 1 0 -1 -2 Sample Quantiles 2 3 Normal Q-Q Plot -2 -1 0 1 Theoretical Quantiles 2 Assumptions Absence of Collinearity No influential data points Normality of Errors Homoskedasticity of Errors Independence Zuur, A. F., Ieno, E. N., & Elphick, C. S. (2010). A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution, 1(1), 3-14. Residual plot (plot of residuals against fitted values) -500 0 500 1000 This is really bad!!! -1000 residuals(xmdl) 1500 2000 plot(fitted(xmdl),residuals(xmdl)) 500 1000 fitted(xmdl) 1500 2000 Learning to interpret residual plots by simulating random data You can type these two lines into R again and again to train your eye!: ## Good par(mfrow=c(3,3)) for(i in 1:9)plot(1:50,rnorm(50)) ## Weak non-constant variance par(mfrow=c(3,3)) for(i in 1:9) {plot(1:50,sqrt((1:50))*rnorm(50))} Faraway, J. (2005). Linear models with R. Boca Raton: Chapman & Hall/CRC Press. Learning to interpret residual plots by simulating random data ## Strong non-constant variance par(mfrow=c(3,3)) for(i in 1:9)plot(1:50,(1:50)*rnorm(50)) ## Non-linearity par(mfrow=c(3,3)) for(i in 1:9)plot(1:50,cos((1:50)*pi/25)+rnorm(50)) Faraway, J. (2005). Linear models with R. Boca Raton: Chapman & Hall/CRC Press. Emphasis of graphical tools • For now, forget about formal tests of deviations from normality and homogeneity; graphical methods are generally considered superior (Montgomery & Peck, 1992; Draper & Smith, 1998; Quinn & Keough, 2002; Läänä, 2009; Zuur et al., 2009) • Problems with formal tests: – – – – Type II errors hard cut-offs if used in a frequentist fashion less information about the data and the model have assumptions themselves Zuur, A. F., Ieno, E. N., & Elphick, C. S. (2010). A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution, 1(1), 3-14. If you have continuous data that exhibits heteroskedasticity… … you can perform nonlinear transformations (e.g., log transform) … there are several variants of regression that can help you out (generalized least squares with gls(); White-Huber covariance matrices using bptest() and coeftest(); bootstrapping using the “boot” package etc.) “A priori violations” In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data (2) binary data “A priori violations” In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data Poisson regression (2) binary data “A priori violations” In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data Poisson regression (2) binary data logistic regression General Linear Model Generalized Linear Model Generalized Linear Models: Ingredients An error distribution (normal, Poisson, binomial) A linear predictor (LP) A link function (identity, log, logit) Generalized Linear Models: Two important types Poisson regression: a generalized linear model with Poisson error structure and log link function Logistic regression: a generalized linear model with binomial error structure and logit link function Relative Frequency The Poisson Distribution Mean = Variance Mean = 1 Mean = 5 Mean = 15 0 5 10 15 Count 20 25 30 Hissing Koreans Winter & Grawunder (2012) Winter, B., & Grawunder, S. (2012). The phonetic profile of Korean formality. Journal of Phonetics, 40, 808-815. Rates can be misleading N = Rate Time 16/s vs. 0/s could be 1 millisecond or 10 years No. of Speech Errors 10 8 6 4 2 0 0.0 0.1 0.2 0.3 Blood Alcohol Level 0.4 No. of Speech Errors 10 8 6 4 2 0 0.0 0.1 0.2 0.3 Blood Alcohol Level 0.4 No. of Speech Errors 10 8 6 4 2 0 0.0 0.1 0.2 0.3 Blood Alcohol Level 0.4 The basic LM formula m = b0 + b1 * x The basic GLM formula for Poisson log(m ) = b0 + b1 * x the link function the linear predictor The basic LM formula m = b0 + b1 * x The basic GLM formula for Poisson m =e b0 +b1*x linear predictor Poisson model output ex log values e x exponentiate predicted mean rate Poisson model in R xmdl = glm(…,xdata,family=“poisson”) When using predict, you have to additionally specify whether you want predictions in LP space or response space preds = predict.glm(xmdl, newdata=mydata, type=“response”,se.fit=T) Relative Frequency The Poisson Distribution Mean = Variance Mean = 1 Mean = 5 Mean = 15 0 5 10 15 Count 20 25 30 The Poisson Distribution use negative binomial regression Mean = Variance if variance > mean, then you are dealing with overdispersion library(MASS) xmdl.nb = glm.nb(…) Overdispersion test xmdl.nb = glm.nb(…) library(pscl) odTest(xmdl.nb) Generalized Linear Models: Two important types Poisson regression: a generalized linear model with Poisson error structure and log link function Logistic regression: a generalized linear model with binomial error structure and logit link function Speech Error Yes/No yes no 0.0 0.1 0.2 0.3 Blood Alcohol Level 0.4 Speech Error Yes/No yes no 0.0 0.1 0.2 0.3 Blood Alcohol Level 0.4 Speech Error Yes/No yes no 0.0 0.1 0.2 0.3 Blood Alcohol Level 0.4 Speech Error Yes/No yes no 0.0 0.1 0.2 0.3 Blood Alcohol Level 0.4 The basic GLM formula for Poisson regression log(m ) = b0 + b1 * x The basic GLM formula for logistic regression p(y) log( ) = b0 + b1 * x 1- p(y) the logit link function x æ p(x) ö log ç ÷ è 1- p(x) ø e x 1+ e = logit function = inverse logit function plogis() Odds and log odds examples Probability Odds Log odds (= “logits”) 0.1 0.2 0.3 0.4 0.111 0.25 0.428 0.667 -2.197 -1.386 -0.847 -0.405 0.5 0.6 0.7 1 1.5 2.33 0 0.405 0.847 0.8 4 1.386 0.9 9 2.197 Snijders & Bosker (1999: 212) Estimate Std. Error z value Pr(>|z|) (Intercept) -3.643 1.123 -3.244 0.001179 ** alc 16.118 4.856 3.319 0.000903 *** for probabilities: transform the entire LP with the logistic function plogis()