Statistics 2: generalized linear models
Download
Report
Transcript Statistics 2: generalized linear models
Statistics 2: generalized linear models
General linear model:
Y ~ a + b1* x1 + … + bn* xn + ε
There are many cases when general linear models are
no longer valid:
• Loss of a) normality of errors, b) homogeneity of
variance, c) linearity
Sometimes these can be fixed by transformations
(e.g. log or sqrt of Y) or by including to the model
a) Interactions or b) polynomials (e.g. I(x^2) )
Generalized linear models
•
These broaden the concept of linear models
– Non-normal error distributions
• Poisson responses (count number)
• Binomial responses (number of successes in a trial)
• Also categorical data (multinomial response)
• Exponential or Weibull responses (waiting time)
– Transformation of the response to linear scale with link function
New model elements
Generalized linear model:
f(Y) ~ a + b1* x1 + … + bn* xn + ε
Link function
Error term:
Variance no longer
homogeneous but depends
on the mean
Link function
• Response is transformed to linear scale
f(Y) = a + b1x1 + … + bnxn
• To obtain a prediction
Y = f-1(a + b1x1 + … + bnxn)
Common choices for link and variance function
Distribution
Link
Variance
Normal
Identity: μ
~1
Binomial
Logit: log(μ /(1- μ))
μ(1- μ)
Poisson
Logarithm: log(μ)
μ
Generalized linear models in R
glm(formula, family, data)
For example:
family=binomial(link=“logit”)
family=poisson(link=“log”)
Binomial data: logistic regression
•
Binomial data gives number of successes in n independent trials
– Tossing of a coin ten times
– Exposing n patients to a treatment and seeing how many survive
•
Expected value to be predicted (p) is the probability of success
– Logit link function transforms this from [0,1] to [-Inf,Inf]
•
Two options of coding the response:
– A vector of values 0 or 1: failure or success
– A matrix with two columns: first gives the number of successes the second the
number of failures
DEMO 1
Count data: Poisson regression
•
Poisson distributed data is count data
– Number of fish caught during a fishing trip
– Number of species observed in a given area
•
Link function is log and model predicts expected number
•
Variance increases along the expected value
•
Two special features:
– Offset variables: ‘observation time’
– Over/under dispersion: Variance in the data can be less or more than that
assumed by the variance function -> can be taken care with dispersal parameter
• This can also occur with binomial data
Measures of fit in glm
• Proportion of deviance explained
– Problem: even a perfect model does not explain all deviance
• Pseudo R2
– Nagelkerke’s R2
1-exp(-2/n*(logLik(model)-logLik(NULLModel)))
NULL model is a model with intercept only.
n = number of observations
DOES NOT work for quasi-distributions!
DEMO 2 & exercises