Introduction to Statistics

Download Report

Transcript Introduction to Statistics

Statistical Modelling
Harry R. Erwin, PhD
School of Computing and Technology
University of Sunderland
Resources
• Crawley, MJ (2005) Statistics: An Introduction
Using R. Wiley.
• Gonick, L., and Woollcott Smith (1993) A Cartoon
Guide to Statistics. HarperResource (for fun).
Statistical Modelling
• Suppose you have a response variable, y, that varies
when independent factors or measurements, x1, x2, …,
xN, vary. (One covariate can be written x.)
• What you want is a model that predicts the value of y as
a function of the xi.
• Statistical models are written:
–
–
–
–
g( h(y) ~ f(xi) )
Where g() describes the model—lm() or aov() is usual.
Where h() describes how to transform the response variable.
Where f(xi) describes what covariates you want in (and out)
of the model. To add a covariate, you use +, to remove it, -.
Fitting Models to Data is what R is
designed to do
• There are five kinds of models
– The saturated model—one parameter per data point—you’re
drawing the response line through all the data points. This tells
you nothing.
– The maximal model—containing all factors, interactions, and
covariates of interest that can be fit given the available data. (You
need at least three data points for fitting each covariate in your
model.)
– The current model, usually smaller than the maximal model.
– The minimum adequate model—smaller than the maximal model,
but not significantly smaller.
– The null model—one parameter, the overall mean response, ymean.
(y~1)
Definitions
• Covariate—an explanatory variable that is possibly predictive
of the outcome under study.
• Factor—a covariate that takes a finite number of values, in no
specific order. A boolean value is a factor.
• Continuous explanatory variable—a numerical covariate. It
may be restricted to integer values, to represent an ordering.
• Interaction—a covariate that involves more than one
explanatory variable.
• Power—a covariate that involves a polynomial of degree 2 or
greater in one or more continuous explanatory variables.
General Process of Fitting
1.
2.
3.
4.
5.
Fit the maximal model. Note down the residual deviance.
Possibly check for overdispersion (advanced topic).
Begin model simplification. Use update -, to remove the
least significant terms first. Start with the highest-order
interactions and powers.
If the resulting increase in deviance is not significant, use
the reduced model and continue
If the increase in deviance is significant, go back to the
unreduced model and look further.
Repeat steps 3 and 4 until only significant terms remain.
Parsimony Suggests
•
•
•
•
•
•
•
Less parameters
Less explanatory variables
A linear model
A model without a hump (a power > 1)
A model without interactions
A model with easily measured variables
A model that reflects how the process operates.
Actions you can take
• Remove non-significant interactions, higher order
terms, explanatory variables
• Group together factor levels (advanced topic)
• In ANCOVA (mixed models with continuous variables
and factors), set slopes to zero if possible
• Rescale (advanced) if necessary to give
– constancy of variance
– approximately normal errors
– additivity
• Don't go to extremes, though. “If you torture the data,
it will confess.”
Model Formulae
• response ~ explanatory variables
– The right hand side describes the variables, their interactions,
and their non-linearity—it isn’t arithmetic!
– To include something in the model: + something
– To remove something from the model: - something
– Interactions are written * (or A:B)
• y ~ A + B + A:B is the same as y~A*B)
– Nesting: / (A/B is A+A:B or A+B%in%A)
• y ~ A/B
– Conditioning is written |
• y~x|z
Expanded Forms
•
•
•
•
•
•
•
•
A*B*C is all the interactions up to A:B:C
A/B/C is A+B%in%A+C%in%B%in%A
(A+B+C)^3 is A*B*C
(A+B+C)^2 is A*B*C - A:B:C
poly(x,n) is a polynomial regression of degree n
I(formula) means the formula ‘as written’ in R.
1 labels the intercept
Error(A/B/C) can be specified
Use of update
• “model <- lm(y~A*B)”
• “model2 <- update(model, ~.-A:B)” to get rid of
the interaction.
– model2 is now lm(y~A+B)
Transforms (Advanced)
• Sometimes you need to directly transform the left hand side
to get constant variance. This looks like:
– lm(log(y)~ I(1/x) + sqrt(z))
• If the left hand side has constant variance already, you can
use a ‘link’, defined by family=something in the model
formula, and a general linear model (glm)
• Families available:
–
–
–
–
Normal (bell-shaped curve, default)
poisson (count data)
binomial (proportions or binary data)
Gamma (special)
Factor/Parameter Transforms
• Used to make the shape of variables closer to
‘normal’, and with constant variance.
• More advanced topic, will be mentioned in the
summary lecture. If it turns out you need to do
this, I’ll provide consulting support (X3227).
Types of Models
• lm(y~x)—linear model (x is a continuous variable)
• aov(y~x)—analysis of variance (x is a factor)
• aov(y~x+z)—analysis of covariance if x and z include a
factor and a continuous explanatory variable.
• glm(y~x)—general linear model (like lm but advanced)
– Non-constant variance
– Non-normal errors
– Options include family=poisson, binomial, Gamma, Normal
• gam(y~x)—generalised additive model (complex, advanced)
• also lme(y~x), nls(y~x), nlme(y~x), loess(y~x), tree(y~x)
Modelling Demonstration
• A demonstration will be given in the analysis of
covariance slides.
Model Checking (demoed later)
• Plot the residuals against
–
–
–
–
fitted values
explanatory values
sequence of data collection
standard normal deviates
• Demo of mcheck
mcheck
mcheck <- function (obj, ... ) {
rs<-obj$resid
fv<-obj$fitted
par(mfrow=c(1,2))
plot(fv,rs,xlab="Fitted values",ylab="Residuals")
abline(h=0, lty=2)
qqnorm(rs,xlab="Normal scores",ylab="Ordered residuals")
qqline(rs,lty=2)
par(mfrow=c(1,1))
invisible(NULL)
}
Observations
• Order of factor deletion will matter
– Delete the high order interactions (A:B), and highorder terms (I(x^2)) first.
– Then delete the remaining terms in decreasing order
of importance.
• The test you use is anova(), because you’re
comparing two models.
• Be pragmatic.