R workshop #2 - University of British Columbia

Download Report

Transcript R workshop #2 - University of British Columbia

Workshop in R & GLMs: #2
Diane Srivastava
University of British Columbia
[email protected]
Start by loading your Lakedata_06 dataset:
diane<-read.table(file.choose(),sep=";",header=TRUE)
Dataframes
Two ways to identify a column (called "treatment") in
your dataframe (called "diane"):
diane$treatment
OR
attach(diane); treatment
At end of session, remember to: detach(diane)
Summary statistics
length (x)
mean (x)
var (x)
cor (x,y)
sum (x)
summary (x)
minimum, maximum, mean, median, quartiles
What is the correlation between two variables
in your dataset?
Factors
• A factor has several discrete levels (e.g. control,
herbicide)
• If a vector contains text, R automatically assumes it
is a factor.
• To manually convert numeric vector to a factor:
x <- as.factor(x)
• To check if your vector is a factor, and what the
levels are:
is.factor(x) ; levels(x)
Exercise
Make lake area into a factor called AreaFactor:
Area 0 to 5 ha: small
Area 5.1 to 10: medium
Area > 10 ha: large
You will need to:
1. Tell R how long AreaFactor will be.
AreaFactor<-Area; AreaFactor[1:length(Area)]<-"medium"
2. Assign cells in AreaFactor to each of the 3 levels
3. Make AreaFactor into a factor, then check that it is a factor
Exercise
Make lake area into a factor called AreaFactor:
Area 0 to 5 ha: small
Area 5.1 to 10: medium
Area > 10 ha: large
You will need to:
1. Tell R how long AreaFactor will be.
AreaFactor<-Area; AreaFactor[1:length(Area)]<-"medium"
2. Assign cells in AreaFactor to each of the 3 levels
AreaFactor[Area<5.1]<-“small"; AreaFactor[Area>10]<-“large"
3. Make AreaFactor into a factor, then check that it is a factor
AreaFactor<-as.factor(AreaFactor); is.factor(AreaFactor)
Linear regression
ALT+126
model <- lm (y ~ x, data = diane)
invent a
name for
your
model
insert your
dataframe
name here
linear
model
insert
your
y vector
name
here
insert
your
x vector
name
here
Linear regression
model <- lm (Species ~ Elevation, data = diane)
summary (model)
Call:
lm(formula = Species ~ Elevation, data = diane)
Residuals:
Min
1Q Median
3Q
Max
-7.29467 -2.75041 -0.04947 1.83054 15.00270
Coefficients:
Estimate
Std. Error
t value
(Intercept) 9.421568
2.426983
3.882
Elevation -0.002609
0.003663
-0.712
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pr(>|t|)
0.000551 ***
0.482070
Residual standard error: 4.811 on 29 degrees of freedom
Multiple R-Squared: 0.01719, Adjusted R-squared: -0.0167
F-statistic: 0.5071 on 1 and 29 DF, p-value: 0.4821
Linear regression
model2 <- lm (Species ~ AreaFactor, data = diane)
summary (model2)
Coefficients:
(Intercept)
AreaFactormedium
AreaFactorsmall
Estimate
11.833
-2.405
-8.288
Std. Error
1.441
1.723
1.792
t value Pr(>|t|)
8.210
6.17e-09 ***
-1.396 0.174
-4.626 7.72e-05 ***
Large has mean species richness of 11.8
Medium has mean species richness of 11.8 - 2.4 = 9.4
mean(Species[AreaFactor=="medio"])
Small has a mean species richness of 11.8 - 8.3 = 3.5
ANOVA
model2 <- lm (Species ~ AreaFactor, data = diane)
anova (model2)
Analysis of Variance Table
Response: Species
Df
AreaFactor
2
Sum Sq Mean Sq F value Pr(>F)
333.85 166.92 13.393 8.297e-05 ***
Residuals
348.99 12.46
28
F tests in regression
model3 <- lm (Species ~ Elevation + pH, data = diane)
anova (model, model3)
Model 1: Species ~ Elevation
Model 2: Species ~ Elevation + pH
1
2
Res.Df RSS Df Sum of Sq
29
671.10
28
502.06 1 169.05
F 1, 28 = 9.43
F
Pr(>F)
9.4279 0.004715 **
Exercise
• Fit the model: Species~pH
• Fit the model: Species~pH+pH2
("pH2" is just pH2)
• Use the ANOVA command to decide whether species
richness is a linear or quadratic function of pH
Distributions: not so normal!
• Review assumptions for parametric stats
(ANOVA, regressions)
• Why don’t transformations always work?
• Introduce non-normal distributions
Tests before ANOVA, t-tests
• Normality
• Constant variances
Tests for normality: exercise
data<-c(rep(0:6,c(42,30,10,5,5,4,4)));data
How many datapoints are there?
Tests for normality: exercise
• Shapiro-Wilks (if sig, not normal)
shapiro.test (data)
If error message, make sure the stats
package is loaded, then try again:
library(stats); shapiro.test (data)
Tests for normality: exercise
• Kolmogorov-Smirnov (if sig, not
normal)
ks.test(data,”pnorm”,mean(data),sd=sqrt(
var(data)))
Tests for normality: exercise
• Quantile-quantile plot (if wavers
substantially off 1:1 line, not normal)
par(pty="s")
opens up a single plot window
qqnorm(data); qqline(data)
Tests for normality: exercise
4
3
2
1
0
Sample Quantiles
5
6
Normal Q-Q Plot
-2
-1
0
1
Theoretical Quantiles
2
Tests for normality: exercise
If the distribution isn´t normal, what is it?
30
20
10
0
barplot(freq.data)
40
freq.data<-table(data); freq.data
0
1
2
3
4
5
6
Non-normal distributions
• Poisson (count data, left-skewed, variance =
mean)
• Negative binomial (count data, left-skewed,
variance >> mean)
• Binomial (binary or proportion data, leftskewed, variance constrained by 0,1)
• Gamma (variance increases as square of
mean, often used for survival data)
Exercise
model2 <- lm (Species ~ AreaFactor, data = diane)
anova (model2)
1. Test for normality of residuals
resid2<- resid (model2)
...you do the rest!
2. Test for homogeneity of variances
summary (lm (abs (resid2) ~ AreaFactor))
Regression diagnostics
1. Residuals are normally distributed
2. Absolute value of residuals do not change
with predicted value (homoscedastcity)
3. Residuals show no pattern with predicted
values (i.e. the function “fits”)
4. No datapoint has undue leverage on the
model.
Regression diagnostics
model3 <- lm (Species ~ Elevation + pH, data = diane)
par(mfrow=c(2,2)); plot(model3)
1. Residuals are normally
distributed
Std. deviance resid.
• Straight “Normal Q-Q plot”
Theoretical Quantiles
2. Absolute residuals do not
change with predicted values
Residuals
MALO
BUENO
Fitted values
Sqrt (abs (SD resid.))
• No fan shape in Residuals vs fitted plot
• No upward (or downward) trend in
Scale-location plot
Fitted values
Examples of neutral and fan-shapes
3. Residuals show no pattern
Curved residual
plots result
from fitting a
straight line to
non-linear data
(e.g. quadratic)
4. No unusual leverage
Cook’s distance > 1
indicates a point
with undue
leverage (large
change in model fit
when removed)
Transformations
Try transforming your y-variable to
improve the regression diagnostic plot
• replace Species with log(Species)
• replace Species with sqrt(Species)
Poisson distribution
• Frequency data
• Lots of zero (or minimum value)
data
• Variance increases with the mean
What do you do with
Poisson data?
1. Correct for correlation between mean and
variance by log-transforming y (but log (0)
is undefined!!)
2. Use non-parametric statistics (but low
power)
3. Use a “generalized linear model”
specifying a Poisson distribution
The problem: Hard to transform
data to satisfy all requirements!
Tarea: Janka example
Janka dataset: Asks if Janka hardness
values are a good estimate of timber
density? N=36