Advanced regression analysis

download report

Transcript Advanced regression analysis

NUMERICAL ANALYSIS OF
BIOLOGICAL AND
ENVIRONMENTAL DATA
Lecture 5
Advanced (= Modern)
Regression Analysis
John Birks
ADVANCED REGRESSION ANALYSIS =
MODERN REGRESSION ANALYSIS
Generalised Linear Models (GLM)
-What are GLMs?
-A simple GLM
-Advantages of GLM
-Structure of GLM
Error function
Linear predictor
Link function
-Parameter estimation
-Minimal adequate model
-Concept of deviance
-Model building
-Model notation
-Examples of models
-Model criticism
Classification
Locally weighted regression (LOWESS)
Spline functions
Generalised additive models (GAM)
Classification and regression trees (CART)
Examples of modern techniques
Artificial neural networks
Software
GENERALISED LINEAR MODELS
Brew, J.S. & Maddy, D. 1995. Generalised linear modelling. In Statistical Modelling
of Quaternary Science Data (eds. D. Maddy & J.S. Brew) – Quaternary Research
Association Technical Guide 5
Crawley, M.J. 1993. GLIM for ecologists – Blackwell
Crawley, M.J. 2002. Statistical computing: An introduction to data analysis using
S-PLUS – Wiley
Crawley, M.J. 2005. Statistics. An introduction using R - Wiley
Crosbie, S.F. & Hinch, G.N. 1985. New Zealand J. Agr. Research 28, 19-29
Faraway, J.J. 2004. Linear Models with R. Chapman & Hall/CRC.
Faraway, J.J. 2006. Extending the Linear Model with R – Chapman & Hall/CRC
Fox, J. 2002. An R and S-PLUS companion to applied regression – Sage.
McCullagh, P. & Nelder, J.A. 1989. Generalized Linear Models – Chapman & Hall
Nicholls, A.O. 1989. Biological Conservation 50, 51-75
O’Brian, L. 1992. Introducing Quantitative Geography. Measurement, Methods and
Generalized Linear Models - Routledge
WHAT ARE GENERALISED LINEAR MODELS?
Not a straight-line relationship between response variable and predictor variable.
Linear model is an equation that contains mathematical variables, parameters
and random variables that is LINEAR in the parameters and the random
variables.
y = a + bx
y = a + bx + cx2 = a + bx + cz
(x2 = z)
y = a + bex = a + bz
where z = exponential (x)
Some non-linear models can be linearised by transformation
y = exp (a + bx)
Logny = a + bx
Michaelis-Menten equation
y
ax
1  bx
1
1
ab
y
x
Take reciprocals
Linear models are not necessarily
straight-line models:
Inverse polynomials:
a) polynomial (y =1 + x – x2/15)
a) The Michaelis-Menten or Holling
functional response equation;
b) exponential (y = 3 + 0.1ex)
b) The n-shaped curve 1/y = a + bx + c/x
These are linear models!
Some models are intrinsically non-linear:
b
y a
cx
hyperbolic function
y  a1  becx 
asymptotic exponential
No transformation can linearise them in all parameters
EXAMPLES OF GENERALISED LINEAR
FUNCTIONS
A SIMPLE GENERALISED LINEAR MODEL
Primary aim - provide a mathematical expression for use in
description, interpretation, prediction, or reconstruction involving
the relationship between variables
y = a + bx
Want to find linear combinations of predictor (= explanatory or
independent) (x) variables which best predict the response variable
(y).
y  a  bx  
Systematic
component
Error
component
Influences estimates of
a and b
Five steps:
1. Identification of response (y) and predictor (x) variables.
2. Identification of model equation.
3. Choice of appropriate error function for response variable.
4. Appropriate model parameter estimation procedures.
5. Appropriate model evaluation procedures.
R
ADVANTAGES OF GLM
1: Error function can follow several distributions, not just normal
distribution.
Errors may be: strongly skewed
kurtotic
strictly bounded (0/1, proportions, %)
cannot lead to negative fitted values (counts)
2: Linear combination of the x variables, LINEAR PREDICTOR  (‘eta’)
may be used to predict y through a non-linear intermediary
function, so-called LINK FUNCTION. Use of a non-linear link
function allows the model to use response and predictor variables
that are measured on different scales by effectively mapping the
linear predictor onto the scale of the response variable.
3: Common framework for regression and ANOVA.
4: Can handle many problems that look non-linear.
5: Not necessary to transform data since the regression is transformed
through the link function.
STRUCTURE OF GENERALISED LINEAR MODEL
(1) ERROR FUNCTION
Poisson
Binomial
Gamma
Exponential
count data
proportions, 1/0
data with constant coefficient of variation
data on time to death (survival analysis)
CHARACTERISTICS OF COMMON GLM PROBABILITY DISTRIBUTIONS
Probability
Range of y
Variance function
Gaussian
- to 
1
Poisson
0 (1) 

Binomial
0 (1) n
 (1 - /)
Gamma
0 to 
2
Inverse Gaussian
0 to 
3
Choice depends on range of y and on the proportional relationship
between variance and expected value .
Some members of the exponential family
of probability distributions
ECOLOGICALLY MEANINGFUL ERROR
DISTRIBUTIONS
Normal errors rarely adequate in ecology, but GLM offer
ecologically meaningful alternatives.
• Poisson. Counts: integers, non-negative, variance
increases with mean.
• Binomial. Observed proportions from a total: integers,
non-negative, have a maximum value, variance largest
at  = 0.5.
• Gamma. Concentrations: non-negative real values,
standard deviation increases with mean, many near-zero
values and some high peaks.
J. Oksanen (2002)
(2) LINEAR PREDICTOR
m
   xijj
j 1
unknown parameters
LINEAR STRUCTURE
predictor variables
To determine fit of a given model, linear predictor is needed for
each value of response variable and then compares predicted
value with a transformed value of y, the transformation to be
applied specified by LINK FUNCTION. The fitted value is
computed by applying the inverse of the link function to get back
to the original scale of measurement of y.
Log-link - Fitted values are anti-log of linear predictor
Reciprocal link - Fitted values are reciprocal of linear predictor
(3) LINK FUNCTION
Link function relates the mean value of y to its linear predictor (η).
η = g(μ)
where g(·) is link function and μ are fitted values of y.
y = predictable component
+ error component
y=
+
Linear predictor is sum of terms for each of the parameters and
value of  is obtained by transforming value of y by link function and
obtaining predicted value of y as inverse link function.
μ = g-1(η)
Can combine link function and linear predictor to form basic or core
equation of GLM.
y = g-1 (η) + ε
Error component
OR
g(y) = η + ε
Link function
Linear predictor
The link functions used by GLIM. The canonical link function for normal errors
is the identity link, for Poission errors the log link, for binomial errors the logit
link and for gamma errors the reciprocal link.
Symbol
Link function
Formula
Use
Regression or
I
Identity
η=μ
ANOVA with
normal errors
Count data with
L
Log
η = log μ
Poission errors
Proportion data
  
 z
G
Logit
η = log
with binomial errors
n - 
R
Reciprocal
P
Probit
C
S
E
η=
1

η = Φ-1 (μ/n)
Complementary
η = log[–log(1-μ/n)
log-log
Square root η = √μ
Exponent
η = μ** number
Link Function
Identity
Log
Power p
Logit
Probit
Definition
η=μ
η = ln μ
η = μP
η = ln[μ/(1- μ)]
η = Φ-1 (μ)
Continuous data
with gamma errors
Proportion data in
bioassays
Proportion data in
dilution assays
Count data
Power functions
Range of fitted value
-∞ to ∞
0 to ∞
0 to ∞
0 to 1
0 to 1
Some common link functions in generalised linear models
Link
GLIM notation
Link function
($LINK=)
(η=)
Identity
I
μ
Logarithmic
L
log μ
Logit
G
log (μ/n –μ)
Probit
P
Φ-1 (μ/n)
Square root
Exponent
Reciprocal
S
E
R
√μ
μ**{number}
1/μ
Notes: the following are defaut configurations set automatically by GLIM if $LINK
is omitted:
Error
Normal
Poisson
Binomial
Gamma
Implied $LINK
Identity
Logarithmic
Logit
Reciprocal
ENSURE FITTED
VALUES STAY WITHIN
REASONABLE
BOUNDS
Data Type
Error Functions Link Functions
Continuous interval Normal
Identity, Power family
Continuous ratio
Logarithmic, Reciprocal, Power
family
Gamma
Inverse Gaussian
Count
Poisson
Logarithmic
Count
Binomial
Logit, Probit, Complementary log-log
Binary
Binomial
Logit, Probit, Complementary log-log
Category
Multinomial
Logit, Probit, Complementary log-log
Ordered category
Multinomial
Logit, Probit, Complementary log-log
Common combinations of Error Functions and Link Functions
TYPES OF GLM ANALYSIS
Examples of some error distributions and link functions
Type of analysis
Response
Explanatory
variable
variable
Regression
Continuous
Continuous
ANOVA
Continuous
Factor
ANCOVA
Continuous
Both continuous
and factor
Regression
Continuous
Continuous
Contingency
Count
Factor
table
Proportions
Proportion
Continuous
Probit
Proportion
Continuous
(dose)
Survival
Binary
Factor
(alive or dead)
Survival
Time to death
Continuous
Examples of generalised linear models
Method
Link function
Linear regression
Identity
ANOVA
Identity
ANOVA (random effects) Identity
Log-linear model:
symmetric
Poisson
asymmetric
Binomial or Multinomial
Binomial or multinomial
Logit regression
Binomial or multinomial
Link
function
Identity
Identity
Log
Error
distribution
Normal
Normal
Gamma
Reciprocal
Log
Gamma
Poisson
Logit
Probit
Binomial
Binomial
Complementary
log-log
Reciprocal
Binomial
Error distribution
Normal
Normal
Gamma
Logarithmic
Logit
Logit
Probit regression
Probit
Source: After O’Brian (1983) and O’Brian and Wrigley (1984)
Exponential
GENERALISED LINEAR MODELS – A SUMMARY
Mathematical extensions of linear models that do not force
data into unnatural scales. Thereby allow for non-linearity
and non-constant variance structures in the data.
Based on an assumed relationship (link function) between
the mean of the response variable and the linear
combination of the predictor variables.
Data can be assumed to be from several families of
probability distributions – normal, binomial, Poisson,
gamma, etc – which better fit the non-normal error
structures of most real-life data.
More flexible and better suited for analysing real-life data
than 'conventional' regression techniques.
PARAMETER ESTIMATION
Given error function and link function can now formulate linear predictor term.
Need to be able to estimate its parameters and find linear predictor that
minimises the deviance.
Normal distribution, least-squares algorithm appropriate. Other error functions
need maximum likelihood estimation.
In maximum likelihood, aim is find parameter values that give ‘best fit’ to the
data. Best in ML considers:
1. data on response variable y
2. model specification
3. parameter estimates
Need to find the MINIMAL ADEQUATE MODEL to describe the data.
‘BEST’ model is that producing the minimal residual deviance subject to the
constraint that all the parameters in the model are statistically significant.
Model should be minimal because of principle of parsimony and adequate
because there is no point in retaining an inadequate model that does not describe
a significant part of the variation in the data.
NO ONE MODEL, many possible models may be adequate. Need to find MINIMAL
ADEQUATE MODEL.
PRINCIPLE OF PARSIMONY (Ockham’s Razor)
1. Models should have as few parameters as possible.
2. Linear models are to be preferred to non-linear models.
3. Models relying on few assumptions are to be preferred to models
with many assumptions.
4. Models should be simplified until they are minimal adequate.
5. Simple explanations are to be preferred to complex ones.
Maximum likelihood estimation, given the data, model, link, and error
functions, provides values for the parameters by finding iteratively
the parameter values in the model that would make the data most
likely, i.e. to find the parameter values that maximise the
likelihood of the data being observed.
Depends not only on the data but on the model specification.
CONCEPT OF DEVIANCE
Deviance - measure of the goodness of fit
Fitted values are most unlikely to match the observed data perfectly. Size of
discrepancy between model and data is a measure of the inadequacy of the
model. DEVIANCE is measure of discrepancy. Twice the log likelihood of the
observed data under a specified model.
Its value is defined relative to an arbitrary constant, so that only differences in
DEVIANCE (i.e. ratios of likelihoods) have any useful meaning.
CONSTANT is deviance for FULL MODEL – parameter for each observation – is zero.
Discrepancy of fit is proportional to twice the difference between the maximum
log likelihood achievable and that attained using a particular model.
OTHER OUTPUT FOR GLM
Parameter estimates, standard errors, t-values
Standardised parameter estimates (estimates/se)
Fitted values
Covariance matrix for parameter estimates
Standardised residuals
REF
CALCULATION OF DEVIANCE
REF
The formulae used by GLIM in calculating deviance, where y is the data and μ
is the fitted value under the model in question (the grand mean in the simplest
case); note that, for the grand mean, the term Σ(y – μ) = 0 in the Poisson
deviance, and so this reduces to 2Σyln(y/μ); in the binomial deviance, n is the
sample size (the binomial denominator), out of which y successes were
obtained.
REF
Error structure
Deviance
Normal
(y - )2
Poisson
2[y ln (y/) – (y - )]
Binomial
2{y ln (y/) + (n – y) ln [(n – y)/(n - )]}
Gamma
2[-ln (y/) + (y - )/]
Inverse Gaussian
(y - )2/(2y)
REF
MODEL BUILDING
Aim is to find minimal adequate model and use deviance as
principal criterion for assessing different models.
GENERAL LINEAR MODELS
•
•
•
•
Common framework for Regression Analysis and ANOVA
Goodness of fit: Sum of Squares (SS)
Least squares estimation
Degrees of freedom (df) = {Number of observations} minus {number of
parameters}, or df = n – p
•
Statistical testing: Compare two models with different number (p and m)
of estimated parameters
SSdiff  SSp  SSm
p2m  SSdiff
dfdiff  df p  df m  p  m
Fp m,n p 
SSdiff / dfdiff
SSp /n  p 
REGRESSION ANALYSIS
Is the regression coefficient significant?
μ = b0
df = N – 1
SS0
μ = b0 + b1x
df = N – 2
SSA
F1,N 2
SS - SS /1

SS /N  2
0
A
A
ANOVA (Analysis of variance)
Are the class means equal?
B
μ = b0
df = N – 1
SS0
μ = b0 + b1B + b2C
df = N – 3
SSA
A
C
SS - SS /1
F2,N 3 
SS /N  3
0
A
A
CLASS
A
B
C
1
1
0
0
2
0
1
0
3
0
0
1
In GLM we have DEVIANCE RATIO TEST
To consider if model A is a significant improvement over model B,
we use:

DA  DB / dfA  dfB
F
DB / dfB

(Deviancedifference)/df difference
DevianceB/dfB
F corresponding to  =0.05
df1 = dfA – dfB
df2 = dfB
Value greater than tabulated value of F would indicate
model A is a significant improvement over model B.
REF
LEAST SQUARES AND MAXIMUM LIKELIHOOD

REF

 xi - μi  2
2 
LL  i
log
2
πσ

2

2
σ


Least squares maximize Normal log-likelihood
Other error distributions can be used in analogous way
Deviance is based on log-likelihood, and has the same distribution
- Deviance = 0: Observed and fitted values are equal (= ‘deviation’)
- Deviance is always positive
Log-likelihood, Sum of Squares and Deviance follow Chi-Squared distribution
Scaled Chi-Squared distribution follows F distribution
REF
REF
STATISTICAL TESTING IN GLM
Deviance: Same distribution as Sum of Squares
- Chi-squared: Model fits
- F test: Scaled deviance
Tests exactly like general linear models
Expected value of deviance = degrees of freedom
Overdispersion: Model does not fit
- Deviance > degrees of freedom
Deviance must be scaled
- Divide by overdispersion coefficient (D/df)
- Use F test (scaling automatic)
GOODNESS OF FIT AND MODEL INFERENCE
• Deviance: Measure of goodness of fit
– Derived from the error function: Residual sum of squares in
Normal error
– Distributed approximately like x2
• Residual degrees of freedom: Each fitted parameter uses one
degree of freedom and (probably) reduces the deviance.
• Inference: Compare change in deviance against change in degrees
of freedom
• Overdispersion: Deviance larger than expected under strict
likelihood model
• Use F–statistic in place of x2.
J. Oksanen (2002)
MODEL BUILDING
The aim of the exercise is to determine the minimal adequate model in which
all the parameters are significantly different from zero. This is achieved by a
step-wise process of model simplification, beginning with the full model, then
proceeding by the elimination of non-significant terms, and the retention of
significant terms.
Full model
One parameter for each and every data point
Fit: perfect
Degrees of freedom: none
Explanatory power of model: none
Maximal model
Contains all (p ) factors, interactions and covariates that might be of interest
Many of the model’s terms may be insignificant
Degrees of freedom: n – p – 1
Minimal adequate model A simplified model with 0  p 1  p terms
All the terms are significant (if there are no significant terms, then MAM = null model)
Degrees of freedom: n – p’ – 1
Null model
Explanatory power: r 2 = SSR/SST
A single parameter, the grand mean
Fit: none
Degrees of freedom: n – 1
Explanatory power of model: none
MINIMAL ADEQUATE MODEL
Adequate model is statistically as acceptable as the most complex model
Start with all explanatory variables in the model: Full model
Try all models and accept the minimal adequate model
Minimal adequate model is
- Adequate itself
- Has no adequate submodels
If you are lucky, you have only one adequate model which is minimal as
well
If the full model has Sum of Squares SSf with p parameters, the tested
model is adequate if its SSr satisfies:
SSr / SSf > 1+p Fα,p,n-p-1/(n-p-1)
α is the risk level adjusted for number of parameters,
e.g. α = 1-0.05p
The steps involved in model simplification. There are no hard and fast rules,
and this is only a guide to one sensible way of approaching the problem of
model simplification.
Step
1
Procedure
Fit the maximal model
Explanation
Fit all the factors, interactions and covariates of
interest. Note the residual deviance. Check for
overdispersion (Poisson or binomial errors), and rescale if
necessary
2
Begin model simplification
3
If the deletion causes an
insignificant increase in
deviance
Inspect the parameter estimates
Remove the least significant terms first starting with
the highest order interactions
Leave the term out of the model
4
If the deletion causes a
significant increase in
deviance
Inspect the parameter estimates
Remove the least significant term remaining in the model
Put the term back into the model
These are the statistically significant terms as assessed by
deletion from the maximal model
5
Keep removing terms from
the model
Repeat steps 3 or 4 until the model contains nothing
but significant terms
The resulting model is the minimal adequate model
If none of the parameters is significant, then the null
model is the minimal adequate model
EXAMPLE OF FINDING MINIMAL ADEQUATE MODEL
Effect of altitude on sulphur concentration in terricolous lichens
Explanatory variables
- ALT: Altitude (m)
- SPE: Species (Cetraria nivalis, Hypogymnia physodes)
- EXP: Exposition (E, W)
- FJE: Fjell (three alternatives)
Parameters
- n = 72, p – 1 = 23, df = 48, α = 1 – .0523 = 0.693
Minimal adequate model:
- RSSr/RSSf = 1 + 23 · 0.819 / 48 = 1.392
Model
alt*spe*exp*fje
alt*spe*exp
alt*spe*fje
spe*exp*fje
alt*exp*fje
RSSr/RSSf
1
1.978
1.284
1.352
8.697
Full
Reject
Adequate
Adequate
Reject
TOOLS FOR FINDING MINIMAL ADEQUATE
MODEL OR PARSIMONY
AIC - Akiake information criterion (or penalised log likelihood)
BIC - Bayes information criterion
AIC = -2 x log likelihood + 2(parameters + 1)
(1 is added for the estimated variance, an additional parameter)
BIC = -2 x log likelihood + logen(parameters + 1)
R
More parameters in the model, better the fit but less and less
explanatory power.
Trade-off between goodness of fit and the number of
parameters. AIC and BIC penalise any superfluous parameters
by adding 2p (AIC) or logen times p (BIC) to the deviance.
AIC applies a relatively light penalty for lack of parsimony.
BIC applies a heavier penalty for lack of parsimony.
Select the model that gives the lowest AIC and/or BIC.
R
MODEL NOTATION
REF
Variables X and Y
REF
Factors A,B,C with levels i, j, k (categorical variables)
Model formula involves parameters being added to model, one for each variable and (n – 1) for each n
level factor.
Proportions of a given lithology (A – factor) may depend on depth (X – variable) and site (B – factor).
Additive model
Linear predictor
A+B+X
  K  i   j  x
constant
parameter
Parameter with appropriate factor level
What if proportion of a given lithology A may depend on depth and site in such a way that the effect of
depth is different at different sites.
Interaction term between
main effects of B and X
Model
AB X BX
  K  i   j  i X
For each lithology factor level
Interaction term between two factors A and B is A.B and introduces a new factor ()ij for each
combination of factor levels.
Interaction term between two variables X and Y (X.Y) is equal to new variable Z = (XY).
Multiple interactions:
A.B.C = A + B + C + A.B + A.C + B.C + A.B.C
REF
EXAMPLES OF GLMs
TAYLOR (1980): California precipitation – 30 localities
(1) TOTAL DEVIANCE (SS) =
Altitude, latitude,
distance to coast
PPTN
Error function
Link function
Quantitative variable
predictors
Response variable
normal
identity   
(2)
ALT
t
3.36
*
+
LAT
+
8012 29 df
DIST
+ constant 3202 26 df
–3.93
*
4.34
*
–3.51
(estimate/se)
*
Deviance difference
 0.4  r 2  0.60
(3) ADD BINARY DUMMY VARIABLES for RAIN-SHADOW
EFFECT
t
ALT
2
*
+
LAT
5.23
*
+
DIST
-3.1
*
+
RS
3.63
*
2098 25df
Deviance difference
 0.26  r 2  0.74
(a) Location of California weather stations;
(b) Map of regression residuals; (c) Map of
regression residuals from second analysis.
Pine and spruce needle damage and SO2 emissions
Predicted damages and their 95% confidence
limits against sulphur concentration of Scots pine
needles. The regression model was fitted with
different levels (heights of the peaks) for the
transects and using observed shoot lengths as
offset; the lines shown correspond to transect 1
and 1cm shoot length
Diatom – pH responses
The Gaussian response curve for the abundance
value (y) of a taxon against an environmental
variable (x) (u = optimum or mode; t = tolerance; c
= maximum).
Gaussian Logit Model
yk(x) =
ck exp
-1/2(x -uk)2 / tk 2
1  ck exp
-1/2(x -uk)2 / tk 2
yk(x) is expected proportional abundance of taxon k as a
function of x (pH)
Generalised linear model
 p 
 b0 + b1x + b2x2
 1 p 
log = 
where p is shorthand for yk(x)
Gaussian response function:
GLM estimation
 ( x  u)2 
μ = h exp   2t 2 


log μ = b0 + b1x + b2x2
• Gaussian response function can be written as a generalized linear model
(which is easy to fit)
- Linear predictor: explanatory variables x and x2
- Link function log (or logit)
- Error Poisson (or Binomial)
• The original Gaussian response parameters can be found by
u = -b1/2b2
t=
 1 2b2
h = exp(b0 - b12 / 4b2)
OPTIMUM
TOLERANCE
HEIGHT
Results of fitting Gaussian logit, linear logit and null models to the
SWAP 167-lake training set and lake-water pH
225 taxa
No. of taxa
Non-converging
1
Gaussian unimodal curves with maxima (b2 < 0)
88
Linear logit sigmoidal curves
78
Gaussian unimodal curves with minima (b2> 0)
5
No pattern
53
Significant Gaussian logit model
88
Significant linear logit model
78
Non-significant fit to pH
58
SEVERAL GRADIENTS
• Gaussian response can be fitted to several gradients: Bell-shaped
models
J. Oksanen (2002)
INTERACTIONS IN GAUSSIAN RESPONSES
• No interactions: responses parallel to the gradients
• Interactions: the optimum on one gradient depends on the
other
J. Oksanen (2002)
ASYMMETRIC RESPONSES
β – function
parameters determining
shape
Y =  (x – a)α
constant
GLM
(b – x)γ
lower and
upper limit of
env. var x
Log (γ) = Log () + αLog (x – a) + γLog (b – x)
with log (x - a) and log (b - x) as explanatory variables and a log link
function
 and  define location of mode, skewness of response, and kurtosis of
response
response is zero at a and b
 is a scaling parameter
R
SELECTION OF RESPONSE MODELS
Huisman, Olff & Fresco - Hierarchical models of species-environment
responses.
Huisman, Olff & Fresco (1993) – J. Veg. Sci. 4, 37–46
Plateau
III
Environmental gradient x
HOF
Oksanen & Minchin (2002) - Ecol. Modelling 157, 119-129
HOF MODELS
Huisman-Olff-Fresco: A set of five
hierarchic models with different
shapes.
Model
Parameters
V
Skewed
a
b
c
d
IV
Symmetric a
b
c
b
III
Plateau
a
b
c

II
Monotone
a
b
0
0
I
Flat
a
0
0
0
J. Oksanen (2002)
V yM 1
abx
1 e
1
1 ecdx
1
1
1  eabx 1  ecbx
II y  M 1
1  eabx
I yM 1
1  ea
IV y  M
M = sample total
(e.g. 100)
4 parameters to estimate
3 parameters to estimate
2 parameters
1 parameter
Hierarchical model means that a simpler model has (1) fewer
parameters than the complex model and (2) can be derived by
simplifying a more complex model by deleting one or more parameters.
y = expected value which is dependent on the known values of the
environmental gradient x, maximum possible value (M), and
parameters a, b, c, and d.
Model
Parameters
 parameters
V
Skewed
a
b
c
d
4
IV
Symmetric
a
b
c
b
3
II
Monotonic
a
b
0
0
2
I
Flat, null
a
0
0
0
1
[III
Plateau
a
b
c

3]
HOF
HOF
- fits most complex model V first by maximum likelihood, then
IV, II and I (backward elimination)
- calculates deviance, if drop in deviance greater than 3.84,
extra parameter is significant at p < 0.05 (2 distribution).
- if data are overdispersed (deviance > degrees of freedom),
cannot use 2 test. Must use F-test.
F1,n  p 
Dp 1  Dp
Dp /n  p 
- model is simplified as long as the removed parameters are not
significant at p < 0.05
- can specify Poisson or binomial error function
HOF - estimation is stopped when first significant term is found
- evaluate how many taxa have significant fits to models V, IV, II
and I
- adopt the simplest model which cannot be simplified without a
significant change in deviance
- as model III (plateau) has the same number of parameters as
model IV, not fitted routinely. If model IV is rejected in favour of
model V, the latter is compared against model III and model
simplification is continued
HOF
HOF: INFERENCES ON RESPONSE SHAPES
• Alternative models differ only in
response shape
• Selection of most parsimonious
model with statistical criteria
• 'Shape' is a parametric concept,
and parametric HOF models may
be the best way of analysing
differences in response shapes.
Most parsimonious HOF models on
altitude gradient in Mt. Field,
Tasmania.
J. Oksanen (2002)
MODEL CRITICISM
Faraway 2005, 2006
1. All models are wrong
2. Some models are better than others
3. The correct model can never be known with certainty
4. The simpler the model, the better it is
In GLM may have mis-specified model, error structure, or
link function.
MODEL CRITICISM
Plot residuals against fitted values
For non-Normal models: Use Anscombe or Pearson residuals
Normality: Plot ordered residuals against a Normal deviate
Any pattern: Something wrong
Bent residual belt:
Wrong systematic part
Wrong link function
Wrong or missing explanatory variables
Widening residual belt: Wrong error function
Leverage values show the influential observations
Influential observations: small residuals
Leverage > 2p/N is high
EXAMPLE OF MODEL CRITICISM
CLASSIFICATION
What has this to do with regression analysis?
What is classification as distinct from clustering and
partitioning (Lecture 3) (= unsupervised pattern
recognition)?
Classification involves multivariate data that fall
into two or more a priori groups, so-called
supervised pattern recognition
Range of questions that can be asked of such data.
1. Do the groups involved have different mean vectors for
the available measurements?
Multivariate equivalent of familiar univariate t-test,
Hotelling’s T2 and multivariate analysis of variance.
Linear discriminant analysis (2 groups) or multiple
discriminant analysis (3 or more groups) (also known as
canonical variates analysis).
2. For grouped multivariate data, it is possible to use
the measurements to construct a classification rule
derived from the original observations (training set)
that will allow new individuals having the same set of
measurements but no known group identity to be
allocated to a group or classified in such a way that
misclassifications are minimised.
A.H. Fielding (2007) Cluster and classification
techniques for the biosciences. Cambridge University
Press
Can formulate this classification problem as a
regression problem
Response Variable
Predictor Variable
Class 1
Class 2
x1, x2, x3, … xm
1
0
x1, x2, x3, … xm
1
0
x1, x2, x3, … xm
1
0
x1, x2, x3, … xm
0
1
x1, x2, x3, … xm
0
1
x1, x2, x3, … xm
0
1
x1, x2, x3, … xm
Regression with 0/1 response variable(s) and predictor
variables
DISCRIMINANT FUNCTION FOR SEXING FULMARINE
PETRELS FROM EXTERNAL MEASUREMENTS
(Van Franketer & ter Braak (1993) The Auk, 110: 492-502)
Lack plumage characters by which sexes can be recognised.
Problems of geographic variation in size and shape.
Approach:
1.
A generalised discriminant function from data from sexed birds of a
number of different populations
2.
Population – specific cut points without reference to sexed birds
Measurements
Five species of fulmarine petrels
HL – head
length
Antarctic petrel Northern fulmar
Cape petrel
Southern fulmar
Snow petrel
CL – bill length
BD – bill depth
TL – tarsus
length
STEPWISE MULTIPLE REGRESSION
Ranks characters according to their discriminative power, provides
estimates for constant and regression coefficient b1 (character
weight) for each character.
For convenience, omit constant and divide the coefficient by the
first-ranked character.
Discriminant score = m1 + w2m2 + ..... + wnmn
where mi = bi/b1
Cut point – mid-point between ♂ and ♀ mean scores.
Reliability tests
1. Self-test
- how well are the two sexes discriminated? Ignores
bias, over-optimistic
2. Cross-test
- divide randomly into training set and test set
3. Jack-knife (or leave-one-out – LOO)
- use all but one bird, predict it, repeat for all birds.
Use n-1 samples. Best reliability test.
Small data-sets - self-test
OVERESTIMATE
- cross-test
UNDERESTIMATE
- jack-knife
RELIABLE
MULTISAMPLE DISCRIMINANT ANALYSIS
If samples of sexed birds in different populations are small but different
populations have similar morphology (i.e. shape) useful to estimate
GENERALISED DISCRIMINANT from combined samples.
1.
2.
Cut-point established with reference to sex
(determined by dissection)
WITH SEX
Cut-point without reference to sex
NO SEX
Decompose mixtures of distributions into their underlying
components. Maximum likelihood solution based on assumption of
two univariate normal distributions with unequal variances.
Expectation – maximisation (EM) algorithm to estimate means 1 and
2 and variances 1 and 2 of the normals.
Cut point is where the two normal densities intersect.
xs = (22 - 12)-1 {122 - 212 + 12 [(1 - 2)2
+ (12 - 22) log n 12/22]0.5}
LOCALLY WEIGHTED REGRESSION
Cleveland, W.S. 1979. J. Amer. Stat. Association 74, 829-836
Cleveland, W.S. 1993. Visualizing Data. AT & T Bell Laboratories
Cleveland, W.S. 1994. The Elements of Graphing Data. AT & T
Bell Laboratories
Crawley, M.J. 2002. Statistical Computing – an introduction to
data analysis using S-PLUS. Wiley
Efron, B. & Tibshirani, R. 1981. Science 253, 390-395
Trexler, J.C. & Travis, J. 1993. Ecology 74, 1629-1637
LOCALLY WEIGHTED REGRESSION
W. S. Cleveland
LOWESS
or
LOESS
Locally weighted
regression scatterplot
smoothing
May be unreasonable to expect a single functional relationship between Y and X throughout
range of X.
(Running averages for time-series – smooth by average of yt-1, y, yt+1 or add weights to yt-1,
y, yt+1)
LOESS - more general
1.
2.
3.
4.
5.
6.
7.
Decide how ‘smooth’ the fitted relationship should be.
Each observation given a weight depending on distance to observation x1 for adjacent
points considered.
Fit simple linear regression for adjacent points using weighted least squares.
Repeat for all observations.
Calculate residuals (difference between observed and fitted y).
Estimate robustness weights based on residuals, so that well-fitted points have high
weight.
Repeat LOESS procedure but with new weights based on robustness weights and
distance weights.
Repeat for different degree of smoothness, to find ‘optimal’ smoother.
R
(A) Survival rate (angularly transformed) of tadpoles in a single enclosure plotted as a function of the average
body mass of the survivors in the enclosure. Data from Travis (1983). Line indicates the normal least-squares
regression. (B) Residuals from the linear regression depicted in Part A plotted as a function of the independent
variable, average body mass.
(A) Data from above with a line depicting a least-squares quadratic model. (B) Data from above with a
line depicting a LOWESS regression model with f = 0.67. (C) Data from above with a line depicting a
LOWESS regression model with f = 0.33.
How the Loess smoother works. The shaded region indicates the window of
values around the target value (arrow). A weighted linear regression (broken
line) is computed, using weights given by the “tricube” function (dotted line).
Repeating the process for all target values gives the solid curve.
An air pollutant, ozone, is
graphed against wind speed.
From the graph we can see that
ozone tends to decrease as
wind speed increases, but
judging whether the pattern is
linear or nonlinear is difficult.
Loess, a method for smoothing data, is
used to compute a curve summarizing
the dependence of ozone on wind
speed. With the curve superposed, we
can now see that the dependence of
ozone on wind speed is nonlinear.
α = “bandwidth”
parameter
0.3-0.5
λ = polynomial
order of fitted
local regression
model
The three loess curves have three
different values of the smoothing
parameter, α. From the bottom
panel to the top the values are 0.1,
0.3 and 0.6. The value of λ is 2.
Three loess fits are shown. From
the bottom panel to the top, the
two parameters, α and λ, are the
following:
0.1 and 1; 0.3 and 1 and 0.3 and 2.
REF
LOESS – STATISTICAL ASPECTS
REF
Can express its complexity by the number of degrees of freedom (DF)
taken from the data by the fitted model = equivalent number of
parameters.
As LOESS produces fitted values of the response variable, can calculate
variability in the response values accounted for by the LOESS fitted model
and compare it with the residual sum of squares.
As we have the DF of the fitted model, can calculate residual DF and
calculate sum of squares per one degree of freedom (corresponding to the
mean square in an ANOVA table for a classical regression model).
Thus we can compare different LOESS models using an ANOVA approach of
regression and residual sum of squares or deviance. Can also use
generalised cross-validation to find ‘optimal’ LOESS model.
REF
REF
‘In any specific application of LOESS,
the choice of the two parameters  and
 must be based on a combination of
judgement and trial and error. There is
no substitute for the latter’
Cleveland (1993)
SPLINE FUNCTIONS
Faraway 2006, Crawley 2002
Given data of x and y variables on the same n objects, can
connect these points with a smooth, continuous line – spline
function.
Named from the flexible drafting spline made from a narrow
piece of wood or plastic that can be bent to conform to an
irregular shape.
Splines are not analytical functions and they are not statistical
models like regressions.
Purely arbitrary and have no real theoretical basis except the
theory that defines the characteristics of the lines themselves.
Extremely useful for interpolation for smoothing in two or three
dimensions.
Splines are piecewise polynomials that are constrained to have continuous
derivatives at the joints or knots between the pieces or segments.
Cubic spline consists of cubic polynomials which are functions of the form:
y  1  2 x  3 x 2  4 x 3
The curve defined by a cubic polynomial can pass exactly through four
points. For a set of observations with n > 4, need to use a succession of
polynomial segments.
To ensure that there are no abrupt changes in slope or curvature between
successive segments, the polynomial function is not fitted to four points
but only to two. This allows using additional constraints to ensure that the
resulting spline has continuous first derivatives between segments (the
slope of the line will be the same on either side of a joint) and continuous
second derivatives (the rate of change in the slope of the line will not
change across a joint).
A spline of degree n will have continuous derivatives across the points up
to order n – 1.
R
REF
Mathematical Explanation
REF
A smoothly joined piecewise polynomial of degree n.
t1, t2, …, tn are a set of n values in the interval a, b
so that a < t1  t2  …  tn  b.
Cubic spline is a function g such that on each of the intervals
(a, t1), (t1, t2), …, (tn, b) is a cubic polynomial
and
the polynomial pieces fit together at the points t1 in such a way
that g itself and its first and second derivatives are continuous
at each ti and hence on the whole a, b.
The points ti are called knots.
REF
REF
Commonly used type is cubic spline for the smoothed
estimation of the function f in the model
y = f(x) + 
where y = response variable
x = explanatory variable
and
 = error with expected value of zero.
Simple Use of Splines
Basic scatter plot of yield
against irrigation
LOWESS fitted
Curve-fitting is trade off between smoothness and roughness.
Concept of degrees of freedom serves as penalty.
Want smoothest graph that describes relationship between y and
x that has the lowest penalty in terms of degrees of freedom.
2 degrees of freedom
(slope and intercept)
linear fit
n degrees of freedom
3 df no hump
4 df hint of hump
6 df clear hump
Which to use?
Parsimony favours an asymptote (3 df)
over a hump (4 or 6 df).
Need more data to test between
asymptote and hump.
Splines are arbitrary smoothers.
S-PLUS
R
REF
How is a spline fitted? Involves differential calculus.
REF
Construct required estimator for the following minimisation
problem, namely find f to minimise
 y
n
i 1


 f(xi )    f (u) du
2
i

11
2

where primes represent differentiation.
The first term is residual sum of squares which is used as a distance
function between data and estimator.
The second term penalises roughness of the function.
Parameter   0 is a smoothing parameter (degrees of freedom) that
controls trade-off between smoothness of the curve and the bias of
the estimator.
Solution to the minimisation problem is a cubic polynomial between
successive x-values with continuous first and second derivatives at
the observation points.
REF
Uses of Splines
1. Interpolation for smoothing
2. Regression analysis including generalised
additive models (GAM)
Ecological example
van Dobben & ter Braak 1999 Lichenologist
31: 27-39
Lichens and Air Pollution in the Netherlands
1216 groups of 6 tree species in eight 750 km2 areas.
104 lichen species, 65 in 10 or more tree groups.
Pollution data SO2, NO2, and NH3
high correlation between SO2 and NO2 (r = 0.49)
Four models fitted for each of the 65 lichen species
1) Abundance =
b0  b1SO2  b2NO2  b3 NH3  b4 diameter  b5coast (c j treespecies j)
j
where coast = distance to coast
diameter = tree diameter
cj = regression coefficient for dummy (1/0) variable for
tree species j
2) Non-zero abundance =
b0  b1SO2  b2NO2  b3 NH3  b4 diameter  b5coast (c j treespecies j)
j
3) Logit (1/0)
 p 
  b0  b1SO 2  b2NO2  b 3 NH3  b4 diameter
log 
(
1

p
)


 b5coast  (c j tree species j)
j
4) Logit with splines
 p 
  b0  b1SPL q(SO2 )  b2NO2  b 3 NH3  b4 diameter
log 
(
1

p
)


 b5coast (c j tree species j)
j
where SPLq = spline function with q degrees of freedom (q = 1, 2, or 4)
In this context q = 2 allows fitting of a unimodal response, q = 4
bimodal response.
Find q by increasing q stepwise and stop if the resulting increase in fit
is not significant at 1% level based on deviance test.
van Dobben & ter Braak, 1999
van Dobben & ter Braak, 1999
Most species had monotonic response (df = 1).
Nearly all species sensitive to SO2
about 50% for NO2
33% for NH3
Because of high correlation between SO2 and NO2, excluded the NO2
term when fitting for SO2.
For NO2, fitted the SO2 term first.
The 'true' sensitivity to SO2 may therefore be lower than modelled.
NH3 uncorrelated with SO2 and NO2. Ecological effect is not through
toxicity but through its effect on bark pH. Causes a shift from
acidophilic to acidiphobous species.
GENERALISED ADDITIVE MODELS (GAM)
Semi-parametric extension of generalised linear models GLM:
intercept or constant
GLM
predictor variables
p
gEy      x      j x j
j 1
link function
modelled abundance of
response variable y
regression coefficients or
model parameters
e.g. Ordinary least-squares regression - identity link, normal error distribution
Ey =  +  jxj
e.g. 2-dimensional Gaussian logit regression - logit link, binomial error distribution
 p 
    1 x1  2 x12  3 x 2   4 x 22
Logit(p)  Log 
1  p 
Requires a priori statistical model, e.g. Gaussian logit model, β-response model, etc.
What if the response is bimodal, is badly skewed, or is more complex than a priori model?
GLM may not be flexible enough to approximate the true response adequately. GLM are
model-driven.
GAM
intercept or constant
predictor variables
p
gEy     fx     fj x j 
j 1
link function
modelled abundance
of response variable y
unspecified smoothing functions
estimated from data using smoothers
to give maximum explanatory power
fj are unspecified smoothing functions estimated from the data using techniques
developed for smoothing scatter plots, e.g. loess, cubic splines.
Data determine shape of response curve rather than being limited by the shapes
available in parametric GLM. Can detect bimodality and extreme skewness.
Regression surface g(Ey) for taxon y is expressed as a sum of the functions for each
variable xj so each has an additive effect, hence GAM.
GAM are data-driven, the resulting fitted values do not come from an a priori model.
Still some statistical framework with link functions and error specification
Need to specify the type of smoother and their complexity in terms of their degrees of
freedom.
R
GENERALISED ADDITIVE MODELS
Efron, B. & Tibshirani, R. 1991 Science 253, 390-395
Yee, T.W. & Mitchell, N.D. 1991 J. Vegetation Science 2, 587-602
Guisan, A. et al. 2002 Ecological Modelling 157, 89-100
Hastie, T.J. & Tibshirani, R. 1990 Generalized Additive Models.
Chapman & Hall
Wood, S.N. 2006. Generalized Additive Models. An introduction
with R. Chapman & Hall/CRC.
GENERALIZED ADDITIVE MODELS (GAM)
• Generalized from GLM; linear
predictor replaced with smooth
predictor
• Smoothing by regression splines
or other smoothers
• Degree of smoothing controlled
by degrees of freedom;
analogous to number of
parameters in GLM
• Everything else like GLM
• Enormous potential use in
ecology
J. Oksanen (2002)
PRINCIPLE OF PARSIMONY IN STATISTICAL MODELLING
“No more causes or factors should be assumed than are necessary to account for
the facts”.
i.e. the simplest model desirable but with maximum explanatory power.
Compromise between simple and complex models.
In GLM, we evaluate role of individual predictors by looking at the magnitude,
sign and likely statistical contribution of the estimated regression coefficients. Fit
most complex model first, and then backward eliminate variables until retain
simplest but with good explanatory power.
In GAM, we can look at fitted smoothers to investigate how the influence of a
particular predictor varies along the range of its possible values. Smoothers can
be chosen to have different levels of detail that are characterised by the
effective number of degrees of freedom used in the fitting of the smoother.
This concept, shared with regression analysis where the individual model terms
correspond to one degree of freedom, allows in conjunction with the concept of
residual deviance explained, to evaluate significance of the variability explained
by the fitted additive models and to make decisions about the significance of any
model improvements by extending from constant  linear GLM  GAM S(2) 
GAM S(3)  GAM S(4).
Simple leave-one-out (jack-knife) to estimate realistic root mean square error of
prediction (RMSEP).
DEGREES OF FREEDOM
The width of a smoothing window (span) = Degrees of Freedom
J. Oksanen (2002)
SWISS MODERN POLLEN AND CLIMATE
MULTIPLE GRADIENTS
• Each gradient is fitted separately
• Interpretation easy: Only the
individual main effects shown and
analysed
• Possible to select good parametric
shapes
• Thin-plate splines: Same
smoothness in all directions and
no attempt at making responses
parallel to axes
J. Oksanen (2002)
INTERACTIONS BETWEEN PREDICTORS ON RESPONSE
OF TAXON
Two explanatory variables show interaction of effects if the effect of one variable depends
on the value of the other.
Test for interaction by extending our regression equation with product terms, i.e.
p
gEy       j x j  q x j x k
GLM
j 1
p
gEy      fj x j   fq x j x k
GAM
j 1
Then test using F-test on deviance if contribution of interaction is significant.
Three groups of taxa based on most appropriate pairs of predictors.
(1) January temperature - Annual pptn
Significant interaction (p≤ 0.05) Pinus cembra
(2)
July temperature - Annual pptn
Significant interaction
Alnus, Pinus sylv
Selaginella selaginoides
Fagus, Carpinus bet
Botrychium
Non-significant interaction
Non-significant interaction
Gramineae (p =0.05)
Betula, Salix
Fraxinus excelsior
Artemisia
Populus
Platanus
(3)
January temperature - July temperature
Non-significant interaction
Quercus, Alnus viridis
INTERACTIONS
GAM are designed to show clearly the main effects in GAM
plots 'Equivalent kernel' is parallel to the axes
"
"
J. Oksanen (2002)
ANDREAEA NIVALIS
Summary of regression statistics for Andreaea nivalis. Included are model type
selected, p-value, and regression coefficient (R) for continuous variables.
Continuous variables; s( ,λ) = smooth spline function (GAM) where λ sets the
width of the neineighbourhood, poly ( ,n) = GLM of the n’th order polynomial.
Categorical variables; Linear = reflects a linear continuous function, Quad =
reflects a quadratic continuous function. Not significant = n.s., not available = n.a.
Variables
Altitude
Slope
Aspect
Rock pH
Rock Ca
Water pH
Soil pH
Soil Ca
LOI
Ri
Substratum
Flushing
Cracks
Weathering
Shelter
Undulation
Concavity
Snow-persistence
Phyllitite
Interactions Flush + Shelter
Flush + Snow
Shelter + Snow
Ri + Flush
Model
s( ,3)
s( ,2)
P(.)
<0.001
<0.001
n.s.
poly( ,2)
<0.001
s( ,2)
<0.001
poly( ,2)
<0.001
n.a.
n.a.
n.a.
poly( ,3)
<0.001
Linear
<0.001
Quad
<0.001
Linear
<0.001
Linear
<0.001
Quad
<0.001
Linear
<0.001
Linear
<0.001
Quad
<0.001
Linear
<0.001
Quad + Quad
<0.001
Quad + Quad
<0.001
Quad + Quad
<0.001
poly( ,1) + Quad <0.001
R
0.7
0.21
Ri + Shelter
poly( ,1) + Quad <0.001
0.69
Ri + Snow
poly( ,1) + Quad <0.001
0.61
Rock pH + Rock Ca
Rock pH + Water pH
Rock Ca + Water pH
s( ,3) + s( ,3)
s( ,2) + s( ,3)
s( ,3) + s( ,3)
0.67
0.38
0.47
<0.001
<0.001
<0.001
0.82
0.81
0.42
0.8
0.79
The response function for A. nivalis
on the continuous environmental
variables: (a) altitude, (b) slope, (c)
rock pH, (d) rock Ca, (e) water pH
and (f) radiation index. = number of
occurrences at each level of the
gradient.
Histogram of A. nivalis on each
categorical variable: (a) substratum,
(b) flushing, (c) cracks, (d)
weathering, (e) shelter, (f)
undulation, (g) concavity, (h) snowpersistence, and (i) phyllite.
Response surface of A. nivalis on: (a) shelter and
flushing, (b) flushing and radiation, (c) snowpersistence + flushing, and (d) snow-persistence +
shelter.
Response surface of A. nivalis on: (a) rock Ca
and rock pH, (b) water pH and rock pH, and (c)
water pH + rock Ca.
GENERALISED ADDITIVE MODELS – A SUMMARY
GAMs are semi-parametric extensions of GLMs.
Only underlying assumptions are that the functions are additive and
that the components are smooth. Like GLM, uses a link function to
establish a relationship between the mean of the response variable
and a 'smoothed' function of the predictor variable(s).
Strength is ability to deal with highly non-linear and non-monotonic
relationships between the response variable and the set of predictor
variables.
Data-driven rather than model-driven (as in GLM). Data determine the
nature of the relationship between response and predictor variables.
Can handle non-linear data structures. Very useful exploratory tool.
A CONTINUUM OF REGRESSION MODELS
Simple Linear Regression  Multiple Linear Regression > GLM > GAM
SLR and MLR -
most restrictive in terms of assumptions but are
most used (and misused!)
GLM -
fairly general but still model-based
GAM -
most general as data-based
CLASSIFICATION AND REGRESSION
TREES (CART)
Breiman, L., Friedman, J., Ohlson, R. & Stone, C. 1984.
Classification and regression trees – Wadsworth
De'Ath, G. 2002 Ecology 83, 1105-1117
De'Ath, G. & Fabricus, K.E. 2000 Ecology 81, 3178-3192
Efron, B. & Tibshirani, R. 1991. Science 253, 390-395
Michaelsen, J. et al. 1994. J. Vegetation Science 5, 673-686
Also known as decision trees
Decision Trees
Like a species identification key. Class labels are assigned
to objects by following a path through a series of simple
rules or questions, the answers to which determine the
next direction through the path.
Decision tree is a supervised learning algorithm which
must be provided with a training set that contains objects
with class labels.
Looks like a cluster analysis dendrogram or partitioning
diagram but these are from unsupervised methods that
take no account of pre-assigned class labels.
Example (three species of Iris)
If petal length < 2.09 cm
If petal width < 1.64 cm
If neither
Fielding 2007
Iris setosa
Iris versicolor
Iris virginica
As axis-parallel splits
CART PROBLEM:
Experiment on cause of duodenal ulcers, one of 56 model nucleophiles
were given to each of 745 rats. Each rat subsequently autopsied to check
for development of duodenal ulcer and outcome scored as 1, 2 or 3
severity.
535 class 1, 90 class 2, 120 class 3 outcomes
Which of 67 characteristics of these compounds was associated with
development of duodenal ulcers?
CART aims to use a set of predictor variables to estimate the means of one
or more response variables. A binary tree is constructed by repeatedly
splitting the data set into subsets. Each individual split is based on a single
predictor variable and is chosen to minimise the variability of the response
variables in each of the resulting subsets.
The tree begins with the full data set and ends with a series of terminal
nodes. Within each terminal node, the means of the response variables are
taken as predictors for future observations.
Closer to ANOVA than regression in that data are divided into a discrete
number of subsets based on categorical predictors and predictions are
determined by subset means.
R
Must define two criteria:
1.
A measure of impurity or inhomogeneity.
2.
Rule for selecting optimum tree.
Produce a very large tree and then prune it into successively smaller trees.
Skill of each tree is determined by cross-validation. Divide the full data
into subsets, drop one subset, grow the tree on the remaining data and
test it on the omitted subset.
Measure of impurity
Univariate response variable error sum of squares, i.e. one-way ANOVA at each
split and selecting the predictor variable which minimises the error sum of
squares in the two descendent nodes.
Categorical responses (classes 1, 2, 3) – assign classes to terminal node using
majority rule, assign the class that is most numerous in the node.
At each node of the tree a question is asked – data points for which the answer
is yes are assigned to the left branch.
May be less desirable to misclassify animal with a severe ulcer. Introduce a
higher penalty to errors for class 3.
CART tree. Classification tree from the CART
analysis of data on duodenal ulcers. At each
node of the tree, a question is asked; data
points for which the answer is “yes” are
assigned to the left branch and other data
points are assigned to the right branch
Misclassification 1
39.6%
class
2
56.7%
3
18.3%
R
CLASSIFICATION AND REGRESSION TREES
– A SUMMARY
Explain variation of single response variable by one or more explanatory or
predictor variables.
Response variable can be quantitative (regression trees) or categorical
(classification trees).
Predictor variables can be categorical and/or quantitative.
Trees constructed by repeated splitting of data, defined by a simple rule based
on single predictor variable.
At each split, data partitioned into two mutually exclusive groups, each of which
is as homogeneous as possible. Splitting procedure is then applied to each group
separately.
Aim is to partition the response into homogeneous groups but to keep the tree as
small and as simple as possible.
Usually create an overlarge tree first, pruned back to desired size by crossvalidation.
Each group typically characterised by either the distribution (categorical
response) or mean value (quantitative response) of the response variable, group
size, and the predictor variables that define it.
SPLITTING PROCEDURES
Way that predictor variables are used to form splits depends on their type.
1.
Categorical variable with two levels (e.g. small, large), only one split is
possible, with each level defining a group.
2.
Categorical variables with more than two levels, any combination of levels
can be used to form a split. With k levels, there are 2k-1 –1 possible splits.
3.
Quantitative predictor variables, a split is defined by values less than and
greater than some chosen value. Only the rank order of quantitative
variables determines a split, and for u unique values there are u-1 possible
splits.
From all possible splits of predictor variables, select the one that maximises the
homogeneity of the two resulting groups. Homogeneity can be defined in many
ways, depending on the type of response variable.
Trees drawn graphically, with root node representing the undivided data at the
top, and the branches and leaves (each leaf representing a final group) beneath.
Can also show summary statistics of nodes and distributional plots.
ECOLOGICAL EXAMPLE
Regression tree
(5 point abundance)
Regression tree analysis of the abundance
of the soft coral species Asterospicularia
laurare rated on a 0-5 scale; only values
0-3 were observed. The explanatory variables were shelf position (inner, mid,
outer), site location (back, flank, front,
channel), and depth (m). Each of the
three splits (nonterminal nodes) is
labelled with the variable and its values
that determine the split. For each of the
four leaves (terminal nodes), the
distribution of the observed values of A.
laurae is shown in a histogram. Each
node is labelled with the mean rating and
number of observations in the group
(italic, in parantheses). A. laurae is least
abundant on inner- and mid-reefs (mean
rating = 0-038) and most abundant on
front outer-reefs at depths  3m (1.49).
The tree explained 49.2% of the total ss,
and the vertical depth of each split is
proportional to the variation explained.
Classification tree
( +/ - )
Classification tree on
the presence-absence
of A. laurae. Each
leaf is labelled
(classified) according
to whether A. laurae
is pre-dominantly
present or absent, the
proportions of
observations in that
class, and the number
of observations in the
group (italic, in
parentheses). The
misclassification rate
of the model was 9%,
compared to 15% for
the null model
(guessing with the
majority, in this case
the 85% of absences).
Splits minimise sum-of-squares within groups in regression tree; splits are based on
proportions of presence and absence in the classification tree.
CART can be used for (i) description and summarisation of data and (ii) prediction
purposes for new data.
Can identify the environmental conditions under which a taxon is particularly
R
abundant (regression tree) or particularly frequent (classification tree).
Regression trees explaining the abundances of
the soft coral taxa Efflatounaria, Sinularia
spp., and Sinularia flexibilis in terms of the
four spatial variables (shelf position, location,
reef type, and depth) and four physical
variables (sediment, visibility, waves, and
slope). At the bottom of the cross-validation
plots (a, d, g), the bar charts show the
relative proportions of trees of each size
selected under the 1-SE rule (grey) and
minimum rules (white) from a series of 50
cross-validations. For Efflatournaria (a), a
five-leaf tree is most likely by either the 1-SE
or the minimum rule. For Sinularia spp. (d),
five- to eight-leaf trees have support, and for
S. flexibilis (g), five- to nine-leaf trees have
support. Cross-validation plots (a, d, g),
representative of the modal choice for each
taxa according to the 1-SE rule, are also
shown. For all three taxa, a five-leaf tree was
selected (c, f, i). The shaded ellipses enclose
nodes pruned from the full trees (b, e, h),
each of which accounted for > 99% of the total
ss.
COMPARISON OF CART AND GLMs
ANOVA is powerful technique but as number of predictor variables and
complexity of data increase (interactions, unbalanced designs, empty
cells), ANOVA and GLMs become less effective.
CARTs are simpler and less sensitive to unbalanced designs and zerovalues. Splits represent an optimum set of one-degree-of-freedom
contrasts.
Simple, easy to interpret, and graphical.
These CART advantages increase as number of predictor variables and
complexity increase.
'Data mining' tool.
MULTIVARIATE REGRESSION TREES
De'Ath, G. 2002 Ecology 83, 1105-1117
Natural extension of univariate regression trees. Considers multivariate
response, not single response.
Replace univariate response by multivariate assemblage response and redefine
the impurity of a node by summing the univariate impurity measure over the
multivariate response.
Extend univariate sum-of-squares impurity criterion to multivariate sum-ofsquares about the multivariate mean. Sum of squared Euclidean distances (SSD)
of samples about the node centroid.
Each split minimises the SSD of samples from the centroids of the nodes to which
they belong. Maximises the SSD between node centroids (cf. k-means clustering).
This minimises SSD between all pairs of samples within nodes and maximises SSD
between all pairs of samples in different nodes.
Each tree leaf can be characterised by multivariate mean of its samples, number
of samples at the leaf, and the predictor values that define it.
Forms clusters of sites by repeated splitting of data, each split defined by simple
rule based on environmental values. Splits chosen to minimise the dissimilarity of
sites within node.
R
MULTIVARIATE REGRESSION TREES (cont)
MRT is a form of constrained clustering, with constraints set by the
predictor variables and their values
MRT can be extended to dissimilarity measures other than squared
Euclidean distance (distance-based MRT)
Hunting spider data (12 species, 28 samples, six environmental variables)
Four-leaf tree split just on water content and abundances of fallen twigs at
sample sites explains 78.8% of the species variance
Tree size selected by cross-validation. Four-leaf tree has lowest estimated
prediction error
Can identify indicator species using Dufrêne & Legendre (1967) INDVAL
approach
Tabulate explained variance at each split for each species
MULTIVARIATE REGRESSION TREES (cont)
Useful for providing view of species–environment relationship by:
1.
Displaying the tree
2.
Tabulating variation at the tree splits
3.
Identifying indicator species to characterise groups (INDVAL)
4.
Displaying group means, species, and samples
5.
Comparing tree groupings with clusters from non-constrained
hierarchical and non-hierarchical cluster analyses
MULTIVARIATE REGRESSION TREES (cont)
Advantages
1. Absence of model assumptions (e.g. response models), resulting
in greater robustness
2. Invariance to monotonic transformations of predictor variables
3. Prediction of species abundances from environmental variables
4. Emphasises local structure and interactions whereas constrained
ordinations consider global structure
5. Outperforms or matches redundancy analysis and canonical
correspondence analysis in explaining and predicting species
composition
6. MRT is one tree, need m univariate regression trees for m species
7. More regression-based approach than simple discriminants and
TWINSPAN
OTHER NEWER TYPES OF CLASSIFICATION
AND REGRESSION TREE TECHNIQUES
Rapidly growing area of activity in data-mining – analysis of
large heterogeneous data
New approaches
1. Bagging Trees
2. Random Forests
3. Multivariate adaptive regression splines (MARS)
Brief introduction to each, so that you are aware of their
existence and what their strengths and limitations are.
BAGGING TREES
Part of the output error in a simple regression tree (RT) can be
due to the specific choice of the data set.
If create data sets by resampling with replacement (i.e.
bootstrapping) and grow regression trees without pruning or
averaging, the variance of the output error is reduced.
In bootstrapping. on average 37% of the data is excluded. The
included data are replicated so that the sample is full size.
Portion of data in sample is ‘in-bag’ data, the rest is ‘out-of-bag’
data.
Out-of-bag data used not to build or prune tree but to provide
better estimates of node error.
Requires 30-100 trees. Difficult or impossible to examine them
all. Usually find consistent results, so one RT is adequate. Often
averaged.
R
In addition to bagging, there is boosting or boosted
trees (De’ath 2007 Ecology 88: 243-251)
In boosted trees, bias is reduced by repeatedly readjusting weights of the training samples. Used
primarily for classifying data with large sample sizes
rather than for regression.
R
RANDOM FORESTS (RF)
Designed to produce accurate predictions that do not overfit the
data. Similar to BT in that bootstrap samples are drawn to
construct multiple trees.
Difference from BT is that each tree is grown with a randomised
set of predictors, hence name ‘random’ forests.
Large number of trees (500-2000) are grown, hence a ‘forest’ of
trees.
Number of predictors used to find the best split at each node is a
randomly chosen subset of the total number of predictors.
As with BT, trees are grown to maximum size without pruning.
Aggregation is by averaging trees.
Out-of-bag samples can be used to derive an unbiased error rate
and variable importance, eliminating the need for an
independent test-set.
As many trees are grown, there is limited generalisation
error (true error as opposed to training error only). Thus
no overfitting is possible, hence good for prediction.
By growing each tree to maximise size without pruning
and selecting only the best split between a random subset at each node, RF tries to maintain some predictive
ability while inducing diversity among trees.
Random prediction selection diminishes correlation
between unpruned trees and keeps bias low. Using an
ensemble of unpruned trees, variance is also reduced.
Another advantage is that predicted output depends
only on one user-selected parameter, the number of
predictors to be chosen at each node.
RF seem more of a ‘black box’ than BT because
cannot see individual trees. RF give general metrics to
aid interpretation, especially the importance of
predictor variables in prediction. Can evaluate how
much worse the prediction would be if that predictor
were permuted randomly.
In contrast to artificial neural networks that are very
much a very ‘black box’, RF are perhaps a ‘grey box’.
R
MULTIVARIATE ADAPTIVE REGRESSION
SPLINES (MARS)
Builds flexible regression models by using basic functions to
fit separate splines to distinct intervals of the predictor
variables.
The variables to use and the end-points of the intervals are
found by an exhaustive search procedure using basic
functions. Differs from classical splines where the knots are
pre-determined and evenly spaced.
Basic functions are similar to principal components and
express the relationship of the predictors to the response
variable.
MARS finds the locations and number of required knots in a
forward/backward stepwise fashion.
Model is overfitted by generating more knots than
needed, and the knots that contribute least to the
overall fit are removed.
MARS have advantage over RT in the RT’s
discontinuous branching at tree nodes is replaced
with continuous smooth functions that are guided by
the nature of the data.
MARS better at detecting global and linear data structure
as output is smoother and not as coarse-grained and
discontinuous as in RT.
MARS limitations are:
1. basic functions may be excessively guided by the local
nature of the data, resulting in inappropriate results
2. selecting the correct values for the parameters can be
cumbersome and may need multiple trial-and-error
steps
3. does not lend itself well to modelling speciesenvironment relationships
R
Comparison of regression-tree approaches
1. Method
RT
Recusively partitions data based on a single, best
predictor to form a binary tree. Creates a series of
decision rules based on the predictor variables.
BT
Creates multiple boot-strapped regression trees
without pruning and averages the outputs.
RF
Similar to BT except that each tree is grown with a
randomised subset of predictors. Typically 500-2000
trees are grown and results aggregated by averaging.
MARS Builds localised regression models by fitting separate
splines using basic functions to distinct intervals of
predictor variables.
2. Strengths
RT
Better than conventional linear techniques in allowing
for interactions and non-linearities when there are
many predictors. Easy to interpret and can map
predictors with greatest influence.
BT
Very effective in reducing variance and error in highdimensional data. Data not used (out-of-bag data)
used to provide reliable error estimates
RF
Growing large numbers of trees does not overfit the
data, and random predictor selection keeps bias low.
Provides good (?best) models for prediction.
MARS Because splitting rules are replaced by continuous
smooth functions, MARS better at detecting global
and linear data structures. Output is smoother and
less coarse-grained.
3. Limitations
RT
Linear function highly approximated and output tree
can be highly variant to small data perturbations.
BT
Because large numbers of trees (30-50) are averaged,
interpretation of results not easy. Bias component of
the error is marginally better than single RT
RF
At least a ‘grey box’ or a pale black box compared to
BT. Can be very demanding in computing resources
and time.
MARS Tends to be excessively guided by the local nature of
the data, making predictions with new data unstable.
Selecting values for input parameters can be
cumbersome.
MAJOR FEATURES OF CLASSIFICATION AND
REGRESSION TREES OF ECOLOGICAL DATA
1. Ability to use different types of response variables
(continuous, categorical, +/-)
2. Capacity for interactive exploration, description, and
prediction
3. Invariance to monotonic transformations of predictor
variables
4. Easy graphical interpretation of complex results
involving interactions
5. Model selection by cross-validation
6. Good procedures for handling missing values in both
the response and the predictor variables
CLASSIFICATION (= DISCRIMINANT ANALYSIS)
AND CLASSIFICATION AND REGRESSION TREES
Recursive partition of data on the basis of set of predictor variables
(in discriminant analysis a priori groups or classes, 1/0 variables).
Find the best combination of one variable and its split threshold value
that separates the entire sample into two groups that are internally
homogeneous as possible with respect to species composition.
Lindbladh et al. 2002. American Journal of Botany 89: 1459-1476
Picea pollen in eastern North America.
Three species
P. rubens
P. mariana
P. glauca
R
R
Lindbladh et al. (2002)
Lindbladh et al. (2002)
Cross-validation of classification tree
(419 grains in training set, 103 grains in test set)
Binary trees - Picea glauca
vs rest
Picea mariana
vs rest
Picea rubens
vs rest
In identification can have several outcomes
e.g. not identifiable at all
unequivocally P. rubens
P. rubens or P. mariana, etc.
Can now see which grains can be equivocally identified in
test set, how many are unidentifiable, etc. Assessment of
inability to be identified correctly.
Test set (%)
P. glauca P. mariana
Correct (100, 010, 001)
Equivocal (101, 110, 011, 111)
Unidentifiable (000)
P. rubens
79.3
70.0
75.9
0.0
2.7
2.5
20.7
27.3
21.6
Unidentifiable about the same for each species, worst in P.
mariana.
Applications
to fossil data
EXAMPLES OF MODERN REGRESSION ANALYSIS
Relationship of the frequency of Fagus sylvatica to altitude and
annual precipitation
Leps & Smilauer (2000)
LINEAR
SECOND
ORDER
POLYNOMIAL
LINEAR LEAST
SQUARES
REGRESSION
Two linear models describing,
separately, the dependency of
Fagus frequency upon altitude
and annual precipitation
Negative
predictions
SECOND
ORDER
POLYNOMIAL
GLM POISSON
LOG LINK
FUNCTION
GLM
SECOND ORDER
POLYNOMIAL
Shape of two generalized linear
models describing, separately,
the dependency of Fagus
frequency upon altitude and
annual precipitation
Leps & Smilauer 2000
3df
1df
Poisson
log link
function
spline
smoother
GAM
Two generalized additive
models fitting, separately,
the dependence of the
Fagus frequency on altitude
and the annual precipitation
amount
ALTITUDE + PRECIPITATION
4df
Confidence
intervals
3df
Generalized additive
model of dependence of
Fagus frequency on both
altitude and the annual
precipitation. The linear
marks at the bottom of
the two plots indicate
position of individual
observations
Leps & Smilauer 2000
Comparison of three response surfaces modelling the
frequency of Fagus using altitude and annual
precipitation as predictors and using (from left to right)
GLM, GAM, and loess smoother.
Leps & Smilauer 2000
Regression tree
Altitude, precipitation, degree days
Fagus
frequency
The final regression tree
Leps & Smilauer 2000
Correlation between rate of sea-level change and frequency of explosive
volcanism in the Mediterranean.
V.J. McGuire, R.J. Howarth, C.R. Firth, A.R. Solow, A.D. Pullens, S.J. Saunders, I.S. Stuart,
J.C Vita-Finzi (1997) Nature 389: 473-476
Location map of principal volcanic centres and
provinces active in the Mediterranean region during
the late Quaternary, and the distribution of
boreholes from which deep-sea cores were
extracted. The Roman Province includes the Vulsini,
Vico, Sebatini and Albani centres; the Campanian
Province includes Campi Flegrei, Somma Vesuvius
and Ischia.
Cumulative plot of ordered event times
(representing the tephra-layer occurrence) versus
time. The dashed line corresponds to a median
repose period of 1.05kyr. Three anomalous
episodes of increased tephra-layer emplacement
between 8 and 15, 34 and 38, and 55 and 61 kyr
BP are also shown, having median repose periods
(time to next tephra-producing event) of 0.35,
0.45 and 0.80 kyr respectively.
95%
95%
Changes in mean sea level. a. Estimated
change in mean sea level (MSL) as a function
of age (kyr) based on data from Barbados and
Pacific cores. A smooth curve has been fitted
to the Barbados data and region of overlap of
the two data sets using the non-parametric
locally weighted regression smoother LOWESS
technique. The sparse data of Shackleton for
the period 80 kyr ago have been fitted with a
smooth cubic spline curve. Ages of dated
tephra layers in deep-sea cores are shown by
crosses. b. Rate of change of MSL with time,
based on 0.25-kyr intervals. Ages of dated
tephra layers in deep sea cores are shown by
crosses.
Variation of repose times as a function of rate of change of mean sea
level with time. These data are based on a bin width of 1.5d(MSL)/dt,
and are summarized by box plots. Box width is proportional to the
number of values in each bin; the base, horizontal dividing line, and
top of each box show the 25th, 50th, (median) and 75th percentiles. In
a few cases the median coincides with the base or top of the box;
whiskers extend out to the most extreme values lying with 1.5 of the
interquartile range beyond the ends of the box. Isolated data points
are shown individually. The bold, solid curve is a weighted LOWESSsmoothed fit to the medians and indicates a clear decrease in repose
period with rate of change of MSL, either upwards, (positive) or
downwards (negative). We note that the maximum repose period is
offset from the zero point on the rate of change axis, implying a time
lag in the response of the volcanic systems to a given rate of change
in the sea-level record. the dashed lines show the median line (line
labelled ‘1’) and empirical 95% (‘2’) and 99% (‘3’) confidence
envelopes for the binning and LOWESS curve-fitting process applied to
1,000 sets of 81 repose times drawn randomly from the empirical
cumulative distribution of the observed repose times. No systematic
variation of repose period with rate of change of MSL is apparent in
the simulated data.
ARTIFICIAL NEURAL NETWORKS
Branch of artificial intelligence - ability to “learn”.
Attempt to emulate the human brain with 1.5 x 1010
neurons each with 10 to 104 connections or synapses.
Learn some target values or vectors from a set of
associated input signals through a set of iterative
adjustments of set of parameters. Minimise error
between network and desired output following some
learning rule. Mimic biological neuron.
Regression, calibration, discriminant analysis
(= classification)
NEURAL NETWORKS – BASIC REFERENCES
Abdi, H. et al. 1999 Neural Networks. Sage Publications
Eberhart, R.C. & Dobbins, R.W. (eds.) 1990 Neural Network PC
Tools. Academic Press
Faraway, J.J. 2006 Extending the Linear Model with R. Chapman
& Hall/CRC (chapter 14)
Lek, S. & Guégan, J.P. 1999 Ecological Modelling 120, 65-73
Lek, S. & Guégan, J.P. 2000 Artificial Neuronal Networks.
Application to Ecology and Evolution. Springer
(a) A diagram showing the general
architecture of a three-layer back
propagation network with five neurons in
the input layer, three neurons in the hidden
layer, and two neurons in the output layer.
Each neuron in the hidden and output layers
receives weighted signals from each neuron
in the previous layer. (b) A diagram showing
a single neuron in a back propagation
network. In forward propagation, the
incoming signals from the neurons of the
previous layer (p) are multiplied with the
weights of the connections (w) and
summed. The bias (b) is then added, and
the resulting sum is filtered through the
transfer function to produce the activity (a)
of the neuron. This is sent on to the next
layer or, in the case of the last layer,
represents the output. (c) A linear transfer
function (left) and a sigmoidal transfer
function (right)
All neurons are associated with:
TRANSFER FUNCTION
Summed signals, filters and sends them on
BIAS TERM
Measure of ‘importance’ like regression coefficients
Weight terms contain knowledge of memory of network.
Training involves incremental adjustment of weights to find an optimal mapping
between input and output vectors. Need training set with corresponding inputs and
outputs.
Optimisation ‘template comparison’ in which differences between actual output and
desired output are used as optimisation criterion.
Training -1
-2
forward propagation
back propagation
repeated until the differences between target output and computed output reach a
preset threshold.
FORWARD PROPAGATION
Each input vector is propagated through network while being modified and filtered by the
weights of the connections and by the transfer functions of the neurons. All incoming
signals are multiplied with connective weights and summed and filtered through transfer
function. Resulting activity of neuron is then used as input to next layer. Done once when
running an already trained network.
BACK PROPAGATION
Difference or ‘error’ between output vector resulting from forward propagation step and
desired target vector is computed. Used to incrementally adjust the weights between
output layer and the last of the hidden layers according to a learning algorithm based on a
gradient-descent method.
For each layer, going backwards through the network, the values used for adjusting the
weights are the error terms in the immediately succeeding layer. Size of incremental
adjustments determined by learning rule set to 0–1.
Too high a learning rate may result in a network that may never converge. Too low a rate
may result in excessively slow learning.
Various ways of doing back propagation:
Nguyen - Widrow initialisation
Momentum
Adaptive learning
Success of NN model best determined by some cross-validation.
Predictor or training set and independent test set leave-one-out (jack-knifing).
Statistics
MSE
RMSE
Estimated
r2
MSEP
RMSEP
r2 cv
Predicted cross-validation
Description of the data used
The data used in this study come from the US National Eutrophication Survey (NES) as published by Omernik (1977).
They consist of 927 tributary sites that drained watersheds not affected by point-source pollution. For each
tributary site, the NES collected parameters for each subdrainage area: area, land use percentage (7 categories),
geology, slope, pH, precipitation, flow and animal density. Moreover, mean nutrient concentrations of total phosphorus, ortho-phosphate and nitrogen were measured in the corresponding tributaries and the export coefficients
were calculated. These data were discussed by Omernik (op. Cit.). In the present study, we consider as independent variables: the percentage of the subwatershed areas under forest (FOR), agriculture (AGR), other categories
(OTH) (defined as the difference between total watershed area and forest plus agriculture area), animal density
(ANI), average annual precipitation (PRE) and flow (FLO). Concentration of total phosphorus (CTP), concentration
of ortho-phosphorus (COP), export of total phosphorus (ETP) and export of ortho-phosphorus (EOP) were used as
dependent variables.
Independent variables presented large ranges corresponding to the large geographical variations in climate, soil
characteristics and land use within the US territory (see table below). Dependent variables also presented large
ranges with extremely high values. We can only hypothesize the existence of some local particularities, or the
hidden effects of point source pollution not considered in the original data.
Statistical parameters of the variables studied
(Q1, Q3: first and third quartile; CV: coefficient of variation)
FOR AGR
OTH
PRE
FLO
ANI
CTP
COP
3
-1
-1
-1
(%)
(%)
(%)
(cm)
(m s ) (animal (mg.l ) (mg.l )
ETP
-1
EOP
-1
(mg.l )
(mg.l )
-2
Minimum
Maximum
Range
Mean
Median
Standard
deviation
CV (%)
Q1
Q3
0
100
100
47.11
48.3
0
100
100
35.37
26.9
0
100
100
17.52
7.7
13
263
250
105.86
102
0.001
14.65
14.65
0.51
0.25
km )
0
233.5
233.5
22.28
16.1
32.35
68.67
16.1
76.2
32.61
92.2
3.6
62.5
22.84
130.35
2.7
23.1
41.57
39.27
81
27
0.99
195.35
0.09
0.55
25.06
112.46
1.4
34.1
PREDICTORS
0.005
1.12
1.12
0.07
0.04
0.005
0.58
0.57
0.03
0.01
0
321.7
321.7
17.81
11.7
0
117.7
117.7
7.65
4.7
0.09
126.05
0.02
0.09
0.05
157.28
0.008
0.03
22.33
125.38
6.4
22.3
10.62
138.77
2.8
8.7
RESPONSES
Representation of the structure of the neural network
used. F1: input layer neurons, F2: hidden layer
neurons, F3: output layer neurons. FOR: % forest;
AGR: % agricultural zone; OTH: % other than forest
and agriculture; PRE: precipitation; FLO: flow; ANI:
animal density; Ŷ: estimated dependent variable.
The mean and confidence interval
of correlation coefficients as a
function of the number of hiddennodes. The number of iterations is
500. For every network structure
the mean prediction performance
and the confidence interval are
calculated by the five different
runs.
Mean Square of Errors (MSE) as a function of
the number of training iterations. The number
of hidden units is 5.
Lek et al. (1996)
Correlation coefficients (R) for the test of five test sets concerning
four dependent variables
Dependent variables N° test
1
2
CTP
3
total P
4
5
COP
ortho-P
ETP
export total P
EOP
export ortho-P
R Mean Standard deviation
0.73
0.751
0.751 0.745
0.011
0.756
0.739
1
2
3
4
5
0.719
0.746
0.751 0.75
0.792
0.744
0.026
1
2
3
4
5
0.739
0.79
0.789 0.756
0.767
0.745
0.024
1
2
3
4
5
0.739
0.79
0.789 0.766
0.767
0.745
0.024
tests
PREDICTION OF BROWN TROUT SPAWNING SITES
Habitat variables to study brown trout reproduction
(from Delacoste et al., 1993)
Variable
Type
Characteristics
Wi
i
Wetted width (m2 )
ASSG
SV
GRA
FWi
i
i
i
i
Area with suitable spawning gravel for trout
per linear meter of river (m2 /linear meter)
Surface velocity (m/s)
Water gradient (%)
Flow / width m2 / s/m)
D
i
Mean depth (m)
SDD
i
Standard deviation of depth (m)
BV
i
Bottom velocity (m/s)
SDBV
i
Standard deviation of bottom velocity (m/s)
VD
i
Mean speed / mean depth (m/s/m)
R/M
d
Density of trout redds per linear metre of
stream-bed (redds/m)
i: independent, d: dependent; these independent variables are non-correlated
except SV and BV, R = 0.76.
Structure of the neural network used in this study. F1: input layer of neurons
comprising as many neurons as variables at the entry of the system; F2: hidden
layer of neurons whose number is determined empirically; F3: output layer of
neurons with a single neuron corresponding to a single dependent variable.
Lek et al. (1996)
PREDICTION OF BROWN TROUT SPAWNING SITES (R/M)
(R/M = density of trout redds per metre of stream-bed)
205 sites
No transfor Transform
Full model R2
Stepwise linear regression 4 vars
0.44
0.62
Multiple linear regression 10 vars
0.47
0.63
Neural network
Neural network
0.81
0.93
0.74
0.96
4 vars
10 vars
Neural network modelling: variation of the
correlation coefficient between observed
and estimated values according to the
number of neurons of the hidden layer
(average value and standard deviation).
Neural network modelling: variation of
the sum squared of errors and the
correlation coefficient between observed
and estimated values according to the
number of iterations.
Correlation graphs between observed
and estimated values of R/M by
different models: (a) multiple
regression(MR) with transformed
variables; (b) multiple regression
(MR) with non-transformed variables;
(c) neural network with four
independent variables (NN4) with
transformed variables; (d) neural
network with four independent
variables (NN4) with nontransformed variables; (e) neural
network with all the independent
variables (NN10) with transformed
variables; (f) neural network with all
the independent variables (NN10)
with non-transformed variables.
Lek et al. (1996)
Relationship between residuals and the estimated and
observed values of R/M for transformed variable models: a, b:
MR: c, d: NN4; e, f: NN10
Lek et al. (1996)
Cross
Validation
Results of the NN and MR models on random training set
fractions (3/4) of the records) and test set fractions (the
remaining 1/4 records)
No. test NN
MR
R_learn R_test R_learn R_test
1
0.892 0.862 0.685
0.487
2
0.914 0.888 0.685
0.628
3
0.904 0.906
0.69
0.626
4
0.883 0.867 0.688
0.566
5
0.905 0.906 0.669
0.74
Mean
0.9
0.886 0.684
0.609
R_learn: correlation coefficient between estimated and observed
values of the training sets; R_test: correlation coefficient of the
test sets.
0.81
0.79
0.468
0.371
R2
Contribution profile of each independent variable to the prediction
of R / M by NN (five variables are only represented here)
Lek et al. (1996)
CLASSIFICATION (= DISCRIMINANT ANALYSIS)
AND ARTIFICIAL NEURAL NETWORKS
Artificial neural networks
Input vectors
Output vectors
>1 Predictor
1 or more Responses
Regression
>1 Variable
2 or more Classes
(or 1/0 Responses)
Discriminant
analysis
DISCRIMINANT ANALYSIS BY NEURAL
NETWORKS
Malmgren & Nordlund (1996) Paleoceanography 11, 503–512
Four distinct volcanic ash zones in late Quaternary sediments of Norwegian Sea.
Zone A B C D
Basaltic and Rhyolithic types
8 classes x 9 variables (Na2O, MgO, Al2O3, SiO, K2O, CaO, TiO2, MnO, FeO)
183 samples
R
(A). Diagram showing the general
architecture of a 3-layer back
propagation network with five
elements in the input layer, three
neurons in the hidden layer, and two
neurons in the output layer. Each
neuron in the hidden and output
layers receives weighted signals
from the neurons in the previous
layer. (B) Diagram showing the
elements of a single neuron in a
back propagation network. In
forward propagation, the incoming
signals from the neurons of the
previous layer (p) are multiplied
with the weights of the connections
(w) and summed. The bias (b) is
then added, and the resulting sum is
filtered through the transfer
function to produce the activity (a)
of the neuron. This is sent on to the
next layer or, in the case of the last
layer, represents the output. (C) A
linear transfer function (left) and a
sigmoidal transfer function (right).
4 zones
ABCD
2 types
Rhyolite
Basalt
Configuration of grains referable to the 4 late Quaternary volcanic ash zones, A through D, in the Norwegian
sea described by Sjøholm et al [1991] along first and second canonical variate axes. The canonical variate
analysis is based on the geochemical composition of the individual ash particles (nine chemical elements were
analyzed: Na2O, MgO, Al2O3, SiO2, K2O, CaO, TiO2, MnO, and FeO). Two types of grains, basaltic and rhyolithic,
were distinguished within each zone. This plane, accounting for 98% of the variability among group mean
vectors in nine-dimensional space (the first axis represents 95%), distinguishes basaltic and rhyolithic grains.
Apart from basaltic grains from zone C, which may be differentiated from such grains from other zones, grains
of the same type are clearly overlapping with regard to the geochemical composition among the zones.
Malmgren & Nordlund (1996)
Changes in error rate (percentages of misclassifications in the test set) for a three-layer
back propagation network with increasing number of neurons when applied to training-test
set 1 (80:20% training test partition). Error rates were determined for an incremental series
of 3, 6, 9, …., 33 neurons in the hidden layer. Error rates were computed as average rates
based on ten independent trials with different initial random weights and biases. The error
rates represent the minimum error obtained for runs of 300, 600, 900, and up to 9000
epochs. The minimum error rate (9.2%) was obtained for a configuration with 24 neurons in
the hidden layer, although there is a major reduction already at nine neurons.
Malmgren & Nordlund (1996)
Changes in error rate (percentages of misclassifications) in the training set with increasing
number of epochs in the first out of ten trials in training set 1. This network had 24 neurons
in the hidden layer, and the network error was monitored over 30 subsequent intervals of
300 training epochs each. During training, the error rate in the training set decreased from
18.5% after 300 epochs to a minimum of 2.1% after 7500 epochs. The minimum error rate in
the test set (10.8%) was reached after 3300 epochs.
Malmgren & Nordlund (1996)
CRITERION OF NEURAL NETWORK SUCCESS
OTHER TECHNIQUES USED
ERROR RATE of predictions in independent test set
that is not part of the training set.
Linear discriminant analysis (LDA)
Cross-validation 5 random test sets
Training set
37 particles
146 particles
Error rate of misclassification (%) for each test set
Average rate of misclassification (%) for five test
sets
k-nearest neighbour
technique) (=KNN)
(= modern analog
Soft independent modelling of close
analogy (SIMCA)
(close to PLS with classes)
NETWORK CONFIGURATION & NUMBER OF
TRAINING CYCLES
CONCLUSIONS
24 neurons
Average error rate NN network 9.2%
Training set
– minimum in error rate
7500 cycles
Test set
– minimum in error rate
(10.8%) 3300 cycles
i.e. 33.6 out of 37 particles correctly
classified
LDA 38.4%
K-NN 30.8%
SIMCA 28.7%
Error rates (percentages of misclassifications in the test sets) for each of the five independent training-test
set partitions (80% training set and 20% test set members) and average error rates over the five partitions
for a three-layer back propagation (BP) neural network, linear network, linear discriminant analysis, the knearest neighbours technique (k-NN) and SIMCA. Neural network results are based on ten independent trials
with different initial conditions. Error rates for each test set are represented by the average of the
minimum error rates obtained during each of the ten trials, and the fivefold average error rates are the
averages of the minimum error rates for the various partitions.
Error rates in each of five training-test set partitions, fivefold average error rates
in the test sets, and 95% confidence intervals for the fivefold average error rates
for the techniques discussed in this paper.
Neural N
The fivefold average error rates were determined as the average error rates over five
independent training and test sets using 80% training and 20% test partitions. Error rates for
the neural networks are averages of ten trials for each training-test set partition using
different initial conditions ((random initial weights and biases). The minimum fivefold error
rate for the back propagation (BP) network was obtained using 24 neurons in the hidden
layer. Apart from regular error rates for soft independent modelling of class analogy (SIMCA
1), the total error rates for misclassified observations that could be referable to one or
several other groups are reported under SIMCA 2. LDA represents linear discriminant analysis
and k-NN, k-nearest neighbour.
Average error rates (percentages) for basaltic and
rhyolithic particles in ash zones A through D
As before, error average error rates over five experiments based on
80% training set members and 20% test set members. N is the range of
sample sizes in these experiments.
As in the use of ANN in regression, problems of overfitting and over-training and reliable model testing
occur. n-fold cross-validation needed with an
independent test set (10% of observations), an
optimisation data set (10%), and a training or learning
set (80%). Repeated n-times (usually 10).
ANN a computationally slow way of implementing twoor many-group discriminant analysis. No obvious
advantages.
Allows use of 'mixed' data about groups (e.g. continuous,
ordinal, qualitative, presence/absence). But can use
mixed data in canonical analysis of principal coordinates if use the Gower coefficient for mixed data.
NEURAL NETWORK APPLICATIONS
Malmgren & Nordlund, 1996. Palaeoceanography 11, 305-512 (volcanic ash
discriminant analysis)
Malmgren & Nordlund, 1997. Palaeo, Paleao, Palaeo 136, 359-373 (surface
temperature reconstructions)
Lek et al., 1996. Ecological Modelling 90, 39-52 (trout & habitat regression)
Lek et al., 1996. Acta Oecologia 17, 43-53 (phosphorus & land-use)
Mastrorillo et al., 1997. Freshwater Biology 38, 237-246 (fish +/- and habitats)
Borggaard & Theoberg, 1992. Annl. Chem. 64, 545-551 (near infra-red spectra)
Whitehead et al., 1997. Hydrobiologia 349, 47-57 (blue-green algae)
Poff, Tokar & Johnson, 1996. Limnology & Oceanography 41, 857-863 (stream
hydrology)
Guegan et al., 1998. Nature 391, 382-384 (fish diversity)
Racca et al., 2001. J. Paleolimnology 26, 411-422 (diatoms & pH)
Malmgren et al., 2001. Paleoceanography 16, 520-530 (forams & sea temperatures)
Peyron & de Vernal, 2001. J. Quaternary Science 16, 699-709 (dinoflagellates & sea
temperatures)
REGRESSION MODELS AS PREDICTIVE TOOLS
Olden & Jackson (2002) Freshwater Biology 47, 1976-1995
Compared logistic regression analysis, linear discriminant analysis,
classification trees, and artificial neural networks to predict:
(1) presence/absence of 27 fish species as a function of 13 habitat features
in 286 temperate lakes
(2) Monte Carlo simulated presence/absence data with a range of
deterministic, linear, and non-linear species responses (30 samples x
500 times)
(Regression models mainly concerned with descriptive and explanatory
roles.)
Criteria of prediction performance:
(i)
Percentage of lakes where presence or absence of species correctly
classified
(ii) Ability to predict species presence (sensitivity) correctly
(iii) Ability to predict species absence (specificity) correctly
RESULTS
(i) Real data
(ii) Simulated data
(iii) On average, neural networks outperformed the other methods
but for species presence/absence all methods showed
moderate to excellent success.
(Correct classification 80-85%, specificity 70-75%, sensitivity
35-75%)
Neural networks consistently had best performance.
(iv) Simulated non-linear data – neural networks (98% correct) and
classification trees (89% correct) greatly outperformed other
methods.
(v) Simulated linear data. All methods good (92-100% correct).
Classification trees and neural networks have the advantage
that they can model both linear and non-linear responses;
linear discriminant analysis poor with non-linear data; logistic
regression surprisingly poor with non-linear data.
A Warning About Artificial Neural Network Software
Telford et al. 2004 Palaeoceanography 19, PA4014, doi:
10.1029/2004PA001072
ANN are algorithms that by mimicking biological neural
networks have the ability to learn by example. Learn by
iteratively adjusting a large set of parameters (originally
set at random values) to minimise the error between the
predicted output and actual input.
If trained for too long, ANNs can over-fit the data by
learning particular features of the data rather than
learning the general rules.
Need to have
(1) modelling data-set
(2) independent optimisation data-set
and, when training and optimisation are done,
(3) independent test-set
Not all software makes the distinction between (2) and (3),
and some use (2) as a test-set.
When a truly independent test-set is used, ANN does not
out-perform more 'classical' methods.
Not always clear from published studies what was done.
Be cautious when reading about the fantastic performance
of ANN
ANN CONCLUSIONS
ANN are, if used carefully, a flexible class of non-linear regression
models. By adding more hidden layers, can control complexity of model
from relatively simple models to models with complex structure.
Seem attractive because they require less expertise to use compared to
GLM, etc. BUT users must pay attention to basic statistical issues of
transformations, scalings, outliers, influential points, and minimal
adequate models.
May be good for prediction but bad for understanding. The ANN weights
are almost un-interpretable. ANN usually introduce complex
interactions that often do not reflect reality.
Easy to over-fit, giving over-optimistic predictions.
No statistical theory for inference, diagnostics, or model selection.
ANN are, at best, a tool; not a rigorous method with underlying theory.
SOFTWARE FOR ADVANCED OR MODERN
REGRESSION ANALYSIS
Generalised Linear Models
SYSTAT
GLIM
GENSTAT
S-PLUS
R
Locally Weighted Regression &
Splines
SYSTAT
GENSTAT
S-PLUS
R
Generalised Additive Models
GAIM
GENSTAT
S-PLUS
R
Classification & Regression Trees
SYSTAT
S-PLUS
CART
R
Neural Networks
MATLAB Neural Network
Toolbox
S-PLUS (Functions)
NGO Neuro-Genetic Optimiser
R (Libraries)