ORDINATION TECHNIQUES IN ENVIRONMENTAL BIOLOGY
Download
Report
Transcript ORDINATION TECHNIQUES IN ENVIRONMENTAL BIOLOGY
ORDINATION TECHNIQUES IN
ENVIRONMENTAL BIOLOGY PROGRESS, PROBLEMS, AND PITFALLS
H.J.B. Birks
University of Bergen
and
University College London
CONTENTS
INTRODUCTION
Definitions
Data
Types of Ordination
Historical Perspective
PROGRESS
Introduction
Underlying Response Models
Indirect Gradient Analysis
Direct Gradient Analysis
PROBLEMS
PITFALLS
CONCLUSIONS
INTRODUCTION
Definitions
Ordination - "process of reducing the dimensionality (i.e., the number
of variables) of multivariate data by deriving a small number of
new variables ('latent variables', 'composite variables', ordination
axes) that contain much of the information in the original data.
- the reduced data set is often most useful for investigating
possible structure in the observations."
B.S. Everitt (1998)
- "the arrangement of samples or sites along gradients on the basis
of their species composition or environmental attributes.
Ordination is the mathematical expression of the continuum
concept in ecology. Gradient analysis is often treated as a
synonym by plant ecologists."
M.O. Hill (1998)
End result is a low-dimensional (usually 2 dimensions) plot in which
sites are represented by points in two-dimensional space in such
a way that points close together in the plot correspond to sites
that are similar in species composition, and points that are far
apart correspond to sites that are dissimilar in species
composition. Plot is a graphical summary of the data.
Ordination multidimensional scaling, component analysis, latentstructure analysis
Environmental biology - encompasses ecology, environmental
monitoring, and palaeoecology. Basically, differences in time
scale and temporal resolution only.
Data
Data for ordination typically consist of a matrix of values specifying
the abundance or presence of species in sites. Environmental data for
the same sites sometimes also available.
A convenient way to envisage the structure of the data is of two
matrices Y and X stacked one beside the other
C = [Y|X] = [yij|xik] (i = 1, ......, n; j = 1, ......, m;
k = 1, ......, q)
Rows of the matrix represent sites. The first block of m columns
represents species, yij is the abundance of species j in site i. The
second block of q columns represents environmental variables; xik is
the value of the kth environmental variable in site i.
Given data on species (Y) and environment (X), there are two major
approaches to ordination:
Indirect gradient analysis - analyse Y only, and then involve X
Direct gradient analysis - analyse Y and X simultaneously
Types of Ordination
Historical Perspective
1901
- Pearson develops PCA as a regression technique.
1927
- Spearman applies factor analysis to psychology.
1930
- Ramensky uses an informal ordination technique and introduces the term
'ordnung' into ecology.
1954
- D.W. Goodall introduces PCA into ecology and proposes the term 'ordination'.
1970
- R.H. Whittaker develops theoretical foundations of gradient analysis,
especially unimodal species responses and turnover along environmental
gradients.
1971
- K.R. Gabriel develops biplot graphical display.
1973
- M.O. Hill re-invents correspondence analysis and introduces CA (as 'reciprocal
averaging') into ecology.
1986
- Cajo ter Braak invents canonical correspondence analysis (CCA) and released
CANOCO software.
1988
- Cajo ter Braak and Colin Prentice's "A theory of gradient analysis" (Advances in
Ecological Research 18; 271-317) that unifies indirect and direct gradient
analysis and highlights the importance of underlying species response models.
1998,
2002
- Cajo ter Braak and Petr Šmilauer CANOCO 4 & 4.5 software and manual.
Mark Hill
Cajo ter Braak
Petr Šmilauer
1987
1987
2002
2003
PROGRESS
Introduction
Indirect gradient analysis - analyse Y data only
Direct gradient analysis - analyse Y and X data together
Aims of Indirect Gradient Analysis
1.
Summarise multivariate data in a convenient low-dimensional way.
Dimension-reduction technique.
2.
Uncover the fundamental underlying structure of data. Assume that
there is underlying LATENT structure. Occurrences of all species are
determined by a few unknown environmental variables, LATENT
VARIABLES, according to a simple response model. In ordination trying to
recover and identify that underlying structure.
Reasons for using Indirect Gradient Analysis
1.
Species compositions easier to determine than full range of
environmental conditions. Many possible environmental variables. Which
are important?
2.
Overall composition is often a good reflection of overall environment.
3.
Overall composition often of greater concern than individual species.
Global, holistic picture, in contrast to regression which gives a local,
individualist reductionist view.
Constrained canonical ordination or direct gradient analysis stands
between indirect gradient analysis and regression. Many species, many
environmental variables.
Underlying Response Models
A straight line displays the linear
relation between the abundance
value (y) of a species and an
environmental variable (x), fitted
to artificial data (●). (a =
intercept; b = slope or regression
coefficient).
A Gaussian curve displays a
unimodal relation between the
abundance value (y) of a species
and an environmental variable
(x). (u = optimum or mode; t =
tolerance; c = maximum =
exp(a)).
Indirect gradient analysis can be viewed as being like regression
analysis but with the MAJOR difference that in ordination the
explanatory variables are not known environmental variables but are
theoretical ‘latent’ variables.
Constructed so that they ‘best’ explain the species data.
As in regression, each species is a response variable but in contrast to
regression, consider all response variables simultaneously.
____________________
PRINCIPAL COMPONENTS ANALYSIS
CORRESPONDENCE ANALYSIS
PCA – linear response model
CA – unimodal response model
PCA
CA
& relative DCA
Indirect Gradient Analysis
Principal components analysis
In regression, fit a particular environmental variable to all the species.
Might then repeat for a different environmental variable. For some
species, one variable may fit better and for other species, another
variable may fit better. Judge the goodness-of-fit (explanatory power)
of an environmental variable by the total regression sum-of-squares.
What is the best possible fit that is theoretically obtainable within the
constraints of the linear response model?
Defines the ordination problem - to construct the single hypothetical
variable that gives the best fit to the species data according to the
linear response model. This hypothetical environmental variable is the
LATENT VARIABLE or the first ordination axis.
Principal components analysis provides the solution to this linear
ordination problem in any number of dimensions.
PCA results most conveniently presented as BIPLOTS
Correlation (=covariance) biplot scaling
Species scores sum of squares = λ
Site scores scaled to unit sum of squares
Distance biplot scaling
Site scores sum of squares = λ
Species scores scaled to unit sum of squares
Emphasis on species
Emphasis on sites
Other important questions in PCA
1. Data transformations.
2. Data standardisations (covariance or correlation matrix).
3. Can position 'unknown' samples (e.g., fossil samples) into PCA of
'known' modern samples.
4. How many axes to retain for interpretation? Comparison with
broken-stick model surprisingly reliable.
5. Interpretation indirect e.g., if environmental data available,
overlay or regress variables on PCA axes 1 and 2. If no
environmental data, interpret on basis of species biology and
ecology.
Correspondence analysis
Invented independently numerous times:
1. Correspondence Analysis: Weighted Principal Components
with Chi-squared metric.
2. Optimal or Dual Scaling: Find site and species scores so that (i)
all species occurring in one site are as similar as possible, but
(ii) species at different sites are as different as possible, and
(iii) sites are dispersed as widely as possible relative to species
scores.
3. Reciprocal Averaging: species scores are weighted averages of
site scores, and simultaneously, site scores are weighted
averages of species scores.
Assume we have five species (presences and absences) along a
moisture or pH gradient. Can estimate by Gaussian logit regression
the moisture optimum of each species.
Gaussian logit curve fitted by logit regression of the presences ( at p = 1) and
absences ( at p = 0) of a species on acidity (pH).
Simple weighted averaging is a good approximation to Gaussian logit
regression under a wide range of conditions.
Artificial example of
unimodal response
curves of five species
(A-E) with respect to
standardised variables,
showing different
degrees of separation of
the species curves.
a: moisture b: First axis
of CA c: First axis of CA
folded in this middle
and the response curves
of the species lowered
by a factor of about 2.
Sites are shown as dots
at y = 1 if Species D is
present and at y = 0 if
Species D is absent.
As a measure of how well moisture explains the species data, can use
the dispersion or spread of the species scores or optima. If the
dispersion is large, moisture separates the species curves and explains
the species data well, under the assumption of a unimodal response
model. If the dispersion is small, then moisture is a poor explanatory
variable.
Is there an environmental variable that would explain the species
data better? CA is the technique that will construct the theoretical
latent variable that will best explain the species data under the
assumption of unimodal species responses i.e., will find the latent
variable that will maximise the dispersion of the species scores. The
theoretical variable is the first CA axis.
A second CA axis and further CA axes can be constructed so that they
also maximise the dispersion of the species scores but subject to the
constraint of being uncorrelated with previous CA axes.
CA can be applied not only to presence-absence data but also to
abundance data. Involves two-way iterative weighted averaging
algorithm by starting from arbitary initial values for sites or from
arbitary initial (indicator) values for species. Calculate new species
scores by weighted averaging of site scores, calculate new site scores
by weighted averaging of sample scores, continue until convergence.
On convergence, the values are the site and species scores on CA axis
1, and have the maximum dispersion of species scores on CA axis 1.
Eigenvalue of axis 1 is the maximised dispersion of the species scores
on the CA axis and is a measure of the importance of the ordination
axis.
Remarkable feature of correspondence analysis is that it turns out
to be the solution to a wide range of seemingly different problems.
(Hill, 1974)
CA axes can be shown to minimise the ratio of within-species
variance of the site scores to the between-species variance, i.e., a
CA axis finds the latent variable that separates the species niches
or optima as well as possible. Important property of CA in its
constrained or canonical form, canonical correspondence analysis.
M.O. Hill (1973) J. Ecology 61, 237-249
M.O. Hill (1974) Applied Statistics 23, 340-354
Other important questions in CA
1. Hill's scaling of scores in multiples of one standard deviation or
'TURNOVER'. Sites differing by 4 SD tend to have no species in
common.
2. Often need to detrend as CA axis 2 may be an artifact, resulting
in an arch in CA axis 2. Commonest cause is that there is only one
dominant gradient and the second axis may simply be the first
axis folded so that axis 2 is a quadratic function of axis 1, axis 3
a cubic function of axis 1, and so on.
3. Present CA results as biplots with emphasis on species or on sites
or symmetric scaling (Gabriel, 2002), or scaled in Hill's turnover
standard deviation units, depending on research aims.
4. Rare species usually have 'extreme' scores - delete them,
downweight them, or do not plot them.
5. Interpretation indirect, as in PCA.
When to use PCA or CA/DCA?
PCA – linear response model
CA/DCA – unimodal response model
How to know which to use?
Gradient lengths important.
If short, good statistical reasons to use LINEAR methods.
If long, linear methods become less effective, UNIMODAL methods become
more effective.
Range 1.5–3.0 standard deviations both are effective.
In practice:
Do a DCA first and establish gradient length.
If less than 2 SD, responses are monotonic. Use PCA.
If more than 2 SD, use CA or DCA.
When to use CA or DCA more difficult.
Ideally use CA (fewer assumptions) but if arch is present, use DCA.
Hypothetical occurrence of species A-J over an environmental gradient. The length of the
gradient is expressed in SD units. Broken lines describe fitted occurrences of species. If
sampling takes place over a gradient range 1.5 SD, this means that occurrences of most
species are best described by a linear model. If sampling takes place over a gradient range
3 SD, occurrences of most species are best described by a unimodal model.
Methods of indirect gradient analysis or unconstrained ordination of a
multivariate set of data, Y
Name of method
(acronyms, synonyms)
Distance
measure
preserved
Relationship of
ordination axes
with original
variables
Criterion for drawing ordination axes
Principal Components
Analysis (PCA)
Euclidean
distance
Linear
Finds axis that maximises the total
variance (or, equivalently, that
minimises the total residual variation)
Correspondence Analysis
(CA, reciprocal
averaging, dual scaling)
Chi-square
distance
Unimodal
(approximately
Gaussian)
Finds axis that maximises dispersion
of species scores (which are
themselves weighted averages of site
scores)
Principal Coordinates
Analysis (PCO, PCoA,
metric multidimensional
scaling, classical scaling,
Torgerson scaling)
Any chosen
distance or
dissimilarity
measure
Unknown;
depends on
distance measure
chosen
Euclidean distances in new fulldimensional space are equal to
original distances (or dissimilarities)
Nonmetric
Multidimensional Scaling
(MDS, NMDA, NMDSCAL)
Any chosen
distance or
dissimilarity
measure
Unknown;
depends on
distance measure
chosen
The number of dimensions for the new
space is chosen a priori (reduced).
Euclidean distances in new space are
monotonically related to original
distances
Direct Gradient Analysis
Introduction
Env. vars
Direct GA
Species
Indirect GA
Primary data in gradient analysis
Abundances
or
+/variables
Values
Classes
Response variables
Y
Predictor or explanatory
variables
X
Two-step approach of indirect gradient analysis
PCA or CA followed by regression
Standard approach from 1954 to about 1985
Limitations:
(1) environmental variable studied may turn out to be
poorly related to the first few ordination axes.
(2) may only be related to 'residual' minor directions of
variation in species data.
(3) remaining variation can be substantial, especially in
large data sets with many zero values.
(4) a strong relation of the environmental variables with,
say, axis 5 or 6 can easily be overlooked and unnoticed.
Limitations overcome by canonical or constrained ordination techniques
= multivariate direct gradient analysis.
Direct gradient analysis or canonical ordination techniques
Ordination and regression in one technique
Search for a weighted sum of environmental variables that fits the
species best, i.e. that gives the maximum regression sum of squares
Ordination diagram
1) patterns of variation in the species data
2) main relationships between species and each environmental
variable
Redundancy analysis
constrained or canonical PCA
Canonical correspondence analysis (CCA) constrained CA
(Detrended CCA)
constrained DCA
Axes constrained to be linear combinations of environmental variables.
In effect PCA or CA with one extra step:
Do a multiple regression of site scores on the environmental variables
and take as new site scores the fitted values of this regression.
Multivariate regression of Y on X.
Artificial example of unimodal response curves of five species (A-E) with respect to
standardised environmental variables showing different degrees of separation of
the species curves
moisture
linear combination of
moisture and phosphate
CCA linear combination
a: Moisture
b: Linear combination of moisture and phosphate, chosen a priori
c: Best linear combination of environmental variables, chosen by CCA.
Sites are shown as dots, at y = 1 if Species D is present and at y = 0 if Species D
is absent.
Combinations of environmental variables
e.g.
3 x moisture + 2 x phosphate
e.g.
all possible linear combinations
xj co c1 z 1 j c 2 z 2 j c 3 z 3 j .....
zj = environmental variable at site j
c = weights
xj = resulting ‘compound’ environmental variable
CCA selects linear combination of environmental variables that maximises
dispersion of species scores, i.e. chooses the best weights (ci) of the
environmental variables.
C.J.F. ter Braak (1986) Ecology 67, 1167-1179
Alternating regression algorithms
- CA
- DCA
- CCA
Algorithms for (A) Correspondence Analysis, (B) Detrended Correspondence Analysis, and
(C) Canonical Correspondence Analysis, diagrammed as flowcharts. LC scores are the
linear combination site scores, and WA scores are the weighted averaging scores.
CCA axis is a linear combination of the environmental variables
supplied and it is the 'best' in the sense that it minimises the size of
the species niches (minimises the ratio of within-species variance
to total variance), and maximises the dispersion of the species
scores or optima.
Canonical or constrained correspondence analysis
Ordinary correspondence analysis gives:
1. Site scores which may be regarded as reflecting the underlying
gradients.
2. Species scores which may be regarded as the location of
species optima in the space spanned by site scores.
Canonical or constrained correspondence analysis gives in addition:
3. Environmental scores which define the gradient space.
These optimise the interpretability of the results.
CCA of the Dune Meadow Data. a:
Ordination diagram with environmental variables represented by arrows.
The c scale applies to environmental
variables, the u scale to species and
sites. the types of management are
also shown by closed squares at the
centroids of the meadows of the
corresponding types of management.
a
b
b: Inferred ranking of the
species along the variable
amount of manure, based on
the biplot interpretation of Part
a of this figure.
CCA: Biplots and triplots
• You may have in a same figure
• WA scores of species
• WA or LC scores of sites
• Biplot arrows or class centroids of environmental variables
• In full space, the length of an environmental vector is 1: When projected onto
ordination space
• Length tells the strength of the variable
• Direction shows the gradient
• For every arrow, there is an equal arrow to the opposite direction,
decreasing direction of the gradient
• Project sample points onto a biplot arrow to get the expected value
• Class variables coded as dummy variables
• Plotted as class centroids
• Class centroids are weighted averages
• LC score shows the class centroid, WA scores the dispersion of the
centroid
Redundancy analysis - constrained or canonical PCA
Short (< 2 SD) compositional gradients
Linear or monotonic responses
Reduced-rank regression
PCA of y with respect to x
Two-block mode C PLS
PCA of instrumental variables
Rao (1964)
PCA
- best hypothetical latent variable is the one that gives the
smallest total residual sum of squares
RDA
- selects linear combination of environmental variables that gives
smallest total residual sum of squares
C.J.F. ter Braak (1994) Ecoscience 1, 127–140 Canonical community
ordination Part I: Basic theory and linear methods
RDA ordination diagram of the Dune Meadow Data with environmental variables represented
as arrows. The scale of the diagram is: 1 unit in the plot corresponds to 1 unit for the sites,
to 0.067 units for the species and to 0.4 units for the environmental variables.
Partial constrained ordinations (partial CCA, RDA, etc)
e.g.
pollution effects
seasonal effects COVARIABLES
Eliminate (partial out) effect of covariables. Relate
residual variation to pollution variables.
Replace environmental variables by their residuals
obtained by regressing each pollution variable on the
covariables.
Partial CCA
Natural variation due to sampling season and due to
gradient from fresh to brackish water partialled out
by partial CCA.
Variation due to pollution could now be assumed.
Ordination diagram of a partial canonical
correspondence analysis of diatom species (A) in
dykes with as explanatory variables 24 variablesof-interest (arrows) and 2 covariables (chloride
concentration and season). The diagram is
symmetrically scaled [23] and shows selected
species and standardised variables and, instead of
individual dykes, centroids (•) of dyke clusters.
The variables-of-interest shown are: BOD =
biological oxygen demand, Ca = calcium, Fe =
ferrous compounds, N = Kjeldahl-nitrogen, O2 =
oxygen, P = ortho-phosphate, Si= siliciumcompunds, WIDTH = dyke width, and soil types
(CLAY, PEAT). All variables except BOD, WIDTH,
CLAY and PEAT were transformed to logarithms
because of their skew distribution. The diatoms
shown are: Ach hun = Achnanthes hungarica, Ach
min = A. minutissima, Aph cas= Amphora
castellata Giffen, Aph lyb = A. lybica, Aph ven =
A. veneta, Coc pla = Cocconeis placentulata, Eun
lun = Eunotia lunaris, Eun pec = E. pectinalis, Gei
oli = Gomphoneis olivaceum, Gom par =
Gomphonema parvulum, Mel jur = Melosira
jürgensii, Nav acc = Navicula accomoda, Nav cus =
N. cuspidata, Nav dis = N. diserta, Nav exi = N.
exilis, Nav gre = N. gregaria, Nav per = N.
permitis, Nav sem = N. seminulum, Nav sub= N.
subminuscula,Nit amp = Nitzschia amphibia, Nit
bre = N. bremensis v. brunsvigensis, Nit dis = N.
dissipata, Nit pal = N. palea, Rho cur =
Rhoicosphenia curvata.
(Adapted from H. Smit, province of Zuid Holland,
in prep)
Partial ordination analysis (partial PCA, CA, DCA)
There can be many causes of variation in ecological data. Not all are of major
interest. In partial ordination, can ‘factor out’ influence from causes not of
primary interest. Directly analogous to partial correlation or partial regression.
Can have partial ordination (indirect gradient analysis) and partial constrained
ordination (direct gradient analysis). Variables to be factored out are
‘COVARIABLES’ or ‘COVARIATES’ or ‘CONCOMITANT VARIABLES’. Examples are:
1) Differences between observers.
2) Time of observation.
3) Between-plot variation when interest is temporal trends within repeatedly
sampled plots.
4) Uninteresting gradients, e.g. elevation when interest is on grazing effects.
5) Temporal or spatial dependence, e.g. stratigraphical depth, transect position, x
and y co-ordinates. Help remove autocorrelation and make objects more
independent.
6) Collecting habitat – outflow, shore, lake centre.
7) Everything – partial out effects of all factors to see residual variation in data.
Given ecological knowledge of sites and/or species, can try to interpret residual
variation. May indicate environmental variables not measured, may be largely
random, etc.
Overview of major ordination techniques
Response
Linear
Unimodal
Ordination
PCA
CA, DCA
Constrained
ordination
RDA
CCA, DCCA
Partial ordination
Partial PCA
Partial CA,
partial DCA
Partial constrained
ordination
Partial RDA
Partial CCA,
partial DCCA
Partitioning variance or variation
ANOVA
total SS = regression SS + residual SS
Two-way ANOVA
between group (factor 1) + between
treatments (factor 2) + interactions + error
component
Borcard et al. (1992) Ecology 73, 1045–1055
Variance or variation decomposition into 4 components
Important to consider groups of environmental variables relevant
at same level of ecological relevance (e.g. micro-scale, specieslevel, assemblage-level, etc.).
Total inertia = total variance
Sum canonical eigenvalues =
Explained variance
Unexplained variance = T – E
1.164
0.663
57%
57%
43%
What of explained variance component?
Soil variables (pH, Ca, LOI)
Land-use variables (e.g. grazing, mowing)
Not independent
Do CCA/RDA using
1)
2)
3)
4)
canonical eigenvalues
canonical eigenvalues
Land-use covariables
Soil covariables
Soil variables only
Land-use variables only
Partial analysis Soil
Partial analysis Land-use
0.521
0.503
0.160
0.142
a) Soil variation independent of land-use (3)
b) Land-use structured (covarying) soil variation (1–3)
c) Land-use independent of soil (4)
Total explained variance
d) Unexplained
a
unique
b
c
unique
covariance
0.160
0.361
0.142
d
unexplained
13.7%
31%
12.2%
56.9%
43.1%
Variance partitioning or decomposition with three or more
sets of predictor (explanatory) variables
Qinghong & Bråkenheim (1995) Water, Air and Soil Pollution 85: 1587–1592
Three sets of predictors – Climate (C), Geography (G) and Deposition of
Pollutants (D)
Series of RDA and partial RDA
Predictors
Covariables
G+C+D
D
G+C
G+C
G+C
D
D
Joint effect
DG+C=0.784-0.132=0.679-0.027=0.652
C
D+G
G+D
G+D
C
C
Joint effect
CD+G=0.737-0.106=0.706-0.074=0.631
Sum of canonical
0.811
0.027
0.784
0.132
0.679
0.106
0.706
0.074
0.737
0.811
0.811
0.812
0.811
Predictors
G
D+C
D+C
G
Covariables
D+G
G
-
Sum of canonical
0.034
0.811
0.777
0.228
0.811
0.538
Joint effect
GD+C=0.777-0.228=0.538-0.034=0.549
Canonical eigenvalues
All predictors
0.811
Pure deposition
0.027
PD
Pure climate
0.106
PC
Pure geography
0.034
PG
Joint G + C
0.132
Joint G + D
0.074
Joint D + C
0.228
Unexplained variance 1 – 0.811 = 0.189
C
D
Covariance
terms
PD
CD
CD
DG
CG
CDG
DG
CDG
PC
CG
PG
G
Total explained variance 0.811 consists of:
Common climate + deposition
0.095
Unique climate PC
0.106
Common deposition + geography
0.013
Unique geography PG
0.034
Common climate + geography
0.008
Unique deposition PD
0.027
Common climate + geography +
deposition
0.544
Unexplained variance
0.189
See also Qinghong Liu (1997) Environmetrics 8: 75–85
Anderson & Gribble (1998) Australian J. Ecology 23: 158-167
Total variation:
1) random variation
2) unique variation from a specific predictor variable or set of predictor
variables
3) common variation contributed by all predictor variables considered together
and in all possible combinations
Usually only interpretable with 2 or 3 'subsets' of predictors.
In CCA and RDA, the constraints are linear. If levels of the environmental
variables are not uncorrelated (orthogonal), may find negative 'components of
variation'.
Statistical testing of constrained ordination results
Statistical significance of species-environmental relationships. Monte
Carlo permutation tests.
Randomly permute the environmental data, relate to species data
‘random data set’. Calculate eigenvalue and sum of all canonical
eigenvalues (trace). Repeat many times (99).
If species react to the environmental variables, observed test statistic
(1 or trace) for observed data should be larger than most (e.g. 95%) of
test statistics calculated from random data. If observed value is in top
5% highest values, conclude species are significantly related to the
environmental variables.
Statistical significance of constraining variables
• CCA or RDA maximise correlation with
constraining variables and
eigenvalues.
• Permutation tests can be used to
assess statistical significance:
- Permute rows of environmental
data.
- Repeat CCA or RDA with permuted
data many times.
- If observed higher than (most)
permutations, it is regarded as
statistically significant.
Monte Carlo permutation tests in CCA and RDA as
implemented in CANOCO
1. Unrestricted
2. Restricted - time series and line transects
- spatial layout on rectangular grid
- split-plot design
- multifactorial analysis of variance
Model-based permutations - when covariables are present, what is
permuted are the residuals of the regression of Y on the
covariables ('reduced model') or the residuals of the regression of
Y on X and the covariables ('full model').
An ecological example of CCA
Example data: quantitative
and qualitative environmental
variables (a) and qualitative
covariables (b) recorded at 40
sites along two tributaries
from the Hierden stream (sd:
standard deviation, min:
minimum, max: maximum).
Aquatic macro-fauna data
}Ordinal
4 classes
3 classes
7 binary class
variables
Remove effect of
seasonal variation
Ranking environmental variables in importance by their marginal (left) and conditional (right) effects of
the macrofauna in the example data-set, as obtained by forward selection. (1 = fit = eigenvalue with
variable j only; a = additional fit = increase in eigenvalue; cum (a) = cumulative total of eigenvalues a; P =
significance level of the effect, as obtained with a Monte Carlo permutation test under the null model with
199 random permutations; - additional variables tested; veg. = vegetation). Seasonal variation is partialled
out by taking the month class variables as covariables.
Marginal effects (forward:
j Variable
1
1 Shrubs (1/0)
0.25
2 Source distance 0.22
3 EC
0.20
4 Discharge
0.17
5 Tot. cover of veg.0.16
6 Shading
0.15
7 Soil grain size
0.14
8 Stream width
0.14
9 High weedy veg. 0.14
10 Cover bank veg. 0.13
- U vs L stream
0.22
step 1)
P
(0.01)
(0.01)
(0.01)
(0.01)
(0.01)
(0.01)
(0.02)
(0.05)
(0.08)
(0.11)
(0.01)
Each variable is the only
environmental variable
MARGINAL EFFETS i.e. ignoring all other variables
Conditional effects (forward: continued)
j Variable
a
P
1 Shrubs (1/0)
0.25 (0.01)
2 Source distance STEP 2 0.19 (0.01)
3 Discharge
STEP 3 0.19 (0.01)
4 EC
STEP 4 0.14 (0.03)
Cum (a)
0.25
0.44
0.63
0.75
- Cover emergent veg.
- Cover bank veg.
- Soil grain size
-
0.11
0.11
0.10
- U vs L stream
(0.10)
(0.12)
(0.13)
0.09 (0.26)
EXTRA FIT
Change in eigenvalue if particular variable selected
CONDITIONAL EFFECTS
given other selected variables
Species-conditional triplot based on a canonical
correspondence analysis of the example macroinvertebrate data displaying 13% of the inertia
(=weighted variance) in the abundances and 69%
of the variance in the weighted averages and
class totals of species with respect to the
environmental variables. The eigenvalue of axis 1
(horizontally) and axis 2 (vertically) are 0.35 and
0.17 respectively; the eigenvalue of the axis 3
(not displayed) is 0.13. Sites are labelled with
stream code (U, L) and are ranked by distance
from the source (rank number within the stream).
Species (triangles) are weighted averages of site
scores (circles). Quantitative environmental
variables are indicated by arrows. The class
variable shrub is indicated by the square points
labelled Shrub and No shrub. The scale marks
along the axes apply to the quantitative
environmental variables; the species scores, site
scores and class scores were multiplied by 0.4 to
fit in the coordinate system. Only selected
species are displayed which have N2>4 and a
small N2-adjusted root mean square tolerance for
the first two axes. The species names are
abbreviated to the part in italics as follows
Ceratopogonidae, Dendrocoelum lacteum, Dryops
luridus, Erpobdella testacea, Glossiphonia
complanata, Haliplus lineatocollis, Helodidae,
Micropsectra atrofasciata, Micropsectra fusca,
Micropterna sequax, Prodiamesa olivacea,
Stictochironomus sp.
Other aspects of CCA/RDA
1. Very robust, major assumption of CCA is that species
responses are approximately unimodal.
2. Can add in 'unknown' or passive samples into CCA or RDA
space (e.g., fossil samples positioned into modern CCA
space).
3. Unlike canonical correlation analysis, RDA (and CCA) can
handle data sets where the number of variables (species
and environmental variables ) >> number of sites.
4. Can calculate range of ordination diagnostics, comparable
to regression diagnositics.
Fossils samples in modern CCA space
Distance-based redundancy analysis
DISTPCOA
Pierre Legendre & Marti Anderson (1999) Ecol. Monogr. 69: 1-24.
RDA but with any distance coefficient
RDA - Euclidean distance Absolute abundances
Quantity dominated
CCA - chi-square metric
Relative abundances
Shape/composition dominated
Does it matter?
Total biomass or cover and species composition
Varying e.g. ridge snow bed gradient
Other dissimilarities
Bray & Curtis
Jaccard +/Gower mixed data
non-Euclidean
non-Euclidean
non-Euclidean
semi-metric
semi-metric
semi-metric
Basic idea
Reduce site x site DC matrix (any DC) to principal co-ordinates (principal coordinates analysis, classical scaling, metric scaling – Torgerson, Gower) but with
correction for negative eigenvalues to preserve distances.
PCoA – embeds the Euclidean part of DC matrix, rest are negative eigenvalues for
which no real axes exist. These correspond to variation in distance matrix, which
cannot be represented in Euclidean space.
Correction for negative eigenvalues
2
d 'ij (d ij 2c1 ) 0.5 for i j
where c1 is equal to absolute value of largest negative eigenvalue of matrix used
in PCoA 1
1 2
aij d ij
D
2
ij aij ai a j a..
1
Use all principal co-ordinate site scores (n - 1 or m, whichever is less) as
RESPONSE (species) data in RDA. Use dummy variables for experimental design as
predictors in X in RDA.
Now under framework of RDA and battery of permutation tests, can analyse
structured experiments but WHOLE ASSEMBLAGE (cf. MANOVA but where m >n).
Now can test null hypothesis (as in MANOVA) that assemblages from different
treatments are no more different than would be expected due to random chance
at a given level of probability. BUT unlike non-parametric tests (ANOSIM, Mantel
tests), can test for interactions between factors in multivariate data but using any
DC (not only Euclidean as in ANOVA/MANOVA). Using permutation tests means we
do not have to worry about multivariate normality or homogeneity of covariance
matrices within groups, or abundance of zero values as in ecological data.
DISTPCoA
www.fas.umontreal.ca/biol/legendre
Raw data
(replicates x species)
Distance matrix
(Bray-Curtis, etc)
Principal coordinate analysis
(PCoA)
Test of one factor
in a single-factor model
Matrix Y
(replicates x
principal coordinates)
Redundancy analysis (RDA)
F# statistic
Test of interaction term
in multifactorial model
Matrix Y
(replicates x
principal coordinates)
Partial redundancy analysis
(partial RDA)
F# statistic
Correction for
negative eigenvalues
Matrix X
(dummy variables
for the factor)
Test of F# by permutation
Matrix X
(dummy
variables
for the
interaction)
Matrix XC
(dummy
variables
for the main
effects)
Test of F# by permutation
under the full model
Canonical analysis of principal co-ordinates
Anderson, M.J. & Willis, T.J. (2003) Ecology 84: 511-525
CAP – www.stat.auckland.ac.nz/~mja
CAP - canonical analysis of principal co-ordinates based on any symmetric
distance matrix including permutation tests.
Y response variables (n x m)
X predictor variables (n x q) (1/0 or continuous variables)
Performs canonical analysis of effects of X on Y on the basis of any distance
measure of choice and uses permutations of the observations to assess
statistical significance.
If X contains 1/0 coding of an ANOVA model (design matrix), result is a
generalised discriminant analysis. If X contains one or more predictor
variables, result is a generalised canonical correlation analysis.
Summary of constrained ordination methods
Methods of constrained ordination relating response variables, Y (species abundance variables)
with predictor variables, X (such as quantitative environmental variables or qualitative
variables that identify factors or groups as in ANOVA).
Name of methods (acronyms,
synonyms)
Distance
measure
preserved
Relationship of ordination
axes with original variables
Takes into account
correlation structure
Redundancy Analysis (RDA)
Euclidean
distance
Linear with X, linear with
fitted values, Y = X(X'X)-1 X'Y
... among variables in X, but
not among variables in Y
Canonical Correspondence
Analysis (CCA)
Chi-square
distance
Linear with X, approx
unimodal with Y, linear with
fitted values, Y*
... among variables in X, but
not among variables in Y
Canonical Analysis of
Principal Coordinates (CAP;
Generalized Discriminant
Analysis)
Any chosen
distance or
dissimilarity
Linear with X, linear with
Qm; unknown with Y
(depends on distance
measure)
... among variables in X, and
among principal coordinates
Qm
Canonical Correlation
Analysis (CCorA, COR)
Mahalanobis
distance
Linear with X, linear with Y
... among variables in X, and
among variables in Y
Canonical Discriminant
Analysis (CDA; Canonical
Variate Analysis CVA;
Discriminant Function
Analysis, DFA)
Mahalanobis
distance
Linear with X, linear with Y
... among variables in X, and
among variables in Y
Principal response curves (PRC)
van der Brink, P. & ter Braak, C.J.F. (1999) Environmental Toxicology &
Chemistry 18: 138-148
van der Brink, P. & ter Braak, C.J.F. (1998) Aquatic Ecology 32: 163-178
PRC is a means of analysing repeated measurement designs and of testing
and displaying optimal treatment effects that change across time.
Based on RDA (= reduced rank regression) that is adjusted for changes across
time in the control treatment. Allows focus on time-dependent treatment
effects. Plot resulting principal component against time in PRC diagram.
Developed in ecotoxicology; also used in repeated measures in experimental
ecology and in descriptive ecology where spatial replication is substituted
for temporal replication.
Highlights differences in measurement end-points betweeen treatments and
the reference control.
PRC model
Yotk + bk cdt + d(i)tk
Yd(i)tk
=
where,
Yd(i)tk
= abundance counts of taxon k at time t in replicate i of treatment d
Yotk
= mean abundance of taxon k in controls (o) at time t
cdt
= principal response of treatment d at time t (PRC)
bk
= weight of species k with respect to cdt
d(i)tk
= error term with mean of zero and variance 2k
Modelling the abundance of particular species as a sum of three terms, mean
abundance in control, a treatment effect, and an error term.
Data input - species data (often log transformed) for different treatments at
different times
- predictor variables of dummy variables (1/0) to indicate all
combinations of treatment and sampling time ('indicator variables')
- covariables of dummy variables to indicate sampling time
Do partial RDA with responses, predictors, and covariables and delete all
predictor variables that represent the control. This ensures that the treatment
effects are expressed as deviations from the control.
PRC plots
One curve for each treatment expressed as deviation from the control. Species weights
(bk) allow species interpretation. Higher the weight, more the actual species response is
likely to follow the PRC pattern, because the response pattern = bk cdt. Taxa with high
negative weight are inferred to show opposite pattern. Taxa with near zero weight show
no response.
Significance of PRC can be tested by Monte Carlo permutation of the whole time series
within each treatment.
Can use the second RDA axis to generate a second PRC diagram to rank 2 model.
PRC and analysis of monitoring data
PRC usually used with experimental data. Can be used with
(bio)monitoring data.
Samples at several dates at several sites of a river, some upstream
of a sewage treatment plant (STP) (300 m, 100 m), in the STP
outlet, and some downstream (100 m, 1 km). 795 samples, 5 sites,
1994-2002.
PRC using sampling month as covariable, product of sampling
month and site as explanatory variables. Used STP outlet as the
reference site.
Of total variance, 24% could be attributed to between-month
variation. 57% of all variance could be allocated to between-site
differences, the remaining 19% to within-month variation.
See biggest differences for the two upstream sites, with lower NO x, total N, conductivity,
salinity, total P, and temperature and higher values of turbidity and faecal coliforms. STP
outlet leads to increases in N, P, temperature, etc. Downstream values decrease but are not
as low as upstream sites. STP successfully reduces faecal coliforms as their values are higher
in the upstream sites due to pollution.
PRC:
Filters out mean abundance pattern across time in the control.
Focuses on deviation between treatment and control. PRC displays
major patterns in those deviations and provides good summary of
response curves of individual taxa.
PRC helps to highlight 'signal' from 'noise' in ecological data in
replicated experimental studies.
Simplified RDA - simplified by representing the time trajectory for
the controls as a horizontal line and taking the control as the
reference to which other treatments are compared.
PRC gives simple representation of how treatment effects develop
over time at the assemblage level.
A palaeoecological example of RDA
Laacher See volcanic ash
A F Lotter & H J B Birks
1993
J Quat Sci 8, 263 - 276
11000 BP
? Any impact on terrestrial and aquatic systems
Also:
H J B Birks & A F Lotter
1994
J Paleolimnology 11, 313 - 922
A F Lotter et al
1995
J Paleolimnology 14, 23 - 47
Map showing the location
of Laacher See (red star),
as well as the location of
the sites investigated
(blue circle). Numbers
indicate the amount of
Laacher See Tephra
deposition in millimetres
(modified from van den
Bogaard, 1983).
Data
Terrestrial pollen and spores (9, 31 taxa)
Aquatic pollen and spores (6, 8 taxa)
Diatoms (42,54 taxa)
RESPONSE VARIABLES
% data
Biozone
(Allerød, Allerød/Younger Dryas, Younger Dryas)
+/-
Lithology
(gyttja, clay/gyttja)
+/-
Depth
("age")
Continuous
Ash
Exponential decay process
Continuous
Exp x-t
211 years
Time AL
YD
EXPLANATORY VARIABLES
= 0.5
NUMERICAL ANALYSIS
x = 100
(Partial) redundancy analysis
t = time
Restricted (stratigraphical) Monte
Carlo permutation tests
Variance partitioning
Log-ratio centring because of %
data
RESULTS OF (PARTIAL) RESUNDANCY ANALYSIS OF THE BIOSTRATIGRAPHICAL DATA SETS AT
ROTMEER (RO-6) AND HIRSCHENMOOR (HI-1) UNDER DIFFERENT MODELS OF EXPLANATORY
VARIABLES AND COVARIABLES.
Entries are significance levels as assessed by restricted Monte Carlo permutation tests (n = 99)
Data Set
Site
Explanatory
variables
Covariables
Terrestrial
pollen
RO-6
-
0.01a
-
0.01a
0.10
0.01a
RO-6
Depth + biozone +
ash + lithology
Depth + biozone +
ash + lithology
Ash
Aquatic
pollen &
spores
0.01a
Depth + biozone
0.09ns
0.48ns
0.16ns
HI-1
Ash
Depth + biozone
0.28ns
0.13ns
0.01a
RO-6
Ash + lithology
Depth + biozone
-
0.88ns
0.17ns
HI-1
Ash + lithology
Depth + biozone
-
0.10ns
0.01a
RO-6
Ash
-
0.53ns
0.08ns
HI-1
Ash
-
0.10ns
0.19ns
RO-6
Ash + lithology +
ash*lithology
Ash + lithology +
ash*lithology
Depth + biozone
+ lithology
Depth + biozone
+ lithology
Depth + biozone
-
0.25ns
0.03b
Depth + biozone
-
0.12ns
0.05b
HI-1
HI-1
a
p 0.01 b 0.01 < p 0.05
Diatoms
0.01a
Unique ash effect
(no lithology)
Unique ash +
lithology effect
Unique ash effect
(lithology
considered)
Unique ash +
lithology +
(ash*lithology)
interaction effect
Other recent developments in gradient analysis methods
based on weighted averaging
1. Weighted averaging partial least squares regression and calibration
Predict one or more environmental variables (e.g., lake pH) from
biological data (e.g., diatoms).
ter Braak and Juggins (1993) Hydrobiologia 269: 485-502
2. Canonical correspondence analysis partial least squares regression
Predict biological assemblages from many environmental variables
ter Braak and Verdonschot (1995) Aquatic Sciences 57: 255-289
3. Co-correspondence analysis
Relate two biological data sets (e.g., vascular plants and
invertebrates) to identify patterns common to both.
ter Braak and Schaffers (2003) Ecology (in press)
PROBLEMS
1. Data type and choice of ordination method
Besides gradient length (standard deviations), data type is also
important in selecting ordination method.
Absolute abundances
Unconstrained PCA (linear)
Constrained
RDA (linear)
Constrained
(PRC) (linear)
Relative abundances
(Compositional differences)
CA, DCA (unimodal)
CCA, DCCA (unimodal)
-
PCA/RDA are weighted summations; CA/CCA are weighted averages,
hence the difference between modelling absolute values (PCA/RDA) or
relative values (CA/CCA).
Cannot currently model satisfactorily absolute abundances over long
gradients. Need to partition the data into smaller gradients first (e.g.
TWINSPAN).
Long gradients and many rare species
What to do with data containing many zero values and long gradients, and to avoid
the inherent problems of the Chi-squared distance metric implicit in CA and CCA?
These problems are:
1.
Implicit use of relative abundances
2.
A difference between abundance values for a common species contributes
less to the distance than the same difference for a rare species, so rare
species may have an unduly large influence on the analysis
Possible solutions:
1.
Delete rare species
2.
Empirical downweighting of rare species as in CANOCO
3.
Data transformations that preserve the Euclidean distances and balance
rare and common species
P. Legendre & E.D. Gallagher (2001) Oecologia 129: 271-280
Software:
www.fas.umontreal.ca/biol/legendre
(1) Chord distance
m
D12
j 1
m
y1 j
m
y
j 1
1
(2) Chi-square metric D
12
j 1 y
2
1j
y2 j
m
y
j 1
2
2j
y1 j y2 j
j y1 y2
2
2
where y+j is the column (species) sum for species j, and y1+
is the row (sample) sum for sample 1
(3) Species profiles
y1 j y2 j
D12
y2
j 1 y1
m
y1 j
(4) Hellinger distance D12
j 1
y1
m
2
y2 j
y2
2
(These are also relevant transformations for minimum-variance cluster analysis and k-means
cluster analysis that minimise a least-squares function. These transformations result in Euclidean
metrics that can be represented in Euclidean space and that preserve sum of squares.)
2. Selecting environmental variables in
constrained ordination analysis (e.g., CCA, RDA)
1) The fewer the environmental variables, stronger the constraints are.
2) With q (number of samples – 1) environmental variables, the analysis is
unconstrained.
3) Want to try to find MINIMAL ADEQUATE SET of environmental variables that
explain the species data about as well as the FULL SET.
4) Automatic selection (e.g. forward selection) can be dangerous:
a) Several sets can be almost equally good. Automatic selection finds one but
may not be the best.
b) Selection order may change the result and ecologically important variables
may not be selected.
c) Small changes in the data can change the selected variables. Difficult to
draw reliable conclusions about relative importance of variables. Omission of a
variable does not mean it is not ecologically important.
5) If we are lucky, there may only be one minimal adequate model but we cannot
assume that there is only one such model.
3. Ignoring knowledge about species
Use
(and
Y
species data
X
environmental data
Z
covariable data)
All matching for all n samples
But, we may have previous knowledge about our m species (e.g., lifehistory strategies, ecological indicator values, growth rates, heights,
life-forms, etc). At present this information is not used.
Need for 3 matrix approach to CCA and RDA.
Doledec, S. et al. (1996) Environmental & Ecological Statistics 3: 143166
4. Data coding
If the abundance of species is measured by biomass, then the
analysis will only really pay attention to the trees and not the
herbs.
Useful to rescale data or re-define species variables. For example,
reduce the data to presences and absences of particular attributes,
such as the attribute 'present with cover of 25%' (cf. PSEUDOSPECIES
in TWINSPAN).
In much field plant ecological data, our experience is that almost
all quantitative information is contained in three attributes
"present", "present in moderate quantity", and "present in large
quantity".
PITFALLS
1. Biplot scaling
Data-tables in an ecological study on species
environmental relations. Primary data are the
sub-table 1 of abundance values of species and
the sub-tables 4 and 7 of values and class labels
of quantitative and qualitative environmental
variables (env. var), respectively. The primary
data are input for canonical correspondence
analysis (CCA). The other sub-tables contain
derived (secondary) data, as the arrows indicate,
named after the (dis)similarity coefficient they
contain. The coefficients shown in the figure are
optimal when species-environmental relations are
unimodal. The CA ordination diagram represents
these sub-tables, with emphasis on sub-tables 5
(weighted averages of species with respect to
quantitative environmental variables), 8 (totals
of species in classes of qualitative environmental
variables) and 1 (with fitted, as opposed to
observed, abundance values of species).
The sub-tables 6, 9, and 10 contain correlations among quantitative environmental variables, means of the
quantitative environmental variables in each of the classes of the qualitative variables and chi-square
distances among the classes, respectively. (Chi-sq = Chi-square; Aver = Averages; Rel = Relative)
-1: focus on sites
Hill's scaling
Interpreta
-tion
2: focus on species
biplot scaling of CCA
1. species x sites
Rel. abundances
CENTROID
Fitted rel. abund.
BIPLOT rule or
CENTROID rule
2. species x species
-
UNKNOWN
-square distances
DISTANCE rule
3. sites x sites
Turnover distances
DISTANCE
-square distances*
DISTANCE rule
4. sites x env. vars
-
UNKNOWN
Values of env.vars
BIPLOT rule
5. species x env. vars
Weighted averages
BIPLOT
Weighted averages
BIPLOT rule
6. env.vars x env.
vars
Effects
? BIPLOT
Correlations
BIPLOT rule
7. sites x env. classes
Membership
CENTROID
Membership
CENTROID rule
8. species x env. cls.
Rel. total abund.
CENTROID
Rel. total abund.
CENTROID rule
9. env.vars x env.
classes
-
UNKNOWN
Mean values of env. BIPLOT rule
vars
10. env. classes x
env. classes.
Turnover distances
DISTANCE
-square distances*
Scaling
Interpretation
Quantitative env. vars
Qualitative env. vars
(Italics = fitted by weighted least-squares)
* if 1 2
DISTANCE rule
Interpretation of CCA plots
Centroid principle
Distance principle
Biplot principle (of relative abundances)
Small eigenvalues, short (< 4 SD) gradients – Biplot principle
Large eigenvalues (> 0.40), long (> 4 SD) gradients – Centroid
and Distance principles and some Biplot principles
The centroid and distance principles may approximate biplot
principle if gradients are short and eigenvalues small.
Differences are least important if 12
2. Choice of environmental variables in
constrained ordinations
1) Choice can greatly influence the results. Fewer the environmental variables, the more
constrained the ordination is.
2) Possible to have one only – can evaluate its explanatory power.
3) Can remove superfluous variables if they are confusing or difficult to interpret. Can
often remove large number without any marked effect. Post-hoc removal of variables is
not valid in a hypothesis-testing analysis.
4) Linear combinations – environmental variables cannot be linear combinations of other
variables. If a variable is a linear combination of other variables, singular matrix results.
Examples:
- total cations, Ca, Mg, Na, K, etc. Delete total cations
- % clay, % silt, % sand
- dummy variables (granite or limestone or basalt)
5) Transformation of environmental data – how do we scale environmental variables in such
a way that vegetation ‘perceives’ the environment? Need educated guesses.
Log transformation usually sensible – 1 unit difference in N or P is probably more
important at low concentrations than at high concentrations.
As statistical significance is assessed by randomisation tests, no need to transform data to
fulfil statistical assumptions.
Transformations useful to dampen influence of outliers.
Environmental data automatically standardised in RDA and CCA.
6) Dummy variables – factors such as bedrock type, land-use history, management, etc,
usually described by categorical or class variables. 1 if belongs to class, 0 if it does not.
For every categorical variable with K categories, only need K – 1 dummy variables e.g.
Granite Limestone Basalt
Gabbro
Plot 1
1
0
0
0
2
0
1
0
0
3
1
0
0
0
4
0
1
0
0
5
0
0
1
0
6
0
0
0
1
7) Circular data – some variables are circular (e.g. aspect) and large values are very close to
small values. Aspect – transform to trigonometric functions.
northness = cosine (aspect)
eastness = sine (aspect)
Northness will be near 1 if aspect is generally northward and –1 if southward. Close to 0
if west or east.
Day of year – usually not a problem unless dealing with sampling over whole year. Can
create ‘winterness’ and ‘springness’ variables as for aspect.
8) Vegetation-derived variables – maximum height, total biomass, total cover, light
penetration, % open ground can all be ‘environmental’ variables. Such variables SHOULD
NOT BE USED in hypothesis testing, as danger of circular reasoning.
9) Interaction terms – e.g. elevation * precipitation. Easy to implement, difficult to
interpret. If elevation and precipitation interact to influence species composition,
easy to make term but the ecological meaning of where in environmental space the
sites or species are is unclear. Huge number of possibilities N variables ½ N (N – 1)
possible interactions. 5 variables 10 interactions.
AVOID quadratic terms [e.g. pH * pH (pH2) (cf. multiple regression and polynomial
terms)]. Can create an ARCH effect or warpage of ordination space.
Try to avoid interaction terms except in clearly defined hypothesis-testing studies
where the null hypothesis is that ‘variables c and d do not interact together to
influence the species composition’.
For interaction to be significant, eigenvalue 1 of the analysis with product term should
be considerably greater than 1 when there is no product term and the t-value
associated with the product term should be greater than 2 in absolute value.
Avoid product variables to avoid ‘data dredging’.
3. Monte Carlo permutation test results
1. Valid even without random samples
2. Relatively easy to take account of particular features of data
3. Can use 'non-standard' test statistics
Tell us if a certain pattern could or could not be caused/arisen by
chance. Completely specific to data set.
'Non-parametric' does not mean 'no assumptions'.
Validity of permutation results depends on validity of permutation
types for particular data-type - time-series or line transects, spatial
grids, repeated measures, split plots. All require particular types of
permutations.
In restricted permutations, may be a limited number of possible
permutations for a particular test. Increasing number of
permutations in such cases does not 'improve' the p-value!
4. Forward selection procedures in CCA and RDA
Used to
(i) Find minimal adequate set of explanatory variables that explain
the species data as well as the full data
(ii) Rank the environmental variables in importance
(iii) Evaluate the statistical significance of the effects on the species
of a particular environmental variable unconditionally or
conditionally on the effects of other environmental variables.
When applied repeatedly and in a step-wise fashion, shares the
shortcomings of all regression selection procedures in that the overall
size of the test in not controlled. In practice, too many variables are
judged significant . The tests are too tolerant overall. Great care and
patience are needed to find the 'minimal adequate model(s)'.
5. General limitations of ordination in
environmental biology
M.O. Hill (1988) Bull. Soc. Roy. Bot. Belg. 121, 134–41
"Ordination is a rather artificial technique. The idea that the world
consists of a series of environmental gradients, along which we should
place our vegetation samples, is attractive. But this remains an
artificial view of vegetation. In the end the behaviour of vegetation
should be interpreted in terms of its structure, the autoecology of its
species and, above all, the time factor. At this level, trends become
unimportant and multivariate analysis is perhaps irrelevant.
Ordination is useful to provide a first description but it cannot provide
deeper biological insights."
CONCLUSIONS
What can be done with modern ordination techniques in a combined
exploratory and confirmatory way?
Hallgren, Palmer, & Milberg (1999) Journal of Ecology 87, 1037-1051
2000+ plots in cereal and oil-seed crops in Sweden 1970-1994
Data Diving With Cross-validation:
AN INVESTIGATION OF BROADSCALE GRADIENTS IN SWEDISH
WEED COMMUNITIES
E. Hallgren, M.W. Palmer & P.
Milberg (1999), Journal of Ecology
87:1037-1051
Flow chart for the sequence of
analyses employed in the study.
Solid lines represent the flow of
data and dashed lines the flow of
ideas or analyses.
Map of Sweden with
the geographical
regions (A-H)
indicated.
Environmental variables used in the analyses
Nominal variables
Sowing season (autumn or spring)
Geographical regions: Swedish counties (A-H)
Soil types (1, sandy soil; 2, fine sand soil; 3,
silty soil; 4, loamy soil; 5, silty clay loam; 6,
heavy to very heavy clay soil; 7, organogenic
soil)
Crop species (barley, wheat, rye, oats, turnip
rape, rape; categorised according to season
of sowing)
Interval-scale variables
Year of trial (1970-1994)
Organic content (%; seven catgories)
Continuous variables
Nitrogen fetilization (N ha-1)
Crop yield (kg ha-1)
Techniques used
1. Detrended correspondence analysis - reveal gradients in data
2. Canonical correspondence analysis - Monte Carlo permutation tests
on "trace" statistic
3. Partial DCA - any interpretable species patterns beyond the effects
of measured environmental variables
4. Partial CCA - can one set of variables explain variation in species
composition not explained by a second set of variables
5. Stepwise CCA
Exploratory data set
1000 plots
Confirmatory data set
1000 plots
Display data set
2359 plots
Autumn sown
Spring sown
Variance partitioning
Partitioning of the explainable variable variation among the four groups of variables. TU
is the variation described by T but not explained by U. TU is the variation jointly
described by T and U. UT is the variation described by U but not by T.
S = soil type; C = crop species; Y = year; G = geographical region
Percentage of
explainable variation
T
U
T|U
TU
U|T
Percentage of
explainable variation
Other
Autumn
T
U
T|U
TU
U|T
Other
Spring
S
G
28.6
9.4
33.9
28.2
S
G
34.7
10.8
36.3
18.2
C
SG
24.5
7.2
64.6
3.7
C
SG
14.3
4.6
77.1
1.0
Y
CSG
3.7*
2.4
93.9
-
Y
CSG
4.0*
4.2
91.9
-
S
C
32.7
5.3
26.4
35.6
S
C
43.4
2.0
16.9
37.6
S
Y
36.4
1.6
4.5
57.5
S
Y
45.3
0.17
8.0
46.6
C
G
28.2
3.5
39.7
28.6
C
G
14.9
4.0
43.1
38.0
C
Y
29.0
2.7
3.3
65.0
C
Y
14.9
4.0
4.2
77.0
G
Y
42.5
0.78
5.3
51.5
G
Y
46.1
1.0
7.1
45.8
Year – low
Soil – high
Geography - high
Crop speices - high
Display phase
Autumn sown
Spring sown
Autumn sown
Spring sown
CONCLUSIONS
1. Besides analysing 'classical' ecological data (species x sites;
environmental variables x sites), ordination methods such as CCA
and RDA with associated Monte Carlo permutation tests provide a
means of examining multivariate experimental and monitoring
data.
2. CCA and RDA are, in reality, reduced rank multivariate regression
techniques with Y (many species) and X (environmental or predictor
variables, design matrix in MANOVA) and Z (covariables defining, for
example, plot design).
3. Ordination methods have progressed from basic two-dimensional
plots (still very valuable if constructed correctly) to statistical
testing of specific hypotheses about impacts, effects, etc. A semigraphical approach to MANOVA but without MANOVA's crippling
assumptions.
4. Due to recent progress, there is now a unified theory about choice
of methods for particular data sets and research questions. Still
several critical problems and potential pitfalls.
Outline of ordination techniques presented
here. DCA (detrended correspondence
analysis) was applied for the determination
of the length of gradient (LG). LG is
important for choosing between ordination
based on a linear or on a unimodal response
model. In cases where LG < 2.5 SD,
ordination based on linear response models
is considered to be the most appropriate.
PCA (principal components analysis)
visualizes variation in species data in
relation to best fitting theoretical variables.
Environmental variables explaining this
visualized variation are deduced
afterwards, hence, indirectly. RDA
(redundancy analysis) visualizes variation in
species data directly in relation to
quantified environmental variables. Before
analysis, covariables may be introduced in
RDA to compensate for systematic
differences in experimental units. After
RDA, a permutation test can be used to
examine the significance of effects.
5. Ordination methods, both for indirect and direct gradient
analysis, are now an important part of environmental biology.
Current and potential applications are very great in ecology,
monitoring, palaeoecology, limnology, ecotoxicology, analysis of
genetic diversity, biogeography, behavioural ecology, restoration
ecology, taxonomy, etc.
The use of
ordination prior to
about 1988
Andrew Lang 1844-1912.
He uses statistics as a
drunken man uses lampposts – for support
rather than illumination.
From MacKay, 1977, and
reproduced through the
courtesy of the Institute of
Physics.
Post-1988
Pre-1988
Cartoon illustrating statistical zap and shotgun approaches