Transcript Overview
NUMERICAL ANALYSIS OF
BIOLOGICAL AND
ENVIRONMENTAL DATA
Lecture 12
Overview
John Birks
OVERVIEW
• Topics covered
• Exploratory data analysis
• Clustering
• Gradient analysis
• Hypothesis testing
• Principle of parsimony in data
analysis
• Possible future developments
• Conventional
• Less conventional
• Some applications
• Volcanic tephras
• Scotland’s most famous
product
• Integrated analyses
• Problems of percentage
compositional data
• Log-ratios
• Chameleons of CA and CCA
• Software availability
• Web sites
• Final comments
EXPLORATORY DATA ANALYSIS
Essential first step
Feel for the data – ranges, need for transformations, rogue or outlying
observations
NEVER FORGET THE GRAPH
CLUSTERING
Can be useful for some purposes – basic description, summarisation of large data
sets. Fraught with problems and difficulties – choice of DC, choice of clustering
method, difficulties of validation and evaluation
Good general purpose TWINSPAN – ORBACLAN – COINSPAN
GRADIENT ANALYSIS
Regression, calibration, ordination, constrained ordination, discriminant analysis
and canonical variates analysis, analysis of stratigraphical and spatial data.
HYPOTHESIS TESTING
Randomisation tests, Monte Carlo permutation tests.
Cajo ter Braak
1987 Wageningen
Classification of gradient analysis techniques by type of problem, response
model and method of estimation.
Linear Response
Model
Unimodal Response Model
Type of problem
Least-Squares
Estimation
Maximum Likelihood
Estimation
Weighted Averaging
Estimation
Regression
Multiple regression
Gaussian regression
Weighted averaging of
site scores (WA)
Calibration
Linear calibration;
‘inverse regression’
Gaussian calibration
Weighted averaging of
species scores (WA)
Ordination
Principal components
analysis (PCA)
Gaussian ordination
Correspondence analysis
(CA); detrended CA (DCA)
Constrained
ordinationa
Redundancy analysis
(RDA)d
Gaussian canonical
ordination
Canonical CA (CCA);
detrended CCA (DCCA)
Partial ordinationb
Partial components
analysis
Partial Gaussian
ordination
Partial CA; partial DCA
Partial constrained
ordinationc
Partial redundancy
analsyis
Partial Gaussian
canonical ordination
Partial CCA; partial
detrended CCA
Constrained multivariate regression b Ordination after regression on covariables
c Constrained ordination after regression on covariables = constrained partial multivariate
d “Reduced-rank regression” = “PCA of y with respect to x”
regression
a
A straight line displays the linear relation between the abundance
value (y) of a species and an environmental variable (x), fitted to
artificial data (). (a = intercept; b = slope or regression
coefficient).
A unimodal relation between the abundance value (y) of a species and an
environmental variable (x). (u = optimum or mode: t = tolerance; c =
maximum).
GRADIENT ANALYSIS
Linear based-models or unimodal-based methods
Critical question, not a matter of personal preference
If gradients are short, sound statistical reasons to use linear methods – Gaussianbased methods break down, edge effects in CA and related techniques become
serious, biplot interpretations easy.
If gradients are long, linear methods become ineffective (‘horseshoe’ effect).
How to estimate gradient length?
Regression
Calibration
Ordination
Constrained
ordination
Hierarchical series of response models GLM and HOF
GLM, DCCA (single x variable)
DCA (detrending by segments, non-linear rescaling)
DCCA (detrending by segments, non-linear rescaling)
Partial ordination
Partial DCA (detrending by segments, non-linear rescaling)
Partial constrained Partial DCCA (detrending by segments, non-linear rescaling)
ordination
HYPOTHESIS TESTING
Monte Carlo permutation tests and randomisation tests
Distribution free, do not require normality of error distribution
Do require INDEPENDENCE or EXCHANGEABILITY
Validity of permutation test results depends on the validity of the type of
permutation for the data set at hand.
Completely randomised observations, completely random permutation is
appropriate = randomisation test.
Randomised block design-permutation must be conditioned on blocks, e.g. type
of farm declared as covariable, if randomisation is conditioned on these,
permutations are restricted to within farm.
Time series or line transect – restricted permutations and data kept in order.
Spatial data on grid – restricted permutations and data kept in position.
Repeated measurements – BACI
PRINCIPLE OF PARSIMONY IN DATA ANALYSIS
William of Occam (Ockham), 14th century English nominalist philosopher. Insisted
that given a set of equally good explanations for a given phenomenon, the
explanation to be favoured is the SIMPLEST EXPLANATION.
Strong appeal to common sense.
Entities should not be multiplied without necessity.
It is vain to do with more what can be done with less.
An explanation of the facts should be no more complicated than necessary.
Among competing hypotheses or models, favour the simplest one that is consistent
with the data.
‘Shaved’ explanations to the minimum.
In data analysis:
1) Models should have as few parameters as possible.
2) Linear models should be preferred to non-linear models.
3) Models relying on few assumptions should be preferred to those relying on many.
4) Models should be simplified/pared down until they are MINIMAL ADEQUATE.
5) Simple explanations should be preferred to complex explanations.
RELEVANCE OF PRINCIPLE OF PARSIMONY TO
DATA ANALYSIS
MINIMAL ADEQUATE
- as statistically acceptable as the most complex model
MODEL (MAM)
- only contains significant parameters
- high explanatory power
- large number of degrees of freedom
- may not be one MAM
CLUSTERING
- prefer simple cluster analysis methods (few assumptions,
simple values of , , )
- intuitively sensible
REGRESSION
- GAM – GLM
- In GAM, simplest smoothers to be used
- In GLM, model simplification to find MAM (e.g. AIC)
CALIBRATION
- minimum number of components for lowest RMSEP in PLS
or WA-PLS
ORDINATION
- retain smallest number of statistically significant axes
(broken stick test)
- retain ‘signal’ at expense of noise
PARTIAL ORDINATION
remove effects of ‘nuisance variables’ (covariables or concomitant
variables) by partialling out their effects
ordination of residuals
retain smallest number of statistically significant axes (broken stick test)
‘signal’ at expense of ‘noise’ and ‘nuisance variables’
CONSTRAINED ORDINATION
most powerful if the number of predictor variables is small compared to
number of samples. Constraints are strong, arch effects avoided, no
need for detrending, outlier effects minimised
minimal adequate model (forward selection, VIF, variable selection, AIC)
only retain statistically significant axes
PARTIAL CONSTRAINED ORDINATION
as above + partial ordination
STRATIGRAPHICAL DATA ANALYSIS
only retain statistically significant zones
simplify data to major axes or gradients of variation
CHOICE BETWEEN INDIRECT & DIRECT GRADIENT ANALYSIS
Indirect gradient analysis – two steps
Direct gradient analysis
– one combined step
If relevant environmental data are to hand, direct approach is likely to be more
effective and simpler than indirect approach. Generally achieve a simpler model
from direct gradient analysis.
CHOICE BETWEEN REGRESSION & CONSTRAINED
ORDINATION
Both regression procedures! One Y or many Y.
Depends on purpose – is it an advantage to analyse all species simultaneously or
individually?
CONSTRAINED
ORDINATION
REGRESSION
Community assemblage or individual taxa?
HOLISTIC
INDIVIDUALISTIC
COMMON GRADIENTS
SEPARATE GRADIENTS
QUICK, SIMPLE
SLOW, COMPLEX, DEMANDING
LITTLE THEORY
MUCH THEORY (GLM)
EXPLORATORY
MORE CONFIRMATORY, IN DEPTH
LIMITING FACTORS
Research questions
Hypotheses to be tested and evaluated
Data quality
TYPES OF GRADIENT ANALYSIS METHODS
BASED ON WEIGHTED AVERAGING
Community data - incidences (1/0) or abundances ( 0)
of species at sites.
Environmental data - quantitative and/or qualitative
(1/0) variables at same sites.
Use weighted averages of species scores (appropriate
for unimodal biological data) and linear combinations
(weighted sums) of environmental variables
(appropriate for linear environmental data)
Method
Abbreviation
Response
variables (y)
Correspondence
analysis
CA (also DCA)
Community data -
Canonical
correspondence
analysis
CCA (also
DCCA)
Community data
Environmental
variables
7
CCA partial least
squares
CCA-PLS
Many
Community data environmental
variables
11
Weighted
averaging
calibration
WA
Environmental
variable
Community
data
8
WA partial least
squares
WA-PLS
Environmental
variable(s)
Community
data
8
Community data
Community
data
11
Co-correspondence
CO-CA
analysis
Also partial CA, partial DCA, partial CCA, partial DCCA.
Predictors (x)
Lecture
6
POSSIBLE FUTURE DEVELOPMENTS - CONVENTIONAL
Lecture topic
2
Exploratory data
3
Clustering
Model specific ‘outlier’ detection; interactive analysis
graphics
COINSPAN; better randomisation tests; CART;
latent class analysis
4, 5 Regression analysis
GLM and GAM framework evaluation by crossvalidation. Give up SS, deviance, t, etc!
6
Indirect gradient
? quest for the ‘ideal’ ordination method, 2-analysis
matrix CA and PCA
7
Direct gradient
3-matrix CCA and RDA (biology, environment, analysis
species attributes); multi-component variance partitioning, vector-based reduced rank models with GAMs
8
Calibration and
reconstruction
WAPLS; non-linear deshrinking; ? ML; mixed response
models; chemometrics, Bayesian framework, more
consideration of spatial autocorrelation
9
Classification
? give up classical methods; use permutation tests;
classification and regression trees and random forests
10
Stratigraphical and
spatial data
Hypothesis testing
? more consideration of temporal and spatial
autocorrelation
More realistic permutation tests (restrictions);
better p estimation
11
NEURAL NETWORKS – THE LESS CONVENTIONAL
DATA ANALYSIS APPROACH IN THE FUTURE?
Back propagation neural network – layers containing neurons
input vector
input layer
hidden layer
output layer
output vector
Clearly can have different types of input and output vectors, e.g.
INPUT VECTORS OUTPUT VECTORS
> 1 Predictor
1 or more Responses
Regression
> 1 ‘Responses’
1 or more ‘Predictors’
Inverse regression or
calibration
> 1 Variables
2 or more Classes
Discriminant analysis
CALIBRATION (INVERSE REGRESSION) AND
ENVIRONMENTAL RECONSTRUCTIONS
Malmgren & Nordlund (1997) Palaeo-3 136, 359–373
Planktonic foraminifera 54 core-top samples
Summer water and winter water temperatures
Core E48–22
Extends to oxygen stage 9
320,000 years
Compared neural network as a calibration tool with:
Imbrie & Kipp principal component regression
Modern analog technique (MAT)
2-block PLS (SIMCA)
WA-PLS
CRITERION FOR NETWORK SUCCESS
Cross-validation leave-one-out
Estimate RMSE
(average error rate in training set)
RMSEP
(predictions based on leave-one-out cross-validation)
3 neurons 600–700 cycles
RMSEP
Neural N
Summer
0.71
Winter °C
0.76
rs
0.99
rw
0.98
PLS
1.01
1.05
0.98
0.97
MAT
1.26
1.14
0.97
0.96
Imbrie & Kipp
WA-PLS
1.22
1.04
1.05
0.86
0.97
0.97
0.96
0.96
Changes in root-mean-square errors (RMSE) for S in relation to number of training epochs for
3-layer BP neural networks with 1, 2, 3, 4, 5, and 10 neurons in the hidden layer. The
networks were trained over 50 intervals of 100 epochs each (in total of 5,000 epochs). As
expected, the RMSEs decrease as training proceeds. The minimum RMSE, 0.3539, was
obtained after training a network with 10 neurons in the hidden layer over 5,000 epochs.
Similar results were obtained also for W (not shown in diagram).
Changes in root-mean-square
errors of prediction (RMSEP) for
S with increasing number of
training epochs in a 3-layer back
propagation neural network with
1, 2, 3, 4, 5, and 10 neurons in
the hidden layer. These error
rates were determined using the
Leave-One-Out technique,
implying training of the networks
over 54 sets consisting of 53
observations each, with one
observation left out for later
testing. The lowest RMSEPs for
both S and W, 0.7176 and
0.7636, respectively, were
obtained for a configuration with
3 neurons (only the results for S
are shown in the diagram). Note
that set-ups with 1, 2, and 3
neurons gave lower RMSEPs than
for 4, 5, and 10 neurons.
Summer
Winter
Relationships between observed and predicted S and W using a 3-layer
BP neural networks with 3 neurons in the hidden layer. Lines are
linear regression lines. The product-moment correlation coefficients
(r) are shown in the lower right hand corners.
Prediction errors for different network configurations: root-mean-square errors
for the differences between observed and predicted S and W using a 3-layer BP
neural network with 1, 2, 3, 4, 5, and 10 neurons in the hidden layer.
S
W
No. neurons
RMSEP
No. epochs
RMSEP
No. epochs
1
0.8779
500
0.8796
300
2
0.7850
1800
0.9013
700
3
0.7176
600
0.7636
700
4
1.0621
700
0.8776
700
5
1.0032
2200
0.9206
3600
10
1.2108
500
0.9332
3000
Root-mean-square errors of prediction (RMSEP) are based on the Leave-One-Out
technique in which each of the 54 observations in the data set is left out one at a
time and the network is trained on the remaining observations. The trained
network is then used to predict the excluded observation. The network was run
over 50 intervals of 100 epochs each, and the error rates were recorded after each
interval.
Prediction error for different methods: Root-mean-square errors of
prediction (RMSEP) for S and W obtained from a 3-layer BP network,
Imbrie-Kipp Transfer Functions (IKTF), the Modern Analog Technique
(MAT), and Soft Modelling of Class Analogy (SIMCA)
Method
S
W
BP network
0.7136
0.7636
IKTF
1.2224
1.0550
MAT
1.2610
1.1346
SIMCA
1.0058
1.0501
PLS
WA-PLS
1.0419
0.8560
WA-PLS
Neural Network
Predictions were made using the Leave-One-Out technique
Predictions of S and W in
core E48-22 from southern
Indian Ocean based on a BP
network, compared to the
oxygen isotope (18O of
Globorotalia
truncautulinoides) curve
presented by Williams (1976)
for the uppermost 440 cm of
the core. The crosscorrelation coefficients for
the relationships between
18O and the predicted S and
W are –0.68 and –0.71,
respectively, for zero lags
(p<0.001). Interglacial
isotope stages 1, 5, 7, and 9
as interpreted here, are
indicated in the diagram.
Problems with ANN implementation and cross-validation
Easy to over-fit the model.
Leave-one-out cross-validation is not a stringent test
as ANN will continue to train and optimise its
network to the one sample left out. Need a training
set (ca. 80%) and an optimisation (or selection set)
(ca. 10%) to select the ANN model with the lowest
prediction error AND an independent test set (ca.
10%) whose prediction error is calculated using the
model selected by the optimisation set.
Telford et al. (2004) Palaeoceanography 19
947 Atlantic foraminifera data.
Split randomly 100 times into training set (747 samples),
optimisation set (100 samples), and test set (100 samples).
Median RMSEP (ºC)
ANN
MAT
Training set
0.72
0.94
Optimisation set
0.94
0.94
Test set
1.11
1.02
No advantage in the hours of ANN computing when crossvalidated rigorously. ANN appears to be a very complicated
(and slow) way of doing a MAT!
May not be so good after all!
DIATOMS AND NEURAL NETWORKS
Descriptive statistics for the SWAP diatom-pH data set
No. of samples
167
No. of taxa
267
% no. of +ve values in data
18.47
Total inertia
3.39
Min.
Median Mean
Max.
N2 for samples
5.13
28.58
29.22
57.18
N2 for taxa
1
14.99
23.76
120.86
pH
4.33
5.27
5.56
7.25
S.D.
Range
0.77
2.92
SWAP data-set: 167 lakes
convergence
Artificial
Neural
Network
Yves Prairie & Julien Racca (2002)
SWAP data-set: 167
lakes
jack-knife predicted
pH against observed
pH
Yves Prairie & Julien
Racca (2002)
pH reconstruction by ANN
and WA-PLS: (RLGH core)
Yves Prairie & Julien Racca (2002)
SKELETONISATION ALGORITHM
Pruning algorithm comparable to BACKWARD ELIMINATION in regression
models
1. Measure relevance Pi for each taxon i
Pi = E without i – E
with i
where E = RMSE
2. Train network with all taxa using back-propagation
3. Compute relevance Pi based on error propagation and weights
4. Taxon with smallest estimated relevance Pi [Did this in 5% classes of
importance]
5. Re-train the network to a minimum again [After deleting a taxon,
the values of the remaining taxon are not re-calculated, so the
input data are always the same original relative abundance values]
Racca et al. (2003)
N2
ANN functionality
Leave-one-predicted pH
ANN
ROUND LOCH OF GLENHEAD
30% pruned ANN
60% pruned ANN
85% pruned ANN
0% pruned ANN
All taxa WA
All taxa ML
General characteristics of the 37 most functional taxa for
calibration based on ANN modelling approach.
Summary statistics of the SWAP diatom pH inference models according to
the classes of taxa included based on the Skeletonisation procedure
Apparent
Cross-validation
Cross-validation
Apparent
Ideally apparent RMSE should be a reliable measure of the actual predictive
of a model, and the difference between apparent and cross-validated RMSE
indicates the extent to which the model has overfitted the data
Examples of the
recently published
diatom-based
inference models in
palaeolimnology
used.
CURSE OF
DIMENSIONALITY
related to ratio of
number of taxa to
number of lakes, as this
ratio determines the
ratio of the dimensional
space in which the
function is determined
to the number of
observations for which
the function is
determined.
MAXIMUM ROBUSTNESS – ratio of taxa : lakes as small as possible
(1) increase the number of lakes
(2) decrease the number of taxa
“Neural networks have the potential for data analysis and represent a viable alternative to
more conventional data-analytical methods”.
Malmgren & Nordlund (1997)
Advantages:
1) Mixed linear and non-linear responses.
2) Good empirical performance.
3) Wide applicability.
4) Many predictors and many ‘responses’.
Disadvantages:
1) Very much a black box.
2) Conceptually complex.
3) Little underlying theory.
4) Easy to misuse and report erroneous model performance statistics.
PATTERN RECOGNITION
Unsupervised (cluster analysis, indirect gradient analysis) or supervised (discriminant
analysis, direct gradient analysis)
Statistical theory
Linear methods
Discriminants & Decision Theory
Neural network
BELIEF NETWORKS
Non-parametric methods
CART trees
Nearest-neighbour
K-NN
LDA
VOLCANIC TEPHRAS IN N.W.EUROPE OF LATEGLACIAL AND EARLY HOLOCENE AGE
Vedde Ash
(Rhyolitic type)
Vedde
(Basaltic type)
Borrobol
Saksunarvatn
SiO2
TiO2
mid Younger Dryas
ca 10600 14C yrs BP
Kråkenes, Norway
Several other sites in W Norway
Borrobol, Scotland
Tynaspirit, Scotland
Whitrig, Scotland
Kråkenes
W Norway
Lower LG Interstadial
ca 12500 14C yrs BP
Borrobol, Scotland
Tynaspirit, Scotland
Whitrig, Scotland
early Holocene ca 9000 14C yrs BP = 9930 – 10010 cal yr
Faeroes
Kråkenes, W Norway
Dallican Water, Shetland
Al2O3
FeO
MnO
MgO
CaO
Na2O
K2O
“The way in which correlation by tephrochronology may revolutionise approaches to
reconstructing the sequence of events in the N.E.Atlantic region...” Lowe & Turney (1997)
SiO2
V
VB
Al2O3
B
S
VB
V
MgO
V
VB
B
TiO2
S
VB
V
CaO
B
S
V
VB
B
FeO
S
V
VB
K2O
B
S
V
VB
B
B
S
Na2O
S
V
VB
B
S
2 = 0.841
28%
1 = 0.988
32.9%
CANONICAL VARIATES ANALYSIS
(= multiple discriminant analysis)
Group means
Saksun
Vedde
Borrobol
Vedde B.
CVA – individual samples
CVA
CVA- biplot of variables
Vedde
Scotland
+
a few
Vedde
Norway
• Borrobol
• Saksunavatn
• Vedde Basaltic
• Vedde Norway
Vedde
Norway
• Vedde Scotland
Minimum-variance
Borrobol
Saksunati
Vedde
Basalt
0.955 cophenetic correlation
cluster analysis
√% data = chord
distance
PCA √% data
97.4%
2 = 0.016
1.6%
1 = 0.96
95.9%
Vedde
Norway
Saksunavatn
Borobol
Vedde Scotland
Vedde basaltic
Vedde
Scotland
Saksunavatn
Borrobol
Vedde Norway
Vedde Basaltic
PCA
97.4%
All samples
PCA
97.4%
All samples
“Tephrochronology
offers the potential of
overcoming problems of
correlation because ash
layers provide timeparallel markers and
therefore precise
comparisons between
sequences”
“The geochemical
signature of each ash is
unmistakable”
Lowe & Turney (1997)
Turney et al. (1997)
SCOTLAND'S MOST FAMOUS PRODUCT
Lapointe & Legendre (1994)
Applied Statistics 43, 237-257
Dendrogram representing the
minimum variance
hierarchical classification of
single-malt Scotch whiskies:
two scales are provided at
the top of the graph - the
number of groups formed by
cutting the dendrogram
vertically at the given points
and the fusion distances of
the hierarchical classification
(represented by vertical
segments in the
dendrogram); the vertical
order of the whiskies is
partly arbitrary - swapping
the branches of a
dendrogram does not change
the corresponding
cophenetic matrix (the 12
groups detailed in Appendix A
are labelled A-L here)
Map of Scotland showing the
positions of the Scottish
distilleries, divided into 11
groups (symbols) in the
regional classification of singlemalt whiskies (Appendix B) (the
six Speyside groups are
deferred to Fig. 3):distiilery
names are represented by fourletter abbreviations (see Fig.
3); the names of regions and of
some major cities are also
indicated - notice that two
Scotches in the present study
come from the Springbank
distillery; Springbank pertains
to the western group whereas
Longrow is a member of the
Islay group.
Map of the Speyside
region showing six
of the 11 groups
(symbols) of Scotch
distilleries of the
regional
classification of
single-malt whiskies
(Appendix B) (the
names of regions
and of some major
cities are also
indicated) and
abbreviations and
full names of the
distilleries.
Looked at spatially constrained classification and
constrained ordination (RDA)
Looked at similarities between results based on:
Colour
Nose
All give consistent results. Can use one to
Body
predict the other, except for finish.
Palate
Finish
TEST OF CONGRUENCE AMONG DISTANCE
MATRICES (CADM)
Legendre & Lapointe (2005)
5 data sets - colour (14 variables +/-)
1
- nose (12 variables +/-)
- body (8 variables +/-)
- palate (15 variables +/-)
- finish (19 variables +/-)
(1 - Jaccard coefficient)½ to give 5 distance
2
3
4
5
matrices
Overall CADM test - null hypothesis of incongruence rejected (H0) (p = 0.0001)
Compare
1 with 2-5
2 with 1, 3-5
3 with 1, 2, 4, 5
- H0 rejected
- H0 rejected
- H0 rejected
4 with 1-3, 5
- H0 rejected
5 with 1-4
- H0 not rejected
Mantel test (2 matrices)
Finish not related to Colour, Nose, Body or Palate.
Principal co-ordinates analysis of Mantel-test
statistics. Axis 1 = 28.7%, axis 2 = 26.3%.
Why is FINISH so different?
It is important!
How were the whiskies tested by the tasters?
Did they swallow or spit?
If the latter, the finish variables may not be fully detected.
ONLY WHEN SWALLOWING CAN ONE TOTALLY CAPTURE THE
AFTERTASTE.
But,
“some professional blenders work only with their nose, not
finding it necessary to let the whisky pass their lips”.
SINGLE MALTS MUST BE SWALLOWED!
INTEGRATED ANALYSES OF BIOLOGICAL
AND ENVIRONMENTAL DATA
For nature conservation and management purposes,
useful to have an overview of the natural zonation of
the area as a whole. Such zonation should:
1. Have characteristic or indicator species or life-forms
2. Correspond to a circumscribed range of environments
3. Have some geographical coherence
Requires integrated analysis of biological and
environmental data.
INDIRECT CLUSTERING APPROACH
Biological
data
Environmental
data
Clusters e.g.
TWINSPAN
Biological
clusters
e.g. DISCRIM
Canonical variates
analysis
RIVPACS
cf. Indirect gradient analysis
Biological data
PCA or CA
Regression with environmental data
DIRECT CLUSTERING APPROACH
1. Latent class analysis with biological
data as +/- or counts following binomial
or Poisson distribution and
environmental data following, after log
transformation, normal distribution.
ter Braak et al. (2003) Ecological
Modelling 160: 235-248
Biological data
+
Environmental
data
Clusters
or Zones
2. CCA, RDA, or DCCA of biological and
environmental data combined in
multivariate direct gradient analysis,
followed by minimum-variance cluster
analysis (Ward's method) or k-means
minimum-variance cluster analysis.
Estimate characteristic species for each
cluster.
Carey et al. (1995) J. Ecology 83: 833845. Biogeographical zonation of
Scotland.
Characteristic species of biogeographical zones
3.
Principal co-ordinates analysis of mixed (biological and environmental)
data using Gower's (1971) coefficient.
m
sij w ijk sijk
k 1
m
w
ijk
k 1
where sij is the similarity between sites i and j as measured by the
variable k and wijk is typically 1 or 0 depending on whether or not the
comparison is valid for variable k. Weights of zero are assigned when k is
unknown for one or both sites or to binary variables to exclude negative
matches. For binary variables sij is the Jaccard coefficient. For
categorical data the component similarity sijk is one when the two sites
have the same value and zero otherwise. For quantitative data
sijk 1 x ik x jk Rk
where Rk is the range of variable k
AN EXAMPLE
Altitude
Moisture
Limestone
Sheep
Age
Site 1
120
1
-
-
1
Site 2
150
2
+
-
2
Site 3
110
3
+
+
3
s12
1 (1 30 40) 1 0 1 0 0 1 1 0
0.0625
1 1 1 0 1
Clusters can then be defined using the principal co-ordinate axes
scores in a minimum-variance cluster analysis or a partitioning of the
sites on the basis of the ordination scores.
4. Constrained indicator species analysis (COINSPAN)
Carleton, T.J. et al. (1996) J. Vegetation Science 7: 125-130
Like TWINSPAN (biological data only) but uses CCA first axis
instead of CA first axis (as in TWINSPAN) as the basis for ordering
samples prior to creating dichotomies.
The resulting clustering is based on CCA axis 1, a linear
combination of environmental variables that maximises the
dispersion of species scores.
COINSPAN clustering thus integrates biology and environment
together. Surprisingly little used - has considerable potential.
PROBLEMS OF PERCENTAGE
(COMPOSITIONAL) DATA
Jackson D.A. (1997) Ecology 78, 929–940
Simulated data
SIM
200 observations x 5 variables
Different means and variances
Mean
x1
30
x2
60
x3
60
x4
120
x5
120
Variance
16
16
64
64
4096
Correlations between all variables = 0
Transformed into percentages
Raw data – BASIS
Transformed data – PERCENTAGE or PROPORTIONS
COMPOSITION
BASIS
r
Bivariate casement plots of the basis (lower triangular matrix) and composition (upper triangular matrix) for the simulated data SIM. The basis relationship are independently generated,
and correlations approximate zero. Note the strong linear relationships in the composition
arising due to the constant-sum constraint, i.e. matrix closure. S1-S5 represent variables.
COMPOSITION
BASIS
Frequency distributions of the bivariate correlations for SIM obtained under randomization. Each plot corresponds to the correlation between two variables from the basis (lower triangular matrix) or the composition
(upper triangular matrix) used in the previous figure. The basis matrix was randomized within each column, the
composition recalculated, and the correlation recalculated. Each plot is a frequency distribution of the
correlations obtained from 10 000 randomized matrices.
Eigenvector coefficients from a principal component analysis of the correlation
matrix of SIM. Results from a PCA of the basis and the composition are presented.
SIM
Composition
Basis
Scree plots of the
eigenvalues for each
component from the (a)
simulated data (SIM) and
(b) herbivorous zooplankton data (ZOO).
The solid line represents
the eigenvalues from
the basic data (i.e. nonstandardised), and the
dashed line represents
the eigenvalues from
the compositional data
(i.e. proportions).
Basis
Composition
Scatterplots of the first two components from a principal component analysis
of SIM using the (a) basis and (b) composition in calculating the correlation
matrix. Letters refer to the points positioned at the ends of axes 1 and 2.
CLUSTER ANALYSIS
BASIS
COMPOSITION
UPGMA cluster analysis based on a correlation matrix of the variables (S1-S5
and H1-H5) from: (a) the basis data of the simulation data (SIM); (b) the
compositional data of SIM; (c) the basis data of the zooplankton data (ZOO);
and (d) the compositional data of ZOO.
REF
1.
REF
POSSIBLE SOLUTIONS
CENTRED LOG RATIO Aitchison (1986)
All variables are retained in analysis but are standardised by dividing each
variable by a denominator based on a geometric composite of all variables.
PCA covariance matrix Yij covlogxi gx , log x j gx
i, j, ..., m and g(x) is the geometric mean of the variables, i.e. gx xi
1
m
Advantages:
1. All variables are retained.
2. Pairwise relationships are the same regardless of using basis or compositional
data.
Problems:
1. With SIM, correlations still very strong!
0.412
0.843
-0.799
-0.906
2. Zero values have unidentified log-ratio value. Replace zero values by small
value.
REF
3. Matrix is singular, so only m-1 components.
REF
2)
CORRESPONDENCE ANALYSIS
REF
Only considers proportional relationships between variables;
unaffected by using basis or compositional data.
CA/DCA/CCA
– focuses on relative abundances
PCA/RDA
– focuses on absolute abundance
If an environmental variable influences total biomass, but leaves
the species composition unchanged, the variable will be
important in PCA/RDA but not at all important in CA/DCA/CCA.
One approach
analyse total biomass separately by regression
analyse species composition by CCA
Analyses are fully complementary.
REF
(PCA/RDA would probably give results close to the regression
analysis).
REF
REF
REF
UNRESOLVED QUESTION SINCE 1986 IN CA/CCA
How can CA and CCA
1. Model unimodal function (c.f. WA as approximate Gaussian ML
regression)
and
2. Be linear with fit
y ik y i y k y 1 bk1xi1 ...
Partial answer
CA and CCA model compositional data (proportions)
This compares with Aitchinson's log-ratio model and the
polytomous GLM which are linear in centred logs but unimodal
in the original data.
REF
REF
THE TWO FACES OF CORRESPONDENCE ANALYSIS
AND CANONICAL CORRESPONDENCE ANALYSIS
REF
CA and CCA are methods for analysing unimodal data.
REF
CA and CCA are CHAMELEONS
1)
Unimodal methods
2)
Linear methods
CCA can be derived as a weighted form of reduced rank regression = redundancy analysis =
principal component analysis with respect to instrumental variables. The key element is that
the relative abundance is a linear function of the environmental variables (relative here
means relative to sample total and species total).
As unimodality and compositional data often go hand in hand, common element is that CCA
models compositional (i.e. relative) abundance data instead of the absolute abundance data.
ECOLOGICAL TERMS
CCA (and CA) models relative abundances; takes sample size for granted. Usually the diversity of a sample increases with its size. CCA and CA take that aspect of -diversity for
granted and focuses, instead, on the -diversity (dissimilarity between sites). If the trend in
-diversity coincides with -diversity (e.g. species disappear one by one along a gradient), CA
and CCA can extract such trends.
In unimodal context, species scores are weighted averages of sample scores and vice versa.
In linear context, species scores are derived from a weighted linear regression of transformed
species data on to the sample scores.
REF
REF
REF
Linear context most useful when gradient length is < 3SD. Unimodal context most useful
when gradient length is > 4SD. For intermediate lengths, either contexts may be useful.
Can transform unimodal model into linear model by ‘take logarithms and double centre’
(for data with no zeroes).
If data contain zeroes, no explicit linearising data transformation because we cannot take
logarithms. In CA and CCA, a transformation is implicit that is close to the exact
transformation.
EXACT
log y ik
with
y ik yik g gi gk
where
gi and g k
are geometric averages of across rows and columns,
respectively and is the overall geometric average
y ik y y i y k
CA/CCA y ik
where
REF
yi
and
y k
are the abundance totals across species in site i
and across samples for species k
INHERIT THEIR TWO FACES FROM MODELS OF COMPOSITIONAL DATA.
REF
DATA TYPE AND CHOICE OF ORDINATION METHOD
Besides gradient length (standard deviations), data type is also
important in selecting ordination method.
Absolute abundance
Relative abundance
(Compositional differences)
Unconstrained PCA (linear)
CA, DCA (unimodal)
Constrained
RDA (linear)
CCA, DCCA (unimodal)
Constrained
(PRC) (linear)
-
PCA/RDA are weighted summations; CA/CCA are weighted averages,
hence the difference between modelling absolute values (PCA/RDA) or
relative values (CA/CCA).
Cannot currently model satisfactorily absolute abundances over long
graidents. Need to partition the data into smaller gradients first (e.g.
TWINSPAN).
SPECIES ABSENCES IN DATA SETS
Besides removing the absolute abundance effect, CA,
DCA, CCA, and partial CCA (and WA and WA-PLS) do
not consider species absences or zero values in the
biological data.
Zero values - ? Show real absence
? Reflect incomplete sampling
? Chance
Is this an advantage or disadvantage?
SOFTWARE AVAILABILITY
CANOCO & CANODRAW
MAT, ZONE, WINTRAN, C2
MicroComputer Power
111 Clover Lane
ITHACA, NY 14850
USA
Steve Juggins
Geography Department
University of Newcastle
NEWCASTLE UPON TYNE
NE1 7RH
[email protected]
http://www.microcomputerpower.com
([email protected])
http://www.campus.ncl.ac.uk/staff/
Stephen.Juggins/
HOF
Jari Oksanen
Department of Biology
University of Oulu
OULU
Finland
([email protected])
http://cc.oulu.fi/~jarioksa/
TWINSPAN (Mark Hill), DISCRIM (Cajo ter Braak), TWINGRP, RATEPOL, SPLIT, etc
John Birks
Department of Biology
University of Bergen
Allégaten 41
N-5007 BERGEN
Norway
([email protected])
QUERIES
[email protected]
John Birks, Department of Biology, University of Bergen, Allégaten 41, N-5007
Bergen, Norway
Fax: (+47) 55 58 96 67
[email protected]
Gavin Simpson, Environmental Change Research Centre, University College
London, Gower Street, London, WC1E 6BT, UK
http://www.homepages.ucl.ac.uk/~ucfagls/ncourse/
VALUABLE WEB SITES FOR NUMERICAL
ECOLOGISTS AND PALAEOECOLOGISTS
www.okstate.edu/artsci/botany/ordinate
Mike Palmer's ordination site with masses of documentation,
explanatory notes, links, details of software, etc.
www.canoco.com
Cajo ter Braak's site about CANOCO and with answers to many
frequently asked questions (FAQ)
www.microcomputerpower.com
Richard Furnas' site about CANOCO and related software availability
and ordering
www.canodraw.com
Petr Šmilauer's site about CANODRAW and CANOCO and related
software
http://regent.bf.jcu.cz/maed
Details of Petr Šmilauer and Jan Lepš' course and data on multivariate
analysis of ecological data.
WEB SITES continued
http://cc.oulu.fi/~jarioksa/
Jari Oksanen's site with his R vegan package, lecture notes, programs
(e.g. HOF), documentation, comments, FAQ, and much more
http://www.bio.umontreal.ca/legendre/indexEnglish.html
Pierre Legendre's site with details of publications, software, activities,
etc.
http://www.bio.umontreal.ca/Casgrain/en/labo/index.html
Software from Pierre Legendre's lab
http://labdsv.nr.usu.edu/
Dave Robert's site about quantitative vegetation ecology with lecture
notes, software details, etc.
www.nku.edu/~boycer/fso/
Rick Boyce's site about fuzzy set ordination
http://cran.r-project.org/
R website
WEB SITES continued
www.stat.auckland.ac.nz/~mja/
Marti Andersen's site with new software, details of publications,
activities, etc.
www.campus.ncl.ac.uk/staff/Stephen.Juggins
Steve Juggins' site for C2, WinTran, ZONE, etc.
www.chrono.qub.ac.uk/psimpoll/psimpoll.html
Keith Bennett's site for palaeoecological software, notes, etc.
www.chrono.qub.ac.uk/inqua
Keith Bennett's site of INQUA Data Analysis Sub-Commission software,
newsletters, etc.
www.env.duke.edu/landscape/classes/env358/env358.html
Dean Urban's site with excellent lecture notes on Multivariate Methods
for Environmental Applications
FINAL COMMENTS
Numerical Analysis of Biological Data
Basic building-blocks and concepts and the resulting numerical methods
Continuum
concept
Niches
Weighted
averaging
'Communities'
CA/DCA
TWINSPAN
Cluster analysis
Metric scaling
Non-metric scaling
Indicator-species
analysis INDVAL
Numerical Analysis of Environmental Data
Basic building-blocks and concepts and the resulting numerical methods
GLM
Regression models
Multiple regression
Gradients
Correlation &
covariance
+
PCA
Linear combinations
PLS
Crossvalidation
RDA
Permutation tests
Cluster analysis
Procrustes rotation
Co-inertia analysis
Linear discriminant
analysis, canonical
correlation analysis
Numerical Analysis of Biological and Environmental Data
Basic building-blocks and concepts and the resulting numerical methods
GLM & GAM
Regression models
Niches &
Gradients
Weighted averaging
Multiple regression
+
CA/DCA
TWINSPAN
WA
WA-PLS
Crossvalidation
CCA-PLS
Co-CA
Co-inertia analysis
DISCRIM
PLS
COINSPAN
Permutation tests
Cluster analysis
CCA
Distance-based PCoA
Canonical analysis of
principal co-ordinates (CAP)
Multiple
discriminant
analysis
Andrew Lang 1844-1912.
He uses statistics as a
drunken man uses lampposts – for support rather
than illumination.
From MacKay, 1977, and
reproduced through the
courtesy of the Institute
of Physics.
Statistics are for
illumination!
Sketches illustrating statistical zap and shotgun
THE PEOPLE WHO HAVE MADE THE
STATISTICAL ZAP POSSIBLE
Mark
O. Hill
Marti J.
Anderson
Cajo J.F. ter Braak
Richard Telford
Steve Juggins
Pierre
Legendre
Gavin Simpson