NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL …

Download Report

Transcript NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL …

NUMERICAL ANALYSIS OF
BIOLOGICAL AND
ENVIRONMENTAL DATA
Lecture 1
Introduction
John Birks
TEACHING OF THE COURSE
Course Leader
Gavin Simpson (UCL)
Lectures 1, 4, 5, 8, 10, 12
John Birks (Bergen & UCL)
Lectures 2, 3, 6, 7, 9, 11
Gavin Simpson (UCL)
Practicals 1-10
Gavin Simpson (UCL)
Course administration
Adam Young (UCL)
INTRODUCTION
Book list
Level of course
Aims of course
What are multivariate data?
What is multivariate data analysis?
Aims of multivariate data analysis
Why do multivariate data analysis?
Terminology
Types of variables
Geometrical models and concept of similarity
(dissimilarity or distance)
Computing
Course topics
LEVEL OF THE COURSE
Approach from practical biological and geological
viewpoint, not statistical theory viewpoint.
Assume no background in matrix algebra,
eigenanalysis, or statistical theory.
Emphasis on techniques that are ecologically realistic
and useful and that are computationally feasible.
“Truths which can be proved can also be
known by faith. The proofs are difficult and
can only be understood by the learned; but
faith is necessary also to the young, and to
those who, from practical preoccupations,
have not the leisure to learn. For them,
revelation suffices.”
Bertrand Russell 1946
The History of Western Philosophy
“It cannot be too strongly emphasised that a long
mathematical argument can be fully understood on first
reading only when it is very elementary indeed, relative
to the reader’s mathematical knowledge. If one wants
only the gist of it, he may read such material once only,
but otherwise he may expect to read it at least once
again. Serious reading of mathematics is best done sitting
bolt upright on a hard chair at a desk. Pencil and paper
are indispensable.”
L Savage 1972
The Foundations of Statistics.
BUT:
“A journey of a thousand miles begins with a single step”
Lao Tsu
STATUS OF MULTIVARIATE NUMERICAL DATA
ANALYSIS
Basic mathematics of correlation, regression, analysis of variance,
eigenanalysis, randomisation etc. not new, worked out in 1920-1930s.
Arithmetic manipulations and calculations involved so numerous and so
time consuming; virtually impossible to work with anything other than
smallest data-sets on hand calculator or early computer.
Development of numerical data analysis closely linked to development of
computers.
Now possible to do in seconds what would have taken hours, days, even
weeks.
Increased availability of computer program packages has advantages and
disadvantages.
Advantages
Disadvantages
• fast
• too fast
• painless
• too easy
• simple
• too simple
Need to understand a technique
well before one can critically
evaluate results. Sound
interpretation requires a good
understanding of the technique.
AIMS
Provide introductory understanding to the most appropriate
methods for the numerical analysis of complex multivariate
biological and environmental data. Recent maturation of
methods.
Provide introduction to what these methods do and do not
do.
Provide some guidance as to when and when not to use
particular methods.
Provide an outline of major assumptions, limitations,
strengths, and weaknesses of different methods.
Indicate to you when to seek expert advice.
Encourage numerical thinking (ideas, reasons, potentialities
behind the techniques). Not so concerned here with numerical
arithmetic (the numerical manipulations involved).
Syllabus for Edgeworth’s
1892
Newmarch Lectures,
University College London
ON THE USES AND METHODS OF
STATISTICS
By Professor F. Y. Edgeworth, M. A., D. C. L.
I. FIRST PRINCIPLES
The extent of the subject here treated is that which is denoted by two leading definitions of statistics, viz: the study
of numerical statements relating to society, and the theory of means. The subject may be divided according as the
element of induction is more or less prevalent. First come general directions as to the acquisition of data; e.g., that
figures should be accurate, and terms unambiguous. Examples of the violation of these rules; together with other
precepts and cautions. Use of relative figures (per head, per cent, &c.). Analysis of the data.
References: Conférences sur la Statistique (Rozier Editeur), 1891; Pidgin, Practical Statistics, 1888; Giffen,
International Statistical Comparisons, Economic Journal, June, 1892.
II. GRAPHICAL METHODS
The Cartesian system of co-ordinates. Integration and interpolation. Case where several dependent variables (i.e.
diseases from different causes) are referred to one independent variable (i.e. the time). The case of one variable
dependent on two independent variables is properly represented by a surface; but curves of level and variously
coloured planes are more convenient. Methods of expressing variation of a quantity relative to its initial, or average,
value. Miscellaneous devices for exhibiting numerical relations to the eye.
References: Marey, La Méthode Graphique, 1885; Favaro, Leçons de Statique Graphique (translated into French by
Terrier), Ch. V. with appendix by the translator. Levasseur, La Statistique Graphique, Journal of the Statistical
Society, Jubilee vol., 1885; Marshall, The Graphic Method of Statistics, Ibid; Cheysson, Les Cartogrammes à
teintes graduées, Journal de la Société de Statistique de Paris, 1887; Scribner’s Statistical Atlas of the United
States; Longstaff, Studies in Statistics, 1891.
III. THE DOCTRINE OF AVERAGES
The general idea of a mean comprehends innumerable species, of which the most important are, the Arithmetic
Mean, the Median, the Greatest Ordinate (or centre of greatest condensation) and the Geometric Mean. A cross
division is between simple and weighted means. Concrete instances of these varieties. Subtle distinction between
so-called objective and subjective means. Peculiar prestige attaches to the means of which the constituents are
grouped according to the Probability Curve, or law of error. A priori demonstration, and empirical verification, that
this form arises under certain conditions.
References: Venn, Logic of Chance, Third Edition, 1888, chap, xviii., and xix.; On….Averages. Journal of the
Statistical Society, 1891; Galton, Statistics by inter-comparison, Philosophical Magazine, 1875; Bertillon, Moyenne,
Dictionnaire Encyclopédique des Science Médicales; Edgeworth, On the Choice of Means, Phil. Mag., 1887, On
the empirical proof of the law of error, Ib., 1887.
IV. TYPES AND CORRELATIONS
The ‘mean man’ has for stature, length of cubit, height of knee, &c, the respective means of the statures, lengths,
&c., of a greater number of men. Reply of the objection that such a combination of partial means may not form a
possible whole. Relation between the deviation of one organ or attribute, e.g. length of cubit, from its mean; as
established by Mr. Galton, and illustrated by Mr. H. Dickson. Abridged method of ascertaining the co-efficient
which expresses the correlation between three attributes, e.g. stature, length of cubit and height of knee. The
formula for the most probable attribute, e.g. stature corresponding to assigned values of two other attributes, e.g.
length of cubit and height of knee, may be ascertained either from three simple correlations, between stature and
cubit, stature and height of knee, cubit and height of knee; or by observations special to the case of three variables.
Correlation between any number of attributes.
References: Quetelet, Anthropométrie; Galton, Family Likeness in Stature, Proceedings of the Royal Society, 1886;
Co-relations and their measurements Ibid. 1888; Weldon, Correlated Variations, Ibid, 1892.
V. THE STATISTICAL PART OF INDUCTIVE LOGIC
Passing Insurance and other direct applications of statistics, we come to the investigation of causes. The inductive
method to which statistics lends itself, the Method of Agreement, is liable to the fallacy Post hoc propter hoc; of
which numerous examples occur. The Method of Concomitant variations is facilitated by the use of parallel curves.
The Method of Residues is exemplified when in comparing the death rates of different classes, we make allowance
for their different ages; and in similar cases.
References: Mill, Logic; Giffen, Essays on Finance, and Article in June No. of Economic Journal; Humphreys,
Value of death rates as a test of Sanitary conditions, Journal of the Statistical Society, 1874, Class Mortality
Statistics, Ibid, 1887.
VI. THE ELIMINATION OF CHANCE
One case of the Method of Residues, for which there exists a technical apparatus, is where the agency allowed for
consists of those “fleeting causes” called chance. The simple method of eliminating chance, described by Mill
(Logic, iii, xviii, 4) and the higher method derived from the theory of error. The latter method is particularly
applicable where the deviation from the average value of a ratio – e.g. that between male and female births –
follows the analogy of the simpler games of chance. In other cases the higher theory affords rather regulative ideas
than exact conclusions; in this respect, comparable to the use of the mathematical theory of economics.
References: Westergaard, Grundzüge der Theorie der Statistik, 1891; Duesing, Das geschlechtverhaltniss in
Preussen, 1890; Edgeworth, Methods of Statistics, Journal of the Statistical Society, Jubilee vol., 1885.
[The lectures were presented on six consecutive Wednesdays
at 5:00 P.M., beginning 11 May 1892, admission free.]
AIMS
At its best, statistical analysis sharpens thinking about data, reveals new
patterns, prompts creative thinking, and stimulates productive discussions in
multi-disciplinary research groups. For many scientists, these positive possibilities
of statistics are over-shadowed by negatives; abstruse assumptions, emphasis of
things one can’t do, and convoluted logic based on hypothesis rejection. One
colleague’s reaction to this Special Feature (on statistical analysis of ecosystem
studies) was that “statistics is the scientific equivalent of a trip to the dentist.”
This view is probably widespread. It leads to insufficient awareness of
the fact that statistics, like ecology, is a vital, evolving discipline with everchanging capabilities.
At the end of the semester, could my students fully understand all of the
statistical methods used in a typical issue of Ecology? Probably not, but they did
have the foundation to consider the methods if authors clearly described their
approach. Statistics can still mislead students, but students are less apt to see all
statistics as lies and more apt to constructively criticise questionable methods.
They can dissect any approach by applying the conceptual terms used throughout
the semester. Students leave the course believing that statistics does, after all,
have relevance, and that it is more accessible than they believed at the beginning
of the semester.
July 18, 1998.
Plot 6 (quadrats)
Species
(Rt. Bank, c 300 m S of mouth of Steepbank R., 40m inland)
#11
#12
#13
#14
#15
#16
#17
#18
#19
#20
Equisetum pratense
4
-
1
2
-
7
10
13
18
17
Rubus pubescens
11
4
13
18
4
7
17
-
13
2
R. strigosus
1
8
1
2
19
8
3
5
2
8
Cornus stolonifera
6
-
-
1
-
-
1
1
-
1
C. canadenis
-
-
2
-
12
-
-
1
-
-
Rosa acicularis
2
2
1
6
11
2
1
-
3
3
Galium boreale
-
-
12
3
22
-
2
-
1
-
Ribes oxycanthoides
-
1
-
4
15
-
-
8
-
3
R. triste
2
9
13
2
-
4
10
6
16
9
Mitella nuda
-
6
-
-
1
9
-
16
25
19
Mertensia nudicaulis
-
11
6
10
-
2
10
4
1
12
Aralia nudicaulis
4
-
6
1
3
-
-
1
-
1
Viburnum edule
2
15
5
6
-
7
4
5
3
4
Calamagrostis canescens
3
3
-
1
1
6
11
8
4
4
Populus balsamifera (seedling)
2
1
-
1
1
2
2
-
1
-
Prunus virginiana (seedling)
-
-
1
-
-
-
-
-
1
-
Populus tremuloides (seedling)
-
-
1
-
1
-
-
1
-
-
Actaea rubra
-
-
1
-
1
-
-
-
-
1
Circaea alpina
4
-
1
18
1
3
-
-
2
11
Thalictrun venulosum
3
-
-
-
-
1
1
-
-
-
Matteuccia struthiopteris
-
-
-
-
-
-
-
-
-
2
12
10
14
14
12
12
12
12
13
14
NO. OF SPECIES
A typical page from a field notebook. This one records observations on the ground vegetation in
Populus balsamifera woodland in the flood plain of the Athabasca River, Alberta.
TYPES OF MULTIVARIATE DATA
Object (n)
Variable (m)
Botany (plant ecology)
Quadrat
Relevé
Plot
Plant species
Archaeology
Sites
Artefacts
Geology
Samples
Particle-size classes
Chemistry
Stream sediments
Trace elements
Zoology
Geographical localities
Morphometric
characters
Pollen analysis
Sediment samples
Pollen types
Diatom analysis
Sediment samples
Diatom types
Palaeontology
Rock samples
Fossil taxa
...
...
...
Features in common –
MANY OBJECTS n
MANY VARIABLES m
CAN BE ARRANGED IN DATA MATRIX
of SAMPLES or OBJECTS x VARIABLES
DATA MATRIX
Samples (n samples)
Variables
(m vars)
1
2
3
4
...
N (columns)
1
xik
*
*
*
...
X1n
2
*
*
*
*
3
*
*
*
*
4
*
*
*
*
...
...
M
(rows)
xm1
Xmn
Matrix X with n columns x m rows. n x m matrix. Order (n x m).
 x11
X  
 x 21
x12
x 22
subscript
x13 

x 23 
X21
element in row
two
Xik
column
one
row i
column k
FEATURES OF MULTIVARIATE DATA
Complex
Show:
Noise
Redundancy
Internal relationships
Outliers
Some information in the data is only
indirectly interpretable
BIOLOGICAL DATA
ENVIRONMENTAL DATA
many species
fewer variables
+/–, quantitative, often %,
many zero values, skewed
+/–, ranks, quantitative
non-linear responses to
environment
linear inter-relationships, often high
correlations, some redundancy
non-normal
STATISTICS AND DATA ANALYSIS
1.
Hypothesis testing ‘confirmatory data analysis’ (CDA).
2.
Model building
explanatory
empirical
[statistical]
Pielou (1981) Quart. Rev. Biol.
“Models are often displayed with little or no effort to link them with the
real world. As a result the whole body of knowledge and theory has
grown top-heavy with models... Models are not useless but too much
should not be expected of them. Modelling is only a part, and a
subordinate part, of research.”
3.
Hypothesis generation ‘exploratory data analysis’ (EDA).
Detective work
CDA & EDA - different aims, philosophies, methods
“We need both exploratory and confirmatory”.
J W Tukey 1980
EXPLORATORY
DATA ANALYSIS
CONFIRMATORY DATA
ANALYSIS
Real world ’facts’
Observations
Measurements
Hypotheses
Data
Data analysis
Patterns
Real world
‘facts’
Observations
Measurements
Data
Statistical
testing
‘Information’
Hypothesis
testing
Hypotheses
Decisions
Theory
Underlying statistical model (e.g.
linear or unimodal response)
Biological Data Y
Exploratory data
analysis
Description
Confirmatory data
analysis
Additional (e.g.
environmental data) X
Testable ‘null
hypothesis’
Rejected hypotheses
induction
Observation
Scientific H0
deduction
Scientific HA
Theory/Paradigm
Prediction
deduction
Evaluate theory/paradigm
Conceptual design of study, choice
of format (experimental, nonexperimental) and classes of data
Evaluate scientific H0, HA
Statistical H0
Statistical HA
Evaluate prediction
Evaluate statistical H0, HA
Analysis
Data
collection
Sampling or
experimental
design
The Popperian hypothetico-deductive method, after Underwood and others.
HO = null hypothesis
HA = alternative hypothesis
EXPLORATORY
DATA ANALYSIS
CONFIRMATORY
DATA ANALYSIS
How can I optimally describe or
explain variation in data set?
Can I reject the null hypothesis
that the species are unrelated to a
particular environmental factor or
set of factors?
Samples can be collected in many
ways, including subjective
sampling.
Samples must be representative of
universe of interest – random,
stratified random, systematic.
‘Data-fishing’ permissible, post-hoc
analyses, explanations, hypotheses,
narrative okay.
Analysis must be planned a priori.
P-values only a rough guide.
P-values meaningful.
Stepwise techniques (e.g. forward
selection) useful and valid.
Stepwise techniques not strictly
valid.
Main purpose is to find ‘pattern’ or
‘structure’ in nature. Inherently
subjective, personal activity.
Interpretations not repeatable.
Main purpose is to test hypotheses
about patterns. Inherently
analytical and rigorous.
Interpretations repeatable.
A WELL-DESIGNED MODERN ECOLOGICAL
STUDY COMBINES BOTH.
- Initial phase is exploratory, perhaps
involving subjectively located plots or
previous data to generate hypotheses.
1) Two-phase study
- Second phase is confirmatory,
collection of new data from defined
sampling scheme, planned data
analysis.
- Large data set (>100 objects),
randomly split into two (75/25) –
exploratory set and confirmatory set.
2) Split-sampling
- Generate hypotheses from
exploratory set (allow data fishing);
test hypotheses with confirmatory set.
- Rarely done in ecology.
Data diving with cross-validation: an investigation of broadscale gradients in Swedish weed communities.
ERIK HALLGREN, MICHAEL W. PALMER and PER MILBERG.
Journal of Ecology, 1999, 87, 1037-1051.
Full data set
Remove observations with missing data
Clean data set
Ideas for
more analysis
Random split
Exploratory
data set
Hypotheses
Choice of variables
Some
previously
removed
data
Confirmatory
data set
Hypothesis
tests
Combined
data set
Analyses for display
RESULTS
Flow chart for the
sequence of analyses.
Solid lines represent the
flow of data and dashed
lines the flow of analysis.
EUROPEAN FOOD
(From A Survey of Europe Today, The Reader’s Digest Association Ltd.) Percentage of all
households with various foods in house at time of questionnaire. Foods by countries.
GC ground coffee
IC instant coffee
TB tea or tea bags
SS sugarless sugar
BP packaged biscuits
SP soup (packages)
ST soup (tinned)
IP instant potatoes
FF frozen fish
VF frozen vegetables
AF fresh apples
OF fresh oranges
FT tinned fruit
JS jam (shop)
CG garlic clove
BR butter
ME margarine
OO olive, corn oil
YT yoghurt
CD crispbread
90
49
88
19
57
51
19
21
27
21
81
75
44
71
22
91
85
74
30
26
D
82
10
60
2
55
41
3
2
4
2
67
71
9
46
80
66
24
94
5
18
I
88
42
63
4
76
53
11
23
11
5
87
84
40
45
88
94
47
36
57
3
F
96
62
98
32
62
67
43
7
14
14
83
89
61
81
16
31
97
13
53
15
NL
94
38
48
11
74
37
25
9
13
12
76
76
42
57
29
84
80
83
20
5
B
97
61
86
28
79
73
12
7
26
23
85
94
83
20
91
94
94
84
31
24
L
27
86
99
22
91
55
76
17
20
24
76
68
89
91
11
95
94
57
11
28
GB
Country
72
26
77
2
22
34
1
5
20
3
22
51
8
16
89
65
78
92
6
9
P
55
31
61
15
29
33
1
5
15
11
49
42
14
41
51
51
72
28
13
11
A
73
72
85
25
31
69
10
17
19
15
79
70
46
61
64
82
48
61
48
30
CH
97
13
93
31
43
43
39
54
45
56
78
53
75
9
68
32
48
2
93
S
96
17
92
35
66
32
32
11
51
42
81
72
50
64
11
92
91
30
11
34
DK
96
17
83
13
62
51
4
17
30
15
61
72
34
51
11
63
94
28
2
62
N
98
12
84
20
64
27
10
8
18
12
50
57
22
37
15
96
94
17
64
SF
70
40
40
62
43
2
14
23
7
59
77
30
38
86
44
51
91
16
13
E
13
52
99
11
80
75
18
2
5
3
57
52
46
89
5
97
25
31
3
9
IRL
Classification
Dendrogram showing the results of minimum variance agglomerative cluster
analysis of the 16 European countries for the 20 food variables listed in the table.
Key:
Countries: A Austria, B Belgium, CH Switzerland, D West Germany, E Spain, F France, GB
Great Britain, I Italy, IRL Ireland, L Luxembourg, N Norway, NL Holland, P Portugal, S
Sweden, SF Finland
Ordination
Key:
Countries:
A Austria,
B Belgium,
CH Switzerland,
D West Germany,
E Spain,
F France,
GB Great Britain,
I Italy,
IRL Ireland,
L Luxembourg,
N Norway,
NL Holland,
P Portugal,
S Sweden,
SF Finland
Correspondence analysis of percentages of households in 16
European countries having each of 20 types of food.
Minimum spanning tree fitted to the full 15-dimensional correspondence
analysis solution superimposed on a rotated plot of countries from
previous figure.
Percentages of
people employed
in nine different
industry groups in
Europe. (AGR =
agriculture, MIN =
mining, MAN =
manufacturing, PS
= power supplies,
CON =
construction, SER
= service
industries, FIN =
finance, SPS =
social and personal
services, TC =
transport and
communications).
Country
Belgium
Denmark
France
W. Germany
Ireland
Italy
Luxembourg
Netherlands
UK
Austria
Finland
Greece
Norway
Portugal
Spain
Sweden
Switzerland
Turkey
Bulgaria
Czechoslovakia
E. Germany
Hungary
Poland
Romania
USSR
Yugoslavia
AGR
3.3
9.2
10.8
6.7
23.2
15.9
7.7
6.3
2.7
12.7
13
41.4
9
27.8
22.9
6.1
7.7
66.8
23.6
16.5
4.2
21.7
31.1
34.7
23.7
48.7
MIN
0.9
0.1
0.8
1.3
1
0.6
3.1
0.1
1.4
1.1
0.4
0.6
0.5
0.3
0.8
0.4
0.2
0.7
1.9
2.9
2.9
3.1
2.5
2.1
1.4
1.5
MAN
27.6
21.8
27.5
35.8
20.7
27.6
30.8
22.5
30.2
30.2
25.9
17.6
22.4
24.5
28.5
25.9
37.8
7.9
32.3
35.5
41.2
29.6
25.7
30.1
25.8
16.8
PS
0.9
0.6
0.9
0.9
1.3
0.5
0.8
1
1.4
1.4
1.3
0.6
0.8
0.6
0.7
0.8
0.8
0.1
0.6
1.2
1.3
1.9
0.9
0.6
0.6
1.1
CON
8.2
8.3
8.9
7.3
7.5
10
9.2
9.9
6.9
9
7.4
8.1
8.6
8.4
11.5
7.2
9.5
2.8
7.9
8.7
7.6
8.2
8.4
8.7
9.2
4.9
SER
19.1
14.6
16.8
14.4
16.8
18.1
18.5
18
16.9
16.8
14.7
11.5
16.9
13.3
9.7
14.4
17.5
5.2
8
9.2
11.2
9.4
7.5
5.9
6.1
6.4
FIN
6.2
6.5
6
5
2.8
1.6
4.6
6.8
5.7
4.9
5.5
2.4
4.7
2.7
8.5
6
5.3
1.1
0.7
0.9
1.2
0.9
0.9
1.3
0.5
11.3
SPS
26.6
32.2
22.6
22.3
20.8
20.1
19.2
28.5
28.3
16.8
24.3
11
27.6
16.7
11.8
32.4
15.4
11.9
18.2
17.9
22.1
17.2
16.1
11.7
23.6
5.3
TC
7.2
7.1
5.7
6.1
6.1
5.7
6.2
6.8
6.4
7
7.6
6.7
9.4
5.7
5.5
6.8
5.7
3.2
6.7
7
8.4
8
6.9
5
9.3
4
Source: Euromonitor (1979, pp. 76-7) with the percentage employed in
finance in Spain reduced from 14.7 to the more reasonable figure of 8.5
Correspondence
analysis
Correspondence
analysis
WHY DO MULTIVARIATE DATA ANALYSIS?
1:
2:
3:
4:
5:
6:
7:
8:
Data simplification and data reduction - “signal from noise”
Detect features that might otherwise escape attention.
Hypothesis generation and prediction.
Data exploration as aid to further data collection.
Communication of results of complex data.
Ease of display of complex data.
Aids communication and forces us to be explicit.
“The more orthodox amongst us should at least reflect that
many of the same imperfections are implicit in our own
cerebrations and welcome the exposure which numbers bring to
the muddle which words may obscure”.
D Walker (1972)
Tackle problems not otherwise soluble. Hopefully better
science.
Fun!
“General impressions are never to be trusted.
Unfortunately when they are of long standing they
become fixed rules of life, and assume a prescriptive
right not to be questioned. Consequently those who are
not accustomed to original inquiry entertain a hatred
and a horror of statistics. They cannot endure the idea
of submitting their sacred impressions to cold-blooded
verification. But it is the triumph of scientific men to
rise superior to their superstitions, to desire tests by
which the value of their beliefs may be ascertained,
and to feel sufficiently masters of themselves to discard
contemptuously whatever may be found untrue.”
Francis Galton
Quoted from Quotes, Damned Quotes and...
compiled by J Bibby Edinburgh: John Bibby (Books)
TERMINOLOGY
Sample, object, individual “sampling unit”
Statistician
Others
Single unit
Sampling unit
Sample
Collection of units
Sample
Sample set
Variable, character, attribute
Algorithms, methods, models, programs
Classification, clustering, partitioning, scaling, gradient analysis
[assignment, identification, discrimination]
[dissection]
Objective, repeatable
TYPES OF VARIABLES
1) Numeric, quantitative, continuous variables
2) Nominal and ordinal variables (qualitative multistate)
Nominal “disordered multistate” (e.g. red, white, blue)
Ordinal “ordered multistate” (e.g. dry, moist, wet)
3) Binary or dichotomous variables +/– (e.g. male, female)
4) Conditionally present variables
Only A & B have petals
A pink petals
B white petals
e.g. 3 species - A, B, C
A
B
C
Pink petals
+
-
-
White petals
-
+
-
No petals
-
-
+
5) Mixed data – see Lecture 12
nominal disordered
GEOMETRICAL MODELS
Pollen data - 2 pollen types x 15 samples
Variables
Depths are in
centimetres, and
the units for
pollen
frequencies may
be either in
grains counted or
percentages.
Sample
1
2
3
4
5
6
7
Samples
8
9
10
11
12
13
14
15
Depth
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
Type A
10
12
15
17
18
22
23
26
35
37
43
38
47
42
50
Type B
50
42
47
38
43
37
35
26
23
22
18
17
15
12
10
Adam (1970)
ALTERNATE REPRESENTATIONS OF THE POLLEN DATA
Palynological
representation
Geometrical
representation
In (a) the data are plotted as a standard diagram, and in (b) they
are plotted using the geometric model. Units along the axes may be
either pollen counts or percentages.
Adam (1970)
Geometrical model of a vegetation space
containing 52 records (stands).
A: A cluster within the cloud of points
(stands) occupying vegetation space.
B: 3-dimensional abstract vegetation
space: each dimension represents an
element (e.g. proportion of a certain
species) in the analysis (X Y Z axes).
A, the results of a classification approach
(here attempted after ordination) in which
similar individuals are grouped and
considered as a single cell or unit.
B, the results of an ordination approach in
which similar stands nevertheless retain
their unique properties and thus no
information is lost (X1 Y1 Z1 axes).
N. B. Abstract space has no connection with
real space from which the records were
initially collected.
Concept of Similarity, Dissimilarity, Distance and Proximity
sij – how similar object i is object j
Proximity measure  DC or SC
Dissimilarity = Distance
_________________________________
Convert sij  dij
sij = C – dij where C is constant
dij 
1  s 
ij
dij  (1  sij )
sij  1(1  d )
ij
COMPUTING
In the 10 practicals, mainly use R, a public-domain statisticalcomputing environment, rather than specific commercial
packages such as MINITAB or SYSTAT.
Relatively steep learning curve but worth it.
Recommend Fox (2002) An R and S-PLUS companion to applied
regression (Sage), Crawley (2005) Statistics – An introduction
using R (Wiley), Crawley (2007) The R Book (Wiley), Everitt
(2005) An R and S-PLUS companion to multivariate analysis
(Springer), and Verzani (2005) Using R for introductory
statistics (Chapman Hall/CRC) as excellent guides.
Will also use specialised software for specific methods (e.g.
TWINSPAN, CANOCO and CANODRAW, C2, ZONE, etc.)
Computing practicals are an integral and essential part of the
course.
COURSE TOPICS
Introduction
Exploratory Data Analysis
Cluster Analysis
Regression Analysis
Ordination (Indirect Gradient
Analysis)
Constrained Ordination (Direct
Gradient Analysis)
Calibration and Environmental
Reconstructions
Classification
Analysis of Stratigraphical and
Spatial Data
Hypothesis Testing
Overview and Future Developments
Lecture 1
Lecture 2
Practical 1
Lecture 3
Practical 2
Lectures 4 & 5 Practicals 3 & 4
Lecture 6
Practical 5
Lecture 7
Practical 6
Lecture 8
Practical 7
Lecture 9
Lecture 10
Practical 8
Practical 9
Lecture 11
Lecture 12
Practical 10
-
COURSE POWERP0INTS
In some of the lectures, some of the slides are
rather technical.
They are included for the sake of completion to
the topic under discussion.
They are for reference only and are marked REF