Spread of Giant Hogweed (Heracleum Mantegazzianum) in the UK

Download Report

Transcript Spread of Giant Hogweed (Heracleum Mantegazzianum) in the UK

Statistical analysis of species-level data
on distribution and traits
Adam Butler
Biomathematics & Statistics Scotland (BioSS)
Tartu University, October 2007
The ALARM project
• Assessing Large-scale risks to biodiversity with tested methods
• Project of the 6th framework programme of the European Union
Key objectives
• Develop an integrated risk assessment for biodiversity in
terrestrial and freshwater ecosystems at the European scale
• Focus on four key pressures – climate change, invasive
species, chemical pollution, pollinator loss – and their
interactions
• Contribute to the dissemination of scientific knowledge and to
the development of evidence-based policy
BioSS
“BioSS undertakes research, consultancy and training in
mathematics and statistics as applied to agriculture, the
environment, food and health…”
• It is a publicly funded organisation based in Scotland, employing
approximately 40 people
• BioSS is a partner in ALARM, with three staff currently working
on the project: Glenn Marion, Stijn Bierman, Adam Butler
• Our role involves a mix of research/consultancy and training
Research-consultancy
BioSS have two key themes of work within ALARM…
Statistical analysis of species-level data on distribution & traits
Quantifying uncertainty in complex mechanistic models
…this talk focuses mainly on the first of these
Species level data
Species atlas data
Presence/absence of species for
each cell on a regular grid
Florkart: Germany, vascular plants
Traits data
Physiological & genetic traits
Biolflor: Germany, vascular plants
National Biodiversity Network: UK
Invasive success
Date of arrival, establishment
or naturalisation
Environmental data
Local or national
Future projections under different
scenarios of socio-economic change
Land use, climate
• These data are observational, rather than experimental…
Advantages
• Available at large spatial scales for a wide range of species
• A basis for predicting impacts of long-term environmental change
Limitations
• Can be used to infer correlative relationships, but not causal ones
• Analysis must be model-based rather than design-based
Our work in this area
• Spatial distribution of individual species
• Spread of invasive species across time and space
• Spatial distribution of trait compositions
• Trait-based prediction of invasive success
Spatial distribution
of individual species
Galium pumilum in Germany
• Atlas data: presence/absence
• Data derive from individual
records, but much heterogeneity
in recording effort
• Data aggregated over time &
space, with the aim of reducing or
removing this heterogeneity
Contact: Stijn Bierman ([email protected])
Reference:
Bierman, S.M., Wilson, I.J., Elston, D.A., Marion, G., Butler, A. & Kühn, I. (in preparation) Bayesian image
restoration techniques to analyze species atlas data with spatially varying non-detection probabilities.
Logistic regression
Interested in exploring relationships
between environment & distribution
Mean annual temperature
1960-1990
(oC)
mean annual
temperature (1960-1990)
(degrees centigrade)
Data on presence/absence are binary
yi = 1 if present, yi = 0 if absent
so we need to use logistic regression,
rather than standard linear regression
>18
16-18
14-16
12-14
10-12
<10
Logistic regression:
yi ~ Binomial(pi,1)
explanatory variable(s)
log(pi / (1 – pi)) = a + bxi
The “climate envelope” approach
• Use logistic regression to infer current relationship between
climate and presence/absence, and Assume that this
relationship will continue to hold in future
• Use climate predictions and the regression model to estimate
the probability of presence in future years
• Predict presence if Probability(presence) > threshold
• Ignores dynamics of spatial spread, ignores adaptation,
and assumes that climate is the primary determinant of range
Residual spatial autocorrelation
• Standard logistic regression ignores the effects of residual
spatial autocorrelation – nearby cells will tend to have a similar
response (either presence/absence), and this similarity persists
even after accounting for known environmental variables
• Possible sources:
Distribution depends on a variable about which we have no data
Species not in equilibrium, so range is expanding or contracting
• Spatial autocorrelation leads us to underestimate uncertainty,
and can lead to bias in estimates of environmental effects
The autologistic model
• There are many methods to deal with spatial autocorrelation
– one of the simplest is to use an autologistic model:
yi ~ Binomial(pi,1)
log(pi / (1 – pi)) = a + bxi + c(yA + yB + yL + yR)/4
where A, B, L and R are the cells immediately
above, below, to the left and to the right of cell i
A
L
i
B
• Different neighborhoods & weights can be used
• The parameter c measures the strength of autocorrelation
R
Non-detection
If the species is recorded present (yi = 1), we can be pretty confident
it is actually present
If the species is recorded absent (yi = 0) then it could either be
genuinely absent (true absence) or just undetected (false absence)
If we have an additional source of reliable data on detection, then we
can modify the analysis to correct for the effects of false absences
500
Edge effects?
control group
1
2
3
4
5
6
7
8
9
number of grid cells
1000
1500
a
2000
Proxy data for detection effort
highest
lowest
b
Dealing with non-detection
Introduce a new variable…
zi = 1 if actually present in cell i, 0 otherwise
…and let
pi = probability of non-detection = Prob(yi = 0 given zi = 1)
Assume a relationship between pi and our proxy variable,
and estimate the value of pi
We can thereby estimate Prob(zi = 1 given yi = 0)
Estimated current distribution of
Galium
pumilum
Galium
pumilum
a
Probability
of presence
<0.05
0.5
1
b
Combining things
• The model for non-detection can be combined with the
autologistic model, to create an approach that accounts for
spatial dependence, false absences and non-normality
• This results in a complicated model, so we cannot estimate
the parameters using standard approaches such as least
squares or maximum likelihood – instead, we use Bayesian
methods…
Bayesian inference
• Statistical analyses involve modelling and inference
• The two major approaches to statistical inference are classical
(frequentist) and Bayesian
• There are deep philosophical differences between them, but
they often yield similar answers in practice
• It is usually best to concentrate on finding an appropriate model
first, before worrying about which method of inference to use
• Before the advent of modern computing and Markov chain
Monte Carlo, Bayesian methods were difficult to use
• Today, it is possible to fit complicated and realistic models
using Bayesian methods, many of which would be difficult or
impossible to fit using frequentist methods
• The development of a powerful piece of software (WinBUGS)
has opened up the methods to non-statisticians
• This largely explains why they have become very popular in
ecology and other disciplines
Spread of invasive species
• For invasive species we may also have data on arrival,
establishment or naturalization, at either national or local levels
• We can, with care, use such data to draw inferences about the
spatio-temporal spread of a species across a landscape
• Allows us to assess the risks associated with future expansion
• Key issue: environmental heterogeneity (land use & climate)
Contact: Glenn Marion ([email protected])
Reference: Cook, A., Marion, G., Butler, A. and Gibson, G. (2007) Bayesian inference for the spatiotemporal invasion of alien species. Bulletin of Mathematical Biology, 69(6), 2005-2025.
Spread of Giant Hogweed (Heracleum Mantegazzianum) in the UK
Data : National Biodiversity Network
By 1910
Spread of Giant Hogweed (Heracleum Mantegazzianum) in the UK
Data : National Biodiversity Network
By 1920
Spread of Giant Hogweed (Heracleum Mantegazzianum) in the UK
Data : National Biodiversity Network
By 1930
Spread of Giant Hogweed (Heracleum Mantegazzianum) in the UK
Data : National Biodiversity Network
By 1940
Spread of Giant Hogweed (Heracleum Mantegazzianum) in the UK
Data : National Biodiversity Network
By 1950
Spread of Giant Hogweed (Heracleum Mantegazzianum) in the UK
Data : National Biodiversity Network
By 1960
Spread of Giant Hogweed (Heracleum Mantegazzianum) in the UK
Data : National Biodiversity Network
By 1970
Spread of Giant Hogweed (Heracleum Mantegazzianum) in the UK
Data : National Biodiversity Network
By 1980
Spread of Giant Hogweed (Heracleum Mantegazzianum) in the UK
Data : National Biodiversity Network
By 1990
Spread of Giant Hogweed (Heracleum Mantegazzianum) in the UK
Data : National Biodiversity Network
By 2000
Statistical framework
3. Colonization rate = Arrival rate * Suitability
2. Arrival rate = Sum of dispersal rates from
all currently occupied cells
j
1. Dispersal rate from cell i to cell j
depends only on distance d
i
Assumed to be no decolonization
Model for dispersal
Power law function
Dispersal rate = 2d-2
 = 1/2
 = decay parameter
 = 1/10
Truncated at 150km
+ background rate
Suitability: homogeneous landscape
For a particular grid cell, i…
Suitabilityi =  * proportion of cell i that is land
 = unknown parameter
Interpretation of  is quite ambiguous
Suitability: heterogeneous landscape
Suitabilityi = ( k * Landik ) * exp( * Temperaturei +  * Altitudei)
Landik = proportion of cell i with land use k
1,…,10, , , : unknown parameters
Land use, climate & altitude currently treated
as constant over time
1
sea
2
coastal
3
arable
4
broadleaf
5
built
6
conifer
7
improved grassland
8
open water
9
semi-natural
10
upland
Inference
• The parameters and colonisation history are unknown
• We adopt a Bayesian approach, which involves treating both
sets of quantities as random
• Plausible values are simulated using a computer-intensive
algorithm known as Markov chain Monte Carlo
Posterior mean
Posterior mean
Colonization
suitability
Colonization
probability:
10 year prediction
Cumulative rate of colonization
homogeneous
landscape
with landscape
heterogeneity
Spatial distribution
of functional traits
• Rather than the distribution of a particular species, we might be
interesting in spatial patterns in the properties of species
groups or whole ecosystems
• In particular, we might be interested in the proportion of species
having a particular qualitative trait – e.g. main pollen vector
(insect, selfing, wind)
Contact: Stijn Bierman ([email protected])
Reference: Kühn, I., Bierman, S.M., Durka, W. & Klotz, S. (2006) Relating geographical variation in
pollination types to environmental and spatial factors using novel statistical methods. New Phytologist,
172(1), 127-139.
Main pollen vectors in German flora
a
% insect
x
xx
x
xxx
x
x
x
x
xx
x
x
x
xx
xx
xxx x
x
xx x
xx
x
xx
x
x
xx
x
x x
x
x
xx xx
x
x
x
xx
b
% wind
xx x
x x
x
x
xx
x
x
x
x
xx
x
xx
x
x
x
x
x
x
x
x xxxx
x
x
x xx
xx
xxx
xxx
xx
x
x
xx
xx
xx x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
xx
xx
xx
x
x
x
x x
x xxx
x xx
x x x
xxx
x
x
x
x
xx
x
x
xx x
x
x
x
x
x
x
xx
x
x
x
x
x
d
Altitude
N
x
x
xxx xxx
xxx xx
xx
x
xx
x
x
x
x
x
x
x
x
xx
x
x
xx x
x
x
x
xxx x
x
xx
x
x x
x
x
xxx x
x
x
x
x
xx
x
xx
xxx
x x
x
x
xx
x
x
x
c
% selfing
x
xx
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
xx
xx
xx
x
x
x
xx
x
x
xxx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Wind
e
speed
x
x
xxxx xx
xxxx x
xx
x
xx
x
x
x
x
x
x
x
x
xx
x
x
xxx
x
x
x
xxxx
x
xx
x
x
x
xx
x
x
xx
xx
x
x
x
xx xx
x
xx x
xx
x
xx
x
x
x
xx
x
x
x
x
xxxx
x
x
x
x
x xx
x x x
xxx
x
x x
xx x
xx
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
xx
xx
xx
x
x
x
xx
x
x
x xx
x
x
x
x
x
xx
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x xxxxx
x xxxx
xx
x
xx
x
x
x
x
x
x
x
x
xx
x
x
x xx
xx
xx x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
xx
xx
x
xx
x xx
x
x
x
x
xx
x
xxx
x
x
x x
x
x
x
x
x
x
xx
xxx
x
xx x
xx
x
x
x
xx
x
x
x x
xxx x
x xx
x x x
xx x
x
x
x
x
x
x
xx
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
xxxx
x
xx x
xx
x
xx
x
x
x
x
x
x
x
x
x
x
xx
x
xx
xxx
x
x
x
x
xx
x
x
x
xx
x
x
x x
xx xx
x xx
x x x
x xx
x
x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x
xx
x
x
xx xxx
x
x
x xx
xx
xxx
xxx
xx
x
x
xx
xx
xxx
x
x
x
x
x
x xxx
x
xx
x
x
x
x
x
x
xx
proportion of
poll. types/
topography:
wind speed
altitudes
low
0 - <75
75 - <150
150 - <300
300 - <450
medium
450 - <600
600 - <900
900 - <1200
1200 - <1500
1500 - <2100
high
>= 2100
x
x
0.8
0.3
0.7
Compositional data
0.4
0.6
0.1
0.5
• Data on trait proportions are compositional
0.2
0.6
0.4
lfin
g
0.3
0.2
0.5
0.9
• Instead of analysing the
0.0 raw
0.1 data,
0.2
we can analyse the log-ratios, e.g.
0.3
0.4
g
lfin
• Standard methods (linear regression,
principal components0.8
etc.) are invalid
se
0.7
d
win
se
win
d
– they must sum to one
0.5
0.1
0.3
0.4
0.5
log(selfing / wind) = log(selfing) – log(wind)
log(insect / wind) = log(insect) – log(wind)
0.6
insect
0.7
0.8
Spatial compositions
• The multivariate conditional autogressive (MCAR) model
describes spatial dependence in multivariate data
• By applying it to the log-ratios, we can get a model for
compositional data on a spatial grid
• This allows us to estimate the relationship between trait
proportions and environmental variables, whilst accounting
for residual spatial dependence
Trait-based prediction of
invasive success
• Part of ALARM involves assessing the risks associated with
invasive species
• Atlas data provide us with a measure of invasive success
• Data on traits can be used as explanatory variables
• We can use regression modelling again – but now modelling
across species, rather than across spatial locations
Contact: Adam Butler ([email protected])
Reference: Work in progress: Butler, A., Küster, I., Kühn, I., Bierman, S.M. and Marion. G.
Response variables:
• Number of cells occupied (count data), or…
• Whether or not species is naturalised (binary data)
…which are both crude measures of invasive success
Explanatory variables:
• A whole range of biological traits
e.g. size, ploidy, length of flowering season, type of
reproduction
• Some are continuous, others binary or ordinal
Statistical modelling
yj = number of cells occupied by species j
xj = trait data for species j
There are (at least) two possible models:
Log-normal: log yj ~ Normal(a + bxj, 2)
Binomial: yj ~ Binomial(pj, M), where log(pj / (1 - pj)) = a + bxj
Coping with phylogeny
• Phylogenic dependence is a key issue - species that have a
similar evolutionary history will tend to have similar spatial
distributions, even after accounting for trait effects
• Sources and impacts are similar to those associated with
spatial dependence – and the statistical methods to deal with
the dependence are also similar…
• One important difference – we often (as in Biolflor) know the
phylogenetic tree but not the branch lengths, & measure
distance to be the number of branches (“patristic distance”)
Prediction
• We want to predict how many species an existing or newly
introduced plant species will occupy in Germany after, say,
50 years, based on the traits of that species
• We can answer this using national data on residence time
e.g. time since first introduction or naturalisation
• We include residence time as another explanatory variable
within the regression model
Examples of predictions
Using BiolFlor/Florkart data for German neophyte plant species
Missing data
• Data on residence times are typically sparse
e.g. only available for 36% of neophyte plant species in Biolflor
• We need to impute these values,
and to account for the uncertainty
introduced by imputation
• Easiest to do this in a Bayesian
framework, using WinBUGS
Key statistical themes
• Environmental heterogeneity & environmental change
• Spatial & phylogenetic dependence
• Use of data on traits and invasive history
• False absences in atlas data, missing data on traits
• Model-based approach to statistics, use of Bayesian inference
Role of the statistician
• Develop strong collaborative relationships with scientists
• Contribute to formulation of appropriate models
• Ensure that assumptions are clear and explicit, and that
sensitivity of results to these assumptions is explored
• Advise & assist with software and computational issues
• Undertake methodological research to develop new statistical
techniques, when appropriate
Our other work for ALARM
• Statistical analysis of output from complex mechanstic models –
climate uncertainty in the LPJ Dynamic Global Vegetation Model
• Qualitative risk assessment – threats to European bee populations
• Use of statistical downscaling to generate projections of
European land use at a fine spatial scale
• Developing a training course on statistical methods for
environmental risk assessment
Simulations of global vegetation carbon for
the 20th & 21st centuries using LPJ-DGVM
Individual simulations from LPJ using
observed climate (black) & using
9 different climate models
Probabilistic projection of future
change, based on
statistical model averaging
Median
2.5% quantile