Extremes_MLURI - Biomathematics and Statistics Scotland

Download Report

Transcript Extremes_MLURI - Biomathematics and Statistics Scotland

Extreme values
Seminar at MLURI, January 2008
Adam Butler
Biomathematics & Statistics Scotland
1. Motivation
What isMotivation
EVT?
Applications
Current research
Flooding, Budapest, 2002
Graham Berry
http://en.wikipedia.org/wiki/Image:Floods_in_Budapest_2002.jpg
What is the probability that the flood defenses of Budapest
will be overtopped during 2008?
Northern Rock branch, London, 2007
Alex Gunningham
http://en.wikipedia.org/wiki/Image:1378965141_7817eb7212_o.jpg
What is the probability of today’s value of the Dow Jones
index being at least 9.5% lower than yesterday’s?
Log daily return = log(value today / value yesterday)
Value drops by 9.5%  LDR drops by log(0.905) = -0.10
Q. On this particular day, what is the chance of getting a
log daily return of less than –0.10?
Dow Jones Data for the period 1996-2000
To answer this question we clearly need to extrapolate,
since –0.1 is well outside the range of the data…
Extrapolation should be avoided whenever possible,
but in many real-life problems it is unavoidable
So how should we go about estimating this probability?
We could assume that the data are normally distributed…
P(X < –0.1)  10-20
…but the extreme values that have been observed don’t
play much of a role when we estimate the parameters
(e.g. the mean and variance)
Hence, our chosen model (e.g. the normal distribution)
might do badly in describing their properties…
Empirical:
P(X < –0.05)  0.002
Normal:
P(X < –0.05)  0.000001
…and, worse still, extrapolations beyond the range of the
data often differ radically between models that provide
a very similar fit to the bulk of the data
For example, we might decide to fit a Cauchy rather than a
normal distribution…
Cauchy:
P(X < –0.1)  0.02
Normal:
P(X < –0.1)  10-20
We need an alternative statistical approach that is more
robust, in the sense that it is does not require us to
make strong and untestable assumptions about
the process that is generating our data
This is the motivation for EVT – Extreme Value Theory
Motivation
2. What is
EVT?
Motivation
Applications
Current research
General characteristics of an “EVT” problem
• We are interested in a process that can be quantified,
and for which we have some data
• …and we want to use this data to say something about
the probability that a rare or extreme event will occur
• We will usually be interested in events that are beyond
the range of the data e.g. we want to extrapolate
To deal with such problems, we begin from the principle
that our inferences should only be based on the most
extreme data that we have actually observed
 e.g. we should throw away almost all of the data
Extreme value theory (EVT) then provides us with
some simple and robust models that can then be used to
describe the properties of these extreme data
Q. What is the
probability of
getting more than
100mm of rain on
any given day?
We might decide to only use data for days
with 25mm or more of rainfall…
Histogram of data above a threshold of 25mm
Threshold exceedance = Value - Threshold
The GPD model
• A good statistical model for threshold exceedances is
the GPD (Generalised Pareto Distribution)
• The probability density function is of the form
f(x) = 1 – (1 + x / )-1/
• There are two parameters, a scale parameter  and a
shape parameter , which needed to be estimated
GPD model fitted to threshold exceedances
Threshold = u = 25mm
 and  estimated by maximum
likelihood to be 7.70 and 0.108
P(X > 100) estimated to be
0.0000209 (once per 131 years)
• But why is the GPD a good model to use? The
mathematical justification is given by asymptotic theory
• The theory says that, for almost any random variable X,
the exceedances of a high threshold u will tend towards
following the GPD model as u tends towards infinity
• In practice, we use a threshold that is high but still finite:
we rely on the fact that if this level is sufficiently high
then the asymptotic result will still be approximately true
When choosing a threshold, we need to balance
• Precision: If the threshold is low then our results will
tend to be more certain than if it is high
• Bias: extreme value methods will only be valid when
the threshold is sufficiently high
We can do this in a partly subjective way using
parameter stability plots
Parameter stability plot for shape parameter, 
The GEV model
• Another approach involves analysing block maxima
• For example, if we have hourly sea level data then we
may choose to analyse only the largest value that occurs
each year: the annual maximum value
• The same method can also be used to analyse minima
• A good statistical model for block maxima is the GEV
(Generalised Extreme Value Distribution)
• The probability density function is of the form
f(x) = exp{-[1 + ((x - ) / )]-1/}
• There are three parameters - a location parameter , a
scale parameter , and a shape parameter  - which
need to be estimated
The r-largest model
• The GEV model uses only one value per block
• An extension of this model involves using the r largest
values per block, where r is greater than one
• e.g. We might model the 20 highest sea levels per year
The shape parameter
• All of the extreme value models contain a common
parameter  that determines the shape of the distribution
• The extremes of a light tailed distribution will have a
negative shape parameter ( < 0) & the extremes of a
heavy tailed distribution have a positive shape ( > 0)
• The extreme values of a normal distribution have  = 0
GPD: impact of the shape parameter, 
=0
=1
 = -0.5
Covariates
• The properties of extreme values may depend on time,
location, or other covariates (explanatory variables)
• We can easily build these covariates into our extreme
value models, in a similar way that we would build them
into a regression model or GLM
• The key difference is that in a GLM we only build
covariates into the mean, whereas in EV models we
might build them into any of the three parameters
Venice sea level data – linear trend in location parameter
More advanced statistical modelling
• Methods to deal with clustering: e.g. declustering
algorithms, estimation of the extremal index
• Semiparametric modelling: allow trends to vary
smoothly over time, using local likelihood or smoothing
splines
• Bayesian methods: allow for the incorporation of prior
information, and for the construction of relatively
complicated hierarchical models
Example of semiparametric modelling:
estimated trends in storm surge levels at Dover
Software
• Add-on packages are available for R (extRemes, ismev,
evir, evd, evdbayes), Splus (EVIS, S+FinMetrics) and
Matlab (EVIM, EXTREMES)
• The extremes toolkit provides a user-friendly interface -
www.isse.ucar.edu/extremevalues/evtk.html
• Some methods are also available in Genstat
• Stand-alone commercial software: Xtremes, HYFRAN
Should I be using EVT?
Advantages
• Robust
Disadvantages
• Inefficient
Relies on weak assumptions
Most of the data are thrown away
Avoids bias
…we may over-estimate uncertainty
• Theoretically sound
Justified by asymptotic theory
• Quick & relatively easy to use
• Honest
…about the uncertainties involved
in making statements about very
rare events
…relies on having a large sample size
• Asymptotics
The theory only holds exactly for
infinitely extreme events
Difficult to extend to multivariate case
• Data quality
Sensitive to errors in extreme data
Motivation
What isMotivation
EVT?
3. Applications
Current research
Environmental sciences
• EVT is widely used by scientists working in hydrology,
climatology, oceanography and fire science
• It is also used for operational purposes in flood risk
assessment and civil engineering
• Particular interest in studying the impact of climate
change upon extreme events – e.g.
MICE project (www.cru.uea.ac.uk/projects/mice)
WASA project: Waves & Storms in the NE Atlantic
Thames Barrier, London
Source: Roger Haworth
http://en.wikipedia.org/wiki/Image:Thames_Barrier_059184.jpg
Risk assessment and design
• Extreme value problems in hydrology and coastal
engineering are often phrased in terms of return levels
• N-year return level: the level that is exceeded with
probability 1/N in a particular year – definition applies to
nonstationary processes too, but interpretation is harder
• e.g. Thames Barrier: “…was originally designed to
protect London against a flood level with a return period
of 1000 years in the year 2030…” (Wikipedia)
Biology
• Biologists are also often interested in studying the
properties of extreme or rare events, but rarely use EVT
• Some likely reasons –
• Relatively small sample sizes (compared to e.g. hydrology)
• Extreme events not so easily defined in quantitative terms
• New applications are likely to arise from the increasing
use of large datasets (e.g. in genetics), and from an
increased focus on quantitative risk assessment
Genetics
A major application of EVT is in sequence alignment, and
extreme value models are used by BLAST and FASTA
“…a sequence alignment is a way of arranging the primary sequences of
DNA, RNA, or protein to identify regions of similarity that may be a
consequence of functional, structural, or evolutionary relationships
between the sequences…” (Wikipedia)
Compare a sequence against a vast database of known sequences 1. define a similarity score
2. search for the best match within the database
3. use EVT to evaluate the significance of this match
Ecology
Review papers by Gaines & Denny (1993) and Katz et al.
(2005) focus on disturbance – studying the extremes of
environmental processes that are known to lead to
ecological disturbance
e.g. sediment rates, fire sizes, frost days
They also consider longevity & survival – i.e. studying the
maximum lifespan or size of an individual
Bumblebee on Echinacea purpurea
http://en.wikipedia.org/wiki/Image:Zigouillonbourdon.jpg
Possible new applications in ecology
• Dispersal & spread: spatial spread (of diseases, pollen,
invasive species) known to be influenced by long-range
dispersal; can EVT be used to analyse dispersal data?
• Population dynamics: estimating the probability of
extinction or explosion of a population
• Ecological modelling: study the properties of extreme
events simulated by complex process-based ecological
models – e.g. mass extinction events
Other areas where EVT is used
• Finance and insurance: in particular, calculation of
Value at Risk (www.gloriamundi.org)
• Telecommunications: e.g. estimation of very large file
sizes in internet traffic
• Sport science: trends in record times for athletics
…and many, many more…
Motivation
What isMotivation
EVT?
Applications
4. Current research
Extreme value theory remains an area of active
methodological research, with two key strands:
1) Improving the practical utility of existing extreme
value methods by making use of recent
developments in statistics and computing
e.g. Bayesian extremes
2) Developing methods for multivariate extremes – this
involves much theoretical work
Multivariate extremes
• Standard (“univariate”) extreme value methods
concentrate on the extremes of a single random variable
• Multivariate extreme value theory studies how the
values of different variable are related at extreme levels
• The different random variables may relate to genuinely
different processes (e.g. tide and waves) or to the same
process at different locations (spatial extremes)
Some applications:
1) Calculating the risk that there will be in a fall in the
overall value of a portfolio of investments
2) Assessing regional flood risk e.g. estimating the
probability that a severe flood will occur at one or more
locations within a region
3) Evaluating the probability that two atmospheric
pollutants will simultaneously reach hazardous levels
Two random variables X1 and X2 may either be…
• Asymptotically dependent:
Extreme values of X2 occur when X1 is also extreme
• Asymptotically independent:
Extreme values of X2 occur when X1 is not extreme, and
vice-versa
• There is a rich mathematical theory about asymptotic
dependence, providing us with statistical models that we
can use if we are prepared to make this assumption
• There is little theory about asymptotic independence,
and practical techniques for dealing with data that exhibit
this have only been developed in the past 5-10 years
• There are relatively few practical techniques for testing
whether your data exhibit asymptotic (in)dependence
Thank you for listening!
Email: [email protected]
Phone: 0131 650 4896