Statistical descriptors

Download Report

Transcript Statistical descriptors

Review of statistics and frequency analysis
Academic year 2009 - 2010
Associate Professor: Dr. P.H.A.J.M. van Gelder
TU Delft, Faculty of Civil Engineering and Geosciences
UNESCO-IHE, Guest Lecturer
Year 2009 Lectures on:
October 27th 8.45 – 10.30h
October 28th 15.45 – 17.30h
November 5th 15.45 – 17.30h
November 6th 8.45 – 10.30h
A selection of these slides will be
presented during the course
contact for students
Dr.ir. P.H.A.J.M. van Gelder room: 3.87 ext: 86544
Overall outline of the course
Review of statistics and frequency
analysis
Data analysis, random variables,
classification, stat. moments, frequency
distributions; samples, populations and
probability models; parameter
estimation and confidence intervals.
Introduction to this Course
1 + 1 lecture periods (basic statistics), 1 + 1
lecture periods on frequency analysis and a
written exam
Dialog instead of monolog (ask questions!)
Complete power point presentations can be
found on Van Gelder’s website
Brief self-introduction:
–
–
–
–
Background (name, country, university education etc.)
Interests, experiences with statistics?
What do you hope to get out of this part of the course?
Who has experienced hydrological extremes?
Outline for today
• Introduction on natural hazards and
probabilistic design
• Data sets for river– and coastal engineers
• Theory on:
– Probability and events
– Random (or stochastic) variables
– Transformations
– Multivariate distributions
Number of Floods worldwide
Source: Dartmouth Flood Observatory, 2003
Regional Distribution of Large Floods
1999-2002
1985-1988
Source: Dartmouth Flood
Observatory, 2003
Reasons for Concern
Source: Smith et al 2001, TAR IPCC WG II
Does Global Warming Lead to an Intensification of the
Hydrological Cycle and More Extremes Events?
More flood-producing rain (but regional differences)
Longer rainless periods and higher evaporation demand
of the atmosphere
There are indications for:
 Changing regimes, flood timing
 More frequent floods, land slides etc.
 No reduction of droughts, but decreased summer low flows
e.g. in western Europe
Effects of changing Mean and Variance of
Precipitation on Stream Flow
(i.e. non-linear processes, thresholds, etc.)
(from Middelkoop 2005, after Arnell 1996)
Frequency analyses can be done for …
High flows
 Flood peak discharge
 Flood volume
 Occurrence of high flows in certain periods/months
Low flows
 Minimum low flow discharge
 Runoff deficits
Groundwater levels, groundwater fluctuations etc.
Estimated cost of damage caused by natural hazards
Simulated variables (model outputs)
Daily rainfall, rainfall intensities etc.
….
Let’s start with examples of flood peak discharges
How do we measure floods?
(Aus: Hornberger et al., 1998)
Rating curve
(Aus: Hornberger et al., 1998)
Application of rating curve to measured
water levels at a gauge
(Aus: Hornberger et al., 1998)
…. that is not always that easy!
(The big flood in the HJ Andrews 1996)
Continuous measurement of discharges
(incl. floods and low flows)
(Aus: Hornberger et al., 1998)
Floods in 2002; river Elbe,
city of Dresden, Germany
Flood defence structures
Storm Surge Barrier Oosterschelde (NL)
Maeslantkering - storm surge barrier (NL)
Variability in annual precipitation (NL)
Precipitation in the Netherlands
1100
Precipitation [mm]
1000
900
800
700
600
500
1940
1950
1960
1970
Year
1980
1990
2000
Variability in daily river
discharges (Meuse river, NL)
Daily Discharges of the River Maas at Lith (+ Mean, Minimum and Maximum per year))
2500
3
River Discharge [m /s]
2000
1500
1000
500
0
0
1000
2000
3000
Day Nr
4000
5000
Example: Evaluating of hydrological extremes
(flood runoff); Case in Austria 2005
600
500
Min (71-02)
Max(71-02)
Mittel (71-02)
2005
2005: max. ca. 515 m3/s
[m3/s]
400
300
1999: max. ca. 230 m3/s
200
100
0
1.Jän
1.Feb
1.Mär
1.Apr
1.Mai
1.Jun
1.Jul
1.Aug
1.Sep
1.Okt
1.Nov
1.Dez
Example: Evaluating of floods in a regional
context (Case study: Austria 2005)
Ongoing EU projects on Flood Risks
EFFS: A European Flood Forecasting System
– to develop a prototype of a European flood
forecasting system for 4-10 days in advance,
which could provide daily information on potential
floods for large rivers, and flash floods in small
basins.
SPHERE: Systematic, Palaeoflood and Historical data
for the improvement of flood Risk Estimation
– to develop a new approach which complements
hydrologic modelling and the application of
historical and paleoflood hydrology to increase the
temporal framework of the largest floods over time
spans from decades to millennia; in order to
improve extreme flood occurrences.
THARMIT: Torrent Hazard Control in the European Alps
– to develop practical tools and methodologies for
hazard assessment, prevention and mitigation,
and to devise methods for saving and monitoring
potentially dangerous areas.
CARPE DIEM: Critical assessment of Available Radar
Precipitation Estimation techniques and Development
of Innovative approaches for Environmental
Management.
– to improve real-time estimation of radar rainfall
fields for flood forecasting, by coupling multiparameter polarisation radar data and NWP, and
exploiting NWP results in order to improve the
interpretation of radar observations.
IMPACT: Investigation of Extreme Flood Processes and
Uncertainty
– to investigate extreme flood and defense failure
processes, their risk and uncertainty. Will consider
dam breach formation, sediment movement, flood
propagation and predictive models, within an
overall framework of flood risk management.
GLACIORISK: Survey and Prevention of Extreme
Glaciological Hazards in European Mountainous
Regions
– to develop scientific studies for detection, survey
and prevention of glacial disasters in order to
save lives and reduce damages.
SAFERELNET: Risk assessment of natural hazards in
Europe
MITCH: Mitigation of Climate Induced Hazards
– dealing with the mitigation of natural hazards with
a meteorological cause, in order to assist planning
and management. The main focus will be on flood
forecasting and warning, but it will also include
other flood related hazards, such as landslips and
debris flow, and longer term climate hazards, such
as drought, and the possible impact of climate
change on the frequency and magnitude hazards
ADC-RBM: Advanced Study Course in River Basin
Modelling for Flood Risk Mitigation - June 2002
FLOODMAN: Near real-time flood forecasting, warning
and management system based on satellite radar
images, hydrological and hydraulic models and in-situ
data
– near real-time monitoring of flood extent using
spaceborne SAR, optical data & in-situ
measurements, hydrological and hydraulic model
data. The result will be an expert decision system
for monitoring, management and forecast of floods
in selected areas in Europe. The monitoring will
also be used to update the hydrological/hydraulic
models and thereby improving the quality of flood
forecasts.
FLOODSITE: The FLOODsite project covers the
physical, environmental, ecological and socioeconomic aspects of floods from rivers, estuaries
and the sea. The project is arranged into seven
themes covering:
Risk analysis – hazard sources, pathways and
vulnerability of receptors.
Risk management – pre-flood measures and flood
emergency management.
Technological integration – decision support and
uncertainty.
Pilot applications – for river, estuary and coastal sites.
Training and knowledge uptake – guidance for
professionals, public information and educational
material.
Networking, review and assessment.
Co-ordination and management.
Extreme Events;
Two very realistic simulations
1. River dike failure in the Netherlands
2. Asteroid impact in the Atlantic Ocean
The role Statistics
in Water Engineering
Data analysis and statistics
(later also modeling etc.)
Statistics of hydrological variables
for decision support
Properties of
the hydrological
system
Statistical Analysis !!
We need Information about …
Water balance: P = R + ET + dS/dt
Variability and heterogeneity of hydrological
variables (groundwater levels, precip. patterns etc.)
Hydrological extremes:
x-year
flood
droughts
floods
Scenarios for:
 Land use change
 Climate chance
 Different water management strategies
ETC.
4 types of error occur when measuring a
hydrological variable
1. Operation and function errors: malfunction of measuring
instrument, personal (human) error.
2. Random error: caused by numerous minor impacts partly
independent from each other. If repeated frequently, the
values fluctuate around the true value.
3. Constant systematic error: inherent in any kind of
equipment (e.g., wrong installation of instruments, wrongly
indicated zero point, incorrect rating curve etc.); constant
in respect of time, but may vary according to the
measuring range.
4. Variable systematic error: usually caused by insufficient
control during the measuring period; mostly the origin in
the instrument (e.g., drift” of the device, growing of plants
at the location of measurement etc.). Can be avoided
through continuous comparison of the measurement and
repeated calibration of the instruments.
 Systematic errors can not be reduced by increasing the
number of measurements, if equipment and measuring
conditions remain the same!
Structural design principles
Old methods:
– determine a worst case load
– determine a worst case strength
– determine the geometry of the structure
Disadvantages of old method
Unknown how safe the structure is
No insight in contribution of different individual
failure mechanisms
No insight in importance of different input
parameters
Uncertainties in variables cannot be taken into
account
Uncertainties in the physical models cannot be
taken into account
Failure mechanisms of a dike
Design of a structure
Random boundary conditions
Fault tree with AND and OR
flood defence fails
OR
flood defence collapses
AND
piping develops
inspection fails
overtopping
Mathematics of AND and OR
In case of an AND-gate, you should
multiply the probabilities
In case of an OR-gate, you should add
the probabilities (and substract the
multiplication of the two probabilities)
Important condition:
This is only true when both mechanisms
are fully (statistically) independent
Example of dependence
Modern wave run-up formula is:
R  1.6 H
tan 

H /L
2
L  gT /( 2 )
(for shallow water, last equation is somewhat different and implicit)
Example of dependence (2)
So the answer depends on H and T
But in a single wave field, T = f(H), for
example:T = 3.9 * H0.376
This can be modelled as:
T = A * HB, in which both A and B are
stochastic variables with a mean and
standard deviation
Example of dependence (3)
But this is only true in case of a single
wave field (wind waves OR swell
waves)
When there are more wave fields H and T
are NOT statistically independent
There is no good model for run-up due to
double peaked spectra, but there is an
approximation by Van der Meer
Example of dependence (4)
Run-up larger than allowed
OR
Run-up due to
wind waves
Run-up due to
swell waves
Run-up due to
double peaked spectrum
Computation with
statistics of
wind waves
Computation with
statistics of
swell waves
Computation with the
approxiamtion of
VanderMeer
Two approaches
First approach:
start at bottom and calculate the
probability of failure according to normal
design practice
second approach:
start at top and assign probability to
failure mechanisms
Two approaches (2)
Usually with a start at bottom, you do not
reach at the required overall failure
probability
Usually with a start at top, you cannot
construct some elements
So in practice, you have to make a
mixture
Sensitivity analysis
What is the effect of 10% change in input
on the output ?
This determines how important is an input
parameter
End of introduction
Data availability
Internet offers a huge source of past - and
real time data
The theory will be explained with
examples and data sets taken from river
engineering:
The Global Runoff Data Centre
http://grdc.bafg.de/servlet/is/910/
Mediterranean Hydrological Cycle Observing System
(Med-HYCOS project)
http://medhycos.mpl.ird.fr
UNESCO International Hydrological Programme
http://webworld.unesco.org/water/ihp/db/
The Global River Discharge Database” (RIVDIS)
http://www.rivdis.sr.unh.edu
Local Websites
Ministry of Water Resources of China
http://www.mwr.gov.ch/english/index.asp
Ministry of Water Resources of India
http://mowr.gov.in
Water Commission of India
http://cwc.nic.in
These sites were useful for obtaining basic
information about river basins, not so useful
in downloading discharge data
Apart from websites, data is also published in
National Water Resources Books
Datasets for river engineers
The longest period of observation is recorded for the
river Nemunas at Smalinninkai: 1812-2003 (LT). The
majority of European rivers have observation records
dating from the period 1910-1920 and continuing to
1999-2004.
On most Asian rivers water discharges have been
observed since the period 1930-1940, although the
river Bia at Biisk (Russia) has a record 108 year
period of observation (1895 to 2003). The shortest
period of observation is found on the Indian rivers
(1939 -1979), the Chinese rivers (1930-1985) and the
Iranian rivers (1963-1985).
Your data for the exercise is available at:
http://www.citg.tudelft.nl/live/pagina.jsp?
id=418a276e-b63e-4cec-a6fe763feb04f984&lang=en
Some snap shots
Data for coastal engineers
www.oceanor.no/
www.knmi.nl/onderzk/oceano/waves/era40/lice
nse.cgi
www.globalwavestatisticsonline.com/
http://www.golfklimaat.nl
http://www.actuelewaterdata.nl
http://www.hydraulicengineering.tudelft.nl/public
/gelder/paper56-data3.zip
Hm0, H1/3,HTE3, Tm02, TH1/3, Th0,
wind direction, wind speed, water
level , surge
Important data source for Dutch data
Some snap shots
real time wave data; significant wave heights
wave periods
Water levels gauges in Mid West Netherlands
Water levels and astronomical tide
The wave data on the wave climate site is available in data files per
year (now : 1979 - 2002). The files contain wave data in the
following format :
19880221 0100 90 4 84 12 52 66 351 10 -6 3433402
19880221 0400 76 3 73 15 53 65 339 -95 -2 3433402
19880221 0700 67 3 66 10 42 53 346 -31 -1 3433402
19880221 1000 67 3 64 12 40 48 344 51 -8 3433402
19880221 1300 66 3 65 11 36 44 308 -3 0 3433402
19880221 1600 75 3 73 12 40 47 298 -93 6 3433402
19880221 1900 91 4 79 13 41 48 325
0 3 3433402
19880221 2200 86 4 80 11 41 47 334 95 0 3433402
19880222 0100 84 5 81 13 42 50 323 38 4 0400402
…….. etc.
The files are arranged as follows :
Column nr.
Name
unit
1
Date
[yyyymmdd]
2
time
[hhmm] MET !
3
wave height Hm0
[cm]
4
accuracy wave height Hm0 (standard deviation)
[cm]
5
wave height H1/3
[cm]
6
wave height HTE3
[cm]
7
wave period Tm02
[0.1 s]
8
wave period TH1/3
[0.1 s]
9
wave direction Th0
[gr], nautical[1]
10
water level
[cm] NAP/MSL
11
surge
[cm]
12
code number which indicates the origin of the given value
[-]
[1]
Nautical degrees : (from) North = 0 degrees, (from) East = 90, South = 180, West = 270 and
North again = 360.
Probability
P(A) = probability of event A
Mathematical definition
Frequentistic definition
Mathematical definition
Axioms:
1. P(A)  0
2. P() = 1
3. P(A or B) = P(A) + P(B)
(if A and B are independent)
Frequentistic definition
P(A) = N(A) / N
in which:
N(A)
N
number of experiments leading to A
total number of experiments
example: probability that a consumer product fails
within 1 year after production
example interpretation
P(A) = n(A) / N
P
n(A)
N
probability
number of outcomes in experiment A
total number of outcomes
x
x
x
x
x
x
x
x
P(A) = 4 / 24 = 1 / 6
x
x
x
x
x
x
x A x
x
x
x
x
x
x
x
x
example dice
1
x
2
x
P(x=4)
= 1/6
P(x  5) = 2/6
P(x even) = 3/6
3
x
4
x
5
x
6
x
Some history
1650Pascal / Fermat
1750
Bernouilli / Bayes
1850
Venn / Boole
1920
1960
1970
Von Mises
Savage / Lindley
decision making
Benjamin / Cornell decision making
1960’s and onwards
‘Years ago a statistician might have
claimed that statistics deals with the
processing of data;
today statisticians will be more likely to
say that statistics is concerned with
decision making in the face of
uncertainty.’
probability calculation
calculation of a probability from other
probabilities
Joint events
Union
A or B
B

A
B
Cross section
A and B
Implication

A
B
A

A in B
Denial
A not
A

Union
P(A or B)
x
x
x
x
x
x
A
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x B
x
P(A or B) = P(A) + P(B) - P(A en B)
13/24
= 6/24 + 9/24 - 2/24
?
Cross section
P(A and B)
x
x
x
x
x
x
A
x
x
x
x
x
x
x
x
x
x
x
x
x
x
P(A and B) = nAB / n
= (nA / n) * (nAB / nA)
= P(A) * P(B | A)
= 6/24 * 2/6 = 2/24
x
x
x B
x
Conditional probability
P(A | B) =
probability of A given the fact that event
B has occured
P(A and B) = P(B) P(A | B)
P(A | B) = P(A and B) / P(B)
Conditional probability
P(rain in Delft on sept. 18, 2024)?
P(rain in Delft on 9/18-2024| rain in
Amsterdam on 9/18-2024)?
P(rain in Delft on 9/18-2024|rain in Cape
Town on 9/18-2024)?
example: dice
1
x
2
x
P( x  2 | x  even ) 
4
x
5
x
P( x  2 en x  even ) 1/ 6

 1/ 3
P( x  even )
1/ 2
P( x  2 | x  oneven ) 
P( x  2 | x  even ) 
3
x
P( x  2 en x  oneven )
0
P( x  oneven )
P( x  2 en x  even ) 1/ 6

 1/ 3
P( x  even )
1/ 2
6
x
Independence
A and B are independent if
PA | B  PA 
In that case:
PA en B  PA  PB
PA of B  PA   PB  PA  PB
Important rules
Theorem of total probability
PA   PA | B PB  PA | Bniet  PBniet 
PA    PA | Bi  PBi  als Bi elkaar uitsluiten
i
Generalisation to continuous integral “in
which the uncertainty is integrated out”
Theorem of Bayes
PB | A  PA 
PA | B 
PB
example: quiz dilemma
Car in A, B or C
A
B
C
U: chooses A
QM: Good that you didn’t choose B,
because it is empty. Would you still like
to switch to C?
Quiz-dilemma
Theorem of Bayes:
Pinf o | A  PA  1/ 2 * 1/ 3
PA | inf o 

 1/ 3
Pinf o
Pinf o
PC | inf o 
Pinf o | C PC 1* 1/ 3

 2/3
Pinf o
Pinf o
Yes, switch to C!
Notes: P(info)=0.5 because there can be a car in B or not.
P(info|C)=1, because if we have information on C and B(info), we know that A
should contain the car with 100% certainty
A clever student is invited to write a simulation programme to find out if this is
indeed true
Solution of Jeroen van den Bos
Development of 2 Matlab scripts
Quiz_noinfo.m (choose a box and check if the car is
there with no information from the QM)
Quiz_info.m (choose a box and update your choice
when the QM gives his information on another box)
Quiz_noinfo.m
clear;
N = 2000; NoOfBoxes = 3;
NoSuccess=0;
for i = 1:N,
Box_Car = fix(1+rand(1)*(NoOfBoxes)); %car is put randomly in a box
Box_Guess = fix(1+rand(1)*(NoOfBoxes)); %random choice of a box
NoSuccess = NoSuccess + (Box_Car == Box_Guess); %if guess is right increase # of
succesful attempts
Fr_Success(i)=NoSuccess/i; %frequency of success after i attemps
end;
P_Success = Fr_Success(N) %final result
% output
% -----plot(1:N,Fr_Success,'.',[0 N],[1 1]/NoOfBoxes)
axis([0 N 0 1])
legend('Simulation result',['P = 1/' num2str(NoOfBoxes)])
Quiz_info.m
clear;
N = 2000; NoOfBoxes = 3;
NoSuccess_Stay=0;
NoSuccess_Switch = 0;
for i = 1:N,
Box_Car = fix(1+rand(1)*(NoOfBoxes)); %car is put randomly in a box
Box_1stGuess = fix(1+rand(1)*(NoOfBoxes)); %random choice of a box
k = 0; %Select possible empty boxes
for j = 1:NoOfBoxes,
if (Box_Car ~= j) & (Box_1stGuess ~= j) %empty box cannot be 'car' or 'guess'
k = k + 1;
Empty_Boxes(k) = j; %vector of empty boxes
end;
end;
Box_Empty = Empty_Boxes(fix(1+rand(1)*k)); %Choose randomly from empty boxes
k = 0; %Select possible alternatives
for j = 1:NoOfBoxes,
if (Box_Empty ~= j) & (Box_1stGuess ~= j) %alt box cannot be 'empty' or 'guess'
k = k + 1;
Alternative_Boxes(k) = j; %vector of alternative boxes
end;
end;
Box_Alternative = Alternative_Boxes(fix(1+rand(1)*k)); %Choose randomly from alternative boxes
NoSuccess_Stay = NoSuccess_Stay + (Box_Car == Box_1stGuess); %if 1st guess is right increase # of succesful attempts 'stay' strategy
NoSuccess_Switch = NoSuccess_Switch + (Box_Car == Box_Alternative); %if alt. guess is right increase # of succesful attempts 'switch' strategy
Fr_Success_Stay(i)=NoSuccess_Stay/i; %frequency of success after i attemps
Fr_Success_Switch(i)=NoSuccess_Switch/i; %frequency of success after i attemps
end;
P_Success_Stay = Fr_Success_Stay(N) %final result
P_Success_Switch = Fr_Success_Switch(N) %final result
% output
% -----plot(1:N,Fr_Success_Stay,'r.',1:N,Fr_Success_Switch,'b.',[0 N],[1 1]*P_Success_Stay,'r--',[0 N],[1 1]*P_Success_Switch,'b--')
axis([0 N 0 1])
%legend('Simulation result Stay','Simulation result Switch',['P_{Stay} = ' num2str(P_Success_Stay)],['P_{Switch} =',
num2str(P_Success_Switch)],'location','EastOutside')
legend(['"Stay" stragegy (P_{success} = ' num2str(P_Success_Stay) ')'],['"Switch" strategy (P_{success} = ' num2str(P_Success_Switch) ')']);
No info
1
Simulation result
P = 1/3
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
200
400
600
800
1000
1200
1400
1600
1800
2000
With info from QM
1
"Stay" stragegy (P success = 0.3295)
"Switch" strategy (P success = 0.6705)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Updating process
If new information becomes available,
new estimates can be made about the
failure probability of a system
stochastic variables
stochastic variables
•
•
•
•
•
What is a stochastic variable?
probability distributions
Fast characteristics
Distribution types
Two stochastic variables
stochastic variable
Quantity with uncertainty:
– Natural variation
– Lack of statististical data
– Schematizations
example:
– strength of concrete
– outcome of a dice
– temperature in Delft on September 16th, 2014
Relation with events
uncertainty can be expressed with probabilities
• probability that stochastic variable X
–
–
–
–
–
is smaller than x
larger than x
equal to x
is in the interval [x, x+ x]
etc.
probability distribution
probability distribution function = probability
P(X):
FX() = P(X)
1
0.8
0.6
X
F ()
stochast
0.4
dummy
0.2
0

probability density
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-3
-2
-1
0
1
2
3
4
5

This is a probability density function
probability density
Differentiation of F to :
fX() = dFX() / d
f = probability density function
fX() d = P( < X  +d
1
P(X 
0.6
X
F ()
0.8
0.4
0.2
0

0.5
fX()
0.4
P( < X  d 
0.3
0.2
0.1
0
 d 
1
0.6
X
F ()
0.8
0.4
0.2
0
P(X 

0.5
fX()
0.4
0.3
0.2
0.1
0

Discrete and continuous
discrete variabele:
0.4
1
0.35
0.9
0.3
0.8
FX(x)
pX(x)
0.25
0.7
0.2
0.6
0.15
0.5
0.1
0.4
0.05
0
0
1
2
3
4
x
5
0.3
6
0
1
2
3
x
4
5
6
continue variabele:
1
0.4
0.8
fX(x)
FX(x)
0.5
0.6
0.3
0.2
0.4
0.1
0.2
0
-4
-2
0
x
2
4
kansdichtheid
6
0
-4
-2
0
x
2
4
(cumulatieve)
kansverdeling
6
Complementary probability
distribution
Complementary (cumulative distribution function (ccdf) of variable S
P(S > x) = 1 - P(S  x) = 1-FS(x)
tail of the distribution
P{S > Sd} = Ptarget
Sd
Fast characteristics
0.5
0.3
X
f (x)
0.4
0.2
sX
0.1
0
-4
mX
sX
-2
0
mX
2
4
x
6
mean
standard deviation, indication for spreading
0.7
0.6
0.5
fX(x) 0.4
0.3
0.2
sX
0.1
0
0
1
2
mX
3
4
x
5
Mean  maximum (mode)
Median is the value m for which P(X<m)=50%

• Mean
• Variance
m X   x fX x  dx


s   x  m X  fX x  dx
2
2
X

• Standard deviation
sX
• Variation coefficient
VX 
sX
mX
distribution types
•
•
•
•
•
•
•
Uniform distribution
Normal distribution
Lognormal distribution
Gumbel distribution
Weibull distribution
Gamma distribution
….
Uniform distribution
fX()
area = total probability = 1
1/(b-a)
a
mean
b
m = (a+b)/2
Standard deviation s = (b-a)/12

Matlab demonstration
Generate numbers from a Uniform distr.
Make a histogram (observe the variability
around its mean)
Calculate the mean value and standard
deviation
They should converge to 0.5 and 0.2887
(1/sqrt(12))
This is indeed confirmed by the simulation
Normal distribution
strength of timber
0.06
mX
probability density function
normal distribution
0.05
mean = 37 N/mm2
standard deviation = 8.6 N/mm2
probability density
0.04
sX
0.03
0.02
0.01
0
-10
0
10
20
30
40
50
60
70
bending strength (N/mm2)
80
90
Normal distribution in CDF domain
sterkte houtsoort
1
0.9
kans buigsterkte <= x
0.8
0.7
0.6
0.5
kansverdelingsfunctie
normale verdeling
mu = 37 N/mm2
sigma = 8.6 N/mm2
0.4
0.3
0.2
0.1
0
0
10
20
30
40
50
2
x (N/mm )
60
70
80
Normal distribution in linearised CDF
domain
sterkte houtsoort
0.997
Normal Probability Plot
kans buigsterkte <= x
0.99
0.98
0.95
0.90
0.75
0.50
0.25
0.10
0.05
0.02
0.01
0.003
15
20
25
30
35
40
2
x (N/mm )
45
50
55
60
normal distribution
probability density: probability distribution:
f X   
1
e
s 2
1   m 
 

2 s 
2
FX y  
y
 f X  d

in which:
m
mean
s
standard deviation (s > 0)

dummy variable (- <  < )
standard normal distribution
normal distributed variable X:
X  mX  s X u
or
u
X  mX
sX
standard normal distributed variable u:
mu  0
su  1
probability density:
f u       
1
2
1
 2
e 2
probability distribution:
Fu y     
y
   d

table
This table can be used in both directions:
1. Given an x value, what is the
corresponding exceedence probability
2. Given a probability, what is the
corresponding x value
Note that the table only describes the right
hand tail of the standard normal
distribution. The left hand tail can be
obtained by symmetry around the point (0,
0.5).
For ordinary normal distributions, always
scale back to a standard normal distribution
(by subtracting the mean value, and
dividing by the standard deviation)
normal distribution
Why so popular?
Central limit theorem
Sum of many variables (i.i.d.) is (almost)
normally distributed.
Y = X1 + X2 + X3 + X4 + ….
i.i.d. = independent identically distributed
Normal Distributions
A continuous rv X is said to have a
normal distribution with parameters
m and s , where    m   and
0  s , if the pdf of X is
1
 ( x  m )2 /(2s 2 )
f ( x) 
e
s 2
  x  
Standard Normal Distributions
The normal distribution with parameter
values m  0 and s  1 is called a
standard normal distribution. The
random variable is denoted by Z. The
pdf is
1
 z2 / 2
f ( z;0,1) 
e
  z  
s 2
The cdf is
z
( z )  P( Z  z ) 


f ( y;0,1)dy
Standard Normal Cumulative Areas
Shaded area = (z )
Standard
normal
curve
0
z
Standard Normal Distribution
Let Z be the standard normal variable.
Find (from table)
a. P( Z  0.85)
Area to the left of 0.85 = 0.8023
b. P(Z > 1.32)
1  P( Z  1.32)  0.0934
c. P(2.1  Z  1.78)
Find the area to the left of 1.78 then
subtract the area to the left of –2.1.
= P( Z  1.78)  P( Z  2.1)
= 0.9625 – 0.0179
= 0.9446
z Notation
z will denote the value on the
measurement axis for which the area
under the z curve lies to the right of z .
Shaded area
 P(Z  z )  
0
z
Ex. Let Z be the standard normal variable. Find z if
a. P(Z < z) = 0.9278.
Look at the table and find an entry
= 0.9278 then read back to find
z = 1.46.
b. P(–z < Z < z) = 0.8132
P(z < Z < –z ) = 2P(0 < Z < z)
= 2[P(z < Z ) – ½]
= 2P(z < Z ) – 1 = 0.8132
P(z < Z ) = 0.9066
z = 1.32
Nonstandard Normal Distributions
If X has a normal distribution with
mean m and standard deviation s , then
Z
X m
s
has a standard normal distribution.
Normal Curve
Approximate percentage of area within
given standard deviations (empirical
rule).
99.7%
95%
68%
Ex. Let X be a normal random variable
with m  80 and s  20.
Find P( X  65).
65  80 

P  X  65   P  Z 

20 

 P  Z  .75
= 0.2266
Ex. A particular rash shown up at an
elementary school. It has been
determined that the length of time that the
rash will last is normally distributed with
m  6 days and s  1.5 days.
Find the probability that for a student
selected at random, the rash will last for
between 3.75 and 9 days.
96
 3.75  6
P  3.75  X  9   P 
Z

1.5 
 1.5
 P  1.5  Z  2
= 0.9772 – 0.0668
= 0.9104
Percentiles of an Arbitrary Normal
Distribution
(100p)th percentile
(100 p)th for 

m


s


m
,
s
for normal 

standard normal 
Lognormal distribution
0.7
0.6
0.5
fX() 0.4
0.3
0.2
sX
0.1
0
0
1
2
mX
3
4

5
Lognormal distribution
y
fX() : lognormal
fY(y) : normal
y = ln  or
 = exp(y)

If X is lognormal distributed, than
Y = ln(X) is normal distributed
Lognormal distribution
X lognormal distributed  Y = ln(X) normal distributed
probability density function for X:
 ln   m 2 
Y
f X   
exp 

2
 s Y 2
2s Y


1
 0
in which mY and sY parameters of the lognormal distribution:
mY
mean value of Y (not of X !!)
sY
standard deviation of Y (not of X !!)
Lognormal distribution
X lognormal distributed  Y = ln(X) normal distributed
2
m X  exp( m Y  1 s Y
)
2
2
s X  m X exp(s Y
) 1
m Y  ln m X  1 s Y2
2
2
 sX
s Y  ln 1 
 m2
X


  ln( 1  V 2 )
X


Lognormal distribution
As a consequence of the central limit
theorem:
Product of many variables is (almost)
lognormal distributed
y  x1 x 2 ...x n
log y  log x1  log x 2  ...  log x n
so log y (almost) normal distributed.
Definition: log y normal  y lognormal
Example (salaries in a country are LN distr.
Asymptotic distributions
Normal
X  Y1  Y2  Y3  ...
Lognormal
X  Y1 * Y2 * Y3 * ...
Weibull
X  minY1, Y2, Y3 ,...
Gumbel
X  max Y1, Y2, Y3 ,...
Asymptotic distr. return in the 5th year course CT5310
by Vrijling and Van Gelder
Discrete distributions
Exponential Distribution
A continuous rv X has an exponential
distribution with parameter  if the pdf is
 e  x x  0
f ( x;  )  
0

otherwise
Mean and Variance
The mean and variance of a random
variable X having the exponential
distribution
m
1

s 
2
1

2
Applications of the Exponential
Distribution
Suppose that the number of events
occurring in any time interval of length t
has a Poisson distribution with parameter  t
and that the numbers of occurrences in
nonoverlapping intervals are independent
of one another. Then the distribution of
elapsed time between the occurrences of
two successive events is exponential with
parameter    .
Two stochasts
Two stochasts
joint probability density
Kansdichtheid
0.2
0.15
0.1
0.05
0
2
1
2
1
0
0
-1
y
-1
-2
-2
x
fXY(,h)
fY(h)
fX()

h
Contour lines of the joint density
Two stochasts
Relation with events
fXY  ,h   P  X    
andh
 Y  h  h  /  h
FXY  ,h   PX   en Y  h 

h
h

Also here
f
density
F
(cumulative) distribution
Example
• Length
kansdichtheid (1/m)
3
2.5
2
1.5
1
0.5
0
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
lengte (m)
kansdichtheid (1/kg)
• Weight
0.05
0.04
0.03
0.02
0.01
0
50
60
70
80
gewicht (kg)
90
100
110
Corresponding contour plot?
110
100
weight (kg)
90
80
70
60
50
1.4
1.6
1.8
length (m)
2
2.2
Scatter plot results of a large survey
health investigation
110
1000 observations
100
weight (kg)
90
80
70
60
50
1.4
1.5
1.6
1.7 1.8 1.9
length (m)
2
2.1
2.2
Dependence
0
110
100
50
60
1.4
1.6
1.4
1.6
1.8
2
2.2
1.8
2
2.2
3
2.5
densiity (1/m)
0.01
60
70
2
1.5
1
0.5
0
0.02
70
length (m)
weight (kg)
80
90
90
80
lengte (m)
weight (kg)
100
50
0.03
110
0.04
0.05
density (1/kg)
Characteristics
mX, mY
sX, sY
Dependence
covXY
rXY = covXY / sX sY
covariance or
correlation, between -1 and 1
Covariance
Cov(X,Y)=E((X-EX)(Y-EY))
calculation example on black board
Correlation
rho = 0
rho = 0.3
rho = 0.7
rho = 0.9
rho = 1
rho = -0.9
Correlations between wave height
and wave period (data from
golfklimaat.nl website)
Engineering: structural reliability
Reliability:
Probability that structure falls apart
The smaller the probability, the larger the
reliability
Risk = probability x consequences
Structure:
Strength
Load
Falls apart if strength < load
Introduction
Design value - principle
Probability density function (pdf) of the load S: fS(x)
P{S > Sd} = Ptarget
s
m
Sd
Introduction
Cumulative distribution function
1
0.6
X
F ()
0.8
cdf
0.4
0.2
0
P(X 

0.5
f X()
0.4
0.3
pdf
0.2
0.1
0

Introduction
Design value - principle
Cumulative probability distribution function (cdf) of the lo
P{S  Sd} = 1-Ptarget
Sd
Introduction
Design value - principle
Complementary cumulative distribution function (ccdf) of th
= 1-FS(x)
P{S > Sd} = Ptarget
P{S > Sd} = Ptarget
Sd
Sd
Introduction
Design value load Sd
Design value load Sd or quantile is defined as:
P{S > Sd} = Ptarget during reference period Tref
Target probability Ptarget :
Depends on consequences of structural failure
Is specified in building codes
Typical:
Ptarget = 10-4 - 10-1 (structural collapse)
Tref = 15 - 100 years
Introduction
Peaks over Treshold analysis
for quantile estimation
Let X1; X2; . . . ; Xn be a
series of independent
random observations
of a random variable X
with the distribution
function F(x). To model
the upper tail of F(x),
consider k
exceedances of X over
a threshold u and let
Y1; Y2; . . . ; Yk denote
the excesses (or
peaks), i.e. Yi=Xi-u.
Extreme value statistics
If we know the distribution of a random variable
(for instance, monthly water level, daily wave
height, etc), how does the distribution of the
maximum of n random variables behave?
Careful for inhomogenities
Stationary Time Series
Exhibits stationarity in that it fluctuates around a constant long
run mean
Has a finite variance that is time invariant
Has a theoretical covariance between values of yt that depends
only on the difference apart in time
E ( yt )  m


Var ( y t )  E ( y t  m)  (0)
2
Cov( y t , y t  )  E( y t  m)( y t   m)  ()
Stationary time series
WHITE NOISE PROCESS
Xt = ut
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
ut ~ IID(0, σ2 )
White Noise
Examples of non-stationary series
Share Prices
2.5
Exchange Rate
.7
2.25
.65
2
1.75
.6
1.5
.55
1.25
.5
1
.45
.75
.5
.4
.25
.35
0
50
100
150
200
250
300
350
400
450
0
9.3
Income
9.2
9.1
9
8.9
8.8
8.7
1960
1965
1970
1975
50
100
150
200
250
Unit Root Tests
How do you find out if a series is
stationary or not?
Order of Integration of a Series
A series which is stationary after being differenced once
is said to be integrated of order 1 and is denoted by I(1).
In general a series which is stationary after being
differenced d times is said to be integrated of order d,
denoted I(d). A series, which is stationary without
differencing, is said to be I(0)
Yt  b0 Yt  1  e t  I (1)
 Yt  Yt  Yt  1  b0 e t  I (0)
Informal Procedures to identify non-stationary processes
(1) Eye ball the data (a) Constant mean?
12
RW2
10
8
6
4
2
0
0
50
100
150
200
250
300
350
400
450
500
(b) Constant variance?
var
200
150
100
50
0
-50
-100
-150
-200
0
50
100
150
200
250
300
350
400
450
500
Statistical Tests for stationarity: Simple t-test
Set up AR(1) process with drift (b0)
Yt = b0 + b1Yt-1 + et
et ~ iid(0,σ2)
Simple approach is to estimate eqn (1) using OLS and examine
estimated b1
Use a t-test with null Ho: b1 = 1 (non-stationary)
against alternative Ha: b1 < 1 (stationary).
Test Statistic: TS = (b1 – 1) / (Std. Err.(b1))
reject null hypothesis when test statistic is large negative
- 5% critical value is -1.65
(1)
Distribution of a maximum of random
variables
Therefore...
Extreme value distribution of a uniform
distribution
From an ‘operational point of view’
rather than conceptual point of view Matlab
code ct53100.m
for j=1:100,
n=12;
for i=1:n,
x(i)=5*rand(1);
end
y(j)=max(x);
end
3
2
1
Frequency
Maximum water level [m]
4
0
2
4
6
Time period i
8
10
12
4
3
2
3
60
2
40
1
0
1
2
3
Monthly water level [m]
4
0
5
20
40
60
80
100
2
2.5
3
3.5
4
4.5
Annual maximum Water level [m]
5
Cumulative frequency
5
4
3
2
1
0
20
5
Cumulative frequency
5
Year j
Frequency
Water level [m]
5
0
2
4
6
8
Sorted monthly water level [m]
10
12
4
3
2
0
20
40
60
80
Sorted annual water level [m]
100
Back to the operational viewpoint
- Change the number of observations
- Change the distribution type
- From minimum to maximum
- etc.
From a visual point of view
Statistics wind:
30
30
Bron: KNMI (Schiphol, 1964)
Source:
V (m/s)
Weibull
20
20
10
10
0
0.2
0.1
0
0
kansdichtheid
probability
density
0
0.2
0.4
0.6
0.8
tijd (1/1/64 - 31/12/64)
time
Mean
: 5 m/s
Standard deviation
: 2.5 m/s
Introduction
1
visual point of view
instantaneous
maximum 1 year
maximum 50 years
5 m/s
20 m/s 28 m/s
wind speed
Introduction
visual point of view
Procedure for minima of r.v.’s
The extreme value distribution
Distribution type
Distribution parameters
Quantile values (design values)
Example: wind loads
Approach 1: extreme value distributions
The extreme value distribution
Limit theorem (Fisher-Tippet):
Maximum of many random variables has distribution:
Reverse Weibull (bounded maximum) or
Gumbel or
Frechet (bounded minimum)
regardless of parent distribution
Conditions:
Random variables are independent
Random variables have the same parent distribution
Approach 1: extreme value distributions
Extreme value distributions
Frechet
(concave)
Reverse Weibull
(convex)
Gumbel
(straight)
Approach 1: extreme value distributions
Generalized Extreme Value
distribution
All three extreme value distributions are
special cases of the Generalized Extreme
Value distribution (GEV):
 = 0 Gumbel
 > 0 Frechet
 < 0 Reverse Weibull
(EV type I for maxima)
(EV type II for maxima)
(EV type III for maxima)
Approach 1: extreme value distributions
Domain of attraction
Parent distribution
Asymptotic distribution
type of maximum
(domain of attraction)
Reverse Weibull
uniform, beta
(short tail)
Normal, exponential,
Gumbel
gamma, lognormal, Weibull
Pareto, Cauchy, Student-t
(fat tail)
Frechet
Approach 1: extreme value distributions
Example: wind load
Example:
Wind load
Maximum over 50 years
Quantile value at Ptarget = 0.15 (design
load)
Steps:
– Determine extreme value distribution type
– Determine distribution parameters
– Calculate requested quantile (design load)
Approach 1: extreme value distributions
Distribution type
Statistics: parent distribution is approx. Weibull
EV-theory: domain of attraction is Gumbel
Plot: monthly maxima of hourly averaged wind speeds
Slightly convex?
Approach 1: extreme value distributions
Gumbel probability plot
CDF on Gumbel probability paper:
Reverse Weibull-like deviation
(poor convergence)
Approach 1: extreme value distributions
Wind speed annual maxima Schiphol
Approach 1: extreme value distributions
Wind pressure monthly maxima
Schiphol
pressure:
q = 0.5 r U2
with:
r air density
U wind speed
Approach 1: extreme value distributions
Transformations
Presenting large datasets
In a histogram
On probability paper
Classify your data
order the n observations
Number of classes: 1 + 1.33 ln(n)
All classes have preferably the same
width
Histogram
Wave height (cm’s):
25 45 35 25 30 70
20 45 65 30 40 40
35 45 55 35 32 37
28 45 49 39 40 60
29 34 47 35 45 49
35 45 34 28 34 54
48 38 32 39 45 58
Histogram
#classes: 1 + 1.33 ln(42) ≈ 6
highest - lowest = 70 - 20 = 50
class width about 50 / 6 = 8 (we take 5 to
choose a round number)
Histogram 4
class
(unit=5)
17,5 - 27,5
27,5 - 32,5
32,5 - 37,5
37,5 - 42,5
42,5 - 47,5
47,5 - 57,5
57,5 - 77,5
frequency freq/width
3
7
9
6
8
5
4
3/2
7/1
9/1
6/1
8/1
5/2
4/4
Table 2.1 Groundwater chemistry from Noord-Branband, The Netherlands
Case study:
Groundwater
chemistry data set
(from;
Y. Zhou, 2006:
Hydrogeostatistics.
UNESCO-IHE
lecture note.)
X
(m)
133875
158975
156675
145325
143200
148238
165225
171500
163875
168463
177075
184550
182500
194688
137412
146000
157725
158600
143290
150705
167325
179563
173640
162600
171335
182675
192750
182065
146050
155915
163750
168200
177325
169813
82245
85975
93840
115480
117915
119620
127360
135725
121645
125395
76360
84060
94950
96000
104570
101830
113238
Y
EC
(m)
(mS/m)
407525 11.60
413725 83.20
419725 72.10
401775 39.70
410625 33.10
411875 67.90
422650 40.10
414450 24.00
407250 14.30
402625 49.80
405600 19.50
415063 40.30
403638 65.40
404850 62.80
391325 70.10
394525 26.40
399378 11.60
393350 88.40
376185 36.50
382290 14.20
392060 27.70
395513 68.20
388415 98.50
378238 15.70
380745 12.10
393300 25.20
397975 53.00
382505 23.20
367045 93.00
372595 63.90
368850
9.90
365138 70.80
372138 13.50
362163 33.20
404445 87.10
401085 62.30
406235 45.90
420695 112.60
407045 76.00
402050 26.50
419875 115.10
416615 62.90
408175 30.50
401095 39.00
392710 16.00
393685 35.30
391050 23.60
383625 63.30
391570 51.10
398440 51.60
399300 61.20
Ca
(mg/l)
9.63
159.00
135.00
51.20
59.30
74.20
67.70
3.84
17.70
48.20
8.34
52.70
63.40
84.10
83.80
49.20
15.70
167.00
23.30
6.69
45.30
51.30
172.00
11.20
2.12
11.90
55.50
35.50
82.10
69.00
7.63
64.80
18.30
10.40
34.50
109.00
41.50
147.00
73.30
18.40
121.00
125.00
54.90
52.90
3.68
30.10
4.31
66.10
15.00
33.30
60.00
Cl
(mg/l)
19.70
46.50
37.40
19.50
14.60
85.00
22.10
15.40
11.70
52.30
11.10
31.40
46.50
31.40
36.20
6.67
10.50
88.30
54.50
10.50
18.30
58.90
109.00
6.82
4.72
12.60
18.80
7.45
43.70
38.60
4.93
112.00
18.50
15.30
160.00
56.30
90.70
181.00
60.20
18.10
184.00
17.80
16.30
48.90
37.20
61.40
20.00
30.00
60.90
20.10
73.30
HCO3
(mg/l)
3.00
290.00
260.00
11.00
207.00
252.00
125.00
3.00
44.00
147.00
3.00
9.10
3.00
9.10
13.00
173.00
61.00
371.00
3.00
3.00
127.00
3.00
450.00
3.00
3.00
3.00
3.00
77.00
3.00
3.00
3.00
56.00
57.00
3.00
293.00
349.00
130.00
433.00
232.00
3.00
467.00
305.00
159.00
65.00
29.00
3.00
3.00
3.00
3.00
3.00
74.00
K
(mg/l)
1.72
1.58
4.00
8.28
1.16
3.73
1.13
1.67
0.77
4.54
9.21
2.78
25.81
16.37
10.99
1.24
0.51
1.47
10.40
4.09
0.72
43.23
1.77
2.57
1.55
2.93
24.71
1.15
41.72
10.99
1.70
2.42
1.51
1.83
2.73
2.16
4.21
2.71
42.34
3.33
3.17
0.32
1.05
1.66
2.18
2.10
2.32
44.80
4.18
13.39
19.38
Mg
(mg/l)
0.88
14.23
13.29
7.50
4.66
3.87
3.45
2.08
2.06
13.15
6.25
5.63
15.67
9.82
22.27
4.22
1.83
6.58
2.68
6.06
5.57
15.59
19.10
0.81
2.27
7.02
10.38
5.70
21.87
21.53
2.15
14.02
1.64
2.46
3.32
12.57
9.11
13.30
19.69
7.49
9.94
11.59
4.55
4.53
2.13
8.63
6.13
13.07
6.67
15.59
11.81
NO3
Na
(mg/l) (mg/l)
0.28
9.90
0.28 23.26
1.81 18.02
13.20 13.97
0.28 10.88
0.28 78.83
0.28 10.41
9.35
8.63
0.28
7.84
0.28 40.53
2.17 12.75
0.28 21.81
31.60 20.15
45.50 22.87
57.20 17.48
0.28
7.46
0.28
8.20
0.28 31.94
7.29 38.94
0.28
5.49
0.28
6.93
0.28 32.86
0.28 47.29
0.34
7.52
0.28
4.67
9.31
7.56
40.50 19.79
0.28
6.13
87.30 23.15
44.40 21.83
1.28
6.75
0.28 59.13
0.28 10.69
0.28 15.75
0.28 150.70
0.28 19.78
0.28 18.86
0.28 49.11
0.28 49.06
0.28 11.38
0.28 147.70
0.28 11.49
0.28
7.62
0.28 21.33
0.28 22.38
0.28 25.51
2.30 11.43
16.30 13.18
0.28 26.06
0.28 12.54
0.28 45.19
SO4
(mg/l)
21.20
189.00
149.00
112.00
2.10
44.50
83.10
61.20
19.60
81.10
61.00
150.00
137.00
94.60
83.50
1.80
2.50
76.90
57.70
41.40
21.40
242.00
60.40
45.90
27.10
62.70
81.20
49.10
90.60
105.00
31.10
166.00
2.10
72.10
2.20
1.70
2.10
1.00
144.00
86.50
11.10
105.00
16.40
74.00
2.20
79.20
76.60
207.00
80.90
173.00
151.00
Frequency tables
Step 1: Range of data
R = xmax - xmin
R(Cl) = 184 - 4.72 = 179.28 mg/l
Frequency tables
Step 2: Number of class intervals, m
Class number
6 < m < 25
m = 1 + 1.33 ln(N)
Example Cl: m should be around 6-7 class
Class width x:
x > R/m
Example Cl: 15 > 179.28/13=14
Class limits xj- (lower limit) and xj+ (upper limit):
xj- = x0 + (j-1)*Δx < values in class j < xj- + Δx =
xj+
Frequency tables
Step 3: Number of measurements per class
nj:
absolute frequency
fj=nj/n: relative frequency
Frequency tables
Step 4: Creating frequency table
Class interval
Absolute frequency
Relative frequency
0 < Cl < 15
15 < Cl < 30
30 < Cl < 45
45 < Cl < 60
60 < Cl < 75
75 < Cl < 90
90 < Cl < 105
105 < Cl < 120
120 < Cl < 135
135 < Cl < 150
150 < Cl < 165
165 < Cl < 180
180 < Cl < 195
11
15
8
7
4
2
1
2
0
0
1
0
2
20.80%
28.30%
15.10%
13.20%
7.50%
3.80%
1.90%
3.80%
0
0
1.90%
0
3.80%
Creating of a Histogram
Step 5: Absolute and relative frequencies
Abso lute fre quency histog ram
16
Absolute frequency
14
12
10
8
6
4
2
0
0
15
30
45
60
75
90
105
120
135
150
165
180
195
165
180
195
Cl (mg/l)
Re lative freque nc y histo gram
30
Relative frequency (%)
25
20
15
10
5
0
0
15
30
45
60
75
90
105
Cl (mg/l)
120
135
150
Frequency tables
Step 6: Cumulative frequency table
Upper class limit
Absolute frequency
Relative frequency
Cl < 15
Cl < 30
Cl < 45
Cl < 60
Cl < 75
Cl < 90
Cl < 105
Cl < 120
Cl < 135
Cl < 150
Cl < 165
Cl < 180
Cl < 195
11
26
34
41
45
47
48
50
50
50
51
51
53
20.80%
49.10%
64.20%
77.40%
84.90%
88.70%
90.60%
94.40%
94.40%
94.40%
96.30%
96.30%
100.00%
Frequency distribution
Step 7: Cumulative frequency distribution curve
Cumulative fre que ncy dis tribution
100
Cumulative frequency (%)
90
80
70
60
50
40
30
20
10
0
0
20
40
60
80
100
Cl (mg/l)
120
140
160
180
200
Some typical frequency distributions
P o s it iv e s ke we d dis t ribut io n
Relat ive f requency ( %)
Relative f requency (% )
Sy m m e t ric a l o r be ll- s ha pe d d is t ribut io n
X v a r i a b le
X v a ri a b l e
N e g a t iv e s ke we d dis t ribut io n
Relative f requency (% )
Relat ive f requency ( %)
B im o d a l dis t ribu t io n
X v a ri a b l e
X v a r i a b le
Statistical descriptors
Descriptors of central tendency
Mode: the value with largest frequency; average
value of measurements of the class with the
largest frequency
For Cl: the mode is 19.4 mg/l
Re lative freque nc y histo gram
30
25
Relative frequency (%)
Not applicable for
distributions with
several peaks
20
15
10
5
0
0
15
30
45
60
75
90
105
Cl (mg/l)
120
135
150
165
180
195
Statistical descriptors
Descriptors of central tendency
Median: the value corresponds to 50% of cumulative
frequency, the value of mid measurement for odd
number samples or the average of two mid
measurements for even number samples
For Cl:
median = 31.4 mg/l
Cumulative fre que ncy dis tribution
100
Insensitive to the tails or
outsiders of the distribution,
preferable for data sets with
exceptional values
Cumulative frequency (%)
90
80
70
60
50
40
30
20
10
0
0
20
40
60
80
100
Cl (mg/l)
120
140
160
180
20
Statistical descriptors
1. Descriptors of central tendency
Quartiles: split the data into quarters
– Lower quartile: 25% cumulative frequency
For Cl: lower quartile = 16.3 mg/l
– Upper quartile: 75% cumulative frequency
For Cl: upper quartile = 56.3 mg/l
In practice also the 1%, 5%,10%,
90%, 95% and 99% values are used
(e.g. discharge data).
Cl Rank
184
1
181
2
160
3
112
4
109
5
90.7
6
88.3
7
Percent
100.00%
98.00%
96.10%
94.20%
92.30%
90.30%
88.40%
85
8
86.50%
73.3
9
84.60%
61.4
10
82.60%
60.9
11
80.70%
60.2
12
78.80%
58.9
13
76.90%
56.3
14
75.00%
54.5
15
73.00%
52.3
16
71.10%
48.9
17
69.20%
46.5
18
65.30%
46.5
18
65.30%
43.7
20
63.40%
39.2
38.6
37.4
21
22
23
61.50%
59.60%
57.60%
37.2
24
55.70%
36.2
25
53.80%
31.4
26
50.00%
31.4
26
50.00%
30
28
48.00%
22.1
29
46.10%
20.9
30
44.20%
20.1
31
42.30%
20
32
40.30%
19.7
33
38.40%
19.5
34
36.50%
18.8
35
34.60%
18.5
36
32.60%
18.3
18.1
17.8
16.3
15.4
15.3
14.6
12.6
11.7
11.1
10.5
10.5
7.45
6.82
6.67
4.93
4.72
37
38
39
40
41
42
43
44
45
46
47
47
49
50
51
52
53
30.70%
28.80%
26.90%
25.00%
23.00%
21.10%
19.20%
17.30%
15.30%
13.40%
9.60%
9.60%
7.60%
5.70%
3.80%
1.90%
.00%
Statistical descriptors
1. Descriptors of central tendency
Arithmetic mean: average value of measurements
1 N
x =  xi
n i=1
For Cl: arithmetic mean = 43.72 mg/l
• More representative of the sample, sensitive to outsiders in a
small sample.
• Most distributions are sufficiently characterised by the mean
and the variance.
Statistical descriptors
1. Descriptors of central tendency
Geometric mean
x g = n x1* x2 * ...* xn
1 n
log ( x g ) =  log ( xi )
n i=1
For Cl: geometric mean = 29.53 mg/l
Not applicable for negative values. Often hydrogeological
variables are not symmetrical, but the log transformations are
symmetrical. Then geometric mean is applicable.
The radius of a grain with main axes a, b, and c is characterised
best by the third root of the a*b*c.
Statistical descriptors
1. Descriptors of central tendency
Harmonic mean
xh =
1
1 n 1

n i=1 xi
For Cl: harmonic mean = 20.08 mg/l
Appropriate for phenomenon where small values are more
important (e.g. hydraulic conductance; see lecture notes form
Zhou, 2006)
Statistical descriptors
Descriptors
Summary of properties
--------------------------------------------------------------------------------------------------------------------Mode
indication of abundant values, isolated property,
not applicable for distributions with several peaks.
--------------------------------------------------------------------------------------------------------------------Median
insensitive to the tails of the distribution, preferable
for data sets with exceptional values.
--------------------------------------------------------------------------------------------------------------------Arithmetic Mean more representative of the sample, sensitive to exceptional
values in a small sample. Most distributions are sufficiently
characterised by the mean and the variance.
--------------------------------------------------------------------------------------------------------------------Geometric mean not applicable for negative values. The radius of a grain with
main axes a, b, and c is characterized best by the third root
of the product a b c.
--------------------------------------------------------------------------------------------------------------------Harmonic mean more appreciate for phenomenon where small values are
more important.
Statistical descriptors
Relations between central tendency descriptors
xh < x g < x
The harmonic mean is smaller than the
geometric mean, and the geometric mean is,
in turn, smaller than the arithmetic mean.
They are equal only if x1 = x2 = ... = xn.
x = median
If the frequency distribution is symmetrical.
They are not equal when the distribution is
not symmetrical.
log ( x g ) = log (x)
The mean of the log x -distribution is equal to
the logarithm of the geometric mean of x.
Statistical descriptors
2. Descriptors of dispersion (variation)
Sample variance
s
2=
1 n
2
(
x
)
 xi
n i=1
s
2=
1 n 2 2
 xi - x
n i=1
For Cl: variance = 1768.72 [mg/l]2
Standard deviation: the square root of the
variance
For Cl: standard deviation = 42.06 mg/l
Statistical descriptors
2. Descriptors of dispersion (variation)
50
Frequency (100%)
40
30
20
10
0
-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
Me an = 0
S tandard devi ati on = 1
S tandard devi ati on = 2
3
4
5
6
7
8
Statistical descriptors
2. Descriptors of dispersion (variation)
Coefficient of variation
s
Cv =
x
For Cl: coefficient of variation = 0.96
Useful to compare the variations of two or more
data sets.
Statistical descriptors
3. Descriptors of asymmetry
Coefficients of skewness
x - mode
P1 =
Moments or
s
product
moments!
m3
a3 = 3/2
m2
3( x - median)
P2 =
s
3=
2
n
a3
(n - 1)(n - 2)
For Cl: coefficient of skewness α3 =
1.97
Statistical descriptors
Moments or product moments
Central moments
1 n
k
=
(
x
)
 xi
mk
n i=1
Moments are statistical descriptors of a data set used for:
– 1st moment: mean or expected value, µx (“central tendency”)
– 2nd moment: variance, σ2x ; standard deviation σx is square root of
variance (“spread around the central value”)
– 3rd moment: skewness, γx
(“measure of symmetry”)
– 4th moment: kurtosis, κx
(“peakedness of central portion of distribution”)
Statistical descriptors
3. Descriptors of asymmetry
Symmetrical distribution
3
=0
X variable
Mode = Mean = Median
Skewned to the right
3
Mode
Mean
Median
X variale
>0
Skewned to the left
3
<0
X variale
Mean Mode
Median
Statistical descriptors
4. Descriptor of flatness/’peakedness’
Coefficient of kurtosis
a4 =
m4
m22
n3
4=
a4
(n - 1)(n - 2)(n - 3)
For Cl: coefficient of Kurtosis = 7.04
α4=3: for Normal distribution
α4>3: steeper than Normal distribution
α4<3: flatter than Normal distribution
Normal probability paper
Sort the data from small to large
Assign each observation to i/N+1 in which
i the order number and N the total
number of data points
deseasonalised daily mean run-off
Daily Mean Run-Off Anomalies at Achleiten Danube
River
POT
year
Flood frequency analysis (peak flows)
Annual maximum series (more common)
 One can miss a large event if more than one per year; but continuous
and easy to process
 Often used for estimating extremes in long records (>10 years)
Partial duration series (“Peaks-Over-Threshold, POT”)
 Definition of the threshold is tricky and requires experience
 Often used for short records (<10 years)
Daily discharge (m3/s)
150
at Stah
125
100
threshold
75
50
25
0
25-Mar-60
2-Dec-73
11-Aug-87
19-Apr-01
Flood frequency
analysis
(peak flows):
Annual max.
series
vs.
partial duration
series
(Davie, 2002)
Annual max. series vs. partial
duration series (POT)
Langbein showed the following relationship
(Chow 1964):
1/T = 1- e-(1/Tp)
T : return period using annual max. series
Tp: return period using partial duration series
Differences get smaller for larger return periods
(less than 1% difference for a 10-year
recurrence interval)!
Assumptions of frequency analysis
All data points are correct and precisely measured
 Be aware of the uncertainty of peak flow data (uncertainty and errors come
later in this course!)
Independent events: peak flows are not part of the same event
 Carefully check the data set; plot the whole record, in particular all events of
the POT series
 Problems with events at the transition of the year (31 Dec – 1 Jan) in humid
temperate or some tropical climates
Random sample: Every value in the population has equal chance of being
included in the sample
The hydrological regime has remained static during the complete time period
of the record
 No land use change, no climate change, no changes in the river channels, no
change in the flood water management etc. in the catchment (often not the
case for long records!)
All floods originate from the same statistical population (homogeneity)
 Different flood generating mechanisms (e.g. rain storms, snow melt, snowon-ice etc.) might cause floods with different frequencies/recurrence intervals
Describing the frequency mathematically:
Probability Distribution Function
Typically defined in either of two forms:
• Probability density function (PDF)
• Cumulative distribution function (CDF)
CDF
PDF
Discrete
PX  x  p( x)  px
x
F ( x)   f (i )
i 0
b
Continuous pa  x  b   f ( x)dx
a
x
F ( x) 
 f (m )dm

Basics (examples using measured flow, Q)
Probability of exceedence, P(X): probability that the flow Q is
greater or equal X; P(X)ε[0,1]
Relative frequency, F(X): probability of flow Q being less than a
value X; F(X)ε[0,1]. Can be read from a cumulative
probability curve, but be careful with the selected class
intervals.
Average recurrence interval or return period, T(X): statistical
term meaning the chance of exceedence once every T
years over a long record (time step is usually one year).
– Not exactly the number of years that are between certain size
events!
– More the average number of years, in which flow is greater than X!
– No regularity or periodicity in occurrences of exceedences
(assumption)
P(X) = 1–F(X)
T(X) = 1/P(X) = 1/(1-F(x))
Relative frequency F(X)
Probability of exceedance P(X)
PDF
Probability of exceedance P(X)
Relative frequency F(X)
CDF
Recurrence intervals for design
purposes (flood protection) in Germany
Class 1
 Settlements, urban areas, important infrastructure:
50-100 years
Class 2
 Single buildings, not always inhabit neighborhoods:
25-50 years
Class 3
 Farm land, intensively used: 10-25 years
Class 4
 Farm land; 5-10 years
What about large dams, nuclear power plants etc.?
PMF
(according to DIN 19700, part 99)
Exceedence probability for a specified number of
time intervals (see Box C-4, page 561 in Dingman, 2002)
Examples: Exceedence probability and return
period (based on Box C-4 in Dingman, 2002)
What is the probability that a flood greater or equal a 100-year flood will occur
next year?

P(X) = 1/T(X) = 0.01
What is the probability that we will not have a flood that is greater or equal the
50-year flood next year?

F(X) = 1-P(X) = 1-0.02 = 0.98
What is the probability that we will not have a flood that is greater or equal the
20-year flood in the next 5 years?

F(X) = (1-P(X))n = 0.955 = 0.774
What is the probability that the next exceedence of the 100-year flood will occur
in the 10th year from now on?

p = (1-0.01)9 x 0.01 = 0.00916
What is the probability that the 100-year flood will be exceeded at least once in
the next 40 years?

p = 1-(1-0.01)40 = 0.331
What is the probability that the 50-year flood will be exceeded twice in a row
(two independent events in one year), and how many 50-year floods can
be expected on averages in 1000 years?

p = 0.02 x 0.02 = 0.0004; and on average 20 floods in 1000 years.
=
(
)
-
=
=
(Bedient and Huber, 2002)
But how do we estimate P(X) and F(X) from data?
Example: Flood frequency analysis
Daily discharge (m3/s)
150
at Stah
125
100
75
50
25
0
25-Mar-60
2-Dec-73
11-Aug-87
19-Apr-01
Example: Annual max. series for the river Wye (1971-97)
(not Normal-distributed!)
(Davie, 2002)
Plotting position – Weibull formula
Rank the annual maximum series data from low to high
(independent data, the year of occurrence is irrelevant)
Calculate F(X) with the rank r and N total data points (i.e.
length of record: N years)
F(X) = r/(N+1)
– For example: The largest value of a 25 year record would plot at
a recurrence interval of 26 years.
– F(X) can never reach 1
– If you rank from high to low P(X) = 1-F(X) is calculated
Example: Annual max. series for the river Wye (1971-97)
(not Normal-distributed!)
(Davie, 2002)
[Please note:
F(X)=p
in the Workshop
course note]
Gringorten formula
F(X) = (r-0.44) / (N+0.12)
Difference to Weibull formula is often not great
Use is often down to personal preferences
Empirical constants (0.44 and 0.12) are valid for
Gumbel distribution
Comparison of Weibull and Gringorten
formulae (Davie, 2002)
Example: Annual max. series for the river Wye (1971-97)
(not Normal-distributed!)
(Davie, 2002)
Example: Annual max. series for the river Wye (1971-97)
(not Normal-distributed!)
Reliability
is good!
(Davie, 2002)
Extrapolation beyond the data set
Weibull or Gringorten formulae only good for flood frequency
estimations for flows within the measured record, and
even unreliable near either limiting value
For extrapolation the fit of a probability distribution is needed.
Estimate the parameters through:
1. Method of moments (widely used)
2. Method of L-moments (less widely, used, quite complex)
3. Method of maximum likelihood (not widely used)
An alternative is a graphical approach to fit the distribution
(subjective approach)
Choice of the distribution function often based on personal
preferences (but always take the distribution that fits your
data best in a particular region), but there are sometimes
guidelines (depend on the region)
Extreme values are usually not normally distributed, however,
mean annual flows in humid areas are often normally
distributed
Method of moments – Example:
Gumbel distribution
Product moments are statistical descriptors of a data set (characterize
the probability distribution):
– 1st moment: mean or expected value, µx (“central tendency”)
– 2nd moment: variance, σ2x ; standard deviation σx is square root of
variance (“spread around the central value”)
– 3rd moment: skewness, γx
(“measure of symmetry”)
– 4th moment: kurtosis, κx
(“peakedness of central portion of distribution”)
– Coefficient of variation: measure of spread
CV = σx/µx
γx < 0
(L-moments are used for small sample sizes;
see Dingman (2002), Appendix C)
γx > 0
Method of moments – Gumbel distribution
F(X) = exp(-exp(-b(X-a)))
a = mean(Q) - 0.5772/b
b = π/(σQ 60.5)
F(X) leads to P(X) and T(X) for a certain size of flow X
Re-arranging the formulae leads to the size of flow for a
given recurrence interval:
X = a – 1/b ln ln(T(X)/(T(X)-1))
Example: for the 50-year flood, you need to compute the natural logarithm of
50/49 and then the natural logarithm of this result. The parameters a and b
are estimated from the sample.
Rule of thumb: Do not extrapolate recurrence intervals
beyond twice the length of your stream flow record
Example: Annual max. series for the river Wye
(1971-97); values required for the Gumbel formula
Mean (Q)
21.21
Standard deviation (σQ)
6.91
a value
18.11
b value
0.19
Applying the method of moments and Gumbel formula to the
data gives some interesting results. The values used in the formula
are shown in the table above and can be easily computed. When
the formula is applied to find the flow values for an average
recurrence interval of 50 years it is calculated as 39.1 m3/s. This is
less than the largest flow during the record which under the Weibull
formula has an average recurrence interval of 29 years. This
discrepancy is due to the method of moments formula treating the
highest flow as an extreme outlier. If we invert the formula we can
calculate that a flood with a flow of 48.87 m3/s (the largest on
record) has an average recurrence interval of around three
hundred years.
(Davie, 2002)
Distributions often used in hydrology (1/2)
(Dingman, 2002)
Distributions often used in hydrology (2/2)
(Dingman, 2002)
Use of common distributions in hydrology
Flood frequency analysis
Most commonly applied are the Exponential (EXP), Log-Normal (LN), LogPearson 3 (LP3) and Generalised Extreme Value (GEV) distributions.
In practice
The choice of probability distribution may be dictated by mathematical
convenience or by familiarity with a certain distribution (“personal bias”).
Sample estimators must be adopted in order to obtain estimates of the statistics
for determining the distribution parameters. In some cases, more than one
distribution may fit the available data equally well.
General three-step procedure
1) A suitable form of standard frequency distribution is chosen to represent the
observations;
2) the chosen distribution is fitted to the data by determining values for its
parameters; and
3) the required quantiles are computed from the fitted cumulative distribution
function (CDF).
Distributions often used in hydrology
1. Normal distribution (ND)
Probably the most important distribution, but often not
useful for hydrological extremes
PDF:
2


1
1
x

c



f ( x) 
exp   
 2  a  
a 2


a: scale parameter = standard deviation, σ
c: location parameter = mean, μ
Standard Normal Distribution (S-ND)
y
CDF:
G( y) 


(compare lecture of Dr. Zhou)
1
u2
exp(  )du
2
2
f ( y) 
y
e
 y2 / 2
2
xm
s
Location parameters
A probability distribution is characterized by location and scale parameters.
Location parameter equal to zero and scale parameter equal to one (StandardND) vs. ND with a location parameter of 10 and a scale parameter of 1.
Scale Parameter
The next plot has a scale parameter of 3 (location parameter is zero). The
effect is that the graph is ‘stretched out’ .
Probability Distribution Functions
2.
The Lognormal distribution
y = ln(x)
If y follows the Normal distribution, x follows the
Log-normal distribution.
Lognormal probability density function:
2
ln
(x)
m
1
1
y
) ]
f(x)=
exp[- (
2
x s y 2
sy
= 0 elsewhere
when x > 0
Lognormal distribution function
0.6
0.5
Frequency
0.4
0.3
0.2
0.1
0
0
1
2
3
4
X variale
Standard deviation = 1
Standard deviation = 2
Lognormal distribution is skewed to the right!
5
Effects of shape parameters for the
lognormal distribution
PDF
CDF
Processing Log-normal Distribution Function
Histogram indicates positive skewness
distribution
Plot of cumulative frequency on Lognormal
probability paper shows a straight line
Take logarithm transformation of data
y = ln(x)
Calculate sample mean and standard
deviation of logarithmic values
Carry out analysis on logarithmic values
3. Pearson type III distribution
(Often used for flood frequencies)
PDF:
f ( x) 
1
 xc
b 1
(
x

c
)
exp


b
a (b)
a 

a - scale parameter
c - location parameter
b - shape parameter

Г( ) – Gamma function: (b)   t b1e t dt
0
 commonly fitted to the logarithms of floods (socalled log-Pearson type III distribution)
4. Gamma Distribution (when c=0)
Effects of shape parameter, gamma
PDF
CDF
5. Exponential distribution
Particularly useful when applying partial duration
series
PDF:
f ( x)   exp  x
With a mean of 1/λ, a variance of 1/λ2, and a skewness of 2.
Standard Exponential Distribution
CDF:
G ( y )  1  exp( y )
Plots of exponential distribution
CDF
PDF
Example: Exponential distribution applied to
storm interval times (from Bedient & Huber 2002)
6. General extreme value (GEV) distribution
CDF:
  k ( x  c) 1/ k 
F ( x1 )  exp  1 


a
 
 
c - location parameter
k - shape parameter
for k  0
a - scale parameter
k=0 Extreme value type I (EV1) (Gumbel)
k<0 Extreme value type II (EV2)
k>0 Extreme value type III (EV3)
closely related to the Weibull distribution
Comparison of the three types of
GEV distributions (PDF)
x
Type II, k<0
Type I, k=0
Type III, k>0
x=c
y1
0
If the sample for which frequency distribution is required exhibits
skewness, a three-parameter distribution is useful (e.g. GEV).
General Extreme Value distribution
Type I (= Gumbel distribution)
-> widely used for annual maximum series!
(Note: little different in description (use of parameters) than
in the example of the river Wye, from Davie (2002))
CDF:

 x1  c 
F ( x1 )  exp  exp  

a 


PDF:
 x1  c
1
 x1  c 
F ( x1 )  exp 
 exp  

a
a
a 


CDF:
G( y1 )  exp exp  y1 
(standardized)
x1  c
y1 
a
Plots of Gumbel distribution
CDF
PDF
How good is the fit of the distribution
function?
• Graphic check: visual check of the plotted graph
(How good are the observations reproduced by the fitted
PDF/CDF?)
• Mathematical check: statistical test to determine the
goodness fit
- chi-square (χ2 ):
PDF
(needs much data, depends on classified intervals)
- Kolmogorov-Smirnov (K-S) tests:
CDF
 Not necessary to divide the data into intervals; thus error associated
with the number and size of intervals is avoided.
 Good if n>35 and even better if n>50.
 Quick and easy, but only one value is considered)
- Unfortunately, often several distributions provide acceptable fits to
the available data (no identification of the “true” or “best”
distribution); confidence limits are too large
Graphical method
(Example of the river Wye)
Is this fit suitable for the
whole data set?
Frequency of flows less than a value X. The F(X) values on the
x-axis have undergone a transformation to fit the Gumbel distribution;
called ‘reduced variate’ (cf. Workshop in Hydrology).
Comparison of different PDFs
(Significant differences in particular for the extremes!)
Example: Kolmogorov-Smirnov (K-S) tests
(according to Schoenwiese 2000)
Dn  1 max FX ( xi )  S n ( xi )
n

P( Dn  Dn )  1  
(see sketch on
black board)
FX(xi) - CDF of the assumed distribution
Sn(xi) - CDF of the observed ordered sample
If Dn ≤ the tabulated value (see below) Dnα, the assumed
distribution is acceptable at the significance level α
(n: sample size).
α
Dnα
0.20
1.073/n0.5
0.10
1.224/n0.5
0.05
1.358 /n0.5
0.01
1.628/n0.5
0.001
1.040/n0.5
Variability of quantile estimates –
confidence limits
Sources of errors
• assumption of a particular distribution (cannot be quantified)
• sampling errors in estimation of the parameters of the
distribution (quantifiable through standard errors)
Confidence limits (CL): 100(1-α)% confidence intervals
QT : quantile
CL(QT )  QT  t1 / 2 SEq
Cs
Standard error: SEq 
n
C : constant (see Workshop page 13;
course notes from Hall page 31f)
σ : standard deviation from a
sample of size n
t1-α/2 : value of Student’s tdistribution for a 97.5% level of
confidence (two-tailed test) and
(n-1) degrees of freedom; tabulated
Example: Lognormal with 90%-confidence limits
(Bedient & Huber 2002)
Example: Estimation of confidence limits
(according to Schoenwiese 2000)
The mean annual temperature at the Hohenpeissenberg, Germany, for
the period 1954-1970 (n=17; Normal-distributed C=1) is 6.24 0C with a
standard deviation of 0.73 0C.
CL(QT )  QT  t1 / 2 SEq
The confidence limits (α = 5%) can be calculated as:
CL (mean annual temperature in 0C) = 6.24 ± t97.5% (0.73/170.5) 0C
= 6.24 ± 0.38 0C
The mean of the annual temperature is at significance level of 95%
in the interval of 6.24 ± 0.38 0C, thus between 5.86 and 6.62 0C.
Please note, if the records lengths would have been 120 years
(= n) with the same standard deviation, the interval would be
6.24 ± 0.13 0C.
A few remarks on:
Low flow frequency analysis
Data required: annual minimum series
Problem of independent events; do not split the year in the
middle of low period (i.e. low flow periods can be long)
Often zero-values (e.g. in arid climates or cold climates)
There is finite limit on how low a low flow can be (no
negative flows!)
 Different statistical treatment of the data
 Fit an exponential distribution rather than, for instance, a lognormal distribution
 Other often used distributions are the Weibull, Gumbel, Pearson
Type III, and log-normal distributions
A few remarks on:
Low flow frequency analysis
Figure 7.15 Two probability density
functions. The usual log-normal
distribution (solid line) is contrasted
with the truncated log-normal
distribution (dashed line) that is
possible with low flows (where the
minimum flow can equal zero).
Figure 7.16 Probability values
(calculated from the Weibull
sorting formula) plotted on a log
scale against values of annual
minimum flow (hypothetical
values).
Application in the Rur river (7-1)
Station Stah (Germany)
2245 km2
area
record 1953 to 2001
Daily discharge (m3/s)
150
at Stah
125
100
75
50
25
0
25-Mar-60
2-Dec-73
11-Aug-87
Prepared by Tu Min (2004)
19-Apr-01
Application in the Rur river (7-2)
Annual flood peaks (1954 - 2001)
3
Discharge (m /s)
the water years
200
(Nov - Oct)
Stah on the Roer
150
100
50
0
1950
1960
1970
1980
1990
2000
Homogeneity
(change point)
3
Statistical tests
Discharge (m /s)
200
Stah on the Roer
150
88
100
50
0
1950
1980
66
1960
1970
1980
1990
2000
Application in the Rur river (7-3)
Flood frequency analysis
Example – Normal distribution
3
water year peak Q (m /s) rank (m) Q in order p=(m-0.375)/(N+0.25)
t
Xcal
UCL
LCL
1954
65
1
28
0.013
-2.228
12.70
1955
76
2
31
0.034
-1.829
24.05
1956
66
3
34
0.054
-1.604
30.47
1957
59
4
34
0.075
-1.439
35.17
1958
71
5
36
0.096
-1.306
38.96
1959
65
6
44
0.117
-1.192
42.19
28.24
37.65
43.04
47.03
50.26
53.04
-2.83
10.44
17.90
23.32
27.66
31.33
150
Normal
Peak (m /s)
100
3
N = 48
μ = 76.1
σ = 28.5
t97.5% = 2.011
50
0
-3
-2
-1
0
1
Standard variate
Observed
Fitted
95% CL
Series4
2
3
Application in the Rur river (7-4)
Flood frequency analysis
Example – LN distribution
Y  ln( X )
5.5
N = 48
5.0
Ln (Q)
μy = 4.3
LN
4.5
4.0
σy= 0.4
t97.5% = 2.011
3.5
3.0
-3
-2
-1
0
1
Standard variate
Observed
Fitted
95% CL
Series4
2
3
Application in the Rur river (7-5)
Flood frequency analysis
Example – LN3 distribution
N = 48
Y  ln( X   )

Xmin = 27.6
Xmax = 139.3
μy = 5.0
LN3
5.0
σy= 0.2
t97.5% = 2.011
x(min)  x(max)  2 xmed
5.5
Ln (Q+73)
Xmed = 73.1
x(min) x(max)  x 2 med
4.5
-3
-2
-1
0
1
Standard variate
Observed
Fitted
95% CL
Series4
2
3
Application in the Rur river (7-6)
Flood frequency analysis
N
48
Example – Gumbel distribution
μ
76.1
2
σ
α
ζ
water year
810.9
22.2
63.3
peak Q (m3/s) Rank (m) Q in order
Distribution fitting
Fi
Yi
Xest
baised var(Xest)
LCL
UCL
1954
65
1
28
0.012
-1.494
30.2
34.2
18.4
41.9
1955
76
2
31
0.032
-1.232
36.0
26.7
25.6
46.3
1956
66
3
34
0.053
-1.076
39.4
22.9
29.8
49.1
1957
59
4
34
0.074
-0.957
42.1
20.4
33.0
51.2
1958
71
5
36
0.095
-0.857
44.3
18.6
35.6
53.0
200
Discharge (m 3/s)
200
Gumbel (1954-2001)
3
100
Observed
Fitted
95% CL
Series4
50
0
-2
-1
100
50
0
200
0
1
2
3
Reduced Gumbel variate
4
Discharge (m 3/s)
Peak (m /s)
150
5
1980-2001
150
-2
-1
0
1
2
3
Gumbel variate
4
5
0
1
2
3
Reduced Gumbel variate
4
5
Reduced
1974-1979
150
100
50
0
-2
-1
Application in the Rur river (7-7)
Flood frequency analysis
Example – Gumbel distribution
Magnitude of T-year flood

 1 
QT  ay  c y   ln  ln 1  
 T 

22.2
22.7
20.4
63.3
52.9
76.2
N-1 =
t97.5% =
47
1.50
2.25
3.20
3.90
4.31
4.60
96.6
113.3
134.3
150.0
159.0
165.5
86.9
104.0
125.5
141.4
150.7
157.3
106.9
122.2
141.6
156.0
164.4
170.3
40.4
73.7
133.9
191.7
230.4
260.1
160
2.01
Return
Reduced
1954-2001 1958-1979 1980-2001 Var (Est) LCL
Period (yr) variate (yi)
5
10
25
50
75
100
200
83.8
96.0
111.1
122.1
128.5
133.0
Discharge (m 3/s)
α=
ζ=
58-79
80-01
Magnitude
of T-year flood
54-01
LCL(54-01)
UCL (54-01)
UCL
109.4
130.5
157.6
177.8
189.5
197.9
120
80
1
2
3
4
Reduced Gumbel variate
5
Take home messages
Frequency analysis, in particular of hydrological extremes,
is prerequisite for sustainable water resources
management
Annual max. series or partial duration series depends on
the length of the record
Understanding of probability, recurrence interval and risk
Knowledge of often used statistical distributions
Calculation of confidence intervals
Test of the goodness of fit of a PDF or CDF
Specialty of low flow values
Closure
•
•
•
•
•
•
probability and event
stochastic variables, cont. and discrete
Transformations
Joint distributions
Linear regression
Parameter estimation (with Bestfit)
Closure
The role of statistics in hydrology and water
resources is what?
You have now knowledge of
 Frequency tables,
 Histograms,
 Distributions functions,
 statistical descriptors, and
 Standard Normal Distribution
distribution.
and
Log-Normal
Are you now well prepared for the two
assignments?
Assignment 1: Basic Statistics
Use the data set from Van Gelder’s website
1. Calculate the range and estimate a reasonable number of intervals as
well as class limits.
2. Calculate the relative and absolute frequencies.
3. Make a histogram and cumulative frequency distribution, hand drawn
on linear paper (figures). Interpret this briefly (one sentence!)
4. Calculate the median, mode, and arithmetic mean.
5. Calculate the variance, standard deviation and coefficient of variation.
6. Calculate the skewness and kurtosis. Interpret the results briefly (one
sentence!)
7. Plot the data as a graph on normal probability paper (figure). Is the
Normal-Distribution suitable for that data set? Compare the mean value
and standard deviation from your graph with your results in question
4+5.
Deliver your printed report to Pieter van Gelder on November 5th
at 15.45h at the start of the lectures
Assignment 2: Frequency Analysis
Download your dataset from Van Gelder’s
website
Determine the PDF with the lowest Chi-Square
value in Bestfit
Include in your report a plot of the observations
and optimal fit
Extrapolate the fitted Exponential distribution to
a 10^-3 /yr quantile
Calculate the 95% Confidence Bounds around
the 10^-3 /yr quantile for the Exponential fit
Deliver your printed report to Pieter van Gelder on
November 6th at 8.45h at the start of the lectures
Assignment 3: Transformations of
distributions
Download your parameters a and b from Van Gelder’s website
Generate 28 random numbers from the Uniform distribution with
lowerbound a and upperbound b
Plot your data in a histogram and draw the PDF of the uniform
distribution in the same plot
Generate 100 sets of 28 random numbers from the above Uniform
distribution and take from each set the maximum number
Plot these 100 maxima in a histogram, derive the theoretical
distribution function for these 100 maxima and draw the PDF in
the same plot
Assume that the above 100 numbers are monthly maximum wind
speeds. Transform your wind speeds to wind pressures and plot
your wind pressures in a cumulative distribution plot.
Deliver your printed report to Pieter van Gelder on
November 13th at 17.00h at the reception of UNESCO-IHE
Final mark for Review of statistics and
frequency analysis (module 1)
Weight factor of all computer exercises is
0.5 in your final mark
Weight factor of written test is 0.5 in your
final mark