Random Variables, Probability Distributions and Moments

Download Report

Transcript Random Variables, Probability Distributions and Moments

Concepts in Probability, Statistics
and Stochastic Modeling
• Loucks et al., 2005, Chapter 7
Learning Objective
• Be able to use probability and statistics to quantify
uncertainty and natural variability in physical
quantities
How Express a Distribution
Cumulative Density
Probability Density
Which method conveys the
information best to you?
Probability Plot
Equation
Carl Friedrich Gauß, immortalized
A random variable X is a variable whose
outcomes (values) are governed by the
laws of chance.
0.30
Probability density function
0.20
0.10
x1
0.00
 f (x)dx
f(x)
P( x1  X  x 2 ) 
x2
0
2
4
6
x
8
10
12
Cumulative distribution function
 f (x )dx
0.4
F(x)
dF
f (x) 
dx
0.8

0.0
F( x )  P( X  x ) 
x2
0
2
4
6
x
8
10
12
Continuous and Discrete Random
Variables
From: Loucks, D. P., E. van Beek, J. R. Stedinger, J. P. M. Dijkman and M. T. Villars, (2005), Water Resources Systems Planning and
Management: An Introduction to Methods, Models and Applications, UNESCO, Paris, 676 p, http://hdl.handle.net/1813/2804
0.8
0.4
F(X)
0.0
0.4
F(x)
F(U)
0.0
F(u)
0.8
Generating a random variable from a
given distribution
0.0
0.4
U 0.8
0
2
u
1.
2.
X4
6
8
10
12
x
Generate U from a uniform distribution between 0 and 1
Solve for X=F-1(U)
F-1(U) is randomly distributed with
CDF F(x)
Basis
P(X<x)=P(U<F(x))=P(F-1(U)<x)
Generating a Pseudo random number
• There is a lot of lore about this. Refer to: Press, W. H., B. P.
Flannery, S. A. Teukolsky and W. T. Vetterling, (1988), Numerical
Recipes in C : The Art of Scientific Computing, Cambridge
University Press, New York, 735 p.
• Congruential method
rnext  remainder of [( rprev  a  c)  m]
• Each r is an integer random number between 0 and m-1.  by
(m-1) gives a number between 0 and 1 that repeats after at most m
numbers. Numerical recipes gives "good" choices for a, c and m.
• R has built in functions runif to generate uniform random numbers,
as well as other distributions, e.g rnorm, rgamma.
Moments of Random Variables
Moments of Random Variables
Population
Sample

Mean

1 N
X   Xi
N i 1
 xf ( x )dx


Expectation
1 N
Ê( X ) 
Xi

N i 1
 xf ( x )dx
E( X ) 


Expectation operator
E(g( X)) 
1 N
Ê( g( X ))   g( X i )
N i 1
 g(x )f (x )dx


 ( x  )
2 
Variance
2
N
1
S 
( X i  X )2

N ( 1) i 1
f ( x )dx
2

 E([ X  E( X )] 2 )

Skewness
1
3

 ( x  )
 
3
f ( x )dx
3
 E([ X  E( X )] ) / 
3
ˆ 
1 N
(X i  X) 3

N i 1
S3
L-Moments
2  1 / 2E[X(2|2)  X(1|2) ]
Probability weighted moments
L-moment estimators
L-Moment Diagrams
From: Loucks, D. P., E. van Beek, J. R. Stedinger, J. P. M. Dijkman and M. T. Villars, (2005), Water Resources Systems Planning and
Management: An Introduction to Methods, Models and Applications, UNESCO, Paris, 676 p, http://hdl.handle.net/1813/2804
From: Salas, J. D., J. W. Delleur, V. Yevjevich and W. L. Lane, (1980), Applied Modeling of
Hydrologic Time Series, Water Resources Publications, Littleton, Colorado, 484 p.
From: Salas, J. D., J. W. Delleur, V. Yevjevich and W. L. Lane, (1980), Applied Modeling of
Hydrologic Time Series, Water Resources Publications, Littleton, Colorado, 484 p.
Hillsborough River at Zephyr Hills, September flows
0.00010
x = 8621 mgal
S = 8194 mgal
n = 31
0.00000
Density
0.00020
Fitting a probability distribution to data
0
5000
10000
15000
mgal
20000
25000
30000
35000
Method of Moments
• Using the sample moments as the estimate
for the population parameters
2
ˆ
ˆ
E ( X )  x ; Var ( X )  
0.00020
Method of Moments
Gamma distribution
 x 1e  x
f (x) 
()
2
0.00010
ˆ
ˆ  =1.3 x 10-3
x
0.00000
Density
ˆ   x  =1.1
S
 
0
5000
10000
15000
20000
25000
30000
35000
0.00020
Method of Moments
Log-Normal distribution
f (x) 
0.00010
S
x
ˆ 2y  ln( CV2  1) =0.643
1 2
ˆ y  ln( x exp(  ˆ y )) =8.29
2
0.00000
Density
CV 
2



1
 1  ln( x )   y  
exp 



y
2  y x
 2 
 
0
5000
10000
15000
20000
25000
30000
35000
Method of Maximum Likelihood
• “Back into” the estimate by assuming the
parameters we are trying to estimate from the data
are known.
• How likely are the sample values we have, given a
certain set of parameter values?
• We can express this as the joint density of the
random sample given the parameter value.
f X 1 , X 2 ,..., Xn x1 , x2 ,..., xn |     f X xi |  
• After we obtain the data (random sample), we use
the joint density to define the Likelihood function.
n
L | x1 , x2 ,..., xn    f X xi |  
i 1
0.00020
Likelihood
L   fX xi | 
0.00010
ln(L)= -312 (for log normal)
0.00000
Density
ln(L)= -311 (for gamma)
0
5000
10000
15000
20000
25000
30000
35000
Normalization
• Much theory relies on the central limit
theorem so applies to Normal Distributions
• Where the data is not normally distributed
normalizing transformations are used
– Log
– Box Cox (Log is a special case of Box Cox)
Box-Cox Normalization
The Box-Cox family of transformations that includes the
logarithmic transformation as a special case (=0). It is
defined as:
z = (x -1)/ ;   0
z = ln(x);  = 0
where z is the transformed data, x is the original data and 
is the transformation parameter.
Box-Cox Normalization
So… the log looked OK ( = 0). Is that what we really
want?
Let’s skip the derivations for now and look at the answer
for our three proposed methods.
Determining Transformation
Parameters
• Trial and error: apply a series of trial
lambda values and evaluate statistic.
• PPCC (Filliben’s Statistic): R2 of best fit
line of the QQplot
• Kolomgorov-Smirnov (KS) Test (any
distribution): p-value
• Shapiro-Wilks Test for Normality: p-value
Quantiles
Rank the data
pi
0.6
i
n 1
0.2
prob( X  x i ) 
F(y)
x1
x2
x3
.
.
.
xn
Theoretical distribution,
e.g. Standard Normal
-3
-2
-1
0
qi1
2
3
y
qi is the distribution specific theoretical
quantile associated with ranked data value xi
Quantile-Quantile Plots
7
6
5
3
4
Sample Quantiles
3000
2000
1000
0
Sample Quantiles
xi
ln(xi)
8
Normal
Q-Q Plot
QQ-plot for
Log-Transformed
Flows
4000
Normal
QQ-plot
for Q-Q
RawPlot
Flows
-3
-2
-1
0
1
2
3
-3
-2
-1
0
1
Theoretical Quantiles
Theoretical Quantiles
qi
qi
Need transformation to make the Raw flows
Normally distributed.
2
3
Example: Determining
Transformation Parameters
• Alafia River historical monthly flows
• Evaluate using all three criteria
• Test a range of lambda values from -2 to 2
by 0.1 for Filliben’s and KS
• Test a range of lambda values from -1 to 1
by 0.1 for Shapiro-Wilks (errors for larger
lambda values).
Box-Cox Normality Plot for Monthly September Flows on Alafia R.
Using PPCC
0.2
0.4
0.6
This is close to 0,  = -0.14
0.0
Fillibens Statistic
0.8
1.0
Box-Cox Normality Plot for Alafia R.
-2
-1
0
Box-Cox Lambda Value
Optimal Lambda= -0.14
1
2
Kolmogorov-Smirnov Test
• Specifically, it computes the largest
difference between the target CDF FX(x)
and the observed CDF, F*(X).
• The test statistic D2 is:
n

D2  max F * ( X (i ) )  FX ( X (i ) )
i 1

i
(i ) 
 max   FX ( X ) 
i 1  n

n
where X(i) is the ith largest observed value in
the random sample of size n.
Box-Cox Normality Plot for Monthly September Flows on Alafia R.
1.0
Box-Cox Normality Plot for (KS)
Alafia R.Statistic
Using Kolmogorov-Smirnov
0.2
0.4
0.6
 = -0.39
0.0
KS p-value
0.8
This is not as close to 0,
-2
-1
0
Box-Cox Lambda Value
Optimal Lambda= -0.39
1
2
http://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/wilkshap.htm
Box-Cox Normality Plot for Monthly September Flows on Alafia R.
0.2
0.4
0.6
This is close to 0,  = -0.14.
Same as PPCC.
0.0
Shapiro-Wilks p-value
0.8
1.0
Box-Cox
Normality Plot for
Alafia R.
Using
Shapiro-Wilks
Statistic
-1.0
-0.5
0.0
Box-Cox Lambda Value
Optimal Lambda= -0.14
0.5
1.0