How is new knowledge discovered?

Download Report

Transcript How is new knowledge discovered?

Statistics in WR: Lecture 1
• Key Themes
– Knowledge discovery in hydrology
– Introduction to probability and statistics
– Definition of random variables
• Reading: Helsel and Hirsch, Chapter 1
How is new knowledge discovered?
After completing the Handbook of
Hydrology in 1993, I asked myself the
question: how is new knowledge
discovered in hydrology?
I concluded:
• By deduction from
existing knowledge
• By experiment in a
laboratory
• By observation of the
natural environment
Deduction – Isaac Newton
• Deduction is the classical path
of mathematical physics
– Given a set of axioms
– Then by a logical process
– Derive a new principle or
equation
• In hydrology, the St Venant
equations for open channel
flow and Richard’s equation
for unsaturated flow in soils
were derived in this way.
Three laws of motion and law of gravitation
http://en.wikipedia.org/wiki/Isaac_Newton
(1687)
Experiment – Louis Pasteur
• Experiment is the
classical path of
laboratory science – a
simplified view of the
natural world is
replicated under
controlled conditions
• In hydrology, Darcy’s law
for flow in a porous
medium was found this
way.
Pasteur showed that microorganisms cause disease &
discovered vaccination
Foundations of scientific medicine
http://en.wikipedia.org/wiki/Louis_Pasteur
Observation – Charles Darwin
• Observation – direct
viewing and
characterization of
patterns and phenomena
in the natural
environment
• In hydrology, Horton
discovered stream scaling
laws by interpretation of
stream maps
Published Nov 24, 1859
Most accessible book of great
scientific imagination ever written
Mean Annual Flow
Mean Annual Flow, Colorado River at Austin
(1929-2008)
8000
7000
Discharge (cfs)
6000
5000
4000
3000
2000
1000
0
1920
1930
1940
1950
1960
1970
1980
1990
2000
2010
2020
Is there a relation between flow
and water quality?
Mean Annual Flow, Colorado River at
Austin (1929-2008)
6000
4000
2000
0
1920
Colorado River at Austin
1940
1960
1980
2000
3.52020
3
Total Nitrogen (mg/l)
Discharge (cfs)
8000
2.5
Total Nitrogen in water
2
1.5
1
0.5
0
Jun-68
Dec-73
May-79
Nov-84
May-90
Oct-95
Are Annual Flows Correlated?
Correlation of Annual Flows
(Colorado River at Austin)
8000
Last Year's Discharge (cfs)
7000
6000
5000
4000
3000
2000
1000
0
0
1000
2000
3000
4000
5000
This Year's Discharge (cfs)
6000
7000
8000
CE 397 Statistics in Water
Resources, Lecture 2, 2009
David R. Maidment
Dept of Civil Engineering
University of Texas at Austin
9
Key Themes
• Statistics
– Parametric and non-parametric approach
• Data Visualization
• Distribution of data and the distribution of
statistics of those data
• Reading: Helsel and Hirsch p. 17-51 (Sections 2.1
to 2.3
• Slides from Helsel and Hirsch (2002) “Techniques
of water resources investigations of the USGS,
Book 4, Chapter A3.
10
Characteristics of Water Resources Data
•
•
•
•
Lower bound of zero
• Autocorrelation –
consecutive
Presence of “outliers”
measurements are not
Positive skewness
independent
Non-normal distribution
• Dependence on other
of data
uncontrolled variables
• Data measured with
e.g. chemical
thresholds (e.g.
concentration is related
detection limits)
to discharge
• Seasonal and diurnal
patterns
11
Normal Distribution
From Helsel and Hirsch (2002)
12
Lognormal Distribution
From Helsel and Hirsch (2002)
13
Method of Moments
From Helsel and Hirsch (2002)
14
Statistical measures
• Location (Central
Tendency)
– Mean
– Median
– Geometric mean
• Skewness (Symmetry)
– Coefficient of skewness
• Kurtosis (Flatness)
– Coefficient of kurtosis
• Spread (Dispersion)
– Variance
– Standard deviation
– Interquartile range
15
Histogram
From Helsel and Hirsch (2002)
Annual Streamflow for the
Licking River at Catawba,
Kentucky 03253500
16
Quantile Plot
From Helsel and Hirsch (2002)
17
Plotting positions
i = rank of the data with i = 1 is the lowest
n = number of data
p = cumulative probability or “quantile” of the data
value (its percentile value)
18
Normal Distribution Quantile Plot
From Helsel and Hirsch (2002)
19
Probability Plot with Normal Quantiles
(Z values)
q  q  zsq
q
q
z
20
From Helsel and Hirsch (2002)
Annual Flows From HydroExcel
Annual Flows produced
using Pivot Tables in
Excel
21
Estimating the mean annual discharge
6000
Licking River at Catawba, Kentucky,
1934-2008 (75 years of data)
5000
Mean Discharge (cfs)
4000
Mean
3000
Mean +Se
Mean - Se
2000
1000
0
0
10
20
30
40
50
60
70
80
90
Number of years of Data
22
CE 397 Statistics in Water
Resources, Lecture 3, 2009
David R. Maidment
Dept of Civil Engineering
University of Texas at Austin
23
Key Themes
• Using HydroExcel for accessing water
resources data using web services
• Descriptive statistics and histograms using
Excel Analysis Toolpak
• Reading: Chapter 11 of Applied Hydrology by
Chow, Maidment and Mays
24
CE 397 Statistics in Water
Resources, Lecture 4, 2009
David R. Maidment
Dept of Civil Engineering
University of Texas at Austin
25
Key Themes
•
•
•
•
Frequency and probability functions
Fitting methods
Typical distributions
Reading: Chapter 4 of Helsel and Hirsh pp. 97116 on Hypothesis tests
26
27
Method of
Moments
28
Maximum Likelihood
29
CE 397 Statistics in Water
Resources, Lecture 5, 2009
David R. Maidment
Dept of Civil Engineering
University of Texas at Austin
30
Key Themes
• Using Excel to fit frequency and probability
distributions
• Chi Square test and probability plotting
• Beginning hypothesis testing
• Reading: Chapter 3 of Helsel and Hirsh pp. 6597 on Describing Uncertainty
• Slides from Helsel and Hirsch Chap. 4
31
32
Statistics in Water Resources, Lecture 6
• Key theme
– T-distribution for distributions where standard
deviation is unknown
– Hypothesis testing
– Comparing two sets of data to see if they are
different
• Reading: Helsel and Hirsch, Chapter 6
Matched Pair Tests
Chi-Square Distribution
http://en.wikipedia.org/wiki/Chi-square_distribution
t-, z and ChiSquare
Source: http://en.wikipedia.org/wiki/Student's_t-distribution
Normal and t-distributions
Normal
t-dist for ν = 1
t-dist for ν = 5
t-dist for ν = 2
t-dist for ν = 10
t-dist for ν = 3
t-dist for ν = 30
Standard Normal and Student - t
• Standard Normal z
– X1, … , Xn are
independently
distributed (μ,σ), and
– then
is normally distributed with
mean 0 and std dev 1
• Student’s t-distribution
– Applies to the case
where the true standard
deviation σ is unknown
and is replaced by its
sample estimate Sn
p-value is the probability of obtaining the value of the
test-statistic if the null hypothesis (Ho) is true
If p-value is very small (<0.05 or 0.025) then reject Ho
If p-value is larger than α then do not reject Ho
38
One-sided test
Two-sided test
Statistics in WR: Lecture 7
• Key Themes
– Statistics for populations and samples
– Suspended sediment sampling
– Testing for differences in means and variances
• Reading: Helsel and Hirsch Chapter 8
Correlation
Estimators of the Variance
Maximum Likelihood
Estimate for
Population variance
Unbiased estimate
from a sample
http://en.wikipedia.org/wiki/Variance
Bias in the Variance
Common sense would suggest to apply the population
formula to the sample as well. The reason that it is
biased is that the sample mean is generally somewhat
closer to the observations in the sample than the
population mean is to these observations. This is so
because the sample mean is by definition in the
middle of the sample, while the population mean may
even lie outside the sample. So the deviations from
the sample mean will often be smaller than the
deviations from the population mean, and so, if the
same formula is applied to both, then this variance
estimate will on average be somewhat smaller in the
sample than in the population.
Suspended Sediment Sampling
http://pubs.usgs.gov/sir/2005/5077/
T-test with same variances
T-test with different variances
Statistics in WR: Lecture 8
• Key Themes
– Replication in Monte Carlo experiments
– Testing paired differences and analysis of variance
– Correlation
• Reading: Helsel and Hirsch Chapter 9 Simple
Regression
Statistics of Mean of Replicated
Series
Variance of Replicates of Cumulative mean of 1000 uniform(0,1)
random variables
1.20E-04
1.00E-04
8.00E-05
6.00E-05
Variance
Theoretical Value
4.00E-05
2.00E-05
0.00E+00
0
20
40
60
Number of Replicates
80
100
Patterns of data that
all have correlation
between x and y of
0.7
Monotonic nonlinear correlation
Linear correlation
Non-monotonic correlation
Statistics in WR: Lecture 9
• Key Themes
– Using SAS to compute cross-correlation between two data
series
– Using Excel to compute autocorrelation of a single data
series
– Correlation length and influence of data interval on that
– Lagged Cross-correlation between rainfall and flow
• Reading: Helsel and Hirsch Chapter 12 Trend
Analysis
Correlation
• Correlation (or cross-correlation) measures
the association between two sets of data (x, y)
• Autocorrelation measures the correlation of a
dataset with lagged or displace values of itself
(either in time or space), e.g x(t) with x(t – L)
where L is the lag time
• Lagged cross-correlation measures the
association between one series y(t), and
lagged values of another series x(t – L)
Statistics in WR: Lecture 10
• Key Themes
– Trend analysis using Simple Linear Regression
– Characterization of outliers
– Multiple Linear Regression
• Reading: Helsel and Hirsch Chapter 11
Multiple Linear Regression
• Slides are from Helsel and Hirsch, Chapter 9
H&H p.222
Regression Formulas
H&H p.226
Regression
Formulas
H&H p.227
Statistics in WR: Lecture 11
• Key Themes
– Simple Linear Regression
– Derivation of the normal equations
– Multiple Linear Regression
• Reading: Helsel and Hirsch Chapter 7
Comparing several independent groups
• Reading: Barnett, Environmental Statistics
Chapter 10 Time series methods
• Slides are from Helsel and Hirsch, Chapter 9
Regression
Assumptions
Formulas used in the
derivation of the normal
equations
(1a) Plot the Data: TDS vs LogQ
(2) Interpret Regression Statistics
A good set of Residuals
Multiple Linear Regression
Simple vs Complex regression
models
F-distribution
http://en.wikipedia.org/wiki/F-test
“If U is a Chisquare random variable with m degrees of freedom, V is a
Chisquare random variable with n degrees of freedom, and if U and V
are independent, then the ratio [(U/m)/V/n) has an F-distribution with
(m, n) degrees of freedom.” Haan, Statistical Methods in Hydrology,
p.122
The values of the F-statistic are tabulated at:
http://www.itl.nist.gov/div898/handbook/eda/section3/eda3673.htm
Statistics in WR: Lecture 12
• Key Themes
– Regression y|x and x|y
– Adjusted R2
– Time series and seasonal variations
R2 and Adjusted R2
SUMMARY OUTPUT
SSE
R  1
SSy
2
Regression Statistics
Multiple R
0.950344
R Square
0.903154
0.903154347
Adjusted R Square
0.898543
0.89854265
Standard Error
159033.1
Observations
SSE /( n  p)
AdjR  1 
SSy /( n  1)
2
23
ANOVA
df
Regression
SS
MS
1
4.95309E+12
4.95309E+12
Residual (error)
21
5.31122E+11
25291521454
Total (y)
22
5.48421E+12
F
195.8399
Significance
F
4.07E-12
Time Series Trend:
Tide Levels at San Diego
4
y = 2E-05x + 2.2869
3.5
3
2.5
2
1.5
1
0.5
0
Jan-00
Mar-19
Apr-38
Jun-57
Aug-76
Oct-95
Dec-14
-0.5
-1
http://tidesandcurrents.noaa.gov/sltrends/sltrends_station.shtml?stnid=9410170%20San%20Diego,%20CA
One harmonic
Five harmonics
http://en.wikipedia.org/wiki/Fourier_series
Statistics in WR: Lecture 13
• Key Themes
– ANOVA for sediment data
– Fourier series for diurnal cycles
– Fourier series for seasonal cycles
Analysis of Variance (ANOVA)
Assumptions
There are several variants (one factor, two factor, two factor with replication).
We will deal just with One Factor ANOVA
Single
Factor
ANOVA
Single
Factor
ANOVA
ANOVA Formulas
Single Factor ANOVA
Groups of Sediment Load Data (Ex3)
3.5 x 106
5.5 x 106
480,000
USGS1 Mean
218,000 Ton/yr
TWDB Mean
189,000 Ton/yr
Overall Mean
183,000 Ton/yr
USGS2 Mean
97,000 Ton/yr
Zero