PPT - Water on the Web
Download
Report
Transcript PPT - Water on the Web
Statistics for Water Science
Module 17.1: Descriptive Statistics
Module 17: Statistics
Statistics
A branch of mathematics dealing with the
collection, analysis, interpretation and
presentation of masses of numerical data
Descriptive Statistics (Lecture 17.1)
Basic description of a variable
Exploratory Data Analysis (Lecture 17.2)
Techniques for understanding data
Hypothesis Testing (Lecture 17.3)
Asks the question – is X different from Y?
Developed by: Host
Updated 2/2004:
U5-m17-s2
Descriptive statistics
Describe basic
characteristics of a
population of numbers
Central Tendency or
“Middleness”
Means, medians
and others
Variance or “spread”
of data
Standard Deviation
The range of data
Min, Max and
Percentiles
Developed by: Host
Simple graphical
representations of data
Updated 2/2004:
U5-m17-s3
Precision, accuracy and bias
Precision:
Tendency to have
values closely clustered
around the mean
Accuracy:
Tendency of an
estimator to predict the
value it was intended to
estimate
Bias:
A systematic error in
prediction
Adapted from Ratti and Garton (1994)
Developed by: Host
Updated 2/2004:
U5-m17-s4
Green dots
are the mean
value
Spread is
analogous to
the standard
error
Biased
Accurate
Not Accurate
Precise
Unbiased
Not Precise
The yellow
curling
rocks
represent
means from
repeated
samples
Developed by: Host
Updated 2/2004:
U5-m17-s5
Finding the middle:The arithmetic mean
Between 1998 and 2002, the Ice Lake RUSS unit
collected 2120 temperatures readings at
depths of 1-4 m
What is the average June temperature?
30
28
26
24
22
20
18
16
14
12
10
8
6
400
350
300
250
200
150
100
50
0
4
# of Observations
Surface Temperature
Temperature
Surface Temperature
Developed by: Host
Updated 2/2004:
U5-m17-s6
Finding the middle:The arithmetic mean
Not too hard - Add’em up, divide by n
30
28
26
24
22
20
18
16
14
12
10
8
= 18.48 C
4
39179.3
2120
400
350
300
250
200
150
100
50
0
6
Sum of temperatures
= 39179.3
# of Observations
Surface Temperature
Temperature
Surface Temperature
Developed by: Host
Updated 2/2004:
U5-m17-s7
Expressing variability: Standard deviation (SD)
Note that there is ‘scatter’ around the mean
The Standard Deviation quantifies how wide or
narrow this scatter is:
For this data set,
the SD is 2.34 C
Mean and SD are
often combined:
18.48 +/- 2.34
Developed by: Host
Updated 2/2004:
U5-m17-s8
Comparing data sets
Let’s consider a second data set, shown in
blue. This is the mean seasonal temperature in
the lower reaches of the lake (8-13 m)
n = 3097
Developed by: Host
Updated 2/2004:
U5-m17-s9
Comparing data sets
Two things to note:
It’s a lot colder at the bottom of the lake!
The temperatures are much less variable – why?
Developed by: Host
Updated 2/2004:
U5-m17-s10
Means and standard deviations for
epilimnetic and hypolimnetic temperatures
Developed by: Host
Mean
SD
Surface 18.48
2.34
Bottom
0.85
Updated 2/2004:
5.96
U5-m17-s11
Standard deviation: Fun facts
The SD is always in the same units as the mean
Roughly 68% of the values are included in +/- 1
SD of the mean, 95% within +/- 2 SD
If the SD is larger than the mean (e.g. 20 +/- 24),
your data is pretty flaky
Definition of flaky – the data are so widely
scattered that the mean is, well, meaningless.
In this case, use some other measure of
middleness, such as the geometric mean or
median
Developed by: Host
Updated 2/2004:
U5-m17-s12
Using geometric means: Fecal coliform example
What about data that are not well behaved?
Fecal coliform counts are often used by
management agencies as an indicator of water
quality
For non-contact water recreation (boating and fishing),
Colorado Public Health state that fecal coliform count
shall not exceed 2000 fecal coliforms per 100 mL
(based on geometric mean of representative samples)
Developed by: Host
Updated 2/2004:
U5-m17-s13
The problem
Fecal coliform counts can range over several
orders of magnitude.
For such data, the geometric mean is a more
appropriate indicator of central tendency.
Sample
F. coli.
counts
1
160
2
700
3
60
7
12000
Arithmetic
Mean
3230
Developed by: Host
12000
Boulder Creek Longitudinal Fecal Coliform Profiles for July, 2000
Updated 2/2004:
U5-m17-s14
The geometric mean
Multiply ’em together, take the nth root
To be honest, this is a pain without a good
calculator, but there’s a shortcut…
Geometric mean =
Developed by: Host
4
160 * 700 * 60 * 12000
Updated 2/2004:
U5-m17-s15
The geometric mean: The easy way
Take the logarithm of each data point (easy)
Sample
F. coli. counts
Log(10)
1
160
2.20
2
700
2.85
3
60
1.78
7
12000
3.51
Developed by: Host
Updated 2/2004:
U5-m17-s16
The geometric mean
Take the logarithm of each data point (easy)
Average the log values (easier)
Sample
F. coli. counts
Log
1
160
2.20
2
700
2.85
3
60
1.78
7
12000
3.51
Average
Developed by: Host
2.88
Updated 2/2004:
U5-m17-s17
The geometric mean
Take the logarithm of each data point (easy)
Average the log values (easier)
Calculate the antilog (sounds hard, is easy)
F. coli.
counts
Log
1
160
2.20
2
700
2.85
3
60
1.78
7
12000
3.51
Sample
Average
2.88
Developed by: Host
Antilog
= 10^2.88
= 764.1
The geometric mean
is 764.1 cells/ 100 ml
Lower than the state
regulatory standard
of 2000 cells/ 100 ml
Updated 2/2004:
U5-m17-s18
Fun facts about geometric means
The geometric mean is always less then the arithmetic
mean.
The ‘shortcut’ calculation works with either natural logs
or base 10 logs.
The geometric mean tends to dampen the effect of very
low or very high values, and is useful when values
range from 10-10,000 over a given period.
Excel has a GEOMEAN function. Life is good.
Use of the geometric mean is a standard for most
wastewater discharge and beach monitoring programs:
Beach standards are typically 200 counts/100 ml.
Developed by: Host
Updated 2/2004:
U5-m17-s19
Descriptive statistics: Min, Max, and Median
Ice Lake
Mean
SD
Min
Max
Median
Surface
19.59
2.28
12.1
27.1
18.2
Bottom
5.96
0.85
4.3
9.0
5.9
Developed by: Host
Updated 2/2004:
U5-m17-s20
When to use medians: Stream turbidity levels
Background:
• Turbidity in streams makes the water appear cloudy (muddy), mostly
from suspended sediments. It’s bad for fish, their eggs and their food
(bugs) – particularly cold water species such as brook trout.
• Minnesota Water Pollution Rules set a Chronic Standard of 10 NTU the highest level to which these organisms can be exposed indefinitely
without causing chronic toxicity (see Notes for reference website).
• Tischer Creek is a trout stream in Duluth, MN with a nearly continuous
turbidity record in summer/fall 2002. Let’s look at a 30 day period in
midsummer and decide what the level of exposure was for the fish.
Developed by: Host
Updated 2/2004:
U5-m17-s21
Medians: the middlemost value
Prevents being mislead
by a few very small or
very large values
Consider salaries within
a hypothetical company
Which is the more
appropriate measure
of a typical salary?
Mean $104,000
Median $24,000
CEO
$350,000
Middle
manager
Worker 1
88,000
Worker 2
22,000
Worker 3
18,000
Mean
$104,400
Median
Developed by: Host
24,000
Updated 2/2004:
$24,000
U5-m17-s22
Medians: a real world example
Tischer Creek: July 13 - Aug 12, 2002
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Confidence Level(95.0%)
13.1
0.93
1.0
0.0
48.0
2301.1
153.9
9.6
1017.2
0
1017.2
35061.7
2679
1.82
Summary
30 d: 7/13 - 8/12/02
Developed by: Host
Tischer Creek 13 Jul - 12 Aug 2002
30d spanning late July Storm
400
Turbidity (NTUs)
13 Jul 02- 12 Aug 02 Tischer Turbidity
~ 30 days straddling the late July storm
300
200
100
0
11-Jul
Mean+/- s.d.
13.1+ 0.9
21-Jul
31-Jul
10-Aug
Date 2002
Median
1
Updated 2/2004:
Range
0.0 - 1017
U5-m17-s23
Frequency Distribution: Jul 13- Aug 12
Tischer Creek – Summer 2002
2500
Note that these data are
highly skewed, with >80% of
the values in the 20-40 NTU
range
1500
1000
More
957
898
838
778
718
658
598
539
479
419
359
299
180
120
0
60
500
239
There is one value of 1017
NTU, no valid reason to
delete it.
0
Frequency
2000
Turbidity (NTUs)
Developed by: Host
Updated 2/2004:
U5-m17-s24
Stream Data Visualization
Tischer Creek –Summer
2002 Storm Period
Developed by: Host
Updated 2/2004:
U5-m17-s25
Another plot of Tischer from midsummer 2002
Developed by: Host
Updated 2/2004:
U5-m17-s26
Means vs Medians: Which represent the data better?
The mean of 13 NTU for the 30 day period suggests that
the chronic toxicity standard was violated
The standard deviation of the mean was high (48 NTUs)
relative to the mean and so the coefficient of variation
was a whopping 369%: CV = (48/13)*100
Although the range was high, from 0 to 1017 NTU, “most of
the time” the stream ran clear with values <<10 . The mode
(most common value) was in fact = 0
The median value was 1.0 NTU and perhaps best
characterizes the state of turbidity in the stream and the
level of exposure of the fish (the 50th percentile).
Determining chronic exposure values for “flashy” data is not
trivial
Developed by: Host
Updated 2/2004:
U5-m17-s27
Excel functions for descriptive statistics:
Format - @statistic(datarange)
Mean
@average()
Median
@median()
Standard Deviation
@stdev()
Minimum
@min()
Maximum
@max()
Geometric mean
@geomean()
Developed by: Host
Updated 2/2004:
U5-m17-s28
Upcoming: How can we tell if two
populations of numbers are different?
Developed by: Host
Updated 2/2004:
U5-m17-s29