Denver-water-talk - Civil, Environmental and Architectural

Download Report

Transcript Denver-water-talk - Civil, Environmental and Architectural

K-Nearest Neighbor Resampling
Technique
(Weather Generation and Water
Quality Applications)
Balaji Rajagopalan
Somkiat Apipattanavis & Erin Towler
Department of Civil, Environmental and
Architectural Engineering
University of Colorado
Boulder, CO
Denver Water
February 2007
“Translation” of Climate Info
• Users most interested in sectoral outcomes
(streamflows, crop yields, risk of disease X)
Climate
Forecast /
Projection
Forecast /
Projection
Translation
Process
Models
Distribution
of Outcomes
Translation
Historical
Data
Synthetic series
28.5
23.1
29.1
25.8
…
…
…
…
…
…
…
…
…
…
…
…
12.4
10.2
11.4
9.7
…
Process model
Frequency
distribution of
outcomes
Why Simulation?
• Limited historical data
– cannot capture the full range of variability
– electing a (single or a set of ) historical years from the record – with
equal chance.
Unconditional bootstrap, Index Sequential Method
• Need – tool to generate ‘scenarios’ that capture the historical
statistical properties
• Several statistical techniques are available
(e.g., time series techniques, Monte-carlo techniques etc.)
– These are cumbersome, restrictive (in their assumptions)
• Re-sampling techniques are simple and robust
– Unconditional and Conditional bootstrap, K-nearest neighbor (K-NN)
bootstrap offer attractive alternatives.
Why Simulation?
• Limited historical data
– cannot capture the full range of variability
– electing a (single or a set of ) historical years from the record – with
equal chance.
Unconditional bootstrap, Index Sequential Method
• Need – tool to generate ‘scenarios’ that capture the historical
statistical properties
• Several statistical techniques are available
(e.g., time series techniques, Monte-carlo techniques etc.)
– These are cumbersome, restrictive (in their assumptions)
• Re-sampling techniques are simple and robust
– Unconditional and Conditional bootstrap, K-nearest neighbor (K-NN)
bootstrap offer attractive alternatives.
Re-sampling Techniques
• Drawing cards from a well shuffled deck
– Selecting a (single or a set of ) historical years from the record –
with equal chance.
Unconditional bootstrap, Index Sequential Method
• Drawing card from a biased deck
– Selecting a (single or a set of) historical years with unequal
chance.
E.g., selecting only El Nino years
Conditional bootstrap
• K-Nearest Neighbor Bootstrap – “pattern matching”
– Select ‘K’ nearest neighbors (e.g., years) to the current ‘feature’
– Select one of the K neighbors at random
– Repeat to produce an ensemble
–
Examples
• Ensemble Weather Generation
– Scenario generation
– Forecast
Argentina - Pampas Region
• Water Quality Modeling
(Boulder Water Utility)
Probability of Dry and Wet Days
Dry day
Wet day
0.60 (pd)
0.40 (pw)
Two Step Weather Generator
•
Transition Prob (pij)
•
Dry day
Wet day
Dry day
0.70 (pdd)
0.30 (pdw)
Wet day
0.80 (pwd)
0.20 (pww)
•
Generated Precipitation State time series
•
1
Year
0
0
1
1
0
0
0
1
0
January
1 2 3 4 5 6 7 -
0
-
-
-
-
-
•
February
-
1
1 2
3
4 -
-
1
2 0 0 3 0 2 0 0 -
- x x x x - -
2
0 3 2 0 0 0 4 0 -
- x x x x - -
3
3 0 0 2 0 3 0 0 -
- x x x x - -
4
0 0 6 0 0 0 0 0 -
- x x x x - -
----
- - - - - - - - -
-
- - - - - -
----
- - - - - - - - -
-
- - - - - -
----
- - - - - - - - -
-
- - - - - -
0
0 2 0 3 0 0 2 3 -
- x x x x - -
•
•
•
Estimate Transition (wet to dry, etc.)
Probabilities of the Markov Chain order-1
from historical data – for each month
Generate Precipitation State time series
using Markov Chain
Suppose we need weather simulation for
January 5th - January 4th is a wet day
Get Neighbors from a 7-day window (7*50)
centered on January 4th
Screen days using the Precipitation state
[(1,0), days in blue] – i.e., “Potential
Neighbors”
Calculate the distances between weather
variables of current day feature vector and
the potential neighbors
Select the K-nearest neighbors k  n
Assign them weights K  j  i  
1
j
k

1
j
j 1
•
•
•
Pick a day from k-NN using the weight
function – say, Jan 1st 1953
The simulated weather for Jan 5th is Jan
2nd 1953.
Repeat
Single Site Simulation
• Pergamino, Argentina
– Daily weather variables 1931-2003
• Precipitation
• Max. Temperature
• Min. Temperature
• 100 simulations of 73 year length (as
length of record)
• Statistics of simulated and historical data
are compared
Spell Properties
Pergamino, Argentina
wet and dry spell statistics
Moments (wet month - Jan)
Moments (dry month - July)
Conditional K-NN Re-sampling
• Conditioned on IRI seasonal
forecast
• Get the prediction
(A:N:B=40:35:25)
• Divide historical (seasonal)
total into 3 tercile categories
• Bootstrap 40, 35 and 25
sample of historical years from
wet, normal and dry categories
• Apply the two-step weather
generator on this sample.
Conditional Weather Generation (results)
Multi-site extension
• Same procedure as single site is used but
– Calculate the Average time series – “single site virtual
weather data”
– Apply the two-step generator
– Select the weather at all the locations on the picked
day – to obtain multi-site simulation
• Stations in Pampus region,
Argentina
• Pergamino
• Junin
• Nueve de Julio
wet and dry spell Statistics
Multisite Case
Pergamino, Argentina
Basic Distribution Properties
Spatial Correlation
Motivation
Finished water must comply with a given
regulation
• TOC
• TSUVA
• Alkalinity
Water
Treatment
Plant
• pH
• Turbidity
• Temperature
Influent
Water Quality
Finished
Water Quality
Motivation
0.5
Probability density function
Non-Compliance
0.0
0.4
0.3
0.2
0.0
0.1
0.1
0.2
WTP
Comply
Distribution
Distribution
Probability density function
0.3
0.6
0.7
0.4
Uncertainty helps us to understand the
risk of non-compliance with a given
regulation
0
2
4
6
Input
sw_avg
8
10
0
2
4
6
Output
sw_avg
8
10
Data Set
Information Collection Rule (ICR)
• Monitoring effort mandated by USEPA
• Large public water systems
• Water quality and operating data
- Disinfection by-products (DBPs) and
microorganisms to support rulemakings
• Most comprehensive view of large
drinking water systems to date
Data Set
ICR
• 18 months (Jul. 1997 – Dec. 1998)
• 458 continental US locations
Data Set
ICR Database
• Water Quality
– Influent
– Intermediate
– Finished
– Distribution system
• Chemical Additions
Characterize Variability
Influent water quality
has significant
variability due to
- climate
- geology
- water management
practices
Source Water
• TOC
• TSUVA
• Alkalinity
• pH
• Turbidity
• Temperature
• Total Hardness
Variability
• Examine influent water quality for
surface waters (SWs)
– Spatial variability
– Temporal variability
• Focus on total organic carbon (TOC)
– TOC is a precursor in formation of DBPs
– Methods extend to other water quality
parameters
Variability
Spatial Variability
• Local polynomial approach
TOC annual_ average  f ( Latitude, Longitude)
• Find best K and P combination
• Contour estimates
Variability
Spatial Variability
SW Average Annual TOC (mg/L)
  .30, P  2
Variability
Spatial Variability
Similar spatial patterns found for
• Finished water TOC (lower)
• Distribution system DBPs
– TTHM (total trihalomethanes)
– HAA5 (five haloacetic acids)
Variability
Spatial Variability
Spatial patterns consistent with
previous research for other
influent water quality variables
• Alkalinity
• Bromide
Variability
Temporal Variability
11
22
33
1998
00
TOC[1:12]
Influent TOC (mg/L)
44
City of Boulder’s Betasso Water Treatment Plant (CO)
J
F
2
M
A
4
M
J6
J
A
8
S
10
O
N
12
D
J F M A M J J A S O N D
1:12
Variability
Temporal Variability
• Some locations exhibited
seasonal trends, others did not
• Month to month variations should
be considered
Variability
• Inherent variability
in water quality
contributes to
uncertainty
• How can we
quantify
uncertainty?
Quantify Uncertainty
Simulate “ensembles” of influent
water quality (Monte Carlo)
TOC1
... TOC12 
 TOC S 1 _ 1 ... TOC S 1_ 12 


...
...
...


TOC S 100 _ 1 ... TOC S 100 _ 12 


Ensembles
Observed data
Quantify
Traditional Method
• Fit a probability
density function
(pdf) to the data
-Normal,
Lognormal, etc.
• Simulate from
pdf
Normal
Lognormal
Quantify
Limitations
- What if the pdf is not a good fit?
2 e-04
0 e+00
1 e-04
Density
3 e-04
4 e-04
Histogram of May
1000
2000
3000
4000
5000
6000
May
- What if you don’t have enough data to
make the pdf?
ex. 18 months/location in ICR database
Quantify
Space-Time Bootstrapping Method
• Skip fitting a pdf to the data
• Simulate by bootstrapping
• Randomly sample data with replacement
• Expand bootstrapping pool to include
“similar” locations (nearest neighbors)
• What is limited in time is available in space
Quantify
• Find nearest neighbors (locations) in
terms of a feature vector that includes
variables of interest
• Feature vector includes:
- Average Annual Concentration
- Latitude
- Longitude
FeatureVector  (TOC average, Lat , Lon)
Quantify
Average
annual
concentration
helps finds
neighbors that
are similar but
may not be
geographically
nearby.
Geographically
close, but not good
“neighbors” for
bootstrapping
Average annual TOC (mg/L) for Ohio surface
Quantify
• Sample monthly TOC values based on
feature vector
• Conditional probability
f (TOC monthly | FeatureVector)
FeatureVector  (TOC average, Lat , Lon)
Quantify
Simulation Algorithm
1) User inputs their
location and their
average annual
TOC concentration
2) The ICR
database is queried
for all eligible
entries
xuser
xICR
TOCuser 


  Latuser 
 Lonuser 
 TOC1
 ...

  TOCi

 ...
TOC m
Lat1
...
Lati
...
Lat m
Lon1 
... 
Loni 

... 
Lonm 
Quantify
Algorithm- cont.
3) Calculate distances, d, between the
xuser vector and the xICR vector
" d  xuser
user  xICR
ICR "
Quantify
Algorithm- cont.
3) Calculate distances using weighted
Mahalanobis equation
di  (W  ( xuser  xICR _ i )T ) S 1 (W T  ( xuser  xICR _ i ) )
Quantify
Algorithm- cont.
di  (W  ( xuser  xICR _ i )T ) S 1 (W T  ( xuser  xICR _ i ) )
Remove the weights (W) and the
covariance matrix (S) and it’s Euclidean
Distance
Quantify
Algorithm- cont.
di  (W  ( xuser  xICR _ i )T ) S 1 (W T  ( xuser  xICR _ i ) )
By including S, covariance matrix,
components of the feature vector do not
have to be scaled
(Davis 1986 )
Quantify
Algorithm- cont.
di  (W  ( xuser  xi )T )S 1 (W T  ( xuser  xi ) )
Weights are assigned as
W  WTOC WLat WLon 
Quantify
Weights offer flexibility in neighbor
selection
WTOC  0
WTOC  1
WLat  1 WLon  0
(a)
WLat  0 WLon  0
(c)
WTOC  0
WTOC  1
WLat  0 WLon  1
(b)
WLat  1 WLon  1
(d)
Quantify
Algorithm- cont.
4) Obtain observed monthly data for
each nearest neighbor
xNN
TOC1_ Jan
 ...

 TOCi _ Jan

 ...
TOC k _ Jan

... TOC1_ Dec 

...
...

... TOCi _ Dec 

...
...

... TOC k _ Dec 
Quantify
Algorithm- cont.
5) Bootstrap xNN using a weight function
pj 
1
j
k

1
i
i 1
Increases likelihood of picking
nearer neighbors
Quantify
Apply algorithm to quantify uncertainty
in influent TOC concentration
City of Boulder’s Betasso Water Treatment Plant (CO)
Boulder
SWs only, N = 334
Quantify
Identify nearest neighbors
- Include Boulder in pool for bootstrapping
Red dot is the
Boulder plant
being simulated
Empty black dots
are the “neighbors”
to be bootstrapped
WTOC  1
WLat  1 WLon 1
Quantify
Box plot each monthly bootstrap
ensemble (100 values)
95th
Percentile
75th
Percentile
Median
25th
Percentile
5th
Percentile
Outliers
Quantify
00
1
1
22
33
4
4
1998
TOC
Influent
Influent
TOC(mg/L)
(mg/L)
• Simulates
seasonal
trends
• Provides rich
variety of
uncertainty
5
5
Uncertainty quantified for Boulder
JJ
F
F
M
A
M A
M JJ
M
O N
N D
D Ann
Ann
JJ AA SS O
Quantify
11
22
3
3
4
4
1997
1998
2003
2004
2005
00
TOC (mg/L)
Influent TOC (mg/L)
• Simulations
capture
recent data
55
Overlay recent data
JJ
F
F
A
M
M A
M JJ
M
Ann
D Ann
N D
O N
J AA SS O
Quantify
Portable Across Locations
11
2
2
33
1998
00
(mg/L)
TOC
Influent
Influent
TOC (mg/L)
44
City of Birmingham’s Carson Filter Plant (AL)
JJ
F
2
F
M A
A
4
M JJ6
M
M
JJ
A
8
10
O N
N D
12
D Ann
Ann
A
SS O
Quantify
Portable Across Locations
33
2
2
11
00
(mg/L)
TOC
Influent
Influent
TOC (mg/L)
44
City of Birmingham’s Carson Filter Plant (AL)
JJ
F
2
F
M A
A
4
M JJ6
M
M
JJ
A
8
10
O N
N D
12
D Ann
Ann
A
SS O
Quantify
Portable Across Locations
11
2
2
33
1997
1998
2003
2004
2005
00
(mg/L)
TOC
Influent
Influent
TOC (mg/L)
44
City of Birmingham’s Carson Filter Plant (AL)
JJ
F
2
F
M A
A
4
M JJ6
M
M
JJ
A
8
10
O N
N D
12
D Ann
Ann
A
SS O
Quantify
Applies to Other Variables
4
2
10
10
1
20
20
30
30
TOC[1:12]
40
40
50
50
3
60
60
1998
0
0
0
obs_1998
Influent Alkalinity (as mg/L CaCO3)
70
70
New Jersey American Water Swimming River
Treatment Plant (NJ)
JJ
F
2
M A
A
4
M JJ6
F
M
M
JJ
z1
O N
12
D Ann
Ann
J10
F
2N D
M
A
4
AA8 SS O
M
J6
J
1:12
A
8
S
10
O
N
Quantify
Applies to Other Variables
60
60
60
50
50
50
40
40
40
30
30
30
20
20
20
10
10
10
0
0
obs_1998
Influent Alkalinity (as mg/L CaCO3)
70
70
70
New Jersey American Water Swimming River
Treatment Plant (NJ)
JJ
F
2
M A
A
4
M JJ6
F
M
M
JJ
z1
A
8
10
O N
N D
12
D Ann
Ann
A
SS O
Quantify
Applies to Other Variables
40
40
40
50
50
50
60
60
60
1997
1998
+ 2002
2003
2004
2005
+
+
+
+
+
+
+
+
+
+
+
10
10
10
20
20
20
30
30
30
+
+
0
0
obs_1998
Influent Alkalinity (as mg/L CaCO3)
70
70
70
New Jersey American Water Swimming River
Treatment Plant (NJ)
JJ
F
2
M A
A
4
M JJ6
F
M
M
JJ
z1
A
8
10
O N
N D
12
D Ann
Ann
A
SS O
Summary & Conclusions
• K-NN resampling technique provides a simple
and robust alternative to generating
‘scenarios’.
– Quantify Uncertainty
– Ensemble forecast
• Very general – can be easily applied to a
variety of situations.
Weather generation
Water Quality
Streamflow (Colorado River Basin)
• Can readily be extended to generate
‘scenarios’ under climate change or
decadal variability
modify the ‘feature vector’ to
include the climate variability
information
• Rajagopalan and Lall (1999); Yates et
al. (2003), Apipattanavis et al. (2007) all papers in Water Resources Research
• [email protected]
Acknowledgements
AwwaRF project 3115
“Decision Tool to Help Utilities Develop Simultaneos
Compliance Strategies”
Utilities
City of Boulder’s Betasso Water Treatment Plant (CO)
City of Birmingham’s Carson Filter Plant (AL)
New Jersey American Water Swimming River Treatment
Plant (NJ)
Greater Cincinnati (OH) Water Works Richard Miller
Water Treatment Plant
Questions
“It is better to be
roughly right than
precisely wrong.”
-John Maynard Keynes (1883-1946)