Transcript Slide 1

Data-Driven Modeling and
Machine Learning Regression
Approach in Water Resource
Systems
CEE 6410
Water Resources Systems Analysis
Data-driven Models
Data-driven Models

Find relationships between the system state
variables without explicit knowledge of the
physical behavior of the system.

Examples: The unit hydrograph method,
statistical models (ARMA, ARIMA) and
machine learning (ML) models.
Why data-driven modeling and
machine learning in water resource
systems?
Why data-driven modeling and machine
learning in water resource systems?

Some highly complex processes in water resources system are difficult to
understand and simulate using physically based approach.
The Lower Sevier
River Basin System
Utah
Why data-driven modeling and machine
learning in water resource systems?

Physically based modeling is limited by the lack of
required data and the expense of data acquisition.

Data-driven models (Machine Learning) as an
alternative.

Machine Learning models replicate the expected
response of a system.
Example of ML Uses
Task
Anomaly detection
Association rule
learning
Clustering
Classification
Regression
Summarization
Example of use
Identification of unusual data records (outliers, pattern change, data
deviation) in weather or hydrological time series variables
Discovery of relationships (dependency) between variables from different
sources for a given phenomena, e.g. identification of critical weather
variables, vegetation cover and urban development information to explain
the change of lake water levels in time.
Detection of groups and structures in the data that is alike, without using
known structures or relationship for the data. For example, detection of
areas with similar weather-hydrological patterns is the Western US.
Discovery of structures in the data to identify patterns among them. For
example, identification of vegetation covertures in aerial or satellite
image.
Identification of a mathematical expression or equation that models the
data with the least error. E.g. prediction of water flow in rivers based on
weather parameters and local geographic conditions
Compact representation of data (visualization and report). E.g. Reduction
of LandSat TM/ETM+ satellite bands from 7 to 3 using Principal
Component Analysis.
Supervised vs. Unsupervised learning
Supervised learning: relate attributes with a target by
discovering patterns in the data.
These patterns are used to predict values of the target
in future data.
Unsupervised learning: The data have no target attribute.
Explore the data to find some intrinsic structures in
them.
Supervised vs. Unsupervised learning
Data-Driven Algorithm
Artificial Neural Networks
(ANN)
Support Vector Machines
(SVM)
Random Forest (RF)
Relevance Vector Machines
(RVM)
Classification And Regression
Trees (CART)
Linear Discriminant Analysis
Supervised Learning
Classification,
Regression,
Association Rule Learning,
Classification,
Regression,
Association Rule Learning,
Classification,
Regression,
Association Rule Learning
Classification,
Regression,
Association Rule Learning
Classification,
Regression,
Classification
Unsupervised Learning
Clustering,
Clustering,
Anomaly Detection
Clustering,
Clustering,
Procedure
Objective
Data Retrieval &
Analysis
Input –
Output
Selection
Learning
Machine
Calibration
Comparison
& Robustness
Analysis
Analysis – Supervised Learning

Machine Learning Approach:

Input inclusion
(Curse of Dimensionality)

Generalization
(Overfitting)

Impact of unseen data
(Robustness)

Performance comparison (vs. another similar
algorithm)
Analysis - Regression
2
(t
t
*)
n 1 n n
N
Nash coefficient of efficiency (η or E)
Determination(r2)
Similar to Coef of
Range –Inf to 1
Non dimensional units
E 1-
2
(t

t
)
n 1 n av
N
 t
N
Root mean square error (RMSE):
Same units as model response
where:
t : observed output
t* : predicted output
tav : observed average output
N : number of observations.
RMSE 
n 1
n
N
 t *n

2
Analysis - Classification
Confusion matrix
Helps evaluate classifier performance on
class by class basis
Kappa Coefficient: robust measurement of classification accuracy
n = number of classes
n
n
xii =No. of observations on the diagonal
N  xii -  ( xi+ × x+i )
of the confusion matrix corresponding to
i=1
i=1
K
=
n
row i and column i,
2
N
-  ( xi+ × x+i )
xi+ and x+i = Marginal totals of row i and
i=1
col. i respectively
N = No. of instances.
A Neural Network Model:
Bayesian Multilayer Perceptron for
Regression & Classification
Bayesian Multilayer Perceptron (MLP)
ANN algorithm that uses the Bayesian Inference Method (b):
*
*
 *I
I
[ y1, y 2,.... yn ]  W  tanh  W [x]  b   bII


*
II
Where:
y1, y2, …, yn
WI, WII, bI, bII
[x]
= simultaneous results from algorithm,
= model weights and biases,
= inputs.
MLP:
• Used also with success in simulation and forecasting of soil moisture, reservoir
management, groundwater conditions, etc.
• Probabilistic approach, noise effect minimization, error prediction bars, etc.
(b) Implemented by Nabney (2005)
Bayesian Multilayer Perceptron (BMLP)

Using a dataset D = [x(n) , t(n)] with n =1…N, the training of the
parameters [Wa, Wb, b(n), bh] is performed by minimizing the
Overall Error Function E (Bishop, 2007):
E  β * ED  α  EW


β N (n) (n) 2 α W 2
E  ∑t - y
 ∑w i
2 n 1
2 i 1
•
•
•
•
•
Where:
ED: data error function,
EW: penalization term,
W= number of weights and biases in the BMLP, and
α and β: Bayesian hyper-parameters.
Bayesian Multilayer Perceptron (BMLP)
For regression tasks, the Bayesian Inference allows the
prediction y(n) and the variance of the predictions σy2, once the
distribution of W has been estimated by maximizing the
likelihood for α and β (Bishop, 2007).
1
1
σ β g H g
2
y
T
The output variance has two sources; the first source arises
from the intrinsic noise in the output values ; and the second
source comes from the posterior distribution of the BMLP
weights. The output standard deviation vector σy can be
interpreted as the error bar for confidence interval estimation
(Bishop, 2007).
Bayesian Multilayer Perceptron (BMLP)
For classification tasks, the Bayesian Inference method allows
for the estimation of the likelihood of a given class of the input
variables using a logistic sigmoid function (Nabney, 2002).
y
1
,
1  exp - a 
m
a   w i x i  bi
i 1
00.5.5
BMLP classification - example
0.5 0.5
0.7
0.6
0.5
0.5
0.5
0.3
0.5
0.5
0.5
0.
5
0.4
0.5
0.2
0.5
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
BMLP regression - example
Relevance Vector
Machine for Regression
“Entities should not be multiplied unnecessarily”
William of Ockham
“Models should be no more complex than is
sufficient to explain the data” Michael E. Tipping
Relevance Vector Machine for
Regression
Developed by Tipping [2001]
Given a training set of input-target pairs
{xn , t n }nN1
where N is the number of observations
The target vector can be written as:
t  yε
 Φw  ε
w is a weight matrix
Ф is a “design” matrix related with a kernel K(x1,xM)
the error ε is assumed to be zero-mean Gaussian, with
variance σ2
A likelihood distribution of the complete data set :
2


t

y


2
N / 2 N
p(t | w,  )  (2 )
 exp 
2 
2





There is a danger that the maximum likelihood estimation of w and σ2
will suffer from severe over-fitting
Imposition of an additional penalty term to the likelihood :
2



w
M / 2
1/ 2
m
m


p(w |  )  (2 )
 m exp  

2 
m 1

M
This prior is ultimately responsible for the sparsity properties of
the RVM
The posterior parameter distribution conditioned on the data :
2
p
(
t
|
w
,

) p( w |  )
p( w | t ,  ,  2 ) 
p( t |  ,  2 )
The posterior probability assigned to values which are both
probable under the prior and “which explain the data” (Tipping
2004)
The optimal values for many parameters are
infinitive, therefore, the posterior probabilities
of the associated weights are zero and the
corresponding inputs are “irrelevant”.
The non-zero elements are “The Relevance
Vectors”
RVM approximations to "sinc"
function
Generalization and Robustness
Analysis
Overfitting
 Calibrate the ML models with one training data
set and evaluated their performance with a
different unseen test data set.
 ML applications : ill-posed problems
t= f(x)
small variation in x may cause large changes in t

Model Robustness
Bootstrap Method
 It is created by randomly sampling with replacement from the
whole training data set
 A robust model is the one that shows a narrow confidence
bounds in the bootstrap histogram:
Model A
Model B
R1d
120
120
R1d+1
350
100
100
300
80
80
60
60
40
40
R1d
500
R2d
500
400
400
300
300
250
200
200
200
100
100
100
20
0
50
100
RMSE
150
0
50
R1d
0
100 50
RMSE
R1d+1
0
150 100
20
RMSE
R1d
350
120
150
Model B is more robust
80
600
500
60
400
300
200
40
150
100
0
0
60 80
50 100 120100 20
RMSE
RMSE
R2d
500
R1d+1
200
400
150
0
40
150 60 8020 100
40 120
60
RMSE
RM
R2d+1
400
300
250
400
200
100
R
100
50
200
100
300
R2d+1
500
200
150
20
R1d+1
400
300
100
300
R