SEPA Presentation 09/07/2009 - Centre for Intelligent Environmental

Download Report

Transcript SEPA Presentation 09/07/2009 - Centre for Intelligent Environmental

Decision Support Tools for
River Quality Management
Martin Paisley, David Trigg and William Walley
Centre for Intelligent Environmental Systems,
Faculty of Computing, Engineering & Technology,
Staffordshire University
Contents

Background



The River Pollution Diagnostic System (RPDS).



Pattern Recognition
Data exploration, diagnosis and classification
The River Bayesian Belief Network (RPBBN).



Our Aims
Our Approach
Plausible Reasoning
Diagnosis, prognosis and scenario-testing
Summary
© 2009 David Trigg
Our Aims




Maximise the benefit gained from existing
databases/information, increased objectivity.
Exploit the available technology to create
sophisticated, flexible, multi-purpose tools
Make the technology easy to use.
Provide expert support to those who need it
to help them do their job.
© 2009 David Trigg
Our Approach



Our initial studies with expert ecologist H.A.
Hawkes lead to goal of trying to capture
expertise.
Expert systems is the branch of Artificial
Intelligence (AI) that attempts to capture
expertise in a computer based system.
Study of an expert is required to reveal:



what they do,
how they do it; and
what information and mental processes they use.
© 2009 David Trigg
The Expert Ecologist

Our early research discovered the expert ecologist
tend used to use two complementary techniques.



Memory (pattern matching) – “I’ve seen this before,
it was due to …”
Scientific knowledge (plausible reasoning) – based
on their knowledge of the system and available
evidence they are able to reason about the likely state
of other elements of the system.
We set out to replicate these processes and
produce software that would allow people to gain
easy access to ‘expert’ interpretation
© 2009 David Trigg
The Modelling Tools

After over a decade of research in this field the
current modelling techniques we use are:



our own clustering and visualisation system know as
MIR-Max (Mutual Information & Regression
Maximisation) for pattern matching; and
Bayesian Belief Networks (BBN) for plausible
reasoning.
These techniques were used to produce the models
on which our decision support software is based.
© 2009 David Trigg
What the tools provide.





Visualisation and exploration of large complex
datasets. (RPDS)
Classification of samples. (RPDS)
Diagnosis of potential pressures. (RPDS &
RPBBN)
Prediction of biology from environmental and
chemical parameters. (RPBBN)
Scenario testing – impact of changing sample
parameters. (RPBBN)
© 2009 David Trigg
Pattern Recognition
© 2009 David Trigg
Pattern Recognition –What is it?



Recognition of patterns – pattern implies multiple
attributes, so is a multivariate technique.
Classification of a new pattern (thing) as being of
a particular type, based on similarity to a set of
attributes indicative of that type.
Success of pattern recognition reliant on having
the appropriate distinguishing features.


Enough features to clearly discriminate.
Appropriate set of features –
orthogonal/uncorrelated.
© 2009 David Trigg
Pattern Recognition – Why do it?



Method of managing information – reduce
multiple instances as single type or kind.
Classification of situations allows to cope
with novel but similar situations.
Exploitation of existing ‘information’.
Once identified as being of a type
‘unknown’ attributes can be inferred.
© 2009 David Trigg
Pattern Recognition - Clustering



To create a model first need
to cluster training samples
The training samples
contain both data on the
training/clustering variables
and additional ‘information’
variables (those that are to
be predicted).
In the case of RPDS, the
training variables are the
biology and the information
variables the chemical and
other stress parameters.
© 2009 David Trigg
Pattern Recognition - Clustering
Set of samples .. grouped into ‘clusters’ .. to provide templates/types in the model
© 2009 David Trigg
Pattern Recognition - Classification



Classification involves
matching a new sample
with an existing cluster.
Based on the training
variables.
In this example the
closest match for the new
sample is cluster ‘A’.
This is the ‘classification’
of the new sample. The
quality of the cluster is
that assigned to the new
sample.
© 2009 David Trigg
Pattern Recognition - Diagnosis



The diagnosis is derived
from the values for the
information variables (the
blue bars) in the training
samples grouped in the
cluster.
The predicted values are
derived from the training
samples in the cluster.
These values are usually a
statistic such as mean,
median or a percentile.
© 2009 David Trigg
Visualisation
Classification can appear as a black box
system.
 Visualisation is a useful tool.





Opens the model up for inspection.
Helps understand & validate model.
Helps explore data and discovery of new relationships.
To help visualisation clusters can be ‘ordered’ in a
map.
© 2009 David Trigg
Ordering



Ordering sole purpose is to help visualise the data
and the cluster model, no more no less.
The process involves arranging the clusters in a
space/map usually based on similarity. Similar
clusters are placed close together dissimilar far
apart.
Our algorithm, R-Max, uses the r correlation
coefficient between distances in data space and
corresponding distances in output space
© 2009 David Trigg
Data Visualisation - Ordering Clusters
y
X
j
d
j
z
D
i
i
Y
x
d = distance in data space
D = distance between clusters in map
R-Max aims to maximise the correlation r between d and D
© 2009 David Trigg
Pattern Recognition - Ordering
Clusters templates/types
… destination map … clusters ordered by similarity
© 2009 David Trigg
Pattern Recognition - Visualisation


Maps can be
colour-coded to
show the value of
any chosen feature
across all of the
clusters
‘Feature maps’ and
‘templates’ form
the basis of RPDS
visualisation
© 2009 David Trigg
RPDS 3.0

Primary uses are



Data exploration – visual element to the
clustered/organised data allows existing relationships
in the data to be verified (model validation) and new
ones to be identified (data mining).
Classification - assignment of a sample to cluster
allows an estimated quality class to be defined.
Diagnosis - The ‘known’ stress information
associated with other samples in the cluster can help
diagnose potential problems.
© 2009 David Trigg
RPDS 3.0 - Data Exploration
© 2009 David Trigg
RPDS 3.0 - Data Exploration
© 2009 David Trigg
RPDS 3.0 - Data Exploration
© 2009 David Trigg
RPDS 3.0 - Data Exploration
© 2009 David Trigg
RPDS 3.0 - Data Exploration
© 2009 David Trigg
RPDS 3.0 - Classification
© 2009 David Trigg
RPDS 3.0 - Classification
© 2009 David Trigg
RPDS 3.0 - Diagnosis
© 2009 David Trigg
RPDS 3.0 - Comparison
© 2009 David Trigg
Plausible Reasoning
© 2009 David Trigg
Reasoning

Reasoning:




Thinking that is coherent and logical.
A set of cognitive processes by which an individual
may infer a conclusion from an assortment of evidence
or from statements of principles.
Goal-directed thought that involves manipulating
information to draw conclusions of various kinds.
Use available information combined with existing
knowledge to derive conclusions for a particular
purpose.
© 2009 David Trigg
Reasoning with Uncertainty




If reasoning is ‘coherent and logical’, how can it
deal with unknowns, conflicting information and
uncertainty?
The ability to quantifying uncertainty helps to
resolve conflicts and provides ‘lubrication’ for the
reasoning process.
In humans this takes the form of beliefs.
Probability theory provides a mathematical
method of handling uncertainty.
© 2009 David Trigg
Probability Theory




Probability theory is robust and proven to be a
mathematically sound.
It provides a method for representing and
manipulating uncertainty.
It is one of the principle methods used for
handling uncertainty in computer based systems.
Bayesian Belief Networks (BBN) are currently
the most popular methods for creating
probabilistic systems.
© 2009 David Trigg
Bayesian Belief Networks





A BBN consists of two elements causal network
and a set of probability matrices.
A causal network is a graph of arcs (variables)
and directed edges (relationships).
The network defines the relationships between all
the variables in a domain.
The causal variables are often referred to
‘parents’ and the effect variables as ‘children’.
Can be defined through data analysis but is
probably best achieved by an expert.
© 2009 David Trigg
Causal Network
© 2009 David Trigg
Probability Matrix



The probability matrices encode the relationship between
variables.
A probability is required for every combination of parent
and child states.
The number of states grows geometrically meaning that
the derivation probabilities is often better achieved via
data analysis.
© 2009 David Trigg
Outputs - Predictions



The outputs of the system are
Variable Name
likelihood of each of the
State Labels
states of the variables
occurring.
The whole system is updated
every time evidence is entered
regardless of where it occurs.
The most common way to
represent the values is through Probability Bars
a bar chart, where the bars
Probability Values
depict the likelihood of each
(0 - 100)
state.
© 2009 David Trigg
RPBBN 2.0

Primary uses are:


Prediction of concentrations of common
‘chemical’ pollutants from biological sample
data.
Scenario testing, prediction of new biological
community and biological assessment
‘scores’ based on the modification of
changeable environmental and chemical
parameters for a site.
© 2009 David Trigg
RPBBN 2.0 - Prediction
© 2009 David Trigg
RPBBN 2.0 - Prediction
© 2009 David Trigg
RPBBN 2.0 - Scenario Testing
© 2009 David Trigg
RPBBN 2.0 - Scenario Testing
© 2009 David Trigg
Summary



RPDS organises the EA dataset allowing exploration and
analysis and provides the ability to classify new samples
and diagnose potential problems.
RPBBN allows prediction of the states of variables in a
system based on any available evidence. Making it useful
for diagnosis, prognosis and scenario testing.
Together these tools can help decision makers identify
potential problems, suggest areas for further investigation,
help develop programmes of remedial action and define
targets.
© 2009 David Trigg
Summary



The models are based primarily on data analysis
making them more objective than expert opinion.
The systems robust and consistent in their
operation.
The software is easily reproduce and distributed
meaning that the valuable expertise they hold can
easily be spread through out an organisation
© 2009 David Trigg
The Future



River Quality - include more geographic
information and move from site to river
basin management.
Improvement in algorithms, incorporation
of sample bias and improved confidence
measures.
Major revision of software – potentially
rewritten as web-based application.
© 2009 David Trigg