BNs for water quality prediction in Sydney Harbour
Download
Report
Transcript BNs for water quality prediction in Sydney Harbour
Using Bayesian networks for Water
Quality Prediction in
Sydney Harbour
Ann Nicholson
Shannon Watson, Honours 2003
Charles Twardy, Research Fellow
School of Computer Science
and Software Engineering
Monash University
1
Overview
Representing uncertainty
Introduction to Bayesian Networks
» Syntax, semantics, examples
The knowledge engineering process
Sydney Harbour Water Quality Project 2003
Summary of other BN research
2
Sources of Uncertainty
Ignorance
Inexact observations
Non-determinism
AI representations
» Probability theory
» Dempster-Shafer
» Fuzzy logic
3
Probability theory for representing uncertainty
Assigns a numerical degree of belief between
0 and 1 to facts
» e.g. “it will rain today” is T/F.
» P(“it will rain today”) = 0.2 prior probability
(unconditional)
Posterior probability (conditional)
» P(“it will rain today” | “rain is forecast”) = 0.8
Bayes’ Rule: P(H|E) = P(E|H) x P(H)
P(E)
4
Bayesian networks
A Bayesian Network (BN) represents a probability
distribution graphically (directed acyclic graphs)
Nodes: random variables,
» R: “it is raining”, discrete values T/F
» T: temperature, cts or discrete variable
» C: colour, discrete values {red,blue,green}
Arcs indicate conditional dependencies between
variables
P(A,S,T) can be
decomposed to
P(A)P(S|A)P(T|A)
5
Bayesian networks
Conditional Probability Distribution (CPD)
– Associated with each variable
– probability of each state given parent states
“Jane has the flu”
Flu
P(Flu=T) = 0.05
Te
P(Te=High|Flu=T) = 0.4
P(Te=High|Flu=F) = 0.01
Models causal relationship
“Jane has a
high temp”
Models possible sensor error
“Thermometer
temp reading”
Th
P(Th=High|Te=H) = 0.95
P(Th=High|Te=L) = 0.1
6
BN inference
Evidence: observation of specific state
Task: compute the posterior probabilities for query
node(s) given evidence.
Flu
Flu
Y
Te
Te
Th
Th
Diagnostic
inference
Predictive
inference
Flu
TB
Te
Flu
Te
Th
Intercausal
inference
Mixed
inference
7
BN software
Commerical packages: Netica, Hugin,
Analytica (all with demo versions)
Free software: Smile, Genie, JavaBayes,
See appendix B, Korb & Nicholson, 2004
Example running Netica software
8
Decision networks
Extension to basic BN for decision making
» Decision nodes
» Utility nodes
EU(Action) = p(o|Action,E) U(o)
o
» choose action with highest expect utility
Example
9
Elicitation from experts
Variables
» important variables? values/states?
Structure
» causal relationships?
» dependencies/independencies?
Parameters (probabilities)
» quantify relationships and interactions?
Preferences (utilities)
10
Expert Elicitation Process
These stages are done iteratively
Stops when further expert input is no longer
cost effective
Process is difficult and time consuming.
Current BN tools
» inference engine
» GUI
BN
EXPERT
Domain
EXPERT
Next generation of BN tools?
BN TOOLS
11
Knowledge discovery
There is much interest in automated methods
for learning BNS from data
» parameters, structure (causal discovery)
Computationally complex problem, so current
methods have practical limitations
» e.g. limit number of states, require variable
ordering constraints, do not specify all arc
directions
Evaluation methods
12
Knowledge Engineering
for Bayesian Networks (KEBN)
1. Building the BN
» variables, structure, parameters, preferences
» combination of expert elicitation and knowledge discovery
2. Validation/Evaluation
» case-based, sensitivity analysis, accuracy testing
3. Field Testing
» alpha/beta testing, acceptance testing
4. Industrial Use
» collection of statistics
5. Refinement
» Updating procedures, regression testing
13
The KEBN process
14
Quantitative KE process
15
Water Quality for Sydney Harbour
Water Quality for
recreational use
Beachwatch /
Harbourwatch Programs
Bacteria samples used
as pollution indicators
Many variables
influencing Bacterial
levels – rainfall, tide, wind,
sunlight temperature, ph etc
16
Past studies
Hose et al. used multi dimension scaling model of
Sydney harbour
» low predictive accuracy, unable to handle the noisy bacteria
samples, explained 63% of bacteria variablity (Port Jackson)
Ashbolt and Bruno:
» agree with Hose et al, + wind effects, sunlight hours, tide
Crowther et al (UK):
» rainfall, tide, sampling times, sunshine, wind
» Explained 53% of bacteria variablility
Other models developed by the USEPA to model
estuaries are:
» QUAL2E – Steady-state receiving water model
» WASP – Time Varying dispersion model
» EFDC – 3D hydrodynamic model
EPA in Sydney interested in a model applying the
causal knowledge of the domain
17
EPA Guidelines
Today
Yesterday
IF T>4
Day Before
Yesterday
Pollution
THEN
Likely
ELSE IF T 4
AND Y 4
AND DBY 4 THEN
Unlikely
ELSE IF T 4
AND Y 4
AND DBY 4 THEN
Unlikely for 24h flushing
But Likely for 48h flushing
ELSE
Likely for all other results
18
Stages of Project
Preparation of EPA Data rainfall only
Hand-craft simple networks for rainfall
data
Comparison of hand-crafted networks
with range of learners (using Weka
software)
2003
Hons
proj
Using CaMML to learn BN on extended
data set
2003/04
Summer
Vac proj
19
EPA Data
Database 1:
» E.coli, Enterococci (cfu/100mL), thresholds 150 &
35.
» 60 water samples each year since 1994 at 27 sites
in Sydney Harbour.
» Enterococci E.coli, Raining, Sunny, Drain running,
temperature, time of sample, direction of sampling
run, date, site name, beach code
Database 2:
» Rainfall readings (mm) at 40 locations around
Sydney
20
Data Preparation
New file format:
Date BeachCode Entc Ecoli D1 D2 D3 D4 D5 D6
D1 = rainfall on day of collection
D6 = rainfall 5 days previously
Rainfall data had many missing entries
21
Rainfall BNs
Hand-crafted BNs to predict bacteria using
rainfall only
Started with deterministic BN that
implemented EPA guidelines
Looked at varying number of previous days
rainfall for predicting bacteria
Investigated various discretisations of
variables
22
EPA Guidelines as BN
23
Davidson BN: 1 day rainfall
24
Davidson BN: 6 days rainfall
25
Evaluation
Split data 50-50 training/testing
10 fold cross validation
Measures: Predictive Accuracy & Information Reward
Also looked at ROC curves (correct classification vs
false positives)
Using Weka: Java environment for machine learning
tools and techniques
Small data: 4 beaches: Chinamans, Edwards,
Balmoral (all middle harbour), Clifton (Port Jackson)
Using 6 days rainfall averaged from all rain gauges
26
Predictive accuracy
Examining each joint observation in the
sample
Adding any available evidence for the other
nodes
Updating the network
Use value with highest probability as
predicted value
Compare predicted value with the actual value
27
Information Reward
Rewards calibration of probabilities
Zero reward for just reporting priors
Unbounded below for a bad prediction
Bounded above by a maximum that depends
on priors
Reward = 0
Repeat
If I == correct state
IR += log ( 1 / p[i] )
else
IR += log ( 1 / 1 - p[i] )
28
Evaluation: Weka learners
Naïve Bayes
J48 (version of C4.5)
CaMML –Causal BN learner, using MML metric
AODE
TAN
Pr=1/3
Pr=1/3
Pr=1/3
Logistic
“Davidson” BN – 6 days previous rainfall
» With and without adaptation of parameters (case learning)
“Guidelines” BN – 3 days previous rainfall
» Deterministic rule
» With adaptation of parameters (case learning)
29
Results
Learner
Prior
Pred Accuracy
0.758
Info Reward
0
Naïve Bayes
J48
CaMML
0.760
0.791
0.764
-0.729
0.125
0.122
AODE
TAN
Logistic
0.769
0.775
0.787
0.128
-1.459
0.128
Davidson
Davidson CL
Guidelines (det)
Guidelines CL
0.757
0.776
0.530
0.776
-0.272
0.033
-2.318
0.058
30
Results: ROC Curves
31
Results: area under ROC Curves
Perfect
AODE
0.999
0.733
Logistic
CaMML
J48
0.729
0.718
0.689
Naïve
Davidson CL
Guidelines CL
0.679
0.645
0.643
Guidelines
Davidson
TAN
Prior
0.637
0.620
0.561
0.496
32
Results: ROC Curves
For ~20% false-positive, can get ~60% of events
For ~45% false-positive, can get ~75% of events
For ~60% false-positive, can get ~80% of events
Implications?
» Using current guidelines, if accept 45% false-positive,
getting 60% hit rate
» Can either keep that false-positive rate, get extra 15%
» Or, keep same hit rate at half the false positive rate
33
Example of CaMML BN
34
Future Directions?
35
36
Early BN-related projects
DBNS for discrete monitoring (PhD, 1992)
Approximate BN inference algorithms based
on a mutual information measure for
relevance (with Nathalie Jitnah, 1996-1999)
Plan recognition: DBNs for predicting users
actions and goals in an adventure game (with
David Albrecht, Ingrid Zukerman, 1997-2000)
DBNs for ambulation monitoring and fall
diagnosis (with biomedical engineering, 1996-2000)
Bayesian Poker (with Kevin Korb, 1996-2003)
37
Knowledge Engineering with BNs
Seabreeze prediction: joint project with
Bureau of Meteorology
» Comparison of existing simple rule, expert elicited
BN, and BNs from Tetrad-II and CaMML
ITS for decimal misconceptions
Methodology and tools to support knowledge
engineering process
» Matilda: visualisation of d-separation
» Support for sensitivity analysis
Written a textbook:
» Bayesian Artificial Intelligence, Kevin B. Korb and
Ann E. Nicholson, Chapman & Hall / CRC, 2004.
www.csse.monash.edu.au/bai/book
38
Current BN-related projects
BNs for Epidemiology (with Kevin Korb, Charles Twardy)
» ARC Discovery Grant, 2004
» Looking at Coronary Heart Disease data sets
» Learning hybrid networks: cts and discrete variables.
BNs for supporting meteorological forecasting process
(DSS’2004) (with Ph. D student Tal Boneh, K. Korb, BoM)
» Building domain ontology (in Protege) from expert elicitation
» Automatically generating BN fragments
» Case studies: Fog, hailstorms, rainfall.
Ecological risk assessment
» Goulburn Water, native fish abundance
» Sydney Harbour Water Quality
39
Open Research Questions
Methodology for combining expert elicitation
and automated methods
» expert knowledge used to guide search
» automated methods provide alternatives to be
presented to experts
Evaluation measures and methods
» may be domain dependent
Improved tools to support elicitation
» Reduce reliance on BN expert
» e.g. visualisation of d-separation
Industry adoption of BN technology
40