Transcript Slide 1

Machine Learning to explore fish
species interaction in the Northern
gulf of St Lawrence
Dr Allan Tucker
Centre for Intelligent Data Analysis
Brunel University
West London
UK
Talk Outline
Introduce myself and research group
 Introduce Machine Learning
 Describe Bayesian network models
 Document some preliminary results on
fish population data
 Conclusions

Who Am I?
Research Lecturer at Brunel University,
West London
 Member of Centre for IDA (est 1994)

X
What is the
?
Over 25 members (academics,
postdocs, and PhDs) with diverse
backgrounds (e.g. maths, statistics,
computing, biology, engineering)
 Over 140 journal publications & a dozen
research council grants since 2001
 Many collaborating partners in UK,
Europe, China and USA
 Bi Annual Symposia in Europe

Some Previous Work in
Machine Learning and Temporal Analysis
 Oil Refinery Models




Medical Data: Retinal (Visual Field)



Forecasting
Explanation
Screening
Forecasting
Bioinformatics:


Gene Clusters
Gene Regulatory Networks
Some Previous Work in
Part 1
What is Machine Learning?
What is Machine Learning?
(and why not statistics?)
 Data oriented
 Extracting useful info from data
 As automated as possible
 Useful when lots of data and little theory
 Making predictions about the future
What Can we do with ML?
Classification and Clustering
 Feature Selection
 Prediction and Forecasting
 Identifying Structure in Data

E.g. Classification
Given some labelled data (supervised)
 Build a “model” to allow us to classify
other unlabelled data
 e.g. A doctor diagnosing a patient
based upon previous cases

Classification e.g. medical
Scatterplot of patients
 2 variables:

Measurement of expression of 2 genes
0.2
0.15
0.1
0.05
NM_013720

0
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0
-0.05
-0.1
-0.15
-0.2
-0.25
-0.3
NM_008695
0.1
Diseased
Control
Classification
How do we classify them?
Nearest Neighbour / Linear / Complex Fn?

0.2
0.15
0.1
0.05
NM_013720

0
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0
-0.05
-0.1
-0.15
-0.2
-0.25
-0.3
NM_008695
0.1
Diseased
Control
Classification
Trivial case with Cod and Shrimp Data
2.5
2
1.5
Pre 1990
Cod

Post 1990
1
0.5
0
0
0.2
0.4
0.6
Shrimp
0.8
1
1.2
The Data

Northern Gulf (region a)



Two ships (Needler and Hammond) combined by
normalising according to overlap year
Multivariate Spatial Time Series (short)
Missing Data
Background
Northern Gulf considered to be one ecosystem / fish community
 Quite heavily fished until about 1990
 Most fish populations collapsed since
 Some say that moved to an alternative stable state and unlikely
to come back to cod dominated community without some chance
event beyond human control.
 Lots of speculation:
 cold water
 large increases in population of predators.
 Examine nature and strength of interactions between species in
the two periods.
 Ask “what if ?” questions:
 For other parts of community to recover, we would need
cod to have X strength of interaction with Y number of
other species?

ML for Northern Gulf Data

Network building

knowledge and data of interactions
Feature Selection for Classification of
relevant species to the cod collapse
 State Space / Dynamic models for
predicting populations
 Hidden variable analysis

Part 2
Bayesian Networks for
Machine Learning
Bayesian Networks
Method to model a domain using
probabilities
 Easily interpreted by non-statisticians
 Can be used to combine existing
knowledge with data
 Essentially use independence
assumptions to model the joint
distribution of a domain

Bayesian Networks

Simple 2 variable Joint Distribution
P(Collapse1, Collapse2)
Species2
¬ Species2
Species1
0.89
0.01
¬ Species1
0.03
0.07
can use it to ask many useful questions
 but requires kN probabilities

Bayesian Network for Toy Domain
P(A)
.001
SpeciesA
A
T
T
F
F
C P(D)
T .70
F .01
B
T
F
T
F
P(C)
.95
.94
.29
.001
SpeciesD
SpeciesB
P(B)
.002
SpeciesC
SpeciesE
C P(E)
T .90
F .05
Bayesian Networks

Bayesian Network Demo

[Species_Net]
Use algorithms to learn structure and
parameters from data
 Or build by hand (priors)
 Also continuous nodes (density
functions)

Informative Priors
To build BNs we can also use prior
structures and probabilities
 These are then updated with data
 Usually uniform (equal probability)
 Informative Priors used to incorporate
existing knowledge into BNs

Bayesian Networks for Classification
& Feature Selection
Node that represents the class label
attached to the data

Dynamic Bayesian Networks for
Forecasting
Nodes represent
variables at distinct time
slices
 Links between nodes
over time
 Can be used to forecast
into the future
[Species_Dynamic_Net]

Hidden Markov Models
Like a DBN but with hidden nodes:
HT-1
HT
OT-1
OT
Often used to model sequences
Typical Algorithms for HMMs
Given an observed sequence and a
model, how do we compute its probability
given the model?
 Given the observed sequence and the
model, how do we choose an optimal
hidden state sequence?
 How do we adjust the model
parameters to maximise the probability of
the observed sequence given the model?

Summary
Different learning tasks can be used to
solve real world problems
 Machine Learning techniques useful
when lots of data and lots of gaps in
knowledge
 Bayesian Networks: probabilistic
framework that can perform most key ML
tasks
 Also transparent & can incorporate
expert knowledge

Part 3
Some Preliminary Results on
Northern Gulf Data
Expert Knowledge
Ask marine biologists to generate
matrices of expected relationships
 Can be used to compare models learnt
from data
 Also to be used as priors to improve
model quality

Results: Expert networks
Results: Data networks
(BN from correlation)

85% conf. imputed from 70% data
Witch
Flounder
(Eel pout / Ocean Sun Fish)
Cod
Haddock
(Silver Hake)

(Lumpfish)
Shrimp
(Atlantic soft pout / Bristlemouths)
Warning: data quality, spurious relations
Example DBN
Let’s look at an example DBN
[NGulfDynamic - range]
 Structure Encoded by knowledge
 Updated by data
 Explore with queries
 Supported by previous knowledge:


“In the Northern gulf of st. Lawrence, cod (code 438)
and redfish (792,793,794,795,796) collapsed to very
low levels in the mid 1990s. Subsequently the shrimp
(8111) increased greatly in biomass so one will see
this signal in the data. It is hypothesised that these
are exclusive community states where you never get
high abundance of both at the same time owing to
predatory interactions.”
Feature Selection
Given that we know that from 1990 the
cod population collapsed

Can we apply Feature Selection to see
what species characterise this collapse

[Learn BN and apply CV]
890
447
441
449
90
8135
320
12
859
745
27
478
461
193
730
849
187
8217
8111
444
4753
8196
150
721
8213
844
24
443
966
451
792
426
726
700
809
9995
893
819
8112
8178
889
814
572
808
836
8138
711
8218
4894
701
716
892
835
812
8057
91
717
8093
-35
-37
0.8
0.7
0.5
0.4
441
447
890
12
90
449
193
320
461
444
27
721
8135
150
426
966
187
572
700
792
859
4753
8057
8112
443
701
717
745
8138
8196
8217
24
478
726
730
808
809
892
8093
8111
91
451
711
716
812
814
819
835
836
844
849
889
893
4894
8178
8213
8218
9995
Results 7: Feature Selection
with Bootstrap
Filter method using Log Likelihood
-39
-41
-43
-45
-47
0.6
Wrapper method using BNs
Redfish
0.3
0.2
0.1
0
Results : Feature Selection
Change in Correlation of interactions between
cod and high ranking species before and after
1990:

0.8
pre 1990 correlation
post 1990 correlation
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
white
hak e
thorny
sk ate
sea
raven
haddock
white
hak e
silver
hak e
witch redfish* shrimp*
flounder
Dynamic Models
Given that the data is a time-series
 Can we build dynamic models to
forecast future states?
 Can we use HMM to classify the timeseries?

Multivariate Time Series
N Gulf is process
measured over time
 Autoregressive
Correlation Function
(here cod)
 Cross Correlation
Function
(here hake to cod)
1.2

ACF
1
0.8
Correlation
0.6
0.4
0.2
0
0
2
4
6
8
10
12
14
-0.2
-0.4
Time Lag
0.9
CCF
0.8
0.7
Correlation
0.6
0.5
0.4
0.3
0.2
0.1
0
-6
-4
-2
0
Time Lag
2
4
6
Results 3: Fitting Dynamic Models
HMM Expert with CCF > 0.3 (maxlag = 5)
2
2
1
1.5
0
1
-1
0.5
-2
0
-0.5
0
5
10
15
20
25
0
5
10
15
20
25
2
-1
1.5
-1.5
-2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
1
0.5
LSS = 8.3237
Results 3: Fitting Dynamic Models
Learning DBN from CCF data
2
2
1
1.5
0
1
-1
0.5
-2
0
-0.5
0
5
10
15
20
25
0
5
10
15
20
25
2
-1
1.5
-1.5
-2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
1
0.5
LSS = 5.0106
Fluctuation: Early Indicator of Collapse?
Results 4: Examining DBN Net
Data only Dynamic Links:
Hakes
Redfish
Cod
Haddock
Witch Flounder
White Hake
Shrimp
Thorny Skate
Results 5: Fitting Dynamic Models
Learning DBN from Expert biased CCF
data CCF > 0.5 (maxlag=5)
2
2
1.5
1
1
0
0.5
-1
0
-2
-0.5
0
5
10
15
20
25
0
5
10
15
20
25
2
-1
1.5
-1.5
1
-2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0.5
LSS = 6.1326
Results 6: Examining DBN Net
Data Biased Expert Dynamic Links:
Cod
Herring
Witch Flounder
Mackerel / Capelin
Results 7: Linear Dynamic System
Instead of hidden state, continuous var:
6
1987
5
(white fur4 ban)
1991
3
1997 (white fur hunt)
2
1
0
-1
1984
-2
0
5
10
15
20
25
Could be interpreted as measure of fishing?
Predator population (e.g. seals)? Water
temperature?
Conclusions
Hopefully conveyed the broad idea of
machine learning
 Shown how it can be used to help
analyse data like fish population data
 Potentially applicable to other data
studied here at MLI

Potential Projects
1.
Spatio-Temporal Analysis
Use Spatio-Temporal BNs to model fish stock
data. Nodes would represent species in
specific “regions”
2.
3.
4.
Combining Expert Knowledge and
Data for improved Prediction
Looking for Un/Stable States and the
factors that influence them
Functional Analysis of Data from
Multiple Locations
E.G. Spatial Analysis
Spatial Bayesian Network Analysis
 [NGulfCodSpatial]

E.G. Functional Models
Functional Models to assimilate data
from different oceans...

Acknowledgements:
Daniel Duplisea
Panayiota Apostolaki
Any Questions?