download report

Transcript BRT.pres.Villeneuve

Boosted Regression Trees
A method to explore biologyenvironment relationships
Sophie Mormede, Matt Pinkerton
National Institute of Water and Atmospheric
Research, Wellington, NZ
May 2010
Two main uses of BRT
• to investigate the ecological dependence
of a species on the environment
• to determine "habitat preference" in order
to extrapolate patchy biological data to a
larger domain
An example
• WHAT: Predict toothfish and bycatch species
distributions over the Ross Sea (88.1 & 882A–B)
• WHY:
layers for bioregionalisation
input to systematic conservation planning
to investigate overlap of TOA and prey species
to consider potential changes in species distribution
under climate change scenarios
– to help in estimating biomass from the small number
of research trawls (WGR)
• HOW: GLM / GAM (not very satisfactory), BRT,
General Dissimilarity Matrices, …
Project outcomes so far
• Predictions seem to make sense, and
confidence intervals
• Quality of depth data critical (use gebco08,
modified with fishing depth)
• Still need to validate models on a different
area (882E?, Kerguelen?)
BRT – what is it all about then?
• Regression Tree:
– Recursive binary splits
– Stopping criterion
– Allows interactions natively if wanted (tree complexity)
• Boosting = forward stagewise model fitting:
– A truncated tree (1-10 splits)
– Computed the fitted values and residuals
– Fit and add a new tree to the residuals, repeating
many times (number of trees > 1000)
More about BRT
• Boosting with stochasticity:
– At each step a proportion of dataset is
randomly selected (bag fraction) to be fitted
to, improves model performance
• Cross validation (CV):
– To avoid overfitting, test model on withheld
parts of the data – also estimates overfitting
• You can bootstrap BRTs (I used 1000
Pros of BRT
• Copes with NAs,
• Copes with non normally-distributed
environmental variables (no transforms),
• Copes with outliers
• Allows multiple levels of interactions
• Unlikely to overfit as much as GLM,
• 20-30% improvement of fits compared with
• Runs on R
Cons of BRT
• Cons of BRT
– Does not give smooth / monotonic responses
– Still some overfitting – need to be careful
– Slow when using bootstrapping
• Cons of any prediction method
– Only as good as the environmental layers
– Predict only in the domain we have data for
(need to mask other areas)
BRT process
• Optimise BRT setup (which variables, how
many interactions, based on deviance)
• Run full models and bootstraps
• Run reduced models with only variables
that were significant
• Bootstrap predictions based on reduced
model, and calculate CI
• Plot
Back to the example
environmental variables we used
• Bathymetry (Gebco 2008, modified for fishing
• Chlorophyll A summer (remote sensing)
• Ice15 and ice85 (satellite data) – not used
• Rugosity (Gebco08)
• Near bottom current speed, temperature and
salinity (HIGEM circulation model)
• Use only variables that make biological sense!
Predictor variables
• For each species, predict proportion of
hooks that caught a fish
– Akin to binomial per hook
• Transform to normalise data
– Y = arcsin [ sqrt (fish per hook) ]
• Predict with BRT using Gaussian link
• Also predict binomial for all but toothfish
(only 5% null catch)
• Could also do fish per line
Example - TOA prediction
preliminary results
Other example – Oithona similis
Pinkerton et al. (2010)
CPR database
Oithona similis
The most abundant animal in the world?
Last example – species richness
Leathwick et al. (2006)
Others methods to consider
General Dissimilarity Modelling
• General Dissimilarity Modelling:
Multivariate response variable
• Pros
– predict communities based on environmental
variables (multiple species analysed)
– Classification part of the process
• Cons
– No bootstrapping
– How many species??
• Classifications (clusters): separates areas
based on layers (environment, biology etc)
• Options
– Use biology layers from BRT?
– Use environmental layers too? (doubledipping?)
– Use GDM directly for predictions and
• Number of classes…