Bayesian Model Averaging
Download
Report
Transcript Bayesian Model Averaging
Isabelle Bichindaritz
University of Washington
Institute of Technology
Tacoma, WA, USA
Ecole des Hautes Etudes
en Santé Publique
Département Infobiostat
Rennes, France
Purpose of this Talk
Once upon a time …
There was biology (~1800), and
There were computers (~1920)
Of their common interests was born bioinformatics (~1979) …
Question:
How can CBR contribute to bioinformatics research ?
An example to microarray data analysis
ICCBR '10
ICCBR '10
NCBI, 2004
ICCBR '10
Bioinformatics Challenges
Frequent tasks in bioinformatics
Similarity search in genetic sequences
Microarray data analysis
Macromolecule shape prediction
Evolutionary tree construction
Gene regulatory network mining
ICCBR '10
Bioinformatics Challenges
Microarray data analysis
Microarrays are made from a collection of purified DNA’s. A drop
of each type of DNA in solution is placed onto a speciallyprepared glass microscope slide by an arraying machine.
Please note that …
… the human genome contains about 30,000 genes.
… a microarray can contain thousands or tens of thousands
relatively short nucleotides of known sequences.
ICCBR '10
Bioinformatics Challenges
The end product of a comparative hybridization
experiment is a scanned array image.
ICCBR '10
Bioinformatics Challenges
ICCBR '10
Bioinformatics Challenges
Microarray applications
Determine relative DNA levels associated with huge number
of known and predicted genes in a single experiment.
The most attractive application of microarrays is in the
study of differential gene expression in disease.
The up– or down-regulation of gene activity can either be
the cause of the pathophysiology or the result of the
disease.
Accurate measurement of every single gene is assessed.
Sensitivity: very high – detect the presence of one transcript
in one-tenth of a cell.
ICCBR '10
Bioinformatics Challenges
Data mining challenges
Volume of data (Giga bytes, number of features)
Characteristics of data (specific constraints)
Domain specific knowledge (expert interpretation)
ICCBR '10
BMA-CBR System
Gene Expression Level
Dataset
Application of Feature
Selection Algorithm
Discrete Sample
Output: Supervised
Machine Learning and
Model Construction
through Classification
Continuous Sample
Output: Supervised
Machine Lerning and
Model Construction
through Prediction
Diagnosis
Survival analysis
ICCBR '10
BMA-CBR System
BMA-CBR system performs feature selection through
BMA before using CBR for microarray data classification
and prediction (survival analysis)
Introduction and motivation of variable selection
What is Bayesian Model Averaging (BMA)?
One approach: the iterative BMA algorithm
Application 1: Chronic Myeloid Leukemia (CML)
Application 2: Survival analysis
Presentation of CBR
ICCBR '10
Bayesian Model Averaging
Feature selection
Used to select a subset of relevant features for building robust
learning models in machine learning.
Often used in supervised learning.
Select relevant features from the training set (for which class
labels are known).
Apply the selected features in the test set.
ICCBR '10
Bayesian Model Averaging
Feature selection
A minimal set of relevant genes for future prediction or assay
development
ICCBR '10
Bayesian Model Averaging
Typical variable selection methods – one variable at a
time
Examples:
T-test
Between group to within group sum of squares (BSS/ WSS)
[Dudoit et al. 2001]
ICCBR '10
Bayesian Model Averaging
Multivariate gene selection
Our goal: consider multiple genes
Simultaneously to exploit the interdependence between genes
to reduce # relevant genes
ICCBR '10
Bayesian Model Averaging
Bayesian Model Averaging (BMA) [Raftery 1995],
[Hoeting et. al. 1999]
A multivariate variable selection technique.
Typical model selection approaches select a model and then
proceed as if the selected model has generated the data -->
overconfident inferences
Advantages of BMA:
Fewer selected genes
Can be generalized to any number of classes
Posterior probabilities for selected genes and selected models
ICCBR '10
Bayesian Model Averaging
BMA
Average over predictions from several models
What do we need?
Prediction with a given model k --> logistic regression
How to choose a set of “good” models? --> variable selection
ICCBR '10
Bayesian Model Averaging
What models to average over?
All possible models --> way too many!!
Eg. 2^30~1 billion, 2^50~10^15 etc…
The BMA solution:
1. “leaps and bounds” [Furnival and Wilson 1974] : when
#variables (genes) <= 30, we can efficiently produce a
reduced set of good models (branch and bound).
2. Cut down the # models?
Discard models that are much less likely than the best
model.
ICCBR '10
Bayesian Model Averaging
Iterative BMA algorithm [Yeung, Bumgarner, Raftery
2005]
Pre-processing step: Rank genes using BSS/WSS ratio.
Initial step:
Repeat until all genes are processed:
Output: selected genes and models with their
posterior probabilities
ICCBR '10
Bayesian Model Averaging
Application 1: Classification of progression of
chronic myeloid leukemia (CML)
Motivation: New Candidates for Prognostic
studies in CML
ICCBR '10
Bayesian Model Averaging
Progression of CML
CML usually presents in chronic phase (CP), but in the absence
of effective therapy, CP CML invariably transforms to
accelerated phase (AP) disease, and then to an acute
leukemia, blast crisis (BC).
BC is highly resistant to treatment, and all treatments are more
successful when administered during CP.
Imatinib is most effective in early CP patients with excellent
survival (86% at 7 years).
Currently there are limited clinical markers and no molecular
tests that can predict the “clock” of CML progression for
individual patients at the time of CP diagnosis, making it
difficult to adapt therapy to the risk level of each patient.
ICCBR '10
Bayesian Model Averaging
Why predictors for CML progression?
ICCBR '10
Bayesian Model Averaging
Identification of CML progression biomarkers
ICCBR '10
Bayesian Model Averaging
Genes associated with CML progression
ICCBR '10
Bayesian Model Averaging
BMA selected genes using microarray data
Selected 6 genes over 21 models
Repeat CV 100 times
Average Brier Score = 0.21
Average prediction accuracy = 99.17%
ICCBR '10
Bayesian Model Averaging
PCR data: CP-early vs CP-late
ICCBR '10
Bayesian Model Averaging
Summary: CML data
BMA applied to a microarray data consisting of patient samples
in different phases of CML identified 6 signature genes (ART4,
DDX47, IGSF2,LTB4R, SCARB1, SLC25A3).
Results validated the gene signature using quantitative PCR: 6-
gene signature is highly predictive of CP-early vs CP-late.
What is next?
To identify biologically meaningful biomarkers for CML
progression and response to therapy.
Biomarkers that are functionally related (connected in an
underlying network) to known reference genes.
ICCBR '10
Bayesian Model Averaging
Application 2: Survival analysis
ICCBR '10
Bayesian Model Averaging
Results: Breast cancer data
ICCBR '10
Bayesian Model Averaging
Results: Breast cancer data - Annest, Bumgarner,
Raftery, Yeung. BMC Bioinformatics 2009
ICCBR '10
CBR
Classification task
Similarity measure
Weights provided by BMA for selected features
ICCBR '10
CBR
Classification task
Choose the class for which the average similar score is
highest
ICCBR '10
CBR
Survival analysis task
Similarity measure
Weights provided by BMA for selected features
ICCBR '10
CBR
Survival analysis task
Choose the class for which the average similar score is
highest
ICCBR '10
Evaluation / Classification
Dataset
Total Number
of Samples
# Training
Samples
# Validation
Samples
Number
of Genes
Leukemia 2
72
38
34
3051
Leukemia 3
72
38
34
3051
Dataset
# classes
BMA-CBR
iterativeBMA
Other
published
results
Leukemia 2
2
#genes = 20
#errors =
1/34
#genes = 20
#errors = 2/34
#genes = 5
#errors = 1/34
Leukemia 3
3
#genes = 15
#errors =
1/34
#genes = 15
#errors = 1/34
#genes ~ 40
#errors = 1/34
ICCBR '10
Evaluation / Prediction
Dataset
Total Number
# Training
Samples
# Validation
Samples
Number
Of Genes
DLBCL
240
160
80
7,399
Breast Cancer
295
61
234
4,919
Dataset
BMA-CBR
iterativeBMA
Best Other
Published Results
DLBCL
#genes = 25
p-value = 0.00121
#genes = 25
p-value = 0.00139
#genes = 17
p-value = 0.00124
Breast cancer
#genes = 15
p-value = 2.14e-10
#genes = 15
p-value = 3.38e-10
#genes = 5
p-value = 3.12e-05
ICCBR '10
Conclusion
The combination of BMA and CBR provides excellent
classification and prediction results.
It provides promising results for the application of CBR
to bioinformatics tasks and data.
ICCBR '10
Conclusion
Future developments
Refine risk classes into more than two risk groups.
Refine CBR algorithm.
Test on additional datasets.
Provide automatic interpretation of the classification / prediction
both for gene selection and for case-based reasoning.
ICCBR '10
ICCBR '10