Transcript ppt

It is only the beginning:
Putting microarrays into context
Matthias E. Futschik
Institute for Theoretical Biology
Humboldt-University, Berlin, Germany
Hvar sommer school, 2004
The Whole Picture
Protein Functions
Metabolites
Protein Structures
?
Chromosomal Location
DNA
Medical expert
knowledge
Microarrays
Networks of Genes
Gene expression is regulated
by complex genetic networks
with a variety of interactions on
different levels (DNA, RNA, protein),
on many different
time scales (seconds to years)
and at various locations
(nucleus, cytoplasma, tissue).
Models:
Boolean networks
Bayesian networks
Differential equations
Onthologíes: Categorising and labeling objects
Representation as graph:
•Terms as nodes
•Edges as rules
•Transitivity rule
•Parent and child nodes
Bard and Rhee,
Nature Gen. Rev., 2004
Onthology: restricted vocabulary with structuríng rules
describing relationship between terms
Gene Onthology
Consists of three
independent
onthologies:
• molecular function e.g.
enzyme
• biological process e.g.
signal transduction
• cellular component
Gene sets / clusters can
easily be analysed based
on gene onthology terms
Mapping of gene expression to
chromosomal location
Significance analysis
of chromosomal location of differential
gene expression (SW620 vs SW480)
The p-value for finding at least k from a
total of s significant differentially
expressed genes within a cytoband
window is
 s  g  s 

k 1   
i  n  i 

P  1 
g
i 0
 
s
where g is the total number of genes with cytoband
location and n the total number of genes within the
cytoband window.
Relating number of gene copies and gene
expression I
Pollack et al.,
PNAS, 2001
• Study of chromosomal
abnormalities in breast
cancer
• usage of genomic DNA
and cDNA arrays
• hotspots of increased
number of gene copies
Relating number of gene copies and gene
expression II
Correlation of gene
copy number and
transcriptional levéls
detected
Correlation of mRNA and protein abundance
Ideker et al, Science, 2001
• Study of yeast galactose-utilisation pathway
• Use of microarrays, quantative proteomics
and databases of protein interactions
• Dissection of transcriptional and posttranslational control
New genes and interactions
in GAL pathway were found
Av. Correlation of
0.63 between
transcript and
protein levels
Genome, Transcriptome and Translatome
Greenbaum et al, Genome
Research, 2001
• Interrelating geneome,
transcriptome and
translatome
• Similar compostion based
on functional categories of
translatome and
transcriptome
• Differing composition of
genome
Linking expression to drug effectiveness
Relevance networks:
Butte et al, PNAS, 2000
•
•
•
•
•
Correlation between growth
inhibition by drugs and
gene expression for NCI60
cell lines
Gene expression based on
Affymetrix chips (7245
genes)
5000 anticancer agents
Significance testing based
on randomisation
Significant link between
LCP1 and NSC 624044
Combinining gene expression data with
clinical parameters
Diffuse large B-cell Lymphoma (DLBCL)
• Most common lymphoid malignancy in adults
• Treatment by multi-agent chemotherapy
• In case of a relapse: bone marrow transplantation
• Clinical course of DLBCL is widely variable: Only 40% of
treatments successful
=> Accurate outcome prediction is crucial for stratifying
patients for intensified therapy
Case study: DLBCL
Current prognostic model: International Prediction Index (IPI)
Alternative: Microarray-based prediction of treatment outcome
DLBCL study by Shipp et al.
(Nature Medicine, 2002, 8(1):68-74)
• expression profiles of 58 patients using Hu6800 Affymetrix
chips (corresponding to ca. 6800 genes)
• Prediction accuracy of outcome using leave-one-out procedure:
Knn: 70.7%; WV: 75.9%; SVM:77.6%
DLBCL outcome prediction is challenging!
Sammon's Mapping of top 22 genes
ranked by signal-to-noise: Large
overlap between classes with ‘cured’
and ‘fatal’ outcome.
Low correlation of gene expression with classes:
Only 3 genes with correlation coef > 0.4
<=> Leukemia study by Golub et al : 263 genes
<=> Colon cancer study by Alon et al.: 215
genes
Limitation of microarray approach: Only mRNA abundance is measured.
However, many different factors (patient and tumour related)
determine outcome of therapy: Integration might be necessary!
Prognostic models for DLBCL
Clinical predictor:
• IPI based on five risk factors (age, tumour stage, patient’s
performance, number of extranodal sites, LDH concentration)
• Survival rate determined in clinical study:
Low risk: 73%, low-intermediate: 51%, intermediate-high: 42%,
high: 26%
• Conversion of IPI into Bayesian classifier using survival rates as
conditional probabilities P:
e.g. Sample belongs to class ‘cured’ if P(‘cured’|IPI)> P(‘fatal’|IPI)
=> Overall accuracy of 73.2%.
Prognostic models for DLBCL
Microarray-based predictor:
• Identifies clusters by unsupervised learning
• Supervised classification
• EfuNN as five layered neural network
• Based on 17 genes using signal-to-noise criterion
• Accuracy using leave-one-out: 78.5%
Independence of predictors
Mic roarray-based
predic tor
C1
C1 C2
U
C2
Set theory:
For 19 of 56 samples complementary (8 samples
only correctly classified by IPI-based predictor,
11 only by microarray-based predictor)
11
Setting upper threshold to 92.6% (52 out of 56
samples)
33
8
IPI-based
predic tor
Mutual Information
x,y = (0,1) : microarray-based, IPI-based predictions of class (cured -fatal)
P,Q : probability of microarray-, IPI-based predictions
R(x,y): joint probability of predictions by microarray- and IPI-based predictors
I = Σ x,y R(x,y) log2(R(x,y)/[P(x)Q(y)]) ~ 0.05
=> Microarray-based and IPI-based predictor statistically independent!
Hierarchical modular decision system
Three layered hierarchical model
Class A/Class B
Dec ision layer
Com bined
Predic tion
1-
Class A
Class B
2
1
Bayesian
Classifier
IPI
Integration of predictions in class units:
weighted sum
1- 2
1
EFuNN
Class unit
layer
Predictor module layer consisting of
independently trained predictors
Class unit layer integrating prediction
by single predictors
Decision layer producing final
prediction
Predic tor
module
layer
Model parameters: α, β1,β2
Training: error backpropagation with parallel training
of neural network
Validation: leave-one-out
Improved prediction by integration
Significantly improved accuracy of modular hierarchical
system (parameter values:α=0.4, β1= 0.8, β2= 0.75)
Model
Hierarchical model
EFuNN
IPI
Accuracy
87.5%
78.5%
73.2%
Constructive and destructive
interference:
Both microarray-based and clinical predictor are
necessary for improvement
Identification of areas of expertise
Stratification of data set by IPI category
ß = ß1 = ß2
IPI
Mic roarray
=> Data stratification can be used to detect areas of expertise
e.g. IPI risk group low, low-intermediate, intermediate- high
for microarray-based classifier
=> Identification by data stratification can indicate limits of models
e.g. IPI risk group high for microarray-based classifier
M.E. Futschik et al., Prediction of clinical behaviour and treatment for
cancers, OMJ Applied Bioinformatics, 2003
The way out of the microarray cave
Hvala i dovidjenja!