A Statistical Framework for Expression
Download
Report
Transcript A Statistical Framework for Expression
A Statistical Framework for
Expression-Based Molecular
Classification
Elizabeth Garrett
Sidney Kimmel Cancer Center
Johns Hopkins University
Molecular Classification of Cancer
• Goals
– Short term:
• To use gene expression array data to identify and
hypothesize subtypes of cancer
• To discover new cancer classes that are interpretable
and amenable to further biological analysis
• To translate classes into clinical tools
– Long term:
• To eventually refine individualized prognosis and
therapy
Outline of Talk
• Molecular Classifications
– the role of statistics in molecular classification
– defining a molecular profile
• Modeling latent classes: POE (Probability
of Expression)
– Bayesian mixture models
– visualization tools
• “Mining” using latent classes
• Using POE to combine across platforms
Botstein-Brown
style of
visualizing
gene expression
data
(Garber et al.
PNAS 2001)
The fine print
Motivating Datasets
• Unclassified cancer samples: Are the gene expressions
patterns informative about subclasses?
– Ductal breast cancers
– Adenocarcinomas of the lung
– Diffuse large B-cell lymphoma
• Related tissues: Are subtypes associated with prognosis?
– Normal tissues and cancers tissues
– Outcome data (e.g. survival, recurrence, response)
• Genes: Are hypothesized genes associated with cancer
types?
– Functional information
– Custom array
General Approach of
POE (Probability of Expression)
• Define a reference expression value:
– “normal” vs. over expressed vs. under expressed
– unsupervised in nature
• Use scale-independent measures of expression
– allows combination of data across platforms
– incorporates measurement errors
• Choose molecular profile that predicts cancer class based
on a small number of genes
– yields clinical implications
– choose genes using combination of statistical and
biological evidence
• Caveat: NOT intended for gene clustering and not for manual
clustering of genes
Molecular Profiles
(based on 3 genes A, B, and C)
27 = 33 possible profiles
Gene A
Gene B
Gene C
Profile 1
-1
-1
-1
Profile 2
-1
-1
0
Profile 3
-1
-1
1
Profile 4
-1
0
-1
Profile 5
-1
0
0
Profile 6
-1
0
1
….
….
….
….
Profile 24
1
0
1
Profile 25
1
1
-1
Profile 26
1
1
0
Profile 27
1
1
1
Mixture of Normal and Two Uniform Distributions
Empirical Density of Expression Levels in One Gene
Across 203 Lung Samples
Bhattacharjee,
PNAS 2001
Latent Expression Classes
• Notation:
e gt 1
gene g has abnormally low expression in tumor t
e gt 0
gene g has normal expression in tumor t
e gt 1
gene g has abnormally high expression in tumor t
• Modeling observed gene expression, agt:
a gt |( egt e) ~ f e, g ()
e {1,0,1}
• For gene g, the proportions of differentially expressed
tumors in the population of unclassified tumors are
g P(egt 1)
g P(egt 1)
Probability Scale for Expression Data
p gt P( egt 1| a gt , g , g , f1, g , f 0, g )
g f1, g ( a gt )
g f1, g ( a gt ) (1 g g ) f 0, g ( a gt )
Interpretation: The probability that gene g in tumor t is over
expressed given observed expression and the model parameters
p gt P( egt 1| a gt , g , g , f 1, g , f 0, g )
g f 1, g ( a gt )
g f 1, g ( a gt ) (1 g g ) f 0, g ( a gt )
Interpretation: The probability that gene g in tumor t is under
expressed given observed expression and the model parameters
Distributional Assumptions
Samples: Normal/Uniform mixture
f 1, g () U ( g t g , t g )
f 0, g () N ( t g , g )
f1, g () U ( t g , t g g )
g | , ~ N ( , )
g 2 | , ~ G ( , )
Genes: Second stage model
g | ~ E ( )
g | ~ E ( )
logit ( g )| , ~ N ( , )
logit ( g )| , ~ N ( , )
Original Scale
After Transformation
Harvard Lung Cancer Data (Bhattacharjee, PNAS, 2001)
MCMC Estimation Approach
• Relatively straightforward
• A couple comments:
– Data augmentation using unknown expression variables
egt. Sampling of ’s unconditional on e’s
[ | ]
*
[e|| , ]
*
[ | , e]
*
– Starting conditions are critical. K-means clustering
(k=2 or 3) useful for picking starting centers and spread
– Constrain min(g+,g- ) > kg
Denoising Expression Data
E(gt | a gt , ) g t ( pgt pgt )(a gt g t )
Provides “cleaner” version of the original
expression level data.
Mining for Genes
• Two quantities of interest in looking for and
grouping genes.
• Probability that gene g follows a specified pattern:
P(eg1 ,..., egT | ) ( pgt )
I ( egt 1)
( pgt )
I ( egt 1)
(1 pgt pgt )
I ( egt 0 )
t
• Probability that all genes in set G0 have the same
pattern across samples
q( G0 )
t
(
p
p
p
p
(
1
p
p
)(
1
p
p
gt g't gt g't
gt
gt
g 't
g 't ))
g , g 'G0
Identifying Gene Groups
• Preselect proportions of over and under
expressed genes (e.g. 20% under, 5% over)
• Select genes consistent with proportions via
P(eg1,….,egT|)
• Chose genes which are similar in expression
pattern to add to group via q(G0).
• Look at “mining” plot to identify genes
which are sensible (biologically).
5% underexpressed, 15% overexpressed, 4 sets
Molecular Profiles
Combining Across Platforms
• Example: Stanford, Harvard, Michigan
lung cancer datasets
• Publicly available
• Different platforms: Affymetrix, cDNA
glass slides
• POE rescales to probability metric
• With some caveats, can combine data
• Statistics: G. Parmigiani, E. Garrett
• Arrays, Biology: E. Gabrielson, R. Anbazhagan
• http://astor.som.jhmi.edu/poe
• G. Parmigiani, E. Garrett, R. Anbazhagan, E. Gabrielson.
A statistical framework for expression-based molecular
classification in cancer. JRSS, in press.
• E. Garrett, G. Parmigiani. POE: Statistical Methods for
Qualitative Analysis of Gene Expression. In The Analysis
of Gene Expression Data: Methods and Software (eds. G
Parmigiani, E. Garrett, R. Irrizarry, S. Zeger). To appear
2003.