A Statistical Framework for Expression

Download Report

Transcript A Statistical Framework for Expression

A Statistical Framework for
Expression-Based Molecular
Classification
Elizabeth Garrett
Sidney Kimmel Cancer Center
Johns Hopkins University
Molecular Classification of Cancer
• Goals
– Short term:
• To use gene expression array data to identify and
hypothesize subtypes of cancer
• To discover new cancer classes that are interpretable
and amenable to further biological analysis
• To translate classes into clinical tools
– Long term:
• To eventually refine individualized prognosis and
therapy
Outline of Talk
• Molecular Classifications
– the role of statistics in molecular classification
– defining a molecular profile
• Modeling latent classes: POE (Probability
of Expression)
– Bayesian mixture models
– visualization tools
• “Mining” using latent classes
• Using POE to combine across platforms
Botstein-Brown
style of
visualizing
gene expression
data
(Garber et al.
PNAS 2001)
The fine print
Motivating Datasets
• Unclassified cancer samples: Are the gene expressions
patterns informative about subclasses?
– Ductal breast cancers
– Adenocarcinomas of the lung
– Diffuse large B-cell lymphoma
• Related tissues: Are subtypes associated with prognosis?
– Normal tissues and cancers tissues
– Outcome data (e.g. survival, recurrence, response)
• Genes: Are hypothesized genes associated with cancer
types?
– Functional information
– Custom array
General Approach of
POE (Probability of Expression)
• Define a reference expression value:
– “normal” vs. over expressed vs. under expressed
– unsupervised in nature
• Use scale-independent measures of expression
– allows combination of data across platforms
– incorporates measurement errors
• Choose molecular profile that predicts cancer class based
on a small number of genes
– yields clinical implications
– choose genes using combination of statistical and
biological evidence
• Caveat: NOT intended for gene clustering and not for manual
clustering of genes
Molecular Profiles
(based on 3 genes A, B, and C)
27 = 33 possible profiles
Gene A
Gene B
Gene C
Profile 1
-1
-1
-1
Profile 2
-1
-1
0
Profile 3
-1
-1
1
Profile 4
-1
0
-1
Profile 5
-1
0
0
Profile 6
-1
0
1
….
….
….
….
Profile 24
1
0
1
Profile 25
1
1
-1
Profile 26
1
1
0
Profile 27
1
1
1
Mixture of Normal and Two Uniform Distributions
Empirical Density of Expression Levels in One Gene
Across 203 Lung Samples
Bhattacharjee,
PNAS 2001
Latent Expression Classes
• Notation:
e gt  1
gene g has abnormally low expression in tumor t
e gt  0
gene g has normal expression in tumor t
e gt  1
gene g has abnormally high expression in tumor t
• Modeling observed gene expression, agt:
a gt |( egt  e) ~ f e, g ()
e {1,0,1}
• For gene g, the proportions of differentially expressed
tumors in the population of unclassified tumors are
 g  P(egt  1)
 g  P(egt  1)
Probability Scale for Expression Data
p gt  P( egt  1| a gt ,  g ,  g , f1, g , f 0, g )
 g f1, g ( a gt )
 
 g f1, g ( a gt )  (1   g   g ) f 0, g ( a gt )
Interpretation: The probability that gene g in tumor t is over
expressed given observed expression and the model parameters
p gt  P( egt  1| a gt ,  g ,  g , f  1, g , f 0, g )
 g f  1, g ( a gt )
 
 g f  1, g ( a gt )  (1   g   g ) f 0, g ( a gt )
Interpretation: The probability that gene g in tumor t is under
expressed given observed expression and the model parameters
Distributional Assumptions
Samples: Normal/Uniform mixture
f  1, g ()  U (  g   t   g ,  t   g )
f 0, g ()  N ( t   g ,  g )
f1, g ()  U ( t   g ,  t   g   g )
 g |   ,   ~ N (  ,   )
 g 2 |  ,  ~ G ( ,  )
Genes: Second stage model
 g |  ~ E ( )
 g |  ~ E ( )
logit ( g )|  ,   ~ N ( ,   )
logit ( g )|  ,   ~ N ( ,   )
Original Scale
After Transformation
Harvard Lung Cancer Data (Bhattacharjee, PNAS, 2001)
MCMC Estimation Approach
• Relatively straightforward
• A couple comments:
– Data augmentation using unknown expression variables
egt. Sampling of ’s unconditional on e’s
[ |  ]
*
[e|| ,  ]
*
[ | , e]
*
– Starting conditions are critical. K-means clustering
(k=2 or 3) useful for picking starting centers and spread
– Constrain min(g+,g- ) > kg
Denoising Expression Data
E(gt | a gt ,  )   g   t  ( pgt  pgt )(a gt   g   t )
Provides “cleaner” version of the original
expression level data.
Mining for Genes
• Two quantities of interest in looking for and
grouping genes.
• Probability that gene g follows a specified pattern:
P(eg1 ,..., egT |  )   ( pgt )
I ( egt  1)
( pgt )
I ( egt 1)
(1  pgt  pgt )
I ( egt 0 )
t
• Probability that all genes in set G0 have the same
pattern across samples
q( G0 )  
t








(
p
p

p
p

(
1

p

p
)(
1

p

p
 gt g't gt g't
gt
gt
g 't
g 't ))
g , g 'G0
Identifying Gene Groups
• Preselect proportions of over and under
expressed genes (e.g. 20% under, 5% over)
• Select genes consistent with proportions via
P(eg1,….,egT|)
• Chose genes which are similar in expression
pattern to add to group via q(G0).
• Look at “mining” plot to identify genes
which are sensible (biologically).
5% underexpressed, 15% overexpressed, 4 sets
Molecular Profiles
Combining Across Platforms
• Example: Stanford, Harvard, Michigan
lung cancer datasets
• Publicly available
• Different platforms: Affymetrix, cDNA
glass slides
• POE rescales to probability metric
• With some caveats, can combine data
• Statistics: G. Parmigiani, E. Garrett
• Arrays, Biology: E. Gabrielson, R. Anbazhagan
• http://astor.som.jhmi.edu/poe
• G. Parmigiani, E. Garrett, R. Anbazhagan, E. Gabrielson.
A statistical framework for expression-based molecular
classification in cancer. JRSS, in press.
• E. Garrett, G. Parmigiani. POE: Statistical Methods for
Qualitative Analysis of Gene Expression. In The Analysis
of Gene Expression Data: Methods and Software (eds. G
Parmigiani, E. Garrett, R. Irrizarry, S. Zeger). To appear
2003.