Distinguishing clinical subgroups

Download Report

Transcript Distinguishing clinical subgroups

Application of Class Discovery
and Class Prediction Methods
to Microarray Data
Kellie J. Archer, Ph.D.
Assistant Professor
Department of Biostatistics
[email protected]
Basis of Cancer Diagnosis
• Pathologist makes an interpretation
based upon a compendium of knowledge
which may include
–
–
–
–
–
Morphological appearance of the tumor
Histochemistry
Immunophenotyping
Cytogenetic analysis
etc.
Clinically Distinct DLBCL Subgroups
Improved Cancer Diagnosis:
Identify sub-classes
• Divide morphologically similar tumors into
different groups based on response.
• Application of microarrays: Characterize
molecular variations among tumors by
monitoring gene expression
• Goal: microarrays will lead to more reliable
tumor classification and sub-classification
(therefore, more appropriate treatments will
be administered resulting in improved
outcomes)
Distinguishing two types of acute
leukemia (AML vs. ALL)
• Golub, T.R. et al 1999. Molecular
classification of cancer: class discovery
and class prediction by gene expression
monitoring. Science 286: 531-537.
• http://www-genome.wi.mit.edu/cgi-bin/cancer/datasets.cgi
(near bottom of page)
Distinguishing AML vs. ALL
• 38 BM samples (27 childhood ALL, 11 adult
AML) were hybridized to Affymetrix
GeneChips
– GeneChip included 6,817 human genes.
– Affymetrix MAS 4.0 software was used to
perform image analysis.
– MAS 4.0 Average Difference expression summary
method was applied to the probe level data to
obtain probe set expression summaries.
– Scaling factor was used to normalize the
GeneChips.
– Samples were required to meet quality control
criteria.
Distinguishing AML vs. ALL
• Class comparison
– Neighborhood analysis
• Class prediction
– Weighted voting
Class Discovery:
Distinguishing AML vs. ALL
•
•
•
•
The mean of a random variable X is a
measure of central location of the   X 
density of X.
The variance of a random variable is a
measure of spread or dispersion of the
density of X.
Var(X)=E[(X-)2] =∑(X - )2/(n-1)
Standard deviation = Var(X) =(X)
Class Discovery:
Distinguishing AML vs. ALL
• For each gene, compute the log of the
expression values. For a given gene g,
For ALL
Let
1  g  represent the mean log expression value;
Let
1  g  represent the stdev log expression value.
For AML
Let
Let
2  g 
2  g 
represent the mean log expression value;
represent the stdev log expression value.
Class Discovery:
Distinguishing AML vs. ALL
Illustration using
ALL AML example.xls
Class Discovery:
Distinguishing AML vs. ALL
• For each gene, compute a relative class
separation (quasi-correlation measure) as
follows
1  g   2  g 
P  g, c 
1  g    2  g 
• Define neighborhoods of radius r about
classes 1 and 2 such that P(g,c) > r or
P(g,c) < -r. r was chosen to be 0.3
Aside
• This differs from Pearson’s correlation and is
therefore not confined to [-1,1] interval
G
ρ1,2 
 g
i 1
i1
 g.1 g i2  g.2 
G
G
i 1
i 1
2
2




 gi1  g.1  gi2  g.2
Aside
Illustration using
Correlation.xls
Class Discovery:
Distinguishing AML vs. ALL
• A permutation test was used to
calculate whether the observed number
of genes in a neighborhood was
significantly higher than expected.
Permutation based methods
• Permutation based adjusted p-values
– Under the complete null, the joint
distribution of the test statistics can be
estimated by permuting the columns of the
gene expression matrix
– Permuting entire columns creates a
situation in which membership to the Class
1 and Class 2 groups is independent of gene
expression but preserves the dependence
structure between genes
Permutation based methods
Example Permutations
Gene g on Chip i
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Expression Diagnosis
2013.7
2141.9
2040.2
1973.3
2162.2
1994.8
1913.3
2068.7
1974.6
2027.6
1914.8
1955.8
1963
2025.5
1865.1
1922.4
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
AML
AML
AML
AML
AML
AML
AML
AML
1
2
3
AML
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
AML
AML
AML
AML
AML
AML
AML
ALL
AML
ALL
ALL
ALL
ALL
ALL
ALL
ALL
AML
AML
AML
AML
AML
AML
AML
ALL
ALL
AML
ALL
ALL
ALL
ALL
ALL
ALL
AML
AML
AML
AML
AML
AML
AML
B
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
AML
AML
ALL
ALL
AML
ALL
AML
ALL
AML
AML
AML
ALL
ALL
ALL
AML
ALL
Permutation based methods
• Permutation algorithm for the bth
permutation, b=1,…,B
– 1) Permute the n labels of the data matrix X
– 2) Compute relative class separation
P(g1,c)b,…, P(gp,c)b for each gene gi.
• The permutation distribution of the
relative class separation P(g,c) for gene
gi, i=1,…,p is given by the empirical
distribution of P(g,c)j,1,…, P(g,c)j,B.
Distinguishing AML vs. ALL
• Class comparisons using neighborhood
analysis revealed approximately 1,100
genes were correlated with class (AML
or ALL) than would be expected by
chance.
Class Prediction:
Distinguishing AML vs. ALL
• For set of informative genes, each expression value xi
votes for either ALL or AML, depending on whether its
expression value is closer to μALL or μAML
vi  xi
 AML   ALL 


2
– Let μALL represent the mean expression value for ALL
– Let μAML represent the mean expression value for AML
• Informative genes were the n/2 genes with the largest
P(g,c) and the n/2 genes with the smallest P(g,c)
• Golub et al choose n = 50
Class Prediction:
Distinguishing AML vs. ALL
• wi is a weighting factor that reflects how well
the gene is correlated with class distinction;
wivi is the weighted vote
• For each sample, the weighted votes for each
class are summed to get VALL and VAML
• The sample is assigned to the class with the
higher total, provided the Prediction
Strength (PS) > 0.3 where
PS = (Vwin – Vlose)/ (Vwin + Vlose)
Class Prediction:
Distinguishing AML vs. ALL
Gene g,
P(g,c) > 0.3 w = P(g,c)
1
2
3
4
5
…
25
v
Gene g,
w*v P(g,c) < 0.3
w = P(g,c)
v
w*v
1
2
3
4
5
…
25
Sum(w*v)
Sum(w*v)
VALL
VAML
Class Prediction:
Distinguishing AML vs. ALL
• Checking model adequacy
– Cross-validation of training dataset
– Applied model to an independent dataset of
34 samples
Class Discovery
• Determine whether the samples can be
divided based only on gene expression
without regard to the class labels
– Self-organizing maps
Hypothesis Testing
• The hypothesis that two means 1 and 2
are equal is called a null hypothesis,
commonly abbreviated H0.
• This is typically written as H0: 1 = 2
• Its antithesis is the alternative
hypothesis, HA: 1  2
Hypothesis Testing
• A statistical test of hypothesis is a
procedure for assessing the
compatibility of the data with the null
hypothesis.
– The data are considered compatible with H0
if any discrepancy from H0 could readily be
due to chance (i.e., sampling error).
– Data judged to be incompatible with H0 are
taken as evidence in favor of HA.
Hypothesis Testing
• If the sample means calculated are
identical, we would suspect the null
hypothesis is true.
• Even if the null hypothesis is true, we
do not really expect the sample means
to be identically equal because of
sampling variability.
• We would feel comfortable concluding
H0 is true if the chance difference in
the sample means should not exceed a
couple of standard errors.
T-test
• In testing H0: 1 = 2 against HA: 1  2 note that
we could have restated the null hypothesis as
H0: 1 - 2 = 0 and HA: 1 - 2  0
• To carry out the t-test, the first step is to compute the
test statistic and then compare the result to a tdistribution with the appropriate degrees of freedom (df)
tg
y1  y2   0  y1  y2 



SE y1  y2 
 12
n1

 22
n2
SE

2 2
2
4
2
 SE
df 
SE
SE

n1  1 n 2  1
2
1
4
1
T-test
• Data must be independent random samples
from their respective populations
• Sample size should either be large or, in the
case of small sample sizes, the population
distributions must be approximately normally
distributed.
• When assumptions are not met, nonparametric alternatives are available
(Wilcoxon Rank Sum/Mann-Whitney Test)
T-test: Probe set 208680_at
Sample number
1
2
3
4
5
6
7
8
s2
ALL
2013.7
2141.9
2040.2
1973.3
2162.2
1994.8
1913.3
2068.7
2038.5
7051.284
AML
1974.6
2027.6
1914.8
1955.8
1963.0
2025.5
1865.1
1922.4
1956.1
3062.991
n
8
8
y
T-test: Probe set 208680_at
tg

2038.5  1965.1  0

 2.317
0.4
tg

y1  y 2   0

2
7051.3 3062.99

8 1
8 1
P=0.039
0.2
2
2
 12.116
0.0
7051.3  3062.99 

df 
0.1
7051.3 3062.99

8
8
Probability
0.3
SE  y1  y 2 
-3
-2
-1
0
t
1
2
3