Gene pair selection - Department of Applied Mathematics & Statistics

Download Report

Transcript Gene pair selection - Department of Applied Mathematics & Statistics

Classifying Gene Expression
Profiles from Pairwise
mRNA Comparisons
Yan Qi
Biomedical Engineering Department
10/3/2006
Outline
The problem of Molecular
classification
 The TSP classifier
 Results on three Cancer datasets
 Extensions of the TSP classifier

Molecular Classification


Objective: predict class labels, e.g. cancer
subtypes, disease states using gene
expression profiles
Data: G x n matrix, G is number of genes, n
is number of samples, each column is a gene
expression profile
Mathematical formulation




Gene expression profile: X = ( X1, X2, …, XG)
Binary Class label: Y = 1 or Y = 2
Classifier: A mapping f from X to Y
Training dataset:



A = G x n matrix, n = n1+n2
Y: G x 1 vector where n1 entries are 1, n2 entries
are 2
Learning: find a mapping from A to f that
minimizes generalization error e( f )  P( f ( X )  Y )
Challenges to standard
learning methods

Statistical dilemma: n << G



Examples
Consequence: Over-fitting hence poor
generalizability
Practical issue: complex f and DB


Example: ANN, SVM, random forests
Consequence: results are hard to interpret
biologically, inefficient in diagnostic settings
The TSP classifier:
Motivation and strategy



Rank-based scoring exploits the expression
levels of genes relative to each other and
obtains invariance to normalization.
Reduce model complexity by making the
classifier parameter free
Select informative gene pairs by proper
LOOCV and construct intuitive and
biologically interpretable classification rule by
voting.
TSP classifier:
rank-based score



Idea: replace expression value by genes’
ranks within profiles.
Feature: Z ij  1{ X  X } ,1  i  j  G,
Score: ij  P( X i  X j | class 1)  P( X i  X j | class 2)
i

where
N ij1
n1
j

N ij2
n2
N ijk  # of samples in class k that satisfy X i <X j
Gene pair selection


TSP: (i , j )  arg max ( i , j )  ij
The number of TSP and sample size
*


*
A few when sample size is not too small ( >102 ).
Many when sample size is small ( < 102 ).
 Example: myocardial tissue gene expression profiles
 G = 22283; n1=12, n2=10;
 2460 statistically significant TSPs
Classification with one gene pair


Let (i* , j* ) be a unique TSP
Suppose
N i1* j*
n1

N i2* j*
n2
 1 if X i*  X j*
 TSP classifier: Yˆ  fTSP ( X )  
 2 if X i*  X j*

Error on training set: e  P( X i2  X 2j )  P( X i1  X 1j )
 P( X i2  X 2j )  1  P( X i1  X 1j )
 1   ij
An Example
Classification with multiple
gene pairs – majority vote



Seek a mapping from outputs of multiple single TSP
classifiers to a final prediction.
Let Z1  1{ X  X }; Z 2  1{ X  X }; Z3  1{ X  X }
Output from f ( Zi ) represent a vote from TSP i.
Final prediction = class that receives the majority vote
from { f ( Z1 ), f ( Z 2 ), f ( Z3 )}
i1
j1
i2
j2
i3
j3
Assume the features P( Zi  1| k ) are conditionally
independent given the class and equal, a Naïve Bayes
classifier is equivalent to using majority vote
Classification with multiple TSPs:
majority vote & Naïve Bayes classifier
3

Assume
Where
Let
P( Z1 , Z 2 , Z 3 | k )   P( Z i | k )
i 1
Z1  1{ X i1 X j1} ; Z 2  1{ X i 2  X j 2 }; Z3  1{ X i 3  X j 3 }
 P( Z i |1) 
P( Z1 , Z 2 , Z3 |1)
L  log
  log 

P( Z1 , Z 2 , Z3 | 2) i
P
(
Z
|
2)
i



Naïve Bayes Classifier: Yˆ  g ( X )  
1 if L>0
2 if L<0

Each TSP contribute equally if
P( Z i |1)
 p for i=1,2,3
P( Z i | 2)
Loop of cross-validation




Leave-one-out CV
Estimated accuracy: 1 - e/n where e is total
number of errors in cross-validation
Only the TSPs are determined by CV,
unbiased error estimate.
More complicated models, e.g. ANN and
Decision tree need to include both model
topology and parameters in CV loop, more
likely to be biased.
Three bench-mark cancer
datasets



Determine lymphnode status in breast tumor
samples ( West et al. 2001, G=7129, n = 49 )
Classify leukemia subtypes ( Golub et al
1999, G= 7129, n=72)
Distinguish prostate tumors from normal
samples ( Singh et al. 2002, G = 12600,
n=102)
Statistical significance of the score is
evaluated by permutation test
• Repeat e.g. 1000 times
• Keep feature matrix A
• Randomize class labels by keeping
n1 and n2 unchanged
• Get top score Zmax
• Get histogram of Zmax
p  P( Z max  Z real )
The top scoring gene pairs for
the three cancer studies
What does TSP represent?

ALL
AML
|ALL-AML|
X1
1349
1063
286
X2
1396
1057
339
X3
382
3258
2876
X4
868
162
706
X5
174
861
687
X2-X3
1014
-2201
3215
X1-X3
967
-2195
3162
X4-X5
684
-699
1383
Change weak predictors into strong predictors?


Change reference e.g. gene 3 as reference for gene 1 and gene 2
Combine two markers e.g. x4 and X5 might be individual markers, one for each
class, x4|-|X5 ?
Performance of TSP classifier
compared with previous studies
• Breast cancer: k-nearest neighbors (8-26); DLDA (8-19); DQDA (11-26)
logitboost (9-21); random forests (6-20); SVM (7-29)
• Leukemia: correlation analysis + weighted vote (85% on test set and 95%
on CV set.
• Prostate cancer: k-nearest neighbour to genes chosen by t-statistic
Discussion and extensions




Multiple class classification
Unique TSP: when there are >>1 TSPs, which is the
most informative?
kTSP: there might be many pairs of genes with
informative ordering, combine this information for
more accurate classification?
Normalization invariance: a suitable method to
integrate heterogeneous microarray datasets where
experimental conditions and normalization schemes
differ.