An Evaluation of Gene Selection Methods for Multi

Download Report

Transcript An Evaluation of Gene Selection Methods for Multi

An Evaluation of Gene Selection
Methods for Multi-class
Microarray Data Classification
by Carlotta Domeniconi and
Hong Chai
Outline
•
•
•
•
•
•
•
Introduction to microarray data
Problem description
Related work
Our methods
Experimental Analysis
Result
Conclusion and future work
Microarray
• Measures gene expression levels across
different conditions, times or tissue samples
• Gene expression levels inform cell activity
and disease status
• Microarray data distinguish between tumor
types, define new subtypes, predict
prognostic outcome, identify possible drugs,
assess drug toxicity, etc.
Microarray Data
• A matrix of measurements: rows are gene expression
levels; columns are samples/conditions.
Example – Lymphoma Dataset
Microarray data analysis
• Clustering applied to genes to identify
genes with similar functions or participate
in similar biological processes, or to
samples to find potential tumor subclasses.
• Classification builds model to predict
diseased samples. Diagnostic value.
Classification Problem
• Large number of genes (features) - may
contain up to 20,000 features.
• Small number of experiments (samples) –
hundreds but usually less than 100 samples.
• The need to identify “marker genes” to
classify tissue types, e.g. diagnose cancer feature selection
Our Focus
• Binary classification and feature selection
methods extensively studied; Multi-class
case received little attention.
• Practically many microarray datasets have
more than two categories of samples
• We focus on multi-class gene ranking and
selection.
Related Work
Some criteria used in feature ranking
•
•
•
•
Correlation coefficient
Information gain
Chi-squared
SVM-RFE
Notation
• Given C classes
• m observations (samples or patients)
• n feature measurements (gene expressions)
x  ( x1 ,..., xn ) t  R n
• class labels y= 1,...,C
Correlation Coefficient
• Two class problem: y = {-1,+1}
• Ranking criterion defined in Golub:

wj 


j

j



j

j
• where μj is the mean and σ standard deviation
along dimension j in the + and – classes; Large
|w| indicates discriminant feature
Fischer’s score
• Fisher’s criterion score in Pavlidis:
wj 
(   )

j
 2
j

j
2
( )  ( )

j
2
Assumption of above methods
• Features analyzed in isolation. Not
considering correlations.
• Assumption: independent of each other
• Implication: redundant genes selected into a
top subset.
Information Gain
• A measure of the effectiveness of a feature in classifying the
training data.
• Expected reduction in entropy caused by partitioning the data
according to this feature.
| Sv |
I (S , A)  E (S )   vV ( A)
E (S v )
S
• V (A) is the set of all possible values of feature A, and Sv is
the subset of S for which feature A has value v
Information Gain
• E(S) is the entropy of the entire set S.
E (S )  
C
i 1
| Ci |
| Ci |

log 2
|S|
|S|
• wherewhere |Ci| is the number of training data in
class Ci, and |S| is thecardinality of the entire set S.
Chi-squared
• Measures features individually
• Continuous valued features discretized into
intervals
• Form a matrix A, where Aij is the number of
samples of the Ci class within the j-th
interval.
• Let CIj be the number of samples in the j-th
interval
Chi-squared
• The expected frequency of Aij is
Ei , j  C Ij | Ci | / m
• The Chi-squared statistic of a feature is defined as
  
2
C
i 1
I
j 1
( Aij  Eij )
2
Eij
• Where I is the number of intervals. The larger the statistic,
the more informative the feature is.
SVM-RFE
•
•
Recursive Feature Elimination using SVM
In the linear SVM model on the full feature set
Sign (w•x + b)
w is a vector of weights for each feature, x is an
input instance, and b a threshold.
If wi = 0, feature Xi does not influence
classification and can be eliminated from the set
of features.
SVM-RFE
• After getting w for the full feature set, sort
features in descending order of weights. A
percentage of lower feature is eliminated.
3. A new linear SVM is built using the new set
of features. Repeat the process.
4. The best feature subset is chosen.
Other criteria
• The Brown-Forsythe, the Cochran, and the
Welch test statistics used in Chen, et al.
(Extensions of the t-statistic used in the two-class
classification problem.)
• PCA
(Disadvantage: new dimension formed. None of
the original features can be discarded. Therefore
can’t identify marker genes.)
Our Ranking Methods
•
•
•
•
•
•
BScatter
MinMax
bSum
bMax
bMin
Combined
Notation
• For each class i and each feature j, we define the
mean value of feature j for class Ci:
 j ,i
1

 xCi x j
| Ci |
• Define the total mean along feature j

j
1

xj
m x
Notation
• Define between-class scatter along feature j
C
B j   | Ci | (  j ,i   j )
i 1
2
Function 1: BScatter
• Fisher discriminant analysis for multiple classes under
feature independence assumption. It credits the largest score
to the feature that maximizes the ratio of the between-class
scatter to the within-class scatter
BScatterj 
Bj
  ji
C
i 1
• where σji is the standard deviation of class i along feature j
Function 2: MinMax
• Favors features along which the farthest meanclass difference is large, and the within class
variance is small.
 j ,max   j ,min
MinMax j 
C
 i 1 ji
Function 3: bSum
• For each feature j, we sort the C values μj,i in
non-decreasing order: μ j1 <= μj2…<= μ jC
• Define bj,l = μ j1+1 - μ j1
• bSum rewards the features with large distances
between adjacent mean class values:
bSum j 
Cl11b j ,l
  ji
C
i 1
Function 4: bMax
• Rewards features j with a large between-neighborclass mean difference
bMax j 
max l b j ,l
  ji
C
i 1
Function 5: bMin
• Favorsthe features with large smallest betweenneighbor-class mean difference
bMin j 
min l b j ,l
  ji
C
i 1
Function 6: Comb
• Considers a score function which combines
MinMax and bMin
Combj 
min l (b j ,l )(  j ,max   j ,min )
  ji
C
i 1
Datasets
Dataset
MLL
sample genes classes Comment
72
Lymphoma 88
Yeast
NCI60
80
61
12582 3
Available at
http://research.nhgri.nih.gov/micr
oarray/Supplement
4026
6
Number of samples in each class
are, 46 in DLBCL, 11 in CLL, 9
in FL (malignant classes), 11 in
ABB, 6 in
3
RAT, and 6 in TCL (normal
samples). available at
http://llmpp.nih.gov/lymphoma
8
Available at http://rana.lbl.gov/
5775
1155
Experiment Design
• Gene expression scaled between [-1,1]
• Performed 9 comparative feature selection methods
(6 proposed scores, Chi-squared, Information Gain, and
SVM-RFE)
• Obtain subsets of top-ranked genes to train SVM classifier
(3 kernel functions: linear, 2-degree polynomial, Gaussian;
Soft-margin [1,100]; Gaussian kernel [0.001,2])
• Leave-one-out cross validation due to small sample size
• One-vs-one multi-class classification implemented on
LIBSVM
Result – MLL Dataset
Result – Lymphoma Dataset
Conclusions
• SVMs classification benefits from gene selection;
• Gene ranking with correlation coefficients gives
higher accuracy than SVM-RFE in low
dimensions in most data sets. The best performing
correlation score varies from problem to problem;
• Although SVM-RFE shows an excellent
performance in general, there is no clear winner.
The performance of feature selection methods
seems to be problem-dependent;
Conclusions
• For a given classification model, different gene selection
methods reach the best performance for different feature
set sizes;
• Very high accuracy was achieved on all the data sets
studied here. In many cases perfect accuracy (based on
leave-one-out error) was achieved;
• The NCI60 dataset [17] shows lower accuracy values. This
dataset has the largest number of classes (eight), and
smaller sample sizes per class. SVM-RFE handles this case
well, achieving 96.72% accuracy with 100 selected genes
and a linear kernel. The gap in accuracy between SVMRFE and the other gene rankingmethods is highest for this
dataset (ca. 11.5%).
Limitations & Future Work
• The selection of features over the whole training
set induces a bias in the results. Will study
valuable suggestions on how to assess and correct
the bias in future experiments.
• Will take into consideration the correlation
between any pair of selected features. Ranking
method will be modified so that correlations are
lower than a certain threshold.
• Evaluate top-ranked genes in our research against
marker genes identified in other studies.