Transcript 1 -against

Classifying Lymphoma
Dataset Using Multi-class
Support Vector Machines
INFS-795 Advanced Data Mining
Prof. Domeniconi
Presented by Hong Chai
1
Agenda
(1) Lymphoma Dataset Description
(2) Data Preprocessing
- Formatting
- Dealing with Missing Values
- Gene Selections
(3) Multi-class SVM Classification
- 1-against-all
- 1-against-1
(4) Tools
(5) References
2
Lymphoma Dataset
• Alizadeh et al.(2000), Distinct Types of Diffuse Large Bcell Lymphoma Identified by Gene Expression Profiling
• Publicly available at http://llmpp.nih.gov/lymphoma/
• In microarray data,
Expression profiling of
genes are measured in
rows
Samples are columns
3
Lymphoma Dataset
• 96 samples of lymphocytes (instances)
• 4026 human genes (features)
• 9 classes of lymphoma:
DLBCL, GCB, NIL, ABB, RAT, TCL, FL, RBB, CLL
• Glimpse of data
DLCL
-0042
DLCL0031
DLCL
-0036
CLL
-60
CLL68
FL-10
FL-11
GCB
NILIgM
CLL65
GENE2406X
0.75
0.12
0.28
0.58
0.37
-1.2
0.37
-0.6
0.37
0.12
GENE3689X
0.01
-0.4
-0.6
0.37
0.09
0.78
-0.8
2.45
-0.45
0.37
GENE3133X
1.43
-0.8
0.37
-0.5
0.37
1.15
0.37
-1.29
1.26
0.67
GENE1008X
0.05
0.64
0.37
0.12
0.63
0.35
0.12
0.37
-2.29
-0.8
4
Lymphoma Dataset
5
Goal
Task: classification
• Assign each patient sample to one of 9 categories, e.g. Diffuse
Large B-cell Lymphoma (DLBCL) or Chronic Lymphocytic Leukemia
(CLL).
• Microarray data classification: an alternative to current malignancies
classification that relies on morphological or clinical variables
Medical implications
• Precise categorization of cancers; more relevant diagnosis
• More accurate assignment of cases to high risk or low risk
categories
• more targeted therapies
• Improved predictability of outcome.
6
Data Preprocessing
Missing Values Imputation
•
•
•
•
•
3% of gene expression profiles data are missing
1980 of the 4026 genes have missing values
49.1% of genes (features) involved
Some of these genes may be highly informative for
classification
Need to deal with missing values before applying to
SVM
7
Missing Value Approaches
• Instance or feature deletion
- if dataset large enough & does not distort distribution
• Replace with a randomly drawn observed value
- proved to work well (http://globain/cse/psu.edu/courses/spring2003/3-norm-val.pdf)
• EM algorithm
• Global mode or mean substitution
- will distort distribution
• Local mode or mean substitution with KNN algorithm
(Prof. Domeniconi)
8
Local Mean Imputation (KNN)
1.
Partition the data set D into two sets.
• Let the first set, Dm, contain instances with missing value(s).
• The other set, Dc, contains instances with complete values.
2.
For each instance vector x  Dm
• Divide the vector into observed and missing parts as x = [xo; xm].
• Calculate the distance between xo and every instance y  Dc,
using only those features that are observed in x.
• From the K closest y’s (instances in Dc), calculate the mean of
the feature for which x has missing value(s). Make substitution
with this local mean.
(Note: for nominal features use mode. n/a in microarray data)
9
Data Preprocessing
Feature Selection: Motivations
- Number of features large, instances small
- Reduce dimensionality to overcome overfitting
- A small number of discriminant “marker” genes
may characterize different cancer classes
Example: Guyon et al. identified 2 genes that yield zero leaveone-out error in the leukemia dataset, 4 genes in the
colon cancer dataset that give 98% accuracy.
(Guyon et al. Gene Selection for Cancer Classification using SVM, 2002)
10
Feature Selection
Discriminant Score Ranking
• Which gene is more informative in the 2-class
case:
+
Gene 1
+
Gene 2
11
Separation Score
• Gene 1 more discriminant. Criteria:
- Large difference of μ+ and μ- Small variance within each class
• Score function
F(gj) = | (μj+ - μj-) / (σj+ + σj-) |
12
Separation Score
• In multi-class cases, rank genes that are
discriminant among multiple classes
C1
C2
Δ
C3
• A gene may functionally relates to several
cancer classes such as C1 and C2
13
Separation Score
• Proposing an adapted score function
For each gene j
Calculate μi in each class Ci
Sort μi in descending order
Find a cutoff point with largest diff(μi, μj)
 μexp-cutoff-left
σ+  σexp-cutoff-left
μ-  μexp-cutoff-right
σ-  σexp-cutoff-right
μ+
F(gj) = | (μj+ - μj-) / (σj+ + σj-) |
Rank genes by F(gj)
Select top genes via threshold
14
Separation Score
Disadvantage:
• Does not yield more compact gene sets;
still abundant
• Does not consider mutual information
between genes
15
Feature Selection
Recursive Feature Elimination/SVM
1. In the linear SVM model on the full feature set
Sign (w•x + b)
w is a vector of weights for each feature, x is an
input instance, and b a threshold.
If wi = 0, feature Xi does not influence classification
and can be eliminated from the set of features.
16
RFE/SVM
2. When w is computed for the full feature set,
sort features according in descending order of
weights. The lower half is eliminated.
3. A new linear SVM is built using the new set of
features. Repeat the process until the set of
predictors is non-divisible by two.
4. The best feature subset is chosen.
17
Feature Selection
• PCA comment: not common in microarray
data.
• Disadvantage: none of original inputs can
be discarded
• We want to retain a minimum subset of
informative genes to achieve best
classification performance.
18
Multi-class SVM
19
Multi-class SVM Approaches
1-against-all
• Each of the SVMs separates a single class from all
remaining classes (Cortes and Vapnik, 1995)
1-against-1
• Pair-wise. k(k-1)/2, k Y SVMs are trained. Each SVM
separates a pair of classes (Fridman, 1996)
Performance similar in some experiments (Nakajima, 2000)
Time complexity similar: k evaluation in 1-all, k-1 in 1-1
20
1 -against- All
• Or “one-against-rest”, a tree algorithm
• Decomposed to a collection of binary classifications
• k decision functions, one for each class
(wk)T • x+bk, k Y
• The kth classifier constructs a hyperplane between class
n and the k-1 other classes
Class of x = argmaxi{(wi)T • (x)+bi}
21
1 -against- 1
• k(k-1)/2 classifiers where each one is trained on data
from two classes
• For training data from ith and jth classes, run binary
classification
• Voting strategy: If
Sign(wij)T •  x+bij)
says x is in class i, then add 1 to class i. Else to class j.
• Assign x to class with largest vote (Max wins)
22
Kernels to Experiment
• Polynomial kernels
K(Xi, Xj)=(XiXj+1)^d
• Gaussian Kernels
K(Xi, Xj)=e^(-|| Xi - Xj ||/σ^2)
23
SVM Tools - Weka
Data Preprocessing
• To ARFF format
• Import file
24
SVM Tools - Weka
Feature Selection using SVM
• Select Attribute
• SVMAttributeEval
25
SVM Tools - Weka
Multi-class classifier
• Classify
• Meta
• MultiClassClassifier
(Handles multi-class
datasets with 2-class
classifiers)
26
SVM Tools - Weka
• Multi-class SVM
• Classify
• Functions
• SMO
(Weka’s SVM)
27
SVM Tools - Weka
• Multi-class SVM Options
• Method
1-against-1
1-against-all
• Kernel options
not found
28
Multi-class SVM Tools
Other Tools include
• SVMTorch (1-against-all)
• LibSVM (1-against-1)
• LightSVM
29
References
• Alizadeh et al. Distinct types of diffuse large B-cell lymphoma
identified by gene expression profiling, 1999
• Cristianini, An Introduction to Support Vector Machines, 2000
• Dor et al, Scoring Genes for Relevance, 2000
• Franc and Hlavac, Multi-class Support Vector Machines
• Furey et al. Support vector machine classification and validation of
cancer tissue samples using microarray expression data, 2000
• Guyon et al. Gene Selection for Cancer Classification using Support
Vector Machines, 2002
• Selikoff, The SVM-Tree Algorithm, A New Method for Handling Multiclass SVM, 2003
• Shipp et al. Diffuse Large B-cell lymphoma outcome prediction by
gene expression profiling and supervised machine learning, 2002
• Weston, Multi-class Support Vector Machines, Technical Report,
1998
30