Using SVM for Expression Micro

Download Report

Transcript Using SVM for Expression Micro

Using SVM for Expression
Micro-array Data Mining
—— Data Mining Final Project
Chong Shou
Apr.17, 2007
Expression
Micro-arrays
Expression Micro-arrays
 Data:
 A n*m data matrix
 n = : gene number under investigation
 m = 79: conditions under which gene
expression levels are measured
 Expression level: expression under a
certain condition is compared to a
reference expression level, positive if upregulated, negative if down-regulated
Expression Micro-arrays
 It is believed that genes working
together in certain biological
processes should have similar
expression patterns
 We should be able to find similar
patterns of genes having related
functions
 We could use this information to infer
(predict) functions of unknown genes
Supervised Learning
 We use biological knowledge gathered
from experiments to label selected
genes to a set of classes as our
training set
 Training set: 2467 genes
 Test set: 3754 genes
GO (Gene Ontology)
 Use GO as functional annotation for
selected genes
 Classes:
 Respiration
 TCA cycle (biochemical process that produce
energy)
 Histone (protein helps DNA packing)
 Ribosome (protein complex assembles amino
acids to proteins)
 Proteolysis (process destroy proteins)
 Meiosis (process produce reproductive cells)
Advantages using SVM
 Classical supervised learning method
 Able to use a variety of distance
functions (kernels)
 Able to handle data with extremely
high dimensions: 79 conditions
Details in SVM Parameters
 Four kernel functions
 Linear
 K(X,Y) = <X,Y> + 1
 Polynomial
 K(X,Y) = (<X,Y> + 1)2
 K(X,Y) = (<X,Y> + 1)3
 Radial
 K(X,Y) = exp(-σ||X - Y||2)
Kernel Matrix Modification
 Using kernel functions to calculate kernel
matrix for the use of SVM training
 Ki,j = K(Xi, Xj)
 Problem
 Positive samples: only a few genes have
classification to the six classes
 Negative samples: most genes do not have
specification classification
 Positive samples are considered as noise, thus
prone to make incorrect classifications
Kernel Matrix Modification (cont)
 Add to the diagonal of the kernel matrix a
constant whose magnitude depends on the
class of the data point, thus control the
misclassifications






Positive samples: Kii = Kii + λ(n+/N)
Negative samples: Kii = Kii + λ(n-/N)
λ= 0.1
n+: number of positive samples
n-: number of negative samples
N: total number of samples
Othre Supervised Learning Methods
 Classification tree
 rpart
 moc
 k-Nearest Neighbors
Model Evaluation
 Confusion Matrix
 FP, FN, TP, TN
 Cost Function
 CM = FPM + 2*FNM
 M: learning method
 FN has larger weight
 Save Function
 SM = CN – CM
 CN: cost if all samples are labeled negative
 Models with large SM are preferred
Prediction
 Use SVM on testing set to predict
gene function
 Compare the classification result with
GO
Some Problems
 High data matrix dimension, 2467 *
2467 for the training set. Requires
long running time on PC.
 “predict” function in R
Questions?