Using SVM for Expression Micro
Download
Report
Transcript Using SVM for Expression Micro
Using SVM for Expression
Micro-array Data Mining
—— Data Mining Final Project
Chong Shou
Apr.17, 2007
Expression
Micro-arrays
Expression Micro-arrays
Data:
A n*m data matrix
n = : gene number under investigation
m = 79: conditions under which gene
expression levels are measured
Expression level: expression under a
certain condition is compared to a
reference expression level, positive if upregulated, negative if down-regulated
Expression Micro-arrays
It is believed that genes working
together in certain biological
processes should have similar
expression patterns
We should be able to find similar
patterns of genes having related
functions
We could use this information to infer
(predict) functions of unknown genes
Supervised Learning
We use biological knowledge gathered
from experiments to label selected
genes to a set of classes as our
training set
Training set: 2467 genes
Test set: 3754 genes
GO (Gene Ontology)
Use GO as functional annotation for
selected genes
Classes:
Respiration
TCA cycle (biochemical process that produce
energy)
Histone (protein helps DNA packing)
Ribosome (protein complex assembles amino
acids to proteins)
Proteolysis (process destroy proteins)
Meiosis (process produce reproductive cells)
Advantages using SVM
Classical supervised learning method
Able to use a variety of distance
functions (kernels)
Able to handle data with extremely
high dimensions: 79 conditions
Details in SVM Parameters
Four kernel functions
Linear
K(X,Y) = <X,Y> + 1
Polynomial
K(X,Y) = (<X,Y> + 1)2
K(X,Y) = (<X,Y> + 1)3
Radial
K(X,Y) = exp(-σ||X - Y||2)
Kernel Matrix Modification
Using kernel functions to calculate kernel
matrix for the use of SVM training
Ki,j = K(Xi, Xj)
Problem
Positive samples: only a few genes have
classification to the six classes
Negative samples: most genes do not have
specification classification
Positive samples are considered as noise, thus
prone to make incorrect classifications
Kernel Matrix Modification (cont)
Add to the diagonal of the kernel matrix a
constant whose magnitude depends on the
class of the data point, thus control the
misclassifications
Positive samples: Kii = Kii + λ(n+/N)
Negative samples: Kii = Kii + λ(n-/N)
λ= 0.1
n+: number of positive samples
n-: number of negative samples
N: total number of samples
Othre Supervised Learning Methods
Classification tree
rpart
moc
k-Nearest Neighbors
Model Evaluation
Confusion Matrix
FP, FN, TP, TN
Cost Function
CM = FPM + 2*FNM
M: learning method
FN has larger weight
Save Function
SM = CN – CM
CN: cost if all samples are labeled negative
Models with large SM are preferred
Prediction
Use SVM on testing set to predict
gene function
Compare the classification result with
GO
Some Problems
High data matrix dimension, 2467 *
2467 for the training set. Requires
long running time on PC.
“predict” function in R
Questions?