Knowledge-based Analysis of Microarray Gene Expression Data
Download
Report
Transcript Knowledge-based Analysis of Microarray Gene Expression Data
Knowledge-based Analysis of
Microarray Gene Expression Data
using Support Vector Machines
Michael P. S. Brown, William Noble Grundy,
David Lin, Nello Cristianini, Charles Sugnet,
Terrence S. Furey, Manuel Ares, Jr. David
Haussler
Proceedings of the National Academy
of Sciences. 2000
Overview
Objective: Classify genes based on
functionality
Observation: Genes of similar function yield
similar expression pattern in microarray
hybridization experiments
Method: Use SVM to build classifiers, using
microarray gene expression data.
Previous Methods
Most current methods employ
unsupervised learning methods (at the
time of the publication)
Genes are grouped using clustering
algorithms based on a distance measure
Hierarchical clustering
Self-organizing maps
DNA Microarray Data
Each data point represents the ratio of expression
levels of a particular gene in an experimental condition
and a reference condition
n genes on a single chip
m experiments performed
The results is an n by m matrix of expression-level ratios
m experiments
n genes
m-element
expression vector
for a single gene
DNA Microarray Data
Normalized logarithmic ratio
For gene X, in experience i, define:
• Ei is the expression level in the experiment
• Ri is the expression level in the reference state
• Xi=(x1, x2,..., xn) is the normalized logarithmic ratio
• Xi is positive when the gene is induced (turned up)
• Xi is negative when the gene is repressed (turned down)
Support Vector Machines
Searches for a hyperplane that
Maximizes the margin
Minimizes the violation of the margin
* Edda Leopold† and Jörg Kindermann
Linear Inseparability
What if data points are not linearly
separable?
* Andrew W. Moore
Linear Inseparability
Map the data
to higherdimension
space
* Andrew W. Moore
Linear Inseparability
Problems with mapping data to higherdimension space
1.
Overfitting
•
2.
SVM chooses the maximum margin, and deals
well with overfitting
High computational cost
•
SVM kernels only involve dot products between
points (cheap!)
SVM Kernels
K(X, Y) is function that calculates a
measure of similarity between X and Y
Dot product
• K(X,Y) = X.Y
• Simplest kernel. Linear hyperplane
Degree d polynomials
• K(X,Y) = (X.Y + 1)d
Gaussian
• K(X,Y) = exp(-|X - Y|2/22)
Experimental Dataset
Expression data from the budding yeast
Six functional classes
2467 genes (n)
79 experiments (m)
Dataset available on Stanford web site
From the Munich Information Centre for Protein Sequences
Yeast Genome Database
Class definitions come from biochemical and genetic studies
Training data:
positive labels: set of genes that have a common function
Negative labels: set of genes known not to be a member of this
function class
Experimental Design
Compare the performance of
SVM (with degree 1 kernel, i.e. linear))
SVM (with degree 2 kernel)
SVM (with degree 3 kernel)
SVM (Gaussian)
Parzen Windows
Fisher’s Linear Discriminate
C4.5 Decision Trees
MOC1 Decision Trees
Experimental Design
Define the cost of method M
Cost of each method is compared to:
C(M) = fp(M) + 2.fn(M)
False negatives are weighted higher because the
number of true negatives is larger
C(N) = cost of classifying everything as negative
Cost saving of method M is :
S(M) = C(N) - C(M)
Experimental Results
SVM
(d=1)
TCA
Resp
Ribo
Prot
Hist
HTH
6
31
224
35
18
-56
SVM
(d=2)
9
39
229
48
18
-3
SVM
SVM
Parzen
Fisher's
(d=3) (Gauss) Windows LD
C4.5 MOC1
12
11
6
5
-7
-1
38
33
18
30
8
-4
229
226
220
217 169
164
51
52
39
39
33
26
18
18
14
16
16
10
-1
0
-14
-14
-2
-6
SVMs outperform other methods
All classifiers fail to recognize the HTH protein
this is expected
Members of this class are not “similarly regulated”
Consistently Misclassified Genes
20 genes are consistently misclassified by
4 SVM kernels, in different experiments
Difference between the expression data and
definitions based on protein structures.
Many of the false positives are known to be
important for the functional class (even though they
are not included as part of the class)