Knowledge-based Analysis of Microarray Gene Expression Data

Transcript Knowledge-based Analysis of Microarray Gene Expression Data

Knowledge-based Analysis of
Microarray Gene Expression Data
using Support Vector Machines
Michael P. S. Brown, William Noble Grundy,
David Lin, Nello Cristianini, Charles Sugnet,
Terrence S. Furey, Manuel Ares, Jr. David
Haussler
Proceedings of the National Academy
of Sciences. 2000
Overview

Objective: Classify genes based on
functionality

Observation: Genes of similar function yield
similar expression pattern in microarray
hybridization experiments

Method: Use SVM to build classifiers, using
microarray gene expression data.
Previous Methods

Most current methods employ
unsupervised learning methods (at the
time of the publication)

Genes are grouped using clustering
algorithms based on a distance measure
Hierarchical clustering
 Self-organizing maps

DNA Microarray Data

Each data point represents the ratio of expression
levels of a particular gene in an experimental condition
and a reference condition



n genes on a single chip
m experiments performed
The results is an n by m matrix of expression-level ratios
m experiments
n genes
m-element
expression vector
for a single gene
DNA Microarray Data

Normalized logarithmic ratio

For gene X, in experience i, define:
• Ei is the expression level in the experiment
• Ri is the expression level in the reference state
• Xi=(x1, x2,..., xn) is the normalized logarithmic ratio
• Xi is positive when the gene is induced (turned up)
• Xi is negative when the gene is repressed (turned down)
Support Vector Machines

Searches for a hyperplane that


Maximizes the margin
Minimizes the violation of the margin
* Edda Leopold† and Jörg Kindermann
Linear Inseparability

What if data points are not linearly
separable?
* Andrew W. Moore
Linear Inseparability

Map the data
to higherdimension
space
* Andrew W. Moore
Linear Inseparability

Problems with mapping data to higherdimension space
1.
Overfitting
•
2.
SVM chooses the maximum margin, and deals
well with overfitting
High computational cost
•
SVM kernels only involve dot products between
points (cheap!)
SVM Kernels

K(X, Y) is function that calculates a
measure of similarity between X and Y

Dot product
• K(X,Y) = X.Y
• Simplest kernel. Linear hyperplane

Degree d polynomials
• K(X,Y) = (X.Y + 1)d

Gaussian
• K(X,Y) = exp(-|X - Y|2/22)
Experimental Dataset

Expression data from the budding yeast




Six functional classes



2467 genes (n)
79 experiments (m)
Dataset available on Stanford web site
From the Munich Information Centre for Protein Sequences
Yeast Genome Database
Class definitions come from biochemical and genetic studies
Training data:


positive labels: set of genes that have a common function
Negative labels: set of genes known not to be a member of this
function class
Experimental Design

Compare the performance of








SVM (with degree 1 kernel, i.e. linear))
SVM (with degree 2 kernel)
SVM (with degree 3 kernel)
SVM (Gaussian)
Parzen Windows
Fisher’s Linear Discriminate
C4.5 Decision Trees
MOC1 Decision Trees
Experimental Design

Define the cost of method M



Cost of each method is compared to:


C(M) = fp(M) + 2.fn(M)
False negatives are weighted higher because the
number of true negatives is larger
C(N) = cost of classifying everything as negative
Cost saving of method M is :

S(M) = C(N) - C(M)
Experimental Results
SVM
(d=1)
TCA
Resp
Ribo
Prot
Hist
HTH


6
31
224
35
18
-56
SVM
(d=2)
9
39
229
48
18
-3
SVM
SVM
Parzen
Fisher's
(d=3) (Gauss) Windows LD
C4.5 MOC1
12
11
6
5
-7
-1
38
33
18
30
8
-4
229
226
220
217 169
164
51
52
39
39
33
26
18
18
14
16
16
10
-1
0
-14
-14
-2
-6
SVMs outperform other methods
All classifiers fail to recognize the HTH protein


this is expected
Members of this class are not “similarly regulated”
Consistently Misclassified Genes

20 genes are consistently misclassified by
4 SVM kernels, in different experiments

Difference between the expression data and
definitions based on protein structures.

Many of the false positives are known to be
important for the functional class (even though they
are not included as part of the class)

Knowledge-based Analysis of Microarray Gene Expression Data

Transcript Knowledge-based Analysis of Microarray Gene Expression Data

Directory