Titolo - unimi.it

Download Report

Transcript Titolo - unimi.it

Supervised gene expression data analysis
using SVMs and MLPs
Giorgio Valentini
e-mail: [email protected]
Outline
A real problem: Lymphoma gene expression
data analysis by machine learning methods:
• Diagnosis of tumors using a supervised approach
• Discovering groups of genes related to
carcinogenic processes
• Discovering subgroups of diseases using gene
expression data.
DNA microarray
DNA hybridization microarrays supply information about
gene expression through measurements of mRNA levels of
large amounts of genes in a cell
They offer a snapshot of the overall functional status of a cell:
virtually all differences in cell type or state are related with
changes in the mRNA levels of many genes.
DNA microarrays have been used in mutational analyses,
genetic mapping studies, in genome monitoring of gene
expression, in pharmacogenomics, in metabolic pathway
analysis.
A DNA microarray image (E. coli)
• Each spot
corresponds to the
expression level of
a particular gene
• Red spots
correspond to over
expressed genes
• Green spots to
under expressed
genes
• Yellow spots
correspond to
intermediate levels
of gene expression
Analyzing microarray data by machine
learning methods
The large amount of gene expression data requires machine learning methods
to analyze and extract significant knowledge from DNA microarray data
Unsupervised approach
No or limited a priori knowledge.
Clustering algorithms are used to
group together similar expression
patterns :
• grouping sets of genes
• grouping different cells or
different functional status of the
cell.
Example: hierarchical clustering,
fuzzy or possibilistic clustering, selforganizing maps.
Supervised approach
“A priori” biological and medical
knowledge on the problem domain.
Learning algorithms with labeled
examples are used to associate gene
expression data with classes:
• separating normal form cancerous
tissues
• classifying different classes of cells
on functional basis
• Prediction of the functional class of
unknown genes.
Example: multi-layer perceptrons,
support vector machines, decision
trees, ensembles of classifiers.
A real problem:
A gene expression analysis of lymphoma
Biological
problems
Machine learning
methods
1. Separating cancerous and normal
tissues using the overall
information available.
1. - Support Vector Machines (SVM) :
linear, RBF and polynomial kernels
- Multi Layer Perceptron (MLP)
- Linear Perceptron (LP)
2. Identifying groups of genes
specifically related to the
expression of two different tumour
phenotypes through expression
signatures.
2. Two step method:
A priori knowledge and
unsupervised methods to select
“candidate” subgroups
SVM or MLP identify the most
correlated subgroups
The data
• Data of a specialized DNA microarray, named "Lymphochip", developed
at the Stanford University School of Medicine:
4026 different genes
preferentially expressed in
lymphoid cells or with known
roles in processes important in
immunology or cancer
High dimensional data
96 tissue samples from normal
and cancerous populations of
human lymphocytes
Small sample size
A challenging machine learning problem
Types of lymphoma
Three main classes of lymphoma:
• Diffuse Large B-Cell Lymphoma (DLBCL),
• Follicular Lymphoma (FL)
• Chronic Lymphocytic Leukemia (CLL)
• Transformed Cell Lines (TCL)
and normal lymphoid tissues
Type of tissue
Number of samples
Normal lymphoid cells
24
DLBCL
FL
CLL
TCL
46
9
11
6
Visualizing
data with
Tree View
The first problem:
Separating normal from cancerous tissues.
Our first task consists in distinguishing cancerous from
normal tissues using the overall information available, i.e. all
the gene expression data.
From a machine learning standpoint it is a dichotomic problem.
Data characteristics:
• Small sample size
• High dimension
• Missing values
• Noise
Main applicative goal:
Supporting functionalmolecular diagnosis of
tumors and polygenic
diseases
Supervised approaches to molecular
classification of diseases
Several supervised methods have been applied to the analysis of
cDNA microarrays and high density oligonucleotide chips:
• Decision trees
• Fisher linear discriminant
• Linear discriminant analysis
• Multi-Layer Perceptrons
• Parzen windows
• Nearest-Neighbours classifiers
• Support Vector Machines
Proposed by different authors:
Golub et al. (1999), Pavlidis et al. (2001), Khan et al. (2001),
Furey et al. (2000), Ramaswamy et al. (2001), Yeang et al. (2001),
Dudoit et al. (2002).
Why using Support Vector Machines ?
“General” motivations
“Specific” motivations
•SVM are two-class classifiers
theoretically founded on Vapnik' s
Statistical Learning Theory.
• Kernel are well-suited to
• They act as linear classifiers in a
high dimensional feature space
originated by a projection of the
original input space.
• The resulting classifier is in
general non linear in the input
space.
• SVM achieves good
generalization performances
maximizing the margin between
the classes.
• SVM learning algorithm has no
local minima
working with high dimensional
data.
• Small sample sizes require
algorithms with good
generalization capabilities.
• Automatic diagnosis of tumors
requires high sensitivity and very
effective classifiers.
• SVM can identify mis-labeled
data (i.e. incorrect diagnosis).
• We could design specific kernel
to incorporate “a priori”
knowledge about the problem.
SVM to classify cancerous and normal
cells
We consider 3
standard SVM
kernels:
• Gaussian
• Polynomial
Varying:
• Values of the the
kernel parameters
• The regularization
factor C
• Dot-product
Comparing
them with:
• MLP
• LP
Varying:
• Number of hidden
units
• Backpropagation
parameters
Estimation of the
generalization error
through:
• 10-fold crossvalidation
• leave-one-out
Results
Learning machine model Gen. error St. dev. Prec. Sens.
SVM-linear
SVM-poly
SVM-RBF
MLP
LP
1.04
4.17
25.00
2.08
9.38
3.16
5.46
4.48
4.45
98.63
94.74
75.00
98.61
100.0
100.0
100.0
98.61
10.24 95.65 91.66
• 10-fold cross-validation ~ leave-one-out estimation of error
• SVM-linear achieves the best results.
• High sensitivity, no matter what type of kernel function is used.
• Radial basis SVM high misclassification rate and high estimated VC dimension
ROC analysis
• The ROC curve of
the SVM-linear is
ideal
• The polynomial
SVM also achieves a
reasonably good
ROC curve
• The SVM-RBF
show a diagonal
ROC curve: the
highest sensitivity is
achieved only when
it completely fails to
correctly detect
normal cells.
• The ROC curve
of the MLP is also
nearly optimal
• Linear
perceptron shows
a worse ROC
curve, but with
reasonable values
lying on the
highest and
leftmost part of
the ROC plane.
Summary of the results on the first
problem
Using hierarchical clustering 14,6% of the examples are misclassified
(Alizadeh, 2000), against the 1.04% of the SVM, the 2.08% of the
MLP and the 9.38% of the LP.
Supervised methods exploit a priori biological knowledge (i.e. labeled
data), while clustering methods use only gene expression data to group
together different tissues, without any labeled data.
Linear SVM achieve the best results, but also MLP and 2nd degree
polynomial show a relatively low generalization error.
Linear SVM and MLP can be used to build classifiers with a highsensitivity and a low rate of false positives.
These results must be considered with caution because the size of the
available data set is too small to infer general statements about the
performances of the proposed learning machines.
The second problem:
Identifying DLBCL subgroups
It starts from an hypothesis of Alizadeh et al. about the existence
of two distinct functional types of lymphoma inside DLBCL.
Actually, we consider two problems:
1. Validation of Alizadeh’s
2. Finding groups of genes
hypothesis
mostly related to this separation
• They identified two subgroups of
molecularly distinct DLBCL:
germinal centre B-like (GCB-like)
and activated B-like cells (AB-like).
• These two classes correspond to
patients with very different
prognosis.
Different subsets of genes could
be responsible for the
distinction of these two DLBCL
subgroups: the expression
signatures Proliferation, T-cell,
Lymphnode and GCB
(Lossos,2000).
A feature selection approach based on
“a priori” knowledge
Finding the most correlated genes involves an exponential
combination of genes (2n-1), where n is usually of the order
of thousands.
We need greedy algorithms and heuristic methods.
Can we exploit “a priori” biological
knowledge about the problem ?
An heuristic method (1)
A two-stage approach:
I.
Select groups of coordinately expressed genes.
II. Identify among them the ones mostly correlated to
the disease.
• We do not consider single genes.
• We consider only groups of coordinately expressed
genes.
An heuristic method (2)
I. Selecting groups of
coordinately expressed
genes:
•
Use “a priori” biological and
medical knowledge about
groups of genes with known or
suspected roles in carcinogenic
processes
II. Identify subgroups of
genes mostly related to
the disease:
1.
Train a set of classifiers
using only the subgroups
of genes selected in the
first stage.
2.
Evaluate and rank the
performance of the trained
classifiers.
3.
Select the subgroups by
which the corresponding
classifiers achieve the best
ranking.
And/or
•
Use unsupervised methods
such as clustering algorithms to
identify coordinately expressed
sets of genes
Applying the heuristic method
1. Selecting “candidate”
subgroups of genes:
We used biological knowledge and
hierarchical clustering
algorithms to select four
subgroups:
•
Proliferation: sets of genes
involved the biological process
of proliferation
•
T-cell: genes preferentially
expressed in T-cells
•
Lymphnode: Sets of genes
normally expressed in
lymphnodes
•
GCB: genes that distinguish
germinal centre B-cells from
other stages in B-cell ontogeny
2. Identify subgroups of
genes most related to the
the separation GCB-like /
AB-like
•
Training of SVM, MLP and
LP as classifiers using each
subgroup of genes and all the
subgroups together (All)
5 classification tasks
•
Leave-one-out methods used
with gaussian, polynomial and
linear SVM
•
10-fold cross-validation with
gaussian, polynomial and linear
SVM, MLP and LP.
GCB signature
Learn. machine model Gen. error St. dev.
SVM-linear
10.50
11.16
Prec. Sens.
90.00 90.00
SVM-poly
SVM-RBF
MLP
14.54
9.55
10.50
96.67 88.33
100.0 90.00
90.90 90.90
8.70 10.50
All signatures
Learn. machine model Gen. error St. dev.
SVM-linear
15.00
11.16
90.90 90.90
8.70
4.50
8.70
LP
Prec. Sens.
85.00 85.00
SVM-poly
SVM-RBF
MLP
14.00
10.00
8.70
18.97 93.33 76.67
10.54 100.00 76.67
13.28 95.00 86.36
LP
10.87
14.28
86.96 90.90
Results
The second problem: summary
• The results support the hypothesis of Alizadeh about the
existence of two distinct subgroups in DLBCL.
• The heuristic method identifies the GCB signature as a
cluster of coordinately expressed genes related to the
separation between the GCB-like and AB-like DLBCL
subgroups.
Developments
I. Methods to discover subclasses of
tumors on molecular basis.
II . Methods to identify small subsets
of genes correlated to tumors
Integrating “a priori” biological
knowledge, supervised machine
learning methods and unsupervised
clustering methods
- Refinements of the proposed
heuristic method using clustering
algorithms with semi-automatic
selection of the number of the
significant subgroups of genes.
Stratifying patients into molecularly
relevant categories, enhancing the
discrimination power and precision of
clinical trials
- Greedy algorithms based on mutual
information measures.
New perspectives on the development
of new cancer therapeutics based on a
molecular understanding of the cancer
phenotype.
Discovery
of new
subclasses
of tumors
Enhancing
biological
knowledge
about
tumoral
processes
Automatic
diagnosis
of tumors
using DNA
microchips
SVM for gene expression data analysis: a bibliography
•
•
•
•
•
•
•
•
•
M. Brown et al. Knowledge-base analysis of microarray gene expression data by using support
vector machines. PNAS, 97(1):262--267, National Academy of Sciences Washington DC, 2000.
T.S. Furey, N.Cristianini, N.Duffy, D.Bednarski, M.Schummer, and D.Haussler. Support vector
machine classification and validation of cancer tissue samples using microarray expression
data. Bioinformatics, 16(10):906--914, 2000.
P. Pavlidis, J.Weston, J.Cai, and W.N. Grundy. Gene functional classification from heterogenous
data. In Fifth International Conference on Computational Molecular Biology, ACM, Montreal,
Canada, 2001.
C. Yeang, S. Ramaswamy, P. Tamayo, S. Mukherjee, R. Rifkin, M Angelo, M. Reich, E. Lander,
J. Mesirov, and T. Golub. Molecular classification of multiple tumor types. ISMB 2001,
Proceedings of the 9th International Conference on Intelligent Systems for Molecular Biology,
pages 316--322, Copenaghen, Denmark. Oxford University Press, 2001.
S. Ramaswamy et al., Multiclass cancer diagnosis using tumor gene expression signatures,
PNAS, 98(26), 15149--15154, 2001.
I.Guyon, J.Weston, S.Barnhill, and V.Vapnik. Gene Selection for Cancer Classification using
Support Vector Machines. Machine Learning, 46(1/3):389--422, 2002.
J. Weston, F. Perez-Cruz, O. Bousquet, O. Chapelle, A. Elisseeff, and B. Scholkopf, Feature
selection and transduction for prediction of molecular bioactivity for drug design,
Bioinformatics, 1(1), 2002.
G. Valentini, Gene expression data analysis of human lymphoma using support vector machines
and output coding ensembles, Artificial Intelligence in Medicine, 26(3):283--306, 2002.
G. Valentini, M. Muselli and F. Ruffino, Bagged Ensembles of SVMs for Gene Expression Data
Analysis, The IEEE-INNS-ENNS International Joint Conference on Neural Networks, Portland,
USA, 2003.