Transcript Document

Classification problems
with heterogeneous
information sources
N.U.S. - January 13, 2006
Gert Lanckriet ([email protected])
U.C. San Diego
Motivation
• Statistical machine learning
– Blends statistics, computer science, signal
processing, optimization
– Involves solving large-scale data analysis
problems
• autonomously
• in tandem with a human
• Challenges:
– Massive scale of data sets
– On-line issues
– Diversity of information sources describing data
Example: web-related applications
• Data point = web page
• Sources of information about the webpage:
– Content:
•
•
•
•
Text
Images
Structure
Sounds
– Relation to other webpages: links  network
– Users (log data):
• click behavior
• origin
Example: web-related applications
• Data point = web page
• Sources of information about the webpage:
– Content:
•
•
•
•
Text
Images
Structure
Sounds
– Relation to other webpages: links  network
– Users (log data):
• click behavior
• origin
Information in diverse (heterogeneous) formats
Example: bioinformatics
mRNA
expression data
hydrophobicity data
protein-protein
interaction data
sequence data
(gene, protein)
upstream region data
(TF binding sites)
Overview
• Kernel methods
• Classification problems
• Kernel methods with heterogeneous
information
• Classification with heterogeneous
information (SDP)
• Applications in computational biology
Overview
• Kernel methods
• Classification problems
• Kernel methods with heterogeneous
information
• Classification with heterogeneous
information (SDP)
• Applications in computational biology
Kernel-based learning
Data
Embed data
Linear algorithm
x1
xn
SVM, MPM, PCA, CCA, FDA…
• if data described by numerical vectors:
embedding ~ (non-linear) transformation
 non-linear versions of linear algorithms
Kernel-based learning
Data
Embed data
Linear algorithm
x1
xn
SVM, MPM, PCA, CCA, FDA…
• embedding can be defined for non-vector
data
Kernel-based learning
Embed data
IMPLICITLY: Inner product measures similarity
j
i
K
Property: Any symmetric positive definite
matrix specifies a kernel matrix & every kernel
matrix is symmetric positive definite
Kernel-based learning
Data
x1
xn
Embed data
Kernel-based learning
Data
Embed data
Linear algorithm
x1
K
xn
SVM, MPM, PCA, CCA, FDA…
Kernel design
Kernel algorithm
Kernel methods
• Unifying learning framework
– connections to statistics, convex optimization,
functional analysis
– different data analysis problems can be
formulated within this framework
•
•
•
•
Classification
Clustering
Regression
Dimensionality reduction
• Many successful applications
Kernel methods
• Unifying learning framework
– connections to statistics, convex optimization,
functional analysis
– different data analysis problems can be
formulated within this framework
• Many successful applications
–
–
–
–
–
hand-writing recognition
text classification
analysis of micro-array data
face detection
time series prediction
Binary classification
y1= -1
• Training data: {(xi,yi)}i=1...n
– xi : description ith object
– yi 2 {-1,+1} : label
•
•
•
•
•
HEART
URINE
DNA
BLOOD
SCAN
y2= +1
•
•
•
•
•
HEART
URINE
DNA
BLOOD
SCAN
x1
• Problem: design a classification rule such that,
given a new x, it predicts y with minimal
probability of error
x2
Binary classification
•
Find hyperplane
that separates the two
classes
•
•
•
•
•
x1
Classification Rule:
HEART
URINE
DNA
BLOOD
SCAN
•
•
•
•
•
HEART
URINE
DNA
BLOOD
SCAN
x2
Maximal margin classification
• Maximize margin:
– Position hyperplane
between two classes
– Such that 2-norm
distance to closest point
from each class is
maximized
Maximal margin classification
•
If not linearly separable:
–
–
Allow some errors
Try to maximize margin for
data points with no error
Maximal margin classification:
training algorithm
max margin
min error
error slack
correctly classified
Maximal margin classification
• Training: convex optimization problem (QP)
• Dual problem:
Maximal margin classification
• Training: convex optimization problem (QP)
• Dual problem:
• Optimality condition:
Maximal margin classification
• Training:
• Classification rule: classify new data point x:
Maximal margin classification
• Training:
• Classification rule: classify new data point x:
Kernel-based classification
Data
Embed data
x1
Linear
classification
algorithm
xn
K
Kernel design
Support vector machine
(SVM)
Kernel algorithm
Overview
• Kernel methods
• Classification problems
• Kernel methods with heterogeneous
information
• Classification with heterogeneous
information (SDP)
• Applications in computational biology
Kernel methods with heterogeneous info
• Data points: proteins
• Information sources:
j
i
K
Kernel methods with heterogeneous info
• Data points: proteins
• Information sources:
K
Kernel methods with heterogeneous data
• Proposed approach
– First focus on every single source j of information
individually
– Extract relevant information from source j into Kj
– Design algorithm to learn the optimal K, by
“mixing” any number of kernel matrices Kj, for a
given learning problem
Kernel methods with heterogeneous data
1
2
K
Kernel methods with heterogeneous data
• Proposed approach
1
– First focus on every single source k of information
individually
– Extract relevant information from source j into Kj
Focus on kernel design for specific types of information
2
– Design algorithm that learns the optimal K, by
“mixing” any number of kernel matrices Kj, for a
given learning problem
Homogeneous, standardized input
Flexibility
Can ignore information irrelevant for learning task
Kernel design: classical vector data
• Data matrix:
– each row corresponds to a
gene (data point)
– each column corresponds
to an experiment (mRNA
expression level)
• Each gene: described by
vector of numbers
Kernel design: classical vector data
• Inner product :
• Normalized inner product :
Similar
Dissimilar
Kernel design: classical vector data
• A more advanced similarity measurement for
vector data: Gaussian kernel
• Corresponds to highly non-linear embedding
Kernel design: classical vector data
Kernel design: strings
• Data points: proteins
• Described by variable-length, discrete
strings (amino acid sequences)
protein 1
>ICYA_MANSE
GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKY
DGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVN
LVPWVLATDYKNYAINYMENSHPDKKAHSIHAWILSKSKVLEGNTKEVVD
NVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH
protein 2
>LACB_BOVIN
MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDA
QSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKI
DALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALE
KFDKALKALPMHIRLSFNPTQLEEQCHI
• Kernel design: derive valid similarity
measure, based on non-vector information
Kernel design: strings
>ICYA_MANSE
GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKY
DGKKALVLDTDVSNGVKEYMENSLEIAPDAKYTKQGKYVMTFKFGQRVVN
LVPWVLATDYKNYAINYNCDYHPDKKAHSIHAWILSKSKVLEGNTKEVVD
NVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH
>LACB_BOVIN
MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDA
QSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKI
DALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVKYVNTFKEALE
KFDKALKALPMHIRLSFNPTQLEEQCHI
more similar
String kernels
>ICYA_JAKSE
GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKY
DGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVN
LVPWVLATDYKNYAINYNCDYHPDKKAHSIHAWILSKSKVLEGNTKEVVD
NVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH
>LACB_BOVIN
MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDA
QSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKI
DALNENKVLVLDTDYKKYLDYCMENSAEPEQSLACQCLVRTPEVDDEALE
KFDKALKALPMHIRLSFNPTQLEEQCHI
less similar
Kernel design: graph
• Data points: vertices
• Information: connectivity
described by graph
• Diffusion kernel: establishes
similarities between vertices of a
graph, based on the connectivity
information
– based upon a random walk
– efficiently accounts for all paths
connecting two vertices, weighted by
path lengths
Kernel methods with heterogeneous data
1
2
?
K
Learning the kernel matrix
?
Any symmetric positive definite
matrix specifies a kernel matrix
Positive semidefinite matrices
form a convex cone
Learn K from the convex cone of
positive-semidefinite matrices…
K
?
Define cost function to assess
the quality of a kernel matrix
Restrict to convex cost
functions
… according to a convex
quality measure
Learning the kernel matrix
?
Learn K from the convex cone of
positive-semidefinite matrices…
K
?
… according to a convex
quality measure
Semidefinite Programming (SDP):
deals with optimizing convex cost functions
over the convex cone of positive semidefinite
matrices (or a convex subset of it)
Classification with multiple kernels
?
K
?
… according to a convex
Learn K from the convex cone of
positive-semidefinite matrices (or a quality measure
convex subset) …
Integrate constructed kernels
Large margin classifier (SVM)
Classification with multiple kernels
?
K
?
… according to a convex
Learn K from the convex cone of
positive-semidefinite matrices (or a quality measure
convex subset) …
Integrate constructed kernels
learn a linear combination
Large margin classifier (SVM)
Classification with multiple kernels
?
K
?
… according to a convex
Learn K from the convex cone of
positive-semidefinite matrices (or a quality measure
convex subset) …
Integrate constructed kernels
Large margin classifier (SVM)
learn a linear combination
maximize the margin
Classification with multiple kernels
• SVM, one kernel, dual formulation
• SVM, multiple kernels, dual formulation
Convex (pointwise max of set of convex functions)
Semidefinite programming problem
Classification with multiple kernels
• SVM, one kernel, dual formulation
• SVM, multiple kernels, dual formulation
Need to reformulate this in standard SDP format
Classification with multiple kernels
Integrate constructed kernels
Large margin classifier (SVM)
learn a linear mix
maximize the margin
SDP (standard form)
Classification with multiple kernels
Integrate constructed kernels
Large margin classifier (SVM)
learn a linear mix
maximize the margin
Theoretical performance guarantees
Applications in computational biology
• Yeast membrane protein prediction
• Yeast protein function prediction
Yeast Membrane Protein Prediction
• Membrane proteins:
– anchor in various cellular
membranes
– serve important
communicative functions
across the membrane
– important drug targets
• About 30% of the proteins
are membrane proteins
Yeast Membrane Protein Prediction
• Protein sequences: SW scores
• Protein sequences: BLAST scores
• E-values of Pfam domains
• Protein-protein interactions
Diffusion
• mRNA expression profiles
Gaussian
• Hydropathy profile
Yeast Membrane Protein Prediction
• Protein sequences: SW scores
• Protein sequences: BLAST scores
• E-values of Pfam domains
• Protein-protein interactions
• mRNA expression profiles
• Hydropathy profile
K
Yeast Membrane Protein Prediction
Yeast Protein Function Prediction
• Five different types of data:
–
–
–
–
–
Pfam domains
genetic interactions (CYGD)
physical interactions (CYGD)
protein-protein interaction (TAP)
mRNA expression profiles
• Compare our approach to approach using Markov
Random Fields (Deng et al.)
– using the five types of data
– also reporting improved accuracy compared to using any
single data type
Yeast Protein Function Prediction
MRF
SDP/SVM
(binary)
SDP/SVM
(enriched)
Conclusion
• Computational and statistical framework to
integrate data from heterogeneous information
sources
–
–
–
–
flexible and unified approach
within kernel methodology
specifically: classification problems
resulting formulation: semidefinite programming
• Applications show classification performance can
be enhanced by integrating diverse genome-wide
information sources