Remote Homology Detection with String Kernel and Network Diffusion

Download Report

Transcript Remote Homology Detection with String Kernel and Network Diffusion

Inferring Protein Structure with
Discriminative Learning and
Network Diffusion
Rui Kuang
Department of Computer Science
Center for Computational Learning Systems
Columbia University
Thesis defense, August 16th 2006
Agenda
 Biological Background
 Protein Classification with String Kernels
 Domain Identification and Boundary Detection
 Protein Ranking with Network Diffusion
 Conserved Motifs between Remote Homologs
 Conclusion & Future Research
Background
Structural classification
Domain segmentation
Protein ranking
Motif discovery
Conclusion & Future Research
What are proteins?
Proteins – encoded by genes
 A protein (polypeptide chain) is a sequence of
amino acid residues
 Derived from Greek word proteios meaning
“of the first rank” by Jöns J. Berzelius in 1838
Amino Acid
Polypeptide Chain
(Picture courtesy of Branden & Tooze )
Why study proteins?
 Proteins play crucial functional roles in all
biological processes: enzymatic catalysis,
signaling messengers, structural elements…
 Function depends on unique 3-D structure.
 Easy to obtain protein sequences, difficult to
determine structure.
NLAFALSELDRITAQ
LKLPRHVEEEAARL
YREAVRKGLIRGRS
IESVMAACVYAACR
LLKVPRTLDEIADIA
RVDKKEIGRSYRFI
ARNLNLTPKKLF…
Sequence
fold
determine
Structure
(Picture courtesy of Branden & Tooze )
Function
Sequence Space and Structure Space
Sequence (>2,500,000)
•Homologous proteins share >30%
sequence identity, which suggests
strong structural similarity.
•Remote homologous proteins share
similar structure but low sequence
similarity.
Structure (38,086 known in PDB):
discrete groups of folds with unclear
boundaries (by 8/14/2006)
Remote Homology Detection
 Remote homology : remote evolutionary relationship 
conserved structure/function, low sequence similarity
<10%
sequence
identity
ADTIVAVELDTYPNTDIGDPSYPHIGIDIKSVRSKKTAKWNMQNGKVG
TAHIIYNSVDKRLSAVVSYPNADSATVSYDVDLDNVLPEWVRVGLSAS
TGLYKETNTILSWSFTSKLKSNSTHETNALHFMFNQFSKDQKDLILQG
DATTGTDGNLELTRVSSNGPQGSSVGRALFYAPVHIWESSAVVASFEA
TFTFLIKSPDSHPADGIAFFISNIDSSIPSGSTGRLLGLFPDAN
MSLLPVPYTEAASLSTGSTVTIKGRPLVCFLNEPYLQVDFHTEMKEES
DIVFHFQVCFGRRVVMNSREYGAWKQQVESKNMPFQDGQEFLSISVLP
DKYQVMVNGQSSYTFDHRIKPEAVKMVQVWRDISLTKFNVSYLKR
 It is often not possible to detect statistically significant
sequence alignment between remote homologs
DSSYYWEIEASEVMLSTRIGSGSFGTVYKGKWHG-DVAVKILKVVDPTPEQFQAFRNEVA
D +
WEI+ +++ + ++ SGS+G +++G +
+VA+K LK
E
+ F
EV
DGTDEWEIDVTQLKIEKKVASGSYGDLHRGTYCSQEVAIKFLKPDRVNNEMLREFSQEVF
Protein Domains
 Proteins often consist of several independent domains
 fold autonomically
 often function differently
 represent fundamental
structural, functional and
evolutionary units
 Example: a two-domain
protein


3-layer(aba) sandwich at
the N-terminal
a mainly alpha in an
orthogonal bundle at the
C-terminal
Inferring protein structure/function
from sequence similarity
 For newly sequenced genomes, often homology detection
can only identify less than a half of the genes.
 Remote homology detection and domain segmentation are
crucial steps for studying genes with no close homology.
Query sequence
Boundary identification
Domain
Family 11
Domain
Domain Family 2
Domain Family 3
Domain Family n-2
Domain Family n-1
Domain 2
Domain 3
Sequenc
e
Alignmen
Remote
homology detection
t
 We want to correctly segment protein sequences into domains
Domain Family n
(domain boundary identification) and associate them with their
Domain database structural/functional class (remote homology
corresponding
detection).
Protein Structural
Classification
Background
Structural classification
Domain segmentation
Protein ranking
Motif discovery
Conclusion & Future Research
 Protein classification is the prediction of the structural or
functional class of a protein from its primary sequence.
 SCOP: Structural Classification of Proteins
 Known domain structures are organized in a hierarchy:
family, superfamily and fold.
SCOP
Superfamily : low sequence
similarity but functional
features suggest probable
common evolutionary origin
Family : Sequence identity >
30% or functions and
structures are very similar
Remote Homology Detection in
Protein Classification
 Remote homologs: sequences that belong
to the same super-family but not the same
family.
SCOP
Fold
Super-family
Negative
Negative
Training Set Test Set
Family
Positive
Training Set
Positive
Test Set
Support Vector Machine (SVM)
Classifiers
 SVM: Large margin-based
discriminative learning
approach.
 Find a hyperplane to
separate positive data
from negative data and
also maximize the margin.
 A Quadratic Programming
problem only depends on
the inner product between
data points.
Kernels for Discrete Objects
 Kernel trick: To train an SVM, can use kernel rather
than explicit feature map
 Can define kernels for sequences, graphs, other
discrete objects:

{ sequences }
RN
Kernel value is inner product in feature space:
K(x, y) =  (x), (y) 
 Original string kernels (Watkins, Haussler, Lodhi et al.)
require quadratic time in sequence length, O(|x| |y|), to
compute each kernel value K(x, y).
 We introduce fast novel string kernels with linear time
complexity.
Profile Kernel and its Family Tree
 Three generations



Spectrum Kernels
Mismatch Kernels
Profile Kernels
 Faster computation, i.e., linear computation
time in sequence length.
 Profile-based string kernels take advantage
of abundant unlabeled data to capture
homologous/evolutionary information for
remote homology detection.
Spectrum Kernel
Feature map indexed by all possible k-length subsequences
(“k-mers”) from alphabet  of amino acids, || = 20
Q1:AKQDYYYYE
AKQ
KQD
QDY
DYY
YYY
YYY
YYE
Feature Space
Q2:DYYEIAKQE
(AAA-YYY)
1 AKQ 1
DYY
1 DYY 1
YYE
0 EIA 1
YEI
0 IAK 1
EIA
1 KQD 0
IAK
0 KQE 1
1 QDY 0 K-mers captureAKQ
some positionsimilarity,
KQE
0 YEI 1 independent local
1 YYE 1 but they don’t effectively model
2 YYY 0 evolutionary divergence.
K(Q1,Q2)= <(…1…1…0…0…1…0…1…0…1…2),(…1…1…1…1…0…1…0…1…1…0)> =3
Leslie, Eskin and Noble, PSB 2002
Mismatch Kernel
 For k-mer s, the mismatch neighborhood
N(k,m)(s) is the set of all k-mers t within m
mismatches from s
 Size of mismatch neighborhood is O(||mkm)
AKQ
CKQ
DKQ
AKQ
…
…
AAQ
AKY
( 0 , … , 1 , … , 1 , … , 1 , … , 1 , … , 0 )
AAQ
AKY
CKQ
DKQ
Arbitrary mismatch does not
model the mutation probability
between amino acids.
Leslie, Eskin, Weston and Noble, NIPS 2002
Profile Kernel
[Kuang & Leslie, JBCB 2005 and CSB 2004]
 Profile kernel: specialized to protein sequences,
probabilistic profiles to capture homology information
 Semi-supervised approach: profiles are estimated using
unlabeled data (about 2.5 million proteins )
 E.g. PSI-BLAST profiles: estimated by iteratively
aligning database homologs to query sequence.
 Profiles are build from multiple sequence alignment to
model the positional mutation probability.
QUERY
LKLLRFLGSGAFGEVYEGQLKTE....DSEEPQRVAIKSLRK.......
HOMOLOG1
IIMHNKLGGGQYGDVYEGYWK........RHDCTIAVKALK........
HOMOLOG2
LTLGKPLGEGCFGQVVMAEAVGIDK.DKPKEAVTVAVKMLKDD......
HOMOLOG3
IVLKWELGEGAFGKVFLAECHNLL...PEQDKMLVAVKALK........
A
C
D
…
Y
L K L
3 -2 1
-1 0 2
-1 0 0
… … …
2 -3 -3
…
…
…
…
…
…
Profile-based k-mer Map
P( x)  p j (b), b  , j  1 x 
 Use profile
to
define position-dependent mutation neighborhoods:
 E.g. k=3, =5 and a profile of negative log probabilities
A
C
D
…
K
…
Q
…
Y
AKQ
A
1
5
4
…
4
…
3
…
2
K
3
4
4
…
1
…
4
…
4
Q
4
1
4
…
4
…
1
…
3
…
…
…
…
…
…
…
…
…
…
M k ,  P x j  1 : j  k  
b b b :  log  p
1 2
k
i
j i
bi    
AKQ
YKQ
(2+1+1<)
YKC
AKQ
AKC
(1+1+1<)
(1+1+2<)
(2+1+1<)
( 0 , … , 1 , … , 1 , … , 1 , … , 1 , … , 0 )
AKC
AKQ
YKC
YKQ
Efficient Computing with Trie
 Use trie data structure to organize lexical traversal of all instances of k-
mers in the training profiles.
 Scales linearly with length, O(km_max+1||m_max(|x|+|y|)), where m_max is
maximum number of mismatches that occur in any mutation neighborhood.
 E.g. k=3, =5
Sequence y
Sequence x
A
C
D
…
Q
…
Y
A
1
3
3
…
3
…
2
Q
3
2
2
…
1
…
1
K
2
1
1
…
2
…
3
…
A
…
Q
…
C
D
…
…
… x: 1+1+1<  x: 1+1+1 < 
…
y: .5+.6+2 <  y: .5+.6+4 > 
…
A
C
D
…
Q
…
Y
… A Q Y
….5 2 1
… 2 1 2
… 2 1 4
… … … … …
… 2 .6 2
… … … … …
… 3 3 3
Update K(x, y) by adding contribution for
feature AQC but not AQD
…
…
…
…
…
…
Inexact Matching Kernels
[Leslie & Kuang, JLMR 2004, KMCB 2004 & COLT 2003]
 Gappy kernels
 For g-mer s, g > k, the gapped match set G(g,k)(s) consists
of all k-mers t that occur in s with up to (g - k) gaps
 Wildcard kernels
 Introduce wildcard character “”, define feature space
indexed by k-mers from {}, allowing up to m wildcards
 Substitution kernels
 Use substitution matrices to obtain P(a|b), substitution
probabilities for residues a, b
 The mutation neighborhood M(k,)(s) is the set of all kmers t such that
- i=1…k log P(si|ti) < 
Experiments
 SCOP 1.59 benchmark with 54 experiments
 Train PSI-BLAST profiles on NR database
 Comparison against PSI-BLAST and recent SVM-
based methods:




PSI-BLAST rank: use training sequence as query and
rank testing sequences with PSI-BLAST e-value
eMotif Kernel (Ben-Hur et al., 2003): features are
known protein motifs, stored using trie
SVM-pairwise (Liao & Noble, 2002): feature vectors of
pairwise alignment scores (e.g. PSI-BLAST scores)
Cluster Kernel (Weston et al., 2003): Implicitly average
the feature vectors for sequences in the PSI-BLAST
neighborhood of input sequence
Results
Results (Cont.)
Kernels
ROC
ROC50
PSI-BLAST
0.743
0.293
eMotif
0.711
0.247
Mismatch(5,1)
0.875
0.416
Gappy(6,4)
0.851
0.387
Substitution(4,6.0)
0.876
0.441
Wildcard(5,2,1.0)
0.881
0.447
SVM-Pairwise
0.866
0.533
Cluster
0.923
0.699
Profile(5,7.5)-5 Iteration
0.984
0.874
Identify Protein Domains
and Domain Boundaries
Background
Structural classification
Domain segmentation
Protein ranking
Motif discovery
Conclusion & Future Research
 SVM-based remote homology detection methods do not rely
on sequence alignment.
 To learn the domain segmentation, we use our SVM classifiers
as domain recognizers and find the optimal segmentation
giving the maximum sum of the classification scores.
SCOP
Super-families:
Domain recognizers: SVM1
SVM2
SVM3
Query sequence
Boundaries
SVM4
Algorithms for finding optimal
segmentation
 Assuming we know the number of domains on a protein, we
look for the optimal segmentation with the maximum sum of
classification scores with dynamic programming.
 No gaps: T(1, j )  F ( S1, j ), for 1  j  N
T( c , j )


inf , when j c

max 1 k  j ( F ( Sk , j )T( c 1,k ) ), otherwise
Allowing gaps (can also be solved as a LP problem):
T(1, j )  max 1 k l  j F ( S k ,l ), all 1  j  N
T( c , j )

inf , when j c

max 1 k l  j ( F ( Sk ,l )T( c 1,k ) ), otherwise
: segment between position i and position j of sequence S
F ( s ) : the best classification score of segment s
T( c , j ) : the maximum sum of classification scores from c segments on S1,j
Si, j
Toy Example of Dynamic
Programming
c=1
1
4
2
2
1
1
c=2
-INF
2
5
8
6
7
c=3
-INF
-INF
3
6
9
13
Algorithms for finding optimal
segmentation (unconstrained number of domains)
 The regions across boundaries are less classifiable than
other regions within one domain.
 Use dynamic programming to alternate between domain
regions and boundary regions.
P0  0, G0  0
Pj  max 1i  j (lseg ( Si , j )  Gi 1 )
G j  max 1i  j (l gap ( Si , j )  Pi 1 ),
Si, j
: segment between position i and position j of sequence S
lseg ( s ) : confidence score of assigning s as domain region
l gap ( s ) : confidence score of assigning s as boundary region
boundary region
domain region
Experiments
 Datasets:
 25 SCOP folds: 2678 training domains and 471 test
chains (189 multi-domain proteins).
 40 SCOP super-families: 1917 training domains and
375 test chains (131 multi-domain proteins).
 Baseline approach:
 Align test proteins against PSI-BLAST profiles of the
training domains, and use the best aligned regions as
domain regions.
 Evaluation:
 Domain label: significant overlap between the true
domain and the predicted domain.
 Boundaries: both the predicted start and end positions
should be close to the true ones.
Experiments (Cont.)
At least 75% percent positional overlap between the true
domain and the prediction.
Dataset
Fold Dataset
(414 domains)
Super-family Dataset
(288 domains)
Boundary
distance
25
50
25
50
PSI-BLAST
8.7%(36)
11.4%(47)
5.6%(16)
15.3%(44)
DP_nogap
16.4%(68)
26.1%(108)
21.5%(62)
26.7%(77)
DP_gap
30.7%(127)
35.8%(148)
25.0%(72)
30.2%(87)
DP_alter
18.8%(78)
25.6%(106)
17.7%(51)
26.4%(76)
Protein Ranking
Background
Structural classification
Domain segmentation
Protein ranking
Motif discovery
Conclusion & Future Research
 Protein ranking: search protein database for sequences that
share an evolutionary or functional relationship with a given
query sequence.
 Standard protein ranking algorithm: pairwise alignment-based
algorithm, PSI-BLAST, can easily detect close homologs.
 Pairwise alignment-based algorithms are not effective for remote
homolog detection.
Query
Homologous protein
Remote homolog
Other labeled protein
From Local Similarity to
Global Structure (RankProp)
Weston, Elisseff, Zhou, Leslie, Noble, PNAS 2004
Noble, Kuang, Leslie, Weston, FEBS 2005
Weston, Kuang, Leslie, Noble, BMC Bioinformatics 2005
6
2
5
7
3
Query
1
Homologous protein
Remote homolog
Other labeled protein
Unlabeled protein
8
4
1
2
3
6
7
8
4
5
Correct
Ranking
Ranking based
on local similarity
1
2
3
4
5
6
7
8
 Cluster assumption: proteins with structural or functional relation
tend to be in the same cluster in the network.
 Diffusion on the protein similarity network to capture cluster structure.
RankProp (Cont.)
 Capture global structure with diffusion.
 Protein similarity network:
Graph nodes: protein sequences in the database
 Directed edges: weighted by PSI-BLAST e-value
 Initial ranking score at each node: the similarity to the
query sequence
 Iterative diffusion operation: Yt 1  Y0  KYt
Y0 : Initial ranking score
Yt : Ranking score at step t
K : Normalized connectivity matrix
 : a parameter for balancing the initial ranking score
and propagation

MotifProp
[Kuang, Weston, Noble, Leslie
Bioinformatics, 2005]
 Motivated by HITS algorithm for




page ranking and NLP algorithms
Protein-motif network
 Nodes: proteins and motifs
 Edges: whether a motif is
contained in a protein
Motifs: patterns/models built on
protein segments conserved
during evolution.
Often characterize structural/
function properties of a protein.
Examples: eMOTIF, PROSITE, Kmers, BLOCKS…
Query
…FYPGKGHTEDNIVVWLPQYNILVGGCLVKSTSAKDLGNVADAYVNEWSTSIENVLKRYRNINAVVPGHGEVG…
Motif Database
MotifProp (Cont.)
 MotifProp can identify motif-rich regions derived from
motif ranking to help interpret diffusion algorithm.
 Low computational cost: protein-motif network is fast
to build.
 Motifs serve as bridges connecting homologous/remote
homologous proteins.
Query
Motif vertices
Protein vertices
In MotifProp, protein nodes
and motif nodes enforce their
similarity to the query
sequence through propagation.
Diffusion in Protein-motif Network
MotifProp:
~
Normalize affinity matrix H to H
Initialize P and F with the initial activation value
Iterate until converge (
  (0,1)
For all i, Pi  P  (1   )
0
)
~
H
Fj
ij
j
For all i, Fi  F  (1   )
0
~
H'
ij
Pj
Query
j
Motif vertices
Protein vertices
Experiments
 7329 sequences (4246 for training and 3083 for testing) of <95%
identity from SCOP 1.59 plus 100,000 proteins from Swiss-Prot.
 Motif sets: 4-mers, PROSITE and eMOTIF.
Experiments (Cont.)
Algorithm
ROC1
ROC10
ROC50
Sequential MotifProp
0.640
0.663
0.688
k-mer MotifProp
0.621
0.648
0.679
RankProp
0.592
0.667
0.725
PROSITE MotifProp
0.600
0.643
0.664
eMOTIF MotifProp
0.527
0.612
0.666
PSI-BLAST
0.594
0.616
0.641
Conserved Motifs between
Remote Homologs
Background
Structural classification
Domain segmentation
Protein ranking
Motif discovery
Conclusion & Future Research
 We can derive weights of k-mer features from SVM
classifiers trained with profile kernel.
 MotifProp provides activation values on the k-mer
features after propagation.
 Both the SVM weights and Motif activation values can
be mapped back to protein sequences to identify
conserved structural/functional motifs.
 Positional contribution to classification score:
S x j  1 : j  k   Px j  1 : j  k  ,  ,
where Δ is the SVM weights or MotifProp activation
values on k-mer features.
Mapping Discriminative Regions to
Structure (Profile Kernel)
Ecoli
MarA
protein
(1bl0)
 In examined examples,
discriminative motif regions
correspond to conserved
structural features of the protein
superfamily
 Example: Homeodomain-like
protein superfamily.
Motif Rich Regions (MotifProp)


Motif-rich regions on chain B of arsenite oxidase protein from the ISP
protein super-family.
The PDB annotation and motif-rich regions are given.

The 3D protein structure with motif-rich regions in yellow.
Conclusions &
Contributions
Background
Structural classification
Domain segmentation
Protein ranking
Motif discovery & angle prediction
Conclusion & Future Research
 Profile-based string kernels exploit compact





representation of homology information for better
detection of remote homologs.
Dynamic Programming-based approach improves
multi-label domain classification and domain
boundary detection over PSI-BLAST alignment-based
approach.
MotifProp improves protein ranking over PSI-BLAST
by network diffusion on protein-motif network.
Interpretation of profile-SVM classifier and MotifProp
by motif regions: conserved structural components.
Fast kernels for inexact string matching.
Classifiers for protein backbone angle prediction (not
presented).
SVM-FOLD Web Server
Future Research
 Protein function inference by structural genomics
and proteomics


Identify functional properties of protein structures with
kernel methods, e.g. prediction of protein functional
sites and structure-based identification of protein-protein
interaction sites.
Protein function inference from proteomics, e.g. protein
function prediction based on protein-protein interaction
patterns and protein structures.
 Protein structure prediction

Unified prediction of protein backbone and side chain
positions (Phi-Psi angles and rotamers) with energybased cost function.
Acknowledgements: Committee
 Tony Jebara
Dept. of Computer Science, Columbia University
 Christina Leslie (advisor)
Center for Computational Learning Systems & C2B2,
Columbia University
 Kathleen Mckeown (chair)
Dept. of Computer Science, Columbia University
 William Stafford Noble
Dept. of Genome Science & Dept. of Computer Science,
University of Washington
 Rocco Servedio
Dept. of Computer Science, Columbia University
 Jason Weston
Machine Learning Group, NEC Labs (USA)
Acknowledgements: Collaborators










An-Suei Yang (Genome Research Center, Academia Sinica of Taiwan)
Dengyong Zhou (Machine Learning Group, Microsoft)
Yoav Freund (Computer Science Department, UCSD)
Eugene Ie (Computer Science Department, UCSD)
Ke Wang (Computer Science Department, Columbia University)
Wei Chu (Center For Computational Learning Systems, Columbia University)
Kai Wang (Biomedical Informatics Department, Columbia University)
Iain Melvin (Machine Learning Group, NEC)
Girish Yao (Computer Science Department, Columbia University)
Lan Xu (Molecular Biology Department, The Scripps Research Institute)
Publications
 Structural classification:
 Profile kernels (JBCB 2005 and CSB 2004)
 Inexact marching kernels (JMLR 2004, COLT 2003 & KMCB 2004)
 Protein ranking:
 RankProp (FEBS 2005 and BMC Bioinformatics 2005)
 MotifProp (Bioinformatics 2005)
 Protein local structure prediction
 Kernel methods based on sliding-window (Bioinformatics
2004)

Structured output learning (ongoing research)
 Protein domain segmentation (In preparation)
Protein backbone angle prediction
Conformational
Phi-Psi Angles
States
……
3-D structure
(Φ1,Ψ1)
A
(Φ2,Ψ2)
A
(Φ3,Ψ3)
A
(Φ4,Ψ4)
Discretization of
Phi-Psi angles
G
(Φ5,Ψ5)
B
(Φ6,Ψ6)
B
(Φ7,Ψ7)
B
(Φ8,Ψ8)
B
……
B
…
Sliding-window SVM approach
[Kuang, Leslie & Yang 2004]
Encode each position independently with sequence information
Conformational
within a length-k window.
States
A
A
A:-3 –4 –4 –4 –3 –4…..
A
A:0 –1 –1 3 –4 3 4 1…..
B
B:0 –1 2 1 –3 4 0 –1……
B
B:-2 –3 –4 –5 –2 4……
B
B:0 –3 –1 –2 –4 –1……
B
G
G
E
……
To
SVM
B
B
B
B
B
Smoothing: use
predictions to train a
second sets of SVMs
Experiments
 Datasets: 697 sequences of 97,365 amino acids with
sequence identity < 25 % from PDB (PDB_SELECT25).
 Comparison against:
 LSBSP1: query against local structure-based sequence profile database.
 HMMSTR: Hidden Markov Model based on local structural motifs.
RankProp in Genome Browser
Regularization Framework
 Closed form solution of MotifProp
Y *  (1   )( I  W ) 1Y 0
 Related to the regularization framework in Zhou et. al.
NIPS 2003
1
1
Q(Y * )   Wij || Yi *  Y j* || 
2 i , j 1

n
2
, Where
•
•
n
*
0
||
Y

Y
 i i ||
i 1
0
*




P
P
*
0
Initial Ranking : Y  
 Final Ranking : Y   * 
F 
0
F 
 
 
Normalized Affinity Matrix :
 0 H~ 

W   ~
 H' 0 
Discussion: Alignment VS K-mers
 Rangwala and Karypis et. al. 2005 achieved further
improvement on a previous benchmark dataset on SCOP
1.53 with kernels defined on profile-profile alignment.
 Proteins are documents with no punctuation and there is
no dictionary!!!

Optimal local alignment detects the  Length k subsequences summarize local
matches
most conserved paragraph.

Sensitive for detecting homologous  Can detect discontinuous and disordered
conserved regions between remote homologs
proteins.
Good for remote homologs with one  Can achieve fast computation
relatively long conserved region.
 Well defined k-mer feature space for applying
learning algorithms
 Alignment is easily interpretable.

Extracting Discriminative Motif
Regions
 SVM training determines support vector sequence
profiles and their weights: (P(xi), i)
 SVM decision hyperplane normal vector:
w = i yi i (P(xi))
 Positional contribution to classification score:
S x j  1 : j  k   Px j  1 : j  k  , w  
 Averaged positional score for positive sequences:
S avg x j  
 S x j  k  q : j  1  q
q 1k
Map Motif Rich Regions
Map final motif activation values to query sequence to
find conserved structural components between remote
homologs.
Determination of Protein Structures
 X-ray crystallography
The interaction of x-rays with electrons arranged in a crystal can
produce electron-density map, which can be interpreted to an
atomic model. Crystal is very hard to grow.
 Nuclear magnetic resonance (NMR)
Some atomic nuclei have a magnetic spin. Probed the molecule
by radio frequency and get the distances between atoms. Only
applicable to small molecules.
Protein structure prediction
 Comparative modeling:
Where there is a clear sequence relationship
between the target structure and one or more known
structures.
 Fold recognition ('threading'):
No sequence homology with known structures. Find
consistent folds (remote homology detection).
 Ab initio structure prediction(‘de novo’):
Deriving structures, approximate or otherwise, from
sequence.
RankProp
Weston, Elisseff, Zhou, Leslie, Noble, PNAS 2004
Noble, Kuang, Leslie, Weston, FEBS 2005
Weston, Kuang, Leslie, Noble, BMC Bioinformatics 2005
 Protein similarity network:
 Graph nodes: protein sequences in the database


Directed edges: exp(-Sij/σ), where Sij is the PSIBLAST e-value between ith protein and jth protein.
Initial ranking score at each node: the similarity to
the query sequence
 Iterative diffusion operation: Yt+1=Y0+αKYt
Y0: Initial ranking score
Yt: Ranking score at step t
K: Normalized connectivity matrix
α: (0,1) a parameter for balancing the initial ranking score
and propagation
Diffusion in protein similarity network
 Remote homology can be detected by diffusion
from common neighbors in the cluster.
 Protein network is expensive to build
 Hard to interpret the ranking
Extracting Discriminative Motif
Regions
 Sort positional
scores: about 40%50% of positions in
positive training
sequences
contribute 90% of
classification score
 Peaky positional
plots 
discriminative motifs
Inferring Protein Structure with Machine
Learning
 Machine learning builds
statistical models from data to
learn underlining principles
automatically.
 Machine Learning techniques
are promising for inferring
protein structure: large
amount of data but little
theory.
 Solved structures in protein
databank (PDB) provide
valuable knowledge about
protein folding patterns.
Number of total
Structures
New Structures added
in a year.
Sequence Alignment for Matching
Proteins
 Smith-Waterman algorithm finds the optimal local
alignment with maximum substitution scores
between two sequences by dynamic programming
Alignment-based Algorithms
 Smith-Waterman algorithm: find optimal local
alignments by dynamic programming
 BLAST & PSI-BLAST [Altschul 1997]: fast
approximations of Smith-waterman algorithm. Only
extend alignment from short identical stretches
potentially contained in true matches. Profiles are built
to search database iteratively.
 SAM-98 [Karplus 1999]: HMM based approach. Build
HMM for the query and target sequences. Rank
sequences by likelihood computed from HMMs.
HITS Algorithm for Page Ranking
 Good hubs: web pages with many pointers to related pages.
 Good authorities: web pages pointed to by hubs.
 Recursive updating enforces good hubs and good authorities.
Let N be the set of edges
Initialize Hub[A] and Aut[A] to 1
Iterate until converge
For all A, Aut[A]=∑(B,A) NHub[B]
For all A, Hub[A]=∑(A,B) NAut[B]
Normalize Aut and Hub
Hubs
Authorities
Kleinber, 1998