Transcript Lawrence
Challenges in Creating
an Automated Protein
Structure Metaserver
Lawrence Wisne - CS 273
The Problem
Given a set of servers running prediction algorithms and
their results on test data, is it possible to automate the
choice of a “best” server for unknown sequences?
There does not yet exist a structure prediction algorithm
which gives consistently accurate results
While not providing any new answers, an algorithm which
can successfully answer the question above could add a
great degree of consistency to structure predictions
Solution Outline
Download the results of the CASP6 competition
Isolate a small subset S of structure prediction servers
such that the worst result for the CASP6 target sequences
is minimized, given the correct choice of server
Link each amino acid target sequence with the server that
gives the best result for that sequence
Isolate the similar characteristics of the sequences that are
linked with each server
Note that this requires that he number of results which
are optimally linked with each server in S is large enough
that characteristics of these sequences can be observed
Picking a Set of optimal servers
To evaluate the quality of a given server’s prediction of
sequence i, we use a relative property, not an absolute one
the ranking Rsi of each the prediction of server S on
sequence i among CASP6 participants
Ideally, we would pick our set of servers S such that
i(minsS(Rsi)) is minimized.
This is difficult even if we decide |R|, as there are |S|
choose |R| possible subsets
A decent approximation
To approximate an optimal subset S, use the following
greedy algorithm:
For each server, find the number of targets for which the
server ranked within the top t, for some threshold t
Add the server with the largest count to S, and remove
from consideration the targets for which that server was
in the top t
Repeat until S reaches the desired size
Using this algorithm, with t=5 and |S|=6, the worst result, given
correct server prediction, had a rank of 13, and the mean rank was
~2
Linking Servers with Targets
Now, we can link each target sequence with the
server in our subset S that produces optimal
results
In the case of t=5 and |S|=6, the smallest group of
targets linked with a server was of size 8, and the
largest was of size 32
A Reduced (but still very
difficult) Problem
Find the common characteristics of a set of input
strings which represent amino acids
The main methods attempt were Machine Learning
and Clustering
The machine learning approach
Given a training set and a set of features that are
present in each member of the set, weigh the
features such that future input instances will be
solved optimally
Sounds great, but what are the “features” of a
string?
The Clustering Approach
OK, so why can’t we just group the strings according to
some characteristic?
Some kind of edit distance metric (ie: Smith-Waterman)
may sound good in principle, but there are problems:
Alphabet too large
String size too varied
Most metrics are thrown off by size differences, and
normalization has its problems as well
Scoring
Patterns are (at best) very subtle
So, where do we go from here?
It is very possible that a way to solve the reformulated
problem does exist
Better domain-specific knowledge may be necessary
To create a richer set of features for learning, we can use the
various properties of amino acids to replace the alphabet
Alternately, it may be possible to alter the
match/mismatch scores to account for physical
properties
More sample cases would be very helpful
The size differentials in the strings made certain metrics almost
useless
More cases = the possibility of only comparing like-sized strings