Transcript Document
Improve Protein
Disorder Prediction
Using Homology
Instructor: Dr. Slobodan Vucetic
Student: Kang Peng
1
Protein Disorder Prediction
What is protein?
Protein is usually a chain of 20 different Amino Acids (AAs). So a
protein can be represented by a string of 20 characters. Usually,
protein has its 3D structure, which is important to its function
What is disorder protein?
Disorder protein is a protein that part or all of it have NO identified
3D structures.
Can protein disorder be predicted?
Predictor developed by Dr. Vucetic can predict protein disorder with
accuracy of 82.6%
Current dataset used to train disorder predictor
• 145 proteins with CONFIRMED long disordered region
• 130 proteins that are totally ordered
2
The Objective
Improve disorder prediction using
homologous/similar sequences
What are homologous/similar sequences?
Proteins that may derive from same ancestor. They tend to have
SIMILAR amino acids sequences
Where to find homologous/similar sequences?
For a given protein (its amino acids sequences), its
homologous/similar sequences can be found using the NCBI BLAST
Web server (http://www.ncbi.nlm.nih.gov/BLAST/)
The hypothesis
Homologous/similar sequences may have similar structures, or,
similar disorder regions. So, we can use similar sequences to
enhance the training set
3
Methodology
To enhance the training set using
homologous sequences:
Find homologous sequences that have
segments similar to the disorder proteins in
the original dataset
Remove sequences that are too similar to
original sequences
Label these segments as disorder
Train disorder predictors with these new data
4
Get homologous
Sequences
Each disorder segment in the original dataset
is sent to the NCBI BLAST Web server
Done automatically by a Visual Basic program
Search against the non-redundant database
(nr), return sequences with E-value < 10
6380 sequences found
Discard sequences that are too similar to the
original sequences
Total 444 sequences left, corresponding to 55 original disorder
sequences
5
Which BLAST to use?
Standard BLAST
We may need scoring matrix specially developed for disorder
protein alignments
PSI-BLAST
It is adaptive and can build scoring matrix based on the
results of previous iteration. So, the choice of initial scoring
matrix is not very important
Current Experiment
PSI-BLAST with initial matrix BLOSUM62, use the result
of the 1st iteration
6
Train Disorder Predictor
Group sequences into families
Group newly found sequences according to the original sequences
they are similar to. So, there are 145 families total (only 55 families
contain new sequences)
Neural Network + Bagging
Randomly sampling a BALANCED training set and train a NN on
it. Repeat 10 times and use majority voting to combine 10 NNs
Cross-Validation
Randomly divide sequences into groups, use 1 group as testing set
and the training set is randomly sampled from the rest groups
7
Results
The classification accuracies:
Experiment
Disorder
Order
All
Experiment
Disorder
Order
All
1
76.10
89.75
82.92
1
79.59
90.30
84.94
2
75.02
89.61
82.32
2
80.10
89.77
84.93
3
74.90
90.29
82.60
3
80.09
89.09
84.59
4
74.08
89.64
81.86
4
80.32
89.50
84.91
5
73.63
89.56
81.59
5
77.99
89.42
83.71
6
75.05
89.47
82.26
6
79.11
89.82
84.46
7
75.07
89.90
82.48
7
79.81
89.13
84.47
8
74.80
88.72
81.76
8
78.20
90.40
84.30
9
74.85
90.07
82.46
9
80.23
89.70
84.97
10
74.61
89.51
82.06
10
78.74
89.77
84.26
Avg
74.81
89.65
82.23
Avg
79.42
89.69
84.55
Std
0.65
0.42
0.41
Std
0.86
0.43
0.40
(a) Without Homologous Sequences
(b) With Homologous Sequences
8
Conclusion
After adding homologous sequences to
training set, there are 2% increase on
disorder prediction accuracy
9
Thank You!
10