Transcript Document

Improve Protein
Disorder Prediction
Using Homology
Instructor: Dr. Slobodan Vucetic
Student: Kang Peng
1
Protein Disorder Prediction
 What is protein?
Protein is usually a chain of 20 different Amino Acids (AAs). So a
protein can be represented by a string of 20 characters. Usually,
protein has its 3D structure, which is important to its function
 What is disorder protein?
Disorder protein is a protein that part or all of it have NO identified
3D structures.
 Can protein disorder be predicted?
Predictor developed by Dr. Vucetic can predict protein disorder with
accuracy of 82.6%
 Current dataset used to train disorder predictor
• 145 proteins with CONFIRMED long disordered region
• 130 proteins that are totally ordered
2
The Objective
Improve disorder prediction using
homologous/similar sequences
 What are homologous/similar sequences?
Proteins that may derive from same ancestor. They tend to have
SIMILAR amino acids sequences
 Where to find homologous/similar sequences?
For a given protein (its amino acids sequences), its
homologous/similar sequences can be found using the NCBI BLAST
Web server (http://www.ncbi.nlm.nih.gov/BLAST/)
 The hypothesis
Homologous/similar sequences may have similar structures, or,
similar disorder regions. So, we can use similar sequences to
enhance the training set
3
Methodology
To enhance the training set using
homologous sequences:
 Find homologous sequences that have
segments similar to the disorder proteins in
the original dataset
 Remove sequences that are too similar to
original sequences
 Label these segments as disorder
 Train disorder predictors with these new data
4
Get homologous
Sequences
 Each disorder segment in the original dataset
is sent to the NCBI BLAST Web server
Done automatically by a Visual Basic program
 Search against the non-redundant database
(nr), return sequences with E-value < 10
6380 sequences found
 Discard sequences that are too similar to the
original sequences
Total 444 sequences left, corresponding to 55 original disorder
sequences
5
Which BLAST to use?
 Standard BLAST
We may need scoring matrix specially developed for disorder
protein alignments
 PSI-BLAST
It is adaptive and can build scoring matrix based on the
results of previous iteration. So, the choice of initial scoring
matrix is not very important
 Current Experiment
PSI-BLAST with initial matrix BLOSUM62, use the result
of the 1st iteration
6
Train Disorder Predictor
 Group sequences into families
Group newly found sequences according to the original sequences
they are similar to. So, there are 145 families total (only 55 families
contain new sequences)
 Neural Network + Bagging
Randomly sampling a BALANCED training set and train a NN on
it. Repeat 10 times and use majority voting to combine 10 NNs
 Cross-Validation
Randomly divide sequences into groups, use 1 group as testing set
and the training set is randomly sampled from the rest groups
7
Results
The classification accuracies:
Experiment
Disorder
Order
All
Experiment
Disorder
Order
All
1
76.10
89.75
82.92
1
79.59
90.30
84.94
2
75.02
89.61
82.32
2
80.10
89.77
84.93
3
74.90
90.29
82.60
3
80.09
89.09
84.59
4
74.08
89.64
81.86
4
80.32
89.50
84.91
5
73.63
89.56
81.59
5
77.99
89.42
83.71
6
75.05
89.47
82.26
6
79.11
89.82
84.46
7
75.07
89.90
82.48
7
79.81
89.13
84.47
8
74.80
88.72
81.76
8
78.20
90.40
84.30
9
74.85
90.07
82.46
9
80.23
89.70
84.97
10
74.61
89.51
82.06
10
78.74
89.77
84.26
Avg
74.81
89.65
82.23
Avg
79.42
89.69
84.55
Std
0.65
0.42
0.41
Std
0.86
0.43
0.40
(a) Without Homologous Sequences
(b) With Homologous Sequences
8
Conclusion
 After adding homologous sequences to
training set, there are 2% increase on
disorder prediction accuracy
9
Thank You!
10