Transcript Slide 1

Investigating mRNA’s of intrinsically disordered
proteins
Harini Gopalakrishnan
Advisor: Dr. Predrag Radivojac
Basic Facts –mRNA
1. mRNA-Messenger Ribonucleic Acid
2. Nucleic Acid polymer consisting of nucleotide
monomers adenine, guanine, cytosine and uracil
3. Three important types
•
rRNA
(ribosomal RNA)
•
tRNA (transfer RNA)
•
mRNA (messenger RNA)
Basic Facts –mRNA (contd)
Encodes and
carries
information from
DNA to protein
synthesis
http://en.wikipedia.org/wiki/Image:Mature_mRNA.png
Basic Facts-mRNA (contd)
What is significance of mRNA folding?
Secondary Structures have been used to
explain
• Translational controls
• Regulatory function in the cell especially
the non-coding mRNA
What are the different folding algorithms?
• Energy Minimization
• Base Pair Maximization
• Covariation
Eg: Mfold, Vienna Package
Basic Facts-Disordered Protein
What is a disordered Protein?
• lack a well defined three-dimensional structure
• conserved between species in composition and sequence
• presence of low sequence complexity
• amino acid compositional bias away from bulky hydrophobic
residues
What are the significance of disorder Proteins?
regulation of transcription and translation, cellular signal
transduction, protein phosphorylation, the storage of small
molecules and the regulation of the self assembly of large
multiprotein complexes such as the bacterial flagellum and the
ribosome
Basic Facts-Disordered Protein
What is its role in diseases?
Famous (or infamous?) disorder
proteins in diseases
-alpha-synuclein
-p53
-proteins in HPV’s linked to
Ovarian Cancer
What are the different predictors that are used?
(all based on amino acid sequence inputs)
VL2,VSL2,PONDR,VLXT
Image Courtesy: http://www.disprot.org
Snapshot from Previous Studies …..
• Third Codon and stability
• Speed of translation and protein secondary structures
-alpha helices and beta sheets
• The three bases in the codon
1st base -Biosynthetic pathway
2nd base -Residue hydrophobicity
3rd base -helix or beta strand-forming potential of amino
acid
In a Nutshell
• Check if nucleotide composition has a bias towards the
proteins being ordered and disordered
• Check if the stability of RNA fold have any say in
differentiating the proteins between the two categories.
• Work is different because no study has linked Protein
disorder and mRNA composition and stability.
• Also establishing the correlation would open new avenues
in studying how protein structure can be inferred directly
from its precursor- the mRNA.
Hypothesis
• There should exist some kind of codon bias between the mRNA
sequence of ordered and disordered protein
• There should be a difference in folding energy stability between
the mRNA of ordered and disordered proteins
• There is a correlation between the age of codons and disordered
proteins
Central dogma
Method
•Data Collection
•Implementation
•Analysis
•Future Work
Data Collection
One of the important phases , as the whole significance of the analysis lies
on the quality of data set selected for both the categories of proteins.
Dataset
Predicted Dataset
(From disorder
predictors)
True Dataset
(Experimentally Verified)
After experimentation with various other databases, proteins were finally
taken from the unigene90, DisProt and PDB
Disorder was predicted using VSL2B
Data Collection
Once we have the proteins of interest, we use Uniprot to
webmine the protein and corresponding mRNA dataset based
on their unigene id
Problem!
•Introns
•Poly A tails, which need to be removed
We need a clean data set, in order to study Codon Usage, and
nucleotide composition
Solution - Alignment
BLAST
•Proved to be efficient while aligning the ordered
proteins
•Extremely inefficient while aligning protein vs.
mRNA for the disordered set of proteins
•Disorder proteins have more low complexity region
WISE
•Software by the EMBL institute to align protein vs.
nucleotide data
•Uses Markov Chain methods to make gene
predictions and hence identifies introns
•Extremely efficient and provided qualitative
datasets
Data Collection-Final input
Statistics
81
Predicted Order
96
343
Predicted Disorder
True Order
True Disorder
151
Method -Overview
Analyzed mainly two characteristics of mRNA
Nucleotide Composition of mRNA
• Codon Usage
• Nucleotide Composition
RNA Folding Energy and Base Pair analysis using Mfold
•
number of base pair formation
•
total minimum free energy per RNA fold between
Methods
Mfold Snapshot
Mfold -Overview
What is Mfold?
A mRNA secondary structure prediction algorithm
by M. Zuker and N.Markham
How does it work?
It is based on the nearest neighbor thermodynamic
rules in which free energies are assigned to loops
rather than base pairs. It tries to predict the
optimal structure by minimizing the overall free
energy of the structure formed by coaxial stacking
of helices.
What does it output?
Several output files for every optimal and sub
optimal folds within the allowable energy range
are obtained. Energy dot plot (on the right) is one
important component of this predictor output
Method
Tools Employed
•Parsing and mining information on Web done by PERL
• Analysis and graphs done using MATLAB
• Reporting and graphs done in Excel
• Disorder Prediction using mRNA inputs was done in MATLAB
using SVM
Results
Nucleotide Composition
Nucleotide Composition
True Dataset
Nucleotide
DT
OT
P-Value
(DT, OT)
A
0.275
0.267
1.06E-02
C
0.270
0.247
5.04E-17
G
0.271
0.259
4.55E-05
T
0.183
0.226
5.21E-57
Predicted Dataset
Nucleotide
DP
OP
P-Value (DP, OP)
A
0.267
0.239
0.0067
C
0.275
0.256
0
G
0.291
0.250
0
T
0.166
0.255
0
Analysis based on the Composition of mRNA
Analysis of Codon Age
Amino acid
Old
c
o
d
o
n
New
14 out of 18 Amino Acids have Disorder promoting Codon
as the older one
2 amino acids (M and W) are neutral as they have only one
codon each
New
Base Composition
Preferential
selection of
codons with “g”
or “c” for the
third base
Base Composition
Predicted Dataset
Third Base
Base
OP
DP
Second Base
Total
OP
%
DP
First Base
Total
%
OP
DP
Total
%
G
4
9
13
26.07
9
5
14
-4.88
10
6
16
-3.57
C
4
12
16
38.57
10
6
16
-3.57
8
8
16
10.48
T
14
2
16
-31.67
9
6
15
-0.71
7
5
12
0.83
A
13
1
14
-32.98
7
7
14
9.17
10
5
15
-7.74
Statistical Verification
Third Base
Base
Order
Disorder
T-test
R Test
g
33949
11475
4.49E-48
0.3263
c
26576
8594
1.17E-15
0.4424
t
23488
6404
3.03E-11
0.0308
a
31324
7721
2.40E-64
0.3548
Energy of Folding and Base Pair
Energy of Folding
Predicted Dataset
Dataset
OP
PP-value
DP
Average Minimum Energy (Kcal)
-2230
-2487.27
7.08E-03
Average Energy(Kcal)
-2170
-2428.29
6.93E-03
Average Length
677.57
679.35
0.87
Energy of Folding and Base Pair
Base Pair Analysis
Base Pair Analysis
Summary-Nucleotide Analysis
OP
DP
P-Value
OP vs. DP
Average Length
1005.77
732.9
--
Average Bases
0.062
0.050
0.0063
Bonding ability of A
0.118
0.118
0.2367
Bonding ability of C
0.133
0.08
3.33e-06
Bonding ability of G
0.151
0.10
7.72e-08
Bonding ability of T
0.146
0.14
0.81
Energy of Folding and Base Pair
Sequence Entropy Plot
Future Work
Predictions
Aim: To predict disorder from mRNA based on all above
information
Using Support Vector Machines(SVM’s)
• Based on Codon Composition
• Age of Codons
• Base Composition
Accuracies have been good and promising
Future Work
Acknowledgments
Dr. Predrag Radivojac
Dr. Haixu Tang
Dr. Vladimir Uversky
Amrita Mohan
Linda Hostetter
Informatics faculty and staff
My various Course Professors
Friends and Fellow Students
Future Work
References
1. http://helix.nih.gov/docs/online/mfold/node3.html
2 Jan C Biro Nucleic acid chaperons: a theory of an RNA-assisted protein folding Theoretical Biology and Medical
Modeling 2005, 2:35
3 T. A. Thanaraj and p. Argos Protein secondary structural types are differentially coded on messenger RNA
Protein Sci. 1996 5: 1973-1983
4 Taylor FJR, Coates D. 1989. The code within codons. Biosystems 22:177-187.
5.Brunak S, Engelbrecht J, Kesmir C. 1994. Correlation between protein secondary structure and the mRNA
nucleotide sequence Protein Structure by Distance Analysis. Amsterdam: 10s Press. pp 327-334.
6. H Jane Dyson and Peter E Wright Intrinsically Unstructured proteins and their functions Nat Rev Mol Cell Biol.
2005 Mar; 6(3):197-208
7. Dunker, A.K., Brown, C.J., Lawson, J.D., Lakoucheva, L.M, and Obradovic, Z Intrinsic disorder
And Protein
Function.
8 Tompa P Intrinsically Disorder proteins evolve by repeat expansion Bioessays 2003 Sep; 25(9):847-55
9 Svetlana A. Shabalina, Aleksey Y. Ogurtsov, and Nikolay A. Spiridonov A periodic pattern of mRNA secondary
structure created by the genetic code Nucleic Acids Res. 2006; 34(8): 2428–2437
10 Edward N Trifonov Theory of Early Molecular Evolution Landes Biosciences 2006
11 E.N.Trifonov Consensus temporal order of Amino Acids and evolition of the triplet code Gene 2000 ;(
261):139-151
12 Predrag Radivojac, Zoran Obradovic, David K. Smith, Guang Zhu, Slobodan Vucetic, Celeste J. Brown J. David
Lawson and A. Keith Dunker Protein flexibility and intrinsic disorder Protein Science (2004), 13:71-80
13 N. R. Markham & M. Zuker. UNAFold: software for nucleic acid folding and hybridizing. Methods in Molecular
Biology: Bioinformatics. Totowa, NJ: Humana Press, in press.
14 Peng K., Radivojac P., Vucetic S., Dunker A.K., and Obradovic Z., Length-Dependent Prediction of Protein
Intrinsic Disorder, BMC Bioinformatics 7:208, 2006.
15 Gene Ontology: tool for the unification of biology. Nture Genet. (2000) 25: 25-29.
16 Brooks D, Singh, M, Fresco J R Selection influences the proteomic usage of a majority of amino acid
17 Vucetic S, Obradovic Z, Vacic V, Radivojac P, Peng K, Iakoucheva LM, Cortese MS, Lawson JD, Brown CJ,
Sikes JG, Newton CD, and Dunker AK. 2005Disprot: A database of protein disorder Bioinformatics 21:137-140