Document

Transcript Document

molecule's structure prediction
Outline

RNA



RNA folding
Dynamic programming for RNA secondary structure
prediction
Protein




Secondary Structure Prediction
Homology Modeling
Protein Threading
ab-initio
RNA Basics
23 Hydrogen Bonds – more stable


RNA bases A,C,G,U
Canonical Base Pairs




A-U
G-C
G-U
“wobble” pairing
Bases can only pair with
one other base.
Image: http://www.bioalgorithms.info/
RNA Secondary Structure
Pseudoknot
Stem
Interior Loop
Single-Stranded
Bulge Loop
Junction (Multiloop)
Hairpin loop
Image– Wuchty
RNA secondary structure representation
Circular representation:
Bacillus Subtilis RNase P RNA
RNA secondary structure representation
DotPlot representation
of the same Bacillus
Subtilis RNA folding:
A dot is placed to represent
a base pair
RNA secondary structure definition
An RNA sequence is represented as:
R = r1, r2, r3, …, rn
(ri is the i-th nucleotide).
Each ri belongs to the set {A, C, G, U}.
A secondary structure on R is a set S of ordered pairs, written as i•j,
1≤i<j≤n, satisfying:
Computing RNA secondary structure

Working hypothesis:
The native secondary structure of a RNA molecule is
the one with the minimum free energy

Restrictions:



No knots
(ri,rj) , (rk,rl), i<k<j<l
No close base pairs: (ri,rj) j – i > 3 (exclude “close” base
pairs)
Base pairs: A-U, C-G and G-U
Computing RNA secondary structure

Tinoco-Uhlenbeck postulate:

Assumption: the free energy of each base pair is
independent of all the other pairs and the loop
structures

Consequence: the total free energy of an RNA is the
sum of all of the base pair free energies
Independent Base Pairs Approach

Use solution for smaller strings to find solutions
for larger strings

This is precisely the basic principle behind
dynamic programming algorithms!
RNA folding: Dynamic Programming
Notation:
 e(ri,rj) : free energy of a base pair joining ri and rj

Bij : secondary structure of the RNA strand
from base ri to base rj. Its energy is E(Bij)

S(i,j) : optimal free energy associated with
segment ri…rj
S(i,j) = max -E(Bij)
B
RNA folding: Dynamic Programming
There are only four possible ways that a secondary structure of
nested base pair can be constructed on a RNA strand from position i to j:
1. i is unpaired, added on to
a structure for i+1…j
S(i,j) = S(i+1,j)
2. j is unpaired, added on to
a structure for i…j-1
S(i,j) = S(i,j-1)
RNA folding: Dynamic Programming
4. i j paired, but not to each other;
the structure for i…j adds together
structures for 2 sub regions,
3. i j paired, added on to
i…k and k+1…j
a structure for i+1…j-1
S(i,j) = max {S(i,k)+S(k+1,j)}
S(i,j) = S(i+1,j-1)+e(ri,rj)
i<k<j
RNA folding: Dynamic Programming
Since there are only four cases, the optimal score S(i,j) is just the
maximum of the four possibilities:
S (i  1, j )


S (i, j  1)

S (i, j )  max 
S (i  1, j  1)  e( ri , rj )

S (i, k )  S (k  1, j )
max
i

k

j

ri unpaired
rj unpaired
i, j base pair
i, j paired , but not to each other
To compute this efficiently, we need to make sure that the scores for
the smaller sub-regions have already been calculated
Dynamic Programming !!
RNA folding: Dynamic Programming
Notes:
S(i,j) = 0 if j-i < 4: do not allow “close” base pairs
Reasonable values of e are -3, -2, and -1 kcal/mole
for GC, AU and GU, respectively. In the DP procedure,
we use 3, 2, 1 (or replace max with min)
Build upper triangular part of DP matrix:
- start with diagonal – all 0
- works outward on larger and larger regions
- ends with S(1,n)
Traceback starts with S(1,n), and finds optimal path that lead there.
j
A U A C C C U G U G G U A U
Initialisation:
No close basepairs
A 0
0
0
0
U
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
A
C
C
C
U
i
G
U
G
G
U
A
U
0
j
1
10
5
A U A C C C U G U G G U A U
Propagation:
1
C5….U9 :
C5 unpaired:
S(6,9) = 0
C5-U10 paired
S(6,8) +e(C,U)=0
0 0
0
0
U
0 0
0
0
0
A
0
0
0
0
2
0
0
0
0
3
0
0
0
0
0
0
0
0
0
3
0
0
0
0
1
0
0
0
0
1
0
0
0
0
2
0
0
0
0
1
0
0
0
0
0
0
0
0
0
C
5
U10 unpaired:
S(5,8)=0
A 0
C
C
U
G
C5 paired, U10 paired: U
S(5,6)+S(7,9)=0
10 G
S(5,7)+S(8,9)=0
G
U
A
U
0
j
1
5
10
A U A C C C U G U G G U A U
Propagation:
1
C5….G11 :
A 0
0
0
0
0
0
2
U
0
0
0
0
0
2
3
0
0
0
0
2
3
5
0
0
0
0
3
3
3
0
0
0
0
0
3
6
0
0
0
0
3
3
0
0
0
0
1
1
0
0
0
0
1
2
0
0
0
0
2
2
0
0
0
0
1
0
0
0
0
0
0
0
0
0
A
C5 unpaired:
S(6,11) = 3
C
5
G11 unpaired:
S(5,10)=3
C5-G11 paired
S(6,10)+e(C,G)=6
C
C
U
G
C5 paired, G11 paired: U
S(5,6)+S(7,11)=1
10 G
S(5,7)+S(8,11)=0
S(5,8)+S(9,11)=0
G
S(5,9)+S(10,11)=0
U
A
U
0
j
1
5
10
A U A C C C U G U G G U A U
Propagation:
1
A 0
0
0
0
0
0
2
3
5
6
6
8
10
12
U
0
0
0
0
0
2
3
5
6
6
8
10
10
0
0
0
0
2
3
5
5
6
8
8
8
0
0
0
0
3
3
3
6
6
6
6
0
0
0
0
0
3
6
6
6
6
0
0
0
0
3
3
3
3
3
0
0
0
0
1
1
3
3
0
0
0
0
1
2
2
0
0
0
0
2
2
0
0
0
0
1
0
0
0
0
0
0
0
0
0
A
C
5
C
C
U
i
G
U
10 G
G
U
A
U
0
j
A U A C C C U G U G G U A U
Traceback:
A 0
0
0
0
0
0
2
3
5
6
6
8
10
12
U
0
0
0
0
0
2
3
5
6
6
8
10
10
0
0
0
0
2
3
5
5
6
8
8
8
0
0
0
0
3
3
3
6
6
6
6
0
0
0
0
0
3
6
6
6
6
0
0
0
0
3
3
3
3
3
0
0
0
0
1
1
3
3
0
0
0
0
1
2
2
0
0
0
0
2
2
0
0
0
0
1
0
0
0
0
0
0
0
0
0
A
C
C
C
U
i
G
U
G
G
U
A
U
0
FINAL PREDICTION
U
G
C
AUACCCUGUGGUAU
U
C
G
C
G
A
U
U
A
A
U
Total free energy: -12 kcal/mol
Protein structure
prediction
2000000
1800000
1600000
1400000
1200000
1000000
800000
600000
400000
200000
0
1980
200000
180000
160000
140000
120000
100000
80000
60000
40000
20000
0
1985
1990
1995
2000
2005
Structures
Sequences
The sequence-structure gap
The gap is getting bigger
The protein folding problem

The information for 3D structures is coded in
the protein sequence

Proteins fold in their native structure in
seconds
Secondary Structure
Prediction
Given a primary sequence
ADSGHYRFASGFTYKKMNCTEAA
what secondary structure will it adopt ?

25
Backbone
A polypeptide chain. The R1 side chains identify the component amino acids.
Atoms inside each quadrilateral are on the same plane, which can rotate according
to angles  and  .
Protein structure
Secondary Structure Prediction
Methods

Chou-Fasman / GOR Method


Based on amino acid frequencies
Machine learning methods

PHDsec and PSIpred
28
Chou and Fasman (1974)
The propensity of an amino
acid to be part of a certain
secondary structure (e.g. –
Proline has a low
propensity of being in an
alpha helix or beta sheet 
breaker)
Name
Alanine
Arginine
Aspartic Acid
Asparagine
Cysteine
Glutamic Acid
Glutamine
Glycine
Histidine
Isoleucine
Leucine
Lysine
Methionine
Phenylalanine
Proline
Serine
Threonine
Tryptophan
Tyrosine
Valine
P(a)
142
98
101
67
70
151
111
57
100
108
121
114
145
113
57
77
83
108
69
106
P(b)
83
93
54
89
119
037
110
75
87
160
130
74
105
138
55
75
119
137
147
170
P(turn)
66
95
146
156
119
74
98
156
95
47
59
101
60
60
152
143
96
96
114
50 29
Success rate of 50%
Secondary Structure Method
Improvements
‘Sliding window’ approach


Most alpha helices are ~12 residues long
Most beta strands are ~6 residues long
Look at all windows, calculate a score for
each window. If >threshold  predict this is
an alpha helix/beta sheet
TGTAGPOLKCHIQWMLPLKK
30
Improvements since 1980’s


Adding information from conservation in MSA
Smarter algorithms (e.g. Machine learning).
Success -> 75%-80%
31
Machine learning approach for predicting
Secondary Structure (PHD, PSIpred)
Query
SwissProt
Query
Subject
Subject
Subject
Subject
Step 1:
Generating a multiple sequence
alignment
32
Query
Step 2:
Additional sequences are added using a
profile. We end up with a MSA which
represents the protein family.
seed
MSA
Query
Subject
Subject
Subject
Subject
33
Step 3:
Query
The sequence profile of the protein family
is compared (by machine learning methods)
to sequences with known secondary
structure.
seed
MSA
Query
Subject
Subject
Subject
Subject
Machine
Learning
Approach
Known
structures
34
Neural Network architecture used in BetaTPred2
Predicting protein 3d structure
Goal: 3d structure from 1d sequence
An existing fold
Fold recognition
Homology modeling
A new fold
ab-initio
Homology Modeling




Simplest, reliable approach
Basis: proteins with similar sequences tend to
fold into similar structures
Has been observed that even proteins with
25% sequence identity fold into similar
structures
Does not work for remote homologs (< 25%
pairwise identity)
Homology Modeling

Given:




A query sequence Q
A database of known protein structures
Find protein P such that P has high sequence
similarity to Q
Return P’s structure as an approximation to
Q’s structure
Homology modeling needs
three items of input:



The sequence of a protein with unknown 3D
structure, the "target sequence."
A 3D “template” – a structure having the
highest sequence identity with the target
sequence ( >25% sequence identity)
An sequence alignment between the target
sequence and the template sequence
Fold recognition = Protein
Threading
Which of the known folds is likely to be
similar to the (unknown) fold of a new
protein when only its amino-acid
sequence is known?
Protein Threading

The goal: find the “correct” sequence-structure
alignment between a target sequence and its
native-like fold in PDB
MTYKLILN …. NGVDGEWTYTE

Energy function – knowledge (or statistics) based
rather than physics based


Should be able to distinguish correct structural folds from
incorrect structural folds
Should be able to distinguish correct sequence-fold
alignment from incorrect sequence-fold alignments
Protein Threading

Basic premise
The number of unique structural (domain) folds in
nature is fairly small (possibly a few thousand)
Statistics from Protein Data Bank (~2,000
structures)
90% of new structures submitted to PDB in the
past three years have similar structural folds in
PDB


Chances for a protein to have a structural fold that
already exists in PDB are quite good.
Protein Threading
Basic components:




Structure database
Energy function
Sequence-structure alignment algorithm
Prediction reliability assessment
ab-initio folding
Goal: Predict structure from “first
principles”
Requires:


A free energy function, sufficiently close to the
“true potential”
A method for searching the conformational
space
Advantages:


Works for novel folds
Shows that we understand the process
Disadvantages:

Applicable to short sequences only
Qian et al. (Nature: 2007)
used distributed computing*
to predict the 3D structure of
a protein from its amino-acid
sequence. Here, their
predicted structure (grey) of
a protein is overlaid with the
experimentally determined
crystal structure (color) of
that protein. The agreement
between the two is excellent.
*70,000 home computers for
about two years.
Overall Approach
Multiple Sequence
Alignment
Database Searching
No
Homologue
in PDB
Protein Sequence
Secondary
Structure
Prediction
Fold
Recognition
Yes
Homology
Modelling
3-D Protein Model
Sequence-Structure
Alignment
Ab-initio
Structure
Prediction
Yes
Predicted
Fold
No
Thank you for
learning with me!

Document

Transcript Document

Directory