PPT presentation
Download
Report
Transcript PPT presentation
Methods of Protein Structure Alignment
David Hoksza
Charles University in Prague
Department of Software Engineering
Czech Republic
Presentation Outline
Biological background
Protein databases
Protein structure
Similarity measures
Algorithms
DATAKON 2008
2
Terminology
DNA (deoxyribonucleic acid)
sequence of nucleotides (A, C, G, T)
double-helix
RNA (ribonucleic acid)
single-helix sequence of nucleotides (A, C,
G, U)
messenger RNA (mRNA)
transfer RNA (tRNA)
ribosomal RNA (rRNA)
…
Proteins
molecules
translated from mRNA in ribosomes
sequence of amino acids (20 AAs)
coded by codon (triplet of nucleotides)
genetic code
central dogma
DNA → RNA → protein
transcription
translation
DATAKON 2008
3
Protein Similarity
Interaction of proteins determines biological
functions
Function of protein derived from its three
dimensional structure
similar proteins (many common amino acids on
“appropriate” places) have similar structure
→ similar proteins have similar functions
similar proteins have a common ancestor
Identifying protein sequence
→ finding similar proteins
→ getting clue to the function
DATAKON 2008
4
Protein Databases
Finding similar proteins
even among different species
Prominent non-structural databases
GenBank
EMBL (European Molecular Biology Laboratory Data)
DDBJ (DNA Data Bank of Japan)
UniProt
not moderated
moderated
Swissprot + trEMBL (translated EMBL) + PIR (Protein Information Resource)
Prominent structural databases
PDB (Protein Data Bank)
SCOP (Structural Classification of Proteins)
ASTRAL Compendium (for Sequence and Structure Analysis)
DATAKON 2008
5
Databases Growth
DATAKON 2008
6
Databases Growth (PDB)
DATAKON 2008
7
Levels of complexity of
protein structure
Primary structure
linear sequences of amino acids
Secondary structure
local three-dimensional segments which
are folded into specific repeated
structures
Tertiary structure
alpha helices, beta sheets (strands)
the atomic coordinates - spatial relations
among the secondary structure elements
Quaternary structure
multiple polypeptide chains
DATAKON 2008
8
Protein structure
Amino-acids differ in their side chains (R-groups)
Connection – peptide bonds
Protein sequence → sequence of rigid planes
Degrees of freedom
Planes
R-groups
3D conformation
described by dihedral
angles
Only α-carbons usually
considered
DATAKON 2008
9
Protein structure cont.
SSE – Secondary Structural
Elements
Repetitive structures arising
by H bonds
Alpha helices
Beta sheets (strands)
ith amino-acids is connected
to (i+4)th amino-acid
φ and ψ angles are constant
peptide units per turn is 3,6
multiple strands connected to
each other by H bonds
parallel/antiparallel
Motifs
Combinations (second form)
of SSEs
beta ribbon, beta-barrel, betahairpin, helix-loop-helix,
greek key, …
DATAKON 2008
10
Similarity measures
RMSD – Root Mean Square deviation/difference/distance
Summarizes partial distances of aligned residue pairs
Evaluates quality of a matching (superposition)
cRMSD (core RMSD)
dRMSD (distance RMSD)
intra-residue
disregarding outliers
1 n A
x (i ) x B (i )
n i 1
cRMSd
dRMSd
2
n
n
1
(d ijA d ijB ) 2 , d ij x(i ) x( j )
n(n 1) i 1 j i 1
elastic similarity score
inter-residue
(i, j )
| d ijA d ijB |
d
*
ij
w(d ij* ) , w(x) e
x2
α
fragmented dRMSD
aimed to recognition of similar substructures
(i, j )
local
DATAKON 2008
1
(2l 1) 2
l
l
(d
a l bl
A
i a , j b
d iB a , j b ) 2
11
Algorithms
Goals
Alignment
Classification
direct similarity
indirect similarity
Methods
Incremental extension
Dynamic programming
extending initial partial alignment
dynamic programming matrix of (usually) distances
Indexing
using features to be indexed by
trees
geometric hashing
DATAKON 2008
12
Algorithms - Incremental Extension
DALI
Elastic similarity score
Matrix of inter-residual
distances
Similar proteins ≡ similar interresidual distances ≡ similar
distance matrices
Contatct pattern (CP) submatrix of fixed size
(hexapeptides)
Similar pairs of CPs are stored
and one is used as a seed
Monte-Carlo optimization is
used for extend the already
created alignment
CE
AFP – Aligned Fragment Pair (constant length
portions – local structures)
Fragmented dRMSD
Joining of AFPs based on three different distance
measures
Several path are computed and best of them is
optimized by Smith-Waterman (on the distance
matrix)
DATAKON 2008
13
Algorithms – Dynamic Programming
SAP
PROSUP
1.
2.
3.
4.
Double dynamic programming
View – vector of distances to other resiudes
Between pairs of views, optimal alignments (Smith-Waterman) are computed which are used to fill
up final DP matrix → final alignment
Identification of seed fragments
Expand seed fragments to initial alignments
Apply DP (Needleman-Wunsch) to refine initial alignments
Evaluate refined alignments
STRUCTAL
Evaluated as the best structural algorithm in Kolodny, R., Koehl, P., Levitt, M.: Comprehensive evaluation of
protein structure alignment methods: scoring by geometric measures. J. Mol. Biol. 346 (2005), 1173-88.
TALI
Initial alignment based on several rules (match beginning of structures, match ends, match pairs
based on sequence alignment, …)
Refine the alignments by DP (Needleman-Wunsch)
Exposure weighting
Position dependent gap penalties
…
Incorporates torsion angles → mutual distances form DP matrix → Smith-Waterman
FATCAT
Twists – proteins not understood as rigid bodies
Minimization of twists to turn one structure into another one → DP used to connect AFPs
(introducing a twist is penalized)
DATAKON 2008
14
Algorithms – Indexing - Trees
PSI
Represenatation
Features
amino-acids → centers of
masses
defining local neighborhood
for each SSE
features (9 dimensions)
SSE approximation
PSIST
sliding window of size w
distance
angle
Suffix-tree
distances in triplets
angles between pairs in
triplets
R*-tree
DATAKON 2008
15
Algorithms – Indexing - Hashing
ProGreSS
Combines structure and
sequence
Features (sliding window of
size w)
DATAKON 2008
torsion angle
curvature
SSE type
Pairwise comparison
line in scoring matrix (PAM,
BLOSUM, ..) → scoring vector
→ chaining scoring vectors →
dim = 20w
Haar wavelet transormation
→ dim = dq
normalization → [0,1] dq space
SSE type
Theory of differential
geometry
Structure ≡ 3D spline
Features
curvature (dim = w)
torsion angles (dim = w)
Haar wavelet transormation
→ dim = dt
normalization → [0,1] dt space
sequence
structure
CTSS
Smith-Waterman
scoring matrix
based on features
For each pair
16
End
DATAKON 2008
17
PDB record
13
HEADER HYDROLASE(O-GLYCOSYL)
25-JAN-94 149L HELIX 10 H10 PRO A 143 THR A 155 1
SHEET 1 A 3 TYR A 18 LYS A 19 0
TITLE CONSERVATION OF SOLVENT-BINDING SITES …
SHEET 2 A 3 TYR A 25 ILE A 27 -1 N THR A 26 O TYR A 18
..
..
REMARK 1
SHEET 3 A 3 HIS A 31 THR A 34 -1 N HIS A 31 O ILE A 27
REMARK 1 REFERENCE 1
TURN 1 T1 ASP A 20 GLY A 23
..
REMARK 1 AUTH M.MATSUMURA,W.J.BECKTEL,…
ATOM
1 N MET A 1
29.360 -4.880 38.742 1.00 65.91
N
..
ATOM
2 CA MET A 1
29.892 -6.057 38.096 1.00 60.68
C
REMARK 2 RESOLUTION. 2.60 ANGSTROMS.
ATOM
3 C MET A 1
30.674 -5.673 36.863 1.00 56.33
C
REMARK 3
..
REMARK 3 REFINEMENT.
ATOM 302 CG PRO A 37
51.531 -30.219 18.738 1.00 78.60
C
ATOM 303 CD PRO A 37
52.005 -28.775 18.641 1.00 78.61
C
REMARK 3 PROGRAM : TNT
ATOM 304 N SER A 38
53.483 -28.405 22.129 1.00 70.92
N
REMARK 3 AUTHORS : TRONRUD,TEN EYCK…
ATOM 305 CA SER A 38
54.604 -28.517 23.043 1.00 67.86
C
...
..
SEQRES 1 A 164 MET ASN LEU PHE GLU MET LEU ARG … ATOM 1309 OXT LEU A 164
25.719 -18.888 43.195 1.00 25.30
O
SEQRES 2 A 164 ARG LEU LYS ILE TYR LYS ASP THR … TER 1310
LEU A 164
..
..
HELIX 1 H1 LEU A 3 GLU A 11 1
9 END
HELIX
..
2 H2 LEU A 39 ILE A 50 1
12
DATAKON 2008
18
Similarity Measures (primary structure)
two strings of amino-acids
hamming distance
sequences of equal length
number of non-identical positions
Alignment
S1 = NGHLILLE
S2 = HGALGLLE
x
x
HD(S1, S2) = 3
x
edit distance
minimal number of operations insert/update/delete to convert one sequence to
the other
Alignment
S1 = NPHGIIMGLAE
S2 = HGALGLLE
x
x
x
x
x
x
x
ED(S1, S2) = 8
weighted edit distance
takes into account probability of updating one letter to the other
scoring (substitution) matrices
x
PAM, BLOSUM, …
different costs for opening/extending a gap
global/local alignment
Needleman-Wunsch
Smith-Waterman
DATAKON 2008
19