PPT presentation

Transcript PPT presentation

Methods of Protein Structure Alignment
David Hoksza
Charles University in Prague
Department of Software Engineering
Czech Republic
Presentation Outline

Biological background

Protein databases

Protein structure

Similarity measures

Algorithms
DATAKON 2008
2
Terminology

DNA (deoxyribonucleic acid)



sequence of nucleotides (A, C, G, T)
double-helix
RNA (ribonucleic acid)

single-helix sequence of nucleotides (A, C,
G, U)
messenger RNA (mRNA)
transfer RNA (tRNA)
ribosomal RNA (rRNA)
…





Proteins




molecules
translated from mRNA in ribosomes
sequence of amino acids (20 AAs)
coded by codon (triplet of nucleotides)
genetic code

central dogma

DNA → RNA → protein
transcription
translation
DATAKON 2008
3
Protein Similarity


Interaction of proteins determines biological
functions
Function of protein derived from its three
dimensional structure

similar proteins (many common amino acids on
“appropriate” places) have similar structure



→ similar proteins have similar functions
similar proteins have a common ancestor
Identifying protein sequence
 → finding similar proteins
 → getting clue to the function
DATAKON 2008
4
Protein Databases

Finding similar proteins


even among different species
Prominent non-structural databases

GenBank
EMBL (European Molecular Biology Laboratory Data)
DDBJ (DNA Data Bank of Japan)

UniProt




not moderated
moderated
Swissprot + trEMBL (translated EMBL) + PIR (Protein Information Resource)
Prominent structural databases



PDB (Protein Data Bank)
SCOP (Structural Classification of Proteins)
ASTRAL Compendium (for Sequence and Structure Analysis)
DATAKON 2008
5
Databases Growth
DATAKON 2008
6
Databases Growth (PDB)
DATAKON 2008
7
Levels of complexity of
protein structure

Primary structure


linear sequences of amino acids
Secondary structure

local three-dimensional segments which
are folded into specific repeated
structures


Tertiary structure


alpha helices, beta sheets (strands)
the atomic coordinates - spatial relations
among the secondary structure elements
Quaternary structure

multiple polypeptide chains
DATAKON 2008
8
Protein structure



Amino-acids differ in their side chains (R-groups)
Connection – peptide bonds
Protein sequence → sequence of rigid planes

Degrees of freedom




Planes
R-groups
3D conformation
described by dihedral
angles
Only α-carbons usually
considered
DATAKON 2008
9
Protein structure cont.

SSE – Secondary Structural
Elements


Repetitive structures arising
by H bonds
Alpha helices




Beta sheets (strands)



ith amino-acids is connected
to (i+4)th amino-acid
φ and ψ angles are constant
peptide units per turn is 3,6
multiple strands connected to
each other by H bonds
parallel/antiparallel
Motifs


Combinations (second form)
of SSEs
beta ribbon, beta-barrel, betahairpin, helix-loop-helix,
greek key, …
DATAKON 2008
10
Similarity measures

RMSD – Root Mean Square deviation/difference/distance



Summarizes partial distances of aligned residue pairs
Evaluates quality of a matching (superposition)
cRMSD (core RMSD)


dRMSD (distance RMSD)


intra-residue
disregarding outliers
1 n A
x (i )  x B (i )

n i 1
cRMSd 
dRMSd 
2
n
n
1
(d ijA  d ijB ) 2 , d ij  x(i )  x( j )


n(n  1) i 1 j i 1
elastic similarity score


inter-residue
(i, j ) 
| d ijA  d ijB |
d
*
ij
w(d ij* ) , w(x)  e

x2
α
fragmented dRMSD

aimed to recognition of similar substructures
(i, j )
local
DATAKON 2008
1

(2l  1) 2
l
l
  (d
a l bl
A
i  a , j b
 d iB a , j b ) 2
11
Algorithms

Goals

Alignment


Classification


direct similarity
indirect similarity
Methods

Incremental extension


Dynamic programming


extending initial partial alignment
dynamic programming matrix of (usually) distances
Indexing

using features to be indexed by


trees
geometric hashing
DATAKON 2008
12
Algorithms - Incremental Extension

DALI







Elastic similarity score
Matrix of inter-residual
distances
Similar proteins ≡ similar interresidual distances ≡ similar
distance matrices
Contatct pattern (CP) submatrix of fixed size
(hexapeptides)
Similar pairs of CPs are stored
and one is used as a seed
Monte-Carlo optimization is
used for extend the already
created alignment
CE




AFP – Aligned Fragment Pair (constant length
portions – local structures)
Fragmented dRMSD
Joining of AFPs based on three different distance
measures
Several path are computed and best of them is
optimized by Smith-Waterman (on the distance
matrix)
DATAKON 2008
13
Algorithms – Dynamic Programming

SAP




PROSUP
1.
2.
3.
4.

Double dynamic programming
View – vector of distances to other resiudes
Between pairs of views, optimal alignments (Smith-Waterman) are computed which are used to fill
up final DP matrix → final alignment
Identification of seed fragments
Expand seed fragments to initial alignments
Apply DP (Needleman-Wunsch) to refine initial alignments
Evaluate refined alignments
STRUCTAL

Evaluated as the best structural algorithm in Kolodny, R., Koehl, P., Levitt, M.: Comprehensive evaluation of
protein structure alignment methods: scoring by geometric measures. J. Mol. Biol. 346 (2005), 1173-88.






TALI


Initial alignment based on several rules (match beginning of structures, match ends, match pairs
based on sequence alignment, …)
Refine the alignments by DP (Needleman-Wunsch)
Exposure weighting
Position dependent gap penalties
…
Incorporates torsion angles → mutual distances form DP matrix → Smith-Waterman
FATCAT


Twists – proteins not understood as rigid bodies
Minimization of twists to turn one structure into another one → DP used to connect AFPs
(introducing a twist is penalized)
DATAKON 2008
14
Algorithms – Indexing - Trees

PSI


Represenatation



Features

amino-acids → centers of
masses
defining local neighborhood
for each SSE
features (9 dimensions)



SSE approximation


PSIST
sliding window of size w



distance
angle
Suffix-tree
distances in triplets
angles between pairs in
triplets
R*-tree
DATAKON 2008
15
Algorithms – Indexing - Hashing

ProGreSS



Combines structure and
sequence
Features (sliding window of
size w)











DATAKON 2008
torsion angle
curvature
SSE type
Pairwise comparison


line in scoring matrix (PAM,
BLOSUM, ..) → scoring vector
→ chaining scoring vectors →
dim = 20w
Haar wavelet transormation
→ dim = dq
normalization → [0,1] dq space
SSE type
Theory of differential
geometry
Structure ≡ 3D spline
Features

curvature (dim = w)
torsion angles (dim = w)
Haar wavelet transormation
→ dim = dt
normalization → [0,1] dt space
sequence



structure


CTSS
Smith-Waterman
scoring matrix


based on features
For each pair
16
End
DATAKON 2008
17
PDB record
13
HEADER HYDROLASE(O-GLYCOSYL)
25-JAN-94 149L HELIX 10 H10 PRO A 143 THR A 155 1
SHEET 1 A 3 TYR A 18 LYS A 19 0
TITLE CONSERVATION OF SOLVENT-BINDING SITES …
SHEET 2 A 3 TYR A 25 ILE A 27 -1 N THR A 26 O TYR A 18
..
..
REMARK 1
SHEET 3 A 3 HIS A 31 THR A 34 -1 N HIS A 31 O ILE A 27
REMARK 1 REFERENCE 1
TURN 1 T1 ASP A 20 GLY A 23
..
REMARK 1 AUTH M.MATSUMURA,W.J.BECKTEL,…
ATOM
1 N MET A 1
29.360 -4.880 38.742 1.00 65.91
N
..
ATOM
2 CA MET A 1
29.892 -6.057 38.096 1.00 60.68
C
REMARK 2 RESOLUTION. 2.60 ANGSTROMS.
ATOM
3 C MET A 1
30.674 -5.673 36.863 1.00 56.33
C
REMARK 3
..
REMARK 3 REFINEMENT.
ATOM 302 CG PRO A 37
51.531 -30.219 18.738 1.00 78.60
C
ATOM 303 CD PRO A 37
52.005 -28.775 18.641 1.00 78.61
C
REMARK 3 PROGRAM : TNT
ATOM 304 N SER A 38
53.483 -28.405 22.129 1.00 70.92
N
REMARK 3 AUTHORS : TRONRUD,TEN EYCK…
ATOM 305 CA SER A 38
54.604 -28.517 23.043 1.00 67.86
C
...
..
SEQRES 1 A 164 MET ASN LEU PHE GLU MET LEU ARG … ATOM 1309 OXT LEU A 164
25.719 -18.888 43.195 1.00 25.30
O
SEQRES 2 A 164 ARG LEU LYS ILE TYR LYS ASP THR … TER 1310
LEU A 164
..
..
HELIX 1 H1 LEU A 3 GLU A 11 1
9 END
HELIX
..
2 H2 LEU A 39 ILE A 50 1
12
DATAKON 2008
18
Similarity Measures (primary structure)

two strings of amino-acids

hamming distance



sequences of equal length
number of non-identical positions
Alignment
S1 = NGHLILLE
S2 = HGALGLLE
x
x
HD(S1, S2) = 3
x
edit distance

minimal number of operations insert/update/delete to convert one sequence to
the other
Alignment
S1 = NPHGIIMGLAE
S2 = HGALGLLE

x
x
x
x
x
x
x
ED(S1, S2) = 8
weighted edit distance


takes into account probability of updating one letter to the other
scoring (substitution) matrices



x
PAM, BLOSUM, …
different costs for opening/extending a gap
global/local alignment


Needleman-Wunsch
Smith-Waterman
DATAKON 2008
19

PPT presentation

Transcript PPT presentation

Directory