2 a - Atelier de BioInformatique

Download Report

Transcript 2 a - Atelier de BioInformatique

IFBM 2004
Finding similarities in protein
structures:
a string approach
• Atelier de Bio-Informatique (ABI) - Université Paris VI
Searching for strict repeated patterns : KMR (1972)
k
k
2k
k
k
2k
- Occurrences of strictly repeated 2k-length
patterns are built from intersections of
sets of strictly repeated k-length patterns
which lie side-by-side
This leads to an O(n.log(kmax)) algorithm for
finding kmax-length repeated patterns
Flexible patterns : KMRC
• A pattern is no more defined as a succession of
symbols, instead it is a succession of cliques of
symbols
c1
a b c
c2
S
= caaaabaaacb
sC
2
2
= 21111111121
Here m is a repeated flexible pattern of length 3.
m = c1-c1-c2 at position 4 and 8
Note : several patterns may exist at the same position : here
c1-c1-c1 has also an occurrence in position 4
Flexible patterns : KMRC
• As in KMR, the 2-k length patterns are built from klength patterns.
• Here, a pattern is a clique of (similar) patterns, and
at one position in the string there may exist several
cliques of patterns.
The algorithm is now O(n.log(k).gk) (g being the mean degeneracy,
i.e. the mean number of cliques a symbol belongs to)
In biology, identity is trivial, similarity is interesting…
Flexible patterns in sequences (KMRC)
• This algorithm may be used to find flexible patterns
in several protein sequences (multiple alignment by
blocks):
p1
p2
p3
p4
p5
• Cliques of “symbols” define similarity, e.g :
- different overlapping sets of amino acids clustered by
their properties (e.g. hydrophobic, hydrophilic, small,
large, polar, charged, etc…)
- different overlapping sets of amino acids clustered by
setting a threshold value on their score in a similarity
matrix (e.g. BLOSUM or PAM)
PAM250
Threshold -> cliques
Similarity
is not
transitive !
Flexible patterns in structures (KMRC)
• Finding 3D structural patterns in several protein
structures
• Structures must be described as strings of symbols,
and similar structures must be composed of similar
symbols
• -> use of discretized internal coordinates (angles) as
an alphabet :
f,y
or
a,t angles
Internal coordinates a,t
Internal coordinates to symbols
a
discretization
...a21-a02-a13-a22-…-a19-a15-a20-a12-…
...a03-a21-a14-a21-…-a17-a16-a03-a22-…
Absolute need of similarity (KMRC), not identity !
Flexible patterns : KMRC
Finding flexible patterns of a -symbols :
Here, cliques of “symbols” are a angle overlapping sets
Similarity is a critical point (identity would miss
structural features).
-180°
0°
{a10 , a20 , a30 , a40 , a50 }
180°
{a20 , a30 , a40 , a50 , a60 }
Flexible patterns : KMRC
3 Cytochromes
P450:
TERP,BM3,CAM
PMWIATKHADVMQIG
VTRYLSSQRLIKEAC
GHWIATRGQLIREAY
PTHTAYRGLTLNWFQPASIRKLEENIRRIAQASVQR
KNWKKAHNILLPSFSQQAMKGYHAMMVDIAVQLVQK
PEQRQFRALANQVVGMPVVDKLENRIQELACSLIES
CDFMTDCALYYPLHVVMTALGVP
IEVPEDMTRLTLDTIGLCGFNYR
CNFTEDYAEPFPIRIFMLLAGLP
EDDEPLMLKLTQDF
ITSMVRALDEAMNK
EDIPHLKYLTDQMT
FHETIATFYDYFNGFTVDRRS
FQEDIKVMNDLVDKIIADRKA
FAEAKEALYDYLIPIIEQRRQ
CPKDDVMSLLAN
EQSDDLLTHMLN
KPGTDAISIVAN
Similarity : nature of elements OR relations between them
Bach BWV846
Similarity: series of notes with same pitch ?
-> transposed series rather…
Bach BWV846
-> Similarity of relations between elements
Bach BWV846
Relational patterns (KMRR)
• Now, a similar pattern is not defined by a
succession of similar symbols, but instead by
succession of elements that share the same
relationships between them.
r13
r12
Pattern m =
r13
r23
r12
r23
r13
r12 r23
Example of relations = “to be higher to”,
“to be lower to”, “to be equal to”,…
One step further : Flexible relational patterns
(KMRR)
• The relations between elements do not need to be
the same, they just need to be similar
rc
ra
ra
CR1
rb
rb
rb
rc
ra
Pattern m =
ra
CR2
CR1 CR1
CR2
Relational cliques
Cliques of relations = {“to be higher”, “to be equal”},
{“to be lower”, “to be equal”}…
An example: application to 3D structures:
relations are defined on discrete internal distances between points:
r(i,j) = rk if and only if rk ≤ dist(i,j) < rk + ∆
The relations r(i,j) and r(i’,j’) are considered as similar
if they belong to the same subset {rk, rk+1, rk+2}, i.e. if
| r(i, j) - r(i’, j’) | ≤ 2
This implies for euclidian distances :
| dist(i,j) - dist(i’,j’) | < 3∆
r1 r2 r3
rk rk+1 rk+2
d(i,j)
d(i ’,j’)
Application to 3D structures:
Relational cliques: 1 2 3 4 5 6 7 8 9 10 ….
(defined on distances)
Application to
3D structures:
Ex: multiple structural
alignment of 4 pectate and
pectin lyases:
1PCL,1IDJ,2BSP,1PLU
1PCL:
1IDJ:
2BSP:
1PLU:
AEWDAAVIDNSTNVWVDHVT
WGGDAITLDDCDLVWIDHVT
SQYDNITINGGTHIWIDHCT
KDGDMIRVDDSPNVWVDHNE
1PCL:
1IDJ:
2BSP:
1PLU:
LRVTFHNNVFDRVTERAPRV
DLVTMKGNYIYHTSGRSPKV
LKITLHHNRYKNIVQKAPRV
RNITYHHNYYNDVNARLPLQ
1PCL:
1IDJ:
2BSP:
1PLU:
TERAPRVRFGSIHAYNNVYL
GRSPKVQDNTLLHCVNNYFY
VQKAPRVRFGQVHVYNNYYE
NARLPLQRGGLVHAYNNLYT
1PCL:
1IDJ:
2BSP:
1PLU:
AQTMTSSLATSINNNAGYGK
SASAYTSVASRVVANAGQGN
SIDASANVKSNVINQAGAGK
SPVSAQCVKDKLPGYAGVGK
Structural similarity search
in databases: YAKUSA
database (PDB)
query
fast structural
scanning
Structural similarity search in databases: YAKUSA
Structures encoded with a angles
For each database protein:
1- find structural similar seeds with the query
2 - extend seeds to longer structural matches
Then rank the structural hits
Seeds of length k
All overlapping k-patterns of the query -> automaton
a1a1a2a1a2
Dictionary/Automaton
a1
a1
a1
a1
a2
a2
a1
a2
a2
a2
a2
Leaf->one seed
Advantage: no moving backward in the database string; less
moving backward in the patterns
Query seeds automaton
Dictionary/Automaton
with degeneracy with similar patterns
a'1
a1
dc(ai,a’i) < d
a2 a'2
a'1
a' 2
a2 a' 1
a1
a1
a1
a2
a'1
a2
'
a'2 a2 a 2
a'2
a1
a'1
a2
a1
a1
a'1
Sdc(ai, a’i) < D
a'2
a'1
a'2
a1
a2
a'1
a'2
a2
a'2 a
2
Matching seeds to SHSP
- (many seeds, giving only several SHSP)
seed
query
...a03-a21-a14-a21-a21-a22-a24-a17-a16-a03-a22-…
database ...a -a -a -a -a -a -a -a -a -a -a -…
21
02
11
24
22
23
23
19
14
20
12
SHSP
- SHSP=maximal scoring region around the
seed, the scores being based upon the angular
differences
- probability associated with SHSP (MTD: pair
approximation of Markov model)
SHSPs to results…

Database
structures
found,
ranked by
score
40 seconds for scanning 11000 PDB structures
http://bioserv.rpbs.jussieu.fr/Yakusa
Example of output:
query
(Deoxyribonuclease I)
database
(Heat-Shock protein 70)
SHSPs
http://bioserv.rpbs.jussieu.fr/Yakusa
Multiple structural alignment: « m-diagonals »
First step: finding pair “diagonals” (at the a angle level)
selection
of best 2diagonals
Index in first protein
Multiple structural alignment: « m-diagonals »
Second step: combination of 2-diagonals into m-diagonals
(in m dimensions).
Protein 1
Protein 2
Protein 3
m-diagonal in m dimensions :
here, the 3-diagonal is the
combination of three 2diagonals of dimension 2.
Multiple structural alignment: « m-diagonals »
Second step: combination of 2-diagonals into m-diagonals
(in m dimensions).
Protein 1
2
5
3
Protein 3
Protein 2
m-diagonal in m dimensions :
here, the 3-diagonal is the
combination of three 2diagonals of dimension 2.
Multiple structural alignment: « m-diagonals »
Second step: combination of 2-diagonals into m-diagonals
(in m dimensions).
“column” graphes :
- a residu is a node
- if two residues are in a 2diagonal, they are connected
by a link.
Protein 1
Protein 2
Protein 3
2
5
3
Selection of best m-diagonals (most connected ones)
Example:
13 cytochromes P450
- In blue : non aligned
parts
- Other colors :
m-diagonals
Other method :
« Gibbs sampling »
« Taboo search »
Kmrc+gok
Cytochromes P450
Marie-France Sagot
Alain Viari
Henry Soldano
Pascale Jean
Nadia Pisanti
Mathilde Carpentier
relational
patterns
Yakusa
m-diagonals
Gibbs-taboo
Sophie Brouillet
Near future: classification of structural cores in PDB…
V1
a
a
b
a
a
b
a
k-length
1 2
stacks
Look at +k
P1
3
4
5
6
7
a 75421
b 6 3
Q2
Put position
in set stack
_a
_b
1 4;6 3
2 5;
k=1
P2
2k-length
stacks
aa
ba
ab
4 1
6 3
5 2
k=2
V2
aa ab ba aa ab ba
1
2
3
4
5
6
7