A Similar Fragments Merging Approach to Learn Automata on Proteins

Download Report

Transcript A Similar Fragments Merging Approach to Learn Automata on Proteins

A Similar Fragments Merging
Approach to Learn Automata on
Proteins
Goulven KERBELLEC & François COSTE
IRISA / INRIA Rennes
Outline of the talk


Protein families signatures
Similar Fragment Merging Approach (Protomata-L)

Characterization



Generalization




Similar Fragment Pairs (SFPs)
Ordering the SFPs
Merging of SFP in an automaton
Gap generalization
Identification of Physico-chemical properties
Experiments
Protein families

Amino acid alphabet :

Protein sequence :

Protein data set :
Common function
& Common topology (3D structure)
{A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}
>AQP1_BOVIN
MASEFKKKLFWRAVVAEFLAMILFIFISIGSALGFHY
PIKSNQTTGAVQDNVKVSLAFGLSI…
>AQP1_BOVIN
MASEFKKKLFWRAVVAEFLAMILFIFISIGSALGFHY
PIKSNQTTGAVQDNVKVSLAFGLSI…
>AQP2_RAT
MWELRSIAFSRAVLAEFLATLLFVFFGLGSALQWAS
SPPSVLQIAVAFGLGIGILVQALGH…
>AQP3_MOUSE
MGRQKELMNRCGEMLHIRYRLLRQALAECLGTLIL
VMFGCGSVAQVVLSRGTHGGFLT…
Characterization of a protein family
ZBT11 ...Csi..CgrtLpklyslriHmlk..H...
ZBT10 ...Cdi..CgklFtrrehvkrHslv..H...
ZBT34 ...Ckf..CgkkYtrkdqleyHirg..H...
C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H
x x x x
x
x
x
x
x
x
x
x
C
H
x \ / x
Zn
x
x /
C
x
x
Zinc Finger Pattern
x
\ x
H
x
x
Expressivity classes of patterns
PROSITE
Class
Example
A
T-C-T-T-G-A
B
D-R-C-C-x(2)-H-D-x-C
C
G-G-G-T-F-[ILV]-[ST]-[ILV]
D
V-x-P-x(2)-[RQ]-x(4)-G-x(2)-L-[LM]
E
G-C-x(1,3)-C-P-x(8,10)-C-C
F
C-x(2,4)-C-x(3)-[ILVFYC]-x(8)-H-x(3,5)-H
G
D-T-A-G-Q-E-*-L-V-G-N-K
H
D-T-A-G-[NQ]-*-L-V-G-N-[KEH]
I
D-T-A-x(2,5)-G-[NQ]-*-L-V-G-N-[KEH]
J
Regular Expression / Automaton
PRATT
TEIRESIAS
PROTOMATA-L
Characterization
Similar Fragment Pairs



Significantly similar fragment pairs (SFPs)
Natural selection
Important area characterization
Data set D:
Ordering the SFPs

Problem :

Solution : ordering the SFPs by scoring each SFP


S(f1,f2)= ?
3 different scoring functions :



dialign
Sd
support
Ss
implication Si
Dialign Score

Sd ( f1 , f2 ) = - log P ( L , Sim )



L = |f1| = |f2|
Sim = Sum of the individual similarity values
P = Probability that a random SFP of the same L
has the same S
Blossum62
similarity
Support Score

Taking into account the representativeness of SFP
f1
f2
f
<f1,f2> is supported by f with respect
the triangular inequality :
Sd(f,f1) + Sd(f,f2)  Sd(f1,f2)

Ss (f1,f2,D) = Number of sequences supporting <f1,f2>
Implication Score



Taking into account a counter-example set N
Discriminative fragments
Lerman index:
-P( Ss(f1,f2,N) ) + P( Ss(f1,f2,D) ) x P(N)
Si(f1,f2,D,N) =

avec P(X) =
P( Ss(f1 ,f2 ,D) ) x |N|
|X|
|D| + |N|
Generalization
From protein data sets to automata
MASEIKLFW
M A S E I
K L F W
From protein data sets to automata
MASEIKLFW
MGYEVKYRV
M A S E I
K L F W
M G Y E V K Y R V
Merging SFPs
MASEIKLFW
MGYEVKYRV
M A S E I
K L F W
M G Y E V K Y R V
Merging SFPs
MASEIKLFW
MGYEVKYRV
M A
S
F W
L
E [I,V] K
Y R V
M G Y
Merging SFPs
MASEIKLFW
MGYEVKYRV
M A
S
F W
L
E [I,V] K
Y R V
M G Y
MASEVKLFM
MASEIKYRV
MASEVKYRV
MGYEIKYRV
MGYEVKLFW
MGYEIKLFW
Protein Sequence Data Set
List of SFPs
Ordered List
of SFPs
MCA
MERGING
Automaton / Regular Expression
Gap Generalization


Merging on themself non-representative transitions
Treat them as "gaps"
Identification of Physico-chemical
properties




Similar Fragments ~ potential function area
Amino acids share out the same position
Physicochemical property at play
=> Generalization from a group (of amino acids) to a Taylor group
I,V
C
I,Q,W,P
aliphatic
no information
I,L,V
[I,V]
C
[I,Q,W,P]
C
x
[I,L,V]
C
X {A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}
Likelihood ratio test


To decide if the multi-set A has been generated
according to a physico-chemical group G or not by a
likelihood ratio test:
Given a threshold , we test the expansion of A to G
and reject it when LRG/A < 
Experiments
MIP : the Major Intrinsic Protein
Family
Family
MIP
Subfamilies
AQP, Glpf, Gla
Data sets
Water-specific
Set « W+» (24 seq)
Set « W-» (16 seq)
Set « E» (79 seq)
Set « M» (44 seq)
identity<90%
Set « T » (159 seq)
Set « U » (911 seq)
UNIPROT
MIP in SWISS-PROT
Set « C» (49 seq)
Blast(1<e<100) not MIP
Experiments

First Common Fragment on a Family



MIP family
Positive set
Comparison with pattern discovery tools




Teiresias
Pratt
Protomata-L (short pattern)
Water-specific Characterization



MIP sub-families
Positive and negative sets
Leave-one-out cross-validation

Protomata-L (short to long pattern)
First Common Fragment
Automaton

Results of 4 patterns scanned
on Swiss-Prot protein Database
Target set
Set « T » (159 seq)
Learning set
Learning Set
Set « M» (44 seq)
From short automata to long automata

Previous experiment




only the first SFPs of the ordered list of SFPs
short automaton
first common fragment automaton
Next experiment



larger cut-offs in the list of SFPs
Protomat-L is able to create longer automata with more
common subparts
Long patterns are closed of the topoly (3D-structure) of
the family
Water-specific characterization

Leave-one-out cross-validation

Learning set


W+ \ Si : Positive learning set
W- \ Sj : Negative learning set
Set « W+» (24 seq)
Set « W-» (16 seq)

Test set


{ Si U Sj }
Control set

Set T
Set « C» (49 seq)

Implication score
Leave-one-out cross-validation
Error Correcting Cost

The error correcting cost of a sequence S represents the
distance (blossum similarity) between S and the closest
sequence given by the automaton A.

Distibution of sequences with long automata (size Approx. 100)
Leave-one-out cross-validation
With Error Correcting Cost
Leave-one-out
cross-validation
Conclusion & Perspective



Good characterization of protein family using automata
(-> hmm structure)
No need of a multiple alignment
greedy data-driven algorithm




Important subparts localization
Physico-chemical identification and generalization
Counter example sets
Bringing of knowledge is possible in automata
(-> 2D structure)
Questions ?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
Demo
Protomata-L ’s Approach

First Common Fragment
Protomata-L ’s Approach

To get a more precise automaton
Data set (Protein sequences)
Pairs of
fragments
EXTRACTION
Initial Automaton(MCA)
SORT
MERGING
IDENTIFICATION OF
PHYSICOCHEMICAL
GROUPS
IDENTIFICATION OF « GAPS »
Structural discrimination
Generalization of an Aquaporins
automaton
Aromatique
Hydrophobe
Non Informatif
Physico-chemical properties identification
Ratio likelihood test
Aliphatic
x
Small