A Similar Fragments Merging Approach to Learn Automata on Proteins
Download
Report
Transcript A Similar Fragments Merging Approach to Learn Automata on Proteins
A Similar Fragments Merging
Approach to Learn Automata on
Proteins
Goulven KERBELLEC & François COSTE
IRISA / INRIA Rennes
Outline of the talk
Protein families signatures
Similar Fragment Merging Approach (Protomata-L)
Characterization
Generalization
Similar Fragment Pairs (SFPs)
Ordering the SFPs
Merging of SFP in an automaton
Gap generalization
Identification of Physico-chemical properties
Experiments
Protein families
Amino acid alphabet :
Protein sequence :
Protein data set :
Common function
& Common topology (3D structure)
{A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}
>AQP1_BOVIN
MASEFKKKLFWRAVVAEFLAMILFIFISIGSALGFHY
PIKSNQTTGAVQDNVKVSLAFGLSI…
>AQP1_BOVIN
MASEFKKKLFWRAVVAEFLAMILFIFISIGSALGFHY
PIKSNQTTGAVQDNVKVSLAFGLSI…
>AQP2_RAT
MWELRSIAFSRAVLAEFLATLLFVFFGLGSALQWAS
SPPSVLQIAVAFGLGIGILVQALGH…
>AQP3_MOUSE
MGRQKELMNRCGEMLHIRYRLLRQALAECLGTLIL
VMFGCGSVAQVVLSRGTHGGFLT…
Characterization of a protein family
ZBT11 ...Csi..CgrtLpklyslriHmlk..H...
ZBT10 ...Cdi..CgklFtrrehvkrHslv..H...
ZBT34 ...Ckf..CgkkYtrkdqleyHirg..H...
C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H
x x x x
x
x
x
x
x
x
x
x
C
H
x \ / x
Zn
x
x /
C
x
x
Zinc Finger Pattern
x
\ x
H
x
x
Expressivity classes of patterns
PROSITE
Class
Example
A
T-C-T-T-G-A
B
D-R-C-C-x(2)-H-D-x-C
C
G-G-G-T-F-[ILV]-[ST]-[ILV]
D
V-x-P-x(2)-[RQ]-x(4)-G-x(2)-L-[LM]
E
G-C-x(1,3)-C-P-x(8,10)-C-C
F
C-x(2,4)-C-x(3)-[ILVFYC]-x(8)-H-x(3,5)-H
G
D-T-A-G-Q-E-*-L-V-G-N-K
H
D-T-A-G-[NQ]-*-L-V-G-N-[KEH]
I
D-T-A-x(2,5)-G-[NQ]-*-L-V-G-N-[KEH]
J
Regular Expression / Automaton
PRATT
TEIRESIAS
PROTOMATA-L
Characterization
Similar Fragment Pairs
Significantly similar fragment pairs (SFPs)
Natural selection
Important area characterization
Data set D:
Ordering the SFPs
Problem :
Solution : ordering the SFPs by scoring each SFP
S(f1,f2)= ?
3 different scoring functions :
dialign
Sd
support
Ss
implication Si
Dialign Score
Sd ( f1 , f2 ) = - log P ( L , Sim )
L = |f1| = |f2|
Sim = Sum of the individual similarity values
P = Probability that a random SFP of the same L
has the same S
Blossum62
similarity
Support Score
Taking into account the representativeness of SFP
f1
f2
f
<f1,f2> is supported by f with respect
the triangular inequality :
Sd(f,f1) + Sd(f,f2) Sd(f1,f2)
Ss (f1,f2,D) = Number of sequences supporting <f1,f2>
Implication Score
Taking into account a counter-example set N
Discriminative fragments
Lerman index:
-P( Ss(f1,f2,N) ) + P( Ss(f1,f2,D) ) x P(N)
Si(f1,f2,D,N) =
avec P(X) =
P( Ss(f1 ,f2 ,D) ) x |N|
|X|
|D| + |N|
Generalization
From protein data sets to automata
MASEIKLFW
M A S E I
K L F W
From protein data sets to automata
MASEIKLFW
MGYEVKYRV
M A S E I
K L F W
M G Y E V K Y R V
Merging SFPs
MASEIKLFW
MGYEVKYRV
M A S E I
K L F W
M G Y E V K Y R V
Merging SFPs
MASEIKLFW
MGYEVKYRV
M A
S
F W
L
E [I,V] K
Y R V
M G Y
Merging SFPs
MASEIKLFW
MGYEVKYRV
M A
S
F W
L
E [I,V] K
Y R V
M G Y
MASEVKLFM
MASEIKYRV
MASEVKYRV
MGYEIKYRV
MGYEVKLFW
MGYEIKLFW
Protein Sequence Data Set
List of SFPs
Ordered List
of SFPs
MCA
MERGING
Automaton / Regular Expression
Gap Generalization
Merging on themself non-representative transitions
Treat them as "gaps"
Identification of Physico-chemical
properties
Similar Fragments ~ potential function area
Amino acids share out the same position
Physicochemical property at play
=> Generalization from a group (of amino acids) to a Taylor group
I,V
C
I,Q,W,P
aliphatic
no information
I,L,V
[I,V]
C
[I,Q,W,P]
C
x
[I,L,V]
C
X {A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}
Likelihood ratio test
To decide if the multi-set A has been generated
according to a physico-chemical group G or not by a
likelihood ratio test:
Given a threshold , we test the expansion of A to G
and reject it when LRG/A <
Experiments
MIP : the Major Intrinsic Protein
Family
Family
MIP
Subfamilies
AQP, Glpf, Gla
Data sets
Water-specific
Set « W+» (24 seq)
Set « W-» (16 seq)
Set « E» (79 seq)
Set « M» (44 seq)
identity<90%
Set « T » (159 seq)
Set « U » (911 seq)
UNIPROT
MIP in SWISS-PROT
Set « C» (49 seq)
Blast(1<e<100) not MIP
Experiments
First Common Fragment on a Family
MIP family
Positive set
Comparison with pattern discovery tools
Teiresias
Pratt
Protomata-L (short pattern)
Water-specific Characterization
MIP sub-families
Positive and negative sets
Leave-one-out cross-validation
Protomata-L (short to long pattern)
First Common Fragment
Automaton
Results of 4 patterns scanned
on Swiss-Prot protein Database
Target set
Set « T » (159 seq)
Learning set
Learning Set
Set « M» (44 seq)
From short automata to long automata
Previous experiment
only the first SFPs of the ordered list of SFPs
short automaton
first common fragment automaton
Next experiment
larger cut-offs in the list of SFPs
Protomat-L is able to create longer automata with more
common subparts
Long patterns are closed of the topoly (3D-structure) of
the family
Water-specific characterization
Leave-one-out cross-validation
Learning set
W+ \ Si : Positive learning set
W- \ Sj : Negative learning set
Set « W+» (24 seq)
Set « W-» (16 seq)
Test set
{ Si U Sj }
Control set
Set T
Set « C» (49 seq)
Implication score
Leave-one-out cross-validation
Error Correcting Cost
The error correcting cost of a sequence S represents the
distance (blossum similarity) between S and the closest
sequence given by the automaton A.
Distibution of sequences with long automata (size Approx. 100)
Leave-one-out cross-validation
With Error Correcting Cost
Leave-one-out
cross-validation
Conclusion & Perspective
Good characterization of protein family using automata
(-> hmm structure)
No need of a multiple alignment
greedy data-driven algorithm
Important subparts localization
Physico-chemical identification and generalization
Counter example sets
Bringing of knowledge is possible in automata
(-> 2D structure)
Questions ?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
Demo
Protomata-L ’s Approach
First Common Fragment
Protomata-L ’s Approach
To get a more precise automaton
Data set (Protein sequences)
Pairs of
fragments
EXTRACTION
Initial Automaton(MCA)
SORT
MERGING
IDENTIFICATION OF
PHYSICOCHEMICAL
GROUPS
IDENTIFICATION OF « GAPS »
Structural discrimination
Generalization of an Aquaporins
automaton
Aromatique
Hydrophobe
Non Informatif
Physico-chemical properties identification
Ratio likelihood test
Aliphatic
x
Small