Document

Transcript Document

Combinatorial pattern discovery in
biological sequences: the
TEIRESIAS algorithm
Bioinformatics 1998
Authors: Isidore Rigoutsos
Aris Floratos
Speaker: Minghua ZHANG
March. 12, 2003
1
Content






2
Background
Problem definition
Algorithm
Performance
Future work
Conclusion
Background

The development of biology provides us with a lot of
information previously unknown.
–


DNA sequences, protein sequences
One way to make use of the information: discovery
patterns in biological sequences.
Application:
–
Classification:


–
3
Given a protein family, if we find a pattern P of the family;
Later when a new protein comes in, if the protein matches P, if is
very likely the new protein also belongs to the same family.
Function prediction.
Problem Definition

: the alphabet of the characters
–

Pattern:
–
–
–
4
E.g: for DNA sequences, ={A,C,G,T}.
A sequence of characters in , and wild-cards (“.”);
Wild-card cannot be at the beginning or end of the
sequence.
E.g: A..G is a pattern; AG. is not.
Problem Definition (cont’d)

Given a pattern P
–
–
Its length |P| = No. of characters and wild-cards in P.
char(P) = No. of characters in P.

–
Its subpattern: a substring of P, which itself is a pattern.

–
A.G is a subpattern of A.G.T; while A.G. is not.
Its language G(P) = { p | p is a pattern obtained by replacing
all wild-cards in P with characters in  }

5
E.g. |A.G.T|=5, char(A.G.T)=3.
E.g. G(A.G)={ AAG, ACG, AGG, ATG }.
Problem Definition (cont’d)

Given a character sequence s, we say it matches a
pattern P in position y
–
–
–

Given a sequence database D={s1, s2, … , sn}
–
–

support(P) = No. of sequences in D that match P.
List(P) = {(x,y) | sx matches P in position y}
Frequent
–
6
if a substring of s is in G(P); and
the substring begins from offset y in s.
E.g: if S=ACGATCA, P=CG.T, then s matches P in position 2.
If support (P) > = K
Problem Definition (cont’d)

<L, W> pattern
–
–
L, W are 2 integers, 0<L<W.
P is an <L, W> pattern

–
7
 subpattern Q of P with char(Q)=L, we have |Q| < = W.
E.g: A.G.C.T is a <3, 5> pattern, A..G.C.T is not a
<3,5> pattern.
Problem Definition (cont’d)

Pattern P’ is more specific than P
–
P’ is obtained from P by


–
–

E.g: P=A..G, P1=AC.G, or P2=A..GT.
|List(P’)| <= |List(P)|
Given a set of patterns, a pattern P is maximal if
–
p’, p’ is more specific than p

–
8
Replacing some wild-cards to characters, and/or
Appending a string to the left/right of P
we have |List(p’)| < |List(p)|;
E.g: P=A.GT, P’=ACGT, if List(P)=List(P’)={(1,1), (2,2), (3,1)},
then P is not maximal.
Problem Definition (cont’d)

Problem
–
Inputs



–
Outputs

9
A character sequence database D;
Support threshold K;
parameters L, W.
{ p | p is an <L, W> pattern & p is frequent & p is maximal &
char(p)>=L}.
Algorithm

Two parts:
–
1. Find elementary patterns.

–
10
Frequent <L, W> patterns having exactly L characters
2. Find required patterns.
Algorithm part 1

Find the set of elementary patterns EP:
EP=empty;
Extend(pattern P)
Scan D;
1.
If char(P)=L {add P to EP; return; }
For all c in 
2.
For i=0 to W- |P| - L + char(P)
3.
P’ = P + “i dots”
Extend(c)
4.
for all (x,y) in List(P)
Algorithm
5.
if support(c) >= K
if (y+ |P| + i ) < | SX | {
6.
a = SX[ y + |P| + i ];
7.
update List(P’ a) }
8.
For all c in 
9.
if List( P’c ) > = K
10.
Extend ( P’c )
Function
11
Algorithm part 2

Terminologies:
–
Given a pattern P with char(P) >=L,



12
Prefix(P): its subpattern containing the first L-1 characters.
Suffix(P): its subpattern containing the last L-1 characters.
E.g: L=3, prefix(A.G.TG)=A.G, suffix(A.G.TG)=TG
Algorithm part 2 (cont’d)

Terminologies:
–
Convolution




If suffix(P) = prefix(Q), then
conv(P, Q) = PQ’ (Q=prefix(Q)Q’)
E.g: P=AC.G, Q=C.GT, then Q’=T, conv(P,Q)=AC.GT
If R=conv(P,Q), then
–
13
List(R) = {(x,y) | (x,y) in List(P) & (x, y+ |P| - |suffix(P)|) in
List(Q) }
Algorithm part 2 (cont’d)

Terminologies:
–
Partial orders on a set of patterns

prefix_wise less than (<pf)
–
Align two patterns along their left-most columns
– Find the first column with a “.” and a character
– The pattern with character in the column is smaller.
 E.g:
A. GT
AGT.C
So AGT.C <pf A.GT
14
Algorithm part 2 (cont’d)

Terminologies:
–
Partial orders on a set of patterns

suffix_wise less than (<sf)

E.g:
A .GT
AGT. C
So A.GT <sf AGT.C
15
Algorithm part 2 (cont’d)

Works on a stack
–
At first, push all elementary patterns into the stack.

–
Deal with the top pattern one by one

16
Prefix-wise less patterns are pushed later, so that they are
closer to the top of the stack.
until the stack is empty
Algorithm part 2 (cont’d)

P is on top of the stack
–
Extend P to the right



find elementary patterns whose prefix is the same as suffix(P).
convolution P with the patterns one by one with prefix-wise less
pattern first.
If R=conv(P,Q).
–
If |List( R ) | = |List(P)|, pop stack.
– If support(R) >= K and R is maximal: push R, starts over again.
–
–
17
Extend P to the left.
Pop P from stack. If P is maximal, report it.
Performance

Experiments:
–
effectiveness


–
efficient

18
real data
successfully find patterns previously known by biologist.
synthetic data
Performance (cont’d)

Core histones H3 and H4
–
–
–
–
Database: 13 proteins from H3 family, 7 from H4.
L=3, W=35
Find 9 patterns in all 20 proteins
For those patterns with the largest No. of characters
Patterns
19
SwissProt
NCBI
H3(33)
H4(20)
H3(81)
H4(59)
G…………………..T.I……..V..I……..R
27
19
79
58
K.A……..GGVK
24
20
77
59
E……V…E………..V….K………G
27
19
76
58
Performance (cont’d)

Synthetic data generation
–
Use the generator in
http://www.expasy.ch/sprot/randseq.html to get

–
Obtain 20 derivative proteins from P

20
A random protein P of 400 amino acids
With a X% similarity to P
Performance (cont’d)

Synthetic data
–
–
–
–
X=40, 50, 60, 70, 80, 90%
L=3, W=10 (15)
K=12, 14, 16, 18, 20
Results:


21
The algorithm is efficient.
The running time is almost linear on the actual size of the
output.
Performance (cont’d)

Algorithm analysis:
–
Bad factors:

–
ACGT and GTA will generate ACGTA. However, CGTA may be
infrequent. More candidates are generated.
Good factors:

By use partial order of <pf and <sf, some more specific patterns
will be generate before less specific patterns.
–

When R is generated from P, if |List( R )| = |List(P)|, P will be
popped out of the stack.
–
22
E.g: ACG is considered before AC.T  ACGTG is generated before
AC.TG  If AC.TG is not maximal due to ACGTG, it will not be
pushed into stack for further candidate generations
Reduce no. of candidates
Performance (cont’d)

A simple comparison
–
–
Compared with an Apriori-like algorithm, which finds
all patterns, not only maximal ones.
X=90%, L=3, W=10, K=16
Algorithm
TEIRESIAS
Aprior-like
candidates
patterns output
Time (s)
No. of pushes
185,179
5,855
10.75
21,569
13,972
2,659,681
2,029,867
62.82
-
-
( Running time does not include time to output patterns )
23
No. of pops
Future work

Find more flexible patterns
–
Pattern with ambiguous characters

–
E.g. A[CG]T means a pattern, which can match both ACT
and AGT. ([IALYD2000])
Pattern with flexible gaps

Gap: one or several consecutive wild-cards
–

E.g: In A..G, “..” is a gap of length 2
Flexible gap:
–
A-x(3,5)-G;
– A-*-G.
24
Conclusion




25
TEIRESIAS can find patterns composed of
characters and wild-cards.
It can successfully find out patterns from
biological sequences.
Its running time is roughly linear to the size of
the output.
Future research: discover more flexible
patterns.
Reference



26
[IA1998] Combinatorial pattern discovery in biological
sequences: the TEIRESIAS algorithm. Isidore
Rigoutsos, etal. Bioinformatics, Vol 14, No. 1, 1998, 5567.
[IALYD2000] The emergence of pattern discovery
techniques in computational biology. Isidore Rigoutsos,
etal. Metabolic Engineering 2, 159-177 (2000)
[AI] On the time complexity of the TEIRESIAS
algorithm. Aris Floratos, etal. IBM research report.
27

Document

Transcript Document

Directory