Transcript ppt
SPLASH: Structural Pattern
Localization Analysis by
Sequential Histograms
A. Califano, IBM TJ Watson
Presented by Tao Tao
April 14th, 2004
Motif: A functional regain of a DNA or
protein sequence
Amino Acids
sequences
How to discover the functional regains automatically?
Automatic Motif discovery
-
Problem
Use A, B, C, … stands for different amino acids
A protein sequence: ABABAABCDBAA…
Motifs are certain patterns in sequences
for example: ABCA
Previous Methods: small scale discovery
Several sequences similar functions alignment
Can we use data mining to generate motifs
candidates first?
Automatically discover motifs: What
properties should a motif have?
It has a specific function conservative
frequent appearing in sequences
Evolution likely not continually identical
For example: ABCBABABA
AB--ABAB-
string matching, suffix tree …
AB-BAB-B-
how?
Formal problem definition
Input: A string of characters: S=s1s2,…,sL
Output: A frequent pattern: (∑ U ●)*
●: a wild card to match a single character,
∑: a full character
* : repeat arbitrary times
Note: NO arbitrary-length gap.
ABCD, AED are different
Regular Expression: to describe a certain
type of patterns
| or : A|B means A or B
● wild card to match any characters
A●B means: AAB, ABB, ACB, …
* to repeat any times (including 0 times)
(AB)* means null, AB, ABAB, ABABAB, …
+ to repeat any times (not including 0 times)
…
Any requirements for output patterns?
Can wild card be anywhere? Do we need
some constraints on wild cards?
What means “frequent”?
How long should a qualified pattern be?
Can wild card be anywhere?
A pattern can have ●: for example A●BA●●B
But, A●●●●●●●●●●●●●●●●BBA ??
Probably, it cannot be too “sparse”…
Naïve solution: no more than n ●
But, for example n=5
A●●●●●B : 5 ●
A●BB●●A●●B●●A : 7●
Density: how “sparse” do we allow?
Given a pattern P, any length l0 region in P must
have k0 full characters
Example: l0 = 5, k0 = 3
s1s2s3s4s5s6s7s8s9……
Two ● at most
Density: how “sparse” do we allow?
Given a pattern P, any length l0 region in P must
have k0 full characters
Example: l0 = 5, k0 = 3
s1s2s3s4s5s6s7s8s9……
Two ● at most
Density: how “sparse” do we allow?
Given a pattern P, any length l0 region in P must
have k0 full characters
Example: l0 = 5, k0 = 3
s1s2s3s4s5s6s7s8s9……
Two ● at most
Density: how “sparse” do we allow?
Given a pattern P, any length l0 region in P must
have k0 full characters
Example: l0 = 5, k0 = 3
s1s2s3s4s5s6s7s8s9……
A●●ABB●A √
BA●●A●BB
X
Frequency and length
At least, the patterns have K0 full characters
repeating J0 times
Example J0 =3
ABCBABABA √
ABCBABABA X
Example K0 =3
ABCBABABA X
ABCBABABA √
Summary of parameters for a pattern
Sequence S, and its length L
Pattern P, K full character, appears J times
Length constraints: K ≥ K0
Frequency constraint: J ≥ J0
Density constraint: l0 , k0
Apriori property
A constraint has a-priori property means:
If a set violates this constraint, any its superset will
violate this constraint as well.
For example max(S) < 5
Frequency constraint has a-priori property!
For example, BA●A●BB appears less than J0 times, any
its super patterns CANNOT appears more than J0 times!
A whole picture of the algorithm
-
-
To form longer pattern only from short qualified patterns.
First, to generate candidates/seed (length l0): every seed
should repeat at least J0 times
To generate longer patterns from short patterns, iteratively
1. Two patterns are together
2. Longer patterns repeat at least J0
……
Generate the seeds: enumerating …
-
To generate seeds (shortest patterns) first
ABAABBCBACBDB… J0=4
A: 4, B: 6, C:2, D: 1
Are length 1 seeds too short? How long could those
seeds be?
Too long: enumerating costs too much time
Too short: maybe not efficient, also not consider the
density constraints
Maybe, we should start from the patterns with length
l0 .
How to generate seeds with length l0 ?
Give l0 and k0, and character sets ABC…
Enumerating all possible patterns with length l0
Scan the sequence the count the frequency
For example, l0=3, k0=2, ABC
AAA, AAB, AAC, ABA, …
AA●, AB●, AC●, …
A●A, A●B, A●C, …
…
Can we do it more efficiently?
Give l0 and k0,
1: full character 0:wild card
Enumerating all possible patterns by 1 and 0?
Example l0=5, k0=3, to find comb
11111, 11110, 11101, 11100, 11011,
11010, 11001, 10111, 10110, 10101
10011, 01111, 01110, 01101, 01011,
00111
How to use comb? For example 10101
A B AA B A B B A B A B A B B AA B
A●A●B
How to use comb? For example 10101
A B AA B A B B A B A B A B B AA B
B●A●A
How to use comb? For example 10101
A B AA B A B B A B A B A B B AA B
A●B●B
How to use comb? For example 10101
A B AA B A B B A B A B A B B AA B
A●A●B 3, B●A●A 2, A●B●B 2
B●B●A 2, B●B●B 2, A●A●A 1
A●B●A 1, B●A●B 1
J0 = 3? only A●A●B left
By the same way, use others combs to generate
other seeds, different combs won’t generate the
same patterns
How to get long patterns?
Long pattern two patterns could be
merged need short patterns and their
locations
-
Pattern: A●B●●C {A:0,B:2,C:5}
Locus: the locations where a pattern occurs:
Patten AB in string ABBCABAB
Its locus {0, 4, 6}
-
Append operation: to connect two small
patterns to a longer pattern
Patten S1: A●●B●C and S2: B●D●
S1S2: A●●B●CB●D●
conditional on: Their locus have intersection
S1 locus: {1, 20, 32, 57 …} -6
S2 locus: {7, 13, 38, 63 … } {1,7,32,57,…}
S1S2 locus: {1,32,57,…}
Add: to make the patterns more “dense”
Patten A●●B●C●●●D and ●●●B●CE
A●●B●CE●●D
on the conditions:
Their locus (with shifting) have intersection
Significance: whether it can be generated
by randomly sampling ?
hypothesis: A pattern is not randomly generated
Given: character set: {A,B,C,D,E}
sequence length: L
A pattern: A●BA●AA●B
Its frequency j
Probability to generate this pattern j times pure
randomly?
Statistical significance
Pure random sampling, the frequency should
satisfy normal distribution
Z score, (A-E[A])/σA --- normalized into
N(0,1)
Experiments
-
-
Two questions to answer.
How efficient is this algorithm?
How effective is this algorithm?
Baseline algorithm
PRATT(EBI), MEME(UCSD)
Efficiency
PRATT
SPLASH
Effectiveness
Search against SWISS-PROT Rel. 36,
578 GPCR proteins returned, only 4 false positive
MEME cannot find it, PRATT program crashed
Conclusions
Deterministic algorithm: It can discover all
patterns satisfying the requirements
Efficient and scalable: It beats PRATT and
MEME. More scalable …
Effective: It can discover useful patterns.
Problems
All problems that A-priori algorithm could
have: too many results, cannot really avoid
worse-case exponential …
Doesn’t really consider the 3D structure of
proteins
The software crashes sometimes