P - Computer Science and Engineering
Download
Report
Transcript P - Computer Science and Engineering
Efficient Algorithms
for Locating Maximum Average
Consecutive Substrings
Jie Zheng
Department of Computer Science
UC, Riverside
Outline
•
•
•
•
Problem Definition
Applications to Molecular Biology
Two Existing Algorithms
Open Problems
Definition of Problem
• Given a sequence of real numbers,
A = <a1, a2, …, an>, and a positive integer
L ≤ n, the goal is to find a consecutive substring
of A of length at least L such that the average
of the numbers in the subsequence is
maximized.
Applications in Biology
•
•
•
•
Locating GC-rich Regions
Post-Processing Sequence Alignments
Annotating Multiple Sequence Alignments
Computing Ungapped Local Alignments with
Length Constraints
Two Existing Algorithms
• An O(nlogL)-time Algorithm
(Yaw Ling Lin, Tao Jiang, Kun-Mao Chao, 2001)
• A Linear Time Algorithm for Binary
Strings
(Hsueh-I Lu, 2002)
The O(nlogL)-time Algorithm
(Yaw-Ling Lin, Tao Jiang, Kun-Mao Chao,2001)
Basic Scheme:
• Finding good partner of each element,
i.e. for element ai, locate aj, such that the
segment <ai,…, aj> has maximum average
among all substrings starting from ai.
• Choose the <ai,…, aj> with the maximum
average among the n candidates.
Important Concepts
• Right-Skew Sequence
A sequence A = <a1, a2, …, an> is right-skew
if and only if the average of any prefix
< a1, a2, …, aj> is always less than or equal
to the average of the remaining suffix
subsequence <aj+1, aj+2,…, an>.
Important Concept
• Decreasingly Right-Skew Partition
A partition A=A1A2…Ak is decreasingly rightskew if each segment Ai of the partition is
right-skew and μ(Ai) > μ(Aj) for any i < j.
Big Picture of Right-Skew Partition
A
B
C
D
Intuition:
1. If A is chosen, B must also be
2. If C is not chosen, D can not be, either.
Lemma 7(Huang):
The Maximum Average Substring
Can not be longer than 2L-1
• Proof If C is the maximum average substring
with length ≥2L, let C= AB, where |A|≥L and
|B|≥L, then the average of A or B is no less than
that of C.
Say μ(A) > μ(B), then μ(A) > μ(AB)
Main Idea of the O(nlogL)Algorithm
1. Compute the decreasingly right-skew
partition in O(n) time.
2. Finding the good partner for each index
costs O(logL) time.
Compute the decreasingly rightskew partition
1. Lemma 5: Every real sequence
A=<a1,a2,…,an> has a unique decreasingly
right-skew partition.
2. Lemma 6: All right-skew pointers for a
length n sequence can be computed in O(n)
amortized time.
Compute the right-skew pointers
4
9
4
30
9
4
4
30
9
9
30
30
5
5
5
5
Find good partner in O(logL)
• Lemma 9(Bitonic): Let P be a real sequence,
and A1A2…Am the decreasingly right-skew
partition of a sequence A. Suppose that
μ(PA1…Ak) = max{μ(PA1…Ai)|0≤i≤m}
Then μ(PA1…Ai) > μ(Ai+1) if and only if i≥k.
What does Lemma 9 tell us?
•
Locating good partner can be done with
binary search!
To find k so that
μ(PA1…Ak) = max{μ(PA1…Ai)|0 ≤ i ≤ m}
We guess i and make it closer to k:
1. μ(PA1…Ai) >μ(Ai+1) implies i ≥ k
2. μ(PA1…Ai) ≤μ(Ai+1) implies i < k
Big Picture of Locating Good Partners
L
L
L
1
2
2
1
3
1
Date Structure for Binary Search
logL Pointer-Jumping Tables
•
j (k) denotes the right end-point of the kth rightskew segment .
• p (0)[i] = p[i], where p[i] is right-skew pointer for i,
p (k+1) [i] = min{p (k) [p (k) [i]+1], n}. 1 k
logL
•
The precomputation of the jumping tables takes at
most O(nlogL) time.
The Time Complexity
•
•
•
Totally n phases
Each phase costs O(logL)
Overall: O(nlogL)-time
Crying Out
for A Linear Time
Algorithm!!
A Linear-Time Algorithm for Binary Strings
(Hsueh-I Lu, 2002)
•
•
Build upon the previous algorithm
Improvements:
Considering an upper bound on the number of
right-skew segments
- Working simultaneously on the right-skew
partitions of forward and reverse strings
- Utilizing Properties of Binary Strings
-
Basic Scheme
Let B = log3n and b = (loglogn)3
1. Choose O(n/ logn) indices i of S such that if g(i)-i
B holds for any of such i, then g(i) can be found in
O(logn) time.
2. Choose O(n/ loglogn) indices i of S such that if B
g(i) – i b holds for any of such i, then g(i) can be
found in O(loglogn) time.
3. Find g(i) for all indices i such that g(i) – i b.
Denotations
• A right-skew decompostion of any substring S [p, q]
is a nonempty set of i indices i1,i2,…, il so that
S[i1,i2], S[i2, i3],…, S[il-1, il] are decreasingly rightskew partition of S.
• Let DS (i, j) denote the right-skew decomposition of
S[i, j]
• If P = {p1, p2 ,…, pk, pk+1}, where p1< p2<…<pk+1,
then
k
DS ( P ) DS ( pi , pi 1 )
i 1
An Intuitive Observation
• Right-skew pointers cannot cross
A
B
C
1. By definition of right-skew segment: μ(A) μ(B) μ(C)
Thus μ(A+B) μ(C).
2. By definition of decreasingly right-skew partition:
μ(A+B) > μ(C). Contradiction.
The Big Picture of Right-Skew
Decomposition
Lemma 3: from the big picture
• If j DS (P), then DS (j, n) DS (P).
Lemma 3 tells us that if j belongs to the rightskew decomposition of some set of indices,
then its good partner will also be. Thus, we
only need to search for its good partner among
a limited number of indices.
Lemma 4: |DS(i, j)| = O((j - i)2/3)
(It holds for binary strings only)
• Define: A right-skew substring determined by DS(i, j)
is the undividable right-skew segment.
• A right-skew substring S[p, q] is long (short) if q - p
l1/3 (q - p < l1/3)
• Prove lemma 4 by showing that the number of long
and short right-skew substrings for a binary string is
O((j - i)2/3).
Phase 1: g(i) - i B; gR(j) - j
B
Define:
Pshort = {p | p mod B 0 and 0 p < n} {n}
• We have |DS (Pshort)| = O (n/logn)
• In this phase, we take care of index i such that
i and g(i), i.e. good partner of i, are both in
DS (Pshort) DR (Pshort)
Phase 2: L + b < g(i) – i L+B
L + b < gR(j) – j L+B
Define:
Ptiny = {p | p mod b 0 and 0 p < n} {n}
• We have |DS (Ptiny)| = O (n/loglogn)
• In this phase, we take care of index i such that
i and g(i), i.e. good partner of i, are both in
DS (Ptiny) DR (Ptiny)
Phase 3: g(i)-i L+b, gR(j)-j L+b
• We set up a table M whose (x,y) entry contains the
index z, such that:
If C is a binary string of L+b bits, x is the number of ‘1’ in the
first L bits of C; y is the binary string consisting of the last b
bits of C; z is the good partner of index 0 in C.
Because b is relatively small, the number of possible value for
x and y is linear
• Looking up the table M, we can cope with the leftover case in O(n)-time.
Open Problems:
How to extend the linear time
algorithm for binary strings to
arbitrary strings.
INTERESTED?
Contact:
Jie Zheng
Department of Computer Science
Surge Building # 350
UC, Riverside
E-mail: [email protected]