Transcript Document
An Optimal Algorithm for
MAX-SUM SEGMENT
& Its Applications in Bioinformatics
Tsai-Hung Fan, Shufen Lee, HsuehI Lu Tsung-Shan Tsou, Tsai-Cheng
Wang, and Adam Yao
Applications to biomolecular
sequence analysis
Locating
GC-rich regions
Computing ungapped local alignments with
length constraints
Locating GC-rich regions with length
constraints
w(G) = w(C) = 1-p
w(A) = w(T) = -p
Find the maximum sum of regions.
The maximum sum without length constrains can
be computed by the following recurrence:
Computing ungapped local
alignments with length constraints
A
G
T …… A
C
T
A
……
G
O( m n )
The idea of left-negative
Lemma
1. Every sequence of real numbers
can be uniquely partitioned into minimal leftnegative segments.
Lemma 2. We can find all left-negative
pointers for a length n sequence in O(n)
time.
The left-negative pointers
L
U
i
j
An on-line algorithm
following slides come from Hsueh-I Lu’s
Website.
The
Geometric representation
height of j-th point
=
prefix-sum(j)
S = 2 -1 1 1 -2 3 -2
1
2
j
Sum of S[i, j] = prefix-sum(j) –
prefix-sum(i – 1)
For
each ending index j,
Maximizing the sum of S[i, j] is equivalent to
minimizing prefix-sum(i – 1).
Lowest point in [j-U, j-L]
valley(j)
j –U
Feasible index
set F(j) for index
j.
j –L
j
valley(j)
Clearly,
for each ending index j, 1+valley(j)
has to be the best starting index.
In other words, 1+valley(j) maximizes the
sum of S[i, j] over all feasible starting indices
i.
As a result
The
MAX-SUM SEGMENT problem is
linear-time reducible to computing valley(j)
for all indices j.
More specifically, the solution to MAX-SUM
SEGMENT is simply the segment
S[1+valley(j), j] with maximum sum over all
indices j.
Computing valley(j) for all
indices j in O(n) time
Case 1: U is ineffective (e.g.,U = n)
Case 2: U is effective
Case 1: U is ineffective
F(j)
j
j+1
F(j + 1)
j+2
F(j + 2)
F(j + 1) = F(j) ∪ {j-L+1}
if prefix-sum( j–L+1) < prefix-sum(valley(j))
let valley(j + 1) = j – L + 1;
otherwise,
let valley(j + 1) = valley(j);
O(1) time to compute
valley(j + 1) from
valley(j)
O(n) time to
compute
valley(j) for all j
O(n) time to
compute a maxsum segment.
Case 2: U is effective
F(j)
j
F(j + 1)
F(j + 2)
j+1
j+1
We need a data structure D(j) for
the indices in F(j)
The
specification:
valley(j) can be obtained from D(j) in O(1) time.
D(j + 1) can be updated from D(j) in amortized
O(1) time.
A solution
D(j) be a list of indices X(1), X(2)… such
that
Let
X(1) is the lowest point in F(j) = [j –U, j–L];
That is, X(1) = valley(j).
X(2) is the lowest point in [X(1)+1, j–L];
X(3) is the lowest point in [X(2)+1, j–L];
…
j-U
j-L
D[j] = X
1
2
3
4
Spec. #1
valley(j)
can be obtained from D(j) in O(1)
time?
Yes! Just read the first number in D(j).
Spec. #2
D(j+1)
can be updated from D(j) in
amortized O(1) time?
Yes! Two steps:
(1) If valley(j) is not in F(j+1), we just delete the
first number from the list.
(2) Scan the list of indices in D(j) from right to
left to see where j-U+1 fits in.
time complexity
The overall time complexity is
O(n), although some iterations
may take more than constant time.
On-line processing
In
the j-th iteration, when S[j] is received,
prefix-sum(j) can be computed on the fly
valley(j) can be computed “interatively”,
So,
our algorithm can process the input
sequence in an online manner.
the required working space is O(U – L + 1).
The End