a) t - Technische Fakultät
Download
Report
Transcript a) t - Technische Fakultät
Counting Suffix Arrays and Strings
18 May 2006
Technische Fakultät
Universität Bielefeld
Germany
Klaus-Bernd Schürmann
Jens Stoye
Suffix Array Data Structure
Text to be indexed:
T C T
T C T C T
T C T C
1
4
9 10 11 12 13
2
3
5
6
7
8
$
Suffix Array – lexicographically sorted list of all suffixes:
13
12
10
5
7
2
11
9
4
6
1
8
3
Dagstuhl, May 2006 - Jens Stoye
-
$
C$
CTC$
CTCTTCTC$
CTTCTC$
CTTCTCTTCTC$
TC$
TCTC $
TCTCTTCTC$
TCTTCTC$
TCTTCTCTTCTC$
TTCTC$
TTCTCTTCTC$
Slide 2
Overview
1. Classify strings sharing same suffix array
2. Counting strings sharing same suffix array
3. Counting suffix arrays
Lower bound suffix array compression
4. Summation identities
Dagstuhl, May 2006 - Jens Stoye
Slide 3
1. Classify Strings for Suffix Array
t - string of length n,
P - permutation of {1,..., n},
R - inverse of P.
Theorem:
P is the suffix array of t if and only if
for all i {1,...,n}
a) t[P[i]] t[P[i+1]] and
b) t[P[i]] = t[P[i+1]] R[P[i]+1] R[P[i+1]+1]
same as
b) R[P[i]+1] > R[P[i+1]+1] t[P[i]] < t[P[i+1]]
Dagstuhl, May 2006 - Jens Stoye
Slide 4
1. Classify Strings for Suffix Array
a) t[P[i]] t[P[i+1]] and
b) R[P[i]+1] > R[P[i+1]+1] t[P[i]] < t[P[i+1]]
Text to be indexed: t
= A A B C B
1
i
1
2
3
4
5
2
3
4
5
P[i] t[P[i]]
1
A ABCB
2
A BCB
5
B
3
B CB
4
CB
R
Dagstuhl, May 2006 - Jens Stoye
+-descent
Slide 5
1. Classify Strings for Suffix Array
Equivalences between strings
Text to be indexed: t
= A A B C B
1
2
3
4
5
t2 = A A C D C
1
i
1
2
3
4
5
2
3
4
5
t3 = A B D E D
1
2
3
4
5
P[i] t[P[i]]
t2[P[i]]
t3[P[i]]
1
A ABCB
A ACDC
A BDED
2
A BCB
A CDC
B DED
5
B
C
D
3
B CB
C DC
D ED
4
CB
DC
ED
(order-equivalent)
Dagstuhl, May 2006 - Jens Stoye
(order-distinct)
Slide 6
2. Counting Strings for Suffix Array
Text to be indexed: t
= A A B C B
1
2
3
4
5
t2 = A A C D C
1
i
1
2
3
4
5
2
3
P[i] t[P[i]] t2[P[i]]
1
A +0= A
2
A +0= A
5
B +1= C
3
B +1= C
4
C +1= D
Base string
Dagstuhl, May 2006 - Jens Stoye
4
5
t3 = A B D E D
1
2
3
4
5
t[P[i]] t3[P[i]]
A +0= A
A +1= B
B +2= D
B +2= D
C +2= E
Non-decreasing sequences
Slide 7
2. Counting Strings for Suffix Array
Suffix array P of length n with d R+-descents.
Number of strings over alphabet of size a for P
= Number of non-decreasing sequences over
a-d elements
n a d 1
a d 1
Dagstuhl, May 2006 - Jens Stoye
Slide 8
2. Counting Strings for Suffix Array
Suffix array P of length n with d R+-descents.
Number of strings composed of exactly k
distinct characters for P is
n d 1
k d 1
Dagstuhl, May 2006 - Jens Stoye
Slide 9
2. Counting Strings for Suffix Array
Number of strings over alphabet size 20 for suffix arrays
of length n with 10 R+-descents:
n
Strings composed of Strings composed
up to 20 characters of all 20 characters
5
2,002
0
10
92,378
0
15
1,307,504
0
20
10,015,005
1
25
52,451,256
2,002
30
211,915,132
92,378
35
708,930,508
1,307,504
Dagstuhl, May 2006 - Jens Stoye
Slide 10
2. Counting Strings for Suffix Array
Suffix array P of length n with d R+-descents
Number of order-distinct strings over
alphabet of size a is
n d 1
k d 1 k d 1
a
Number of order-distinct strings where all k
distinct characters must appear is
n d 1
k d 1
Dagstuhl, May 2006 - Jens Stoye
Slide 11
3. Counting Suffix Arrays
Definition:
Let P permutation of {1,..., n}.
Position i{1,...,n-1} is a permutation descent
if P[i] > P[i+1].
Definition:
The Eulerian number
n
d
gives the number of
permutations of {1,...,n} with exactly d
permutation descents.
Dagstuhl, May 2006 - Jens Stoye
Slide 12
3. Counting Suffix Arrays
Well-known fact:
Recursive enumeration of Eulerian numbers
a)
b)
c)
n
1,
0
n
0 for n d, and
d
n
n 1
n 1
(d 1)
(n d )
d
d
d 1
Dagstuhl, May 2006 - Jens Stoye
Slide 13
3. Counting Suffix Arrays
Definition:
Let A(n,d) be the number of permutations of
length n with d R+-descents.
Observation:
a) A(n,0) = 1
b) A(n,d) = 0 for n d
c) see next
Dagstuhl, May 2006 - Jens Stoye
Slide 14
3. Counting Suffix Arrays
Text to be indexed:
t
= A A B C B
1
2
3
4
5
At = A A A B C B
1
i
1
2
3
4
5
Pt[i] t[P[i]]
1
A ABCB
2
A BCB
5
B
3
B CB
4
CB
2
3
4
5 6
i
1
2
3
4
5
6
PAt[i] At[P[i]]
1
A AABCB
2
A ABCB
3
A BCB
6
B
4
B CB
5
CB
(d+1) possible positions without additional R+-descent
Dagstuhl, May 2006 - Jens Stoye
Slide 15
3. Counting Suffix Arrays
Text to be indexed:
t
= A A B C B
1
2
3
4
5
Bt = B A A B C B
1
i
1
2
3
4
5
Pt[i] t[P[i]]
1
A ABCB
2
A BCB
5
B
3
B CB
4
CB
2
3
4
5 6
i
1
2
3
4
5
6
PBt[i] Bt[P[i]]
2
A ABCB
3
A BCB
6
B
1
B AABCB
4
B CB
5
CB
(d+1) possible positions without additional R+-descent
Dagstuhl, May 2006 - Jens Stoye
Slide 16
3. Counting Suffix Arrays
Together:
a) A(n,0) = 1,
b) A(n,d) = 0 for n d, and
c) A(n,d) = (d+1) A(n-1,d) + (n-d) A(n-1,d-1)
Theorem:
The number A(n,d) of permutations of length n
with d R+-descents is the Eulerian number
Dagstuhl, May 2006 - Jens Stoye
n
d
.
Slide 17
3. Counting Suffix Arrays
The number of distinct suffix arrays of length
n for strings over alphabet of size a:
a 1
d 0
n
d
Lower bound for compressibility of suffix
arrays in the Kolmogorov sense:
a 1
log
d 0
Dagstuhl, May 2006 - Jens Stoye
n
d
Slide 18
3. Counting Suffix Arrays
Number of distinct suffix arrays of length n for strings
over alphabet of size 20:
n
String count (20n) Suffix array count
4
160,000
24
6
6.4 107
720
8
2.6 1010
40,320
10
1.0 1013
3.6 106
12
4.1 1015
4.8 108
14
1.6 1018
8.7 1010
16
6.6 1020
2.1 1013
18
2.6 1023
6.4 1015
Dagstuhl, May 2006 - Jens Stoye
Slide 19
3. Counting Suffix Arrays
Number of distinct suffix arrays of length n for strings
over alphabet of size 4:
n
String count (4n) Suffix array count
4
256
24
6
4,096
662
8
65,536
20,160
10
1,048,576
504,046
12
16,777,216
10,670,040
14
268,435,456
202,964,470
16
4,294,967,296
3,614,083,520
18
68,719,476,736
61,786,015,150
Dagstuhl, May 2006 - Jens Stoye
Slide 20
4. Summation Identities
Worpitzki‘s identity
by summing up the number of strings of
length n for each suffix array:
a 1
a
n
d 0
n
d
n a i
n a d 1
a d 1 i i n
Summation rule for Eulerian numbers to
generate the Stirling numbers of second kind:
n i
n k 1 n n d 1
k!
k d 0 d k d 1 i i n k
Dagstuhl, May 2006 - Jens Stoye
Slide 21
Summary
Constructive proofs to count strings sharing
the same suffix array
Constructive proof to count distinct suffix
arrays yielding lower bound for suffix array
compression
Constructive proofs for Worpitzki‘s identity
and the summation rule of Eulerian numbers
to count Stirling numbers of second kind
Dagstuhl, May 2006 - Jens Stoye
Slide 22
Outlook
Efficient enumeration algorithm for suffix
arrays
Compressed suffix arrays for fast querying in
bioinformatics applications
Average case analysis under non-uniform
model
Dagstuhl, May 2006 - Jens Stoye
Slide 23
Thank you for
your attention!
18 May 2006
Technische Fakultät
Universität Bielefeld
Germany
Klaus-Bernd Schürmann
Jens Stoye