a) t - Technische Fakultät

Download Report

Transcript a) t - Technische Fakultät

Counting Suffix Arrays and Strings
18 May 2006
Technische Fakultät
Universität Bielefeld
Germany
Klaus-Bernd Schürmann
Jens Stoye
Suffix Array Data Structure
Text to be indexed:
T C T
T C T C T
T C T C
1
4
9 10 11 12 13
2
3
5
6
7
8
$
Suffix Array – lexicographically sorted list of all suffixes:
13
12
10
5
7
2
11
9
4
6
1
8
3
Dagstuhl, May 2006 - Jens Stoye
-
$
C$
CTC$
CTCTTCTC$
CTTCTC$
CTTCTCTTCTC$
TC$
TCTC $
TCTCTTCTC$
TCTTCTC$
TCTTCTCTTCTC$
TTCTC$
TTCTCTTCTC$
Slide 2
Overview
1. Classify strings sharing same suffix array
2. Counting strings sharing same suffix array
3. Counting suffix arrays
 Lower bound suffix array compression
4. Summation identities
Dagstuhl, May 2006 - Jens Stoye
Slide 3
1. Classify Strings for Suffix Array
t - string of length n,
P - permutation of {1,..., n},
R - inverse of P.
Theorem:
P is the suffix array of t if and only if
for all i {1,...,n}
a) t[P[i]]  t[P[i+1]] and
b) t[P[i]] = t[P[i+1]]  R[P[i]+1]  R[P[i+1]+1]
same as
b) R[P[i]+1] > R[P[i+1]+1]  t[P[i]] < t[P[i+1]]
Dagstuhl, May 2006 - Jens Stoye
Slide 4
1. Classify Strings for Suffix Array
a) t[P[i]]  t[P[i+1]] and
b) R[P[i]+1] > R[P[i+1]+1]  t[P[i]] < t[P[i+1]]
Text to be indexed: t
= A A B C B
1
i
1
2
3
4
5
2
3
4
5
P[i] t[P[i]]
1
A ABCB
2
A BCB
5
B
3
B CB
4
CB
R
Dagstuhl, May 2006 - Jens Stoye
+-descent
Slide 5
1. Classify Strings for Suffix Array
Equivalences between strings
Text to be indexed: t
= A A B C B
1
2
3
4
5
t2 = A A C D C
1
i
1
2
3
4
5
2
3
4
5
t3 = A B D E D
1
2
3
4
5
P[i] t[P[i]]
t2[P[i]]
t3[P[i]]
1
A ABCB
A ACDC
A BDED
2
A BCB
A CDC
B DED
5
B
C
D
3
B CB
C DC
D ED
4
CB
DC
ED
(order-equivalent)
Dagstuhl, May 2006 - Jens Stoye
(order-distinct)
Slide 6
2. Counting Strings for Suffix Array
Text to be indexed: t
= A A B C B
1
2
3
4
5
t2 = A A C D C
1
i
1
2
3
4
5
2
3
P[i] t[P[i]] t2[P[i]]
1
A +0= A
2
A +0= A
5
B +1= C
3
B +1= C
4
C +1= D
Base string
Dagstuhl, May 2006 - Jens Stoye
4
5
t3 = A B D E D
1
2
3
4
5
t[P[i]] t3[P[i]]
A +0= A
A +1= B
B +2= D
B +2= D
C +2= E
Non-decreasing sequences
Slide 7
2. Counting Strings for Suffix Array
Suffix array P of length n with d R+-descents.
Number of strings over alphabet of size a for P
= Number of non-decreasing sequences over
a-d elements
 n  a  d  1


 a  d 1 
Dagstuhl, May 2006 - Jens Stoye
Slide 8
2. Counting Strings for Suffix Array
Suffix array P of length n with d R+-descents.
Number of strings composed of exactly k
distinct characters for P is
 n  d  1


 k  d  1
Dagstuhl, May 2006 - Jens Stoye
Slide 9
2. Counting Strings for Suffix Array
 Number of strings over alphabet size 20 for suffix arrays
of length n with 10 R+-descents:
n
Strings composed of Strings composed
up to 20 characters of all 20 characters
5
2,002
0
10
92,378
0
15
1,307,504
0
20
10,015,005
1
25
52,451,256
2,002
30
211,915,132
92,378
35
708,930,508
1,307,504
Dagstuhl, May 2006 - Jens Stoye
Slide 10
2. Counting Strings for Suffix Array
Suffix array P of length n with d R+-descents
 Number of order-distinct strings over
alphabet of size a is
 n  d  1



k  d 1  k  d  1
a
 Number of order-distinct strings where all k
distinct characters must appear is
 n  d  1


 k  d  1
Dagstuhl, May 2006 - Jens Stoye
Slide 11
3. Counting Suffix Arrays
Definition:
Let P permutation of {1,..., n}.
Position i{1,...,n-1} is a permutation descent
if P[i] > P[i+1].
Definition:
The Eulerian number
n
d
gives the number of
permutations of {1,...,n} with exactly d
permutation descents.
Dagstuhl, May 2006 - Jens Stoye
Slide 12
3. Counting Suffix Arrays
Well-known fact:
Recursive enumeration of Eulerian numbers
a)
b)
c)
n
 1,
0
n
 0 for n  d, and
d
n
n 1
n 1
 (d  1)
 (n  d )
d
d
d 1
Dagstuhl, May 2006 - Jens Stoye
Slide 13
3. Counting Suffix Arrays
Definition:
Let A(n,d) be the number of permutations of
length n with d R+-descents.
Observation:
a) A(n,0) = 1
b) A(n,d) = 0 for n  d
c) see next
Dagstuhl, May 2006 - Jens Stoye
Slide 14
3. Counting Suffix Arrays
Text to be indexed:
t
= A A B C B
1
2
3
4
5
At = A A A B C B
1
i
1
2
3
4
5
Pt[i] t[P[i]]
1
A ABCB
2
A BCB
5
B
3
B CB
4
CB
2
3
4
5 6
i
1
2
3
4
5
6
PAt[i] At[P[i]]
1
A AABCB
2
A ABCB
3
A BCB
6
B
4
B CB
5
CB
(d+1) possible positions without additional R+-descent
Dagstuhl, May 2006 - Jens Stoye
Slide 15
3. Counting Suffix Arrays
Text to be indexed:
t
= A A B C B
1
2
3
4
5
Bt = B A A B C B
1
i
1
2
3
4
5
Pt[i] t[P[i]]
1
A ABCB
2
A BCB
5
B
3
B CB
4
CB
2
3
4
5 6
i
1
2
3
4
5
6
PBt[i] Bt[P[i]]
2
A ABCB
3
A BCB
6
B
1
B AABCB
4
B CB
5
CB
(d+1) possible positions without additional R+-descent
Dagstuhl, May 2006 - Jens Stoye
Slide 16
3. Counting Suffix Arrays
Together:
a) A(n,0) = 1,
b) A(n,d) = 0 for n  d, and
c) A(n,d) = (d+1) A(n-1,d) + (n-d) A(n-1,d-1)
Theorem:
The number A(n,d) of permutations of length n
with d R+-descents is the Eulerian number
Dagstuhl, May 2006 - Jens Stoye
n
d
.
Slide 17
3. Counting Suffix Arrays
 The number of distinct suffix arrays of length
n for strings over alphabet of size a:
a 1

d 0
n
d
 Lower bound for compressibility of suffix
arrays in the Kolmogorov sense:
a 1
log

d 0
Dagstuhl, May 2006 - Jens Stoye
n
d
Slide 18
3. Counting Suffix Arrays
 Number of distinct suffix arrays of length n for strings
over alphabet of size 20:
n
String count (20n) Suffix array count
4
160,000
24
6
6.4  107
720
8
2.6  1010
40,320
10
1.0  1013
3.6  106
12
4.1  1015
4.8  108
14
1.6  1018
8.7  1010
16
6.6  1020
2.1  1013
18
2.6  1023
6.4  1015
Dagstuhl, May 2006 - Jens Stoye
Slide 19
3. Counting Suffix Arrays
 Number of distinct suffix arrays of length n for strings
over alphabet of size 4:
n
String count (4n) Suffix array count
4
256
24
6
4,096
662
8
65,536
20,160
10
1,048,576
504,046
12
16,777,216
10,670,040
14
268,435,456
202,964,470
16
4,294,967,296
3,614,083,520
18
68,719,476,736
61,786,015,150
Dagstuhl, May 2006 - Jens Stoye
Slide 20
4. Summation Identities
 Worpitzki‘s identity
by summing up the number of strings of
length n for each suffix array:
a 1
a 
n
d 0
n
d
n a  i
 n  a  d  1

  


 a  d 1  i i  n 
 Summation rule for Eulerian numbers to
generate the Stirling numbers of second kind:
n  i 
n  k 1 n  n  d  1

  


k!    
k  d 0 d  k  d  1 i i  n  k 
Dagstuhl, May 2006 - Jens Stoye
Slide 21
Summary
 Constructive proofs to count strings sharing
the same suffix array
 Constructive proof to count distinct suffix
arrays yielding lower bound for suffix array
compression
 Constructive proofs for Worpitzki‘s identity
and the summation rule of Eulerian numbers
to count Stirling numbers of second kind
Dagstuhl, May 2006 - Jens Stoye
Slide 22
Outlook
 Efficient enumeration algorithm for suffix
arrays
 Compressed suffix arrays for fast querying in
bioinformatics applications
 Average case analysis under non-uniform
model
Dagstuhl, May 2006 - Jens Stoye
Slide 23
Thank you for
your attention!
18 May 2006
Technische Fakultät
Universität Bielefeld
Germany
Klaus-Bernd Schürmann
Jens Stoye