Ukkonen`s Algorithm

Download Report

Transcript Ukkonen`s Algorithm

Suffix Trees
Construction and Applications
João Carreira
2008
Outline
Why Suffix Trees?
 Definition
 Ukkonen's Algorithm (construction)
 Applications

Why Suffix Trees?
Why Suffix Trees?

Asymptotically fast.
Why Suffix Trees?
Asymptotically fast.
 The basis of state of the art data structures.

Why Suffix Trees?
Asymptotically fast.
 The basis of state of the art data structures.
 You don't need a Phd to use them.

Why Suffix Trees?
Asymptotically fast.
 The basis of state of the art data structures.
 You don't need a Phd to use them.
 Challenging.

Why Suffix Trees?
Asymptotically fast.
 The basis of state of the art data structures.
 You don't need a Phd to use them.
 Challenging.
 Expose interesting algorithmic ideas.

Definition
Suffix Tree for an m-character string:

m leaves numbered 1 to m
Definition
Suffix Tree for an m-character string:
m leaves numbered 1 to m
 edge-label vs node-label

Definition
Suffix Tree for an m-character string:
m leaves numbered 1 to m
 edge-label vs node-label
 each internal node has at least two children

Definition
Suffix Tree for an m-character string:
m leaves numbered 1 to m
 edge-label vs node-label
 each internal node has at least two children
 the label of the leaf j is S[ j..m ]

Definition
Suffix Tree for an m-character string:
m leaves numbered 1 to m
 edge-label vs node-label
 each internal node has at least two children
 the label of the leaf j is S[ j..m ]
 no two edges out of the same node can have edge-labels
beginning with the same character

Definition Example
String: xabxac
Length (m): 6 characters
Number of Leaves: 6
Node 5 label: ac
Implicit vs Explicit

What if we have “axabx” ?
Ukkonen's Algorithm
suffix tree construction
Ukkonen's Algorithm
suffix tree construction
Text: S[ 1..m ]
 m phases
 phase j is divided into j extensions:

In extension j of phase i + 1:
 find the end of the path from the root labeled with substring S[ j..i ]
 extend the substring by adding the character S(i + 1) to its end
Extension Rules

Rule 1: Path β ends at a leaf. S(i + 1) is added to the end of the label on that leaf edge.
Extension Rules

Rule 2: No path from the end of β starts with S(i + 1), but at least one labeled path
continues from the end of β.
Extension Rules

Rule 3: Some path from the end of β starts with S(i + 1), so we do nothing.
Ukkonen's Algorithm
suffix tree construction
Complexity:
Ukkonen's Algorithm
suffix tree construction
Complexity:

m phases
Ukkonen's Algorithm
suffix tree construction
Complexity:
m phases
 phase j -> j extensions

Ukkonen's Algorithm
suffix tree construction
Complexity:
m phases
 phase j -> j extensions
 find the end of the path of substring β:
O(|β|) = O(m)

Ukkonen's Algorithm
suffix tree construction
Complexity:
m phases
 phase j -> j extensions
 find the end of the path of substring β:
O(|β|) = O(m)


each extension: O(1)
Ukkonen's Algorithm
suffix tree construction
Complexity:
m phases
 phase j -> j extensions
 find the end of the path of substring β:
O(|β|) = O(m)


each extension: O(1)
3
O(m )
“First make it run, then make it run fast.”
Brian Kernighan
Suffix Links
Definition:
For an internal node v with path-label xα, if there is another node s(v), with
path-label α, then a pointer from v to s(v) is called a suffix link.

Suffix Links
Lemma:
If a new internal node v with path label xα is added to the current tree in extension
j of some phase, then either the path labeled α already ends at an internal node
or an internal at the end of the string α will be created in the next extension
of the same phase.

If Rule 2 applies:
Suffix Links
Lemma:
If a new internal node v with path label xα is added to the current tree in extension
j of some phase, then either the path labeled α already ends at an internal node
or an internal at the end of the string α will be created in the next extension
of the same phase.

If Rule 2 applies:

S[ j..i ] continues with c ≠ S(i + 1)
Suffix Links
Lemma:
If a new internal node v with path label xα is added to the current tree in extension
j of some phase, then either the path labeled α already ends at an internal node
or an internal at the end of the string α will be created in the next extension
of the same phase.

If Rule 2 applies:
S[ j..i ] continues with c ≠ S(i + 1)
 S[ j + 1..i ] continues with c.

Single Extension Algorithm
Extension j of phase i + 1:
1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link
from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].
Single Extension Algorithm
Extension j of phase i + 1:
1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link
from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].
2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the
suffix link and walk down from s(v) following the path for string λ.
Single Extension Algorithm
Extension j of phase i + 1:
1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link
from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].
2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the
suffix link and walk down from s(v) following the path for string λ.
3. Using the extension rules, ensure that the string S[ j..i ] S(i+1) is in the tree.
Single Extension Algorithm
Extension j of phase i + 1:
1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link
from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].
2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the
suffix link and walk down from s(v) following the path for string λ.
3. Using the extension rules, ensure that the string S[ j..i ] S(i+1) is in the tree.
4. If a new internal w was created in extension j – 1 (by rule 2), then string α must
end at node s(w), the end node for the suffix link from w. Create the suffix link
(w, s(w)) from w to s(w).
Node Depth
The node-depth of v is at most one greater than the node depth of s(v).
xß
xß
ß
xα
xλ
xα
α
λ
equal node-depth: 3
xλ
Node depth: 4
ß
α
λ
Node depth: 3
Skip/count Trick
γ number of characters in an edge
 “Directly implemented” edge traversal: O(|γ|)

Skip/count Trick
γ number of characters in an edge
 “Directly implemented” edge traversal: O(|γ|)


“Jump” from node to node.

K = number of nodes in a path

Time to traverse a path: O(|K|)
Ukkonen's Algorithm
Using the skip/count trick:
 any phase of Ukkonen's algorithm takes O(m) time.
Proof:
Ukkonen's Algorithm
Using the skip/count trick:
 any phase of Ukkonen's algorithm takes O(m) time.
Proof:

There are i + 1 ≤ m extensions in phase i + 1
Ukkonen's Algorithm
Using the skip/count trick:
 any phase of Ukkonen's algorithm takes O(m) time.
Proof:
There are i + 1 ≤ m extensions in phase i + 1
 In a single extension, the algorithm walks up at most one edge, traverses one suffix link,
walks down some number of nodes, applies the extension rules and may add a suffix link.

Ukkonen's Algorithm
Using the skip/count trick:
 any phase of Ukkonen's algorithm takes O(m) time.
Proof:
There are i + 1 ≤ m extensions in phase i + 1
 In a single extension, the algorithm walks up at most one edge, traverses one suffix link,
walks down some number of nodes, applies the extension rules and may add a suffix link.
 The up-walk decreases the current node-depth by at most one.

Ukkonen's Algorithm
Using the skip/count trick:
 any phase of Ukkonen's algorithm takes O(m) time.
Proof:
There are i + 1 ≤ m extensions in phase i + 1
 In a single extension, the algorithm walks up at most one edge, traverses one suffix link,
walks down some number of nodes, applies the extension rules and may add a suffix link.
 The up-walk decreases the current node-depth by at most one.
 Each suffix link traversal decreases the node-depth by at most another one.

Ukkonen's Algorithm
Using the skip/count trick:
 any phase of Ukkonen's algorithm takes O(m) time.
Proof:
There are i + 1 ≤ m extensions in phase i + 1
 In a single extension, the algorithm walks up at most one edge, traverses one suffix link,
walks down some number of nodes, applies the extension rules and may add a suffix link.
 The up-walk decreases the current node-depth by at most one.
 Each suffix link traversal decreases the node-depth by at most another one.
 Each down-walk moves to a node of greater depth.

Ukkonen's Algorithm
Using the skip/count trick:
 any phase of Ukkonen's algorithm takes O(m) time.
Proof:
There are i + 1 ≤ m extensions in phase i + 1
 In a single extension, the algorithm walks up at most one edge, traverses one suffix link,
walks down some number of nodes, applies the extension rules and may add a suffix link.
 The up-walk decreases the current node-depth by at most one.
 Each suffix link traversal decreases the node-depth by at most another one.
 Each down-walk moves to a node of greater depth.
 Over the entire phase the node-depth is decremented at most 2m times.

Ukkonen's Algorithm
Using the skip/count trick:
 any phase of Ukkonen's algorithm takes O(m) time.
Proof:
There are i + 1 ≤ m extensions in phase i + 1
 In a single extension, the algorithm walks up at most one edge, traverses one suffix link,
walks down some number of nodes, applies the extension rules and may add a suffix link.
 The up-walk decreases the current node-depth by at most one.
 Each suffix link traversal decreases the node-depth by at most another one.
 Each down-walk moves to a node of greater depth.
 Over the entire phase the node-depth is decremented at most 2m times.
 No node can have depth greater than m, so the total increment to current node-depth
(down walks) is bounded by 3m over the entire phase.

Ukkonen's Algorithm
m phases
 1 phase: O(m)

Ukkonen's Algorithm
m phases
 1 phase: O(m)

2
O(m )
“First make it run fast, then make it run faster.”
João Carreira
Edge-Label Compression

A string with m characters has m suffixes.

If edge labels are represented with characters, O(m2) space is needed.
Edge-Label Compression

A string with m characters has m suffixes.

If edge labels are represented with characters, O(m2) space is needed.
To achieve O(m) space, each edge-label:
(p, q)
Two more tricks...
Rule 3 is a show stopper
If rule 3 applies in extension j, it will also apply in all further
extensions until the end of the phase.
Why?
Rule 3 is a show stopper
If rule 3 applies in extension j, it will also apply in all further
extensions until the end of the phase.
Why?

When rule 3 applies, the path labeled S[ j..i ] must continue with character S(i + 1), and
so the path labeled S[ j + 1..i ] does also, and rule 3 again applies in extensions j+1...i+1.
Rule 3 is a show stopper

End any phase i +1 the first time rule 3 applies.

The remaining extensions are said to be done implicitly.
Once a leaf always a leaf
Leaf created => always a leaf in all successive trees.
 No mechanism for extending a leaf edge beyond its current leaf.

Once there is a leaf labeled j, extension rule 1 will always apply to extension j
in any sucessive phase.

Once a leaf always a leaf
Leaf created => always a leaf in all successive trees.
 No mechanism for extending a leaf edge beyond its current leaf.

Once there is a leaf labeled j, extension rule 1 will always apply to extension j
in any sucessive phase.

Leaf Edge Label:
(p, e)
Single Phase Algorithm
In each phase i:
Single Phase Algorithm
During construction:
Implicit to Explicit
One last phase to add character $:
O(m)
Suffix Trees are a Swiss Knife
Applications
Exact String Matching:
Applications
Exact String Matching:
Preprocessing: O(m)
Search: O(n + k)
Three ocurrences of string aw.
Applications
And much more..
Longest common substring
O(n)
 Longest repeated substring
O(n)
 Longest palindrome
O(n)
 Most frequently occurring substrings of a minimum length
O(n)
 Shortest substrings occurring only once
O(n)
 Lempel-Ziv decomposition
O(n)

“Biology easily has 500 years of exciting problems to work on.”
Donald Knuth
web.ist.utl.pt/joao.carreira
Questions?