Ukkonen`s Algorithm
Download
Report
Transcript Ukkonen`s Algorithm
Suffix Trees
Construction and Applications
João Carreira
2008
Outline
Why Suffix Trees?
Definition
Ukkonen's Algorithm (construction)
Applications
Why Suffix Trees?
Why Suffix Trees?
Asymptotically fast.
Why Suffix Trees?
Asymptotically fast.
The basis of state of the art data structures.
Why Suffix Trees?
Asymptotically fast.
The basis of state of the art data structures.
You don't need a Phd to use them.
Why Suffix Trees?
Asymptotically fast.
The basis of state of the art data structures.
You don't need a Phd to use them.
Challenging.
Why Suffix Trees?
Asymptotically fast.
The basis of state of the art data structures.
You don't need a Phd to use them.
Challenging.
Expose interesting algorithmic ideas.
Definition
Suffix Tree for an m-character string:
m leaves numbered 1 to m
Definition
Suffix Tree for an m-character string:
m leaves numbered 1 to m
edge-label vs node-label
Definition
Suffix Tree for an m-character string:
m leaves numbered 1 to m
edge-label vs node-label
each internal node has at least two children
Definition
Suffix Tree for an m-character string:
m leaves numbered 1 to m
edge-label vs node-label
each internal node has at least two children
the label of the leaf j is S[ j..m ]
Definition
Suffix Tree for an m-character string:
m leaves numbered 1 to m
edge-label vs node-label
each internal node has at least two children
the label of the leaf j is S[ j..m ]
no two edges out of the same node can have edge-labels
beginning with the same character
Definition Example
String: xabxac
Length (m): 6 characters
Number of Leaves: 6
Node 5 label: ac
Implicit vs Explicit
What if we have “axabx” ?
Ukkonen's Algorithm
suffix tree construction
Ukkonen's Algorithm
suffix tree construction
Text: S[ 1..m ]
m phases
phase j is divided into j extensions:
In extension j of phase i + 1:
find the end of the path from the root labeled with substring S[ j..i ]
extend the substring by adding the character S(i + 1) to its end
Extension Rules
Rule 1: Path β ends at a leaf. S(i + 1) is added to the end of the label on that leaf edge.
Extension Rules
Rule 2: No path from the end of β starts with S(i + 1), but at least one labeled path
continues from the end of β.
Extension Rules
Rule 3: Some path from the end of β starts with S(i + 1), so we do nothing.
Ukkonen's Algorithm
suffix tree construction
Complexity:
Ukkonen's Algorithm
suffix tree construction
Complexity:
m phases
Ukkonen's Algorithm
suffix tree construction
Complexity:
m phases
phase j -> j extensions
Ukkonen's Algorithm
suffix tree construction
Complexity:
m phases
phase j -> j extensions
find the end of the path of substring β:
O(|β|) = O(m)
Ukkonen's Algorithm
suffix tree construction
Complexity:
m phases
phase j -> j extensions
find the end of the path of substring β:
O(|β|) = O(m)
each extension: O(1)
Ukkonen's Algorithm
suffix tree construction
Complexity:
m phases
phase j -> j extensions
find the end of the path of substring β:
O(|β|) = O(m)
each extension: O(1)
3
O(m )
“First make it run, then make it run fast.”
Brian Kernighan
Suffix Links
Definition:
For an internal node v with path-label xα, if there is another node s(v), with
path-label α, then a pointer from v to s(v) is called a suffix link.
Suffix Links
Lemma:
If a new internal node v with path label xα is added to the current tree in extension
j of some phase, then either the path labeled α already ends at an internal node
or an internal at the end of the string α will be created in the next extension
of the same phase.
If Rule 2 applies:
Suffix Links
Lemma:
If a new internal node v with path label xα is added to the current tree in extension
j of some phase, then either the path labeled α already ends at an internal node
or an internal at the end of the string α will be created in the next extension
of the same phase.
If Rule 2 applies:
S[ j..i ] continues with c ≠ S(i + 1)
Suffix Links
Lemma:
If a new internal node v with path label xα is added to the current tree in extension
j of some phase, then either the path labeled α already ends at an internal node
or an internal at the end of the string α will be created in the next extension
of the same phase.
If Rule 2 applies:
S[ j..i ] continues with c ≠ S(i + 1)
S[ j + 1..i ] continues with c.
Single Extension Algorithm
Extension j of phase i + 1:
1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link
from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].
Single Extension Algorithm
Extension j of phase i + 1:
1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link
from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].
2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the
suffix link and walk down from s(v) following the path for string λ.
Single Extension Algorithm
Extension j of phase i + 1:
1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link
from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].
2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the
suffix link and walk down from s(v) following the path for string λ.
3. Using the extension rules, ensure that the string S[ j..i ] S(i+1) is in the tree.
Single Extension Algorithm
Extension j of phase i + 1:
1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link
from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].
2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the
suffix link and walk down from s(v) following the path for string λ.
3. Using the extension rules, ensure that the string S[ j..i ] S(i+1) is in the tree.
4. If a new internal w was created in extension j – 1 (by rule 2), then string α must
end at node s(w), the end node for the suffix link from w. Create the suffix link
(w, s(w)) from w to s(w).
Node Depth
The node-depth of v is at most one greater than the node depth of s(v).
xß
xß
ß
xα
xλ
xα
α
λ
equal node-depth: 3
xλ
Node depth: 4
ß
α
λ
Node depth: 3
Skip/count Trick
γ number of characters in an edge
“Directly implemented” edge traversal: O(|γ|)
Skip/count Trick
γ number of characters in an edge
“Directly implemented” edge traversal: O(|γ|)
“Jump” from node to node.
K = number of nodes in a path
Time to traverse a path: O(|K|)
Ukkonen's Algorithm
Using the skip/count trick:
any phase of Ukkonen's algorithm takes O(m) time.
Proof:
Ukkonen's Algorithm
Using the skip/count trick:
any phase of Ukkonen's algorithm takes O(m) time.
Proof:
There are i + 1 ≤ m extensions in phase i + 1
Ukkonen's Algorithm
Using the skip/count trick:
any phase of Ukkonen's algorithm takes O(m) time.
Proof:
There are i + 1 ≤ m extensions in phase i + 1
In a single extension, the algorithm walks up at most one edge, traverses one suffix link,
walks down some number of nodes, applies the extension rules and may add a suffix link.
Ukkonen's Algorithm
Using the skip/count trick:
any phase of Ukkonen's algorithm takes O(m) time.
Proof:
There are i + 1 ≤ m extensions in phase i + 1
In a single extension, the algorithm walks up at most one edge, traverses one suffix link,
walks down some number of nodes, applies the extension rules and may add a suffix link.
The up-walk decreases the current node-depth by at most one.
Ukkonen's Algorithm
Using the skip/count trick:
any phase of Ukkonen's algorithm takes O(m) time.
Proof:
There are i + 1 ≤ m extensions in phase i + 1
In a single extension, the algorithm walks up at most one edge, traverses one suffix link,
walks down some number of nodes, applies the extension rules and may add a suffix link.
The up-walk decreases the current node-depth by at most one.
Each suffix link traversal decreases the node-depth by at most another one.
Ukkonen's Algorithm
Using the skip/count trick:
any phase of Ukkonen's algorithm takes O(m) time.
Proof:
There are i + 1 ≤ m extensions in phase i + 1
In a single extension, the algorithm walks up at most one edge, traverses one suffix link,
walks down some number of nodes, applies the extension rules and may add a suffix link.
The up-walk decreases the current node-depth by at most one.
Each suffix link traversal decreases the node-depth by at most another one.
Each down-walk moves to a node of greater depth.
Ukkonen's Algorithm
Using the skip/count trick:
any phase of Ukkonen's algorithm takes O(m) time.
Proof:
There are i + 1 ≤ m extensions in phase i + 1
In a single extension, the algorithm walks up at most one edge, traverses one suffix link,
walks down some number of nodes, applies the extension rules and may add a suffix link.
The up-walk decreases the current node-depth by at most one.
Each suffix link traversal decreases the node-depth by at most another one.
Each down-walk moves to a node of greater depth.
Over the entire phase the node-depth is decremented at most 2m times.
Ukkonen's Algorithm
Using the skip/count trick:
any phase of Ukkonen's algorithm takes O(m) time.
Proof:
There are i + 1 ≤ m extensions in phase i + 1
In a single extension, the algorithm walks up at most one edge, traverses one suffix link,
walks down some number of nodes, applies the extension rules and may add a suffix link.
The up-walk decreases the current node-depth by at most one.
Each suffix link traversal decreases the node-depth by at most another one.
Each down-walk moves to a node of greater depth.
Over the entire phase the node-depth is decremented at most 2m times.
No node can have depth greater than m, so the total increment to current node-depth
(down walks) is bounded by 3m over the entire phase.
Ukkonen's Algorithm
m phases
1 phase: O(m)
Ukkonen's Algorithm
m phases
1 phase: O(m)
2
O(m )
“First make it run fast, then make it run faster.”
João Carreira
Edge-Label Compression
A string with m characters has m suffixes.
If edge labels are represented with characters, O(m2) space is needed.
Edge-Label Compression
A string with m characters has m suffixes.
If edge labels are represented with characters, O(m2) space is needed.
To achieve O(m) space, each edge-label:
(p, q)
Two more tricks...
Rule 3 is a show stopper
If rule 3 applies in extension j, it will also apply in all further
extensions until the end of the phase.
Why?
Rule 3 is a show stopper
If rule 3 applies in extension j, it will also apply in all further
extensions until the end of the phase.
Why?
When rule 3 applies, the path labeled S[ j..i ] must continue with character S(i + 1), and
so the path labeled S[ j + 1..i ] does also, and rule 3 again applies in extensions j+1...i+1.
Rule 3 is a show stopper
End any phase i +1 the first time rule 3 applies.
The remaining extensions are said to be done implicitly.
Once a leaf always a leaf
Leaf created => always a leaf in all successive trees.
No mechanism for extending a leaf edge beyond its current leaf.
Once there is a leaf labeled j, extension rule 1 will always apply to extension j
in any sucessive phase.
Once a leaf always a leaf
Leaf created => always a leaf in all successive trees.
No mechanism for extending a leaf edge beyond its current leaf.
Once there is a leaf labeled j, extension rule 1 will always apply to extension j
in any sucessive phase.
Leaf Edge Label:
(p, e)
Single Phase Algorithm
In each phase i:
Single Phase Algorithm
During construction:
Implicit to Explicit
One last phase to add character $:
O(m)
Suffix Trees are a Swiss Knife
Applications
Exact String Matching:
Applications
Exact String Matching:
Preprocessing: O(m)
Search: O(n + k)
Three ocurrences of string aw.
Applications
And much more..
Longest common substring
O(n)
Longest repeated substring
O(n)
Longest palindrome
O(n)
Most frequently occurring substrings of a minimum length
O(n)
Shortest substrings occurring only once
O(n)
Lempel-Ziv decomposition
O(n)
“Biology easily has 500 years of exciting problems to work on.”
Donald Knuth
web.ist.utl.pt/joao.carreira
Questions?