Trie - Euler

Download Report

Transcript Trie - Euler

Tries
An N-Way tree with unusual
properties
Copyright 2004-2006 Curt Hill
The word
• Derived from the middle of retrieve
• Pronounced either like “try” or
“tree”
• Pronouncing as “try” is less
confusing, so this is what we will use
• However, before looking at a trie we
must consider a radix sort
Copyright 2004-2006 Curt Hill
Radix sort
• In the olden days we had card decks
• We typically put a sequence number
in the last 8 columns
• If the deck every got shuffled we
took it to operations and they could
resort it based on that sequence
• They used a machine called a card
sorter
• It had one input and 12 output bins
– A card had 12 rows
Copyright 2004-2006 Curt Hill
It worked like this:
• The operator set which single
column it would sort
• The deck was read in and put into
one of 12 slots based on the value in
that column
• If two cards were the same then
they hit the slot in the same order
that they were originally in
• Most sorts do not preserve the input
order in equal keys
Copyright 2004-2006 Curt Hill
How should we sort a deck?
• Sort first or sort last?
• Sort first
– Most of us would sort on the first
character and then have 10 decks
– Then sort each of those decks into 100
decks
– Then sort each of those decks into
1000 decks
– etc
Copyright 2004-2006 Curt Hill
Sort Last
• Sort last – only works because the card
sorter maintains order for equal keys
• Sort on the last digit
– Recombine the decks into one but based on
their slot order
– Next sort on next to last digit
• Recombine the decks into one but based on their
slot order
– Keep doing this until you run out of digits
Copyright 2004-2006 Curt Hill
Example
• Consider the following data:
• 434,214,123,432,124,431,223
• Three passes based on three digits
• Sort on last digit
– 431, 432, 123, 223, 434, 214,124
• Sort on middle digit
– 214, 123, 223, 124, 431, 432, 434
• Sort on first digit
– 123, 124, 214, 223, 431, 432, 434
Copyright 2004-2006 Curt Hill
Quaint?
• The importance of the radix sort at
this point is that it deals with the key
as a sequence of digits rather than a
unified whole
• A trie will do the same
• It will also combine tree searches
and subscripting
Copyright 2004-2006 Curt Hill
Subscripting as a search
• Pros
– The advantage of a vector or an array is that
subscripting is extremely quick
– The advantage of a binary or B tree is that the
key can have any form
– Hashing attempts to make a key from things
that are not a key
• Cons
– Vector/arrays only allow integer subscripts
– Trees O(log n) search are much slower than
O(C) searches of arrays
– Hash tables maul the sorted order
Copyright 2004-2006 Curt Hill
Trie Again
• The trie is an attempt to bring
subscripting back to searching
• The key concept is to think of a
string, not as a single indivisible
item, but as a sequence of
characters
– String is the most general key
• Works well for dense keys
Copyright 2004-2006 Curt Hill
The organization of the trie
• A trie is a multiway tree where no
search occurs on the keys
– Instead a subscript evaluation
• Suppose that we have a string of 5
digits for a key
– Each node will contain 10 possibilities –
one for each digit
• Therefore the trie is 10-way tree
• The root node has one subtree for
each digit Copyright 2004-2006 Curt Hill
Root and three of 10
descendents
9 8 7 6 5 4 3 2 1 0
9 8 7 6 5 4 3 2 1 0
9 8 7 6 5 4 3 2 1 0
9 8 7 6 5 4 3 2 1 0
Copyright 2004-2006 Curt Hill
Notes
• If the key is constant length
– Only the leaves have any data
– The path to leaf is the key
– The digits are not actually stored in the
nodes
– Just the pointers to subnodes
• If the key is not constant length then
each node has the data
corresponding to that key
• Subscripting and pointer
dereferencing is used
Copyright 2004-2006 Curt Hill
Searching
• We use the first digit of the search
key to find the proper subtree
• Each subtree of the root, then uses
the second letter as the basis to find
the correct subtree
• At level N then the Nth letter of the
word is used as a subscript of the
node to find the subtree
• The depth of the tree is the length of
the longest key
Copyright 2004-2006 Curt Hill
Example
• Consider a structure that contains
English words of which there are
more than a million
• Each entry points to a definition and
other stuff
– Perhaps a file ID if the whole structure
does not fit in memory
• Twenty six entries in root
• Contains words:
– a, an, and, am, any
Copyright 2004-2006 Curt Hill
Four levels of Another Trie
Level 0 has
zero length
items
Level 1 has
length one items
a b c d … x y z NULL
a b … m n … z
a b c d … x y z
am
a
a b c d … x y z
a b c d … x y z
Level 2 has
length two items
an
and
a b c d … x y z
Copyright 2004-2006 Curt Hill
any
Notes
• One pointer for each descendent
– The data for the pointers is not needed
since no comparisons are done
• One data item for the word currently
constructed
– Not the word itself (that is the key) but
the data corresponding to that key
– I will not show insertion/deletion etc.
because it follows naturally from the
structure and previous experience on
trees
Copyright 2004-2006 Curt Hill
Searching Trees
• How does this Trie compare with
other tree searches?
• Searching a binary tree is a log2
operation
– Each node examination splits the tree
into (hopefully) two nearly equal pieces
– Log2 of 1 million is about 20 (19.9)
Copyright 2004-2006 Curt Hill
BTree
• Searching a BTree is a harder to determine
operation
• Suppose that N=4
• Each node should be between 4 and 8 keys
• Then the search should be log6 operation
• Each node examination splits the tree into
approximately six nearly equal pieces
• Log6 of 1 million is 7.71
• However, in the traversing from root to leaf in
these 7 searches we also searched 7 or 8 nodes
each of which had approximately 6 items in them
• Hence we had some sort of inspection of about
50 items
Copyright 2004-2006 Curt Hill
Trie
• Searching a trie using characters is a
log26 operation
– Each node examination splits the tree into 26
pieces but be sure they are not equal,
especially if they use English letter frequency
• If a product id with sequential
numbers/letters then an equal distribution
is possible
– Log26 of 1 million is 4.24
– Hence the tree is much shallower
– The root to leaf search did four subscript
evaluations rather than four searches
Copyright 2004-2006 Curt Hill
Balance
• There is none unless the keys
accidentally cause balance
• Balance is something that can be
done when searching but not
subscripting
Copyright 2004-2006 Curt Hill
Space utilization
• Tries are preferred only for dense
keys
• What is a dense key?
• A key where adjacent keys are
relatively close to each other
• The key space has few holes in it
• English words are usually sparse
Copyright 2004-2006 Curt Hill
Example
• In a 55,000 entry dictionary
"gunfire" and "gunlock" are
adjacent
– These have the first three characters
the same but how many permutations
are between them?
– There are five letters between the f and
l
– There should be 5 * 26 * 26 * 26 pseudo
words of length 7 between them which
is greater than 87,000
Copyright 2004-2006 Curt Hill
Example Again
• This neglects other gunf and gunl words
as well as differing lengths
– Social security numbers, telephone numbers,
product numbers are much more likely to be
dense
• If a node of 26 items only uses three of
them then it is pretty wasteful of space
even if the subscripting is quick
• In the dictionary example, the second
level has a number of letter combinations
that do not exist
– Most two consonant pairs do not exist, except
as abbreviations
– bb, bc, bf Copyright 2004-2006 Curt Hill
Tree Nodes
• We seldom want to take a trie to the
bitter end unless we have a
manageable key
–
–
–
–
Key must be dense
Must be evenly distributed key
Such a key is often numeric
Combinations must occur with equal
frequency
• A hybrid tree has a mixture of
formats
Copyright 2004-2006 Curt Hill
First Example
• License plate numbers in ND have three
letters and three digits
– A binary tree would need a height of 24
– A BTree of N = 4 would still need 9 levels
• For a trie the top three levels would have
letters and the bottom three digits
–
–
–
–
Would only be six levels deep
The quickest possible search
Two distinct forms of nodes
One form of hybrid
Copyright 2004-2006 Curt Hill
Hybrids again
• The more common practice is to use
the trie for a small number of levels
and then switch to another data
structure
• The top structure is usually a trie for
its high fanout
• The bottom structure is usually a
binary tree for memory structures
and a BTree for disk structures, but
other things may also be used
Copyright 2004-2006 Curt Hill
Demonstration Notes
• What follows is a trie program with
some unusual features
– Consider these before observing code
Copyright 2004-2006 Curt Hill
Preprocessor commands
• This code has preprocessor conditional
compilations in it
• It tailors the Trie to either accept a key that is
only digits or only uppercase letters
• There is either a definition of TRIE_LETTERS or
not
– The value of this is not important, not even given, but
the question is it defined or not
• Since the preprocessor is finished before the
compiler starts we can generate the compiler
input
• In this case we end up with two different versions
• Great amounts of similarity but still different
• Notice it also extends to the main C++ file
Copyright 2004-2006 Curt Hill
Key density
• What happens when a word like
"AARDVARK" is inserted into an
empty trie?
– This is the problem of a sparse key
• What would happen to binary or B
tree that had the same situation?
Copyright 2004-2006 Curt Hill
Subscripting into the node
• In both find and insert the search is trivial
• Extract the letter, adjust by the beginning
of the alphabet and use as a subscript
• This is simple because we only allow a key
to be a string of uppercase letters
• What would we do if we needed to allow
any characters allowed in a word?
– Such as hyphen, space, apostrophe or digits
– Not all characters are allowed, just some
Copyright 2004-2006 Curt Hill
The what if
• This complicates the lookup and slows the
simple lookup
• The procedure would be something like
this:
– If the character is a letter, adjust as always
– If the character is a digit, adjust by
subtracting the zero and adding 26
– Else do a case on the character and merely
assign the subscript directly
• The more characters the more costly and
the less simple this search will be
• This lack of simplicity will hinder the
speed of the trie and make it less
desirable, based on the probability of the
characters
Copyright 2004-2006 Curt Hill
Iterator
• Notice the handshaking that has to go on
between the Trie and Trie_iterator class
– Both have to tell the other what is going on
• This could be a recursive routine, but
uses a stack and loop instead
• Notice the word of a Trie node needs to
be displayed first
• Every leaf contains an array of NULL
pointers and the pointer to the data item
Copyright 2004-2006 Curt Hill