Data Structures - Tonga Institute of Higher Education

Download Report

Transcript Data Structures - Tonga Institute of Higher Education

Tonga Institute of Higher Education
Design and Analysis of Algorithms
IT 254
Lecture 4:
Data Structures
Data Structures
• Data structures are important for many reasons
in computer science. The way data is stored can
determine how fast you find it and what you can
do with it
• The data structures we will focus on are hash
tables, binary search trees and Red-Black trees
• These data structures have a special name:
– Dynamic Sets
• They are all able to do the following operations:
– Search(S, k), FindMin(S), FindMax(S), Insert(S,x),
Remove(S,x)
Binary Search Trees
• Binary Search Trees (BSTs) are an
important data structure in dynamic sets
• Each piece of data in the tree will contain:
– key: an identifying field that allows ordering
– left: pointer to a left child (may be NULL)
– right: pointer to a right child (may be NULL)
– p: pointer to a parent node (NULL for root)
Binary Search Trees
• BST property:
key[left(x)]  key[x]  key[right(x)]
• Example:
15
11
9
16
13
17
Inorder TreeWalk
• What does the following code do?
TreeWalk(x)
TreeWalk(left[x]);
print(x);
TreeWalk(right[x]);
• Answer: prints elements in sorted (increasing)
order
• This is called an inorder tree walk
– Preorder tree walk: print root, then left, then right
– Postorder tree walk: print left, then right, then
root
Searching a BST
• We can search for an item in a tree by using a
key and a pointer to a node
• The search will return an element with that key
or NULL if not found:
TreeSearch(x, k)
if (x == NULL or k == key[x])
return x;
if (k < key[x])
return TreeSearch(left[x], k);
else
return TreeSearch(right[x], k);
More BST Searching
• Here’s another function that does the same:
TreeSearch(x, k)
while (x != NULL and
if (k < key[x])
x = left[x];
else
x = right[x];
return x;
k != key[x])
• Which of these two functions is more efficient?
BST Insert()
• To insert an element into the tree, we just need
to make sure that after inserting the tree still
holds the BST property
• How can we insert and maintain the property
– Use the search algorithm from Searching
– Because we will not find the item, we will get a NULL
value. Insert the item where the NULL is
– Use a “trailing pointer” to keep track of where you
came from
BST Insert
• Like searching, TreeInsert
begins at the root and
goes downward.
• The pointer x traces the
path and pointer y is the
parent of x
• The lines 3-9 move the x
and the y down the tree
by using the BST property
until x is NULL
• The NULL is the place we
want to put the item "z"
• Lines 10-16 set the
pointer for item "z"
0: TreeInsert(Tree,z)
1:
y = NULL
2:
x = root[Tree]
3:
while (x != NULL) {
4:
y = x
5:
if z->key < x->key
6:
x = x->left
7:
else
8:
x = x->right
9:
}
10: z->parent = y
11: if y == NULL
12:
root[Tree] = z
13: else if z->key < y->key
14:
y->left = z
15: else
16:
y->right = z
Running Time of Search/Insert
• What is the running time of TreeSearch() or
TreeInsert()?
– Answer: O(h), where h = height of tree
• What is the height of a binary search tree?
– Answer: worst case: h = O(n) when tree is just a
linear string of left or right children
– Most of the time, it will be 2h = #of items, so if there
are 64 items, 2h = 64, h = 8
• For now, we'll just say running times in terms of "h"
• Later we’ll see how to make sure h is always O(lg n)
Sorting with BSTs
• An algorithm for sorting an array A of length n:
BSTSort(A)
for i=1 to n
TreeInsert(A[i]);
InorderTreeWalk(root);
• What is the running time of this?
– Worst case?
– Average case?
Sorting with BSTs
• Average case analysis
– It’s a form of quicksort!
for i=1 to n
TreeInsert(A[i]);
InorderTreeWalk(root);
3 1 8 2 6 7 5
1 2
3
8 6 7 5
2
6 7 5
5
1
8
2
7
6
5
7
Sorting BSTs
• The same partitions are done as with
quicksort, but in a different order
– In the previous example
• Everything was compared to 3 once
• Then those items < 3 were compared to 1 once
• Etc.
– This is basically the same type of comparisons
that are done in quicksort, just in a different
order.
– If it’s using the same idea as quicksort, BSTs
are also O(n lg n).
Deletion in BST
• Deletion is a bit tricky
• 3 cases:
– x has no children:
• Remove x
– x has one child:
F
B
H
A
• Take out x, move
child into x’s spot
D
C
– x has two children:
• Swap x with successor
• Perform case 1 or 2 to delete it
K
Example: delete K
or H or B
Red-Black Trees
• Red-Black trees
– A special form of the binary tree.
– The difference is the RB Tree has a property
that says the tree will always be "balanced"
– Balanced means that all operations will
happen in O(lg n) time in the worst case
– Nodes in the tree get different colors
– Using the colors to help organize the tree, we
can make sure the height of the tree is always
O(lg n)
Red-Black Trees
• Our goals for Red-Black Trees
– First: describe the properties of red-black
trees
– Then: prove that the properties guarantee
that h = O(lg n)
– Finally: describe operations that are done with
red-black trees
Red Black Properties
1. Every node is either red or black
2. Every leaf (NULL pointer) is black
•
Note: this means every “real” node has 2 children
•
Note: can’t have 2 consecutive reds on a path
3. If a node is red, both children are black
4. Every path from node to descendent leaf
contains the same number of black nodes
5. The root is always black
The black-height: # black nodes on path to leaf
Now every node has 5 items:
(color, key, left, right and parent)
Height of Red-Black trees
• Theorem: A red-black tree with "n" nodes
has height that is at most 2 lg(n + 1)
• This is a good thing to know. It says that
our operations will never have to go
farther down a tree than 2*lg n, which is
O(lg n).
• We will try to prove this is true by using
induction.
Proving Red-Black Height
• First we will show that at any node "x",
the tree below x has a height of at least
2bh(x) – 1 nodes.
• To prove this we will use induction:
– Base Case: If the height of x = 0, then x is a
leaf, and the number of nodes is 20 – 1 = 0,
which is true (there are no nodes below if
height is 0)
Proving Red-Black Height
• Inductive Case:
– Pretend there is a node x that has:
• A positive height
• 2 children
– Each child has a black-height of either:
• bh(x) or bh(x)-1, depending if it is red or black
– The height of a child of x must be less than the height
of x itself, so we can use induction to say:
• Number of nodes a child has is at least 2bh(x)-1 -1
• And both children + "x" have at least
– (2bh(x)-1 - 1) + (2bh(x)-1 - 1) + 1 = 2bh(x) – 1
• This proves the claim about how many nodes there are
Proving Red-Black Height
• Now we are ready to prove that the height is at
most 2 lg (n+1)
– Property 3 says at least half the nodes on any path
must be black. So, the black height on any path must
be at least h/2
– So we can now say: n > 2h/2 – 1 (because we plug in
h/2 for bh(x) from the last slide)
– This equals:
• n + 1 > 2h/2
• lg (n+1) > h/2
• 2 lg (n+1) > h
– This is what we wanted to prove, that the height is
always less than O(lg n).
– All our operations will now be guaranteed to be O(lg
n) as well (like search, insert, delete…)
Red-Black Trees Example
• Example Red Black tree
• Is it colored correctly?
7
5
9
12
Red-black properties:
1.
Every node is either red or black
2.
Every leaf (NULL pointer) is black
3.
If a node is red, both children are black
4.
Every path from node to descendent leaf
contains the same number of black nodes
5.
The root is always black
Inserting in Red-Black Trees
• Insert 8
– Where does it go?
– What color is it?
7
5
9
12
1.
2.
3.
4.
5.
Every node is either red or black
Every leaf (NULL pointer) is black
If a node is red, both children are black
Every path from node to descendent leaf
contains the same number of black nodes
The root is always black
Inserting in Red Black
• Insert 8
7
5
9
8
12
Insert in Red Black
• How about inserting 11?
• What color?
5
– It can’t be black!
– It can’t be red!
7
9
8
12
11
1.
2.
3.
4.
5.
Every node is either red or black
Every leaf (NULL pointer) is black
If a node is red, both children are black
Every path from node to descendent leaf
contains the same number of black nodes
The root is always black
Insert in Red-Black
• Insert 11
– Where does it go?
– What color?
– Solution:
7
5
Change the colors in the tree
1.
2.
3.
4.
5.
Every node is either red or black
Every leaf (NULL pointer) is black
If a node is red, both children are black
Every path from node to descendent leaf
contains the same number of black nodes
The root is always black
9
8
12
11
Inserting Problems
• Insert 10
– Where does it go?
– What color?
7
5
9
8
12
11
1.
2.
3.
4.
5.
Every node is either red or black
Every leaf (NULL pointer) is black
If a node is red, both children are black
Every path from node to descendent leaf
contains the same number of black nodes
The root is always black
10
Insertion Problems
7
• Insert 10
– Where does it go?
– What color?
5
• Answer: no color!
• Tree is too imbalanced
• Must change tree structure
to allow re-coloring
9
8
12
11
10
– Goal: to change the tree in
O(lg n) time so that all properties hold
Red Black Rotation
• Our basic operation for changing the tree
is called rotation:
y
x
A
C
B
x
rightRotate(y)
A
leftRotate(x)
y
B
C
• Trees are rotated by changing the pointers
in the tree
Rotation
• What does rightRotate() do?
• Answer: It changes a lot of pointers
– x keeps its left child
– y keeps its right child
– x’s right child becomes y’s left child
– x’s and y’s parents change
• What is the running time: O(1).
y
x
A
C
B
rightRotate(y)
x
A
y
B
C
Rotation Example
• Goal: To rotate left about 9:
7
5
9
8
12
11
Rotation Example
• Rotate left around 9
7
5
12
9
8
11
Rotation Code
Left Rotate Example
x
A
y
B
leftRotate(x)
C
A
y
x
C
B
LeftRotate(Tree,x)
y = x->right
// we only know x at start, so get y
x->right = y->left
// start the rotating
if y->left != NULL
// if B is not null …
y->parent->left = x
// switch parent of B to be x
y->parent = x->parent
// change parent's
if x->parent == NULL
// if x was root node …
root[Tree] = y
// change root node to be y
else if x == x->parent->left
x->parent->left = y
// if x was on left side…
else
// make y on left side of parent
x->parent->right = y
// else make y on right side of parent
y->left = x
// now x is the child of y
x->parent = y
// now x's parent is y
Red Black Insertion
• The idea of Insertion is to:
– Insert x into tree, color x red
– The only red black property that might be violated is
if x’s children are also red (because they have to be
black)
– If so, move violation up tree until a place is found
where it can be fixed
– Total time will be O(lg n)
– To actually write the code for this is complicated
because there are many cases we must be careful of.
For example: What happens when a parent's left child
is null and red.
– Deletion is also O(lg n), and is also quite complicated
Hashing and Hash Tables
• Sometimes, people just need a few operations: Insert,
Search, Delete. A hash table is a good solution, because
searching is usually O(1)
• At it’s most basic level, a hash table is a data structure
like an array. Data is stored in the array at specific
indexes. The indexes are chosen by a hash function
• A hash function is a mapping between the input set of
data and a set of integers
• This allows for getting the data in O(1) time.
• You just take the thing you are looking for, apply the
hash function, and see if it is at the right place in the
array
Hashing and Hash Tables
• Definition: a Hash will take a value and
convert it to an integer, so that it can be
put in an array
• With hash tables, there is always the
chance that two elements will hash to the
same integer value.
• This is called a collision, and special things
have to be done to make sure that things
don’t get deleted.
What can we do with hash tables?
• One idea: A dictionary
– Goal: We want to be able to lookup any word in a
dictionary very fast.
– Problem: If the words are in an array, we must go
through each word before we find the correct one.
We cannot use trees because words cannot be sorted
very well (most of the time)
– Solution: We can use a hash table to map words to
integers and store them at a place in an array.
– We choose a hash function that changes letters to
numbers.
– Then when we need to look something up, we use
the hash function, check that place in the array and
see if it exists.
Hash Tables
• To use mathematical language:
– Given a table T and a record x, with key and
any other data, we need to support:
• Insert (T, x)
• Delete (T, x)
• Search(T, x)
– We want these to be fast, but don’t care
about sorting the records
• The structure we can use is a hash table
– Supports all the above in O(1) expected time!
A Simple Hash: Direct Addressing
• Suppose:
– The range of keys is 0..m
– And all keys are different
• The idea:
– Set up an array T[0..m] in which
• T[i] = x
• T[i] = NULL
if x T and hash-key[x] = i
otherwise
– This is called a direct-address table
• Operations take O(1) time!
• So what’s the problem?
Direct Addressing Problems
• Direct addressing works well when the range m
of keys is small
• But what if the keys are 32-bit integers?
– Problem 1: direct-address table will have
232 entries, more than 4 billion
– Problem 2: even if memory is not an issue, the time
to initialize the elements to NULL may be really long
• Solution: map keys to smaller range 0..n where
n<m
• This mapping is called a hash function
Hash tables
Now the only problem is: what happens if two keys go to the
same place in the array (called a collision)?
We can fix collisions with policies: Examples:
Chaining
Open Addressing
T
U
(universe of keys)
h(k1)
h(k4)
k1
k4
K
(actual
keys)
k2
0
k5
h(k2) = h(k5)
k3
h(k3)
m-1
Open Addressing
• Basic idea of open addressing to resolve collisions
– On an insert: if index is full, try another index, …, until an open
index is found (called probing)
– To search, follow same sequence of probing as would be used
when inserting the element
• If you reach the element with correct key, return it
• If you reach a NULL pointer, element is not in table
• Good for fixed sets
– Good for sets where there is no deleting only adding
– When you delete you will create NULL holes in indexes and
searching might not return the correct element
• Tables don’t need to be much bigger than n
Chaining
• Chaining will put elements in a list if there
is a collision
U
(universe of keys)
k4
K
k5
(actual
k7
keys)
k6
k8
k1
k4 ——
k5
k2
——
——
——
k1
k2
——
k3
——
k3 ——
k8
——
k6 ——
k7 ——
Analysis of Chaining
• To insert, delete and search, it is simple. Apply the hash
function to the data. Then search down the list
• To analyze:
– Assume that each key in the table is equally likely to be hashed
to any slot
• Given n keys and m slots in the table:
– Then n/m = average # keys per slot
• How long does it take for a search that does not find
something?
– You need to apply the hash function and search the whole list
– O(1+n/m)
• How long for something that is found? (on average)
– = O(1+ (n/m) * (1/2)) = O(1 + n/m)
Analysis of Chaining
• So the cost of searching = O(1 + n/m)
• If the number of keys is about the same
as the number of indexes in the table,
what is n/m?
• Answer: n/m = O(1)
– In other words, we can make the average
cost of searching constant if we make n/m
constant
Making a hash function
• Choosing a good hash function will be
important, because it will determine how many
collisions you have
• One idea: The Division Method:
• Hash(k) = k mod m
– In words: hash integer k into a table with m slots
using the slot given by the remainder of k divided by
m
• This is a decent function, but if m is a power of
2 or power of 10, there might be a lot of
collisions.
• Solution: pick table size m = prime number not
too close to a power of 2 (or 10)
Hash Functions: Polynomial Hash Codes
• A modulus hash function might be good for whole
numbers, but what if you have strings?
• A polynomial hash code can take into account
strings and the position of each character within a
string to produce a relatively unique hash code
• If we choose a constant "a" > 1 and use the
following code
– If we take each character and convert to it's ASCII, then
use those ASCII codes as co-efficients "xi"
– F(x) = x0ak-1 + x1ak-2 + … + xk-2a + xk-1
• This produces a number that will have few
collisions for the most part, but if the string is really
long (longer than 31 characters), the number that
is produced might not be useful for a hash function
Using Hash Functions
• Very rarely will you need to make a
Hashtable or hash function.
• Instead most languages, C++, Java and
Visual Basic for example, all have built in
Hashtables that can be used easily
• Remember, Hashtables are useful
because they can quickly insert, remove
and find elements.
Summary
• Data structures allow a programmer to control
and manipulate their data in different ways that
can improve performance and solve problems.
• We looked at Dynamic sets like Binary search
Trees, Red-black trees and hash functions. You
should be familiar with how all three of them
work
• You should also know the running times of these
data structures and be aware of when you
should use them.