Slides - SRU Computer Science

Download Report

Transcript Slides - SRU Computer Science

SORTING AND SEARCHING
• Sorting
• Algorithm Analysis
• Searching
Sorting
• Algorithm Analysis
• Sorting and Searching are most frequent
operations
• Elementary Sorting
– bubble sort
– selection sort
– insertion sort
• Merge Sort
• Algorithm analysis is important –
performance matters! Can observe and/or
analyze, then tune or revise algorithm.
Sorting and Searching
2 of 44
November
November13,
8, 2005
2003
Analysis of Algorithms
• Computing resources consumed
– running time
– memory space
• Implementation of algorithm
– machine (UltraSPARC 10, Mac, PC,...)
– language (Java, C++,...)
• For given input, time and space used
depend on implementation.
• Size of input data, denoted N, e.g.,
– number of elements to sort
– number of nodes in tree to be visited
• Worst-case time complexity T(N)
– maximum running time of algorithm over
all possible inputs of size N
Sorting and Searching
3 of 44
November
November13,
8, 2005
2003
Big-Oh Notation - OrderOf()
• How to abstract from implementation?
• Big-Oh notation
• O(N) means each element is accessed once
– N elements * 1 access/element = N accesses
• O(N2) means each element is accessed n times
– N elements * N accesses/element = N2 accesses
• Only consider “asymptotic behavior” i.e., when N>>1
(N is much greater than 1)
– N is unimportant compared to N2
• Disregard constant factors:
– newer machine might run all programs twice as fast
– line of code in one language may be several lines
on another
• Remember, only largest N expression without
constants matters
n
– 3N2 is O(N2)
 i = N(N+1)/2 is O(N2)
– N/2 is O(N)
– 4N2 + 2N + is O(N2) i=1
Sorting and Searching
4 of 44
November
November13,
8, 2005
2003
f(N) on linear graph paper
180
160
2N N2
NlogN
N
140
120
f(N)
100
80
60
40

N
logN
20
0
20
40 60 80 100 120 140
N
Sorting and Searching
5 of 44
November
November13,
8, 2005
2003
f(N) on log-log graph paper
109
108
N2
2N
NlogN
N
107
106
f(N)
105

N
104
103
102
logN
101
0
101 102 103 104 105 106 107
N
• x-axis: log N;
y-axis: log f(N)
• the diagram of cf(N) is obtained by
“shifting” the diagram of f(N) up by log c.
Sorting and Searching
6 of 44
November
November13,
8, 2005
2003
Example: Bubble Sort
• Iterate through sequence, compare each
element to right neighbor.
• Exchange adjacent elements if necessary.
• Keep passing through sequence until no
exchanges are required (up to N times).
• Each pass causes largest element to
bubble into place: 1st pass, largest; 2nd
pass, 2nd largest, ...
49 2 36 55 4 72 23
Before a pass
2 36 49 55 4 72 23
Middle of first pass
2 36 49 4 55 23 72
After one pass
Sorting and Searching
7 of 44
November
November13,
8, 2005
2003
Worst-Case Time, Bubble Sort
# executions
Instruction
1
1
N
(N - 1)
i = N;
sorted = false;
while((i > 1)&&(!sorted))
{
sorted = true;
for(int j=2; j<=i; j++){
if (a[j-1] > a[j]) {
temp = a[j-1];
exchange a[j-1] = a[j];
a[j] = temp;
sorted = false;
}
}
i--;
}
{
(N-1)+(N-2)+
... + 2 + 1
= N(N-1)/2
(N - 1)
[
– N is number of objects in sequence
– [j] is shorthand for accessing jth element in
sequence
Worst-case analysis:
• while-loop is iterated N-1 times
• iteration i executes 2 + 6 (i - 1)
instructions
Total: 2 + N + 2(N-1) + 6[(N-1)+ ... + 2 + 1] =
3N + 6N(N-1)/2 = 3N2+... = O(N2)
Sorting and Searching
8 of 44
November
November13,
8, 2005
2003
Insertion Sort
• Like inserting new card into a partially sorted
hand by bubbling to left into sorted subarray;
little less brute-force than bubble sort
– add one element a[i] at a time
– find proper position, j+1, to the left by shifting to the
right a[i-1], a[i-2], ..., a[j+1] left neighbors, til a[j] < a[i]
– move a[i] into vacated a[j+1]
2
4 36 55 9 72 23
i
2
4
36 55 72 23
j j+1
2
4
9 36 55 72 23
j j+1
i
• After iteration i<n, a[1] ... a[i] are in sorted
order, but not necessarily in final position
Sorting and Searching
9 of 44
November
November13,
8, 2005
2003
Time Complexity of Insertion Sort
Pseudocode implementation
for (int i = 2; i <= n; i++) {
int j;
for (j = i - 1; (j > 0) &&
(a[j] > a[i]); j--) {
move a[j] forward;
}
move a[i] to a[j+1];
}
Analysis
• Most executed instructions are ones for
move in inner for-loop.
• Such instructions are executed at most 1 +
2 + ... + (N-2) + (N-1) times.
• Time complexity: O(N2) worst-case;
constants do not matter for Big-Oh.
Sorting and Searching
10 of 44
November
November13,
8, 2005
2003
Selection Sort
• Find smallest element and put it in a[1].
• Find 2nd smallest element and put it in a[2].
• etc. Less data movement (no bubbling)
Pseudocode:
for (int i = 1; i < n; i++) {
find minimum element a[min] in
subseqeunce a[1..n]
exchange a[min] and a[i]
}
a
2
4 36 55 5 72 23
i
min
• After iteration i, a[1] ... a[i] are in final
position.
Sorting and Searching
11 of 44
November
November13,
8, 2005
2003
Time Complexity of Selection Sort
for (int i = 1; i < n; i++) {
int min = i;
for (int j = i + 1; j <= n; j++) {
if (a[j] < a[min]) {
min = j;
}
}
temp = a[min];
a[min] = a[i];
a[i] = temp;
}
Worst Case Analysis
• Most executed instruction are those in inner
for loop (if)
• Each such instruction is executed (N-1) +
(N-2) + ... + 2 + 1 times
• Time complexity: O(N2)
Sorting and Searching
12 of 44
November
November13,
8, 2005
2003
Comparison of Elementary Sorting
Algorithms
Movements
Comparisons
Selection Insertion Bubble
Best
n2
2
n
n
Average
n2
2
n2
4
n2
2
Worst
n2
2
n2
2
n2
2
Best
0
0
0
Average
n
Worst
n
n2
4
n2
2
n2
2
n2
2
Note: smaller terms omitted
Sorting and Searching
13 of 44
November
November13,
8, 2005
2003
Merge Sort
• Divide-and-Conquer algorithm
• Time complexity: O(N log N)
• Simple recursive formulation
• Stable -- preserves initial order of equal
keys
Sorting and Searching
14 of 44
November
November13,
8, 2005
2003
Merging Two Sorted Lists
A
1
5
9
25
B
2
3
17
N
M
M+N
C
1
2
3
5
9
17
25
• Time: O(M + N)
Sorting and Searching
15 of 44
November
November13,
8, 2005
2003
Outline of Recursive (Top Down)
Merge Sort
• Partition sequence into two sub-sequences
of N/2 elements.
• Recursively sort each sub-sequence.
• Merge the sorted sub-sequences.
Sorting and Searching
16 of 44
November
November13,
8, 2005
2003
Recursive Merge Sort
• listSequence is sequence to sort.
• first and last are smallest and largest
indices of sequence.
public class Sorts {
// other code here
public void mergeSort(
ItemSequence listSequence;
int first, int last) {
if (first < last) {
int middle =
(int)((first + last) / 2);
mergeSort(listSequence, first,
middle);
mergeSort(listSequence,
middle + 1, last);
merge(listSequence, first,
middle, last);
}
}
}
Sorting and Searching
17 of 44
November
November13,
8, 2005
2003
Bottom - Up Merge Sort
7
2
2 7
3
5
8
3 5
2 3 5 7
4
4 8
1
6
1 6
1 4 6 8
1 2 3 4 5 6 7 8
This should remind you of ... a tree!!!
Sorting and Searching
18 of 44
November
November13,
8, 2005
2003
Bottom - Up Merge Sort
for k = 1, 2, 4, 8, ... , N/2 {
merge all pairs of consecutive
sub-sequences of size k into sorted
sub-sequences of size 2k.
}
• Number of iterations is log2N
– if there are 1000 elements in the list, only
10 iterations are required!!!
• Each iteration (a single merge of two subsequences) takes O(N) time.
• Total time T(N) = O( N log2 N ).
• Sequential Data Access:
– merge sort accesses data sequentially
– particularly useful for sorting linked lists
and data sorted on disk
Sorting and Searching
19 of 44
November
November13,
8, 2005
2003
Merging Guidelines
• Use buffer sequence (temporary copy) to
put merged sub-sequences into.
• Scan each half of original sequence.
• Compare two elements from two different
halves of sequence.
• Continue until you have reached end of
both halves of the sequences.
Sorting and Searching
20 of 44
November
November13,
8, 2005
2003
Recurrence Relations for Time
Complexity
• Recurrence relation (or just recurrence):
– recursive definition of function
– often used to express time complexity of
recursive algorithm
• Example: ƒ(1) = 1
ƒ(N) = ƒ(N - 1) + N for N >= 2
• Solution: (by “unfolding”)
ƒ(N) = ƒ(N - 1) + N
ƒ(N) = ƒ(N - 2) + (N - 1) + N
ƒ(N) = ƒ(N - 3) + (N - 2) + (N - 1) + 1
ƒ(N) = 1 + 2 + 3 + ... + (N - 2) + (N - 1) + N
ƒ(N) = N((N + 1)/2) = O(N2)
Sorting and Searching
21 of 44
November
November13,
8, 2005
2003
Example: Towers of Hanoi
• Goal: transfer all N disks from peg A to peg C.
• Rules:
– move one disk at a time
– never place larger disk above smaller one
• Recursive solution:
– transfer N - 1 disks from A to B
– move largest disk from A to C
– transfer N - 1 disks from B to C
• Total number of moves:
– T(N) = 2 T(N - 1) + 1
– T(1) = 1
Sorting and Searching
22 of 44
November
November13,
8, 2005
2003
Solution of the Time Complexity
Recurrence for Towers of Hanoi
• Recurrence relation:
– T(N) = 2 T(N - 1) + 1
– T(1) = 1
• Solution by unfolding:
– T(N) = 2 (2T(N - 2) + 1) + 1
= 4 T(N - 2) + 2 + 1
= 4 (2T(N - 3) + 1) + 2 + 1
= 8 T(N - 3) + 4 + 2 + 1
...
= 2i T(N - i) + 2i-1 + 2i-2 + ... + 21 + 20
– the expansion stops when i = N - 1
T(N) = 2N-1 + 2N-2 + 2N-3 + ... + 21 + 20
– this is a geometric sum, so that we have:
T(N) = 2N - 1 = O(2N)
Sorting and Searching
23 of 44
November
November13,
8, 2005
2003
Time Complexity of Recursive Merge
Sort
merge
N
T(N) = 2 • T
+N
for N 2
2
2 recursive sorts
T(1) = 1
N
(1)
T(N) = 2 • T
+N
2
N
N
+N
T(N) ?= 2 • 2T 4 +
2
N
N
(2)
T(N) ?= 4 • T 4 + 2 •
+N
2
( )
(3)
( )
[ ( ) ]
( )
N
N
T(N) ?= 8 • T ( )+ 4 •
+N
8
4
••
•
(i)
N
2T
2i
i
( )
i
+ N + N + ... + N
• Stops when N/2i = 1, i.e., i = log2N
T(N) = 2log2N + Nlog2N = O(Nlog2N)
Sorting and Searching
24 of 44
November
November13,
8, 2005
2003
Sorting and Searching
• Part II -- Searching
• Using:
–
–
–
–
sequences (arrays, vectors)
linked lists
tree
hash tables
Sorting and Searching
25 of 44
November
November13,
8, 2005
2003
Searching In Different Structures
• Searching is one of the most common and
fundamental computer tasks. We learned
how to sort so that we could search faster.
input
key
search output
value(s) or object
function
...
1. Sequences
key1
value
key2
value
key3
value
key4
value
2. Linked Lists
...
key1
value
key2
value
3. Binary Trees
key
value
Sorting and Searching
26 of 44
November
November13,
8, 2005
2003
Simple Searches in Sequences
• Searching in unordered sequence:
– Dumb Linear Search
– Time = N when:
– duplicates must be found
– worst-case scenario arises
– Time = N/2 for average case in successful
searches
– Time = N for unsuccessful search
• Searching in ordered sequence:
– use binary search (e.g., telephone book)
– average and worst-case time = log2N for successful
and unsuccessful searches
– duplicates can be found by checking adjacent
elements
• Using binary search on ordered sequence
thus offers logarithmic rather than linear runtime.
• But we don’t always use sequences to store
our data
– insertion and deletion require data movement!
– linked lists obviate most data movement at cost of
extra pointers
Sorting and Searching
27 of 44
November
November13,
8, 2005
2003
Simple Searches in Linked Lists
• Searching unordered linked list
head
key1
value
key2
value
key3
value
tail
• Linear search; can’t do binary search.
• Time = N if duplicates must be found, and
for worst case.
• Time = N / 2 on average for successful
searches.
Sorting and Searching
28 of 44
November
November13,
8, 2005
2003
Simple Searches in Linked Lists (cont.)
• Searching ordered linked list
head
Mike
...
Sara
...
Kate
...
tail
– time = N / 2 for average case of successful,
unsuccessful, and duplicate searches
• Easy-to-implement dynamic data structure
at price of increased running time:
– O(N) >> O(log2N)
Sorting and Searching
29 of 44
November
November13,
8, 2005
2003
Binary Trees
• Offer advantages of both sequences and
linked lists
– logarithmic search time for mostly balanced
trees
– proportional to depth of tree
– log2N on average
– easy inserting
• What’s the catch?
– extra reference in each node
– somewhat more complicated deletions as
shown in the Trees lecture
Sorting and Searching
30 of 44
November
November13,
8, 2005
2003
n-ary Trees
• Anything faster than binary tree?
• n-ary tree (n references per node)!
– searches in logn N time
• But
– difference between logs is small because log
(corresponds to tree depth) grows slowly e.g.,
log2 106 = 20; log10 106 = 6
– requires more complex node processing
– not in common use (in this form)
• Is there a scheme that offers advantages of
both log2 and logn?
Sorting and Searching
31 of 44
November
November13,
8, 2005
2003
Hybrid Trees
• n-ary at root, binary thereafter.
Alphanumeric Keys
A
B
C
D
E
Binary tree with
all keys with
first letter = A
Sorting and Searching
F
G
........
Binary tree with
all keys with
first letter = E
32 of 44
November
November13,
8, 2005
2003
Most Efficient Data Structures
• Most efficient search would be 1-level tree
where we could access each key immediately:
key1
value 1
key2 .... key n
value 2
value n
• Implement this dream-tree as sequence; then
if each key were integer in range of sequence
size, use it as index. Check contents of
sequence at that location and done!
...
key n
valuei
• Sounds perfect, right?
Sorting and Searching
33 of 44
...
November
November13,
8, 2005
2003
Symbol Table Example
• Consider stupid old BASIC with
one-character variable names.
• These correspond to 8-bit integer keys in
range (65 - 90, 97 - 122) using ASCII code
• Value is link to contents (value) of identifier.
ASCII value for ‘A’ = 01000001 = 65
value for
‘A’
65
....
90
value for
‘Z’
value for
‘a’
....
97
....
value for
‘z’
122
Symbol Table
• But does this generalize to realistic size
keys and tables?
Sorting and Searching
34 of 44
November
November13,
8, 2005
2003
Problems with Different-Sized Keys
• Creating table to look up CS15 students
based on ID number would be tremendous
waste of space
– ID number is one letter followed by five
digits (e.g., D00011), so there are 26 * 105
combinations!
– do not want to allocate 2,600,000 words for
no more than 100 students (1 word = 4
bytes)
– sequence would be rather sparse...
• What about using their social security
number?
– would need to allocate 109 words, about
4 gigabytes, for no more than 100 students!
• Thus, two major problems:
– how can we deal with arbitrarily long keys,
both numeric and alpha-numeric?
– how can we build a small, dense sequence
that we can index into to find keys and
values?
Sorting and Searching
35 of 44
November
November13,
8, 2005
2003
Introduction to Hashing
• Hashing refers to deriving sequence index
from arbitrarily large key using a
hash function.
• Index leads to value or object. Therefore,
two-step process.
hash table
key
hash
function
Sorting and Searching
index
36 of 44
value
November
November13,
8, 2005
2003
Introduction to Hashing (cont.)
• Hash table typically holds several hundred
to several thousand entries.
sequence of links to instances of the class Student
Hash(‘Mike’)=1
Hash(‘Sara’)=3
Hash(‘Kate’)=5
1
2
3
4
5
null
Mike
Sara
null
Kate
N
Sorting and Searching
37 of 44
November
November13,
8, 2005
2003
Collisions
• Problem: Normally have more keys than
entries in our table. Therefore inevitable that
two keys hash to same position…
– e.g., Hash(‘Monica’) = 4
– and, Hash(‘Scott’) = 4
• Called collision and is the rule, not exception.
• But sequences won’t let us store multiple
elements at one location!
• This did look too good to be true...
Sorting and Searching
38 of 44
November
November13,
8, 2005
2003
Handling Collisions
• Since by design, we can’t avoid collisions,
we use buckets to catch extra entries.
• Consider stupid hash that returns integer
value of first letter of each name.
• Each entry in table could be reference to
bucket
– implement bucket with unsorted linked list
– then we only need to search within (small)
linked lists
– called “chaining”
– for longer lists, we could sort bucket, or use
binary trees; hybrid tree!
55
...
Linked List
head
tail
65
...
90
Linked List
Sorting and Searching
Linked List
head
tail
“Mike”
39 of 44
“Sara”
“Kate”
November
November13,
8, 2005
2003
Linear Buckets
• Alternatively, hash table entry could point to
(maximum sized) sequence for each bucket
and overflow into adjacent bucket if needed.
45
65
83
Kate
...
Mike
...
Monica
...
Sara
...
Scott
...
• Wastes space.
• May have to search multiple sequence
buckets.
Sorting and Searching
40 of 44
November
November13,
8, 2005
2003
Building a Good Hash Function
• Good hash functions
– take into account all information in key
– fill out hash table as uniformly as possible
• Thus, function that uses only first character
(or any character) is terrible hashing function.
– Not many Q’s or Z’s, lots of A’s, M’s, etc
• % (mod) provides simple method of insuring
that any integer can be brought within desired
range (pick prime as table size for best
results).
• Thus, we could take a string, chop it into 4letter sections, then take value of 32 bits that
make up each 4-letter section and XOR them
together, then % that result by table size.
• Almost any reasonable function that uses all
bits will do, so choose fast one!
– example: hashValue =
(key.GetValue()) % 101;
Sorting and Searching
41 of 44
November
November13,
8, 2005
2003
Sorting and Searching
42 of 44
November
November13,
8, 2005
2003