Transcript Lecture13KS

Sorting: Implementation
15-211
Fundamental Data Structures and
Algorithms
Klaus Sutner
February 24, 2004
Announcements
 Homework #5
 Midterm
March 4
Review: March 2
Today
- Recall Sorting
- Implementation Issues
- Average case RT for quicksort
- Timing Results
Total Recall:
Sorting Algorithms
The Bible
Robert Sedgewick
Algorithms in C
Parts 1-4
Fundamentals, Data Structures, Sorting, Searching
Addison-Wesley 1998
Multiple Keys
We could use a special comparator function (this
would require a special function for each
combination of keys).
Easier is often to
- first sort by name
- stable sort by year
Done!
Sorting Review
Several simple, quadratic algorithms (worst case
and average).
- Bubble Sort
- Selection Sort
- Insertion Sort
Only Insertion Sort of practical interest: running
time linear in number of inversion of input
sequence.
Constants small.
Also stable.
Sorting Review
Asymptotically optimal O(n log n) algorithms
(worst case and average).
- Merge Sort
- Heap Sort
Merge Sort purely sequential and stable.
But requires extra memory: 2n + O(log n).
Quick Sort
Overall fastest.
In place.
BUT:
Worst case quadratic.
Not stable.
Implementation details messy.
Picking An Algorithm
First Question: Is the input short?
Short means something like n < 500.
In this case Insertion Sort is probably the best
choice.
Don't bother with asymptotically faster methods.
Picking An Algorithm
Second Question: Does the input have special
properties?
E.g., if the number of inversions is small, Insertion
Sort may be the best choice.
Or linear sorting methods may be appropriate.
Otherwise: Quick Sort
Large inputs, comparison based method, stability
not required (recall our stabilizer trick, though).
Quick Sort is worst case quadratic, why should it
be the default candidate?
On average, Quick Sort is O(n log ), and the
constants are quite small.
Average ???
Average case analysis requires a probability
distribution on the inputs: we have to average the
running times.
t(n) =  px t(x)
where the sum is over all instances of size n and
px is the probability of getting instance x.
Often simply assume uniform distribution: every
instance (of a certain size) is equally likely.
A Computation
Can we write down a recurrence equation?
Can we solve the equation?
At least approximately?
Is the solution (if any) practically relevant?
(see handout from last time)
Implementing Quick Sort
Pivot Selection
Ideally, the pivot should be the median.
Much too slow to be of practical value.
Instead either
- pick the pivot at random, or
- take the median of a small sample.
Partitioning
Partitioning is easy if we use extra scratch space.
But we would like to partition in place.
Need to move elements within the same given
block of the big array.
Basic idea: use two pointers, sweep across block
from left and right till an out-of-place element is
encountered. Swap them.
1. Doing quicksort in place
85
24
63
50
17
31
96
45
85
24
63
45
17
31
96
50
L
85
R
24
63
45
17
L
31
L
31
96
50
96
50
R
24
63
45
17
85
R
1. Doing quicksort in place
31
24
63
45
L
31
24
17
31
24
24
17
17
85
96
50
85
96
50
85
96
50
85
96
63
R
45
L
31
17
63
R
45
63
R
L
45
50
Pseudo Code
i = lo – 1; j = hi;
while( true ) {
while( A[++i] < p );
while( p < a[--j] ) if( j==lo ) break;
if( i >= j ) break;
swap( i, j );
}
swap( i, hi );
return i;
Getting Out
Using Quick Sort on very short arrays is a bad
idea: the overhead becomes too large.
So, when the block becomes short we should exit
Quick Sort and switch to Insertion Sort.
But not locally:
quicksort( A, lo, hi ) {
if( hi – lo < magic_number )
insertionsort( A, lo, hi );
else …
Getting Out
Just do nothing when the block is short.
Then do one global cleanup with insertion sort.
quicksort( A, 0, n )
insertionsort( A, 0, n );
This is linear, since the number of inversions is
linear.
Magic Number
The best way to determine the magic number is to
run real-world tests.
It seems that for current architectures, some value
in the range 5 to 20 will work best.
Equal Elements
Note that ideally pivoting should produce three
sub-blocks:
left:
middle:
right:
<p
== p
>p
Then the recursion could ignore the middle part,
possibly omitting many elements.
Equal Elements
Three natural strategies:
Both pointers stop.
Only one pointer stops.
Neither pointer stops.
Fact: The first strategy works best overall.
Equal Elements
There are clever implementations that partition
into three sub-blocks.
This is amazingly hard to get both right and fast.
Try it!
Application:
Quick Select
Selection (Order Statistics)
A classical problem: given a list, find the k-th
element in the ordered list.
The brute-force approach sorts the whole list first,
and thus produces more information than required.
Can we get away with less than n log n work (in a
comparison based world)?
Easy Cases
Needless to say, when k is small there are easy
answers.
- Scan the array and keep track of the k smallest.
- Use a Selection Sort approach.
But how about general k?
Selection and Partitioning
qselect( A, lo, hi, k ) {
if( hi <= lo ) return;
i = partition( A, lo, hi );
if( i > k ) qselect( A, lo, i-1, k );
if( i < k ) qselect( A, i+1, hi, k );
}
This looks like a typo.
What’s really going on here?
Quick Select
What should we expect as running time?
As usual, if there is a ghost in the machine, it
could force quadratic behavior.
But on average this algorithm is linear.
Don’t get any ideas about using this to find the
median in the pivoting step of Quick Sort!
Some Timing Results
The Real World
Beyond asymptotic analysis, it is always a good
idea to do some real world testing.
Construct a small test-bed:
- automate testing
- flexible but simple
- organize the data in a useful way