CSE 326: Data Structures Lecture #7 Branching Out

Download Report

Transcript CSE 326: Data Structures Lecture #7 Branching Out

CSE 326: Data Structures
Lecture 23
Spring Quarter 2001
Sorting, Part 1
David Kaplan
davek@cs
Outline
• Sorting: The Problem Space
• Sorting by Comparison
–
–
–
–
–
Lower bound for comparison sorts
Insertion Sort
Heap Sort
Merge Sort
Quick Sort
• External Sorting
• Comparison of Sorting by Comparison
Sorting: The Problem Space
• General problem
Given a set of N orderable items, put them in order
• Without (significant) loss of generality, assume:
– Items are integers
– Ordering is 
(Most sorting problems map to the above in linear time.)
Lower Bound
for Sorting by Comparison
• Sorting by Comparison
– Only information available to us is the set of N items to
be sorted
– Only operation available to us is pairwise comparison
between 2 items
• What is the best running time we can possibly
achieve?
Decision Tree Analysis
of Sorting by Comparison
B<A
A<B
B<A
A<B
B,A,C.
A
A,C,B.
C,A,B.
facts
Legend A,B,C
C<A
C
B<
B,C,A.
B<A
C<A
C<B
C<
C
A<
A<B
C<B
A
B
A,B,C.
C<
C
A<
C<
C
B<
C,B,A
Internal node, with facts known so far
Leaf node, with ordering of A,B,C
Edge, with result of one comparison
Max depth of decision tree
• How many permutations are there of N numbers?
• How many leaves does the tree have?
• What’s the shallowest tree with a given number of leaves?
• What is therefore the worst running time (number of
comparisons) by the best possible sorting algorithm?
Lower Bound for log(n!)
Stirling’s approximation:
n
n! 2n  
e
n

n 
log(n !)  log  2 n   


e




  n n 
 log( 2 n )  log      (n log n)
 e  


n
Insertion Sort
Basic idea
After kth pass, ensure that first k+1 elements are sorted
On kth pass, swap (k+1)th element to left as necessary
7
2
8
3
5
9
6
After Pass 1 2
7
8
3
5
9
6
After Pass 2 2
7
8
3
5
9
6
2
After Pass 3 2
7
3
3
7
8
8
5
5
9
9
6
6
Start
What if array is initially sorted?
What if array is initially reverse sorted?
Why Insertion Sort is Slow
• Inversion: a pair (i,j) such that i<j but
Array[i] > Array[j]
• Array of size N can have (N2) inversions
– average number of inversions in a random set of
elements is N(N-1)/4
• Insertion Sort only swaps adjacent elements
– only removes 1 inversion!
HeapSort
Sorting via Priority Queue (Heap)
Worst Case:
87
44 756
13 18
801 27
23
Best Case:
35
8 13
18 23
27
Basic idea: Shove items into a priority queue, take them out
smallest to largest.
MergeSort
MergeSort
(Table [1..n])
Split Table in half
Recursively sort each half
Merge two sorted halves together
Merge
Merging Cars by key
[Aggressiveness of driver].
Most aggressive goes first.
(T1[1..n],T2[1..n])
i1=1, i2=1
While i1<n, i2<n
If T1[i1] < T2[i2]
Next is T1[i1]
i1++
Else
Next is T2[i2]
i2++
End If
End While
MergeSort Analysis
• Running Time
– Worst case?
– Best case?
– Average case?
• Other considerations besides running time?
QuickSort
Picture from PhotoDisc.com
<
<
15
<
28
<
<
47
<
Basic idea: Pick a “pivot”. Divide into less-than & greater-than pivot.
Sort each side recursively.
QuickSort Partition
Pick pivot:
7
2
8
3
5
9
6
Partition
with cursors
7
2
8
3
5
9
6
<
2 goes to
less-than
7
>
2
8
<
3
5
9
6
>
QuickSort Partition (cont’d)
6, 8 swap
7
less/greater-than
2
6
3
5
9
<
8
>
3,5 less-than
9 greater-than
7
2
6
3
5
9
8
Partition done.
Recursively
sort each side.
7
2
6
3
5
9
8
Analyzing QuickSort
• Picking pivot: constant time
• Partitioning: linear time
• Recursion: time for sorting left partition (say of
size i) + time for right (size N-i-1)
T(1) = b
T(N) = T(i) + T(N-i-1) + cN
where i is the number of elements smaller than the pivot
QuickSort
Worst case
Pivot is always smallest element.
T(N) = T(i) + T(N-i-1) + cN
T(N) = T(N-1) + cN
= T(N-2) + c(N-1) + cN
k 1
= T(N-k) + c  ( N  i )
i 0
= O(N2)
Optimizing QuickSort
• Choosing the Pivot
– Randomly choose pivot
• Good theoretically and practically, but call to random number
generator can be expensive
– Pick pivot cleverly
• “Median-of-3” rule takes Median(first, middle, last). Works
well in practice.
• Cutoff
– Use simpler sorting technique below a certain problem
size (Weiss suggests using insertion sort, with a cutoff
limit of 5-20)
QuickSort
Best Case
Pivot is always middle element.
T(N) = T(i) + T(N-i-1) + cN
T(N) = 2T(N/2 - 1) + cN
< 2T ( N / 2)  cN
< 4T ( N / 4)  c(2 N / 2  N )
< 8T ( N / 8)  cN (1  1  1)
< kT ( N / k )  cN log(k )  O ( N log N )
QuickSort
Average Case
• Assume all size partitions equally likely, with
probability 1/N
T ( N )  T (i )  T ( N  i  1)  cN
average value of T(i) or T(N-i-1) is (1/ N ) j 0 T ( j )
N 1


T ( N )  (2 / N ) j 0 T ( j )  cN
N 1
 O ( N log N )
details: Weiss pg 278-279
External Sorting
• When you just ain’t got enough RAM …
– e.g. Sort 10 billion numbers with 1 MB of RAM.
– Databases need to be very good at this
• MergeSort Good for Something!
– Basis for most external sorting routines
– Can sort any number of records using a tiny amount of
main memory
• in extreme case, only need to keep 2 records in memory at any
one time!
External MergeSort
• Split input into two tapes
• Each group of 1 records is sorted by definition, so
merge groups of 1 to groups of 2, again split
between two tapes
• Merge groups of 2 into groups of 4
• Repeat until data entirely sorted
log N passes
Better External MergeSort
• Suppose main memory can hold M records.
• Initially read in groups of M records and sort them (e.g.
with QuickSort).
• Number of passes reduced to log(N/M)
• k-way mergesort reduces number of passes to logk(N/M)
– Requires 2k output devices (e.g. mag tapes)
But wait, there’s more …
• Polyphase merge does a k-way mergesort using only k+1
output devices (plus kth-order Fibonacci numbers!)
Sorting by Comparison
Summary
• Sorting algorithms that only compare adjacent elements are
(N2) worst case – but may be (N) best case
• HeapSort - (N log N) both best and worst case
– Suffers from two test-ops per data move
• MergeSort - (N log N) running time
– Suffers from extra-memory problem
• QuickSort - (N2) worst case, (N log N) best and
average case
– In practice, median-of-3 almost always gets us (N log N)
– Big win comes from {sorting in place, one test-op, few swaps}!
• Any comparison-based sorting algorithm is (N log N)
• External sorting: MergeSort with (log N/M) passes