Transcript Lecture 14

CS503: Fourteenth Lecture, Fall 2008
Amortized Analysis, Sets
Michael Barnathan
A Preliminary Note
• The withdrawal deadline is Nov. 4.
• THAT IS ONE WEEK FROM TODAY.
• It’s also election day, so if you wish to both
vote and withdraw from a course, you may
wish to plan ahead.
Here’s what we’ll be learning:
• Theory:
– Amortized Analysis
– More complex recurrences.
• Data Structures:
– Disjoint sets and unions (very quick overview).
• Java:
– Sets and Multisets.
Traditional Asymptotic Analysis
• Looks at the behavior of one operation.
– One insertion takes O(n) time…
– One search takes O(log n) time…
– One, one, one.
• If every operation takes the same amount of time, this is
perfectly fine. We can figure out the cost of the sequence.
– What is the total complexity of n operations, each taking time
proportional to O(n)?
• However, this is not always the case.
– What about Vectors, which increase in size when they are filled?
– Each insertion at the end takes O(1) time, until the array is full,
upon which the next insertion takes O(n) time.
Vector Doubling in Traditional Analysis
• Suppose we perform n insertions on a vector that employs the
doubling strategy.
• In traditional analysis, every operation has the same cost. So
what is the worst case cost of insertion into an array?
– O(n), because in the worst case, we double.
– This is despite the fact that most insertions take O(1) time, because
the majority do not double.
• We perform n insertions, each taking O(n) time.
• What is the bound upon the complexity?
– n * O(n) = O(n2).
• This is clearly not a tight bound.
Amortization
• Amortized analysis analyzes worst-case performance over a
sequence of operations.
• It is not an average case analysis; it is an “average of worst
cases”.
• Going back to Vector insertion:
– If we perform 6 insertions into a Vector of size 5, 5 of those insertions
will take 1 unit of time. The sixth will take 6 units. (Both in the worst
case)
– Since all 6 insertions will take 11 units of time, one insertion
contributes roughly 2 time units, not on average, but in the worst case.
– 2 is a constant. We would expect constant time behavior on insert.
– An individual insertion may take longer (the sixth insertion takes 6
units of time, for example), but it will make up for it by preparing
subsequent insertions to run quickly (by doubling the array).
Methods of Amortization
• There are three commonly employed amortized analysis
techniques.
• From least to most formal:
– The Aggregate Method:
• Count the number of time units across a sequence of operations and
divide by the number of operations.
– The Accounting Method:
• Each operation “deposits” time, which is then used to “pay for” expensive
operations.
– The Potential Method:
• A “potential function” φ is defined based on the change in state brought
about by each operation and the difference in potential is added to the
total cost (this difference may be negative).
• We won’t go into too much detail on this method.
• Each method has its limitations.
The Aggregate Method
• This is the simplest method of analysis.
• Simply add the worst-case cost of each operation in a
sequence up, then divide by the total number of operations in
the sequence:

AmortizedCost 
n
i 1
Cost (i )
n
• The cost of each operation is very often defined
asymptotically, not as a number.
• But that’s OK; O(n) means “a linear function of n”.
• So O(n) / n = O(1), O(n2) / n = O(n), and so forth.
The Aggregate Method – Example:
Data
Cost
1
2
3
O(1) O(1) O(1)
So far, the amortized cost is [n * O(1)] / n = O(1).
Data
Cost
1
2
3
4
5
6
7
O(1) O(1) O(1) O(n) O(1) O(1) O(1)
The fourth insertion doubles the array, an O(n) operation.
Now the amortized cost is [(n-1) * O(1) + O(n)] / n = O(1) + O(1) = O(1).
Caveats
• The lack of formalism in the aggregate method
has some consequences.
• Specifically, when using the aggregate
method, be careful with your asymptotics!
• O(n) at the 4th insertion is very different from
O(n) at the 32768th insertion!
• It is thus sometimes useful to define the
elementary cost of inserting without doubling
as simply “1” and to use exact numbers.
Again, with numbers.
Data
Cost
1
1
2
1
3
1
So far, the amortized cost is [3 * 1] / 3 = 1.
Data
Cost
1
1
2
1
3
1
4
4
5
1
6
1
The fourth insertion doubles the array, an O(n) operation.
Now the amortized cost is [6 * 1 + 4] / 7 = 10 / 7 = 1.43.
7
1
How it Converges
• It turns out that this is always constant-time, no matter how
many insertions you do:
• For a sequence of n operations, the algorithm will double at n,
n/2, n/4, n/8, …
• The number of elements in the array at each double is the cost
of that doubling step (because it’s O(n)).
• So the cost is defined by a convergent series:
n n n


n   n     ...
Cost of insertion
2 4 8
 1 1 1



Cost 
 1  1     ...  3
n
 2 4 8

Cost of doubling
• So, at worst, it will take you thrice as long to use a doubling
array as a preallocated one.
• 3 * O(1) = O(1); this is still constant-time.
The Accounting Method
• The accounting method begins by assigning each elementary
operation a cost of $1.
– The cost your analysis returns, of course, is then in terms of how long
those operations take.
• Each operation will then pay for:
– The actual cost of the operation, and
– The future cost of keeping that element maintained (for example,
copying it in the array).
• We save in advance so we have something to “spend” when
we double.
• We call the saved money “the bank”.
– The bank balance never goes negative; there are no subprime loans.
• This is sort of difficult because it requires us looking ahead to
see what happens when we double.
What’s the cost?
• Each element we insert costs $1 immediately.
• When doubling:
– We will have to move each element to a new array; each
move costs $1.
– We will have to create a new element for each existing
element (because we’re doubling the size). Something is
eventually going to fill this spot as well. This will cost $1.
• So the total cost is $3 per insertion.
– $1 for now, $2 for the future.
• This is the same answer we received using the
aggregate method.
• But requires careful inspection to arrive at.
Does it work?
• Remember, the bank must never go negative.
• Doubling costs $n+1: $n to copy the n
elements, $1 for the insertion that
immediately follows.
• Each insertion pays $3 and costs $1, so $2
goes into the bank at each non-doubling step.
• And each doubling costs $n+1.
Yes, it does.
i
1
2
3
4
5
6
size
1
2
4
4
8
8
Deposit
$3
$3
$3
$3
$3
$3
Cost
$1
$2
$3
$1
$5
$1
Profit
$2
$1
$0
$2
-$2
$2
Bank
$2
$3
$3
$5
$3
$5
Red fields represent insertions that cause the array to double.
Potential Method
• Instead of “saving” and paying the cost later, the
potential method measures the “potential difference”
between two adjacent operations.
• This is defined by a potential function φ.
– Φ(0) = 0.
– Φ(i) ≥ 0 for all i.
• The amortized cost of operation i is determined by the
actual cost plus the difference in potential:
– aci = ci + [φ(i) - φ(i-1)]
• The total cost is the sum of these individual costs:
 c  [ (i)   (i 1)]   c   [ (n)   (0)]
n
i 1
n
i
i 1
i
Potential Method Example
• For the array doubling problem,  (i)  2i  2 lgi 
• That is, the difference between the amount that the
array size would double to and the least power of 2
greater than the current size of the array.
• If i-1 is a power of 2, c = i + 2 - (i - 1) = 1 + 2 = 3.
• If i-1 is not a power of 2, ceil(lg(i)) = ceil(lg(i-1)), the
potential terms cancel, and c = 1 + 2 - 0 = 3.
• So we get the same answer as in the other methods.
Sets
• A set is a data structure that can be used to
store records in which the key is the same as
the value, keeping all elements in the set
unique.
• There are two types in Java:
– TreeSets: Sorted Unique Containers.
– HashSets: Unsorted Unique Containers.
• A multiset, or bag, is like a set, but without
the uniqueness constraint.
Special Properties
• Elements in a set class are guaranteed unique.
• Attempting to insert an element that already exists will not modify the set
at all and will cause add() to return false.
• Sets can be split and merged.
– You can get the entire set of elements greater than or less than a target, for
example.
– Or you can merge two disjoint sets together.
• This is called a union operation.
• TreeSets are implemented using Binary Search Trees in Java, providing
O(log n) insertion, access, and update and guaranteeing sorted order
(remember to implement Comparable in your classes).
• HashSets are implemented using hash tables, providing average-case O(1)
insertion, access, and deletion, but not guaranteeing sorted order.
• They are thus appropriate data structures to use for operations such as
picking out the unique words in a book and outputting them in sorted
order.
Methods
• Has some of the usual ones: add(), remove(), size(),
addAll()…
• But also some exotic ones that return elements or
subsets greater than or less than an element:
–
–
–
–
–
–
higher(Object): Returns the first element > Object.
lower(Object): Returns the first element < Object.
floor(Object): Returns the first element ≤ Object.
ceiling(Object): Returns the first element ≥ Object.
headSet(Object): Returns the whole set < Object.
tailSet(Object): Returns the whole set > Object.
Special Set Operations
• Sets are usually represented as trees.
• The union algorithm merges two sets by attaching the smaller
set/tree to the larger tree. This is determined by the set’s
rank.
• The rank of the set, also defined as the Horton-Strahler
number of its tree, is as follows:
– A set with one element has a rank of 0.
– The result of a union between two sets of the same rank is a set of
rank r+1.
• The optimal find algorithm utilizes a strategy called path
compression, which traverses up the tree and makes each
node it encounters a child of the root. The amortized running
time of this approach is O(a(n)), where a(n) is the inverse
Ackermann function. This functions grows extraordinarily
slowly, so it effectively runs in amortized constant-time.
Recurrences, part 2.
• If we have a recurrence of the form T(n) =
a*T(n/b) + f(n), the solution can be found
using the Master Method.
• However, what if we have something like T(n)
= T(n/3) + T(n/4) + O(1)?
• Then we need to use a different method.
Solving Complex Recurrences
•
Isolate the recursive terms:
– T(n) = T(n/3) + T(n/4) + O(1).
– This part of the recurrence is called a homogenous recurrence.
•
Guess a general form solution for this part of the recurrence.
– Generally, these recurrences will be polynomial, so guess c*n^a.
•
Plug your guess into the homogenous recurrence:
– c*n^a = c*(n/3)^a + c*(n/4)^a.
•
Solve for a (or at least get a bound on a):
–
–
–
–
–
•
•
c*n^a = c*n^a/3^a + c*n^a/4^a.
1 = 1/3^a + 1/4^a.
Does a = 1 work? 1 > 1/3 + 1/4, so it’s too high.
a = .5 is too small, but close.
a = .56 works.
So the solution is O(n.56).
If the solution were of the same order as the driving function, we would still need
to multiply by log n.
Performance on a Sequence
• We covered amortized analysis and sets today,
plus a bit on recurrences.
• Next time, we will discuss graphs – the root
data structure from which most others derive.
• We will also have a somewhat theoretical
assignment on analyzing the performance of
hashes next time.
• The lesson:
– Plan for the future. Plan your current actions to
make your future efforts easier.