This material in not in your text (except as exercises)

Download Report

Transcript This material in not in your text (except as exercises)

This material in not in your text (except as exercises)
• Sequence Comparisons
– Problems in molecular biology involve finding
the minimum number of edit steps which are
required to change one string into another.
– Three types of edit steps: insert, delete,
replace.
– Example: abbc babb
– abbc  bbc  bbb  babb (3 steps)
– abbc  babbc  babb (2 steps)
– We are trying to minimize the number of steps.
Idea: look at making just one position right. Find all
the ways you could use.
– Count how long each would take and recursively figure total
cost.
– Orderly way of limiting the exponential number of
combinations to think about.
– For ease in coding, we make the last character right (rather
than any other). (Then I can tell the routine the beginning
address and number of positions without having the string
begin in a different location. In C it would be no problem, but
other languages use a different technique.)
There are four possibilities (pick the cheapest)
1. If we delete an, we need to change A(n-1) to
B(m). The cost is C(n,m) = C(n-1,m) + 1
C(n,m) is the cost of changing the first n of str1
to the first m of str2.
2. If we insert a new value at the end of A(n) to
match bm, we would still have to change A(n)
to B(m-1). The cost is C(n,m) = C(n,m-1) + 1
3. If we replace an with bm, we still have to
change A(n-1) to B(m-1). The cost is C(n,m)
= C(n-1,m-1) + 1
4. If we match an with bm, we still have to change
A(n-1) to B(m-1). The cost is C(n,m) = C(n1,m-1)
– We have turned one problem into three
problems - just slightly smaller.
– Bad situation - unless we can reuse results.
Dynamic Programming.
– We store the results of C(i,j) for i = 1,n and j =
1,m.
– If we need to reconstruct how we would
achieve the change, we store both the cost and
an indication of which set of subproblems was
used.
M(i,j) which indicates which of the four decisions lead to the best result.
– Complexity: O(mn) - but needs O(mn)
space as well.
– Consider changing do to redo:
– Consider changing mane to mean:
Longest Increasing Subsequence
Find the longest increasing subsequence in a sequence of distinct
integers.
Idea 1. Given a sequence of size less than m, can find the longest
sequence of it. (Induction)
The problem is that we don't know how to increase the length.
Case 1: It either can be added to the longest subsequence or not
Case 2: It is possible that it can be added to a non-selected
subsequence (creating a sequence of equal length - but having a
smaller ending point)
Case 3: It can be added to a non-selected sub-sequence creating a
sequence of smaller length but successors make it a good choice.
Example: 5 1 10 2 20 30 40 4 5 6 7 8 9 10 11
Idea 2. Given a sequence of size string <
m, we know how to find all the longest
increasing subsequences.
– Hard. There are many, and we need it
for all lengths.
Idea 3 Given a sequence of size < m, can
find the longest subsequence with the
smallest ending point.
– We might have to create a smaller
subsequence, before we create a longer
one.
Idea 4. Given a sequence of size <m, can
find the best increasing sum (BIS) for every
length (k < m-1).
– For each new item in the sequence,
when we add it to the sequence of
length 3 will it be better than the current
sequence of length 4?
For s= 1 to n (or recursively the other
way)
For k = s downto 1 until find correct spot
If BIS(k) > As and BIS(k-1) < As
BIS(k) = As
Actually, we don't need the sequential search as
can do a binary search.
5 1 10 2 12 8 15 18 45 6 7 3 8 9
Length BIS
1
1
2
2
3
3
4
7
5
8
6
9
– To output the sequence would be difficult as
you don't know where the sequence is. You
would have to reconstruct.
Probabilistic Algorithms
– Suppose we wanted to find a number that is greater than
the median (the number for which half are bigger).
– We could sort them - O(n log n) and then select one.
– We could find the biggest - but stop looking half way
through. O(n/2)
– Cannot guarantee one in the upper half in less than n/2
comparisons.
– What if you just wanted good odds?
– Pick two numbers, pick the larger one. What is
probability it is in the lower half?
There are four possibilities:
–
both are lower
–
the first is lower the other higher.
–
the first is higher the other lower
–
both are higher.
We will be right 75% of the time! We only lose if
both are in the lowest half.
Select k elements and pick the biggest,
the probability is 1 - 1/2k . Good odds controlled odds.
– Termed a Monte Carlo algorithm. It may
– give the wrong result with very small
probability.
Another type of probabilistic algorithm is one
that never gives a wrong result, but its running
time is not guaranteed.
– Termed Las Vegas algorithm as you are
guaranteed success if you try long enough.
A coloring Problem: Las Vegas Style
– Let S be a set with n elements. (n only effects
complexity not algorithm)
– Let S1, S2... Sk be a collection of
– distinct (in some way different)
– subsets of S, each containing exactly r elements such
that k 2r-2 . (Use this bound below)
– Color each element of S with one of two colors (red or
blue) such that each
– subset Si contains at least one red and one blue
element.
Idea
– Try coloring them randomly and they just
checking to see if you happen to
– win. Checking is fast, as you can quit
checking each subset when you see
– one of each. You can quit checking the
collection when any single color subset is
found.
– What is the probability that all items in a set
are red? 1/2r
– as equal probability that each color is
assigned and r items in the set.
What is the probability that any one of the collection
is all red?
– k/2r
– Since we are looking for the {\it or} of a set of
probabilities, we add.
– k is bound by 2r-2 so k*1/2r <= 1/4
– The probability of all blue or all red in a single
set is one half. (double probability of all red)
– If our random coloring fails, we simply try
again until success.
– Our expected number of tries is 2.
Finding a Majority
– Let E be a sequence of integers
x1,x2,x3, ... xn The multiplicity of x in E
is the number of times x appears in E. A
number z
– is a majority in E if its multiplicity is
greater than n/2.
– Problem: given a sequence of numbers,
find the majority in the sequence or
determine that none exists.
– For example, suppose there is an
election. Candidates are represented as
integers. Votes are represented as a list
of candidate numbers.
– We are assuming no limit of the number
of possible candidates.
Ideas
1. sort the list O(n log n)
2. If have a balanced tree of candidate names
would be n log c (where c is number of
candidates) Note, if we don’t know how many
candidates, we can’t give them indices.
3. See if median (kth largest algorithm) occurs
more than n/2 times. O(n)
4. Take a small sample. Find the majority - then
count how many times it occurs in the whole
list. This is a probabilistic approach.
5. Make one pass - Discard elements that won’t
affect majority.
Note:
– if xi  xj and we remove both of them,
then the majority in the old list is the
majority in the new list.
– If xi is a majority there are m x_i s
out of n, where m > n/2. If we remove
two elements, (m-1 > (n-2)/2).
– The converse is not true. If there is no
majority, removing two may make
something a majority in the smaller list:
1,2,4,5,5.
Thus, our algorithm will find a possible
majority.
– Algorithm: find two unequal elements. Delete them.
Find the majority in the smaller list. Then see if it is a
majority in the original list.
– How do we remove elements? It is easy. We scan the
list in order.
– We are looking for a pair to eliminate.
– Let i be the current position. All the items before xi
which have not been eliminated have the same value.
All you really need to keep is the number of times,
Occurs, this candidate, C value occurs (which has not
been deleted).
For example:
List:
1 4 6 3 4 4 4 2 9 0 2 4 1 4 2 2 3 2 4 2
Occurs:
X X 1 X 1 2 3 2 1 X 1 X 1 X 1 2 1 2 1 2
Candidate: 1
6
4 4 4 4 4 ? 2 ? 1 ? 2 2 2 2 2 2
2 is a candidate, but is not a majority in the whole
list.
– Complexity: n-1 compares to find a candidate.
n-1 compares to test if it is a majority.
So why do this over other ways? Simple to code.
No different in terms of complexity, but
interesting to think about.