Clustering for Accuracy, Performance, and Alternative

Download Report

Transcript Clustering for Accuracy, Performance, and Alternative

Data Structures and
Algorithms
XKCD
2
…Outline
•
•
•
•
Analyzing algorithms
Designing Algorithms
Profiling
Heuristics
– Ex) hash-based sequence alignment
3
Insertion Sort Pseudocode
(review conventions: arrays, indentation, loops, logic, etc.)
• Input: array A[0..n-1]
• Insertion-Sort(A)
“for” loop convention
iterative or counting
– 1 for j = 1 to length[A]-1
– 2 key = A[ j]
– 3 //insert A[ j] into the sorted sequence A[0..j-1]
– 4 i = j-1
“while” loop convention
– 5 while i > -1 and A[i] > key do “while” expression is true
–6
A[i+1] = A[i]
–7
i = i -1
Cormen, Intro to Algs.
– 8 A[i+1] = key
4
Insertion Sort
1)
Design algorithm (as opposed to
Bubble Sort)
2) Implement algorithm
Left of key is sorted
Right of key is unsorted
1 for j = 1 to length[A]-1
2 key = A[ j]
3
//insert A[ j] into the
sorted sequence A[0..j-1]
4 i = j-1
5 while i > -1 and A[i] > key
6
A[i+1] = A[i]
7
i = i -1
8 A[i+1] = key
5
Analyzing Algorithms
• predicting resources that an algorithm
requires
– memory
– communication bandwidth
– logic gates
– computational time (most often measured)
• In other words, how many “steps” does
Insertion Sort take to complete???
6
Analyzing Insertion Sort
• time taken by Insertion Sort depends on
input: 1000 takes longer than 3 numbers
• can take different amounts of time to sort 2
input sequences of the same size -- Why?
• in general, the time taken by an algorithm
grows with the size of the input
• describe the running time of a program as
function of the size of input
7
Analyzing Insertion Sort
• running time
– function of number of steps executed
– assume a constant amount of time is required
to execute each line of pseudocode
8
Analyzing Insertion Sort
• Insertion-Sort(A)
– 1 for j = 1 to length[A]-1
– 2 key = A[ j]
c2
– 3 //insert A[ j]
– 4 i = j-1
– 5 while i > -1 and A[i] > key
–6
A[i+1] = A[i]
–7
i = i -1
– 8 A[i+1] = key
–  nj=0 tj = n(n+1)/2 -1
cost
c1
n-1
0
c4
c5
c6
c7
c8
time
n
n-1
 nj=0 tj
 nj=0 tj -1
 nj=0 tj -1
n-1
Algorithms, Cormen
9
Analyzing Insertion Sort
• T(n) = c1n + c2(n-1) +c4(n-1) + c5 (n(n+1)/2 -1) +
c6(n(n-1)/2) +c7(n(n-1)/2) + c8(n-1)
• = (c5/2 + c6/2 + c7/2)n2 + (c1+c2+c4+c5/2 - c6/2 c7/2 +c8)n - (c2 +c4+c5+c8)
• = k1n2 + k2n -k3
• (n2) = asymptotic upper and lower bound
• O(n2) = asymptotic upper bound
• Typical complexities
– O(1) < O(logn) < O(n) < O(nlogn) < O(n2) < O(n3) < O(2n)
Linear time
10
Insertion Sort Observations
• numbers are sorted “in place”
– numbers are rearranged within the array, with at most a constant
number of them stored outside of the array at any time
• Run time depends on input
– the number of operations for the following 3 sets will vary greatly
due to level of “pre-sortedness”
– a descending sorted order will actually take more operations
than a random ordering
– algorithm complexity analysis allows us to place upper and lower
asymptotic bounds for comparison
3 5 6 7 9 8 10 15 20 30 69
3 5 6 7 9 8 10 15 1 12 20
20 12 15 10 9 8 7 6 5 3
1
11
Insertion Sort Profile
A=321
Step
j
key
i
A[i]
A[i+1]
1
1
2
1
2
3
1
2
0
3
2
4
0 > -1 and 3 > 2  true
5
1
2
0
3
3
2
-1
Undef
3
1 for j = 1 to length[A]-1
2 key = A[ j]
3
i = j-1
4 while i > -1 and A[i] >
key
5
A[i+1] = A[i]
6
i = i -1
7 A[i+1] = key
A=331
6
1
4
-1 > -1 and Undef > 2  false
7
1
A=231
2
-1
Undef
2
12
What about naïve.pl?
snt[] = array of subject nucleotides
qnt[] = array of query nucleotieds
for i = 0 to length(subject) – length(query)
j=0
while (snt[i + j ] == qnt[j])
j=j+1
if (j == length (query))
found sequence at position i
end
c1
c2
c3 * n
c4 * n
c5 * n * n
c6 * n * n
c7 * n * n
c8 * n * n
O( n2)
13
Definitions
procedural programming languages – tend
to be action oriented (as opposed to
Object Oriented Programming – OOP)
• subroutine: a collection of high-level
programming language operations
procedure: (Pascal – did not return a
value)
• function: (Pascal – did return a value)
14
Machine Instructions: At the lowest level, every
program consists of primitive machine instructions.
move.L D0, 20004
Language Statements: High-level languages
consist of statements that perform one or more
machine instructions.
$i = $k + 9;
Subroutines: Subroutines consist of groups of
language statements.
$sequence = &print_formated_sequence(@qnts,$i);
Programs: Programs consist of groups of
subroutines
C: a s/w engineering approach, Darnell
15
Subroutines
• programs are developed with layers of functions
– lower-level functions perform simple operations
– higher-level functions are created from lower-level functions
• analogous to abbreviations for long and complicated sets of
commands
• defined once, but invoked many times
–
–
–
–
ease of change
modular and re-usable
enhanced reliability (complicated tasks broken into simpler ones)
improved readability
• with low-level details of algorithm compartmentalized, an algorithm may be
easier to read, understand, and modify
– good rule of thumb – if your subroutine spans more than 1 printed page,
I would expect at least 1 bug
16
Bioinformatics example
• Optimal sequence alignment (allowing for
gaps and substitutions in either query or
subject sequence)
17
Heuristics
• What do you do when faced with an NPcomplete problem, or problem size where
algorithm takes too long?
• Example – want to compare 2 genomes (brute
force)
– naïve.pl O( n2)
– 3*10^9*3*10^9 * 1*10-9 S = 9* 10^9 S = 285 years
• Alternative
– hash k-tuples of nucleotides to a number, and
compare numbers
18
Hash-Based Alignment
base-10 numbers
5805 = 5*10^3 + 8*10^2 + 0*10^1+5*10^0
k=8
ATGCCTGGGCT
A=0, C=1, G=2, and T=3 (base 4 number)
ATGCCTGG = 0*4^7+3*4^6+2*4^5+1*4^4+1*4^3+3*4^2+2*4^1+2*4^0
= 14714
Now we can compare “chunks” of sequence much faster - speed
increase by factor of 8
Can pre-compute hashes for entire genome, and only compare hashes
Premise for popular alignment tools – BLAST, BLAT,and UIcluster
19
Heuristics
• Usually a trade off
• In sequence hashing example
– accuracy is traded for speed
– you cannot match/find sequences shorter
than 8 nucleotides
• How do you find optimal k-tuple?
– depends on question
– empirically
20
End
21
Overly simple example of
compartmentalizing
Count the number of nucleotides in a file.
open file
while there is more sequence
read a nucleotide
increment count
print nt count
close file
22
Another Example
(divide and conquer)
• Find the average intron size for all human
genes
•
•
•
•
•
Get human genome
Get genes
Find indices of exons/introns
Size = index2 - index1
Tabulate and average
23
Recursion
• recursion – partially consists or is defined
in terms of itself
• examples
– mirrors
– video camera of television
– factorial function for non-negative integers
• n! =
– a) 0! = 1
– b) if n >0, then n! = n(n-1)!
• 3! = 3(2)! = 3(2(1!)) = 3(2(1(0!))) = 3(2(1(1)))= 6
24
Recursion
• power is in ability to define an infinite set
of objects by a finite statement
• tool for expressing a program recursively
is the subroutine (procedure/function)
– directly recursive: subroutine P contains
reference to itself
– indirectly recursive: P contains reference to
another subroutine Q, which contains a (direct
or indirect) reference to P
25
#!/usr/bin/perl
#
# simple example perl program to calculate
# the factorial of a number using (gasp) "Recursion"
#
# BUG found
print "Enter integer number to determine factorial:";
$i=<STDIN>; # get number
chomp($i); # remove "newline"
$i = int $i; # removes any decimals
if($i < 0)
{
die("Error: input must be positive integer");
}
$j = &Fact($i);
print "($i)!=";
print "$j\n";
#end of program
######################
sub Fact()
{
my $num = shift;
# How can $num be N, and then N-1, then N-2, etc.????
#print "num = $num\n";
$new_num = $num-1;
if($new_num == 0)
{
return(1);
}
else
{
$fact = $num * &Fact($new_num);
}
return($fact);
}
26
Program Iterations or Profile
n=5
[tabraun@texas fact]$ ./fact-test.pl
Enter integer number to determine factorial:5
num = 5
num = 4
num = 3
num = 2
num = 1
(5)!=120
27
Recursion
• Cut and paste (*) here
• Cut and paste (Cut and paste (*) here) here
• Cut and paste (Cut and paste (Cut and paste (*) here) here) here
• … Etc.
28
Variable Scope
• global variables: variables that are accessible/visible
from any part of a program
• local variables: accessible to a limited portion of the
program
– ensures that variables are not unintentionally manipulated
• perl
– variables are always global unless you specify otherwise
my $variable_name specifies a “local” variable
• scope usually refers to “blocks of code”
– for loop
– while loop from insertion sort
• example – scope.pl
29
#!/usr/bin/perl
$i = 5;
print "i=$i\n";
{
print "i=$i\n";
$i = 3;
print "i=$i\n";
}
print "i=$i\n";
30
Can we do better than Insertion
Sort O(n2)?
• Merge-Sort(A,p,r)
1 if p < r
2 q = |(p+r)/2|
3 Merge-Sort(A,p,q)
4 Merge-Sort(A,q+1,r)
5 Merge(A,p,q,r)
6 return
Divide and conquer example -- sorted array of
length 1 is already sorted.
31
Merge-Sort Split Steps
52461326
5246
1326
52
5
46
2
4
6
1
13
26
3
2
6
Merge-sort changes the problem from one of sorting numbers, to one of simply
combining stacks of numbers that are already sorted. At the leaf level of this
graph, individual numbers are already sorted (because a single number by
32
itself is sorted).
Merge-Sort Analysis
O(nlog2n) -- can we do better???
33
Algorithmic Concepts
• Greedy algorithms
– always makes the choice that looks “best” at
the moment.
– makes a locally optimal choice in the hope
that this choice will lead to a globally optimal
solution
– do not always yield optimal solutions
34
Knapsack Problem
• A thief finds n items
– item i is worth vi dollars, and weighs wi
pounds, where vi and wi are integers
– thief wants to maximize value, but is limited to
W pounds
– What items should thief take?
35
Knapsack
36
NP-Completeness and NP-Hard
• polynomial-time algorithms
– naïve, insertion-sort, merge-sort, fact
– O(nk)
• Can all problems be solved in polynomial time?
–
–
–
–
no
this class of problem is called “NP-Complete”
these problems are intractable
valuable to know when a problem is NP-Complete so
that you do not waste time attempting to develop a
solution
– approach is to look for approximation of solution
37
NP-Complete example
• Traveling-salesman problem
– a salesman must visit N cities
– wants to visit every city exactly once
– wants to minimize travel distance
38
Dynamic Programming
• like divide-and-conquer, DP solves problems by combining
the solutions of subproblems (“Progamming” refers to a
tabular method, NOT writing computer code.)
• D and C generally have independent subproblems
• DP is most applicable when subproblems are not
independent - i.e., D and C does more work than necessary,
repeatedly solving the common subsubproblems
• DP solves each subsubproblem once, and saves the results
in a table.
• work is avoided since the answer does not have to be
recomputed every time the subsubproblem is encountered
• Example
– global sequence alignment -- Smith-Waterman
39
Assignment: Debugging naïve.pl
•
•
•
Due:
bug exists in algorithm
find input scenario where algorithm breaks
•
•
•
•
Assignment 1
Obtain naïve.pl (web)
Execute it (with your perl)
Alter input to determine bug
1.
Submit a version of the program with the 2 input lines that illustrate
the bug
@snt = (A, A, …. );
@qnt = (A, T, ….);
2.
Describe a solution by inserting comments into the program
Submit your altered and commented program to icon
40