Clustering for Accuracy, Performance, and Alternative

Download Report

Transcript Clustering for Accuracy, Performance, and Alternative

Data Structures and
Algorithms
2
Outline
• What is a data structure
• Examples
– elementary data structures
– hash tables
• Computer capabilities
• What is an algorithm
• Pseudocode/examples
– naïve alignment (and debugging)
3
Data Structures
• Informal definition: an organization of
information, usually in computer memory,
to improve or simplify algorithm
performance. Associated data structure
algorithms typically exist to maintain the
properties of data structures (search,
insert, delete, push, pop, etc.)
4
Data Structures
• Elementary data structures
– arrays
•
•
•
•
linear replication of a data type
useful for holding related items of identical type
multi-dimensional
conceptually, naturally maps to computer memory
– Abstractions -- stacks and queues
5
Arrays – allocation of space
An array of chars (bytes):
0 1 2 3 4 5 6 7 8 9
A A A T G C T G A T
0 1 2 3 4 5 6 7 8 9
1 1 1 1 1 1 2 2 2 2
An array of integers:
0
1
2
3
4
5
100 200 250 269 300 11
6
12
7
13
8
1
9
15
6
Languages
In high level language such as C, data types are declared:
int a, b, c;
c = a+b;
Perl:
$c=$a+$b;
Note that Perl does not require the specification of data
type (however, as we will see later, this is useful for rapid
prototyping, but can also be conducive to programming
mistakes)
7
Examples of C/Perl arrays
C:
char Dna[6];
char tissue[4][6];
Dna[0] = 'A';
Dna[1] = 'A';
strcpy(tissue[0],"liver");
strcpy(tissue[1],"kidney");
Perl:
@Dna = (A, A, A, T, C, G);
@tissue = (“liver”, “kidney”, “heart”, “brain”);
$tissue[0]=“liver”;
$tissue[1]=“kidney”;
8
Java
char myArray[];
// Note how the type declaration is decoupled from the memory allocation
myArray = new char[10];
myArray[0]='A';
9
Stack Example
•
•
•
•
•
push 15
push 6
push 9
push 2
pop returns 2
• LIFO
10
Queue Example
•
•
•
•
Enqueue(15)
Enqueue(6)
Enqueue(9)
Enqueue(2)
• DeQueue returns 15
• FIFO
11
Linked List
• A linked list is a data structure in which the
objects are arranged in a linear order,
however, the order is encoded within the
data structure itself by a “pointer” (as
opposed to array indices).
• “dynamic”
• “sparse”
12
Linked List
13
Hash Table or Associative Array
• A hash table is similar to an array, in
that it is a linear collection of data types,
with individual elements selected by
some index value (key). Unlike arrays,
the index values (keys) are arbitrary.
• “hash function” maps keys to elements
• do not have to search for values, but
there is overhead of “hash function”
• O(1) to examine an arbitrary position
14
Array VS Hash
Keys
Values
15
Hash Table Example
%aminos = ( "TTT", "F", # Key Value pairs
"TTC", "F",
"TTA", "L",
"TTG", "L",
"CTT", "L",
"CTC", "L",
"CTA", "L",
"CTG", "L",
"ATT", "I",
"ATC", "I",
"ATA", "I",
"ATG", "M",
"GTT", "V",
"GTC", "V",
"GTA", "V",
"GTG", "V",
"TCT", "S",
"TCC", "S",
"TCA", "S",
"TCG", "S“)
16
Objects or Records
• Complicated
extensions
• drug_target
–
–
–
–
–
study_id
clone
date
gene_identity
id_technique
–
–
–
–
–
–
–
–
–
–
cell_source
pathology
special_conditions
regulation
confirmation_diff_expr
ession
ocular_expr_profile
cytogenetics
genotyping_status
priority
reference
17
Data Structures and Abstraction
Applications
Objects
Communication
and Data Sharing
API
Objects
Data
18
What computers/software can and
cannot do
• Can
– simple (a=a+1)
– fast (1 instruction in 1*10-9 s)
– repetitive
• Cannot
– associate (a cloud looks like Mickey Mouse)
– vision
– however, we can define sets of rules that can stratify
(becomes very complicated and difficult)
– algorithms (computers) are black and white, and the
world is gray
19
Algorithms
• Informally, an algorithm is any well-defined
computational procedure that takes some value, or
set of values, as input and produces some value, or
set of values, as output.
– finite set of steps, one or more operations per
step
– An algorithm is correct if, for every input
instance, it halts with the correct output.
– Example: sorting
• Input: A sequence of n numbers (a1, a2, ….,an).
• Output: A permutation (reordering) (a1’, a2’, …,an’) of
the input sequence such that a1’<=a2’<=…<=an’.
20
Algorithms
• How to validate?
– Mathematically prove (usually impractical)
– Case base proving/testing
• How to devise?
– mimic a human procedure
– follow a template
– create
• How to analyze?
– complexity analysis
– profiling
21
Pseudocode
• An abstract, informal representation of
algorithms as computational operations
that is similar to C, Pascal, Perl (or other
programming languages).
• Examples:
– naïve sequence search/alignment
– insertion sort (sort a hand of cards)
22
Naïve Alignment
• ATC
• AAATCG
NO
• ATC
• AAATCG
NO
• ATC
• AAATCG
YES
23
Algorithms-- naïve alignment -first try
•
Example – naïve sequence search and alignment
– align some small number (10 nucleotides) -- called the "query" to some large
number (3 billion nts) -- called the "subject"
– 10 s with BLAT (uses significantly more efficient algorithm)
snt[] = array of subject nucleotides
qnt[] = array of query nucleotides
for i = 0 to length(query)
#i will be index for query sequence
j=0
while (snt[i + j ] == qnt[j])
# but here, j is index for query sequence???
j=j+1
if (j == length (query))
found sequence at position i
end
query = ATC
subject = AAATCG
24
Algorithms- Refinement
snt[] = array of subject nucleotides
qnt[] = array of query nucleotides
for i = 0 to length(subject) – length(query)
j=0
while (snt[i + j ] == qnt[j])
j=j+1
if (j == length (query))
found sequence at position i
end
query = ATC
subject = AAATCG
Modern machine could do this, but what if query, subject are 100 nucleotides, and
30 billion?
This can be done, but it will not scale to 100 seconds, because you can no longer
hold 30 billion nucleotides in memory.
You will have to swap portions of the 30 billion back to disk, and read in a new
portion
This overhead will adversely affect the performance of the algorithm
25
j=1
Naïve Alignment
j=0
• ATC
• AAATCG
NO
i=0
j=0
• ATC
• AAATCG
NO
snt[] = array of subject nucleotides
qnt[] = array of query nucleotides
for i = 0 to length(subject) – length(query)
j=0
while (snt[i + j ] == qnt[j])
j=j+1
if (j == length (query))
found sequence at position i
end
i=1
•
ATC
• AAATCG
YES
26
End
27