Computational Questions
Download
Report
Transcript Computational Questions
Computational Questions
Bioinformatics
Where CS and Biology Meet
Bioinformatics: Applications of CS to
the life sciences
What are the computational issues?
Storage and retrieval of genetic data, data
mining, tools
Analysis of genetic data: similarities,
differences, structure
Processing experimental data
Problem Solving in
Computer Science
Program: Sequence of instructions that
perform a particular task
Task (problem) expressed as: Given
data (input), produce results (output)
From problems to programs
Formulate the problem
Develop and verify an algorithm
Write and test the program
Algorithm Analysis
Algorithm: Conceptual/theoretical form
of a program
What is analyzed?
Correctness: does it solve the problem?
Complexity: how much resources (time
and memory) does it consume?
Tradeoffs: sometimes, we need to sacrifice
correctness for efficiency
Example 1: Searching for an
Element in a List
Problem formulation:
Input: sorted list L of n elements (e.g.,
names) and a target element x
Output: the position of the target element
if it exists in the list
Possible algorithms
Linear search
Binary search
Linear Search
Algorithm:
For each element in L (from the first to the
last element), compare it with x and return
the position if equal
Time complexity:
Up to n comparisons performed
On the average, n/2 comparisons
Runs in linear time (proportional to the list
size n)
Binary Search
Algorithm:
Compare middle element of the list with x,
return the position if equal; if not, reduce
the list to either the lower half or the upper
half of original list; repeat the process
Time complexity
Up to log2n comparisons performed
Runs in logarithmic time
Linear vs Logarithmic Time
n
n/2
log n
10
5
4
100
50
7
1000
500
10
1,000,000
500,000
20
2,000,000
1,000,000
21
Comparing Running Times
Exercise: tabulate values of the following
run-time functions for different values of n
Functions:
log n (logarithmic)
n (linear)
n2 (quadratic)
n3 (cubic)
2n (exponential)
n!
Example 2: Substring Search
Problem formulation:
Example:
Input: Strings s and t of characters
Output: If s is a substring of t, its position
in t
Input: s = “ctct”, t = “agtctcttctaac”,
Output: 4
Algorithm? Time Complexity?
Example 3: Traveling
Salesman
Problem Formulation:
Input: n cities, distances between cities
Output: shortest tour of all cities
Algorithm:
Consider all permutations of the cities,
compute total distances for each
permutation, select the minimum among
all total distances
Exponential Algorithms and
Intractable Problems
The Traveling Salesman problem is an
example of an intractable (“NP-complete”)
problem
Characterized by:
The existence of a correct exponential algorithm
No known polynomial algorithm
Exponential algorithm is impractical. Now
what?
Heuristics
There are polynomial algorithms for
intractable problems that do not always yield
the correct answer
Example: Start with any city, go to the
nearest unvisited city, repeat process
Not always correct. Counterexample?
Selection of nearest city is called a heuristic
Compromise: Can prove some statements on
the (incorrect) algorithm and that may be
enough in practice
Back to Bioinformatics:
Some Objectives
Formulate problems relevant to biology
Devise/understand algorithms for these
problems
Computer scientists and biologists need to
talk more
Computer scientists have a tendency to make
(often unreasonable) assumptions
Biologists may place too much faith on results
returned by automated systems
Overview: Selected Problems
in Bioinformatics
Sequence alignment
Phylogeny
Dealing with experimental results
BLAST Search
Blast Results
DNA Sequence Databases
Data representation, integrity, accuracy
Search and scoring methods
Meaning and reliability of results
e.g., how does BLAST
(Basic Local Alignment Search Tool)
respond to random data?
Sequence Alignment Problem
Given two nucleotide sequence, obtain
an optimal alignment between the
sequences
Example:
AT-C-TGAT
-TGCAT-A-
Dynamic Programming
T
A
T
C
T
G
A
T
G
C
A
T
A
Phylogeny
Construction of phylogenetic trees
based on genomic distance
Problems to be solved:
Determining genomic distance
Tree construction from the distances
Determining Genomic Distance
Given two genomes, determine the
number of mutations necessary to
obtain one from the other
Common distance model (least number
of mutations)
Mutation on the genome level:
rearrangement (sorting!) operations on
permutations
Sorting Permutations and a
Graph Theoretic Model
0
3
5
6
7
2
1
4
8
9
0
1
2
7
6
5
3
4
8
9
Phylogenetic Tree
Reconstruction
Given a set of species and genomic
distances between the species,
construct a phylogenetic tree that is
(most) consistent with the distances
Problem shown to be NP-complete
This means we should try some heuristics
Phylogenetic Tree
Mouse
Monkey
Human
Experimental Results
Image or data directly drawn from a
device
Need to make objective, discrete
conclusions
e.g., microarray, scanner
e.g., pixel intensity vs. gene expression
Need to handle errors and imperfections
Microarray Image
Image Analysis to Aid
Microarray Experiments
Automatically locating the grid of spots
Use Fourier transforms to compute periods
and offsets
Extracting intensity
Refine spot sample to collect significant,
normalized data
Make conclusions on genetic function
Summary
Bioinformatics: a perfect opportunity for
interdisciplinary research within the
sciences
Academics from the different
backgrounds need to study, discuss,
debate with each other