Computational Questions

Download Report

Transcript Computational Questions

Computational Questions
Bioinformatics
Where CS and Biology Meet


Bioinformatics: Applications of CS to
the life sciences
What are the computational issues?



Storage and retrieval of genetic data, data
mining, tools
Analysis of genetic data: similarities,
differences, structure
Processing experimental data
Problem Solving in
Computer Science



Program: Sequence of instructions that
perform a particular task
Task (problem) expressed as: Given
data (input), produce results (output)
From problems to programs



Formulate the problem
Develop and verify an algorithm
Write and test the program
Algorithm Analysis


Algorithm: Conceptual/theoretical form
of a program
What is analyzed?



Correctness: does it solve the problem?
Complexity: how much resources (time
and memory) does it consume?
Tradeoffs: sometimes, we need to sacrifice
correctness for efficiency
Example 1: Searching for an
Element in a List

Problem formulation:



Input: sorted list L of n elements (e.g.,
names) and a target element x
Output: the position of the target element
if it exists in the list
Possible algorithms


Linear search
Binary search
Linear Search

Algorithm:


For each element in L (from the first to the
last element), compare it with x and return
the position if equal
Time complexity:



Up to n comparisons performed
On the average, n/2 comparisons
Runs in linear time (proportional to the list
size n)
Binary Search

Algorithm:


Compare middle element of the list with x,
return the position if equal; if not, reduce
the list to either the lower half or the upper
half of original list; repeat the process
Time complexity


Up to log2n comparisons performed
Runs in logarithmic time
Linear vs Logarithmic Time
n
n/2
log n
10
5
4
100
50
7
1000
500
10
1,000,000
500,000
20
2,000,000
1,000,000
21
Comparing Running Times


Exercise: tabulate values of the following
run-time functions for different values of n
Functions:






log n (logarithmic)
n (linear)
n2 (quadratic)
n3 (cubic)
2n (exponential)
n!
Example 2: Substring Search

Problem formulation:



Example:



Input: Strings s and t of characters
Output: If s is a substring of t, its position
in t
Input: s = “ctct”, t = “agtctcttctaac”,
Output: 4
Algorithm? Time Complexity?
Example 3: Traveling
Salesman

Problem Formulation:



Input: n cities, distances between cities
Output: shortest tour of all cities
Algorithm:

Consider all permutations of the cities,
compute total distances for each
permutation, select the minimum among
all total distances
Exponential Algorithms and
Intractable Problems


The Traveling Salesman problem is an
example of an intractable (“NP-complete”)
problem
Characterized by:



The existence of a correct exponential algorithm
No known polynomial algorithm
Exponential algorithm is impractical. Now
what?
Heuristics


There are polynomial algorithms for
intractable problems that do not always yield
the correct answer
Example: Start with any city, go to the
nearest unvisited city, repeat process



Not always correct. Counterexample?
Selection of nearest city is called a heuristic
Compromise: Can prove some statements on
the (incorrect) algorithm and that may be
enough in practice
Back to Bioinformatics:
Some Objectives



Formulate problems relevant to biology
Devise/understand algorithms for these
problems
Computer scientists and biologists need to
talk more


Computer scientists have a tendency to make
(often unreasonable) assumptions
Biologists may place too much faith on results
returned by automated systems
Overview: Selected Problems
in Bioinformatics



Sequence alignment
Phylogeny
Dealing with experimental results
BLAST Search
Blast Results
DNA Sequence Databases



Data representation, integrity, accuracy
Search and scoring methods
Meaning and reliability of results

e.g., how does BLAST
(Basic Local Alignment Search Tool)
respond to random data?
Sequence Alignment Problem


Given two nucleotide sequence, obtain
an optimal alignment between the
sequences
Example:
AT-C-TGAT
-TGCAT-A-
Dynamic Programming
T
A
T
C
T
G
A
T
G
C
A
T
A
Phylogeny


Construction of phylogenetic trees
based on genomic distance
Problems to be solved:


Determining genomic distance
Tree construction from the distances
Determining Genomic Distance



Given two genomes, determine the
number of mutations necessary to
obtain one from the other
Common distance model (least number
of mutations)
Mutation on the genome level:
rearrangement (sorting!) operations on
permutations
Sorting Permutations and a
Graph Theoretic Model
0
3
5
6
7
2
1
4
8
9
0
1
2
7
6
5
3
4
8
9
Phylogenetic Tree
Reconstruction


Given a set of species and genomic
distances between the species,
construct a phylogenetic tree that is
(most) consistent with the distances
Problem shown to be NP-complete

This means we should try some heuristics
Phylogenetic Tree
Mouse
Monkey
Human
Experimental Results

Image or data directly drawn from a
device


Need to make objective, discrete
conclusions


e.g., microarray, scanner
e.g., pixel intensity vs. gene expression
Need to handle errors and imperfections
Microarray Image
Image Analysis to Aid
Microarray Experiments

Automatically locating the grid of spots


Use Fourier transforms to compute periods
and offsets
Extracting intensity


Refine spot sample to collect significant,
normalized data
Make conclusions on genetic function
Summary


Bioinformatics: a perfect opportunity for
interdisciplinary research within the
sciences
Academics from the different
backgrounds need to study, discuss,
debate with each other