presentation source

Download Report

Transcript presentation source

Complexity Issues in Bioinformatics
Winfried Just
Department of Mathematics and
Quantitative Biology Institute
Ohio University
Biology’s dilemma: There is too
much to know about living things
Roughly 1.5 million species of organisms have been
described and given scientific names to date. Some
biologists estimate that the total number of all living
species may be several times higher. It is impossible to
learn everything about all these organisms. Biologists
solve the dilemma by focusing on some species, so-called
model organisms, and trying to find out as much as they
can about these model organisms.
Some important model organisms
Mammals: Homo sapiens, Chimpanzee, mouse, rat
Fish: Zebrafish, Pufferfish
Insects: Fruitfly (Drosophila melanogaster)
Roundworms: Ceanorhabditis elegans
Protista: Malaria parasite (Plasmodium falciparum)
Fungi: Yeast (Saccharomyces cerevisiae, S. pombe)
Plants: Thale cress (Arabidopsis thaliana), corn, rice
Bacteria: Escherichia coli, salmonella
Archea: Methanococcus janaschii
Let’s find out everything about
some species
What would it mean to learn everything about a given
species? All available evidence indicates that the complete
blueprint for making an organism is encoded in the
organism’s genome. Chemically, the genome consists of
one or several DNA molecules. These are long strings
composed of pairs of nucleotides. There are only four
different nucleotides, denoted by A, C, G, T. The
information about how to make the organism is encoded
by the order in which the nucleotides appear.
Some genome sizes
 HIV2 virus
 Mycoplasma genitalis
 Haemophilus influenzae
 Saccharomyces cerevisiae
 Caenorhabditis elegans
 Drosophila melanogaster
 Homo sapiens
 Some amphibians
 Amoeba dubia
9671 bp
5.8 · 105 bp
1.83 · 106 bp
1.21 · 107 bp
108 bp
1.65 · 108 bp
3.14 · 109 bp
8 · 1010 bp
6.7 · 1011 bp
Sequencing Genomes
Contemporary technology makes it possible to completely
sequence entire genomes, that is, determine the sequence
of A’s, C’s, G’s, and T’s in the organism’s genome. The
first virus was sequenced in the 1980’s, the first
bacterium (Haemophilus influenzae) in 1995, the first
multicellular organism (Caenorhabditis elegans) in 1998.
The rough draft of the human genome was announced in
June 2000.
Where to store all these data?
Some of the sequence data are stored in proprietary data
bases, but most of them are stored in the public data base
Genbank and can be accessed via the World Wide Web.
In fact, most relevant journals require proof of submission
to Genbank before an article discussing sequence data will
be published. A notable exception was the publication of
Celera’s announcement in Science.
What’s in the databases?
As of February 20, 2000, Genbank contained
5,861,088,510 bp of information. There were about 600
completely sequenced viruses, 19 completely sequenced
bacteria, 6 completely sequenced archaea, and 3
complete genomes of eukaryotes: S. cerevisiaea (baker’s
yeast), C. elegans (a roundworm), and Drosophila
melanogaster (fruitfly).
What’s in the databases?
As of November 23, 2000, Genbank contained
10,853,673,034 bp of information. There were about 600
completely sequenced viruses, 29 completely sequenced
bacteria, 8 completely sequenced archaea, and 3
complete genomes of eukariotes: S. cerevisiaea (baker’s
yeast), C. elegans (a roundworm), and Drosophila
melanogaster (fruitfly). The genome of Arabidopsis (thale
cress) was near completion, and a first draft of the human
genome had been completed.
What’s in the databases?
As of March 18, 2002, Genbank contained
20,197,497,568 bp of information. There were about 700
completely sequenced viruses, 63 completely sequenced
bacteria, 13 completely sequenced archaea, and 5
complete genomes of eukaryotes: S. cerevisiaea, S. pombe
(two yeasts), C. elegans (a roundworm), Drosophila
melanogaster (fruitfly) and Arabidopsis thaliana (thale
cress), as well as a draft of the human genome.
Time-complexity of algorithms
Suppose you have an algorithm that takes discrete objects
(such as DNA sequences) as inputs. We say that the
worst-case time complexity of this algorithm is O(f(n)) if
there is a constant M such that on each input of size n the
algorithm produces an output after running for at most
M ·f(n) steps. We say that the average-case time
complexity of this algorithm is O(f(n)) if there is a constant
M such that the expected number of steps on inputs of size
n is bounded by M ·f(n).
Consequences of database
growth
For many algorithms in bioinformatics, the whole data set
in Genbank is the input. Let us make the conservative
estimate that the size of this input doubles every nine
months. Moore’s Law says that computer speed doubles
every 18 months. Thus if you have an algorithm with
running time O(n2) that takes Genbank as input and that
takes one hour to run today on the fastest available
computer, it will take eight hours to run on the fastest
available computer 18 months from now.
How to overcome the problem?
 Designing faster and faster algorithms will remain a
constant challenge in bioinformatics.
 If the worst-case running time of an algorithm is too
long, its average running time may still make it suitable
for your purposes.
 If your problem is an optimization problem, one can
work with approximation algorithms that are not
guaranteed to find the best solution, but a pretty good
one.
 One can use heuristic algorithms that have no
performance guarantee whatsoever, but in most cases
appear to find pretty good solutions.
You have sequenced your genome
- what do you do with it?
This is known as genome analysis or sequence analysis.
At present, most of bioinformatics is concerned with
sequence analysis. Here are some of the questions
studied in sequence analysis:
 gene finding
 protein 3D structure prediction
 gene function prediction
 prediction of important sites in proteins
 reconstruction of phylogenetic trees
From genes to proteins
Proteins are the workhorses of biochemistry. Practically
all chemical reactions in a cell are catalyzed by proteins
and many proteins have diverse other functions. From the
chemical point of view, proteins are long chains of
chemicals called amino acids. There are 20 amino acids
that are the building blocks for most proteins in most
organisms. Amino acids are coded by triplets of
nucleotides, which are also called codons. A stretch of a
genome that codes for a given protein is called a gene.
What did the gene finding
algorithms find?
 Mycoplasma genitalis (bacterium)
500 Genes
 Escherichia coli (bacterium)
4,500 Genes
 Saccharomyces cerevisiae (yeast)
6,000 Genes
 Caenorhabditis elegans (worm)
19,000 Genes
 Drosophila Melanogaster (fruitfly)
13,500 Genes
 Arabidopsis thaliana (thale cress)
25,500 Genes
 Homo sapiens (Human)
24,000-40,000 Genes
 Oryza sativa japonica (rice)
32,000-50,000 Genes
 Oryza sativa indica (rice)
45,000-56,000 Genes
So we know the genes - do we
know everything?
Far from it. The next two questions are:
 Given a single gene, how does it function in the
biology of an organism?
 How do various genes interact?
Prediction of gene function
Suppose you have identified a gene. What is its role in the
biochemistry of its organism? Sequence databases can
help us in formulating reasonable hypotheses.
 Search the database for genes with similar nucleotide
sequences in other organisms.
 If the functions of the most similar genes are known and
if they tend to be the same function (e.g., “codes
enzyme involved in glucose metabolism”), then it is
reasonable to conjecture that your gene also codes an
enzyme involved in glucose metabolism.
How to detect similarity of
genes?
DNA and protein molecules evolve mostly by three
processes: point mutations (exchange of a single letter for
another), insertions, and deletions. If two genes have
evolved from a common ancestral gene, then it should be
possible to detect the similarity by inserting gaps into the
two sequences that correspond to insertions and deletions
so that the nucleotides derived from common ancestral
nucleotides end up on top of each other. This process is
known as sequence alignment. The columns of the
alignment should show significantly more nucleotide
identities than would be expected by chance.
Example of an alignment
ACT- GGTTCAGTC CG
| |
| | |
| | | |
AGTAGGTA - -GACCG
How to score an alignment
Alignments are scored column by column. There is usually
a reward for character matches, a small penalty for
character mismatches, and a large penalty for gaps.
These penalties should be such that the highest scoring
alignment is the most likely one to reflect the true
evolutionary relationship of the loci. Two sequences can
be considered “similar” if their best alignment has a
sufficiently high score. The Smith-Waterman algorithm
finds the best scoring local alignment of two sequences of
length n and m respectively in running time O(n ·m).
Prediction of gene function:
homology searches
Given a nucleotide or DNA sequence, searching the data
base(s) for similar sequences is known as “homology
searches”. This is essentially the task of finding all (local)
alignments between the data base of length n and a query
sequence of length m. The most popular software tool for
performing these searches is called BLAST; therefore
biologists often speak of “BLAST searches”. It is a
heuristic algorithm with average running time O(n),
independently of the length of the query sequence.
Prediction of important sites in
proteins
Not all parts of a protein are equally important; the
function of most of its amino acids is often just to maintain
an appropriate 3D structure, and mutations of those less
crucial amino acids often don't have much effect.
However, most proteins have crucial parts such as
binding sites. Any mutations occurring at binding sites
tend to be lethal and will be weeded out by evolution.
How to predict binding sites from
sequence data:
 Get a collection of proteins of similar amino acid
sequences and analogous biochemical function from
your database.
 Align these sequences amino acid by amino acid.
 Check which regions of the protein are highly conserved
in the course of evolution.
 The binding site should be in one of the highly
conserved regions.
Example of a multiple alignment
A T G
|
A C G
|
- C G
|
- T G
- - T T C G
|
A A T C C A
|
A A T C C T
|
A G C A C T
G A C
|
G - C
|
A A C
|
A A C
T
T
C
C
How to score multiple sequence
alignments?
The most popular way of scoring multiple sequence
alignments, the so-called SP-score (sum of pairs score),
works by computing the scores of all the induced pairwise
alignments and summing them up. The pairwise
alignments are being scored by a fixed scoring scheme
that rewards or penalizes matches, mismatches, and gaps.
Computational complexity of
multiple sequence alignment
The best scoring alignment should be the most likely
hypothesis about the evolutionarily correct alignment.
Finding the best scoring alignment for a given scoring
scheme is called the multiple sequence alignment problem.
The size of an instance is given by the number k of
sequences to be aligned and the length n of the longest
sequence. For any fixed k the problem can be solved
in time O(nk) by a variant of the Smith-Waterman
algorithm. Can we do any better?
The class P
A decision problem is in the class P if there is an algorithm
that always gives the correct answer and has worst-case
running time of O(na) for a fixed a, where n is the size of
the input. An optimization problem is in the class P if the
corresponding decision problem of determining for every
input and given number c whether there is a solution with
score at least c is in the class P.
The class NP
A decision problem of the form “there exists an object J
such that for the given input I property(I,J) holds”
is in the class NP if there is an algorithm that for any
given input I and given object J determines whether
property(I,J) holds and has worst-case running
time of
O(na) for a fixed a, where n is the size of the input. Note
that the decision problem corresponding to the multiple
sequence alignment problem is in the class NP.
The P = NP problem
It is not known whether the classes P and NP are identical.
This is probably the biggest open problem in mathematics.
There are literally hundreds of so-called NP-hard problems.
These are problems in the class NP of which it is know that
if any one of these problems is in the class P, then the
classes P and NP coincide. But nobody has ever found a
polynomial-time algorithm for any of the NP-hard
problems. This is strong empirical evidence for the
conjecture that P is not equal NP, but no proof exists.
Computational complexity of
multiple sequence alignment
With respect to the number of sequences k, the multiple
sequence alignment problem is NP-hard. This was shown
around 1993 by Wang and Jiang. Unfortunately, their
proof uses a biologically totally irrelevant scoring scheme.
It was not clear until recently whether the problem might
become tractable for scoring schemes that are actually
used by biologists, or if we limit the number of gaps that
can be inserted. It was also unknown whether there might
be always a PTAS (polynomial time approximation
scheme).
Polynomial-time approximation
schemes
An optimization problem is said to have a PTAS if for every
ε > 0 there exists an algorithm A(ε) that finds a solution
that is within a factor of 1 - ε of optimum and whose
running time is O(na(ε)). NP-hard optimization problems
may still have a PTAS if a(ε) increases to infinity as ε
decreases to zero.
Computational complexity of
multiple sequence alignment
Theorem 1: (Just) For every scoring scheme used by
biologists, the multiple sequence alignment problem is
NP-hard. This remains true even if the number and size
of gaps that can be inserted into each sequence is
restricted in the most severe way possible.
Theorem 2: (Just) There exists a scoring scheme for
which the multiple alignment problem has no PTAS, even
if the number and size of gaps that can be inserted into
each sequence is most severely restricted.
A few words about the proof
Theorems 1 and 2 were proved by reducing the MAX-CUT
problem for graphs to the multiple sequence alignment
problem. That is, given a simple graph G, a multiple
sequence alignment problem is constructed in such a way
that from a (nearly) optimal solution of the sequence
alignment problem a cut in the graph G of (nearly)
maximal size can be reconstructed in polynomial time.
An open question
A scoring scheme is metric if the mismatch penalties
satisfy the triangle inequality. Biologically meaningful
scoring schemes have to be metric in this sense.
However, the scoring scheme used in the proof of
Theorem 2 does not have this property.
Problem: (Jiang, Kearney and Li, 1999) Is there a PTAS
for the multiple sequence alignment problem for metric
scoring schemes?
An partial answer
Theorem 3: (Just, Della Vedova) The restriction of the
multiple sequence alignment problem to alignments that
insert at most L spaces into each sequence has a PTAS
that works for all metric scoring schemes.
This theorem was proved by conceptualizing the multiple
sequence alignment problem as a facility location problem.
The practice of multiple
sequence alignment
The most frequently used multiple sequence alignment
algorithm used in practice is CLUSTAL. This is a heuristic
algorithm for which no performance guarantee is known.
Using genomic data for
reconstruction of phylogenies
A phylogenetic tree depicts the branching pattern in the
evolution of contemporary species from their common
ancestor. Given the sequences of homologous genes
(i.e., genes derived from a common ancestral gene), one
can try to reconstruct the phylogenetic tree for these
species by looking at the amount of evolutionary change
that has occurred at the molecular level and estimating the
times at which any two of these species diverged.
Reconstructing phylogenies:
How to get started
Most reconstructions of phylogenetic trees from molecular
sequences start from a multiple alignment of these
sequences. The quality of the phylogenetic tree thus
constructed crucially depends on the quality of the multiple
sequence alignment.
Methods for reconstruction of
phylogenies
 Distance methods
 Maximum parsimony
 Maximum likelihood
 Bayesian Analysis
Methods for reconstruction of
phylogenies
 Maximum parsimony: Given a tree, count the number of
character changes that must have occurred during
evolution if the tree is correct. The most parsimonious
tree is the one that postulates the fewest such changes
among all possible trees.
 Maximum likelihood: Given a tree and a model of
molecular evolution, compute the probability that the
sequences actually observed in the extant species have
evolved. The maximum likelihood tree is the one for
which this probability is largest among all possible trees.
The computational challenge for
reconstruction of phylogenies
Maximum parsimony and maximum likelihood require us to
consider all possible trees. Given n species, there are
(2n-3)!/[2n-2(n-2)!] distinct possible (rooted) trees for
depicting the phylogenetic relationship between these
species. For n = 10, this number is 34,459,425;
for n = 15 it is already equal to 213,458,046,676,875,
for n = 21 we get half a mole of trees.
Again, we are confronted with a problem whose most
obvious solution requires exponential running time.
Reconstruction of phylogenies:
A success story
There are two basic kinds of free-living organisms:
prokaryotes, that do not have a cell nucleus, and
eukaryotes, which do. Prokaryotes fall into two major
groups: Eubacteria and Archaea. Phenotypically,
eubacteria and archaea are very similar to each other.
However, it has been demonstrated by using molecular
data that archaea are more closely related to eukaryotes
than to eubacteria, and thus it appears that the
evolutionary branching between archaea and eubacteria
occurred before the branching of archaea and eukaryotes.
Gene interactions: Collecting
gene expression data
All cells of a multicellular organism have the same set of
genes. What accounts for the differences in various cell
types and function is which of the genes are being
expressed (switched on) at a given time in a given cell.
A relatively new technology, called gene chips or
microarrays makes it possible to monitor, for tens of
thousands of genes simultaneously, the differences in gene
expression levels between two different experimental
conditions.
Gene interactions: Interpeting
gene expression data
Once gene expression data have been collected, it is
possible to identify clusters of genes that have similar
expression profiles, that is, are up- or downregulated
under the same experimental conditions. One then
conjectures that genes with similar expression profiles
have similar functions, for example, are involved in the
same biochemical pathways. Such conjectures can serve
as powerful guides for setting up experiments to confirm
the biochemical role of groups of genes.
Interpeting gene expression data:
A mathematical challenge
Gene expression data sets are peculiar in the sense that
we typically have very few experiments (5-10 perhaps)
and a large number (tens of thousands) monitored genes.
It seems inevitable that some genes will show similar
expression profiles just by random accident.
Question: How can we tell spurious clusters of genes
from biologically meaningful ones?
Gene expression profiles:
A success story
Cancer patients with the same clinical picture often
respond to very different types of treatment. Gene
expression profiles of groups of cancer patients have
revealed that what looks to the clinician as the same
disease can sometimes be one of several diseases at the
biochemical level. The latter can be distinguished by
characteristic expression profiles of certain groups of
genes. Once the biochemical nature of the disease has
been established, treatment can be tailored to the type of
disease a patient actually has.