presentation source

Download Report

Transcript presentation source

Mathematics and the Genome
Winfried Just
Department of Mathematics and
Quantitative Biology Institute
Ohio University
This talk is dedicated to the memory of Dr. Pawel Zbierski,
one of the great teachers in my life.
Biology’s dilemma: There is too
much to know about living things
Roughly 1.5 million species of organisms have been
described and given scientific names to date. Some
biologists estimate that the total number of all living
species may be several times higher. It is impossible to
learn everything about all these organisms. Biologists
solve the dilemma by focusing on some species, so-called
model organisms, and trying to find out as much as they
can about these model organisms.
Some important model organisms
Mammals: Homo sapiens, Chimpanzee, mouse, rat
Fish: Zebrafish, Pufferfish
Insects: Fruitfly (Drosophila melanogaster)
Roundworms: Ceanorhabditis elegans
Protista: Malaria parasite (Plasmodium falciparum)
Fungi: Yeast (Saccharomyces cerevisiae, S. pombe)
Plants: Thale cress (Arabidopsis thaliana), corn, rice
Bacteria: Escherichia coli, salmonella
Archea: Methanococcus janaschii
Let’s find out everything about
some species
What would it mean to learn everything about a given
species? All available evidence indicates that the complete
blueprint for making an organism is encoded in the
organism’s genome. Chemically, the genome consists of
one or several DNA molecules. These are long strings
composed of pairs of nucleotides. There are only four
different nucleotides, denoted by A, C, G, T. The
information about how to make the organism is encoded
by the order in which the nucleotides appear.
Some genome sizes
 HIV2 virus
 Mycoplasma genitalis
 Haemophilus influenzae
 Saccharomyces cerevisiae
 Caenorhabditis elegans
 Drosophila melanogaster
 Homo sapiens
 Some amphibians
 Amoeba dubia
9671 bp
5.8 · 105 bp
1.83 · 106 bp
1.21 · 107 bp
108 bp
1.65 · 108 bp
3.14 · 109 bp
8 · 1010 bp
6.7 · 1011 bp
Sequencing Genomes
Contemporary technology makes it possible to completely
sequence entire genomes, that is, determine the sequence
of A’s, C’s, G’s, and T’s in the organism’s genome. The
first virus was sequenced in the 1980’s, the first
bacterium (Haemophilus influenzae) in 1995, the first
multicellular organism (Caenorhabditis elegans) in 1998.
The rough draft of the human genome was announced in
June 2000.
How rough is the draft of the
human genome?
 Announced June 2000
 Covers about 95% of the genome.
 Contains more than 100,000 gaps
 Public version: Started: 1990
Based on genome of one person
 Celera version: Started 1998
Based on genome of five persons
Where to store all these data?
Some of the sequence data are stored in proprietary data
bases, but most of them are stored in the public data base
Genbank and can be accessed via the World Wide Web.
In fact, most relevant journals require proof of submission
to Genbank before an article discussing sequence data will
be published. A notable exception was the publication of
Celera’s announcement in Science.
What’s in the databases?
As of February 20, 2000, Genbank contained
5,861,088,510 bp of information. There were about 600
completely sequenced viruses, 19 completely sequenced
bacteria, 6 completely sequenced archaea, and 3
complete genomes of eukaryotes: S. cerevisiaea (baker’s
yeast), C. elegans (a roundworm), and Drosophila
melanogaster (fruitfly).
What’s in the databases?
As of November 23, 2000, Genbank contained
10,853,673,034 bp of information. There were about 600
completely sequenced viruses, 29 completely sequenced
bacteria, 8 completely sequenced archaea, and 3
complete genomes of eukariotes: S. cerevisiaea (baker’s
yeast), C. elegans (a roundworm), and Drosophila
melanogaster (fruitfly). The genome of Arabidopsis (thale
cress) was near completion, and a first draft of the human
genome had been completed.
What’s in the databases?
As of March 18, 2002, Genbank contained
20,197,497,568 bp of information. There were about 700
completely sequenced viruses, 63 completely sequenced
bacteria, 13 completely sequenced archaea, and 5
complete genomes of eukaryotes: S. cerevisiaea, S. pombe
(two yeasts), C. elegans (a roundworm), Drosophila
melanogaster (fruitfly) and Arabidopsis thaliana (thale
cress), as well as a draft of the human genome.
First mathematical challenge:
Sequencing large genomes
Currently, much of the sequencing process is automated.
However, contemporary sequencing machines can only
sequence stretches of DNA that are a few hundred base
pairs long at a time. The process of assembling these
stretches of sequence into a whole genome poses some
interesting mathematical problems.
First mathematical challenge:
Sequencing large genomes
For example, the publicly financed Human Genome Project
used an approach called genome mapping to facilitate
sequence assembly. Most of the time the HGP took was in
fact spent on onstructing the scaffold of this map.
In contrast, Celera Genomics allegedly used an approach
called shotgun sequencing that works by randomly cutting
up the genome into small streches, sequencing them, and
then using a clever algorithm to assemble the whole
genome. There was much debate over the feasibility of
the latter approach, but it apparently worked.
You have sequenced your genome
- what do you do with it?
This is known as genome analysis or sequence analysis.
At present, most of bioinformatics is concerned with
sequence analysis. Here are some of the questions
studied in sequence analysis:
 gene finding
 protein 3D structure prediction
 gene function prediction
 prediction of important sites in proteins
 reconstruction of phylogenetic trees
How the genome controls the
organism
The genome controls the making and workings of an
organism by telling the cell which proteins to manufacture
under which conditions. Proteins are the workhorses of
biochemistry and play a variety of roles. Most notably,
many proteins are enzymes that catalyze specific
chemical reactions. All biochemical reactions are
catalyzed by enzymes.
A gene is a stretch of DNA that codes a given protein.
Where are the genes?
The objective of gene finding is to identify the regions of
DNA that are genes. Ideally, we want to make statements
like: “Positions 28,354 through 29,536 of this genome code
a protein.” Once we have identified a gene, it is easy to
translate the DNA code into the sequence of amino acids
that make up the corresponding protein.
The mathematical challenge here is to identify patterns in
DNA that reliably indicate where a gene starts and ends,
especially in eukaryotes.
Hidden Markov Models for gene
finding (caricature)
Most current gene finding programs are based on Hidden
Markov Models. These work as follows: assume (wrongly)
that the DNA-sequence has been generated randomly by a
Markov model that can be in one of two states: “gene” or
“intergenic region.” Each state has a characteristic
probability of “emitting” a given nucleotide, and has a
characteristic (low) probability of switching to the other
state. The observer sees the sequence of emissions
(nucleotides), but the information by which state a given
nucleotide was emitted is hidden from the observer.
Hidden Markov Models for gene
finding (caricature)
Now the observer wants to infer the actual sequence of
states of the Markov model that caused the observed
emissions. This sequence is called the path through the
hidden Markov model. The (posterior) probability of any
given path is easy to calculate, and it is computationally
inexpensive to infer the most likely path for a given
sequence of emissions (using the so-called Viterbi
algorithm). This path gives some hypothesis for the
location of the genes. It is also easy to calculate
probabilities that a predicted gene is actually a gene
(under the assumptions of the model).
Hidden Markov Models for gene
finding - the real picture
In reality, the situation is much more complicated. Coding
regions of genes are not characterized by frequencies of
single nucleotides, but of triplets and hexamers of
nucleotides. Additional information, such as signals that
indicate the beginning or end of a gene or a splicing site
are being used. Additional difficulties arise because of:
 existence of six possible reading frames
 existence of introns in eukaryotes
 variable codon usage frequencies in different species
A big mathematical challenge
The underlying assumption of Hidden Markov Models that
DNA sequences are emitted by a Markov Model is
obviously far removed from biological reality. So the
question is: How can we construct gene finding tools that
are based on biologically more meaningful assumptions?
This has practical consequences for evaluating the
probability that a predicted gene is actually a gene, or
estimating the fraction of actual genes that have been
identified as such by a given gene-finding algorithm.
What did the Hidden Markov
Models find?
 Mycoplasma genitalis (bacterium)
500 Genes
 Escherichia coli (bacterium)
4,500 Genes
 Saccharomyces cerevisiae (yeast)
6,000 Genes
 Caenorhabditis elegans (worm)
19,000 Genes
 Drosophila Melanogaster (fruitfly)
13,500 Genes
 Arabidopsis thaliana (thale cress)
25,500 Genes
 Homo sapiens (Human)
24,000-40,000 Genes
 Oryza sativa japonica (rice)
32,000-50,000 Genes
 Oryza sativa indica (rice)
45,000-56,000 Genes
So we know the genes - do we
know everything?
Far from it. The next two questions are:
 Given a single gene, how does it function in the
biology of an organism?
 How do various genes interact?
From genes to proteins
From the chemical point of view, proteins are long chains
of chemicals called amino acids. There are 20 amino acids
used to make most proteins in most organisms. Amino
acids are coded by triplets of nucleotides, which are also
called codons.
Protein structure prediction
When a protein is manufactured in the cell, it assumes a
characteristic 3D structure or fold. It is very costly to
determine the 3D structure of a protein experimentally (by
NMR or X-ray crystallography). It would be much cheaper
if we could predict the 3D structure of a protein directly
from its sequence of amino acids that is coded in the
genome. This is known as the protein folding problem.
Many approaches have been proposed to develop
algorithms for solving this problem; so far results are
mixed.
Protein structure prediction
In theory, it is possible to predict a protein fold ab initio,
that is from first principles. However, the task is beyond
the capabilities of current supercomputers. Recently IBM
announced plans to develop a new pentaflop
supercomputer (1015 floating point operations per second)
called “Blue Gene” that will be designed specifically with
the task of ab initio protein fold prediction in mind. It
should be just powerful enough for the task if no
unexpected complications arise.
Prediction of gene function
Suppose you have identified a gene. What is its role in the
biochemistry of its organism? Sequence databases can
help us in formulating reasonable hypotheses.
 Search the database for genes with similar nucleotide
sequences in other organisms.
 If the functions of the most similar genes are known and
if they tend to be the same function (e.g., “codes
enzyme involved in glucose metabolism”), then it is
reasonable to conjecture that your gene also codes an
enzyme involved in glucose metabolism.
Prediction of gene function:
homology searches
Given a nucleotide or DNA sequence, searching the data
base(s) for similar sequences is known as “homology
searches”. The most popular software tool for performing
these searches is called BLAST; therefore biologists often
speak of “BLAST searches”. There are two interesting
problems here:
 How to measure “similarity” of two sequences.
 How much similarity constitutes evidence of biologically
meaningful homology as opposed to random chance?
Prediction of important sites in
proteins
Not all parts of a protein are equally important; the
function of most of its amino acids is often just to maintain
an appropriate 3D structure, and mutations of those less
crucial amino acids often don't have much effect.
However, most proteins have crucial parts such as
binding sites. Any mutations occurring at binding sites
tend to be lethal and will be weeded out by evolution.
How to predict binding sites from
sequence data:
 Get a collection of proteins of similar amino acid
sequences and analogous biochemical function from
your database.
 Align these sequences amino acid by amino acid.
 Check which regions of the protein are highly conserved
in the course of evolution.
 The binding site should be in one of the highly
conserved regions.
Using genomic data for
reconstruction of phylogenies
A phylogenetic tree depicts the branching pattern in the
evolution of contemporary species from their common
ancestor. Given the sequences of homologous genes
(i.e., genes derived from a common ancestral gene), one
can try to reconstruct the phylogenetic tree for these
species by looking at the amount of evolutionary change
that has occurred at the molecular level and estimating the
times at which any two of these species diverged.
Methods for reconstruction of
phylogenies
 Distance methods
 Maximum parsimony
 Maximum likelihood
 Bayesian Analysis
A big mathematical challenge is to devise fast and reliable
algorithms for phylogenetic reconstruction for large sets of
species, especially using Maximum Likelihood or Bayesian
Analysis.
Reconstruction of phylogenies:
A success story
There are two basic kinds of free-living organisms:
prokaryotes, that do not have a cell nucleus, and
eukaryotes, which do. Prokaryotes fall into two major
groups: Eubacteria and Archaea. Phenotypically,
eubacteria and archaea are very similar to each other.
However, it has been demonstrated by using molecular
data that archaea are more closely related to eukaryotes
than to eubacteria, and thus it appears that the
evolutionary branching between archaea and eubacteria
occurred before the branching of archaea and eukaryotes.
Gene interactions: Collecting
gene expression data
All cells of a multicellular organism have the same set of
genes. What accounts for the differences in various cell
types and function is which of the genes are being
expressed (switched on) at a given time in a given cell.
A relatively new technology, called gene chips or
microarrays makes it possible to monitor, for tens of
thousands of genes simultaneously, the differences in gene
expression levels between two different experimental
conditions.
Gene interactions: Interpeting
gene expression data
Once gene expression data have been collected, it is
possible to identify clusters of genes that have similar
expression profiles, that is, are up- or downregulated
under the same experimental conditions. One then
conjectures that genes with similar expression profiles
have similar functions, for example, are involved in the
same biochemical pathways. Such conjectures can serve
as powerful guides for setting up experiments to confirm
the biochemical role of groups of genes.
Interpeting gene expression data:
A mathematical challenge
Gene expression data sets are peculiar in the sense that
we typically have very few experiments (5-10 perhaps)
and a large number (tens of thousands) monitored genes.
It seems inevitable that some genes will show similar
expression profiles just by random accident.
Question: How can we tell spurious clusters of genes
from biologically meaningful ones?
Gene expression profiles:
A success story
Cancer patients with the same clinical picture often
respond to very different types of treatment. Gene
expression profiles of groups of cancer patients have
revealed that what looks to the clinician as the same
disease can sometimes be one of several diseases at the
biochemical level. The latter can be distinguished by
characteristic expression profiles of certain groups of
genes. Once the biochemical nature of the disease has
been established, treatment can be tailored to the type of
disease a patient actually has.
The databases of the future
Genomic data bases like Genbank are just the beginning.
In the near future we will see:
 Gene expression data banks
 SNP (single nucleotide polymorphism) data banks
 proteomics data banks
 data banks of biochemical pathways
…
Setting up these data banks and making intelligent use of
them will require new mathematical tools, to be developed
by the next generation of mathematicians.