ppt - Chair of Computational Biology

Download Report

Transcript ppt - Chair of Computational Biology

Gene finding
Material of this lecture taken from
- chapter 8, DW Mount „Bioinformatics“
- C. Mathé et al. Nucleic Acids Research 30, 4103 (2002)
Recent years have seen an explosion of the number of completed genome
sequences.
The Genome Online Database lists
166 published completed genomes in the public databases
415 ongoing prokaryotic sequencing projects
360 ongoing eukaryotic sequencing projects.
However, annotation (biological interpretation) is hardly keeping pace with the
new flow of data :-(
9. Lecture WS 2003/04
Bioinformatics III
1
Gene finding
- sequence motifs useful for gene finding
- prokaryotic gene finding
- Hidden Markov Models (HMM)
- eukaryotic gene finding
- popular gene finding programs, comparison
9. Lecture WS 2003/04
Bioinformatics III
2
Introduction
The simplest method of finding DNA sequences that encode proteins is to
search for open reading frames (ORFs).
In each sequence, there are 6 possible open reading frames:
3 ORFs starting at positions 1, 2, and 3, and going in the 5‘ to 3‘ direction
and 3 ORFs starting at positions 1, 2, and 3, and going in the 5‘ to 3‘ direction of
the complementary sequence.
In prokaryotic genomes, DNA sequences encoding proteins are transcribed into
mRNA, and the mRNA is usually directly translated into proteins without
significant modification.
Therefore, the longest ORF running from the first available Met codon (AUG) on
the mRNA to the next stop codon in the same reading frame, is usually a good
prediction of the protein-encoding regions.
9. Lecture WS 2003/04
Bioinformatics III
3
Methods
Obtain new
genomic DNA
sequence.
Translate in all
6 ORFs and compare
to protein database.
Perform database
similarity search of
EST database of same
organism, or of cDNA
sequences, if available.
9. Lecture WS 2003/04
Bioinformatics III
Use gene prediction
program to locate genes.
Analyze regulatory
sequences in the gene.
4
Statistical methods vs. rule-based methods
Statistical methods
secondary structure prediction:
neural networks
AGADIR
Rule-based methods
apply rules from stereochemistry
tertiary structure prediction:
homology modelling, threading
ab initio predictions
gene prediction:
Hidden Markov Models
neural networks
apply rules for gene structure
biological function:
sequence analysis
structural alignment (3D structures)
9. Lecture WS 2003/04
predict binding sites + mechanism
from 3D structure if available
Bioinformatics III
5
Statistical methods vs. rule-based methods
Advantages
statistical methods:
powerful if based on large data basis
Disadvantages
weak if data basis is too small
may be quick to apply
automatic training procedures
no insight in underlying principles
rule-based methods:
powerful if rules are strongly followed
in biological there are rules and
many special cases
insight in underlying principles
rules may only become obvious after
problem is solved by statistical method
may be very expensive
(e.g. ab initio 3rd structure prediction)
may be very quick
9. Lecture WS 2003/04
Bioinformatics III
6
Extrinsic and intrinsic methods
Most approaches now combine
(a) homology methods = „extrinsic methods“ with
(b) gene prediction methods = „intrinsic methods“
Only about half of all genes can be found by homology to other known genes or
proteins (this value is of course increasing as more genomes get sequenced and
more cDNA/EST sequences get available).
In order to determine the 50% of remaining genes, one has to turn to predictive
methods.
Mathé et al. Nucl. Acids. Res. 30, 4103 (2002)
9. Lecture WS 2003/04
Bioinformatics III
7
Prokaryotic genes
Advantages:
there are no introns
the intergenic regions are small
Disadvantages:
the genes may often overlap each other
the translation starts are difficult to predict.
Mathé et al. Nucl. Acids. Res. 30, 4103 (2002)
9. Lecture WS 2003/04
Bioinformatics III
8
Extrinsic Content Sensors
Extrinsic content sensors identify similarities with protein/DNA database entry:
FASTA, BLAST ...
Obvious weakness: nothing will be found if the database does not contain a
sufficiently similar sequence.
Even when a good similarity is found, the limits of the regions of similarity, which
should indicate exons, are not always very precise. Small exons are easily missed.
EST/cDNA data allow to identify exons. On the other hand, EST data gives only
limited information on the gene structure.
Advantage: a single match is enough to detect the presence of a gene.
Mathé et al. Nucl. Acids. Res. 30, 4103 (2002)
9. Lecture WS 2003/04
Bioinformatics III
9
Eukaryotic genomes
Transcription of protein-encoding regions is initiated at specific promoter
sequences, and followed by removal of noncoding sequence (introns)
from pre-mRNA by a splicing mechanism.
3 types of posttranscriptional events influence the translation of mRNA into
protein and the accuracy of gene prediction:
(1) species-dependent codon usage
(2) tissue-dependent splice variations
(3) mRNA may be edited.
9. Lecture WS 2003/04
Bioinformatics III
10
Intrinsic Content Sensors for eukaryotic genomes
Characterize „coding“ regions:
- nucleotide composition
- G+C content (introns are more A/T rich than exons, especially in plants)
- codon composition
- hexamer frequency (this was found to be the most discriminating variable
between coding and non-coding sequences)
- base occurrence periodicity ...
Hexamer frequence, or, more generally, the k-mer composition of coding
sequences is the main search tools in many packages.
Mathé et al. Nucl. Acids. Res. 30, 4103 (2002)
9. Lecture WS 2003/04
Bioinformatics III
11
Repeated sequence elements in eukaryotic genes
Eukaryotic DNA is wrapped around histone-protein complexes called
nucleosomes. Therefore, some of the base pairs in the major or minor grooves
of DNA face the nucleosome surface and others face outside.
Repeated patterns of sequence have been found in the introns and exons by
hidden Markov model (HMM) analysis.
Patterns appear with a periodicity of 10,
JMB 263, 503 (1996) Baldi et al.
9. Lecture WS 2003/04
Bioinformatics III
12
Markov model
A Markov model is a stochastic model which assumes that the probability of
appearance of a given base (A, T, C, or G) at a given position depends only on the k
previous nucleotides.
k is called the order of the Markov model.
Such a model is defined by the conditional probabilities P(X|k previous nucleotides),
where X = A, T, G, or C.
In order to build a Markov model, a learning set of sequences on which these
probabilities will be estimated is required.
Given a sequence and a Markov model, one can then very simply compute the
probability that this sequence has been generated according to this model.
Mathé et al. Nucl. Acids. Res. 30, 4103 (2002)
9. Lecture WS 2003/04
Bioinformatics III
13
Markov models
The simplest Markov models are homogenuous zero order Markov models which
assume that each base occurs independently with a given frequency.
Such simple models were previously often used for non-coding regions.
Modern programs like GeneMark also use higher order models to represent introns
and intergenic regions.
More complex three-periodic Markov models have been introduced to
characterize coding sequences. Coding regions are defined by three Markov
models, one for each position inside a codon.
Mathé et al. Nucl. Acids. Res. 30, 4103 (2002)
9. Lecture WS 2003/04
Bioinformatics III
14
Markov models
The larger the order of a Markov model, the finer it can characterize dependencies
between adjacent nucleotides.
However, a model of order k requires a large number of coding sequences to be
reliably estimated.
Therefore, most programs like GeneMark or Genscan rely on a three-periodic
Markov model of order five (thus exploiting hexamer composition) or less to
characterize coding sequences.
Mathé et al. Nucl. Acids. Res. 30, 4103 (2002)
9. Lecture WS 2003/04
Bioinformatics III
15
Hidden Markov Models
An HMM is a graph of connected states, each
state potentially able to “emit” a series of
observations. The process evolves in some
dimension, often time, though not necessarily.
The model is parameterized with probabilities
governing the state at a time t + 1, given that
one knows the previous states. Markov
assumptions are used to truncate the
dependency of having to know the entire history
of states up to this point in order to assess the
next state: Instead, only one step back is
required. As the process evolves in time through
the states, each state can potentially emit
observations, which are regarded as a stream of
observations over time. These models are often
illustrated graphically as shown right, with the
states being circles and transitions as arrows
between the states.
9. Lecture WS 2003/04
Bioinformatics III
16
Signal sensors
The basic and natural approach to finding a signal that may represent the
presence of a functional site is to search for a match with a consensus
sequence e.g. for promoter regions (TATA box), or the ribosomal binding site on
the mRNA.
This consensus could be determined from a multiple alignment of functionally
related documented sequences. Programs SPLICEVIEW and
SplicePredictor.
A more flexible representation of signals is offered by the so-called positional
weight matrices (PWMs): indicate the probability that a given base appears at
each position of the signal (again computed from a multiple alignment of
functionally related sequences).
One can say that a PWM is defined by one classical zero order Markov model
per position.
The PWM weights can also be optimized by a neural network method.
Mathé et al. Nucl. Acids. Res. 30, 4103 (2002)
9. Lecture WS 2003/04
Bioinformatics III
17
Signal sensors
In order to capture possible dependencies between adjacent positions of a
signal, one may use higher order Markov models called weight array models
(WAM). These methods assume a fixed length signal.
Hidden Markov models further allow for insertions and deletions.
Most existing programs use such models to represent and detect
- splice sites
- branch points
- correct intron/exon boundaries
and other motives like
- poly(A) sites (in 3‘-UTRs)
- promoters
Mathé et al. Nucl. Acids. Res. 30, 4103 (2002)
9. Lecture WS 2003/04
Bioinformatics III
18
Predict eukaryotic gene structures
One doesn‘t want to only search for independent exons, but instead identify the
whole complex structures of genes!
Each consistent pair of detected signals (translation starts and stops, spice sites)
defines a potential gene region (intron, exon or coding part of an exon).
Since all these potential gene regions can be used to build a gene model, the
number of potential gene models grows exponentially with the number of
predicted exons!
In practice, „correct“ gene structures must satisfy a set of properties
(i) there are no overlapping exons
(ii) coding exons must be frame compatible
(iii) merging two successive coding exons will not generate an in-frame stop at
the junction.
Mathé et al. Nucl. Acids. Res.
30, 4103 (2002)
The number of candidates remains, however, exponential.
9. Lecture WS 2003/04
Bioinformatics III
19
Testing the reliability of an ORF prediction
(1) Observation of unusual type of sequence variation found in ORFs:
every 3rd base tends to be the same one much more often than by chance alone
(Fickett 1982).
This property is due to nonrandom use of codons in ORFs and is true for any
ORF, regardless of the species.
(2) Determine whether the codons in the ORF correspond to those used in other
genes of the same organism (codon usage statistic).
(3) Translate ORFs into amino acid sequence and compare that to database of
protein sequences. If good hits are found, confidence in new predicted ORF
rises.
9. Lecture WS 2003/04
Bioinformatics III
20
Neural Network: GRAIL II
Grail II provides analyses of protein-coding regions, poly(A) sites, and promoters
constructs gene models,
predicts encoded protein sequences
provides database searching capabilities.
(1) create list of most likely
exon candidates
(2) evaluate candidates by
neural network
Uberbacher, Mural. PNAS, 88, 11261 (1991)
9. Lecture WS 2003/04
Bioinformatics III
21
Glimmer 2.0
Glimmer 2.0 is an extension. Probabilities may be computed for any of the k
bases preceeding a particular base b,
not just those adjacent to b.
Glimmer 1.0 uses an interpolated Markov
model (IMM) where the search depth k is
optimized for every tuple.
Sample ICM (interpolated context
model) decomposition tree. The root
position 12 has maximum mutual
information with the final base position
13. Each child of the root represents
the subset of windows with the
indicated nucleotide value at position
12, and indicates the maximum mutual
information position for that subset.
Each node is similarly decomposed
into children. Note that children of a
single node may represent different
base positions.
Salzberg et al. NAR, 27, 4636 (1999)
9. Lecture WS 2003/04
Bioinformatics III
22
Accuracy of Glimmer 2.0
Glimmer 2.0's accuracy for ten complete bacterial and archaeal genomes. The
majority of the genes missed were very short, either below the minimum or very
close to it. The default settings produce additional gene predictions ranging
from 7-20% of the total, many of which are likely to be false positives, but some
of which may be genuine. The additional prediction rate drops quickly if the
minimum gene length is set to be greater than 90bp.
9. Lecture WS 2003/04
Bioinformatics III
23
Accuracy of Glimmer 2.0
A better measure of accuracy is to consider only "confirmed genes," which we
define as genes that have a significant database match to a gene in another
organism. The table below shows these statistics on 10 genomes for both
Glimmer 1.0 and 2.0. All results were obtained by a very simple training
procedure: Glimmer was trained by first extracting all non-overlapping open
reading frames over 500bp. The trained model was then used to find genes in the
complete genome.
9. Lecture WS 2003/04
Bioinformatics III
24
Glimmer 1.0 vs. Glimmer 2.0
Numbers of genes confirmed by database matches found exclusively by
GLIMMER 1.0, by GLIMMER 2.0, and by both systems.
9. Lecture WS 2003/04
Bioinformatics III
25
Newly identified genes
Genes in M.tuberculosis found
automatically by GLIMMER
2.0 with homology to protein
sequences from other
organisms.
Salzberg et al. NAR, 27, 4636 (1999)
9. Lecture WS 2003/04
Bioinformatics III
26
Problematic gene start prediction
Detecting a gene as a protein-coding
ORF with an ‘open’ start still does
not provide full information for gene
annotation. Although several
procedures for gene start prediction
accuracy have been described,
verification of the actual accuracy of
these methods has been hampered
by an insufficient number of
experimentally validated translation
starts and, therefore, a deficit of
reliable data for training and testing.
In the absence of a reliable computer
procedure for gene start prediction,
the rule of the ‘longest ORF’ was
frequently applied to annotate
complete microbial genomes with
gene start assigned to the 5'-most
ATG codon (see Table).
9. Lecture WS 2003/04
Besemer et al. Nucl. Acids. Res. 29, 2607 (2003)
Bioinformatics III
Salzberg et al. NAR, 27, 4636 (1999)
27
GeneMarkS
Besemer et al. Nucl. Acids. Res. 29, 2607 (2003)
9. Lecture WS 2003/04
Bioinformatics III
28
Gene prediction accuracy
Besemer et al. Nucl. Acids. Res. 29, 2607 (2003)
9. Lecture WS 2003/04
Bioinformatics III
29
Performance of GeneMarkS on eukaryotic genomes
Besemer et al. Nucl. Acids. Res. 29, 2607 (2003)
9. Lecture WS 2003/04
Bioinformatics III
30
Comparable performance of Glimmer and GeneMarksS
Besemer et al. Nucl. Acids. Res. 29, 2607 (2003)
9. Lecture WS 2003/04
Bioinformatics III
31
GeneMarkS
(A) In the process of GeneMarkS training
there is no division of the coding sequence
into two clusters. However, in applying the
GeneMark.hmm 2.0 program, the model of
coding region derived by GeneMarkS can be
used as the Typical model along with a
heuristic model used as the Atypical model .
(B) In this simplified diagram of hidden state
transitions in GeneMark.hmm 2.0, the state
‘gene’ represents a sequence composed of an
RBS plus a spacer plus the protein-coding
sequence (CDS). Gene overlaps encompass
all possible types of superpositions: overlap of
genes on the same strand (as observed in
operons), overlap of genes on opposite
strands, overlap of coding region with RBS,
and so on.
Besemer et al. Nucl. Acids. Res. 29, 2607 (2003)
9. Lecture WS 2003/04
Bioinformatics III
32
Positional frequency pattern
Sequence logo representing the RBS positional
frequency pattern detected by GeneMarkS in the
analysis of B.subtilis genomic data. The total
height of the four letters in each position
indicates the position specific information
content, while the height of each letter is
proportional to the nucleotide frequency.
Besemer et al. Nucl. Acids. Res. 29, 2607 (2003)
9. Lecture WS 2003/04
Bioinformatics III
33
Spacer length
(B) Graph of probability
distribution of spacer length,
the sequence between the
RBS sequence and the gene
start.
Besemer et al. Nucl. Acids. Res. 29, 2607 (2003)
9. Lecture WS 2003/04
Bioinformatics III
34
Gene prediction evaluation
Venn diagram showing
group relationships
between the GenBank
annotation and sets of
genes detected by
GeneMark.hmm 2.0 and
Glimmer 2.02 for the
B.subtilis genome (A) and
the E.coli genome (B).
Besemer et al. Nucl. Acids. Res. 29, 2607 (2003)
9. Lecture WS 2003/04
Bioinformatics III
35
Ribosome binding sites
Distributions of log-odds scores of RBS sites, as
detected by GeneMarkS, in sets of overlapping and
non-overlapping of genes of (A) B.subtilis, (B) E.coli
and (C) M.jannaschii. As can be seen, the overlapping
genes, which are likely to be located inside operons,
frequently have strong RBS sites. Still, most strong
sites of ribosome binding precede the non-overlapping
genes (stand alone genes and genes leading operons).
This tendency is much more apparent in the case of
the archaeal genome of M.jannaschii than in the E.coli
and B.subtilis genomes
Besemer et al. Nucl. Acids. Res. 29, 2607 (2003)
9. Lecture WS 2003/04
Bioinformatics III
36
RBS motif
Sequence logo representing the RBS motif
observed in a subset of upstream
sequences of the A.fulgidus genome. This
subset consisted of 50 nt long upstream
sequences overlapping the 3' end of the
preceding gene. The consensus of this
motif is complementary to a section of the
A.fulgidus 16S rRNA.
Besemer et al. Nucl. Acids. Res. 29, 2607 (2003)
9. Lecture WS 2003/04
Bioinformatics III
37
Spacer length
Distributions of spacer length for
two species with strong RBS
patterns, B.subtilis and E.coli
(solid and dashed lines,
respectively), and one species
with a strong eukaryotic
promoter-like pattern, A.fulgidus
(dotted line).
The promoter-like pattern of
A.fulgidus is localized much
further upstream of the start
codon than the RBS patterns of
B.subtilis and E.coli.
Besemer et al. Nucl. Acids. Res. 29, 2607 (2003)
9. Lecture WS 2003/04
Bioinformatics III
38
Distribution of spacer lengths
(A) Distribution of spacer lengths observed
in the B.subtilis genome for two different
types of possible RBS hexamers:
AGGAGG and AGGTGA. Multiple
alignment allows these hexamers to be
superimposed. In actual upstream
sequences, these hexamers tend to
occupy different locations relative to the
start codon. This preference may be
involved in the precise positioning of the
ribosome at the translation initiation site
when the 16S rRNA binds to mRNA. The
more frequent hexamer was observed on
average at a further distance from the
gene start than the rare hexamer.
Besemer et al. Nucl. Acids. Res. 29, 2607 (2003)
9. Lecture WS 2003/04
Bioinformatics III
39
Distribution of spacer lengths
(B) Distribution of spacer lengths observed
in the M.thermoautotrophicum genome for
two different types of RBS hexamers:
GGAGGT and GGTGAT. Properties of
these hexamers are similar to the two
hexamers observed in the B.subtilis
genome (A), except that more frequent
hexamer is now found on average at a
closer distance to the gene start than the
rare hexamer.
Besemer et al. Nucl. Acids. Res. 29, 2607 (2003)
9. Lecture WS 2003/04
Bioinformatics III
40
TIGR: GlimmerM, Exonomy and Unveil
Genefinder programs for eukaryotes are far from perfect, often predicting genes
or exons which do not exist, failing to predict those that do exist or generating
predictions having one or more incorrect exon boundaries.
Also, different genefinders trained for the same organism often produce different
predictions.
The latest publication by The Institute of Genome Research (TIGR) therefore
suggests 3 genefinders that are all based on one or more types of Markov
models: GlimmerM, Exonomy, Unveil.
Majoros et al. Nucl. Acids. Res. 31, 3601 (2003)
9. Lecture WS 2003/04
Bioinformatics III
41
TIGR: GlimmerM, Exonomy and Unveil
Topologies of
Unveil
283-state HMM
Exonomy
23-state GHMM
Majoros et al. Nucl. Acids. Res. 31, 3601 (2003)
9. Lecture WS 2003/04
Bioinformatics III
42
Methods don’t work everywhere
An example in which Exonomy
produces the correct gene model.
An example in which GlimmerM
produces the correct gene model.
An example in which Unveil
produces the correct gene model
(as does Genscan).
Majoros et al. Nucl. Acids. Res. 31, 3601 (2003)
9. Lecture WS 2003/04
Bioinformatics III
43
Summary
The results of gene prediction should be taken with caution; although the results
are becoming increasingly more reliable, they do remain only predictions.
They are very useful for speeding up gene discovery and knowledge mining.
Biological expertise remains necessary in order to confirm the existence of a virtual
protein and to find or prove its biological function.
Therefore comparative genome approaches are becoming more and more
important where programs can scan for homologies with expressed sequences
(EST or cDNA sequence data).
More work is now focussing on detecting RNA-coding genes.
Mathé et al. Nucl. Acids. Res. 30, 4103 (2002)
9. Lecture WS 2003/04
Bioinformatics III
44