Gene prediction

Download Report

Transcript Gene prediction

Gene prediction
Gene Prediction: Computational Challenge
aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatg
ctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgc
ggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcgg
ctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccg
atgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg
cggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcat
gcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagct
gggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgca
tgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggcta
tgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcg
gctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatga
caatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggc
tatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgcta
agctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaa
tgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcg
gctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgca
tgcggctatgctaagctcatgcgg
Gene Prediction: Computational Challenge
aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatg
ctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgc
ggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcgg
ctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccg
atgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg
cggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcat
gcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagct
gggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgca
tgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggcta
tgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcg
gctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatga
caatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggc
tatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgcta
agctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaa
tgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcg
gctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgca
tgcggctatgctaagctcatgcgg
Gene Prediction: Computational Challenge
aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatg
ctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgc
ggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcgg
ctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccg
atgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg
cggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcat
gcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagct
gggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgca
tgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggcta
tgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcg
gctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatga
caatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggc
tatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgcta
agctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaa
tgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcg
gctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgca
tgcggctatgctaagctcatgcgg
Gene!
Gene Prediction Analogy
• Newspaper written in unknown language
– Certain pages contain encoded message, say 99 letters on page
7, 30 on page 12 and 63 on page 15.
• How do you recognize the message? You could probably
distinguish between the ads and the story (ads contain the “$” sign
often)
• Statistics-based approach to Gene Prediction tries to make similar
distinctions between exons and introns.
Statistical Approach: Metaphor in Unknown Language
Noting the differing frequencies of symbols (e.g. ‘%’, ‘.’, ‘-’)
and numerical symbols could you distinguish between a story
and the stock report in a foreign newspaper?
Two Approaches to Gene
Prediction
• Statistical: coding segments (exons) have typical
sequences on either end and use different subwords than
non-coding segments (introns).
• Similarity-based: many human genes are similar to
genes in mice, chicken, or even bacteria. Therefore,
already known mouse, chicken, and bacterial genes may
help to find human genes.
Similarity-Based Approach: Metaphor in Different Languages
If you could compare the day’s news in English, side-by-side
to the same news in a foreign language, some similarities
may become apparent
Annotation of
Genomic Sequence
• Given the sequence of an organism’s genome,
we would like to be able to identify:
–
–
–
–
–
Genes
Exon boundaries & splice sites primary goals
Beginning and end of translation
Alternative splicings
secondary goals
Regulatory elements (e.g. promoters)
• The only certain way to do this is experimentally,
but it is time consuming and expensive.
Computational methods can achieve reasonable
accuracy quickly, and help direct experimental
approaches.
Prokaryotic Gene Structure
Promoter
CDS
Terminator
Genomic DNA
transcription
mRNA
 Most
bacterial promoters contain the Shine-Delgarno signal, at
about -10 that has the consensus sequence: 5'-TATAAT-3'.
 The terminator: a signal at the end of the coding sequence that
terminates the transcription of RNA
 The coding sequence is composed of nucleotide triplets. Each
triplet codes for an amino acid. The AAs are the building blocks
of proteins.
Pieces of a (Eukaryotic) Gene
(on the genome)
~ 1-100 Mbp
5’
3’
3’
5’
~ 1-1000 kbp
5’ …
…
3’
… 3’
…
5’
exons (cds & utr) / introns
(~ 102-103 bp)
(~ 102-105 bp)
promoter (~103 bp)
enhancers (~101-102 bp)
Polyadenylation
site
other regulatory sequences
(~ 101-102 bp)
What is Computational Gene Finding?
Given an uncharacterized DNA sequence, find out:
–
–
–
–
–
–
Which region codes for a protein?
Which DNA strand is used to encode the gene?
Which reading frame is used in that strand?
Where does the gene starts and ends?
Where are the exon-intron boundaries in eukaryotes?
(optionally) Where are the regulatory sequences for that
gene?
Prokaryotic Vs. Eukaryotic Gene
Finding
Prokaryotes:
Eukaryotes:
• small genomes 0.5 – 10·106 bp
• high coding density (>90%)
• no introns
• large genomes 107 – 1010 bp
• low coding density (<50%)
• intron/exon structure
– Gene identification relatively easy,
with success rate ~ 99%
Problems:
• overlapping ORFs
• short genes
• finding TSS and promoters
– Gene identification a complex
problem, gene level accuracy
~50%
Problems:
• many
What is it about genes that we
can measure (and model)?
• Most of our knowledge is biased towards
protein-coding characteristics
– ORF (Open Reading Frame): a sequence defined by inframe AUG and stop codon, which in turn defines a
putative amino acid sequence.
– Codon Usage: most frequently measured by CAI (Codon
Adaptation Index)
• Other phenomena
– Nucleotide frequencies and correlations:
• value and structure
– Functional sites:
• splice sites, promoters, UTRs, polyadenylation sites
General Things to Remember about
(Protein-coding) Gene Prediction Software
• It is, in general, organism-specific
• It works best on genes that are reasonably similar
to something seen previously
• It finds protein coding regions far better than noncoding regions
• In the absence of external (direct) information,
alternative forms will not be identified
• It is imperfect! (It’s biology, after all…)
Gene Finding: Different
Approaches
• Similarity-based methods (extrinsic) - use similarity to annotated
sequences:
– proteins
– cDNAs
– ESTs
• Comparative genomics - Aligning genomic sequences from
different species
• Ab initio gene-finding (intrinsic)
• Integrated approaches
Similarity-based methods
• Based on sequence conservation due to functional
constraints
• Use local alignment tools (Smith-Waterman algo,
BLAST, FASTA) to search protein, cDNA, and EST
databases
• Will not identify genes that code for proteins not
already in databases (can identify ~50% new genes)
• Limits of the regions of similarity not well defined
Comparative Genomics
• Based on the assumption that coding sequences are
more conserved than non-coding
• Two approaches:
– intra-genomic (gene families)
– inter-genomic (cross-species)
• Alignment of homologous regions
• Difficult to define limits of higher similarity
• Difficult to find optimal evolutionary distance (pattern
of conservation differ between loci)
Summary for Extrinsic
Approaches
Strengths:
• Rely on accumulated pre-existing biological data,
thus should produce biologically relevant predictions
Weaknesses:
• Limited to pre-existing biological data
• Errors in databases
• Difficult to find limits of similarity
Ab initio Gene Finding
Input: A DNA string over the alphabet {A,C,G,T}
Output: An annotation of the string showing for every nucleotide
whether it is coding or non-coding
AAAGCATGCATTTAACGAGTGCATCAGGACTCCATACGTAATGCCG
Gene finder
AAAGC ATG CAT TTA ACG A GT GCATC AG GA CTC CAT ACG TAA TGCCG
•Using only sequence information
•Identifying only coding exons of protein-coding genes (transcription
start site, 5’ and 3’ UTRs are ignored)
•Integrates coding statistics with signal detection
A eukaryotic gene
• This is the human p53 tumor suppressor gene on chromosome 17.
• Genscan is one of the most popular gene prediction algorithms.
Introns
Final exon
3’ untranslated
region
Initial exon
Internal exons
This particular gene lies on the reverse strand.
Observations
• Given (walk, shop, clean)
– What is the probability of this sequence of
observations? (is he really still at home, or
did he skip the country)
– What was the most likely sequence of
rainy/sunny days?
Signals vs contents
• In gene finding, a small pattern within the
genomic DNA is referred to as a signal,
whereas a region of genomic DNA is a
content.
• Examples of signals: splice sites, starts and
ends of transcription or translation, branch
points, transcription factor binding sites
• Examples of contents: exons, introns, UTRs,
promoter regions
The CpG island problem
• Methylation in human genome
– “CG” -> “TG” happens in most places
except “start regions” of genes and within
genes
– CpG islands = 100-1,000 bases before a
gene starts
• Question
– Given a long sequence, how would we find
the CpG islands in it?
Promoters
• Promoters are DNA segments upstream of
transcripts that initiate transcription
Promoter
5’
3’
• Promoter attracts RNA Polymerase to the
transcription start site
Splice signals (mice): GT , AG
Splice site detection
Donor site
5’
3’
Position
%
A
C
G
T
-8 … -2 -1
26
26
25
23
…
…
…
…
0
1
2
… 17
60 9 0 1 54 … 21
15 5 0 1 2 … 27
12 78 99 0 41 … 27
13 8 1 98 3 … 25
Real splice sites
• Real splice sites show some conservation at
positions beyond the first two.
• We can add additional arrows to model these
states.
weblogo.berkeley.edu
Ribosomal Binding Site
Prior knowledge
•
•
•
•
The translated region must have a length that is a multiple of 3.
Some codons are more common than others.
Exons are usually shorter than introns.
The translated region begins with a start signal and ends with a
stop codon.
• 5’ splice sites (exon to intron) are usually GT;
• 3’ splice sites (intron to exon) are usually AG.
• The distribution of nucleotides and dinucleotides is usually
different in introns and exons.
Gene Prediction and Motifs
• Upstream regions of genes often contain motifs that can be used for
gene prediction
ATG
-35
-10
0
TTCCAA TATACT
Pribnow Box
10
GGAGG
Ribosomal binding site
Transcription start site
STOP
Positional dependence
• In this data, every time a “G”
appears in position 1, an “A”
appears in position 3.
• Conversely, an “A” in
position 1 always occurs with
a “T” in position 3.
ACTG
ACTT
GCAC
ACTT
ACTA
GCAT
ACTA
ACTT
Example of (Positional) Weight Matrix
•
Computed by measuring the frequency of every element of every position of the site
(weight)
TACGAT
1 2 3 4 5 6
TATAAT
TATAAT
GATACT
TATGAT
TATGTT
•
•
A
G
0 6 0 3 4 0
0 0 1 0 1 0
1 0 0 3 0 0
T
5 0 5 0 1 6
C
Score for any putative site is the sum of the matrix values (converted in probabilities)
for that sequence (log-likelihood score)
Disadvantages:
– cut-off value required
– assumes independence between adjacent bases
Conditional probability
GCG
CAG
CCG
GCG
CCG
CCG
GCG
CCT
CCG
GGG
CGG
GCG
AGG
CAG
CCT
CAT
CCT
GCG
• What is the probability of observing an “A” at position 2,
given that we observed a “C” at the previous position?
Conditional probability
GCG
CAG
CCG
GCG
CCG
CCG
GCG
CCT
CCG
GGG
CGG
GCG
AGG
CAG
CCT
CAT
CCT
GCG
• What is the probability of observing an “A” at position 2,
given that we observed a “C” at the previous position?
• Answer: total number of CA’s divided by total number of
C’s in position 1.
• 3/11 = 27%
• Probability of observing CA = 3/18 = 17%.
Conditional probability
GCG
CAG
CCG
GCG
CCG
CCG
GCG
CCT
CCG
GGG
CGG
GCG
AGG
CAG
CCT
CAT
CCT
GCG
• What is the probability of observing a “G” at position 3,
given that we observed a “C” at the previous position?
Conditional probability
GCG
CAG
CCG
GCG
CCG
CCG
GCG
CCT
CCG
GGG
CGG
GCG
AGG
CAG
CCT
CAT
CCT
GCG
• What is the probability of observing a “G” at position 3,
given that we observed a “C” at the previous position?
• Answer: 9/12 = 75%.
Promoter Structure in Prokaryotes (E.Coli)
Transcription starts at offset
0.
• Pribnow Box (-10)
• Gilbert Box (-30)
• Ribosomal Binding Site
(+10)
n
3
Open Reading Frames (ORFs)
• Detect potential coding regions by looking at ORFs
– A genome of length n is comprised of (n/3) codons
– Stop codons break genome into segments between
consecutive Stop codons
– The subsegments of these that start from the Start codon
(ATG) are ORFs
• ORFs in different frames may overlap
ATG
TGA
Genomic Sequence
Open reading frame
Long vs.Short ORFs
• Long open reading frames may be a gene. At random, we should
expect one stop codon every (64/3) ~= 21 codons. However, genes
are usually much longer than this
• A basic approach is to scan for ORFs whose length exceeds certain
threshold. This is naïve because some genes (e.g. some neural and
immune system genes) are relatively short
Testing ORFs: Codon Usage
• Create a 64-element hash table and count the frequencies of
codons in an ORF
• Amino acids typically have more than one codon, but in nature
certain codons are more in use
• Uneven use of the codons may characterize a real gene
• This compensate for pitfalls of the ORF length test
Open Reading Frames in Bacteria
• Without introns, look for long open reading frame (start codon
ATG, … , stop codon TAA, TAG, TGA)
• Short genes are missed (<300 nucleotides)
• Shadow genes (overlapping open reading frames on opposite
DNA strands) are hard to detect
• Some genes start with UUG, AUA, UUA and CUG for start
codon
• Some genes use TGA to create selenocysteine and it is not a
stop codon
Coding Statistics
• Unequal usage of codons in the coding regions is a universal
feature of the genomes
– uneven usage of amino acids in existing proteins
– uneven usage of synonymous codons (correlates with the
abundance of corresponding tRNAs)
• We can use this feature to differentiate between coding and
non-coding regions of the genome
• Coding statistics - a function that for a given DNA sequence
computes a likelihood that the sequence is coding for a protein
Coding Statistics
• Many different ones
–
–
–
–
–
–
codon usage
hexamer usage
GC content
compositional bias between codon positions
nucleotide periodicity
…
Codon Usage in Human Genome
Codon Usage in Mouse Genome
AA codon
Ser TCG
Ser TCA
Ser TCT
Ser TCC
Ser AGT
Ser AGC
/1000
4.31
11.44
15.70
17.92
12.25
19.54
frac
0.05
0.14
0.19
0.22
0.15
0.24
Pro
Pro
Pro
Pro
6.33
17.10
18.31
18.42
0.11
0.28
0.30
0.31
CCG
CCA
CCT
CCC
AA codon
Leu CTG
Leu CTA
Leu CTT
Leu CTC
/1000
39.95
7.89
12.97
20.04
frac
0.40
0.08
0.13
0.20
Ala
Ala
Ala
Ala
GCG
GCA
GCT
GCC
6.72
15.80
20.12
26.51
0.10
0.23
0.29
0.38
Gln
Gln
CAG
CAA
34.18
11.51
0.75
0.25
Codon Usage and Likelihood Ratio
• An ORF is more “believable” than another if it has more “likely”
codons
• Do sliding window calculations to find ORFs that have the “likely”
codon usage
• Allows for higher precision in identifying true ORFs; much better
than merely testing for length.
• However, average vertebrate exon length is 130 nucleotides, which
is often too small to produce reliable peaks in the likelihood ratio
• Further improvement: in-frame hexamer count (frequencies of
pairs of consecutive codons)
Splicing Signals
• Try to recognize location of splicing signals at exon-intron
junctions. This has yielded a weakly conserved donor splice site
and acceptor splice site
• Profiles for sites are still weak, and lends the problem to the
Hidden Markov Model (HMM) approaches, which capture the
statistical dependencies between sites
Donor and Acceptor Sites: GT and AG dinucleotides
•
•
The beginning and end of exons are signaled by donor and acceptor sites
that usually have GT and AC dinucleotides
Detecting these sites is difficult, because GT and AC appear very often
Donor
Site
GT
exon 1
Acceptor
Site
AC
exon 2
A more realistic (and complex)
HMM model for Gene
Prediction (Genie)
Assessing performance: Sensitivity & Specificity
• Testing of predictions is performed on sequences where the
gene structure is known
• Sensitivity is the fraction of known genes (or bases or exons)
correctly predicted
– “Am I finding the things that I’m supposed to find”
• Specificity is the fraction of predicted genes (or bases or
exons) that correspond to true genes
– “What fraction of my predictions are true?”
• In general, increasing one decreases the other
Measures of Prediction Accuracy,
Part 1
Nucleotide level accuracy
TN
FN
TP
FP
TN
FN
TP
FN TN
=
number of correct exons
number of actual exons
=
number of correct exons
number of predicted exons
REALITY
PREDICTION
Sensitivity
Specificity
Measures of Prediction Accuracy,
Part 2
Exon level accuracy
WRONG
EXON
REALITY
PREDICTION
CORRECT
EXON
MISSING
EXON
Graphic View of Specificity and Sensitivity
Sn 
TruePositive
TruePositive

AllTrue
TruePositive  FalseNegative
Sp 
TruePositive
TruePositive

AllPositiv e TruePositive  FalsePosit ive
Quantifying the tradeoff: Correlation Coefficient


TP TN   FP FN 
CC 
 AN PP  AP PN 
AN  TN  FP; AP  TP  FN ;
PP  TP  FP; PN  TN  FN
Examples of Gene Finders
FGENES – linear DF for content and signal sensors and DP for
finding optimal combination of exons
GeneMark – HMMs enhanced with ribosomal binding site
recognition
Genie – neural networks for splicing, HMMs for coding sensors,
overall structure modeled by HMM
Genscan – WM, WA and decision trees as signal sensors,
HMMs for content sensors, overall HMM
HMMgene – HMM trained using conditional maximum likelihood
Morgan – decision trees for exon classification, also Markov
Models
Ab initio Gene Finding is
Difficult
• Genes are separated by large intergenic regions
• Genes are not continuous, but split in a number of
(small) coding exons, separated by (larger) noncoding introns
– in humans coding sequence comprise only a few
percents of the genome and an average of 5% of
each gene
• Sequence signals that are essential for elucidation of
a gene structure are degenerate and highly
unspecific
• Alternative splicing
Problems with Ab initio Gene
Finding
• No biological evidence
• In long genomic sequences many false
positive predictions
• Prediction accuracy high, but not
sufficient
Integrated Approaches for Gene
Finding
• Programs that integrate results of similarity searches
with ab initio techniques (GenomeScan, FGENESH+,
Procrustes)
• Programs that use synteny between organisms
(ROSETTA, SLAM)
• Integration of programs predicting different elements
of a gene (EuGène)
• Combining predictions from several gene finding
programs (combination of experts)
Combining Programs’
Predictions
• Set of methods used and they way they
are integrated differs between individual
programs
• Different programs often predict
different elements of an actual gene
they could complement each other
yielding better prediction
Related Work
• This approach was suggested by several authors
• Burset and Guigó (1996)
– Investigated correlation between 9 gene-finding programs
– 99% of exons predicted by all programs were correct
– 1% of exons completely missed by all programs
• Murakami and Tagaki (1998)
– Five methods for combining the prediction by 4 gene-finding
programs
– Nucleotide level accuracy measures improved by 3-5% in
comparison with the best single
AND and OR Methods
exon 1
exon 2
union
intersection
Combining Genscan and
HMMgene
• High prediction accuracy as well as reliability of their
exon probability made them the best candidates for
our study
Genscan
111
624
91
• Genscan predicted 77% of exons correctly,
HMMgene 75%, both 87%
HMMgene