Transcript Slide 1

CMSC 828N lecture notes:
Eukaryotic Gene Finding with
Generalized HMMs
Mihaela Pertea and Steven Salzberg
Center for Bioinformatics and
Computational Biology, University of
Maryland
Eukaryotic Gene Finding Goals
•
Given an uncharacterized DNA
sequence, find out:
– Which regions code for proteins?
– Which DNA strand is used to
encode each gene?
– Where does the gene starts and
ends?
– Where are the exon-intron
boundaries in eukaryotes?
•
Overall accuracy usually below
50%
Gene Finding: Different Approaches
• Similarity-based methods. These use similarity to annotated
sequences like proteins, cDNAs, or ESTs (e.g. Procrustes,
GeneWise).
• Ab initio gene-finding. These don’t use external evidence to
predict sequence structure (e.g. GlimmerHMM, GeneZilla,
Genscan, SNAP).
• Comparative (homology) based gene finders. These align
genomic sequences from different species and use the
alignments to guide the gene predictions (e.g. TWAIN, SLAM,
TWINSCAN, SGP-2).
• Integrated approaches. These combine multiple forms of
evidence, such as the predictions of other gene finders (e.g.
Jigsaw, EuGène, Gaze)
Why ab-initio gene prediction?
Ab initio gene finders can predict novel genes not
clearly homologous to any previously known gene.
Identifying Signals In DNA with a Signal Sensor
We slide a fixed-length model or “window” along the DNA and
evaluate score(signal) at each point:
Signal sensor
…ACTGATGCGCGATTAGAGTCATGGCGATGCATCTAGCTAGCTATATCGCGTAGCTAGCTAGCTGATCTACTATCGTAGC…
When the score is greater than some threshold (determined
empirically to result in a desired sensitivity), we remember this
position as being the potential site of a signal.
The most common signal sensor is the Weight Matrix:
A = 31%
A = 18%
T = 28%
T = 32%
C = 21%
C = 24%
G = 20%
G = 26%
A
T
G
100%
100%
100%
A = 19%
A = 24%
T = 20%
T = 18%
C = 29%
C = 26%
G = 32%
G = 32%
Start and stop codon scoring
Score all potential start/stop codons within a window of
length 19.
CATCCACCATGGAGAA
CCACCATGG
Kozak consensus
The probability of generating the sequence X  x1 x 2  x
is given by:

p( X )  p ( x1 ) p ( xi | xi 1 )
(1)
(i )
i 2
(WAM model or
inhomogeneous Markov
model)
Splice Site Scoring
Donor/Acceptor sites at location k:
DS(k) = Scomb(k,16) + (Scod(k-80)-Snc(k-80)) +
(Snc(k+2)-Scod(k+2))
AS(k) = Scomb(k,24) + (Snc(k-80)-Scod(k-80)) +
(Scod(k+2)-Snc(k+2))
Scomb(k,i) = score computed by the Markov model/MDD method using
window of i bases
Scod/nc(j) = score of coding/noncoding Markov model for 80bp window
starting at j
Coding Statistics
• Unequal usage of codons in the coding regions is a universal
feature of the genomes
• We can use this feature to differentiate between coding and noncoding regions of the genome
• Coding statistics - a function that for a given DNA sequence
computes a likelihood that the sequence is coding for a protein
• Many different ones ( codon usage, hexamer usage,GC content,
Markov chains, IMM, ICM.)
3-periodic ICMs
A three-periodic ICM uses three ICMs in succession to evaluate the
different codon positions, which have different statistics:
P[C|M0]
ICM0
P[G|M1]
ICM1
P[A|M2]
ICM2
ATC GAT CGA TCA GCT TAT CGC ATC
The three ICMs correspond to the three phases. Every base is evaluated
in every phase, and the score for a given stretch of (putative) coding
DNA is obtained by multiplying the phase-specific probabilities in a
L 1
mod 3 fashion:
 P( f i )(mod3) ( xi )
i 0
GlimmerHMM uses 3-periodic ICMs for coding and homogeneous
(non-periodic) ICMs for noncoding DNA.
The Advantages of Periodicity and Interpolation
HMMs and Gene Structure
• Nucleotides {A,C,G,T} are the observables
• Different states generate nucleotides at different frequencies
A simple HMM for unspliced genes:
A
T
G
T
A
A
AAAGC ATG CAT TTA ACG AGA GCA CAA GGG CTC TAA TGCCG
• The sequence of states is an annotation of the generated string – each
nucleotide is generated in intergenic, start/stop, coding state
Recall: “Pure” HMMs
An HMM is a stochastic machine M=(Q, , Pt, Pe) consisting of the
following:
• a finite set of states, Q={q0, q1, ... , qm}
• a finite alphabet  ={s0, s1, ... , sn}
• a transition distribution Pt : Q×Q [0,1]
• an emission distribution Pe: Q× [0,1]
i.e., Pt (qj | qi)
i.e., Pe (sj | qi)
An Example
5%
M1=({q0,q1,q2},{Y,R},Pt,Pe)
Pt={(q0,q1,1), (q1,q1,0.8),
(q1,q2,0.15), (q1,q0,0.05),
(q2,q2,0.7), (q2,q1,0.3)}
q0
80%
q1
100%
Pe={(q1,Y,1), (q1,R,0), (q2,Y,0), (q2,R,1)}
15% Y=0%
R=0%
R = 100%
Y = 100%
30%
q2
70%
HMMs & Geometric Feature Lengths
 d 1
 d 1
P( x0 ...xd 1 |  )    Pe ( xi |  )  p (1  p)
 i 0

geometric
distribution
exon length
Generalized Hidden Markov Models
Advantages:
* Submodel abstraction
* Architectural simplicity
* State duration modeling
Disadvantages:
* Decoding complexity
Generalized HMMs
A GHMM is a stochastic machine M=(Q, , Pt, Pe, Pd) consisting of
the following:
• a finite set of states, Q={q0, q1, ... , qm}
• a finite alphabet  ={s0, s1, ... , sn}
• a transition distribution Pt : Q×Q  [0,1]
i.e., Pt (qj | qi)
• an emission distribution Pe : Q×*× N[0,1] i.e., Pe (sj | qi,dj)
• a duration distribution Pe : Q× N [0,1] i.e., Pd (dj | qi)
Key Differences
• each state now emits an entire subsequence rather than just one symbol
• feature lengths are now explicitly modeled, rather than implicitly geometric
• emission probabilities can now be modeled by any arbitrary probabilistic model
• there tend to be far fewer states => simplicity & ease of modification
Ref: Kulp D, Haussler D, Reese M, Eeckman F (1996) A generalized hidden Markov model for the recognition of human genes in
DNA. ISMB '96.
Recall: Decoding with an HMM
argm ax
argm axP(  S )
max 
P ( | S ) 


P( S )
argm ax

P (  S )

argm ax

P( S | ) P()

L
L1
P( )   Pt ( yi1 | yi )
P(S |  )   Pe (xi | yi1 )
i0
 max 
i0
emission prob.
L1
transition prob.
argm ax
Pt (q0 | yL ) Pe (xi | yi1 )Pt ( yi1 | yi )


i0
Decoding with a GHMM
argm ax
argm axP(  S )
max 
P ( | S ) 


P( S )
argm ax

P (  S )

argm ax

P( S | ) P()

| |2
| |2
P( )   Pt ( yi1 | yi )Pd (d i | yi )
P(S |  )   Pe (Si | yi ,d i )
i1
max 
i0
emission prob.
transition
prob.
| |2
P (S | y ,d )P ( y

 
argm ax
e
i0
i
i
i
t
i1
duration
prob.
| yi )Pd (d i | yi )
Gene Prediction with a GHMM
Given a sequence S, we would like to determine the parse  of that
sequence which segments the DNA into the most likely exon/intron
structure:
prediction
exon 1
exon 2
exon 3
parse 
AGCTAGCAGTCGATCATGGCATTATCGGCCGTAGTACGTAGCAGTAGCTAGTAGCAGTCGATAGTAGCATTATCGGCCGTAGCTACGTAGCGTAGCTC
sequence S
The parse  consists of the coordinates of the predicted exons, and
corresponds to the precise sequence of states during the operation
of the GHMM (and their duration, which equals the number of
symbols each state emits).
This is the same as in an HMM except that in the HMM each state
emits bases with fixed probability, whereas in the GHMM each
state emits an entire feature such as an exon or intron.
GHMMs Summary
• GHMMs generalize HMMs by allowing each state to emit a
subsequence rather than just a single symbol
• Whereas HMMs model all feature lengths using a geometric
distribution, coding features can be modeled using an arbitrary
length distribution in a GHMM
• Emission models within a GHMM can be any arbitrary
probabilistic model (“submodel abstraction”), such as a neural
network or decision tree
• GHMMs tend to have many fewer states => simplicity &
modularity
GlimmerHMM architecture
Four exon types
Exon0
Exon1
Exon2
I0
I1
I2
Init Exon
Phase-specific introns
Term Exon
Exon Sngl
+ forward strand
Intergenic
- backward strand
Exon Sngl
Term Exon
Init Exon
I0
I1
I2
Exon0
Exon1
Exon2
• Uses GHMM to model
gene structure (explicit
length modeling)
• WAM and MDD for splice
sites
• ICMs for exons, introns
and intergenic regions
• Different model parameters
for regions with different GC
content
• Can emit a graph of highscoring ORFS
Key steps in the GHMM Dynamic
Programming Algorithm
• Scan left to right
• At each signal, look bacward (left)
– Find all compatible signals
– Take MAX score
– Repeat for all reading frames
Key steps in the GHMM Dynamic
Programming Algorithm
AG
AG
AG
GT
AG
ATG
ATG
ATG
Look back at all previous compatible signals
Key steps in the GHMM Dynamic
Programming Algorithm
AG
 Retrieve score of best parse up to previous
site
 Compute score of the exon linking AG to GT
 Use Markov chain or other methods
 Look up probability of exon length
 Multiply probabilities (or add logs)
GT
Key steps in the GHMM Dynamic
Programming Algorithm
AG
MAX over all previous sites
AG
AG
GT
AG
ATG
ATG
ATG
Store for each frame:
MAX score
Reading frame
Pointer backward
GHMM Dynamic Programming Algorithm:
Introns
GT
GT
GT
AG
GT
GT
GT
Huge number of potential signals: how far back to look?
GHMM Dynamic Programming Algorithm:
Introns
GT
 Limit look-back with maximum intron length
 Or, use other techniques
 Compute score of intron linking GT to AG
 Score donor site with donor site model
 Score intron with Markov chain
 Score acceptor with acceptor site model
 Look up probability of intron length
 Multiply probabilities (or add logs)
AG
Training the Gene Finder
θ=(Pt ,Pe ,Pd)
Training for GHMMs
 MLE

argmax 


P(S,  ) 



 ( S , )T



argmax 

Pe (Si | yi , d i ) Pt ( yi | yi1 ) Pd (d i | yi ) 




 ( S , )T yi 

|S i |1

argmax 


Pt ( yi | yi1 ) Pd (d i | yi )  Pe (x j | yi ) 




 ( S , )T yi 
j0

estimate via
labeled
training data
ai , j 
Ai , j

|Q|1
h 0
construct a
histogram of
observed
feature
lengths
estimate via
labeled
training data
ei,k 
Ai ,h
Ei,k

| |1
h0

Ei,h
Gene Finding in the Dark:
Dealing with Small Sample Sizes
–
–
–
parameter mismatching: train on a close relative
use a comparative GF trained on a close relative
use BLAST to find conserved genes & curate them, use as
training set
augment training set with genes from related organisms, use
weighting
manufacture artificial training data
–
–
•
–
be sensitive to sample sizes during training by reducing the
number of parameters (to reduce overtraining)
•
•
–
–
long ORFs
fewer states (1 vs. 4 exon states, intron=intergenic)
lower-order models
pseudocounts
smoothing (esp. for length distributions)
Evaluation of Gene Finding Programs
Nucleotide level accuracy
TN
FN
FP TN
TP
FN
REALITY
PREDICTION
Sensitivity:
Sn 
TP
TP  FN
Precision:
Pr 
TP
TP  FP

TP
FN TN
More Measures of Prediction Accuracy
Exon level accuracy
WRONG
EXON
CORRECT
EXON
MISSING
EXON
REALITY
PREDICTION

ExonSn 
TE number of correct exons

AE
number of actual exons
ExonPr 
TE
number of correct exons

PE number of predicted exons
GlimmerHMM on human genes
(circa 2002)
Nuc
Sens
Nuc
Prec
Nuc
Acc
Exon
Sens
Exon
Prec
Exon
Acc
Exact
Genes
GlimmerHMM
86%
72%
79%
72%
62%
67%
17%
Genscan
86%
68%
77%
69%
60%
65%
13%
GlimmerHMM’s performace compared to Genscan on 963 human RefSeq genes
selected randomly from all 24 chromosomes, non-overlapping with the training
set. The test set contains 1000 bp of untranslated sequence on either side (5' or
3') of the coding portion of each gene.
GlimmerHMM on other species
Nucleotide
Level
Exon Level
Size of test set
Pr
Corretly
Predicted
Genes
Sn
Pr
Sn
Arabidopsis
thaliana
97%
99%
84%
89%
60%
809 genes
Cryptococcus
neoformans
96%
99%
86%
88%
53%
350 genes
Coccidoides
posadasii
99%
99%
84%
86%
60%
503 genes
Oryza sativa
95%
98%
77%
80%
37%
1323 genes
GlimmerHMM has also been trained on: Aspergillus fumigatus, Entamoeba
histolytica, Toxoplasma gondii, Brugia malayi, Trichomonas vaginalis, and many
others.
Ab initio gene finding in the model plant
Arabidopsis thaliana (circa 2004)
Arabidopsis thaliana test results
Nucleotide
Exon
Sn Pr Acc Sn Pr
GlimmerHMM 97 99
SNAP
Genscan+
98
Acc Sn
Pr Acc
84 89 86.5 60
61 60.5
96 99 97.5 83 85
93 99
96
Gene
84
60
74 81 77.5 35
57 58.5
35
35
•All three programs were tested on a test data set of 809 genes, which did not
overlap with the training data set of GlimmerHMM.
•All genes were confirmed by full-length Arabidopsis cDNAs and carefully
inspected to remove homologues.