Ab_initio_predition_tools - Compgenomics2010

Download Report

Transcript Ab_initio_predition_tools - Compgenomics2010

GENE MARK
•Developed in 1993 at Georgia Institute of Technology as the first
gene finding tool.
•Used markov chain to represent the statistics of coding and
noncoding reading frames using dicodon statistics.
Shortcomings:
•Inability to find exact gene boundaries.
GENE MARK.hmm
• Improved gene prediction in terms of finding accurate gene boundaries.
• Incorporates use of Hidden Markov Model with duration..
• The aim of gene finding was to find the true functional sequence X from
the anonymous DNA sequence S.
S= {b1,b2,……………..,bL}
sequence.
X={x1,x2,……………..,xL}
complementary strand
bi stands for nucleotides A,G,C,T and L length of
xi
= 0 if nucleotide is in non coding region
= 1 if nucleotide is in coding region
= 2 if nucleotide is in coding region in
• Probability of any sequence S underlying functional sequence X is
calculated as P(X|S)=P(x1,x2,……………..,xL| b1,b2,……………..,bL)
• Viterbi algorithm then calculates the functional sequence X* such that
P(X*|S) is the largest among all possible values of X.
• Ribosome binding site model was also added to augment accuracy in the
prediction of translational start sites.
Even in prokaryotic genomes gene overlaps are quite common
RBS feature overcomes this problem by defining a % position nucleotide
matrix based on alignment of 325 E coli genes whose RBS signals have
already been annotated.
Uses a consensus sequence AGGAG to search upstream of any alternative
start codons for genes predicted by HMM.
GLIMMER
maintained by Steven Salzberg, Art Delcher at the University of
Maryland , College Park
•Used IMM (Interpolated markov Models) for the first time.
• Predictions based on variable context(oligomers of variable
lengths).
•More flexible than the fixed order Markov models.
•Three versions:
Glimmer 1 (1997)
Glimmer 2 (1999)
Glimmer 3 (2007)*
Principle:
IMM combines probability based on 0,1……..k
previous bases, in this case k=8 is used. But
this is for oligomers that occur frequently
however for rarely occurring oligomers 5 th
order or lower may also be used.
Glimmer 2
Glimmer 2 adds the concept of Interpolated context model to IMM.
ICM – More flexible and will choose any base in the variable context (not
only ones next to bk+1) to determine the probability of our base of interest
(bk+1) .
Based on codon bias in translation .
Add on Features:
- Increase in the sensitivity.
-
Glimmer 3:
• Overcomes the shortcomings of previous models by taking in account sum
of RBS score ,IMM coding potentials and a score for start codons which is
dependent on relative frequency of each possible start codon in the same
training set used for RBS determination.
• Algorithm used reverse scoring of IMM by scoring all ORF (open reading
frames) in reverse ,from the stop codon to start codon , with the
probability of each base conditioned on a context window on it’s 3’ side
and score being the sum of log likelihood of the bases contained in the
ORF
Highlights:
• Increase in sensitivity along with increase in specificity as well.
• Reduced overlapping predictions.
• Separating sequences from separate genomes.
• Better prediction for GC rich genomes .