Transcript Slides

Gene Finding With A Hidden Markov model
Of Genomic Structure and Evolution.
Jakob Skou Pedersen and Jotun Hein
Deepak Verghese
CS 6890
•GPHMM
•CONSERVED Exon method
•2 step GLASS n ROSETTA
•TWINSCAN which extends
GENESCAN
•etc
 Do not exploit all information in
evolutionary pattern
 Not easily extended to multiple genome
sequences.
(EHMM)
A Probabilistic model of both Genome Structure and Evolution
Composed of :
1. Hidden Markov Model (HMM)
2. Phylogenetic Tree
 Can handle any number of sequences in an




alignment.
Can have properties of higher order HMM’s
Can handle variability in the sequences
along the alignment
State of art evolutionary models can be
incorporated later
Evolutionary events between different
genomes are not treated independently
SCOPE
• Not to compete with the existing finding methods
on performance but to illustrate the power of this
approach.
•Relies on a pre produced alignment.
MARKOV CHAINS
 A set of states
 The transitions from one state to all
other states, including itself, are
governed by a probability distribution
 First order Markov chain: the
probabilities depend solely on the
current state
 n-th order Markov chain: n previous
states
HIDDEN MARKOV MODEL
5 Components
•A
set of states
• Matrix of transition probabilities ( A )
• Set of alphabets ( C )
• Set of emission distribution (e)
• Initial state distribution ( B )
Example of hidden Markov model





ACA- - -ATG
T C AA C TAT C
ACAC--AGC
AGA- - -ATC
AC C G - -ATC
NO
1:1 correspondence between states and symbols
Why the name Hidden ?
Components
 State k
 Emits symbols (observables) C
 PROBABILISTIC MODEL
Emission Distribution e
Initial state distribution B
Transition Probabilities A
Path Π
Different paths possible for same sequence
In EHMM
Emission distribution
e specified by
Evolutionary model Ek
Phylogenetic tree T
PHYLOGENETIC TREES
Motivation :
The problem of explaining the evolutionary history of
today's species
 In Phylogenetic trees
 Leaves represent present
day species
 Character states of inner
nodes are missing data
 Interior nodes represent
hypothesized ancestors
 The length of the brances
of a tree represent the
evolutionary difference.
Evolution is often modeled by continuous markov chains
Here evolution along the branches of the phylogenetic tree is modelled by Ek
Transition probability Pk ( t )
For a branch length t P k ( t ) = exp ( t Q k )
Increasing the number of sequences is increasing the amount of evolutionary
information.
THE ALIGNMENT COLUMN CORRESPONDS TO THE STATE OF
ELOVUTION AT THE LEAVES OF THE PHYLOGENETIC TREE
THE PEOPABILITY OF GENERATING AN ALIGNMENT COLUMN IN STATE K
EQUALS PROBABILITY OF OBSERVING A GIVEN CHARACTER PATTERN
ON THE LEAVES OF T WHEN GIVEN E k
Phylogenetic tree of the entries of the 3 alignment columns
 Codon based evolutionary model used to calculate
emission probability of columns of A
 Nucleotide Based evolutionary model used to calculate
emission probability of column B
 Emission probability of C is got from the equilibrium distribution
of the the relevant evolutionary model
Parameter Estimation
Parameters of HMM are estimated by a
combination of
Baum – Welch
Powell
Evolutionary model E
divided into
E equ
E evo
Initial State Distribution B can be estimated by Baum-Welch but
It is generally set to 0.000 01 for all states except the intergenic .
The expectation step of Baum-Welch estimates
the number of nucleotides emitted from each state
the expected number of state transitions
Expected number of times a state is used.
Powell another optimization method estimates
E evo
phylogenetic tree T
Baum – Welch method is used to estimate
E equ
A
Therefore
Likelihood of an alignment ( x ) given a parameterization of the EHMM
Can be found by the equation
Here we are summing over all possible paths
This can be done in linear time by Dynamic Programming
EHMM is fully probabilistic and can be used to
simulate data and find genes.
EUKARYOTIC
GENOME MODEL
can be used to
generate alignments.
Reduced model
produces only inner
exons.
Results
Benefits of modeling evolution with a EHMM
using a data set of orthologous mouse/human gene pair
Benefit will depend on divergence between
sequences compared
Key parameter for modelling the difference
between exons and introns is the dN/dS ratio.
Moreover we see that Evolutionary model shows a distinct difference
between the intergenic /intron state and the codon state
Evaluations were performed on both single and aligned sequences
Graphical Representation
Simple model used now not comparable to state of art
methods
Any number of aligned sequences can be handled
Extensions of the model
• GENESCAN
can be extended into HMM
• Splice site finders
• Models of ribosome binding site and promoter regions
• Non – geometric length distributions of exons
• Pseudo higher order EHMM can be constructed.
• Idea of pair HMM to multiple sequences
Disadvantages in present model
 Existing frame work does not model gaps but
treats it as missing data.
 Optimal data for EHMM is a multiple alignment
of full – length genome.
 Challenge in constructions of the alignment is to
reduce the noise per signal ratio.
BUT ………..