ppt - University of Illinois Urbana

Download Report

Transcript ppt - University of Illinois Urbana

Applications of Hidden Markov Models
(Lecture for CS397-CXZ Algorithms in Bioinformatics)
March 6, 2004
ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
1
Today’s Lecture
• HMM Applications
– Profile HMMs (Classification)
– HMMs for Multiple Sequence Alignment (Pattern
discovery)
– HMMs for Gene Finding (Segmentation)
• Special issues in HMMs
– Local Maximas
– Model construction
– Weighting training sequences
2
HMM Applications
•
Classification (e.g., Profile HMMs)
– Build an HMM for each class (profile HMMs)
– Classify a sequence using Bayes rule
•
Multiple sequence alignment
– Build an HMM based on a set of sequences
– Decode each sequence to find a multiple alignment
•
Segmentation (e.g., gene finding)
– Use different states to model different regions
– Decode a sequence to reveal the region boundaries
3
HMMs for Classification
E.g., Protein families
C {C1 ,..., Ck }
p( X | C ) p(C )
p(C | X ) 
p( X )
C*  arg max C p( X | C ) p(C )
Assign a family to X
p(X|C) is modeled by a profile HMM built specifically for C
Assuming example sequences are available for C
4
HMMs for Multiple Alignment
• Given a set of sequences S={X1, …,Xk}
• Train an HMM, e.g., using Baum-Welch (finding
the HMM that maximizes the probability of S)
• Decode each sequence Xi
• Assemble the Viterbi paths to form a multiple
alignment (insertions are uncertain)
5
HMM-based Gene Finding
•
•
•
•
Design two types of states
– “Within Gene” States
– “Outside Gene” States
Use known genes to estimate the HMM
Decode a new sequence to reveal which part is a gene
Example software:
–
–
–
–
–
GENSCAN (Burge 1997)
FGENESH (Solovyev 1997)
HMMgene (Krogh 1997)
GENIE (Kulp 1996)
GENMARK (Borodovsky & McIninch 1993)
– VEIL (Henderson, Salzberg, & Fasman 1997)
6
VEIL: Viterbi Exon-Intron Locator
Exon HMM Model
Upstream
Start Codon
3’ Splice Site
Exon
Intron
Stop Codon
5’ Splice Site
Downstream
5’ Poly-A Site
VEIL Architecture
• Enter: start codon or intron (3’ Splice Site)
• Exit: 5’ Splice site or three stop codons
(taa, tag, tga)
(Slide from N. F. Samatova’s lecture)
7
GenScan
Architecture
•
•
It is based on Generalized HMM
(GHMM)
Model both strands at once
– Other models: Predict on one
strand first, then on the other
strand
•
•
•
– Avoids prediction of
overlapping genes on the two
strands (rare)
Each state may output a string of
symbols (according to some
probability distribution).
Explicit intron/exon length
modeling
Special sensors for Cap-site and
TATA-box
8
Fig. 3, Burge and Karlin 1997
Special Issues
• Local maxima
• Optimal model construction
• Weighting training sequences
9
Solutions to the Local Maxima Problem
• Repeat with different initializations
• Start with the most reasonable initial model
• Simulated annealing (slow down the
convergence speed)
10
Local Maxima: Illustration
Local maxima
Bad starting point
Global maxima
Good starting point
11
Optimal Model Construction
p ( X | HMM ) p ( HMM )
p ( HMM | X ) 
p( X )
HMM * arg max HMM p( HMM | X )
 arg max HMM p ( X | HMM ) p ( HMM )
Bayesian model selection:
P(HMM) should prefer simpler models
12
Sequence Weighting
• Avoid over-counting similar sequences from
the same organisms
• Typically compute a weight for a sequence
based on an evolutionary tree
• Many ways to incorporate the weights, e.g.,
– Unequal likelihood
– Unequal weight contribution in parameter
estimation
13
HMMs in Real Applications
• SAM-T98 Tutorial:
– http://www.cse.ucsc.edu/research/compbio/ismb
99.tutorial.html
• Pfam
– http://www.sanger.ac.uk/Software/Pfam/
14