Hidden Markov Models in Bioinformatics

Download Report

Transcript Hidden Markov Models in Bioinformatics

Hidden Markov Models
in Bioinformatics
Example Domains: Gene Finding
&
Protein Family Modeling
5 Second Overview

Today’s goal: Introduce HMMs as general
tools in bioinformatics

I will use the problem of Gene Finding as an
example of an “ideal” HMM problem domain
I will use the problem of Protein Family
Modeling as an example of a clever way to
fit HMMs to a problem

Learning Objectives
When I’m done you should know:

1.
2.
3.
When is an HMM a good fit for a problem space?
What materials are needed before work can
begin with an HMM?
What are the advantages and disadvantages of
using HMMs?
Outline






HMMs as Statistical Models
The example tasks at a glance
Good problems for HMMs
HMM Advantages
HMM Disadvantages
Gene Finding Examples
Statistical Models

Definition:


Example: A normal distribution





Any mathematical construct that attempts to parameterize
a random process
Assumptions
Parameters
Estimation
Usage
HMMs are just a little more complicated…
HMM Assumptions


Observations are ordered
Random process can be represented by a
stochastic finite state machine with emitting states.
HMM Parameters

Using weather example
Modeling daily weather
for a year
Ra Ra Su Su Su Ra..

Lots of parameters




One for each table entry
Represented in two
tables.


One for emissions
One for transitions
HMM Estimation




Called training, it falls under machine learning
Feed an architecture (given in advance) a set
of observation sequences
The training process will iteratively alter its
parameters to fit the training set
The trained model will assign the training
sequences high probability
HMM Usage



Two major tasks
Evaluate the probability of an observation
sequence given the model (Forward)
Find the most likely path through the model
for a given observation sequence (Viterbi)
Gene Finding
(An Ideal HMM Domain)

Our Objective:


To find the coding and non-coding regions of an
unlabeled string of DNA nucleotides
Our Motivation:


Assist in the annotation of genomic data produced
by genome sequencing methods
Gain insight into the mechanisms involved in
transcription, splicing and other processes
Gene Finding Terminology

A string of DNA nucleotides containing a gene
will have separate regions (lines):



Introns – non-coding regions within a gene
Exons – coding regions
Separated by functional sites (boxes)


Start and stop codons
Splice sites – acceptors and donors
Gene Finding Challenges

Need the correct reading frame


Introns can interrupt an exon in mid-codon
There is no hard and fast rule for identifying
donor and acceptor splice sites

Signals are very weak
Protein Family Modeling
(A clever fit of HMMs)

I have a protein sequence.

What family is it in?
Can you give me a quick alignment to the
other members of the family?
These amino acids here, do they match the
families consensus positions, or are they
inserts?


Profile HMM



Square: Match (consensus) state
Diamond: Insert state – notice the loops
Circle: Delete state – allows you to “jump” a match
What makes a good HMM
problem space?
Characteristics:
 Classification problems
There are two main types of output from an
HMM:

Scoring of sequences


(Protein family modeling)
Labeling of observations within a sequence

(Gene Finding)
HMM Problem Characteristics
Continued

The observations in a sequence should have
a clear, and meaningful order


Unordered observations will not map easily to
states
It’s beneficial, but not necessary for the
observations follow some sort of grammar



Makes it easier to design an architecture
Gene Finding
Protein Family Modeling
HMM Requirements
So you’ve decided you want to build an HMM,
here’s what you need:

An architecture



Probably the hardest part
Should be biologically sound & easy to interpret
A well-defined success measure

Necessary for any form of machine learning
HMM Requirements
Continued

Training data

Labeled or unlabeled – it depends


You do not always need a labeled training set to do
observation labeling, but it helps
Amount of training data needed is:


Directly proportional to the number of free parameters
in the model
Inversely proportional to the size of the training
sequences
Why HMMs might be a good fit for
Gene Finding





Classification: Classifying observations within a sequence
Order: A DNA sequence is a set of ordered observations
Grammar / Architecture: Our grammatical structure (and the
beginnings of our architecture) is right here:
Success measure: # of complete exons correctly labeled
Training data: Available from various genome annotation
projects
Why HMMs can be made to fit
Protein Family Modeling




Classification: What model fits a sequence best?
Order: An amino acid sequence is well ordered
Grammar: Any two matches can be separated by a series of
inserts and deletes… okay, maybe the word “grammar” is a bit
of a stretch
Success Measure: How many sequences can we correctly
label after training?
HMM Advantages

Statistical Grounding




Statisticians are comfortable with the theory
behind hidden Markov models
Freedom to manipulate the training and
verification processes
Mathematical / theoretical analysis of the results
and processes
HMMs are still very powerful modeling tools – far
more powerful than many statistical methods
HMM Advantages continued

Modularity


HMMs can be combined into larger HMMs
Transparency of the Model



Assuming an architecture with a good design
People can read the model and make sense of it
The model itself can help increase understanding
HMM Advantages continued

Incorporation of Prior Knowledge

Incorporate prior knowledge into the architecture

Initialize the model close to something believed to
be correct

Use prior knowledge to constrain training process
How does Gene Finding make
use of HMM advantages?

Statistics:


Modularity:


Many systems alter the training process to better
suit their success measure
Almost all systems use a combination of models,
each individually trained for each gene region
Prior Knowledge:

A fair amount of prior biological knowledge is built
into each architecture
HMM Disadvantages

Markov Chains

States are supposed to be independent
P(x)




…
P(y)
P(y) must be independent of P(x), and vice versa
This usually isn’t true
Can get around it when relationships are local
Not good for RNA folding problems
HMM Disadvantages
continued


…and then there are the standard
Machine Learning Problems
Watch out for local maxima


Model may not converge to a truly optimal
parameter set for a given training set
Avoid over-fitting


You’re only as good as your training set
More training is not always good
HMM Disadvantages
continued

Speed!!!

Almost everything one does in an HMM involves:
“enumerating all possible paths through the model”

There are efficient ways to do this

Still slow in comparison to other methods
HMM Gene Finders:
VEIL



A straight HMM Gene Finder
Takes advantage of grammatical structure and
modular design
Uses many states that can only emit one symbol to
get around state independence
HMM Gene Finders:
HMMGene





Uses an extended HMM called a CHMM
CHMM = HMM with classes
Takes full advantage of being able to modify
the statistical algorithms
Uses high-order states
Trains everything at once
HMM Gene Finders:
Genie




Uses a generalized HMM (GHMM)
Edges in model are complete HMMs
States can be any arbitrary program
States are actually neural networks specially
designed for signal finding
Conclusions


HMMs have problems where they excel, and
problems where they do not
You should consider using one if:



Problem can be phrased as classification
Observations are ordered
The observations follow some sort of grammatical
structure (optional)
Conclusions
Advantages:




Statistics
Modularity
Transparency
Prior Knowledge
Disadvantages:




State independence
Over-fitting
Local Maximums
Speed
Some final words…


Lots of problems can be phrased as
classification problems
Homology search


Build a model of the sequence with a few close
homologs, and use the model the search for more
distant homologs
Sequence alignment

Align all of these sequences to the model that
represents their family
Questions
Any Questions?