ppt - University of Illinois Urbana

Download Report

Transcript ppt - University of Illinois Urbana

Profile HMMs
(Lecture for CS498-CXZ Algorithms in Bioinformatics)
Nov. 10, 2005
ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
1
Protein Families
•
•
Protein Family - a group of proteins that share a
common function and/or structure, that are potentially
derived from a common ancestor (set of homologous
proteins)
Characterizing a Family - Compare the sequence and
structure patterns of the family members to reveal
shared characteristics that potentially describe
common biological properties
– Multiple sequence alignment
– Motif/Domain - sequence and/or structure patterns
common to protein family members (a trait)
– Profile is a way to represent a family of proteins
2
Profile of a Protein Family
Aligned DNA sequences can be represented by a
4 ·n profile matrix reflecting the frequencies
of nucleotides in every aligned position.
Protein family can be represented by a 20·n profile
representing frequencies of amino acids.
3
Profiles and HMMs
• A 20*n profile PWM corresponds to a simple
HMM with n sequentially linked match states
M1,…,Mn
• More complex profiles involve modifications
to this simple HMM by adding additional
states to model insertions and deletions
4
The Simplest Profile HMM
• Linear HMM with “match” states
Begin
M1
....
Mj
....
Mk
End
Initial state distribution: P(M1)=1.0
Transition probabilities: p(Mi->Mj)=1.0
Output probabilities: p(o|Mi) is from the PWM
• Problem: No gap is allowed, but gap is quite
common in protein motifs
5
Adding Insert States
• Allow one gap between Mi and Mi+1
• Allow one or more gaps between Mi and Mi+1
• Allow zero or more gaps between Mi and Mi+1
Ii
Begin
Mi
Additional parameters:
- Output probability for Ii: p(o|Ii)
- Transition probabilities involving Ii
End
• Problem: How to model deletions?
6
Adding Delete States
•
•
•
How can we allow deletion of Mi?
How can we allow deletion of Mi and Mi+1?
How can we allow arbitrary deletions?
Dj
Begin
Mj
End
Additional parameters:
- State transitions involving Dj’s
- What about output probabilities?
7
Putting All Together
Dj
Ij
Begin
•
•
Mj
End
Parameters can be estimated with slight modification
in the estimation formulas
How many parameters are there in a profile HMM with
k match states?
8
Build an HMM from Multiple Alignments
Step 1: Decide the consensus region
Step 2: Decide which columns are “match” states and
which are “insert” states.
Step 3: Gaps in the match states are marked as “delete”
Step 3: Build an HMM using supervised training
HBA_HUMAN
HBB_HUMAN
MYG_PHYCA
GLB3_CHITP
GLB5_PETMA
LGB2_LUPLU
GLB1_GLYDI
...VGA--HAGEY...
...V----NVDEV...
...VEA--DVAGH...
...VKG------D...
...VYS--TYETS...
...FNA--NIPKH...
...IAGADNGAGV...
*** *****
Match
Insert
Delete
P(v|M1)=?
P(M1->M2)=?
9
Estimation of Probabilities
•
Maximum Likelihood estimator
– P(X|Y)=count(X;Y)/count(Y)
•
– E.g., p(v|M1)=“count of v in M1 column”/ “all amino acids
in M1 column”
Bayesian estimator (with conjugate prior)
– P(X|Y)=(count(X;Y)+pseudocount(X;Y)
)/(count(Y)+totalpseudocount)
– E.g., p(v|M1)=(“count of v in M1 column”+1)/ (“all amino
acids in M1 column”+20) (Laplace smoothing)
P(v|M1)=6/27
P(M1->M2)=7/10
10
Optimal Model Construction
• How do we decide which columns to be
labeled as “match” or “insert”?
• Many approaches
– Heuristic approach: Columns with low entropy
are more likely “match” (threshold-based)
– Bayesian model selection: Choose a model that
fits the data the best and is consistent with our
prior
11
Profile HMM for Local Alignment
•
•
Output prob. in flanking states use background values
p(o|Background).
Looping prob. close to 1, e.g. (1- ) for some small .
Dj
Ij
Mj
Begin
End
Q
Q
12
Repeat Alignments
•
•
Join the two background states
Allow multiple matches
Dj
Ij
Mj
Begin
Q
End
13
Usages of Profile HMMs
•
Detecting potential membership in a family
– Matching a sequence to the profile HMMs
– Score a sequence S by p(S|HMM)/p(S|Random)
• Return top k best matching profile HMMs for a given
sequence
• Given an HMM, find additional sequences in the family
•
Aligning a sequence to an existing family
– Decoding the sequence using Viterbi
– Using the state transition path to align the sequence
with the existing sequences in the family
•
Many other uses….
14
What You Should Know
• Architecture of a typical profile HMM
• How HMMs can be constructed appropriately
to model matching, insertion, and deletion of
amino acids
• How to estimate a profile HMM based on
multiple sequence alignment
• How a regular profile HMM can be modified to
allow multiple matches and local alignment
15