Seq_pattern_II

Download Report

Transcript Seq_pattern_II

PLPTH 890 Introduction to Genomic Bioinformatics
Lecture 17
Protein Domain Analysis Using
Hidden Markov Models
Liangjiang (LJ) Wang
[email protected]
March 10, 2005
Outline
• Basic concepts and biological problems.
• Search for protein domains:
– The Pfam database,
– Other domain/motif databases.
• Protein domain modeling:
– Hidden Markov Models (HMM),
– Construction of the Pfam protein domain
models using HMMER.
Biological Problem #1
You identified a new gene, which might be
involved in a very interesting biological
process. BLAST search in GenBank
resulted in a few homologous sequences
with unknown function. What else can you
do to understand the function of the gene
product and/or to localize the possible
conserved domain in the protein?
Biological Problem #2
Suppose there is a novel gene identified in
mammals, C. elegans and Drosophila, but
not yet in plants. This gene is involved in
an interesting biological process (e.g.,
apoptosis). You are interested in finding the
orthologous gene in Arabidopsis. However,
BLAST search using each of the known
sequences failed to identify an Arabidopsis
homologue. What else can you try?
Orthologs, Paralogs and Homologs
Ancestral
organism
X
Y
Speciation
Duplication
B
A
X
X
Y
Y
Ya
Yb
B
A
X1
X2
X1 and X2 are orthologs
with same function.
Homologs
Paralogs Ya and Yb may have
different but related functions.
Protein Domains
Domains represent evolutionarily conserved amino
acid sequences carrying functional and structural
information of a protein. Domain analysis helps
understand the biological function of a gene product.
bZIP
Protein Domain Analysis Using HMM
Search
HMMER
Multiple
Sequence
Alignment
Hidden
Markov
Models
>TC50726
AIKLNDVKSCQGTAFWMA
PEVVRGKVKGYGLPADIW
SLGCTVLEMLTGQVPYAP
MECISAMFRIGKGELPPV
PDTLSRDARDFILQCLKV
NPDDRPTAAQLLDHKFVQ
RSFSQSSGSASPHIPRRS
>UFO_ARATH
MDSTVFINNPSLTLPFSY
TFTSSSNSSTTTSTTTDS
SSGQWMDGRIWSKLPPPL
LDRVIAFLPPPAFFRTRC
Your
Sequence
Set
Comparison of Search Approaches
Sensitivity
Speed
BLAST
HMM
Threading
Low
High
Very High
Very Fast
Fast
Very Slow
The Pfam Database
• Pfam is a database of multiple alignments and
hidden Markov models (HMMs) of common
conserved protein domains.
• The alignments use a non-redundant protein
set composed of SWISS-PROT and TrEMBL.
• Pfam consists of parts A and B. Pfam-A
contains curated domain families with highquality alignments. Pfam-B contains families
that were generated automatically by
clustering the remaining sequences after
removal of Pfam-A domains.
• Pfam is available at http://pfam.wustl.edu/.
Other Domain/Motif Databases
• ProDom: http://www.toulouse.inra.fr/prodom.html;
contains domain families automatically generated
from the SWISS-PROT and TrEMBL (Pfam-B).
• SMART: Simple Modular Architecture Research
Tool; available at http://smart.embl-heidelberg.de/;
contains domain families that are widely represented
among nuclear, signaling and extracellular proteins.
• TIGRFAMs: http://www.tigr.org/TIGRFAMs; is a
collection of manually curated protein families of
hidden Markov models; contains models of fulllength proteins and shorter protein regions.
More Domain/Motif Databases
• PROSITE: http://www.expasy.org/prosite/; consists
of biologically significant sites, patterns and profiles;
uses regular expression to represent most patterns.
• PRINTS:
http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/;
a collection of protein fingerprints (conserved motifs,
ungapped alignments), which may be used to assign
new sequences to known protein families.
• Blocks: http://blocks.fhcrc.org/; consists of short
ungapped alignments corresponding to the most
highly conserved regions of proteins.
Even More Domain/Motif Databases
• InterPro: http://www.ebi.ac.uk/interpro; an
integrated and curated collection of protein families,
domains and motifs from PROSITE, Pfam, PRINTS,
ProDom, SMART and TIGRFAMs.
• CDD:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cdd;
contains domains derived from Pfam, SMART and
models curated at NCBI.
• 3Dee: http://www.compbio.dundee.ac.uk/3Dee/;
contains structural domain definitions for all protein
chains in the Protein Databank (PDB); clustered by
both sequence and structural similarity.
Why So Many Domain/Motif Databases?
• Different representations of patterns:
– PROSITE: regular expression.
– ProDom: multiple alignment and consensus.
– Pfam: multiple alignment and HMM.
• Different approaches or focuses:
– SMART: focused on signaling proteins.
– PRINTS and Blocks: highly conserved segments.
– 3Dee: structural domain definitions.
• “Meta-sites” (databases of databases):
– InterPro: an integrated collection, derived from
several domain/motif databases.
Protein Domain Modeling
• Machine learning concepts.
• Hidden Markov Models (HMM).
• HMMER (a software tool for constructing
and searching HMM).
• Construction of the Pfam protein domain
models.
Machine Learning
• The study of computer algorithms that
automatically improve performance through
experience.
• In practice, this means: we have a set of
examples from which we want to extract
some rules (regularities) using computers.
• Two types of machine learning:
– Supervised: learn with a teacher (using a set
of input-output training examples).
– Unsupervised: let the machine explore the
data space and find some interesting patterns.
Learning from Examples
• Learning refers to the process in which a
model is generalized (induced) from given
examples (training dataset).
• Error-correction learning: for each of the
given examples, a computer program
– makes a prediction based on what was
already learned (i.e., model parameters).
– compares the prediction with the given output
to calculate the error.
– adjusts the model parameters in some way
(learning algorithm) to minimize the error.
Common Pitfalls - Training Dataset
Data space
Too few examples
(overfitting)
Data instances sampled
Sampling
problem
(“Garbage in, garbage out”)
Good
Hidden Markov Model (HMM)
• A class of probabilistic models that are
generally applicable to time series or linear
sequences.
• Widely used in speech recognition since
early 1970s. David Haussler’s group at UC
Santa Cruz introduced HMMs for biological
sequence profiles in 1994.
• HMM turns a multiple alignment into a
position-specific scoring system that can be
used to search for remotely homologous
sequences.
The Occasionally Dishonest Casino Problem
The casino has two dies: a fair and a loaded die.
They use the fair die most of the time, but
occasionally (P = 0.05) switch to the loaded die and
may switch back to a fair die with probability 0.1.
The loaded die has probability 0.5 of a six and
probability 0.1 for the numbers one to five. The fair
die has probability 0.167 for each number.
Rolls
Die
521462536316562646465251
FFFFFFFFLLLLLLLLLLFFFFFF
HMM
Symbol
State/Path
 The state sequence or path is hidden (HMM).
 Transition probabilities: P(L|F) = 0.05; P(F|F) = 0.95.
 Emission probabilities: P(6|L) = 0.5; P(6|F) = 0.167.
An HMM for the Casino Problem
1:
2:
3:
4:
5:
6:
1/6
1/6
1/6
1/6
1/6
1/6
Emission
Probability
1:
2:
3:
4:
5:
6:
1/10
1/10
1/10
1/10
1/10
1/2
0.05
Fair
Loaded
0.1
0.95
Transition
Probability
0.9
An HMM for 5’ Splice Site Recognition
(Eddy, 2004)
States:
E – Exon
5 – 5’ splice site
I – Intron
An observation (nucleotide sequence) corresponds
to a state path (or paths) through the HMM.
Finding the Best Hidden State Path
(Eddy, 2004)
The probability P of a state path, given the model and
an observation (sequence), is the product of all the
emission and transition probabilities along the path.
Calculating the Probability of a State Path
ln P  ln( 1.0  0.25  (0.9  0.25)17  0.1 0.95 1.0  0.4  0.9  0.4
 0.9  0.4  0.9  0.1 0.9  0.4  0.9  0.1 0.9  0.4  0.1)  41.22
How to Model a Protein Domain?
Consider a two-state HMM:
Is there a domain X (Yes/No)?
A.A.
DomX?
EDQILIKARNTEAARRSRVIANYL
NNNNNNNNYYYYYYYYYYNNNNNN
Symbol
State/Path
Is this sufficient for modeling a protein domain? No
How to represent position-dependent amino acid
distribution?
What about insertions and deletions?
Seq1
Seq2
Seq3
KGIQEF--GADWYKVAK--NVGNKSPEQCILRFLQ
ALVKKHGQG-EWKTIAS--NLNNRTEQQCQHRWLR
SGVRKYGEG-NWSKILLHYKFNNRTSVMLKDRWRT
An HMM for Protein Domain Recognition
(Eddy, 1996)
States:
M - match
D - delete
I - insert
HMM Parameterization (Training)
• HMM parameters are estimated from the
multiple sequence alignment.
– Basic: maximum likelihood estimation.
– Advanced: the MAP construction algorithm.
(See Durbin et al., Biological sequence analysis, p.107-124)
• A High-quality alignment is essential for the
model construction. This includes selection
of sequences and manual editing of the
multiple sequence alignment generated by
the ClustalW program.
Scoring a Sequence with an HMM
• The task is to find the hidden state path with the
highest probability, given the model and an
observation (sequence).
– The Viterbi algorithm (dynamic programming).
– The forward algorithm.
– The backward algorithm.
(See Durbin et al., Biological Sequence Analysis, p.55-61)
HMM versus PWM
• Advantages:
– A HMM has position-dependent amino acid
distributions, which are represented as emission
probabilities at each match state. (also PWM)
– Insertion/deletion gap penalties are handled using
transition probabilities. (Usually not with PWM)
– The possible dependence of an amino acid on its
preceding neighbor can be represented using the
transition probabilities. (Not with PWM)
• Problems:
– Long-range interactions between amino acids.
– Requirement of multiple sequence alignments.
HMMER
• A software package for constructing and
searching HMMs.
• Source code and binary distribution for
various platforms (UNIX, Linux and
Macintosh PowerPC) are available at
http://hmmer.wustl.edu/. Follow the detailed
User’s Guide for software installation.
• Multiple sequence alignment: ClustalW or
ClustalX (with Windows interface), available
at ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX/.
• Sequences in FASTA format.
HMMER Programs
• hmmbuild: build a model from a multiple
sequence alignment.
• hmmalign: align multiple sequences to a HMM.
• hmmcalibrate: determine appropriate statistical
significance parameters for an HMM prior to
database searches.
• hmmsearch: search a sequence database with
an HMM.
• hmmpfam: search an HMM database with one
or more sequences.
• hmmconvert and hmmindex.
Construction of the Pfam HMMs
PROSITE, literature
Family definition
If the HMM
doesn’t find all
members
ClustalW, editing
Seed alignment
(representative, stable)
hmmbuild
HMM profile
hmmalign
Full alignment
(complete, volatile)
A Solution to Problem #2
Collect known sequences in literature
Do multiple alignment (ClustalX, editing)
Create an HMM profile using hmmbuild
Search an Arabidopsis sequence dataset
using the HMM and hmmsearch
Other Tools for Protein Pattern Analysis
• SignalP:
– For predicting signal peptide and cleavage site.
– Available at http://www.cbs.dtu.dk/services/SignalP/.
• PSORT:
– For predicting protein localization sites in cells.
– Available at http://psort.nibb.ac.jp/.
• TMHMM:
– For predicting transmembrane segments.
– Available at http://www.cbs.dtu.dk/services/TMHMM/.
Summary
• Hidden Markov Model (HMM) is well suited
to represent protein domains.
• Since HMMs are constructed from aligned
sequence families, HMM search is often
more sensitive than BLAST for detecting
remotely related homologues.
• Resources are available for modeling and
searching for protein domains/motifs.
PROSITE vs. Perl RegExp
PDOC00269 (Heat shock hsp70 signature)
PROSITE: [IV]-D-L-G-T-[ST]-x-[SC]
Perl:
[IV]DLGT[ST]\w[SC]
PDOC50884 (Part of Zinc finger Dof-type signature)
PROSITE: C-x(2)-C-x(7)-[CS]-x(13)-C-x(2)-C
Perl:
C\w{2}C\w{7}[CS]\w{13}C\w{2}C
PDOC00081 (Part of Cytochrome P450 signature)
PROSITE: [FW]-[SGNH]-x-[GD]-{F}-[RKHPT]-{P}-C
Perl:
[FW][SGNH]\w[GD][^F][RKHPT][^P]C
PDOC00036 (Part of bZIP domain signature)
PROSITE: [KR]-x(1,3)-[RKSAQ]-N-{VL}-x-[SAQ](2)-{L}
Perl:
[KR]\w{1,3}[RKSAQ]N[^VL]\w[SAQ]{2}[^L]