Specialized HMM Databases

Download Report

Transcript Specialized HMM Databases

Developing and Using Special
Purpose Hidden Markov Model
Databases
Martin Gollery
Associate Director of Bioinformatics
University of Nevada, Reno
[email protected]
Today’s Tutorial
• Instructor: Martin Gollery
• Associate Director of Bioinformatics,
University of Nevada, Reno
• Consultant to several organizations
• Formerly with TimeLogic
• Developed several HMM databases
Hidden Markov Models
•
•
•
•
•
•
•
•
What HMM’s are
Which HMM programs are commonly used
What HMM databases are available
Why you would use one DB over another
Integrated Resources- InterPro and more
How you can build your own HMM DB
Problems with building your own
Live demonstration
Hidden Markov ModelsWhat are they, anyway?
• Statistical description of a protein family's
consensus sequence
• Conserved regions receive highest scores
• Can be seen as a Finite State Machine
Representation of Family
Members
•
•
•
•
•
yciH
ZyciH
VCA0570
HI1225
sll0546
KDGII
KDGVI
KDGDI
KNGII
KEDCV
C
D
E
G
I
1
N
V
1.0
2
0.6 0.2
3
0.2
4
0.2 0.2
5
K
0.2
0.8
0.4
0.2
0.8
0.2
Representation of gaps in Family
Members
•
•
•
•
•
yciH
ZyciH
VCA0570
HI1225
sll0546
KDGII
KDGVI
KDGDI
KNGII
KED-V
C
D
E
G
I
1
N
V
-
1.0
2
0.6 0.2
3
0.2
4
0.2
5
K
0.2
0.8
0.4
0.2 0.2
0.8
0.2
For Maximum sensitivityC
D
E
G
I
1
N
V
-
1.0
2
0.6 0.2
3
0.2
4
0.2
5
K
0.2
0.8
0.4
0.2 0.2
0.8
0.2
No residue at any position should have a zero
probability, even if it was not seen in the training data.
Start with an MSA…
• CLUSTAL W (1.7) multiple sequence alignment
•
•
•
•
•
•
•
•
yciH
ZyciH
VCA0570
HI1225
sll0546
PA4840
AF0914
KDGVIEIQGDKRDLLKSLLEAKGMKVKLAGG
KDGVIEIQGDKRDLLKSLLEAKGMKVKLAGG
KDGDIEIQGDVRDQLKTLLESKGHKVKLAGG
KNGIIEIQGEKRDLLKQLLEQKGFKVKLSGG
KEDCVEIQGDQREKILAYLLKQGYKAKISGG
KDGVVEIQGEHVELLIDELLKRGFKAKKSGG
KNGVIELQGNHVNRVKELLIKKGFNPERIKT
*:. :*:**: : :
* :* : :
Hidden Markov Models
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
HMMER2.0
NAME example2
DESC Small example for demonstration purposes
LENG 31
ALPH Amino
COM
hmmbuild example2 example2.aln
NSEQ 7
DATE Wed Jan 08 13:33:06 2003
HMM
A
C
D
E
F
G
1 -3217 -3413 -3082 -2664 -4291 -3257
2 -1938 -3859
2747
1592 -4024 -1857
3 -2160 -3144
1834
-953 -4284
3247
4 -1255
2750
436 -2789 -1273 -2972
5 -2035 -1558 -4660 -4320 -2085 -4409
6 -3264 -3765 -1447
3822 -4535 -2948
7 -2423 -1951 -4843 -4395 -1156 -4544
8 -3220 -3396 -2530 -2667 -3851 -3171
9 -3196 -3194 -3915 -4259 -4867
3789
10 -1923 -3837
2743
2134 -4005 -1854
11
-999 -2164
-952
-353 -2483 -1909
12 -1629 -1909 -2827 -2102 -2279 -2588
H
-2104
-1206
-2013
-2049
-4229
-2636
-3680
-2735
-4005
-1196
3321
-1442
I
-4231
-3953
-4362
1510
3081
-4814
3291
-4442
-5414
-3929
-2139
-1012
K …
3883…
-1455…
-2365…
-2543…
-4224…
-2810…
-4151…
-2277…
-4591…
-1434…
1730…
-488…
Emission Probabilities
• What is the likelihood that sequence X was
emitted by HMM Y?
• Likelihood is calculated by adding the
probability of each residue at each position,
and each of the transition probabilities
Plan7 from Outer Space
(Well, from St. Louis, anyway!)
HMM’s vs BLAST
• Position specific scoring vs. general matrix
• Example:
– dDGVIvIddDKRDLLKSLiEAKkMKVKLAGG
– KDGVIEIQGDKRDLLKSLLEAKGMKVKLAGG has 80% BLAST
similarity, but misses highly conserved regions
• Scoring emphasizes important locations
• Clearer score cutoffs
• However, it is MUCH slower!
HMM programs
•
•
•
•
•
•
•
•
•
HMMer -Sean Eddy, Wash U
SAM - Haussler, UCSC
Wise tools - Birney, EBI
SledgeHMMer - Subramaniam, SDSC
Meta-MEME - Noble & Bailey
PSI-BLAST - NCBI
SPSpfam - Southwest Parallel Software
Ldhmmer - Logical Depth
DeCypherHMM - TimeLogic
What exactly do you want?
• Are you searching thousands of sequences with
one or a few models?
• Use hmmsearch
• Searching a few sequences with thousands of
models?
• Use hmmpfam
• Thousands of sequences vs. Thousands of models?
• Use an accelerator, if you do it very often
HMM databases
•
•
•
•
•
•
PFAM
TIGRFAM
Superfamily
SMART
Panther
PRED-GPCR
HMM databases at the CFB
•
•
•
•
•
•
•
COGfam
KinFam
HydroHMMer
NVfam-pro
NVfam-arc
NVfam-fun
NVfam-pln
PFAM
•
•
•
•
From Sanger, WashU, KI, INRA
Version 17 has 7868 families
Most widely used HMM database
Good annotation team
PFAM
•
•
•
•
•
PFAM-A is hand curated
From high quality multiple Alignments
PFAM-B is built automatically from ProDom
Generated using the Domainer algorithm
ProDom is built from SP/TREMBL
PFAM
• Pfam-ls = global alignments
• Pfam-fs = local alignments, so that matches
may include only part of the model
• Both the –ls and –fs versions are local
W.R.T. the sequence
PFAM
•
•
•
•
•
•
Note ‘type’ annotation
Labeled TP
Family
Domain
Repeat
Motif
TIGRFAMs
• Available at (www.tigr.org/TIGRFAMs/)
• Organized by functional role
• Equivalogs: a set of homologous proteins
that are conserved with respect to function
since their last common ancestor
• Equivalog domains: domains of conserved
function
TIGRFAMs
• 2453 models in release 4.1
• Complementary to PFAM, so run both
• Part of the Comprehensive Microbial
Resource (CMR)
TIGRFAMs
TIGRfam and PFAM alignments for Pyruvate carboxylase. The
thin line represents the sequence. The bars represent hit
regions.
SuperFamily
• By Julian Gough, formerly MRC, now Riken GSC
• www.supfam.org
• Provides structural (and hence implied functional)
assignments to protein sequences at the
superfamily level
• Built from SCOP (Structural Classification of
Proteins) database, which is built from PDB
• Available in HMMer, SAM, and PSI-BLAST
formats
SuperFamily
•
•
•
•
1447 SCOP Superfamilies
Each represented by a group of HMMs
Over 8500 models total
Table provides comparison to GO, Interpro,
PFAM
SMART
• Simple Modular Architecture Research Tool
• Version 3.4 contains 654 HMMs
• Emphasis on mobile eukaryotic domains
• smart.embl-heidelberg.de
• Annotated with respect to phyletic
distributions, functional class, tertiary
structures and functionally important
residues
SMART
• Use for signaling domains or extracellular
domains
• Normal and Genomic mode
PRED-GPCR
•
•
•
•
•
•
Papasaikas et al, U of Athens
265 HMMs in 67 GPCR families
Based on TiPs Pharmacological classification.
Filters with CAST
signatures regularly updated
Entire system redone each year
PRED-GPCR webserver
Panther
• Protein ANalysis THrough Evolutionary Relationships
• Family and subfamily: families are evolutionarily related
proteins; subfamilies are related proteins with the same
function
• Molecular function: the function of the protein by itself or
with directly interacting proteins at a biochemical level,
e.g. a protein kinase
• Biological process: the function of the protein in the
context of a larger network of proteins that interact to
accomplish a process at the level of the cell or organism,
e.g. mitosis.
• Pathway: similar to biological process, but a pathway also
explicitly specifies the relationships between the
interacting molecules.
Panther
• (Thomas et al., Genome Research 2003; Mi
et al. NAR 2005)
• 6683 protein families
• 31,705 functionally distinct protein
subfamilies.
Panther
• Due to the size, searches could be slow
• First, BLAST against consensus seqs
• Then, search against models represented by
those hits
• With an accelerator, you don’t have to do
that…
Panther
• So- how does it perform?
• I took 3451 Arabidopsis proteins with no hit
to PFAM, Superfamily, SMART or
TIGRfam
• Ran it against Panther
• Found 160 significant hits!
COG-HMMs
•
•
•
•
•
•
Clusters of Orthologous Groups of proteins
www.ncbi.nlm.nih.gov/cog/
Each COG is from at least 3 lineages
Ancient conserved domain
4873 alignments available
Alignments from NCBI, HMMs from me at
[email protected]
CDD
• Conserved Domain Database (NCBI)
• Psi-BLAST profiles are similar to HMMs
• 10991 PSSMs - SMART + COG +KOG+
Pfam+CD
• Runs with RPS-BLAST
• Much faster searches
KinFam
• Kinfam- models represent 53 different classes of
PKs
• Assigns Kinase Class and Group
• Based on Hanks’ classification scheme
• Database is small, so searches are fast
KinFam
• Categorizes Kinase data
• Available for download from
bioinformatics.unr.edu
RANK
1
2
3
SCORE QF
852.93 1
479.14 1
423.33 1
TARGET|ACCESSION
KinFam||ptkgrp15
KinFam||ptkgrp14
KinFam||ptkother
E_VALUE
9.3e-256
3.1e-143
1.9e-126
DESCRIPTION
Fibroblast GF recept
Platelet derived GF
Other membrane-span
HydroHmmer
• Hydrohmmer finds LEAs, other hydrophilin
classes
• Small target size makes for very fast
searches
NVFAMs
• HMM’s reflect the training data
• Specific training sets provide better results
• So… use Archaeal data to study Archaeons,
Fungal data to study Fungi, etc.
• Designed for use with PFAM, not stand
alone
• Recent redesign, name change
NVFAMs
•
•
•
•
•
•
NVFAM-pro used to study E. faecalis
Demonstrated higher scores, better aligns
However, PFAM had more total hits
P.falciparum used as negative control
PFAM showed better scores, aligns as predicted
Automated design by Garrett Taylor- scripts are
available!
• Contact me for input, collaboration, or help to
build your own
Which database to use?
One Comparison Test(Your results may vary…)
• Compare 563 I. pini sequences to COGhmm, PFAM,
PFAMfrag, SMART, TIGRfam, TIGRfamfrag,
Superfamily
• COGs- 9
• PFAM- 22
• PFAMfrag- 57
• SMART- 4
• Superfamily- 30
• TIGRfam- 6
• TIGRfamfrag- 12
Integrated Resources
•
•
•
•
InterProscan
MAGPIE
PANAL
Make your own!
InterPro
• Database built from PFAM, Prints, Prosite,
SuperFamily, ProDom, SMART,
TIGRFAMs, PANTHER, PIRsf, Gene3D &
SP/TrEMBL
• Version 10.0
• Nearly 12,000 entries
• http://www.ebi.ac.uk/interpro/
• InterProScan can be installed locally
InterProScan
•
•
•
•
•
•
Splits up big jobs & reassembles them
Works with SGE, PBS, LSF
A free analysis pipeline!
Provides GO mappings
Written in PERL, so it’s easy to modify
Average 4 min. per NT sequence per CPU
InterPro
InterPro release 10.0 contains 11972 entries,
representing 3079 domains, 8597 families, 228
repeats, 27 active sites, 21 binding sites and
20 post-translational modification sites. Overall,
there are 7521179 InterPro hits from 1466570
UniProt protein sequences. A complete list is
available from the ftp site.
DATABASE
VERSION
ENTRIES
SWISS-PROT
46.5
180652
PRINTS
37.0
1850
TrEMBL
29.5
1689375
Pfam
17.0
7868
PROSITE patterns
18.45
1800
PROSITE preprofiles
N/A
120
ProDom
2004.1
1522
InterPro
10.0
11972
SMART
4.0
663
TIGRFAMs
4.1
2454
PIRSF
2.52
962
PANTHER
5.0
438
SUPERFAMILY
1.65
1160
Gene3D
3.0
117
GO Classification
N/A
18705
Modifying InterProScan
• Two ways to Add your own HMM database
to InterProScan:
• Modify PERL scripts
• Concatenate your models onto PFAM
• Similarly, if you are looking for a specific
target, delete all the rest to speed up
searches
PANAL
• Simultaneously searches several targets
• Produces a nice graphical overview
• Databases–
–
–
–
–
–
PFAM
SMART
TIGRFAM
Prosite
PRINTS
BLOCKS
PANAL
MAGPIE
• BLOCKS
•
•
•
•
•
•
•
•
•
•
NCBI public non-redundant DNA and protein
NCBI EST databases
NCBI Conserved Domain Database (CDD)
Protein Identification Resource SuperFamilies
PFAM
ProDom
SCOP SuperFamilies
SMART
TIGRFam
ProSite
MAGPIE
• Gives a putative description of the gene
• Database search result ranking based on user
defined tool precedence and score thresholds.
• A single graphical summary of the various search
results
• Links to the database source entries
MAGPIE
• Gene taxonomic distribution information
• Reporting of similar sequences in the dataset
based on hits to similar database entries
• Annotated metabolic pathway diagrams
• Gene Ontology (GO) term assignments
MAGPIE
Terry Gaasterland et al. Genome Res. 2000; 10: 502-510
Building Your Own HMM
Database
•
•
•
•
•
Why do it?
Greater Specificity
Represent your training set
Faster searches
Focus on the particular aspects that you
want
PFAM
HMMsearch
Your
Data
Your
Data
Or
BLAST
Cluster
Seq uenc es
Disc ard Sing letons
Build
Multip le Seque nce Alig nme nts
Chec k Alignments
HMMb uild
HMMc alib rate
Ad d Desc ription Line
Annota te
Public
DB
First, search against a target…
Select the hits for the model
Build the Multiple Sequence
Alignment
Run HMMbuild to make the
model
Iterate Search to Add more distant Members
Design Decisions:
• Local or global models?
• Which sequence weighting scheme?
• What type of Prior?
Calibration
•
•
•
•
Hmmcalibrate
Improves scoring
Compares to random data
Can be done on each model, or on the entire
collection
Calibration
• Very time consuming on CPU, not on
researcher
• No acceleration available
• Not necessary with SAM
Meme and Meta-Meme
• Meme discovers motifs in a group of related
DNA or protein sequences
• Motifs contain no gaps- split in two instead
Meta-meme
• Meta-meme takes meme motifs & related
seqs as input
• Combines motifs into HMMs
• Regions between motifs are modeled
imprecisely
• Reduction in parameter space
• Accurate models with fewer training seqs
Meta-meme
• mhmm: Build a motif-based HMM from
Meme motifs.
• mhmms: Search a sequence database using
a motif-based HMM
• mhmmscan: Like mhmms, but allows long
seqs and multiple matches.
Using RPS-BLAST
•
•
•
•
•
Start with PSI-BLAST using –C
Prepare files with makemat and copymat
Compile target
Annotate
Search with RPS-BLAST
IMPALA
• Also uses profiles database
• Alignments generated by Smith-Waterman
instead of word hit initiated
• 10-100x Slower, might be better than RPSBLAST
SPEED
• PVM version of HMMer is available, MPI is on
the way (?)
• Other Solutions- use PSSM’s?
• SPSpfam can speed searches 3-60X
• SledgeHMMer claims 10X Speedup
• Accelerators
• Target Triage
SPSpfam
•
•
•
•
•
From Southwest Parallel Software
Optimized HMMer code
Up to 60X faster
Works well on cluster
Uses binary Pfam, so you can’t drop it into
InterProScan
• This may change soon
HMM Accelerators
•
•
•
•
•
•
Can provide speedup of 100’s-1000’s X
TimeLogic is the only commercial one left
HokieGene from Virginia Tech
StarBridge - No HMMs yet
Others coming soon
An open-source project is in the worksBioFPGA
HMMs on the Web
• SAM
http://www.cse.ucsc.edu/research/compbio/
• HMMer http://hmmer.wustl.edu/
• Several other HMMer servers…
• SledgeHMMer.sdsc.edu is only unlimited
webserver- most restrict you to one sequence
at a time.
Resources
• Online Applications:
• HMMer http://hmmer.wustl.edu/
• SAM-T02
http://www.soe.ucsc.edu/research/compbio/
HMM-apps/HMM-applications.html
• Pfam http://pfam.wustl.edu/
• SledgeHMMer sledgehmmer.sdsc.edu
• Meta-MEME http://metameme.sdsc.edu/
• PANAL http://web.ahc.umn.edu/panal/
Resources
• Commercial vendors of HMM systems
• SPSpfam (www.spsoft.com)
• Ldhmmer (www.logicaldepth.com)
• DeCypherHMM (www.timelogic.com)
References
•
•
•
•
•
•
•
•
S.Altshul, et al. Basic Local Alignment Search Tool. JMB, 215:403{410, 1990.
C. Barrett, et al. Scoring hidden Markov models. CABIOS, 13(2):191{199, 1997.
S. R. Eddy. Profile hidden markov models. Bioinformatics, 14(9):755{63, 1998.
W. N. Grundy,et al. Meta-MEME: Motif-based hidden Markov models of protein families.
CABIOS, 13(4):397{406, 1997.
M. Gribskov, et al. Profile analysis: Detection of distantly related proteins. PNAS,
84:4355{4358, July 1987.
S. Henikoff and Jorja G. Henikoff. Amino acid substitution matrices from protein blocks.
PNAS, 89:10915{10919, November 1992.
[HH94] Steven Henikoff and Jorja G. Henikoff. Position-based sequence weights. JMB,
243(4):574{578, November 1994.
•
•
•
•
•
•
•
•
•
Jerey D. et al. Kestrel: A programmable array for sequence analysis. In Application-Specific
Array Processors, pages 25{34, Los Alamitos, CA, July 1996. IEEE Computer Society.
R. Hughey and A. Krogh. Hidden Markov models for sequence analysis: Extension and
analysis of the basic method. CABIOS, 12(2):95{107, 1996.
T. Hubbard, et al. SCOP: a structural classification of proteins database. NAR, 25(1):236{9,
January 1997.
L. Holm and C. Sander. Dali/fssp classification of three-dimensional
protein folds. NAR, 25:231{234, 1 Jan 1997.
K. Karplus, et al. Predicting protein structure using only sequence
information. Proteins: Structure, Function, and Genetics
K. Karplus, et al. Hidden markov models for detecting remote protein homologies.
Bioinformatics, 14(10):846{856, 1998.
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
A. Krogh, et al, Hidden Markov models in computational biology: Applications to protein modeling.
JMB, 235:1501{1531, February 1994.
Kevin Karplus, et al. Predicting protein structure using hidden Markov models. Proteins: Str, Func, and
Genetics, Suppl. 1:134{139, 1997.
C. A. Orengo, et al. Cath- a hierarchic classification of protein domain structures.
Structure, 5(8):1093{108, August 1997.
J. Park, et al. Sequence comparisons using multiple sequences detect twice
as many remote homologues as pairwise methods. JMB, 284(4):1201{1210
E.L.L Sonnhammer, et al. Pfam: A comprehensive database of protein families. Proteins, 28:405{420,
1997.
K. Sjolander, et al. Dirichlet mixtures: A method for improving detection of weak
but signicant protein sequence homology. CABIOS, 12(4):327{345, August 1996.
Reinhard Schneider and Chris Sander. The HSSP database of protein
structure-sequence alignments. NAR, 24(1):201{205, 1 Jan 1996.
Chukkapalli G., Guda, C. and Subramaniam S. SledgeHMMER: A web server for batch searching
Pfam database, Nucleic Acids Res. , 32:W542-544
Schaffer, A.A., Wolf, Y.I., Ponting, C.P. Koonin, E.V., Aravind, L., Altschul, S. F., IMPALA: Matching a
Protein Sequence Against a Collection of PSI-BLAST-Constructed Position-Specific Score Matrices,
Bioninformatics,
P. K. Papasaikas, P. G. Bagos, Z. I. Litou, V. J. Promponas and S. J. Hamodrakas PRED-GPCR:
GPCR recognition and family classification serveNucleic Acids Research 2004 32(Web Server
issue):W380-W382; doi:10.1093/nar/gkh431
Silverstein, K.A.T., A. Kilian, J.L. Freeman, and E.F. Retzel. "PANAL: an integrated resource for
Protein sequence ANALysis," Bioinformatics, 16:1157-1158, 2000
Thanks!
• Garrett Taylor, Brian Beck, Taliah Mittler,
Barrett Abel, John Cushman, Lee Weber
• Contact me at- [email protected]
• Bioinformatics.unr.edu