ppearl_presentation[1]

Download Report

Transcript ppearl_presentation[1]

Hidden Markov Modeling,
Multiple Alignments
and Structure
Bioinformatic Modeling Techniques
Student: Patricia Pearl
The basic notion of a hidden Markov model was
covered during the class lectures and in our midterm.
There are more issues about its
history
development
and future
that we’ll discuss tonight.
There was a time
when scientists started to think about
using hidden Markov models
for multiple protein alignments.
When was that?
Which professional field was using it already?
This is the bibliographic reference for the article that
protein scientists used when they got started.
Rabiner, L. R.
“A tutorial on hidden Markov models and selected
application in speech recognition.”
Proceedings of the IEEE, 77 (2), 257-286. 1989.
This work was sophisticated and a group
of scientists at University of California at Santa Cruz
could make an analogy between computer speech
recognition and protein multiple alignments.
How did they make the analogy between
speech recognition and
multiple protein and DNA alignments?
Speech Recognition
Multiple Alignments
Alphabet
phonemes
amino acids
Observation
words or strings
of phonemes
primary sequence
Good – assigns
high probability
sounds that
are real words
sequences in the
set
The paper they published is:
Krogh, A., Brown, M., Mian, I.S., Sjölander, K., and Haussler, D.
“Hidden Markov Models in Computational Biology:
Applications to Protein Modeling.”
Journal of Molecular Biology, 1994, 235:1501-1531.
Sean Eddy was a student at UCSC then. In an article of his, (1996)
he describes the paper referenced above as:
“The paper that introduced the use of HMM methods for protein
and DNA sequence profiles. “
Then, the software was developed by two collections of
scientists and grad students, separately. There are
many researchers in the subject that are not at these labs.
University of California at Santa Cruz and
University of Washington, St Louis, Missouri,
by UCSC’s former student, Sean Eddy and his
research group.
Two suites of software have been developed. Their
differences are non-trivial.
SAM at UCSC
Sequence Alignment and Modeling
System.
HMMER at U of W.
Both suites can be downloaded. SAM needs UNIX.
HMMER can use many systems.
As has been emphasized in lecture, the advantage of the HMM
approach is that it does not guess aabout gap penalties, nor about
amino acids nor states. It bases those values on actual data,
Bayesian probabilities based in facts.
SAM at UCSC
Sequence Alignment and Modeling System.
<http://www.cse.ucsc.edu/research/compbio/>
Their software is based on HMM’s.
Also use a mathematical approach called
Dirichlet mixtures to improve detection of weak
homologies and to derive hidden Markov models
for protein families.
HMMER at University of Washington
Sean Eddy’s Lab Home Page
http://www.genetics.wustl.edu/eddy/publications/
This page and related pages have many articles that are available
to download.
URL for User’s Guide
http://www.psc.edu/general/software/packages/hmmer/manual/main.
html
If we had HMMER installed at BRANDEIS for us, we could all
use it with the help of this manual.
HMMER
One of the approaches that Sean Eddy has taken to improve
HMMER is to use an approach from computational physical
chemistry and x-ray diffraction protein crystallography called
simulated annealing. The probability values of the fundamental
recursive HMM algorithm are varied by an exponential
factor taken from the Boltzman formula for physical entropy.
S = kb ln Ω
The Boltzman constant, kb, is multiplied by t, for temperature.
It is started at t = high temp and decreased. The “kt” is used as
an exponent P^(1/kt). Eddy reports that it improves accuracy.
(Eddy, S., 1995)
Many people are developing the HMM approach to
use it on RNA sequences. It is meaningful to briefly
describe a recent paper that makes extensive use of
primarily hand done RNA alignments, using both primary
sequence and secondary RNA structure. It produces
evidence toward resolving a problem in systematics biology
or evolutionary biology.
With HMMER, or any similar software, for RNA
alignments, much of this work may be much easier and
have measurable probabilistic statistics in the future.
“However, accurate alignment is only
possible for proteins of known
structure – at least for an identifiable
core of residues that comprises the
secondary structure elements and
active site of the molecule.”
S. Eddy(1995) quoting Chothia and Lesk(1986)
Common ancestor
Anatomical
Evidence
And more
Crocodile
Common ancestor
OR
Bird
rRNA
Multiple
alignments
w/out
secondary
structure
Mammal
Seq1
Seq2
Seq3
Seq4
10
20
30
40
----|----|----|----|----|----|----|----|
A-CC-----GC--------GA--CUUG--GA-CC-CG--G
A-CC-----GU--------GA--CUUG--GA-CC-CG--G
AACCCCGGUGUAGGGGGAAGAACCUUGAUGAACCUCGAUG
AACCCCGGUGCAGGGGGAAGAACCUUCAUGAACCUCGAUG
Figure 1. The problem of aligning short and long sequences.
Sequences 1 and 2 are like the reptilian and bird ribosomal 18s RNA.
Sequences 3 and 4 are like mammals.
Reference: Xiam X., Xie, Z., Kjer, K.M. “18S ribosomal RNA and tetrapod phylogeny.”
Systematic Biology. Washington: Jun 2003. Vol 52, Iss.3; pg 283.
Phylogenetic tree
From: Xiam et al., 2003
They produced several phylogenetic trees, using different
methods, with the careful manual alignments that took
secondary structure into account. In all, the birds are
closer to the crocodiles than to the mammals.
“Our research indicates that the previous discrepancy of phylogenetic
results between the 18S rRNA gene and other genes is caused
mainly by:
1.) misalignment of sequences
2.) the inappropriate use of the frequency parameters
3.) poor sequence quality.
When the sequences are aligned with the aide of the secondary
structure of the 18S rRNA molecule and when the frequency parameters
are estimated either from all sites or from the variable domains where
substitutions have occurred, the 18S rRNA sequences no longer support
the grouping of the avian species with the mammalian species.”
Xia, X., et al., 2003
If there were more time, this presentation would also
Include discussions of Psi Blast and of SuperFam.
Psi Blast is a BLAST software at NCBI that uses HMM’s
and can use multiple alignments.
<http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.h
tml>
a tutorial
<http://www.ncbi.nlm.nih.gov/BLAST/>
the site
SuperFam is a relatively new website. It uses the HMM approach, 59
genomes, and all the solved structures, from those genomes, that are
publicly available, as well.
<http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/>
The head scientist of SuperFam, Prof. Cyrus Chothia,
also supervised a web site called SCOP, or Structural
Classification of Proteins. You might find it interesting, that all of the
protein structures that are “solved” are actually organized and classified.
<http://scop.mrc-lmb.cam.ac.uk/scop/>
Bibliography
Eddy, S.R. “Multiple alignment using hidden Markov models.” Proc.
Int. Conf. Intell. Syst. Mol Biol. 1995;3:114-120.
Eddy, S.R. “Hidden Markov Models.” Curr Opin Struct Biol. 1996
Jun;6(3):361-5. Review.
Eddy, S.R., “Profile hidden Markov models.” Bioinformatics, 1998;
14(9): 755-763. Review.
Gough, J., and Chothia, C., “SUPERFAMILY: HMMs representing all
proteins of known structure. SCOP sequence searches, alignments
and genome assignments.” Nucleic Acids Research, 2002, Vol 30:1.
Krogh, A., Brown, M., Mian, I.S., Sjolander, Haussler, D. “Hidden
Markov models in computational biology: Applications to protein
modeling. Journal of Molecular Biology, 235:1501-1531, February
1994.
Rabiner, L. R. “A tutorial on hidden Markov models and selected
application in speech recognition.”
Proceedings of the IEEE, 77 (2), 257-286. 1989.
Xia, X., Xie, Z., Kjer, K.M. “18S ribosomal RNA and tetrapod
phylogeny.” Systematic Biology. Washington: Jun 2003.
Jun 2003. Vol. 52, Iss. 3; pg 283.