Transcript Document
Hidden Markov Models
• Probabilistic model of a Multiple
sequence alignment.
• No indel penalties are needed
• Experimentally derived information can
be incorporated
• Parameters are adjusted to represent
observed variation.
• Requires at least 20 sequences
The Evolution of a Sequence
• Over long periods of time a sequence will
acquire random mutations.
– These mutations may result in a new amino acid
at a given position, the deletion of an amino acid,
or the introduction of a new one.
– Over VERY long periods of time two sequences
may diverge so much that their relationship can
not see seen through the direct comparison of
their sequences.
Hidden Markov Models
• Pair-wise methods rely on direct comparisons
between two sequences.
• In order to over come the differences in the
sequences, a third sequence is introduced, which
serves as an intermediate.
• A high hit between the first and third sequences as
well as a high hit between the second and third
sequence, implies a relationship between the first
and second sequences. Transitive relationship
Introducing the HMM
• The intermediate sequence is kind of
like a missing link.
• The intermediate sequence does not
have to be a real sequence.
• The intermediate sequence becomes
the HMM.
Introducing the HMM
• The HMM is a mix of all the sequences
that went into its making.
• The score of a sequence against the
HMM shows how well the HMM serves
as an intermediate of the sequence.
– How likely it is to be related to all the other
sequences, which the HMM represents.
Match State with no Indels
MSGL
MTNL
B
M1
M2
M3
M4
Arrow indicates transition probability.
In this case 1 for each step
E
Match State with no Indels
MSGL
MTNL
B
M=1
S=0.5
T=0.5
M1
M2
M3
M4
E
Also have probability of Residue at each positon
Typically want to incorporate small probability
for all other amino acids.
MSGL
MTNL
B
M=1
S=0.5
T=0.5
M1
M2
M3
M4
E
Permit insertion states
MS.GL
MT.NL
MSANI
I0
I1
I2
I3
I4
B
M1
M2
M3
M4
Transition probabilities may not be 1
E
Permit insertion states
MS..GL
MT..NL
MSA.NI
MTARNL
I0
I1
I2
I3
I4
B
M1
M2
M3
M4
E
MS..GL-MT..NLAG
MSA.NIAG
MTARNLAG
DELETE PERMITS INCORPORATION OF
LAST TWO SITES OF SEQ1
D1
D2
D3
D4
D5
D6
D7
I4
I5
I6
I7
I0
I1
I2
I3
B
M1
M2
M3
M4
S
T
A
A
G
N
M
M5 M6
I
L
A
M7
G
E
The bottom line of states are the main states (M)
•These model the columns of the alignment
The second row of diamond shaped states are called the insert states (I)
•These are used to model the highly variable regions in the alignment.
The top row or circles are delete states (D)
•These are silent or null states because they do not match any residues, they simply
allow the skipping over of main states.
D1
D2
D3
D4
D5
D6
I4
I5
I6
I0
I1
I2
I3
B
M1
M2
M3
M4
M5
M6
E
Dirichlet Mixtures
• Additional information to expand
potential amino acids in individual sites.
• Observed frequency of amino acids
seen in certain chemical environments
– aromatic
– acidic
– basic
– neutral
– polar
STRUCTURES
a helix
b sheet
coils
turns
Structures are used to build domains.-Legos of evolution
Rotation around the peptide bond
Ramachandran plot for Glycine
Areas not permitted
for other amino acids
Psi Angles
Phi angles
Introduction to Protein Structure, Branden and Tooze
Garland Publishing Co.1991 p.13
From: http://bioweb.ncsa.uiuc.edu/~bioph254/Class-slides/Lect12/figure13.html
Longitudinal and Transverse image of alpha helix
From: http://bioweb.ncsa.uiuc.edu/~bioph254/Class-slides/Lect12/figure14.html
Turn connecting two helices
Introduction to Protein Structure, Branden and Tooze
Garland Publishing Co.1991 p. 17
Hemoglobin - ribbon representation
Proline
• Because of its
structure, proline is
typically excluded
from a helices
except in the first
three positions at
the amino end.
b Structure
b strand - single run of amino acids in b
conformation
b sheet- multiple b strands which are hydrogen
bonded to yield a sheet like structure.
b bulge - disruption of normal hydrogen bonding in
a b sheet by amino acid(s) that will not fit into the
sheet -for example: proline
b sheets- Parallel
Introduction to Protein Structure, Branden and Tooze
Garland Publishing Co.1991 p.17.
b sheet - longitudinal
and transverse view.
Side chains stick “out”
http://bioweb.ncsa.uiuc.edu/~bioph254/Class-slides/Lect12/figure22.html
Superoxide dismutase - b sheet
Superoxide dismutase - b sheet
Six classes of structure
• Class a- bundled a helices connected by
loops.
• Class b- sandwich or barrel comprised
entirely of b sheets typically anti-parallel.
• Class a / b mainly parallel b sheets with
intervening a helices.
• Class a + b - segregated a helices and
anti-parallel b sheets
• Multi-domain
• Membrane proteins
CD8 -all b
Thioredoxin a / b
Endonuclease
Class a + b
Rhodopsin
7TM proten
Common Hairpin Loop between two b Strands
Introduction to Protein Structure, Branden and Tooze
Garland Publishing Co.1991 p. 17
• Turn - short, regular loop.
– Difference in frequency of amino acids at
positions 1-4 of the turn.
• Coils (not coiled coil)
– Random turns or irregular structure.
Disulfide bridges
• Crosslink of two cysteine residues.
• Distance between sulfur = 3
Angstroms.
Coiled coil -two a helices bundled side by side
From: http://catt.poly.edu/~jps/coilcoil.html
a,d are internal, remaining amino acids are solvent exposed
From: http://catt.poly.edu/~jps/coilcoil.html
Coiled Coil
• Two or more adjacent a
helices
Prediction of potential Coiled coil domain in Groucho
Potential Residues involved in Coiled Coil
MMFPQSRHSGSSHLPQQLKFTTSDSCDRIKDEFQLLQAQYHSL
KLECDKLASEKSEMQRHYVMYYEMSYGLNIEMHKQAEIVKR
LNGICAQVLPYLSQEHQQQVLGAIERAKQVTAPELNSIIRQQL
QAHQLSQLQALALPLTPLPVGLQPPSLPAVSAGTGLLSLSALG
SQTHLSKEDKNGHDGDTHQEDDGEKSD
Triple helix coiled coil - built from a helices
Backbone of triple coiled coil
E. coli Nucleotide exchange factor
Domains
• Single domain proteins •
Epidermal growth factor
•
Serine Proteases - Trypsin
• Multi domain proteins -Factor IX -one Ca2+
binding, two EGF/ one protease domain.
• Permit building of novel functions by
swapping of domains
Factor IX Domain Structure
Ca
EGF
EGF
Ca - Calcium binding domain
EGF - Epidermal growth factor domain
CT - Chymotrypsin domain
CT
Chou - Fasman Prediction of
Secondary Structure
• Based upon analysis of known
structures (1974).
• Frequency of occurrence of each
amino acid in:
a helix
b strand
– turn
Chou - Fasman Prediction
• List is then analyzed for stretches of amino
acids that have a common tendency to form a
given secondary structure.
• Extend until a region of high probability for
either a turn or region with a low probability of
both a or b is encountered.
• Window is typically <10
GOR prediction
• Similar to Chou - Fassman
– More recent (1988) tabulation of amino
acid preferences.
– Uses a larger window -17
More Recent Prediction
Programs
• Make use of library of 3d structures to
predict structure.
• Most use a Neural Net approach for
prediction.
• Examples
– Nnpredict
– PredictProtein
Neural Net
• Programs “trained” on structures.
• Window -within the window each position
is predicted based upon knowledge.
• Rules also applied (alpha helix 4 AA long)
window
a
b
coil
Input
Hidden
Output
PredictProtein
• Uses an alignment approach.
• Submitted sequence is compared to
database and alignment is generated
• Profile is generated for further database
searching.
• Alignment is then used for prediction of
secondary structure.
• Confidence predicted - based upon
number of residues of given type at a
given position in the alignment
Kyte and Doolittle Hydropathy
• Average of hydropathy index for each
residue.
• Examle of Hydropathy index:
• F +2.8
• R -4.5
Transmembrane Domain
• Characteristics make them easier to predict:
–
–
–
a helix structure
Hydrophobic amino acids
19 or more amino acids long
charged residue will typically have an opposing
charge for neutralization.
• Difficulty in predicted ends of transmembrane
domains.
Caveat
• Local secondary structure can be
influenced by tertiary structure.
• Identical string of residues can be an a
helix in one protein but a b strand in
another protein.
3D structural prediction
>gi|14769656|ref|XP_010270.4| coagulation factor IX [Homo sapiens]
MQRVNMIMAESPGLITICLLGYLLSAECTVFLDHENANKILNRPKRYNSGKLEEFVQGNLERECMEEKCSFEEAREVFEN
TERTTEFWKQYVDGDQCESNPCLNGGSCKDDINSYECWCPFGFEGKNCELDVTCNIKNGRCEQFCKNSADNKVVCSCTEG
YRLAENQKSCEPAVPFPCGRVSVSQTSKLTRAETVFPDVDYVNSTEAETILDNITQSTQSFNDFTRVVGGEDAKPGQFPW
QVVLNGKVDAFCGGSIVNEKWIVTAAHCVETGVKITVVAGEHNIEETEHTEQKRNVIRIIPHHNYNAAINKYNHDIALLE
LDEPLVLNSYVTPICIADKEYTNIFLKFGSGYVSGWGRVFHKGRSALVLQYLRVPLVDRATCLRSTKFTIYNNMFCAGFH
EGGRDSCQGDSGGPHVTEVEGTSFLTGIISWGEECAMKGKYGIYTKVSRYVNWIKEKTKLT
Pfam
Protein Information Resource
KFHU
Tertiary Structure
• Still challenging
• Focus upon core structure for prediction
Hydrophobic interactions
that stabilize structure.
Approach
• Determine “fit”of a query sequence to
library of known structures.
– Threading- examine compatibility of amino
acid side groups with known structures
– Two approaches:
• Environmental template
• Contact potential
Environmental Template
• Each amino acid in known core
evaluated for:
– secondary structure
– area of side chain buried
– types of nearby AA side chains
Arginine - basic Aa
Isoleucine
Different propensity to be in a hydrophobic environment.
Might accommodate charge by opposite charge
Environmental
• Query sequence is submitted to
previously analyzed database of
structures
– How well does your sequence fit these
protein cores?
Contact Potential
• Number and closeness between each
AA pair determined.
• Query sequence examined to determine
if potential AA interactions match those
of known cores.
Structural Profile
• Structural position specific scoring matrix
• Identify which amino acid fit into a specific
position in the core of each known structure
– each position is assigned to one of the 18 classes
of structural environment
– scores reflect suitability of AA for that position
– log odds matrix
• Use profile to examine query sequence
Z score
• Many return an E value or a Z score
• Z score the number of standard
deviations from the mean score for all
sequences.
• The higher the Z score, the more
significant the model -typical good
score >5.