phdcourse_motif2 - Center for Biological Sequence Analysis

Download Report

Transcript phdcourse_motif2 - Center for Biological Sequence Analysis

Morten Nielsen,
CBS, BioCentrum,
DTU
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Sequence motifs, information
content, logos, and HMM’s
• Multiple alignments and sequence motifs
• Weight matrices and consensus sequence
– Sequence weighting
– Low (pseudo) counts
• Information content
– Sequence logos
– Mutual information
• Example from the real world
• HMM’s and profile HMM’s
– TMHMM (trans-membrane protein)
– Gene finding
• Links to HMM packages
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Outline
• Core
• Consensus
sequence
• Weight matrices
• Problems
– Sequence weights
– Low counts
----------MLEFVVEADLPGIKA-----------------MLEFVVEFALPGIKA-----------------MLEFVVEFDLPGIAA--------------------YLQDSDPDSFQD----------GSDTITLPCRMKQFINMWQE------------RNQEERLLADLMQNYDPNLR----------------YDPNLRPAERDSDVVNVSLK---------------NVSLKLTLTNLISLNEREEA------EREEALTTNVWIEMQWCDYR------------------WCDYRLRWDPRDYEGLWVLR----LWVLRVPSTMVWRPDIVLEN----------------------IVLENNVDGVFEVALYCNVL-------------YCNVLVSPDGCIYWLPPAIF
---------PPAIFRSACSISVTYFPFDW---*********
FVVEFDLPG
Consensus
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Multiple alignment and
sequence motifs
----------MLEFVVEADLPGIKA-----------------MLEFVVEFALPGIKA-----------------MLEFVVEFDLPGIAA--------------------YLQDSDPDSFQD----------GSDTITLPCRMKQFINMWQE------------RNQEERLLADLMQNYDPNLR----------------YDPNLRPAERDSDVVNVSLK---------------NVSLKLTLTNLISLNEREEA------EREEALTTNVWIEMQWCDYR------------------WCDYRLRWDPRDYEGLWVLR----LWVLRVPSTMVWRPDIVLEN----------------------IVLENNVDGVFEVALYCNVL-------------YCNVLVSPDGCIYWLPPAIF
---------PPAIFRSACSISVTYFPFDW---*********
}
Homologous sequences
Weight = 1/n (1/3)
Consensus sequence
YRQELDPLV
Previous
FVVEFDLPG
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Sequences weighting 1 Clustering
• waa’ = 1/rs
• r: Number of different aa in a column
• s: Number occurrences
• Normalize so S waa= 1 for each column
• Sequence weight is sum of waa
F: r=7 (FYMLPVW), s=4
w’=1/28, w = 0.055
Y: s=3, w`=1/21, w = 0.073
M,P,W: s=1, w’=1/7, w = 0.218
L,V: s=2, w’=1/14, w = 0.109
FVVEADLPG
FVVEFALPG
FVVEFDLPG
YLQDSDPDS
MKQFINMWQ
LMQNYDPNL
PAERDSDVV
LKLTLTNLI
VWIEMQWCD
YRLRWDPRD
WRPDIVLEN
VLENNVDGV
YCNVLVSPD
FRSACSISV
w
0.37
0.43
0.32
0.59
0.90
0.68
0.75
0.85
0.84
0.51
0.71
0.59
0.71
0.75
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Sequences weighting 2 (Henikoff & Henikoff)
P1
--------MLEFVVEADLPGIKA---------------MLEFVVEFALPGIKA---------------MLEFVVEFDLPGIAA------------------YLQDSDPDSFQD--------GSDTITLPCRMKQFINMWQE----------RNQEERLLADLMQNYDPNLR--------------YDPNLRPAERDSDVVNVSLK-------------NVSLKLTLTNLISLNEREEA----EREEALTTNVWIEMQWCDYR----------------WCDYRLRWDPRDYEGLWVLR--LWVLRVPSTMVWRPDIVLEN--------------------IVLENNVDGVFEVALYCNVL-----------YCNVLVSPDGCIYWLPPAIF
-------PPAIFRSACSISVTYFPFDW---*********
• Limited number of data
• Poor sampling of
sequence space
• I is not found at position
P1. Does this mean
that I is forbidden?
• No! Use Blosum matrix
to estimate pseudo
frequency of I
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Low count correction
• Every time for instance
L/V is observed, I is
also likely to occur
• Estimate low (pseudo)
count correction using
this approach
• As more data are
included the pseudo
count correction
becomes less important
Blosum62 substitution frequencies
#
I
L
V
L 0.1154 0.3755 0.0962
V 0.1646 0.1303 0.2689
NL = 2, NV=2, Neff=12 =>
fI = (2*0.1154 + 2*0.1646)/12
= 0.05
pI* = (Neff * pI + b * fI)/(Neff+b)
= (12*0 + 10*0.05)/(12+10)
= 0.02
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Low count correction using
Blosum matrices
• Information and entropy
– Conserved amino acid regions contain high degree of
information (high order == low entropy)
– Variable amino acid regions contain low degree of
information (low order == high entropy)
• Shannon information
D = log2(N) + S pi log2 pi (for proteins N=20, DNA N=4)
• Conserved residue pA=1, pi<>A=0,
D = log2(N) ( = 4.3 for proteins)
• Variable region pA=0.05, pC=0.05, ..,
D=0
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Information content
MHC class II
Logo from 10 sequences
• Height of a column equal
to D
• Relative height of a letter
is pA
• Highly useful tool to
visualize sequence
motifs
High information
position
http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Sequence logo
• Information content
D = S pi log2 (pi/qi)
A R N D
2 1 1 1
8 19 1 1
3 2 7 2
8 13 13 14
4 1 7 7
5 2 8 23
2 1 7 13
3 7 7 8
3 2 7 19
Frequency matrix
C Q E
1 1 1
7 2 2
1 17 13
1 2 13
7 1 2
1 6 3
1 1 2
7 1 7
1 6 2
G
1
2
2
2
2
2
2
8
8
H I L K M F P S T W Y V
1 4 16 1 6 15 7 1 2 7 18 13
1 3 15 13 6 2 1 2 2 7 1 8
1 8 14 3 1 1 7 7 2 0 1 8
1 2 3 3 1 7 1 3 7 0 1 7
1 13 15 2 6 6 1 7 2 7 7 4
1 3 3 2 1 1 1 13 8 0 1 18
1 8 14 2 6 1 20 7 2 7 1 3
1 2 8 2 1 1 13 7 2 7 1 7
1 9 9 2 1 1 1 7 2 0 1 18
• Shannon, qi = 1/N = 0.05
D = S pi log2 (pi) - S pi log2 (1/N)
= log2 N - S pi log2 (pi)
• Kullback-Leibler, qi = background frequency
– V/L/A more frequent than for instance C/H/W
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
More on logos
P1
I(i,j) = Saai Saaj P(aai, aaj) *
log[P(aai, aaj)/P(aai)*P(aaj)]
P(G1) = 2/9 = 0.22, ..
P(V6) = 4/9 = 0.44,..
P(G1,V6) = 2/9 = 0.22,
P(G1)*P(V6) = 8/81 = 0.10
log(0.22/0.10) > 0
P6
ALWGFFPVA
ILKEPVHGV
ILGFVFTLT
LLFGYPVYV
GLSPTVWLS
YMNGTMSQV
GILGFVFTL
WLSLLVPFV
FLPSDFFPS
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Mutual information
313 binding peptides
313 random peptides
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Mutual information
• Mutual information between anchor positions 2 and 9
and other residues low
– At pos 2 we know that L,M,T,V and I are the most frequent
amino acids.
– At pos 9 V,L,I and A are most frequent
– 313 Rammensee + Buus pep
• P(L2) = 0.51, P(V9)=0.48, P(L2,V9) = 0.23
• P(L2,V9)/(P(L2)*P(V9) )=0.23/0.24 = 1.0
• Knowing that we have L at position 2 does not tell us
which one of V,L or I is placed on position 9 => NO
mutual information
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Mutual information at anchor
position is low
• Estimate amino acid frequencies from alignment inc.
sequence weighting and pseudo counts
• Now a weight matrix is given as
Wij = log(pij/qj)
• Here i is a position in the motif, and j an amino acid.
qj is the background frequency for amino acid j.
• W is a L x 20 matrix, L is motif length
• Score sequences to weight matrix by looking up and
adding L values from matrix
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Weight matrices
• 10 peptides from
MHCpep database
• Bind to the MHC
complex
• Relevant for immune
system recognition
• Estimate sequence
motif and weight matrix
• Evaluate on 528
peptides
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Example from real life
• Raw sequence counting
– No sequence weighting
– No pseudo count
– Prediction accuracy 0.45
• Sequence weighting
– No pseudo count
– Prediction accuracy 0.5
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Example (cont.)
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
• Sequence weighting
and pseudo count
– Prediction accuracy
0.60
• Motif found on all
data (485)
– Prediction accuracy
0.79
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Example (cont.)
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
• Weight matrices do not deal with insertions
and deletions
• In alignments, this is done in an ad-hoc
manner by optimization of the two gap
penalties for first gap and gap extension
• HMM is a natural frame work where
insertions/deletions are dealt with explicitly
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Hidden Markov Models
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATC
Core of alignment
• Example from A. Krogh
• Core region defines the
number of states in the
HMM (red)
• Insertion and deletion
statistics is derived from
the non-core part of the
alignment (blue)
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
HMM (a simple example)
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATC
• 5 matches. A, 2xC, T, G
• 5 transitions in gap region
• C out, G out
• A-C, C-T, T out
• Out transition 3/5
• Stay transition 2/5
.4
.2
.4
.2
.2
A
C
G
T
.6
.6
A
C
G
T
.8
.2
A
1. C
G
T
ACA---ATG
A
.8 1. C
.2
G
T
A
.8
.4
C
.2
1
A
A
1. C
1. C
G
T
0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2
G
T
.2
.8
= 3.3x10-2
G
T
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
HMM construction
.8
.2
ACA---ATG
TCAACTATC
0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2
= 3.3x10-2
0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8
0.0075x10-2
ACAC--AGC = 1.2x10-2
AGA---ATC = 3.3x10-2
ACCG--ATC = 0.59x10-2
Consensus:
ACAC--ATC = 4.7x10-2, ACA---ATC = 13.1x10-2
Exceptional:
TGCT--AGG = 0.0023x10-2
=
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Align sequence to HMM
• Score depends
strongly on length
• Null model is a
random model. For
length L the score is
0.25L
• Log-odd score for
sequence S
Log( P(S)/0.25L)
ACA---ATG = 4.9
TCAACTATC = 3.0
ACAC--AGC = 5.3
AGA---ATC = 4.9
ACCG--ATC = 4.6
Consensus:
Note!
ACAC--ATC = 6.7
ACA---ATC = 6.3
Exceptional:
TGCT--AGG = -0.97
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Align sequence to HMM Null model
• In the case of un-gapped alignments
HMM’s become simple weight matrices
• It still might be useful to use a HMM tool
package to estimate a weight matrix
– Sequence weighting
– Pseudo counts
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
HMM’s and weight matrices
• Alignments based on conventional scoring
matrices (BLOSUM62) scores all positions in
a sequence in an equal manner
• Some position are highly conserved, some
are highly flexible (more than what is
described in the BLOSUM matrix)
• Profile HMM’s are ideal suited to describe
such position specific variations
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Profile HMM’s
Sequence profiles
• Alignment of 1PLC._ to 1GYC.A
• Blast e-value > 1000
• Profile alignment
– Align 1PLC._ against Swiss-prot
– Make position specific weight matrix from
alignment
– Use this matrix to align 1PLC._ against 1GYC.A
• E-value > 10-22. Rmsd=3.3
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Example
Score = 97.1 bits (241), Expect = 9e-22
Identities = 13/107 (12%), Positives = 27/107 (25%), Gaps = 17/107 (15%)
Query: 3
Sbjct: 26
Query: 57
Sbjct: 80
VLLGADDGSLAFVPSEFSISPGEKI------VFKNNAGFPHNIVFDEDSIPSGVDASKIS 56
V+ G
F
+
G++
N+
+
+G + +
VVNG------VFPSPLITGKKGDRFQLNVVDTLTNHTMLKSTSIHWHGFFQAGTNWADGP 79
MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--HQGAGMVGKVTV 98
A G +F
G + ++
G+ G
V
AFVNQCPIASGHSFLYDFHVPDQAGTFWYHSHLSTQYCDGLRGPFVV 126
Rmsd=3.3
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Example continued
EM55_HUMAN
CSKP_HUMAN
KAPB_MOUSE
NRC2_NEUCR
WWQGRVEGSSKESAGLIPSPELQEWRVASMAQSAP--SEAPSCSPFGKKKK-YKDKYLAK
WWQGKLENSKNGTAGLIPSPELQEWRVACIAMEKTKQEQQASCTWFGKKKKQYKDKYLAK
-----PENLLIDHQGYIQVTDFGFAKRVKG----------------------------------PENILLHQSGHIMLSDFDLSKQSDPGGKPTMIIGKNGTSTSSLPTIDTKSCIANF
EM55_HUMAN
CSKP_HUMAN
KAPB_MOUSE
NRC2_NEUCR
HSSIFDQLDVVSYEEVVRLPAFKRKTLVLIGASGVGRSHIKNALLSQNPEKFVYPVPYTT
HNAVFDQLDLVTYEEVVKLPAFKRKTLVLLGAHGVGRRHIKNTLITKHPDRFAYPIPHTT
RTWTLCGTPEYLAPEIILSKGYNKAVDWWALGVLIYEMAAGYPPFFADQPIQIYEKIVSG
RTNSFVGTEEYIAPEVIKGSGHTSAVDWWTLGILIYEMLYGTTPFKGKNRNATFANILRE
EM55_HUMAN
CSKP_HUMAN
KAPB_MOUSE
NRC2_NEUCR
RPPRKSEEDGKEYHFISTEEMTRNISANEFLEFGSYQGNMFGTKFETVHQIHKQNKIAIL
RPPKKDEENGKNYYFVSHDQMMQDISNNEYLEYGSHEDAMYGTKLETIRKIHEQGLIAIL
KVRFPSHF-----SSDLKDLLRNLLQVDLTKRFGNLKNGVSDIKTHKWFATTDWIAIYQR
DIPFPDHAGAPQISNLCKSLIRKLLIKDENRRLG-ARAGASDIKTHPFFRTTQWALI--R
EM55_HUMAN
CSKP_HUMAN
KAPB_MOUSE
NRC2_NEUCR
NNGVDETLKKLQEAFDQACSSPQWVPVSWVY
NNEIDETIRHLEEAVELVCTAPQWVPVSWVY
EKCGKEFCEF--------------------ENAVDPFEEFNSVTLHHDGDEEYHSDAYEKR
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Profile HMM’s
All M/D pairs must
be visited once
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Profile HMM’s
(Sonnhammer, von Heijne, and Krogh)
Model TM length
distribution.
Power of HMM.
Difficult in alignment.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
TMHMM (trans-membrane
HMM)
x
Inter-genic
region
xxxxxxxxATGccc
Region around
start codon
ccc
cccTAAxxxxxxxx
Coding
region
Region around
stop codon
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Combination of HMM’s Gene finding
•
•
•
HMMER (http://hmmer.wustl.edu/)
– S.R. Eddy, WashU St. Louis. Freely available.
SAM (http://www.cse.ucsc.edu/research/compbio/sam.html)
– R. Hughey, K. Karplus, A. Krogh, D. Haussler and others, UC Santa Cruz.
Freely available to academia, nominal license fee for commercial users.
META-MEME (http://metameme.sdsc.edu/)
– William Noble Grundy, UC San Diego. Freely available. Combines
features of PSSM search and profile HMM search.
• NET-ID, HMMpro (http://www.netid.com/html/hmmpro.html)
– Freely available to academia, nominal license fee for commercial users.
– Allows HMM architecture construction.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
HMM packages
hmmbuild --gapmax 0.0
--fast A2.hmmer A2.fsa
>HLA-A.0201 16 Example_for_Ligand
SLLPAIVEL
>HLA-A.0201 16 Example_for_Ligand
YLLPAIVHI
>HLA-A.0201 16 Example_for_Ligand
TLWVDPYEV
>HLA-A.0201 16 Example_for_Ligand
SXPSGGXGV
>HLA-A.0201 16 Example_for_Ligand
GLVPFLVSV
hmmbuild - build a hidden Markov model from an alignment
HMMER 2.2g (August 2001)
-----------------------------------Alignment file: A2.fsa
File format: a2m
Search algorithm configuration: Multiple domain (hmmls)
Model construction strategy: Fast/ad hoc (gapmax 0.0)
Null model used: (default)
Sequence weighting method: G/S/C tree weights
-------------------------------Alignment: #1
Number of sequences: 232
Number of columns: 9
Determining effective sequence number ... done. [192]
Weighting sequences heuristically
... done.
Constructing model architecture
... done.
Converting counts to probabilities
... done.
Setting model name, etc.
... done. [A2.fasta]
Constructed a profile HMM (length 9)
Average score:
-6.42 bits
Minimum score: -15.47 bits
Maximum score:
-0.84 bits
Std. deviation:
2.72 bits
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Simple Hmmer command