Sequence information, Logos and HMMs - CBS

Download Report

Transcript Sequence information, Logos and HMMs - CBS

Morten Nielsen,
CBS, BioCentrum,
DTU
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Sequence information, logos
and Hidden Markov Models
• Information and entropy
– Conserved amino acid regions contain high degree of
information (high order == low entropy)
– Variable amino acid regions contain low degree of
information (low order == high entropy)
• Shannon information
D = log2(N) + S pi log2 pi (for proteins N=20, DNA N=4)
• Conserved residue pA=1, pi<>A=0,
D = log2(N) ( = 4.3 for proteins)
• Variable region pA=0.05, pC=0.05, ..,
D=0
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Information content
MHC class II
Logo from 10 sequences
• Height of a column equal
to D
• Relative height of a letter
is pA
• Highly useful tool to
visualize sequence
motifs
High information
positions
http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Sequence logo
• Description of binding motif
• Example
PA = 6/10
PG = 2/10
PT = PK = 1/10
PC = PD = …PV = 0
• Problems
– Few data
– Data redundancy/duplication
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Sequence information
Raw sequence counting
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Sequence information
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Sequence weighting
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
• Sequence weighting
and pseudo count
• Motif found on more
data
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Pseudo counts
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
•
cp files from
/usr/opt/www/pub/CBS/researchgroups/immunology/intro/HMM/exercise
•
Make weight matrix and logos using
– pep2mat -swt 2 -wlc 0 data > mat
– mat2logo mat
– ghostview logo.ps
•
Include sequence weighting
– pep2mat -swt 0 -wlc 0 data > mat
– make and view logo
– Try the other sequence weighting scheme (clustering) -swt 1. What
difference does this make?
•
Include pseudo counts
– pep2mat data > mat
– make and view logo
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
…and now you
• Estimate amino acid frequencies from alignment
including sequence weighting and pseudo counts
• Construct a weight matrix as
Wij = log(pij/qj)
• Here i is a position in the motif, and j an amino acid.
qj is the prior frequency for amino acid j.
• W is a L x 20 matrix, L is motif length
• Score sequences to weight matrix by looking up and
adding L values from matrix
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Weight matrices
• Use the program seq2hmm to evaluate the prediction accuracy
of your weight matrix
– seq2hmm -hmm mat -xs eval.set | grep -v # | args 2,3 | xycorr
– What is going on here?
• By leaving out the -xs option you can generate the scores at
each position in the sequence. This is often useful for Neural
Network training
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Weight matrix predictions
• Complexity of
problem
– Peptides of different
length
– Weak motif signal
• Alignment crucial
• Gibbs Monte Carlo
sampler
DRB1*0401 peptides
RFFGGDRGAPKRG
YLDPLIRGLLARPAKLQV
KPGQPPRLLIYDASNRATGIPA
GSLFVYNITTNKYKAFLDKQ
SALLSSDITASVNCAK
PKYVHQNTLKLAT
GFKGEQGPKGEP
DVFKELKVHHANENI
SRYWAIRTRSGGI
TYSTNEIDLQLSQEDGQTIE
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
MHC class II prediction
Alignment by Gibbs sampler
E = Si,j pij * log( p`ij/qi )
Maximize E using MC
• Random change in offset
• Random shift on box
position
• Accept moves to higher E
always
• Accept moves to lower E
with decreasing probability
RFFGGDRGAPKRG
YLDPLIRGLLARPAKLQV
KPGQPPRLLIYDASNRATGIPA
GSLFVYNITTNKYKAFLDKQ
SALLSSDITASVNCAK
PKYVHQNTLKLAT
GFKGEQGPKGEP
DVFKELKVHHANENI
SRYWAIRTRSGGI
TYSTNEIDLQLSQEDGQTI
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Gibbs sample algorithm
• The file clasII.fsa is a FASTA file containing 50 classII epitopes
• gibbss_mc -iw -w 1,0,0,1,0,1,0,0,1 -m gibbs.mat classII.fsa
– The options -iw and -w 1,0,0,1,0,1,0,0,1 increase matrix weight on
important anchor positions in binding motif
– Make and view logo
• Use the matrix to predict classII epitopes
– cl2pred -mat gibbs.mat classII.eval.dat | grep -v # | args 4,5 | xycorr
– Do you understand what is going on in this command?
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Gibbs sampler exercise
• Weight matrices do not deal with insertions
and deletions
• In alignments, this is done in an ad-hoc
manner by optimization of the two gap
penalties for first gap and gap extension
• HMM is a natural frame work where
insertions/deletions are dealt with explicitly
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Hidden Markov Models
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATC
Core of alignment
• Example from A. Krogh
• Core region defines the
number of states in the
HMM (red)
• Insertion and deletion
statistics are derived
from the non-core part
of the alignment (black)
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
HMM (a simple example)
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATC
• 5 matches. A, 2xC, T, G
• 5 transitions in gap region
• C out, G out
• A-C, C-T, T out
• Out transition 3/5
• Stay transition 2/5
.4
.2
.4
.2
.2
A
C
G
T
.6
.6
A
C
G
T
.8
.2
A
1. C
G
T
ACA---ATG
A
.8 1. C
.2
G
T
A
.8
.4
C
.2
G
T
1
A
A
1. C
1. C
G
T
0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2
.2
.8
G
T
= 3.3x10-2
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
HMM construction
.8
.2
ACA---ATG
TCAACTATC
0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2
= 3.3x10-2
0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8
0.0075x10-2
ACAC--AGC = 1.2x10-2
AGA---ATC = 3.3x10-2
ACCG--ATC = 0.59x10-2
Consensus:
ACAC--ATC = 4.7x10-2, ACA---ATC = 13.1x10-2
Exceptional:
TGCT--AGG = 0.0023x10-2
=
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Align sequence to HMM
• Score depends strongly
on length
• Null model is a random
model. For length L the
score is 0.25L
• Log-odds score for
sequence S
Log( P(S)/0.25L)
• Positive score means
more likely than Null
model
ACA---ATG = 4.9
TCAACTATC = 3.0
ACAC--AGC = 5.3
AGA---ATC = 4.9
ACCG--ATC = 4.6
Consensus:
Note!
ACAC--ATC = 6.7
ACA---ATC = 6.3
Exceptional:
TGCT--AGG = -0.97
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Align sequence to HMM Null model