PowerPoint Slides

Download Report

Transcript PowerPoint Slides

Prediction of T cell epitopes using
artificial neural networks
Morten Nielsen,
CBS, BioCentrum,
DTU
Objectives
• How to train a neural network to predict peptide
MHC class I binding
• Understand why NN’s perform the best
– Higher order sequence information
• The wisdom of the crowd!
– Why enlightened despotism does not work even for
Neural networks
Outline
• MHC class I epitopes
– Why MHC binding?
• How to predict MHC binding?
– Information content
– Weight matrices
– Neural networks
• Neural network theory
– Sequence encoding
• Examples
Prediction of HLA binding specificity
Simple Motifs
– Allowed/non allowed amino acids
Extended motifs
– Amino acid preferences (SYFPEITHI)
– Anchor/Preferred/other amino acids
Hidden Markov models
– Peptide statistics from sequence alignment (previous
talk)
Neural networks
– Can take sequence correlations into account
SYFPEITHI predictions
Extended motifs based on peptides from the literature
and peptides eluted from cells expressing specific HLAs
( i.e., binding peptides)
Scoring scheme is not readily accessible.
Positions defined as anchor or auxiliary anchor positions
are weighted differently (higher)
The final score is the sum of the scores at each position
Predictions can be made for several HLA-A, -B and -DRB1
alleles, as well as some mice K, D and L alleles.
BIMAS
Matrix made from peptides with a measured T1/2 for the
MHC-peptide complex
The matrices are available on the website
The final score is the product of the scores of each
position in the matrix multiplied with a constant,
different for each MHC, to give a prediction of the T1/2
Predictions can be obtained for several HLA-A, -B and -C
alleles, mice K,D and L alleles, and a single cattle MHC.
How to predict
The effect on the binding affinity of
having a given amino acid at one
position can be influenced by the
amino acids at other positions in the
peptide (sequence correlations).
– Two adjacent amino acids may for
example compete for the space in a
pocket in the MHC molecule.
Artificial neural networks (ANN) are
ideally suited to take such
correlations into account
Higher order sequence correlations
Neural networks can learn higher order correlations!
– What does this mean?
Say that the peptide needs one and only
one large amino acid in the positions P3
and P4 to fill the binding cleft
How would you formulate this to test if
a peptide can bind?
S S => 0
L S => 1
S L => 1
L L => 0
No linear
function can
learn this (XOR)
pattern
Neural network learning higher order
correlations
Mutual information
• How is mutual information calculated?
• Information content was calculated as
• Gives information in a single position
pa
I   pa log( )
qa
a
• Similar relation for mutual information
• Gives mutual information between two positions

pab
I   pab log(
)
pa  pb
a,b

Mutual information. Example
Knowing that you have G at P1 allows you to
make an educated guess on what you will find
at P6.
P(V6) = 4/9. P(V6|G1) = 1.0!
pab
I   pab log(
)
pa  pb
a,b
P(G1) = 2/9 = 0.22, ..
P(V6) = 4/9 = 0.44,..
P(G1,V6) = 2/9 = 0.22,
P(G1)*P(V6) = 8/81 = 0.10
log(0.22/0.10) > 0
P1
P6
ALWGFFPVA
ILKEPVHGV
ILGFVFTLT
LLFGYPVYV
GLSPTVWLS
YMNGTMSQV
GILGFVFTL
WLSLLVPFV
FLPSDFFPS
Mutual information
313 binding peptides
313 random peptides
Neural network training
SLLPAIVEL
LLDVPTAAV
HLIDYLVTS
ILFGHENRV
LERPGGNEI
PLDGEYFTL
ILGFVFTLT
KLVALGINA
KTWGQYWQV
SLLAPGAKQ
ILTVILGVL
TGAPVTYST
GAGIGVAVL
KARDPHSGH
AVFDRKSDA
GLCTLVAML
VLHDDLLEA
ISNDVCAQV
YTAFTIPSI
NMFTPYIGV
VVLGVVFGI
GLYDGMEHL
EAAGIGILT
YLSTAFARV
FLDEFMEGV
AAGIGILTV
AAGIGILTV
YLLPAIVHI
VLFRGGPRG
ILAPPVVKL
ILMEHIHKL
ALSNLEVKL
GVLVGVALI
LLFGYPVYV
DLMGYIPLV
TITDQVPFS
KIFGSLAFL
KVLEYVIKV
VIYQYMDDL
IAGIGILAI
KACDPHSGH
LLDFVRFMG
FIDSYICQV
LMWITQCFL
VKTDGNPPE
RLMKQDFSV
LMIIPLINV
ILHNGAYSL
KMVELVHFL
TLDSQVMSL
YLLEMLWRL
ALQPGTALL
FLPSDFFPS
FLPSDFFPS
TLWVDPYEV
MVDGTLLLL
ALFPQLVIL
ILDQKINEV
ALNELLQHV
RTLDKVLEV
GLSPTVWLS
RLVTLKDIV
AFHHVAREL
ELVSEFSRM
FLWGPRALV
VLPDVFIRC
LIVIGILIL
ACDPHSGHF
VLVKSPNHV
IISAVVGIL
SLLMWITQC
SVYDFFVWL
RLPRIFCSC
TLFIGSHVV
MIMVKCWMI
YLQLVFGIE
STPPPGTRV
SLDDYNHLV
VLDGLDVLL
SVRDRLARL
AAGIGILTV
GLVPFLVSV
YMNGTMSQV
GILGFVFTL
SLAGGIIGV
DLERKVESL
HLSTAFARV
WLSLLVPFV
MLLAVLYCL
YLNKIQNSL
KLTPLCVTL
GLSRYVARL
VLPDVFIRC
LAGIGLIAA
SLYNTVATL
GLAPPQHLI
VMAGVGSPY
QLSLLMWIT
FLYGALLLA
FLWGPRAYA
SLVIVTTFV
MLGTHTMEV
MLMAQEALA
KVAELVHFL
RTLDKVLEV
SLYSFPEPE
SLREWLLRI
FLPSDFFPS
• Sequence encoding
KLLEPVLLL
MLLSVPLLL
STNRQSGRQ
LLIENVASL
FLGENISNF
RLDSYVRSL
FLPSDFFPS
AAGIGILTV
MMRKLAILS
VLYRYGSFS
FLLTRILTI
AVGIGIAVV
VDGIGILTI
RGPGRAFVT
LLGRNSFEV
LLWTLVVLL
LLGATCMFV
VLFSSDFRI
RLLQETELV
VLQWASLAV
MLGTHTMEV
LMAQEALAF
IMIGVLVGV
GLPVEYLQV
ALYVDSLFF
LLSAWILTA
AAGIGILTV
– Sparse
– Blosum
– Hidden Markov model
• Network ensembles
– Cross validated training
– Benefit from ensembles
LLDVPTAAV
SLLGLLVEV
GLDVLTAKV
FLLWATAEA
ALSDHHIYL
YMNGTMSQV
CLGGLLTMV
YLEPGPVTA
AIMDKNIIL
YIGEVLVSV
HLGNVKYLV
LVVLGLLAV
GAGIGVLTA
NLVPMVATV
PLTFGWCYK
SVRDRLARL
RLTRFLSRV
LMWAKIGPV
SLFEGIDFY
ILAKFLHWL
SLADTNSLA
VYDGREHTV
ALCRWGLLL
KLIANNTRV
SLLQHLIGL
AAGIGILTV
FLWGPRALV
LLDVPTAAV
ALLPPINIL
RILGAVAKV
SLPDFGISY
GLSEFTEYL
GILGFVFTL
FIAGNSAYE
LLDGTATLR
IMDKNIILK
CINGVCWTV
GIAGGLALL
ALGLGLLPV
AAGIGIIQI
GLHCYEQLV
VLEWRFDSR
LLMDCSGSI
YMDGTMSQV
SLLLELEEV
SLDQSVVEL
STAPPHVNV
LLWAARPRL
YLSGANLNL
LLFAGVQCQ
FIYAGSLSA
ELTLGEFLK
AVPDEIPPL
ETVSEQSNV
LLDVPTAAV
TLIKIQHTL
QVCERIPTI
KKREEAPSL
STAPPAHGV
ILKEPVHGV
KLGEFYNQM
ITDQVPFSV
SMVGNWAKV
VMNILLQYV
GLQDCTMLV
GIGIGVLAA
QAGIGILLA
PLKQHFQIV
TLNAWVKVV
CLTSTVQLV
FLTPKKLQC
SLSRFSWGA
RLNMFTPYI
LLLLTVLTV
GVALQTMKQ
RMFPNAPYL
VLLCESTAV
KLVANNTRL
MINAYLDKL
FAYDGKDYI
ITLWQRPLV
Sequence encoding
• How to represent a peptide amino acid
sequence to the neural network?
• Sparse encoding (all amino acids are equally
disalike)
• Blosum encoding (encodes similarities
between the different amino acids)
• Weight matrix (encodes the position specific
amino acid preference of the HLA binding
motif)
Evaluation of prediction accuracy
1
0.5
0
Motif
PSSM
Hmm
Sparse
BLOSUM
Pear
0.76
0.80
0.88
0.91
Aroc
0.92
0.95
0.97
0.97
Neural network training. Cross validation
1
20%
5
20%
Cross validation
Train on 4/5 of data
Test on 1/5
=>
Produce 5 different
neural networks each
with a different
prediction focus
4
20%
2
20%
3
20%
Neural network training curve
Maximum test set performance
Most cable of generalizing
Network ensembles
The Wisdom of the Crowds
The Wisdom of Crowds. Why the Many are
Smarter than the Few. James Surowiecki
One day in the fall of 1906, the British scientist Fracis
Galton left his home and headed for a country fair… He
believed that only a very few people had the
characteristics necessary to keep societies healthy. He
had devoted much of his career to measuring those
characteristics, in fact, in order to prove that the vast
majority of people did not have them. … Galton came
across a weight-judging competition…Eight hundred people
tried their luck. They were a diverse lot, butchers,
farmers, clerks and many other no-experts…The crowd
had guessed … 1.197 pounds, the ox weighted 1.198
Network ensembles
• No one single network with a particular
architecture and sequence encoding scheme,
will constantly perform the best
• Also for Neural network predictions will
enlightened despotism fail
– For some peptides, BLOSUM encoding with a four
neuron hidden layer can best predict the
peptide/MHC binding, for other peptides a sparse
encoded network with zero hidden neurons performs
the best
– Wisdom of the Crowd
• Never use just one neural network
• Use Network ensembles
Evaluation of prediction accuracy
1
0.9
0.8
0.7
0.6
0.5
Motif
Hmm
Sparse
BLOSUM
ENS
Pear
0.76
0.80
0.88
0.91
0.92
Aroc
0.92
0.95
0.97
0.97
0.98
ENS: Ensemble of neural networks trained using sparse,
Blosum, and weight matrix sequence encoding
T cell epitope identification
Lauemøller et al., reviews in immunogenetics 2001
NetMHC-3.0 update
• IEDB + more proprietary data
Higher accuracy for existing ANNs
More Human alleles
Non human alleles (Mice + Primates)
Prediction of 8mer binding peptides for some
alleles
• Prediction of 10- and 11mer peptides for all
alleles
• Outputs to spread sheet
•
•
•
•
NetMHC Output
94
53
49
289
529
Prediction of 10- and 11mers using
9mer prediction tools
Approach:
For each peptide of length L create 6
pseudo peptides deleting a sliding window
of L- 9 always keeping pos. 1,2,3, and 9
Example:
MLPQWESNTL
= MLPWESNTL
MLPQESNTL
MLPQWSNTL
MLPQWENTL
MLPQWESTL
MLPQWESNL
Q
M
L
P
W
E
S
N
T
L
Prediction of 10- and 11mers using
9mer prediction tools
Prediction of 10- and 11mers using
9mer prediction tools
Final prediction = average of the 6 log
scores:
(0.477+0.405+0.564+0.505+0.559+0.521)/6 = 0.505
Affinity:
Exp(log(50000)*(1 - 0.505))
= 211.5 nM
Prediction using ANN trained on
10mer peptides
Prediction of 10- and 11mers using
9mer prediction tools
9-10 mer approximation
Pearson correlation coefficient
0.900
0.800
0.700
0.600
0.500
0.400
0.300
0.200
0.100
0.000
approach
9mer apprx
10mer
Examples. Hepatitis C virus. Epitope predictions
Hotspots
SARS T cell epitope identification
Peptide binding affinity
A01 predicted peptides offered to rA*0101
2.500
Peptide affinity (KD) m M
2.000
1.500
1.000
0.500
0.000
A1
A1
A1
A1
A1
A1
A1
A1
A1
A1
A1
A1
A1
A1
A1
6929 6930 6931 6932 6933 6934 6935 6936 6937 6938 6939 6940 6941 6942 6943
Peptides tested
Peptides tested: 15/15 (100 %)
Binders (KD < 500 nM): 14/15 (93%)
A0201
Peptides tested
2.000
1.500
1.000
0.500
?
2.000
Peptides tested
B5801
Peptides tested
Peptide binding affinity
B58 predicted peptides offered to rB*5801
2.500
1.500
1.000
0.000
0.000
2.000
Peptides tested
B1501
B7-7003
B7-7002
B7-7001
B7-7000
B7-6999
B7-6998
B7-6997
B7-6996
B7-6995
B7-6994
B7-6993
A1101
B7-6992
0.500
B7-6991
1.000
2.000
B7-6990
1.500
B7-6989
14/15
Peptide affinity (KD) mM
2.000
Peptide affinity (KD) mM
13/15
A3-6973
A3-6972
A3-6971
A3-6970
A3-6969
A3-6968
A3-6967
A3-6966
Peptide binding affinity
A03 predicted peptides offered to
rA*1101
79
.H
L
79 A-B
.H
62
L
79 A-B 705
.H
0
6
LA 2 7
79
0
. H B6 51
LA 2 7
79
0
. H B6 52
2
L
79 A-B 705
.H
62 3
L
79 A-B 705
.H
62 4
L
79 A-B 705
5
.H
6
LA 2 7
0
79
. H B6 56
LA 2 7
79
0
. H B6 57
2
L
79 A-B 705
.H
62 8
L
79 A-B 705
.H
62 9
L
79 A-B 706
.H
0
6
LA 2 7
79
0
. H B6 61
LA 2 7
79
0
. H B6 62
LA 2 7
-B 06
62 3
70
64
2.500
Peptide binding affinity
A2A02supertype:
predicted peptides offered to rA*0201
Molecule used:
rA0201/
human b2m
12/15
A3-6965
0.000
A3-6964
2.500
A3-6963
A0301
A3-6962
2.500
A3-6961
Peptide binding affinity
A03 predicted peptides offered to rA*0301
A3-6960
0.500
A3-6959
1.000
Peptide affinity (KD) mM
1.500
Peptide affinity (KD) m M
3
97
2
1
97
-6
A3
0
97
-6
A3
9
97
-6
A3
8
96
-6
A3
7
96
-6
A3
6
96
-6
A3
5
96
-6
A3
4
96
-6
A3
3
96
-6
A3
2
96
-6
A3
1
96
-6
A3
0
96
-6
A3
9
96
-6
A3
-6
95
-6
A3
A3
Peptide affinity (KD) M
11/15
B 703
58 5
B 703
58 6
B 703
58 7
B 703
58 8
B 703
58 9
B 704
58 0
B 704
58 1
B 704
58 2
B 704
58 3
B 704
58 4
B 704
58 5
B 704
58 6
B 704
58 7
B 704
58 8
-7
04
9
Peptide affinity (KD) m M
2.000
B
58
69
A2 44
69
A2 45
69
A2 46
69
A2 47
69
A2 48
69
A2 49
69
A2 50
69
A2 51
69
A2 52
69
A2 53
69
A2 54
69
A2 55
69
A2 56
69
A2 57
69
58
A2
More SARS CTL epitopes
B0702
Peptide binding affinity
B7 predicted peptides offered to rB*0702
2.500
10/15
1.500
1.000
0.000
0.500
0.000
Peptides tested
Peptide binding affinity
B62 predicted peptides offered to rB*1501
2.500
12/14
1.500
1.000
0.500
0.500
0.000
Vaccine design. Polytope optimization
• Successful immunization can be obtained only if the
epitopes encoded by the polytope are correctly
processed and presented.
• Cleavage by the proteasome in the cytosol,
translocation into the ER by the TAP complex, as well as
binding to MHC class I should be taken into account in an
integrative manner.
• The design of a polytope can be done in an effective
way by modifying the sequential order of the different
epitopes, and by inserting specific amino acids that will
favor optimal cleavage and transport by the TAP
complex, as linkers between the epitopes.
Vaccine design. Polytope construction
Linker
NH2 M
Epitope
COOH
C-terminal cleavage
Cleavage within epitopes
cleavage
New epitopes
Polytope starting configuration
Immunological Bioinformatics, The MIT press.
Polytope optimization Algorithm
• Optimization of four measures:
1. The number of poor C-terminal cleavage sites of epitopes
(predicted cleavage < 0.9)
2. The number of internal cleavage sites (within epitope
cleavages with a prediction larger than the predicted Cterminal cleavage)
3. The number of new epitopes (number of processed and
presented epitopes in the fusing regions spanning the
epitopes)
4. The length of the linker region inserted between epitopes.
• The optimization seeks to minimize the above four terms by use
of Monte Carlo Metropolis simulations [Metropolis et al., 1953]
Polytope optimal configuration
Immunological Bioinformatics, The MIT press.
Summary
• MHC class I binding can be very
accurately predicted using ANN
• Higher order sequence correlations are
important for peptide:MHC-I binding
• ANN can can be trained without
overfitting
• Using multiple sequence encoding schemes
• Wisdom of the crowd
• Optimization can generate polytopes with
high likelihood for antigen presentation