HLA-mediated evolution in HIV-1: implications for T-cell

Download Report

Transcript HLA-mediated evolution in HIV-1: implications for T-cell

Graphical models for HIV
vaccine design
David Heckerman, Jonathan Carlson, Jennifer Listgarten, & Carl Kadie
Microsoft Research
The need for an HIV vaccine
 One of the deadliest pandemics in recorded history (10,000
die per day)
 Treatments are reasonably effective, but expensive and
catastrophic if doses missed (i.e., essentially useless for
developing countries)
Overview
 HIV 101
 Two problems in HIV vaccine that can be
addressed by graphical models
 Weakness in graphical model learning
 Solution (based on False Discovery Rate)
HIV Life Cycle
Two arms of Adaptive Immune Response
 Humoral arm
(antibodies): Recognize,
neutralize and respond
to free floating virus
particles
 Cellular arm (killer T
cells): Identify and
destroy already infected
cells
Central question in vaccine design:
Can our immune system stop HIV?
 Humoral arm
(antibodies): Have been
trying for 20 years
without success
 Cellular arm (killer T
cells): Is this system
strong enough to stop
or help stop HIV?
Cellular Arm Details
Different viral
protein
fragments
Epitope
Host Cell
Cellular Arm Details
Epitope
Different viral
protein
fragments
Host Cell
Cellular Arm Details
Killer
T-cell
Epitope
Different viral
protein
fragments
Host Cell
Cellular Arm Details
Killer
T-cell
Epitope
Different viral
protein
fragments
Host Cell
How effective is this mechanism on HIV?
HLA Molecule
Epitope
HIV mutates rapidly
RT
If the virus contains an epitope targeted by the
immune system, at least one amino acid in or
near the epitope will change to escape attack
Rapid mutation + selection pressure =
detectible footprint
 If there is strong selection pressure from the
immune system, then HIV’s mutations should be
correlated with the HLA types of the human host
 E.g., if host has B57 HLA, then expect to see HIV
mutations in or near B57 epitopes
First approach
Science 2002; 296:1439-43
Moderate number
of associations
found, but didn’t
correlate well with
known epitopes…
something wrong?
First approach
Pt1:
Pt2:
Pt3:
Pt4:
Pt5:
Pt6:
Pt7:
Pt8:
MGARASVLRGEKLDRWEKIRLRPGGKKQYRLKHIVWASRELERFALN...
MGARASILRGGKLDKWEKIRLRPGGKKKYRLKHLVWASRELERFALN...
MGASASILKGEKLDRWEKIRLRPGGKKSYKLKHIVWASRELERFALN...
MGARASVLRGGKLDKWEKIRLRPGGKKCYMLKHLVWASRELERFALN...
MGARASVLRGEKLDKWEKIRLRPGGKKQYKLKHIVWASRELDRFALN...
MGARASILRGENLDKWEKIRLRPGGKKCYMIKHIVWASRELERFALN...
MGARASILIGEKLDRWEKIRLRPGGRKRYMLKHLVWASRELERFALN...
MGARASVLRGEKLDRWEKIRLRPGGKKTYMLKHIVWASRELERFALN...
R
B57
Not B57
K
4
0
0
4
p = 0.03 (Fisher’s exact test)
HLA=B57
HLA<>B57
HLA=B57
HLA<>B57
HLA<>B57
HLA<>B57
HLA=B57
HLA=B57
This approach ignores the phylogeny
Slide from LANL
Problem: Simple method ignores the
phylogenetic structure of the data
AA at
a given
position
X
X
X
X
has HLA
has HLA
has HLA
has HLA
Y
not HLA
not HLA
not HLA
not HLA
Y
Y
Y
HLA
not HLA
X
Y
4
0
0
4
p = 0.03 (Fisher’s exact test)
Problem: Simple method ignores the
phylogenetic structure of the data
X
Y
X
X
X
X
has HLA
has HLA
has HLA
has HLA
Y
not HLA
not HLA
not HLA
not HLA
Y
Y
Y
Here, the tree helps
to explain the data.
Simple method overestimates
the correlation.
HLA
not HLA
X
Y
4
0
0
4
p = 0.03 (Fisher’s exact test)
Problem: Simple method ignores the
phylogenetic structure of the data
X
X
X
X
X
has HLA
has HLA
has HLA
has HLA
Y
not HLA
Y not HLA
Y not HLA
Y not HLA
Now, the data is surprising in
light of the tree.
Y
Simple method underestimates
the correlation.
vs.
X
Y
X
Y
X
Y
has HLA
not HLA
has HLA
not HLA
Y
not HLA
has HLA
not HLA
has HLA
X
Y
X
A graphical model approach
Science 2007
Construct a phylogeny using standard methods (Felsenstein, 1981)
For each HLA allele, position, and amino acid at that position…
 Graphical model 1: aa described by a (given) phylogeny alone
 Graphical model 2: aa driven by phylogeny and HLA pressure
 The better model 2 explains the data, the more likely an
association exists
21
Phylogenetic tree
observed
sequence
length t
22
Model 1: Explained by phylogeny alone
observed
amino acid
For position, amino acid:
p = p(X)
l = rate of
mutation
length t
p(X|not X)
p
1-e-lt
ML parameters
learned with EM
t
23
Model 2: Explained by phylogeny and HLA
observed
amino acid
For position, amino acid, and HLA:
HLA
p = p(X)
HLA
l = rate of
mutation
HLA
HLA
HLA
24
Many possible associations to
investigate
aa
HLA
position within HIV
What biologists don’t want
 logL(data | q, model 2) – logL(data | q, model 1)
 p-value, Bonferroni
 Bayesian posterior probability that the
association is present
What biologists want
 A set of the most likely associations such that
only X% are spurious (i.e., due to chance)
False Discovery Rate (FDR)
Benjamini and Hochberg, 1995
 A set of the most likely associations such that
only X% are spurious (i.e., due to chance)
S(t): Number of associations found with p-value < t
F(t): Number of those associations that are spurious
 F (t ) 

FDR (t )  E 
 S (t ) 
False Discovery Rate (FDR)
Storey and Tibshirani 2003
 F (t )  E ( F (t )) E0 ( F (t ))
 
FDR (t )  E 

S (t )
 S (t )  E ( S (t ))
Expected number of
associations under a
null distribution
Number of associations
found with real data
E.g.:
 Cutoff = 0.001
 40 associations on real data with p-value < cutoff
 Average of 8 associations given null with p-value < cutoff
 FDR(0.001) = 8/40 = 0.2
Creating null data via permutation
PatientID
1
2
3
4
5
6
…
HLA
A*0201=0
A*0201=1
A*0201=1
A*0201=0
A*0201=1
A*0201=0
…
PatientID
1
2
3
4
5
6
…
AA
N=1
N=0
N=0
N=1
N=1
N=0
…
FDR applied to synthetic data
Generate synthetic data
Apply the approach (varying the p-value cutoff)
Measure false positives
FDR applied to synthetic data
0.8
0.7
0.6
actual
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
estimated FDR
0.5
0.6
0.7
0.8
FDR applied to real data
Initial study (Science 2002; 296:1439-43)
Using ~100 patients, reported 80 significant
associations, but most didn’t correlate with known
epitopes
Follow up (Science 2007; 315: 1583-86
Reanalysis with phylogenetic correction revealed
only 8 significant associations at FDR=0.2, 6 of
which were experimentally verified
How many true associations are
missing?
Generate synthetic data (varying sample size)
Apply the approach
Determine false-negative rate as a function of sample
size
Associations missed
Fraction of associations missed
1
0.8
Perth
(Science)
0.6
0.4
0.2
0
0
100
200
300
400
500
Number of patients
600
700
800
FDR applied to real data
Initial study (Science 2002; 296:1439-43)
Using ~100 patients, reported 80 significant
associations, but most didn’t correlate with known
epitopes
Follow up (Science 2007; 315: 1583-86
Reanalysis with phylogenetic correction revealed
only 8 significant associations at FDR=0.2, 6 of
which were experimentally verified
More data (PLoS Pathogens, in press)
protein
PR
RT
Nef
N
531
532
686
#codons
#associations
99
400
206
9
31
136
80% have experimental support at FDR=0.2
x2
Other insights
 HIV has both good and bad epitopes; bad
epitopes are decoys, wasting the energy of the
immune system (Nature Medicine, 2006)
 HIV vaccine should contain good epitopes and
avoid the bad ones
Can we find the epitopes?
 Epitope = (peptide, HLA)
 Finding the peptides:
 Look in regions where HIV doesn’t mutate
 Look in the vicinity of HLA-HIV associations
 Finding the HLA alleles:
 This work
 Important, because we need to find epitopes that
broadly cover alleles from a given population
Find the epitope HLA alleles
With Christian Brander and Nicole Frahm, Mass General
RAIEAQQHL
…
Pt1
Pt2
Pt3
Pt4
PtN
 If a patient’s blood reacts with a peptide, then it
is very likely that the peptide is an epitope for at
least one of the patient’s six HLA types
 From observations for many patients, tease out
the responsible HLA type(s)
Example
RAIEAQQHL
…
Pt1
A02
A30
B53
B58
C04
C05
Pt2
A24
A24
B13
B40
C03
C07
Pt3
A30
A68
B07
B14
C07
C08
Pt4
A02
A25
B13
B58
C03
C07
PtN
B58 is responsible
Complications…
 More than one HLA can be responsible for a given peptide
How to give partial credit?
 False negatives (lack of exposure, bad chemistry)
 False positives (cross reactivity, MHC-II activity)
A
B
C
A
B
C
reacting
patients
non-reacting
patients
Graphical models to the rescue
HLA1
HLA2
HLA3
…
HLAI
noisy OR
react1
react2
react3
…
reactJ
arc from HLA to react iff (reacting peptide, HLA) is an epitope
What biologists want
 A list of the most likely epitopes (arcs in the
graph) such that X% are spurious
FDR applied to arcs in a DAG model
Listgarten and Heckerman, 2007 (UAI)
Assumption: Node parameters are independent and complete data
(so we can learn the parents of each node independently)
Assumption: Node order is known
Input: Data & structure learning algorithm a
Output: Estimate of fraction of arcs that are spurious
FDR = E(# of arcs under the null) / number of arcs on real data
1. For each node create a null distribution by permuting the data
associated with the node; count the number of arcs resulting
from the application of algo a
2. Repeat 1 many times, yielding an expected number of arcs under
the null
Results on synthetic data
More details at UAI
Results on real data
 169 HIV epitopes known prior to study
 118 addition epitopes found using FDR=0.3
(7/7 verified with expensive tests)
 Up to 6 HLAs can pair with a given epitopes,
much more than previously thought (good news
for vaccine design)
Conclusions
Graphical models are helping the design of an HIV vaccine
 HIV is highly vulnerable to the cellular arm of the immune system
 We’re finding the epitopes to include in a vaccine
The search for an HIV vaccine has let to improvements in graphical
modeling
 The utility of graphical models are greatly improved when we can
estimate the number of spurious arcs
 FDR is a method for doing do
Tools and source code online (microsoft.com/science)
Acknowledgments
The algorithms:
Bette Korber
Tanmoy
Bhattacharya
The data:
Zabrina Brumme
Chanson Brumme
Bruce Walker
Richard Harrigan
Christian Brander
Nicole Frahm
Corey Moore
Simon Mallal