Marth-HapAnal-2005

Download Report

Transcript Marth-HapAnal-2005

A coalescent computational platform to predict
strength of association for clinical samples
Genomic studies and the HapMap
March 15-18, 2005
Oxford, United Kingdom
Gabor T. Marth
Department of Biology, Boston College
[email protected]
Focal questions about the HapMap
1. Required marker density
CEPH European
samples
3. How to choose tagging SNPs
2. How to quantify the strength of
allelic association in genome region
Yoruban
samples
4. How general the answers are to
these questions among different
human populations
Across samples from a single population?
(random 60-chromosome subsets of 120 CEPH
chromosomes from 60 independent individuals)
Possible consequence for marker performance
Markers selected based on
the allele structure of the
HapMap reference samples…
… may not work well in another
set of samples such as those
used for a clinical study.
How to assess sample-to-sample variability?
1. Understanding fundamental
characteristics of a given genome
region, e.g. estimating local
recombination rate from the data
2. Experimentally genotype
additional sets of samples, and
compare association structure
across consecutive sets directly
McVean et al. Science 2004
3. It would be a desirable alternative to generate such additional sets
with computational means
Towards a marker selection tool
1. select markers (tag
SNPs) with standard
methods
2. generate computational
samples
3. test the performance of
markers across consecutive
sets of computational
samples
Generating additional computational haplotypes
1. Generate a pair of haplotype sets with Coalescent genealogies. This
“models” that the two sets are “related” to each other by being drawn
from a single population.
3. Use the second haplotype
set induced by the same
mutations as our computational
samples.
4. In subsequent statistics,
weight each such set
proportional to the data
likelihood calculated in 2.
2. Enforce data-relevance by requiring that the first
set reproduces the observed haplotype structure of the
HapMap reference samples. Calculate the “degree of
relevance” as the data likelihood (the probability that
the genealogy does produce the observed haplotypes).
Generating computational samples
M
N
Problem: The efficiency of generating datarelevant genealogies (and therefore
additional sample sets) with standard
Coalescent tools is very low even for modest
sample size (N) and number of markers (M).
Despite serious efforts with various
approaches (e.g. importance sampling)
efficient generation of such genealogies is an
unsolved problem.
We propose a method to generate
“approximative” M-marker haplotypes by
composing consecutive, overlapping sets of
data-relevant K-site haplotypes (for small K)
Approximating M-site haplotypes as
composites of overlapping K-site haplotypes
M
1. generate
K-site sets
2. build M-site
composites
Piecing together neighboring K-site sets
20
20
15
15
10
10
5
5
0
"000"
"001"
"010"
000
100
001
101
010
110
011
111
"011"
"100"
"101"
000
001
010
011
100
101
110
111
"110"
"111"
0
"000"
"001"
"010"
"011"
"100"
"101"
hope that constraint at overlapping markers
preserves for long-range marker association
"110"
"111"
Building composite haplotypes
20
20
20
15
15
15
10
10
10
5
5
5
0
0
"000"
"001"
"010"
"011"
"100"
"101"
"110"
"111"
0
"000"
"001"
"010"
"011"
"100"
"101"
"110"
"111"
20
20
20
15
15
15
10
10
10
5
5
0
"001"
"010"
"011"
"100"
"101"
"110"
"111"
"001"
"010"
"011"
"100"
"101"
"110"
"111"
20
20
15
15
15
10
10
10
5
5
"001"
"010"
"011"
"100"
"101"
"110"
"111"
"001"
"010"
"011"
"100"
"101"
"110"
"111"
20
20
15
15
15
10
10
10
5
5
"001"
"010"
"011"
"100"
"101"
"110"
"111"
"101"
"110"
"111"
"000"
"001"
"010"
"011"
"100"
"101"
"110"
"111"
"000"
"001"
"010"
"011"
"100"
"101"
"110"
"111"
"000"
"001"
"010"
"011"
"100"
"101"
"110"
"111"
5
0
"000"
"100"
0
"000"
20
0
"011"
5
0
"000"
"010"
0
"000"
20
0
"001"
5
0
"000"
"000"
0
"000"
"001"
"010"
"011"
"100"
"101"
"110"
"111"
Initial results: 3-site composite haplotypes
30 CEPH HapMap reference
individuals (60 chr)
a typical 3-site composite
3-site composite vs. data
r2 (3-site composite)
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
r2 (data)
0.8
1
3-site composites: the “best case”
r2 ("exact" 3-site composite)
“short-range”
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
r2 (data)
0.8
1
“long-range”
the “best-case” 3-site scenario: composite of exact 3-site subhaplotypes
Variability across sets
The purpose of the composite
haplotypes sets …
… is to model sample variance
across consecutive data sets.
But the variability across the composite haplotype sets is compounded by
the inherent loss of long-range association when 3-sites are used.
4-site composite haplotypes
r2 (4-site composite #2)
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
r2 (data)
4-site composite
0.8
1
“Best-case” 4 site composites
r2 ("exact" 4-site composite)
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
r2 (data)
Composite of exact 4-site sub-haplotypes
0.8
1
Variability across 4-site composites
Variability across 4-site composites
1
0.6
0.4
1
0.2
0
0
0.2
0.4
0.6
r2 (data #1)
0.8
r2 (4-site composite #5)
r2 (data #2)
0.8
0.8
1 0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
r2 (4-site composite #1)
… is comparable to the variability across data sets.
Technical/algorithmic improvements
1. un-phased genotypes
(AC)(CG)(AT)(CT)
A G A C
C C
T T
?
2. markers with unknown ancestral state
3. dealing with uninformative markers
A
C
01101000010101110
11101000001010101
11101000010101110
01101000010101110
4. taking into account local recombination rare
Software engineering aspects: efficiency
Currently, we run fresh Coalescent simulations at each K-site (several
hours per region). This discards most Coalescent genealogies as
irrelevant.
Total # genotyped SNPs is ~ 1 million -> 1 million different K-sites to
match. Any given Coalescent genealogy is likely to match one or more of
these. Haplotype sets resulting from matches can be loaded into, stored
in, and retrieved from a database efficiently.
4 HapMap populations x 1 million K-sites x 1,000 comp sets x 50 bytes
< 200 Gigabytes
Acknowledgements
Eric Tsung
Aaron Quinlan
Ike Unsal
Eva Czabarka (Dept. Mathematics, William & Mary)
Testing markers with composite sets
1
r2 (4-site composite #2)
r2 (4-site composite #1)
1
0.8
0.6
0.4
0.2
0
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
r2 (data)
0.8
1
0
0.2
0.4
0.6
r2 (data)
0.8
1
Using the HapMap
1. genotype a set of
reference samples
2. compute strength of association
3. select a smaller set of markers that capture most of the
information present in the complete set of markers
4. use these markers in clinical studies
Allele structure varies among populations
CEPH European samples
Yoruban samples
Data probability for composite haplotypes
Pr(composite) = Pr(K-site1) Pr(K-site1 ~ K-site2)Pr(K-site2) Pr(K-site2 ~ K-site3)Pr(K-site3)
(motivation from composite likelihood methods for recombination rate
estimation e.g. by Hudson, Clark, Wall)
Generating K-site haplotypes
K=3,4
reference
data
20
15
10
5
0
"000"
"001"
"010"
"011"
"100"
"101"
"110"
"111"
1 match / 100 – 10,000
Coalescent genealogies
20
15
10
5
0
"000"
"001"
"010"
"000"
"001"
"010"
"011"
"100"
"101"
"110"
"111"
20
15
10
5
0
"011"
"100"
"101"
"110"
"111"
Example: CFTR gene
Hinds et al. Science, 2005
4-site composite haplotypes
HapMap data
4-site composite #1
4-site composite #2
4-site composites vs. data
0.8
0.6
0.4
1
0.2
0
0
0.2
0.4
0.6
r2 (data)
0.8
r2 (4-site composite #2)
r2 (4-site composite)
1
0.8
1 0.6
0.4
0.2
0
0
0.2
0.4
0.6
r2 (data)
0.8
1
Why should this work?
tease apart two questions: (1) to what degree K-site
composites preserve long-range correlations between
markers (really, the quality of the approximation) and (3) the
variability across different sets (what we are interested in).
Example: 4-site approximation
4-site composite #1
4-site composite #3
4-site composite #2
4-site composite #4