The signatures of natural selection in the HapMap data

Download Report

Transcript The signatures of natural selection in the HapMap data

Signals of natural selection in the
HapMap project data
The International HapMap Consortium
Gil McVean
Department of Statistics, Oxford University
The International HapMap Project
•
To facilitate the design and analysis of association studies
•
A genome-wide map of genetic variation across 270
individuals from four populations
–
–
–
–
CEPH families from Utah
Yoruba from Nigeria
Han Chinese from Beijing
Japanese from the Tokyo region
•
Phase I collected data on approximately 1.2 million SNPs
•
Phase II increases SNP density to more than one per kb
•
All data publicly available at www.hapmap.org
Looking for selection
•
A genome-wide map of variation can also be used to hunt for regions of the
genome where natural selection has acted
– Selective sweeps
– Balancing selection
– Local adaptation
•
Why?
– Interest
– Functional polymorphism
– The signal of selection we observe tells us about the genetic architecture of traits
Methods for mapping selection
•
Model-based
– Compare genetic variation to ‘neutral’ model
•
Purely empirical
– Consider the ‘most extreme’ genomic regions
•
‘Calibrated’
– Compare to examples of (very few) proven selective importance
In what way are selected regions unusual?
(in the HapMap data)
HLA
17q21 inversion
Lactase
Duffy
HLA and resistance to infectious disease
HLA
The HLA region shows extremely high levels of polymorphism
17q21 inversion and reproductive success
The inversion has multiple (66) SNPs in perfect association (r2 = 1)
LCT and lactase persistence
The LCT gene shows an extended haplotype structure in European populations
The Duffy locus and resistance to Plasmodium vivax
The FY gene shows extreme population differentiation
Different selective histories leave different
footprints in genetic variation
How much of the genome looks as ‘unusual’
as these selected loci?
Heterozygosity as extreme as HLA
HLA
Sets of perfect proxies as extreme as the 17q21 inversion
Inversion
EHH as extreme as LCT
Lactase
Differentiation as extreme as the Duffy locus (NB not FY*O)
Duffy
For ¾ cases, the selected locus is at the
very extreme of the genome-wide
distribution
What can we learn from the unusual, but
less extreme cases?
Heterozygosity across the genome
Top 1%
Top 5%
Top 10%
Bottom 10%
Bottom 5%
Bottom 1%
Elevated heterozygosity on 8p
Chromosome 6
MHC
Chromosome 8
8p23 inversion
Distribution of long runs of perfect proxies
≥ 50 SNPs
20 – 50 SNPs
10-20 SNPs
17q21 Inversion
An inversion on the X chromosome?
Distribution of EHH
Top 0.1%
Top 1%
Top 10%
A selective sweep on chromosome 5?
Distribution of differentiation
Top 0.1%
Top 1%
Top 10%
SLC24A5
Lamason et al (Science 2005)
Unusual regions of the genome suggest
interesting biology
BUT
The hypothesis of historical selection is
fundamentally untestable
What hypothesis can we test?
Signals of selection should tend to occur
near regions of known functional
importance
i.e. genes
Are genes over-represented in regions of high heterozygosity?
Are genes over-represented in regions of high proxy number?
Are genes over-represented in regions of high EHH?
Are genes over-represented in regions of high differentiation?
Only differentiation shows a tendency for an
increased density of ‘selection’ near genes
The wild speculation
Selection on standing variation
•
Why should we see an excess of one type of signal of adaptive evolution
near genes, but not another?
•
Perhaps the signals are sensitive to assumptions about selection occurs?
•
EHH methods will be most powerful for identifying selection on a single,
novel mutation
•
Differentiation will pick cases where an already polymorphic mutation,
present on multiple haplotype backgrounds, becomes favoured in one
geographic region
•
Perhaps most selection has been on standing variation?
Acknowledgements
•
The International HapMap Consortium
•
Oxford Statistics
– Peter Donnelly, Simon Myers, Chris Spencer, Raphaelle Chaix
•
Funding agencies
– NIH, TSC, The Wellcome Trust, BBSRC, the Fyssen Foundation
Distribution of Fay and Wu’s H statistic
Bottom 0.1%
Bottom 1%
Bottom 10%
Distribution of Tajima D statistic
Top 1%
Top 5%
Top 10%
Bottom 10%
Bottom 5%
Bottom 1%
Tajima D (negative)
Fay and Wu H (negative)
Numbers of SNPs
Chromosome #SNPs in
common
files
1
75850
2
82565
3
59417
4
53219
5
53324
6
61829
7
42588
8
65506
9
51906
10
46073
11
41299
12
38433
13
33757
14
27143
15
24615
16
23400
17
23235
18
35931
19
16505
20
19275
21
17933
22
17244
X PAR 1
408
X non PAR
53594
X PAR 2
45
Totals
965094
#SNPs QC’ed,
polymorphic and
with ancestral
inferred
64107
74829
52523
47878
48504
55344
35240
60306
47285
41185
36687
34895
30779
24487
22124
20779
20576
33137
14246
15700
16281
15196
5
41682
0
853775
Percent
Chromosome Approx.
converted Length
SNP
spacing
0.8451813
246043912
3.84
0.9063041
243407499
3.25
0.8839726
199282781
3.79
0.8996411
191710711
4.00
0.9096092
180825316
3.73
0.8951139
170902878
3.09
0.8274631
158542415
4.50
0.920618
146305419
2.43
0.9109737
136326725
2.88
0.8939075
135035657
3.28
0.8883266
134481573
3.67
0.9079437
132017602
3.78
0.9117813
113025098
3.67
0.9021479
105260053
4.30
0.8988015
100133324
4.53
0.8879915
89915381
4.33
0.8855606
81724082
3.97
0.9222398
76114138
2.30
0.8631324
63788762
4.48
0.8145266
63686957
4.06
0.9078793
46956357
2.88
0.8812341
49375569
3.25
0.0122549
2689596
537.92
0.7777363
150671647
3.61
0
328507 NA
0.8846548 3018551959
3.54