Transcript View/Open
Targets of recent positive
selection in Indian
populations
Irene Gallego Romero
Leverhulme Centre for Human
Evolutionary Studies
Department of Biological Anthropology
The Indian subcontinent
• Probably inhabited by H sapiens ~50,000 YBP
(coastal route out of Africa, mtDNA and Y data)
• Drastic population expansion ~35,000 YBP
• Decidedly not a single panmictic population,
highly stratified and fragmented
– linguistics, geography, sociocultural practices.
• Very high incidence of T2D and obesity
(predicted highest worldwide by 2030)
• Underrepresented in genomic diversity panels
All of which means…
• There has been ample time for ‘recent’
evolutionary adaptations to arise
• These adaptations have generally gone
unexamined
– Most Indian work to date has examined
Indian population history, and been carried
out on mtDNA and Y-chromosome
Allelic trajectories under selection
Bamshad & Wooding, Nat Rev Gen, 2003
Selective sweeps and haplotypes
Nielsen et al, Nat Rev Gen, 2007
Selective sweeps and haplotypes
All we are looking for is haplotypes that are
uncommonly long for their frequency in the
sample.
Bamshad & Wooding, Nat Rev Gen, 2003
Quantifying selective sweeps
• EHH: probability of two chromosomes in a sample
being identical as a function of distance from a
chosen ‘core’ SNP
• Other related metrics:
– iHS: integral under the EHH curve, sensitive to
allelic ancestry
– XP-EHH: cross population EHH, compares
population pairs, detects the action of selection in
one population but not the other
Sample composition
• 156 Indian samples
– 31 populations
• 836 further samples
HGDP-CEPH, our data
– Old World, Oceania
– Split into 8 geographic
groups/40 populations
• Illumina 650K, 610K
chips (~550,000
autosomal SNPs)
India in a global context: FST
Computational challenges
• Phasing:
– Inferring haplotype from genotype
• Calculating test statistics:
– iHS and XP-EHH
• Data post-processing:
– ~550,000 data points per population per
statistic
– SNPs to genes/genomic regions
Phasing
• Likelihood-based methods
• 550,000 SNPs per individual, ~1,000
individuals
• Phasing chromosome 2 (densest, ~50,000
SNPs) can take over a week
• Computationally intensive, and requires a
lot of disk space for storing iterations, so
cannot use CamGrid
– use elephant.bio.cam.ac.uk, simultaneously run
multiple chromosomes
– < 2 weeks to phase all autosomal chromosomes
Computing XP-EHH and iHS
• Compute a value for each statistic for each
SNP for each population or population
pair (~10 per test)
– >5,000,000 data points for each statistic
• Not computationally intensive, small files
– easily run on CamGrid (each chromosome
separately)
– 4-5 hours to analyse a single population
• C++ code
Data processing
• Data sets this big suffer from high false
discovery rates
• Multiple testing corrections can be too
stringent
• Need to reduce the number of data points
– windowing approach:
• Break the genome into non-overlapping,
contiguous 200kb windows, test significance at
that level
Windowing
• Done using R
– Hand-written code, no extra packages
– Requires large amounts of RAM (> 10GB), so not
suitable for CamGrid
– Again, use elephant
– Roughly 2 hours per population
• From 550,000 SNPs to 13,274 windows
– Spanning ~20,000 genes
– How to tease out biological meaningfulness?
Separate signals in North and
South India
From SNPs to genes and beyond
• Selection acts on phenotypes, not genes
• Mining of ontologies and other databases
– Gene Ontology terms, Mammalian Phenotype terms,
other annotations
– (not actually done by high throughput methods, but I
know better by now)
– Although it still requires a lot of manual curation
• Map biological function to windows, test for
overrepresentation of categories relative to
expectations
A lot of hours later…
Acknowledgements
• Toomas Kivisild, Katie Siddle (LCHES)
• Jenny Barna
• Mait Metspalu, Georgi Hudjashov,
Gyaneshwer Chaubey (University of
Tartu)
• Joe Pickrell (University of Chicago)
• Richard Lempicki (NIH)
Other genome-wide statistics
• Genome-wide FST and HS are both computed
with simple R scripts
–
–
–
–
Hand-written code
~5 minutes per population
The slowest bit is reading the data in
Use elephant.bio.cam.ac.uk
• AAF spectrum slopes are a bit more involved
– To correct for sample size effects, resample every
locus 1,000 times from its own allelic distribution
– ~ 1 hour per population, requires high RAM, use R
Ancestral allele frequency slopes