Edinburgh 2006 - Wellcome Trust Centre for Human Genetics
Download
Report
Transcript Edinburgh 2006 - Wellcome Trust Centre for Human Genetics
Genome-wide genetic association
of complex traits in outbred mice
William Valdar, Leah C. Solberg, Dominique Gauguier, Stephanie
Burnett, Paul Klenerman, William O. Cookson, Martin Taylor, J.
Nicholas P. Rawlins, Richard Mott, Jonathan Flint.
Genetic Traits
• Quantitative (height, weight)
• Dichotomous (affected/unaffected)
• Factorial (blood group)
• Mendelian - controlled by single gene
(cystic fibrosis)
• Complex – controlled by multiple
genes*environment (diabetes, asthma)
Quantitative Trait Loci
QTL: Quantitative Trait Locus
chromosome
genes
Quantitative Trait Loci
QTL: Quantitative Trait Locus
chromosome
QTG: Quantitative Trait Gene
Quantitative Trait Loci
QTL: Quantitative Trait Locus
chromosome
QTG: Quantitative Trait Gene
QTN: Quantitative Trait Nucleotide
Association Studies:
Map in
Humans or Animal Models ?
• Disease studied directly
• Population and environment
stratification
• Very many SNPs (1,000,000?)
required
• Hard to detect trait loci – very
large sample sizes required to
detect loci of small effect
(5,000-10,000)
• Potentially very high mapping
resolution – single gene
• Very Expensive
• Animal Model required
• Population and environment
controlled
• Fewer SNPs required (~10010,000)
• Easy to detect QTL with ~500
animals
• Poorer mapping resolution –
1Mb (10 genes)
• Relatively inexpensive
Mosaic Crosses
Inbred founders
G3
mixing
GN
chopping up
F2, diallele
F20
inbreeding
Heterogeneous Stock,
Advanced Intercross,
Random Outbreds
Recombinant
Inbred Lines
Sizes of Behavioural QTL in rodents
(% of total phenotypic variance)
30
25
Number
20
15
10
5
0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59
Effect size (% var)
Effect size of cloned genes
4
Number
3
2
1
0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59
Effect size (% var)
Mapping Resolution
• F2 crosses
– Powerful at detecting QTL
– Poor at Localisation – 20cM
– Too few recombinants
• Increase number of recombinants:
– more animals
– more generations in cross
Heterogeneous Stocks
• cross 8 inbred strains for >10 generations
Heterogeneous Stocks
• cross 8 inbred strains for >10 generations
Heterogeneous Stocks
• cross 8 inbred strains for >10 generations
0.25 cM
Multiple Phenotype QTL
Experiment
Multiple Phenotypes measured on
a Heterogeneous Stock
• 2000 HS mice (Northport, Bob Hitzeman)
84 families
40th generation
• 150 traits measured on each animal
– Standardised phenotyping protocol
– Covariates Recorded
• Experimenter
• Time/Date
• Litter
– Microchipping
Phenotypes
•
•
•
•
•
•
•
•
•
Anxiety (Conditioned and Unconditioned Tests)
Asthma (Plethysmography)
Diabetes (Glucose Tolerance Test)
Haematology
Immunology
Biochemistry
Wound Healing (Ear Punch)
Gene Expression
….others….
High throughput phenotyping facility
Neophobia
Fear Potentiated Startle
Ovalbumin sensitization
Plethysmograph
Intraperitoneal Glucose Tolerance Test
Ears
Genotyping
• 15360 SNPs genotyped by Illumina
– 2000 HS mice
– 300 HS parents
– 8 inbred HS founders
– 500 other inbreds
• www.well.ox.ac.uk/mouse/snp.selector
• 13459 SNPs successful
• 99.8% accuracy (parent-offspring)
Distribution of Marker Spacing
1200
Mean Interval (kb)
SD
Max interval
Min interval
Number of Markers
1000
800
204
231
11328
0
(chromosome X)
(9 Markers)
600
400
200
0
0
0.5
1
1.5
2
2.5
Distance (Mb)
3
3.5
4
4.5
5
LD Decay with distance
0.9
0.8
0.7
R squared
0.6
0.5
Chr 1
Chr 2
Chr 3
Chr 4
Chr 5
Chr 6
Chr 7
Chr 8
Chr 9
Chr 10
Chr 11
Chr 12
Chr 13
Chr 14
Chr 15
Chr 16
Chr 17
Chr 18
Chr 19
Chr X
0.4
0.3
0.2
0.1
0
0
5
15
10
20
Distance (MB)
99.2% marker pairs on different autosomes have R2 < 0.05.
25
Genetic Drift in HS
• 40 generations of
breeding
• Allele Frequency in
founders will drift
• 8% of genome fixed
Allele Frequency in
Founders
Allele Frequency in
HS
12.5
14.99
25
23.23
37.5
29.77
50
31.45
Analysis
• Automated analysis pipeline
– R HAPPY package
– Single Marker Association
• Each phenotype analysed independently
– Transformed to Normality, outliers removed
– Tailored set of covariates
– Linear models for most phenotypes
– Survival models for latency phenotypes
Twisted Pair Analysis of Heterogeneous Stock
chromosome
markers
alleles
1 1 2 1 1 1 2 1 11 2 2 1 2 2 1 1 1 1 2 1 1 2 111 11 2 2 1 2 1 2
• Want to predict ancestral strain from genotype
• We know the alleles in the founder strains
• Single marker association lacks power, can’t
distinguish all strains
• Multipoint analysis – combine data from neighbouring
markers
Twisted Pair Analysis of Heterogeneous Stock
chromosome
markers
alleles
•
•
•
•
1 1 2 1 1 1 2 1 11 2 2 1 2 2 1 1 1 1 2 1 1 2 111 11 2 2 1 2 1 2
Hidden Markov model HAPPY
Hidden states = ancestral strains
Observed states = genotypes
Unknown phase of genotypes
• Analyse both chromosomes simultaneously
• Twisted pair of HMMs
• Mott et al 2000 PNAS
Testing for a QTL
• piL(s,t) = Prob( animal i is descended from strains s,t at locus L)
• piL(s,t) calculated by HMM using
– genotype data
– founder strains’ alleles
• Phenotype is modelled
E(yi) = Ss,t piL(s,t)T(s,t) + mi
Var(yi) = s2
• Test for no QTL at locus L
– H0: T(s,t) are all same
– ANOVA partial F test
Genome Scan
• Additive and dominance models
• Record all peaks that exceed 5% genomewide significance,
– Threshold based on 200 permutations
– 9000 preliminary candidate QTL found
Results
Many peaks
mean red cell volume
How to select peaks: a
simulated example
How to select peaks: a
simulated example
Simulate 7 x 5% QTLs
(ie, 35% genetic effect)
+ 20% shared
environment effect
+ 45% noise
= 100% variance
Simulated example: 1D scan
Peaks from 1D scan
phenotype ~ covariates + ?
1D scan: condition on 1 peak
phenotype ~ covariates + peak 1 + ?
1D scan: condition on 2 peaks
phenotype ~ covariates + peak 1 + peak 2 + ?
1D scan: condition on 3 peaks
phenotype ~ covariates + peak 1 + peak 2 + peak 3 + ?
1D scan: condition on 4 peaks
phenotype ~ covariates + peak 1 + peak 2 + peak 3 +
peak 4 + ?
1D scan: condition on 5 peaks
phenotype ~ covariates + peak 1 + peak 2 + peak 3 +
peak 4 + peak 5 + ?
1D scan: condition on 6 peaks
phenotype ~ covariates + peak 1 + peak 2 + peak 3 +
peak 4 + peak 5 + peak 6 + ?
1D scan: condition on 7 peaks
phenotype ~ covariates + peak 1 + peak 2 + peak 3 +
peak 4 + peak 5 + peak 6 + peak 7 + ?
1D scan: condition on 8 peaks
phenotype ~ covariates + peak 1 + peak 2 + peak 3 +
peak 4 + peak 5 + peak 6 + peak 7 + peak 8 + ?
1D scan: condition on 9 peaks
phenotype ~ covariates + peak 1 + peak 2 + peak 3 +
peak 4 + peak 5 + peak 6 + peak 7 + peak 8 + peak 9
+?
1D scan: condition on 10 peaks
phenotype ~ covariates + peak 1 + peak 2 + peak 3 +
peak 4 + peak 5 + peak 6 + peak 7 + peak 8 + peak 9
+ peak 10 + ?
1D scan: condition on 11 peaks
phenotype ~ covariates + peak 1 + peak 2 + peak 3 +
peak 4 + peak 5 + peak 6 + peak 7 + peak 8 + peak 9
+ peak 10 + peak 11 + ?
Peaks chosen by forward
selection
Bootstrap sampling
1
2
3
10 subjects
4
5
6
7
8
9
10
Bootstrap sampling
sample with
replacement
10 subjects
1
1
2
2
3
2
4
3
5
5
6
5
7
6
8
7
9
7
10
9
bootstrap sample
from
10 subjects
Forward selection on a
bootstrap sample
Forward selection on a
bootstrap sample
Forward selection on a
bootstrap sample
Bootstrap evidence mounts
up…
In 700 bootstraps…
Bootstrap Posterior Probability
(BPP)
Model averaging by bootstrap
aggregation
• Choosing only one model:
– very data-dependent, arbitrary
– can’t get all the true QTLs in one model
• Bootstrap aggregation averages over models
– true QTLs get included more often than false ones
• References:
– Broman & Speed (2002)
– Hackett et al (2001)
Results
Results
We identified 843 QTLs for 97 phenotypes with BPP greater than
0.25 of which on the basis of simulations we expect 590 to be
genuine
Performance of multiple QTL modelling
BPP
Thresholda
Number
of QTLb
Proportion of
detected QTLs
that are truec
Proportion of
true QTLs
detectedd
Expected number of
false QTLs per
genome scane
0.05
3127
0.26
0.90
0.47
0.10
2119
0.41
0.90
0.43
0.20
1105
0.63
0.89
0.32
0.25
843
0.70
0.89
0.25
0.30
633
0.75
0.89
0.21
0.40
364
0.82
0.89
0.17
0.50
251
0.85
0.88
0.12
0.75
58
0.91
0.87
0.04
0.90
30
0.91
0.85
0.03
1.00
13
0.96
0.60
0.00
Where to find the results
http://gscan.well.ox.ac.uk/
Distribution of effect sizes
180
160
140
Number of QTL
120
100
80
60
40
20
0
0
5
10
15
Effect size of QTL (% Var)
20
25
Resolution
95% confidence intervals
Megabases
Number of Genes per Locus
140
120
Number of QTLs
100
80
60
40
20
0
0
10
20
30
40
50
Number of genes
60
70
80
90
100
Results summary
• 843 peaks found with BPP > 0.25
• 8.7 peaks per phenotype on average
• Based on simulation, we expect ~590 to be genuine.
• Mean 95% CI width 2.78 Mb
• Mean number of genes under each 95% CI is 28.9
Results
• ~7 jointly significant QTL per phenotype
• 95% Confidence Interval ~ 2 Mb
• ~50% of QTL have a significant nonadditive component
• Only 3 phenotypes were explained by
single major QTL
– Most phenotypes are complex
Distribution of QTL Effects
Mean Effect size 2.7%
180
160
140
Number of QTL
120
100
80
60
40
20
0
0
5
10
15
Effect size of QTL (% Var)
20
25
%Variance Explained
[% Additive Genetic Variance calculated using 3-generation pedigree data,
not genotypes]
Coat colour genes
albino
agouti
brown
dilute
Gene
Tyr
Asip
Tyrp
Myo5a
Chr.
7
2
4
9
Position (Mb)
149
310.14
158.4
150.8
HS Mapping Position
148.8 - 150.6
309.6 - 310.2
158.2 - 159
150.8 - 151.2
A known QTL: HDL
Wang et al, 2003
HS mapping
New QTLs: two examples
• Ear Punch Hole Area Regrowth
– wound healing
• Cue Conditioning Freeze.During.Tone
– measure of fear
Cue Conditioning
• Freeze.During.Tone: huge effect, small
chr15 number of genes
cntn1:
Contactin precursor
(Neural cell surface
protein)
What do we want?
• Biological:
– Joint QTL containing the functional genes and that lead to their
identification
– But genetic mapping finds the variants not the genes
• Statistical:
– Multi-locus QTL selection algorithms that predict the phenotype
of new animals accurately
– Model-Averaging: no best choice?
– Ghost QTL
• Are statistical QTL algorithms consistent?
– Do they find the biological QTL given a large enough sample
size?
– Simulations of multiple QTL models indicate mapping accuracy
declines as complexity increases [Valdar et al 2006 Genetics in press]
Work of many hands
Carmen Arboleda-Hitas
Amarjit Bhomra
Stephanie Burnett
Peter Burns
Richard Copley
Stuart Davidson
Simon Fiddy
Jonathan Flint
Polinka Hernandez
Sue Miller
Richard Mott
Chela Nunez
Gemma Peachey
Sagiv Shifman
Leah Solberg
Amy Taylor
Martin Taylor
William Valdar
Binnaz Yalcin
Dave Bannerman
Shoumo Bhattacharya
Bill Cookson
Rob Deacon
Dominique Gauguier
Doug Higgs
Tertius Hough
Paul Klenerman
Nick Rawlins
Jennifer Taylor
Chris Holmes
Project funded by
The Wellcome Trust, UK
Data are publicly available
• http://gscan.well.ox.ac.uk
Gene x Environment
Gene x Sex
• Repeat analysis looking for QTLs that
interact with
– Gender
– Litter number
– Season, Month, etc
– Experimenter
• Compare models
E(y) = m + locus + env
E(y) = m + locus * env
Gene x Environment
• 431 jointly significant GxE QTLs
–
–
–
–
–
27 gene x experimenter,
81 gene x litter number,
67 gene x age,
105 gene x study day
151 gene x season.
• 13% of variation is GxE
• 25 GxE QTLs overlapped with original joint QTL
– defined as lying within 4Mb of the peak position
• 42 GxSex QTLs
Gene Expression Data
(with Binnaz Yalcin, Jennifer Taylor)
• Illumina 40k chip
• Livers, Lungs (Brains)
– 190 HS
– HS founders
Phenotype-gene expression
correlation
• Liver gene expression in 180 HS mice
Slc4a7
Testing for Functional Variants
• Is a SNP functional for a trait?
• Is a functional assay measured in
founders related to a trait?
– Gene expression
– DNA-Protein binding
Testing for non-Functional Variants
• Is a SNP’s pattern of variation inconsistent
with the QTL’s pattern of action ?
• Is a functional assay’s distribution
inconsistent with the QTL’s pattern of
action ?
Merge Analysis
Yalcin et al 2005 Genetics
• Require sequence of HS founders
– Determine all variants and their strain
distribution patterns (SDP)
• Don’t genotype every variant in the HS
– Instead predict genotypes in HS at all variants
based on a sparse skeleton of genotypes
Merge Analysis
• A variant v will partition the HS founder
strains into 2 or more groups, depending
on its strain distribution pattern (SDP)
• If p is functional for the trait then the strain
effects at the QTL must be identical for
strains with the same allele.
– so if merging founders according to v’s SDP
destroys significance then we reject v
Merge Analysis
Model Comparison
•
piL(s,t) = Prob( animal i is descended from strains s,t at locus L)
•
Replace strains s,t by merged pseudo-strains g,h
– Add together probabilities for strains with the same allele
– Phenotypic effect of merged strains g,h is F (g,h)
•
viL(g,h) = Prob( animal i is descended from merged strains g,h at locus L)
•
Compare fits of nested models
•
E(yi) = Ss,t piL(s,t)T(s,t) + mi
E(yi) = Sg,h viL(g,h)F(g,h) + mi
E(yi) =
mi
unmerged
merged
null
Require no significant difference between merged and unmerged models,
– and for both to be significant compared to null model
Merge Analysis
Open Field Activity, Chr 1
Merge Analysis
rgs18
Functional Merge Analysis
•
Measure functional assay on HS founders
–
–
FL(t) is value at locus L on founder s
e.g. gene expression
•
Expected value in HS is
•
If assay is related to phenotype y then
•
Compare nested models (thanks to Chris Holmes)
E(fi) = Ss,t piL(s,t)[F(s) + F(t)]
assuming additivity
E(yi) = q E(fi) + mi
E(yi) = Ss,t piL(s,t)T(s,t)
E(yi) = q Ss,t piL(s,t)[F(s) + F(t)]
E(yi) =
•
+ mi
+ mi
mi
unmerged
merged
null
Require no significant difference between merged and unmerged models,
–
and for both to be significant compared to null model
exp.log(Pr>F)
exp.log(Pr>F)
5
5
10
10
exp.log(Pr>F)
exp.log(Pr>F)
15
15
10
10
15
15
55
10
10
15
15
00 55
20
20
00
55
1010
1515
Freeze
Biochem.Tot.Cholesterol
Explore
Biochem.Tot.Protein
exp.log(Pr>F)
15
00 55
15
15
10
10
2020
15
15
exp.log(Pr>F)
model
difference logp
locus.log(Pr>F)
locus.log(Pr>F)
exp.log(Pr>F)
modelexp.log(Pr>F)
difference logp
55
20
20
15
15
Anx
Biochem.Sodium
locus.log(Pr>F)
locus.log(Pr>F)
Context
Biochem.Phosphorous
20
20
00
55
1010
1515
exp.log(Pr>F)
exp.log(Pr>F)
model
difference logp
15
Biochem.Urea
(Pr>F)
Biochem.Triglycerides
15
5
model difference logp
exp.log(Pr>F)
modelexp.log(Pr>F)
difference logp
(Pr>F)
5
exp.log(Pr>F)
exp.log(Pr>F)
0 5
0 5
00
0 05 5 1515
0
0
model difference logp
15
15
00
Biochem.LDL
20
20
0 5
0 5
locus.log(Pr>F)
locus.log(Pr>F)
0
0
locus.log(Pr>F)
locus.log(Pr>F)
locus.log(Pr>F)
locus.log(Pr>F)
Biochem.HDL
0 5
15
0 5
15
locus.log(Pr>F)
locus.log(Pr>F)
Using
Gene Expression in Weight.GrowthRanSlope
HS founders
Weight.GrowthSlope
2020
Future Work
Extensions to basic model
•
•
•
•
Generalised linear models
Multivariate data
Mixture Models, EM (Chris Holmes)
Family Effects, Variance Components, REML
(Peter Visscher, Allan McRae)
• Gene Annotation Data (Kate Elliot)
• Multiple QTL models
• Epistasis
• Pleiotropy