Transcript part2

Zoology 2005
Part 2
Richard Mott
Inbred Mouse Strain Haplotype
Structure
• When the genomes of a pair of inbred
strains are compared,
– we find a mosaic of segments of identity and
difference (Wade et al, Nature 2002).
– A QTL segregating between the strains must
lie in a region of sequence difference.
• What happens when we compare more
than two strains simultaneously?
No Simple Haplotype Block Mosaic
Yalcin et al 2004 PNAS
…But a Tree Mosaic
In-silico Mapping
• Simple idea– Collect phenotypes across a set of inbred
strains
– Genotype the strains (ONCE)
– Look for phenotype-genotype correlation
– Works well for simple Mendelian traits (eg
coat colour)
– Suggested as a panacea for QTL mapping
In-silico Mapping
Problems
• Less well-suited for complex traits
• Number of strains required grows quickly with
the complexity of the trait. Suggested at least
100 strains required, possibly more if epistasis is
present
• Require high-density genotype/sequence data to
ensure identity-by-state = identity by-descent
• May be very useful for the dissection of a QTL
previously identified in a F2 cross (look for
patterns of sequence difference)
Recombinant Inbred Lines
• Panels of inbred lines descended form
pairs of inbred strains
• Genomes are inbred mosaics of the
founders
• Lines only need be genotyped once
• Similar to in-silico mapping except
– identity-by-descent=identity-by-state
– Coarser recombination structure
– ?lower resolution mapping?
BXD chromosome 4
Testing if a variant is functional
without genotyping it
(Yalcin et al, Genetics 2005)
• Requirements:
– A Heterogeneous Stock, genotyped at a
skeleton of markers
– The genome sequences of the progenitor
strains
– A statistical test
Merge Analysis
• Each polymorphism groups together the
founders according to their alleles
• If the polymorphism is functional, then a model
in which the phenotypic strain effects are
estimated after merging the strains together
should be as good as a model where each strain
can have an independent effect.
• Compare the fit of “merged” and “unmerged”
genetic models to test if the variant is functional.
• If the fit of the merged model is poor then that
variant can be eliminated.
Merge Analysis
Merge Analysis
How can we show a gene under a
QTL peak affects the trait?
• Genetic Mapping identifies Functional
Variants, not Genes
• Could be a control element affecting some
other gene
Quantitative Complementation
KO
0
Quantitative Complementation
KO
0
Low
50
High
100
wt
30
Quantitative Complementation
KO
0
Low
50
High
100
wt
30
d
Quantitative Complementation
KO
0
Low
50
High
100
wt
30
d
d
D=d-d
Quantitative Complementation
KO
0
Low
50
High
100
wt
30
d
d
D=d-d
Using Functional Information to
Confirm Genes
• Further experiments
– further bioinformatics, eg networks, functional
annotation (GO, KEGG)
– candidate gene sequencing
– gene expression analyses (eQTL) of
• founder strains
• HS
Mouse/human sequence
comparison
Enhancer reporter assays
enhancer
promoter
luciferase reporter
enhancer
promoter
luciferase reporter
Position
4107860
4053050
3226837
3110794
2930303
2833990
2685395
2682967
2558562
2543708
2532527
2499645
2490240
2418850
2417081
2415436
2413450
2411196
2409307
2404529
Fold expression difference (C57/AJ)
Enhancer elements affect promoter
expression
140
120
100
80
NIH3T3
ND7
60
Neuro2a
40
20
0
Large-Scale Genetic Mapping
• Using a Heterogeneous Stock
• Multiple Phenotypes collected in parallel
Predictions (from simulation of an
HS population)
• In a population of 1,000 HS animals:
– Genome-wide power to detect 5% QTL ~ 0.92
– Resolution < 2 Mb
Study design
• 2,000 mice
• 15,000 diallelic markers
• More than 100 phenotypes
– each mouse subject to a battery of tests
spread over weeks 5-9 of the animal’s life
– more (post-mortem) phenotypes being added
Phenotypes
Open Field Arena
Total Activity
Center Time
Latency
Boli
Context Freezing
Minutes freezing
Elevated Plus Maze
Closed Arm Distance
Open Arm Distance
Closed Arm Time
Open Arm Time
Open Arm Latency
FPS
PreTrain Tone + Startle
PreTrain Startle
Test Tone + Startle
Test Startle
Food Neophobia
FN Latency
"Home cage" activity
Total beam breaks
Burrowing Phenotypes
Pellets Burrowed
Intraperitoneal glucose
tolerance test
Glucose at time t
Insuline at time t
Cue Conditioning
Minutes freezing at cue
Plethysmography
PenH Difference
Respiratory rate
Tidal volume
Inspiratory time
Expiratory time
Haematology
HCT
HGB
MCV
PLT
RBC
WBC
Biochemistry
Sodium
Potassium
Urea
Creatinine
Calcium
HDL
LDL
Triglycerides
Phosphorous
Chloride
Cholesterol
AST
ALT
ALP
Haemolysis
Obesity
Body length
BMI
Immunology
PerB220
PerCD4inCD3
PerCD8inCD3
Covariates
• For each phenotype, we recorded
covariates, eg,
– experimenter
– time of day
– apparatus (eg, Shock Chamber 3)
Data collection
•
All animals microchipped
•
Automated data checking,
processing and uploading
•
All data uploaded into the
Integrated Genotyping System
(IGS) database
Genotypes from Illumina
• Genotyped and phenotyped 2,000
offspring
• Genotyped 300 parents
• Pedigree analysis shows genotyping was
99.99% accurate
• 11, 558 markers polymorphic in HS
QTL mapping
• Models
– HAPPY and single marker association
• Fitting framework
– Linear regression of (transformed) phenotypes
– Survival analysis for latency data
– Logit-based models for categorical data
• Significant covariates incorporated into the null model,
eg
Null
=
Startle ~ TestChamber + BodyWeight + Year + Age + Hour + Gender
Additive
Null
+
additive genetic info for locus
Full
Null
+
full genetic info for locus
QTL mapping
• Significance tests
– partial F-test (linear models), Chi-square /
LRT (others)
• Significance thresholds
– different for each phenotype
– have to take into account LD
• fit distribution to scores of permuted data
E-values
•We set score thresholds using ideas from sequence databank search
programs such as BLAST
E-values
•We set score thresholds using ideas from sequence databank search
programs such as BLAST
•The E-value of a threshold is the number of times you would expect
to see a false positive exceed the threshold in a genome scan
E-values
•We set score thresholds using ideas from sequence databank search
programs such as BLAST
•The E-value of a threshold is the number of times you would expect
to see a false positive exceed the threshold in a genome scan
•Applying the Bonferroni correction to the number of marker intervals
is too severe because LD makes neighbouring scores correlated.
E-values
•We set score thresholds using ideas from sequence databank search
programs such as BLAST
•The E-value of a threshold is the number of times you would expect
to see a false positive exceed the threshold in a genome scan
•Applying the Bonferroni correction to the number of marker intervals
is too severe because LD makes neighbouring scores correlated.
•Permutation analyses indicate the score of the most significant
expected random score amongst all ~12000 marker intervals behaves
as if it was drawn from M~4000 independent tests.
E-values
•We set score thresholds using ideas from sequence databank search
programs such as BLAST
•The E-value of a threshold is the number of times you would expect
to see a false positive exceed the threshold in a genome scan
•Applying the Bonferroni correction to the number of marker intervals
is too severe because LD makes neighbouring scores correlated.
•Permutation analyses indicate the score of the most significant
expected random score amongst all ~12000 marker intervals behaves
as if it was drawn from M~4000 independent tests.
•Hence a nominal P-value of p corresponds to an E-value of pM
Problems
Our population includes both siblings and unrelateds
• We have ignored this distinction
And therefore:
1.
2.
Confounding environmental family effects with genetic family effects
Allowing ghost peaks due to linkage disequilibrium between markers
within a sibship
Our solution so far:
(1) Investigating the effect of environmental factors and building
covariates into the model
(2) Identify peaks by a multiple conditional fit
Multiple Peak Fitting
Forward Selection
• For each phenotype’s genome scan:
– Make list of all peaks > genome-wide threshold T
– Fit most significant peak, P1
– Go through list of peaks, refitting each on conditional upon the
most significant peak.
– Add the most significant remaining peak, P2
– Continue refitting remaining peaks P3 , P4 … and adding them
into model until the most significant remaining peak < T
Peaks found by multiple conditional fit
Multiple conditional fit
(using additive model only)
number of
phenotypes
Database for scans
Database for scans
Additive model
Full model
E-value thresholds
• additive only
•E<0.01 is about the same as genome-wide
corrected p<0.01.
Database for scans
zoom in
Covariates
QTL Mapping: Validation
• Coat colour
• Detection of known QTLs
Coat colour genes
albino
agouti
brown
dilute
Gene
Tyr
Asip
Tyrp
Myo5a
Chr.
7
2
4
9
Position (Mb)
149
310.14
158.4
150.8
HS Mapping Position
148.8 - 150.6
309.6 - 310.2
158.2 - 159
150.8 - 151.2
A known QTL: HDL
Wang et al, 2003
HS mapping
High Resolution QTLs
Phenotype
Chrom
Mb
Method
Ref
HS position
Cue freezing
3
70-83
Genome tagged mice
Liu 2003
71-73
Obesity
2
142-168
Congenic
Demant 2004
150-153
10 week body mass
1
156-160
Progeny testing
Christians 2004
154.5-156
Emotionality
1
143-148
HS
Mott 2000
143-144.5
Emotionality
10
123-127
HS
Mott 2000
121.5-122.7
Emotionality
12
54-57
HS
Mott 2000
55.5-56.5
Emotionality
15
64-77
HS
Turri 1999
63.5-66
New QTLs: two examples
• Freeze.During.Tone (from Cue
Conditioning behavioural experiment)
…………1 peak
• % of CD4 in CD3 cells (immunology
assay)
…………10 peaks
Cue Conditioning
Freezing
• Freezing in response to a conditioned stimulus
TONE
TONE
Cue Conditioning
• Freeze.During.Tone: huge effect, small
chr15 number of genes
cntn1:
Contactin precursor
(Neural cell surface
protein)
% CD4 cells in CD3 cells
• huge effect but lots of genes
% CD4 in CD3 (under peak)
All QTLs
• 608 peaks
• Median interval is 938,936 bp …
• … or about 9 genes per peak
Summary
• The HS project so far has
– phenotyped 2,500 HS mice
– genotyped 2,300 mice
– mapped over 140 phenotypes
– identified more than 600 potential QTLs
Confirming gene candidates
• Increased mapping resolution through
–
–
–
–
–
include epistasis
multivariate
GxE
pleiotropy
sex effects
• Further experiments
– further bioinformatics, eg networks, functional annotation (GO,
KEGG)
– candidate gene sequencing
– gene expression analyses (eQTL) of
• founder strains
• HS
Confirming gene candidates:
epistasis
Single marker association
of pairwise epistasis
Work of many hands
Carmen Arboleda-Hitas
Amarjit Bhomra
Peter Burns
Richard Copley
Stuart Davidson
Simon Fiddy
Jonathan Flint
Polinka Hernandez
Sue Miller
Richard Mott
Chela Nunez
Gemma Peachey
Sagiv Shifman
Leah Solberg
Amy Taylor
Martin Taylor
Jordana Tzenova-Bell
William Valdar
Binnaz Yalcin
Dave Bannerman
Shoumo Bhattacharya
Bill Cookson
Rob Deacon
Dominique Gauguier
Doug Higgs
Tertius Hough
Paul Klenerman
Nick Rawlins
Project funded by
The Wellcome Trust, UK