Gen660_Lecture8B_QTL_2014

Download Report

Transcript Gen660_Lecture8B_QTL_2014

QTL mapping
Simple Mendelian traits are
caused by a single locus, and
come in the ‘all-or-none’
flavor.
A Quantitative Trait is one in
which many loci contribute.
The phenotype can therefore
vary in a ‘quantitative’
manner.
Ades 2008, NHGRI
Modified from Mike White slides, 2010
1
Goals of QTL mapping
To identify the loci that
contribute to phenotypic
variation
1. Cross two parents with
extreme phenotypes
2. Score the progeny for the
phenotype
3. Genotype the progeny at
markers across the genome
4. Associate the observed
phenotypic variation with the
underlying genetic variation
Ades 2008, NHGRI
Modified from Mike White slides, 2010
5. Ultimate goal: identify causal
polymorphisms that explain the
phenotypic variation
2
Backcross
Phenotype:
Drug tolerance
80%
20% viability
Usually have at least 100 individuals
Broman and Sen3 2009
Intercross
Phenotype:
Drug tolerance
80%
20% viability
Broman and Sen4 2009
Backcross vs. Intercross
• An intercross recovers all three possible genotypes (AA, BB, AB). This allows
detection of dominance with both alleles and provides estimates of the degree of
dominance.
• A backcross has more power to detect QTL with fewer individuals.
• A backcross may be the only possible scheme when crossing two different
species.
5
Genetic map: specific markers spaced across the genome
Markers can be:
• SNPs at particular loci
• Variable-length repeats
e.g. ALU repeats
• ALL polymorphisms
(if have whole genomes)
Ideally, markers should
be spaced every 10-20 cM
and span the whole genome
6
Genotype data: Determine allele at all markers in each F2
7
Phenotype data
8
Test which markers correlate with the phenotype
1. Missing Data Problem
Use marker data to infer intervening genotypes
2. Model Selection Problem
How do the QTL across the genome combine with the covariates to
generate the phenotype?
9
Broman and Sen 2009
Test which markers correlate with the phenotype
Marker regression: simple T-test (or ANOVA) at each marker
Marker 1: no QTL
Marker 2: significant QTL
(population means are different)
10
Marker regression
Advantages:
• Simple test – standard T-test/ANOVA
• Covariates (e.g. Gender, Environment) are easy to incorporate
• No genetic map necessary, since test is done separately on each marker
Disadvantages:
• Any individuals with missing marker data must be omitted from analysis
• Does not effectively consider positions between markers
• Does not test for genetic interactions (e.g. epistasis)
• The effect size of the QTL (i.e. power to detect QTL) is reduced by incomplete
linkage to the marker
11
• Difficult to pinpoint QTL position, since only the marker positions are considered
Interval mapping
• Lander and Botstein 1989
• In addition to examining phenotype-genotype associations at markers, look for
associations between makers by inferring the genotype
Q
• The methods for calculating genotype probabilities between markers typically use
hidden Markov models to account for additional factors, such as genotyping errors
12
Interval mapping
13
Broman and Sen 2009
Interval mapping
Advantages:
• Takes account of missing genotype information – all individuals are included
• Can scan for QTL at locations in between markers
• QTL effects are better estimated
Disadvantages:
• More computation time required
• Still only a single-QTL model – cannot separate linked QTL or examine for
interactions among QTL
14
LOD scores
• Measure of the strength of evidence for the presence of a QTL
at each marker location
LOD(λ) = log10 likelihood ratio comparing the hypothesis of a QTL at position λ
versus that of no QTL
Phenotype
log10
{
Pr(y|QTL at λ, µAAλ, µABλ, σλ)
Pr(y|no QTL, µ, σ)
}
LOD 3 means that the TOP model is
103 times more likely than
the BOTTOM model
15
LOD curves
How do you know which peaks are really significant?
16
LOD threshold
•Consider the null hypothesis that there are no QTLs genome-wide
one location
genome-wide
1. Randomize the phenotype labels on the relative to the genotypes
2. Conduct interval mapping and determine what the maximum LOD score is
genome-wide
3. Repeat a large number of times (1000-10,000) to generate a null distribution
of maximum LOD scores
17
Broman and Sen 2009
LOD threshold
• 1000 permutations
10% ‘Genome-wide Error Rate’ = LOD 3.19
(means that at this LOD cutoff 10% of peaks could be random chance)
5% GWER = LOD 3.52
• Boundary of the peak is often taken as points that cross (Max LOD – 1.5)
(or - 1.8 for an intercross)
•Often these regions are very large & encompass many (hundreds) of genes 18
Lessons from QTL mapping studies about Genetic Architecture
* Often have a few big effect QTL and many small modifier QTL
with small effects on the phenotype
need lots of power (good phenotypic measurements and many
individuals) to detect QTLs with small effects
* Recombination in F2’s can reveal negative effects segregating in the
parents
e.g. can find resistant-parent allele associated with sensitivity
MacKay review: often have loci with complementary effects found nearby
* Effects of an allele can be context dependent
Environment-specific effects: Gene x Environment (GxE) interactions
Genomic context: epistatic (i.e. gene-gene) interactions are likely very
common … but difficult to detect
19
An alternative approach: Genome Wide Association Studies (GWAS)
Here the phenotypes and genotypes come from many
different individuals from a population
12
10
8
6
4
2
0
Identify SNPs that are significantly associated with the trait
across a bunch of individuals
An alternative approach:
Genome Wide Association Studies (GWAS) across many individuals
Genotypes
for 65 strains
Phenotypes Population Phylogenetic Random
for 65 strains Structure Relatedness Error
1.0
0.8
0.6
0.4
0.2
0.0
Laboratory
BY4741
S288c
W303
FL100
SK1
Y55
YJM975
YJM981
YJM978
322134S
273614N
YJM789
378604X
YJM326
YJM428
YJM653
YJM320
YJM421
YJM451
YS9
YS2
YS4
CLIB215
CLIB324
JAY291
CBS7960
DBVPG1788
DBVPG1106
DBVPG1373
DBVPG6765
L-1374
L-1528
RM11_1A
BC187
YIIc17_E5
WE372
T73
NCYC110
DBVPG6044
Y12
K11
Y9
DBVPG6040
NCYC361
DBVPG1853
CLIB382
UC5
PW5
YPS163
YPS606
YPS128
NC-02
YPS1009
T7
UWOPS05-227.2
UWOPS05-217.3
UWOPS03-461.4
Y10
IL-01
YJM269
M22
I14
MUSH
LEP
CRB
UWOPS87-2421
UWOPS83-787.3
Clinical
Wine
Strains
Bio
Baking Fuel
Other fermentation Oak
Nature
0.0000
0.2000
0.4000
Typically use a mixed linear model to test for significance
Phenotypic variance y = μ + a + other stuff + Error
Phenotypic Additive Genetic
Effects
mean
across all involved genes
Random
Error
Identify SNPs that are significantly associated with the trait
12
Phenotype
10
8
6
4
2
0
AA
TT
Genotype
A very important control for both types of mapping:
controlling for covariates
Sometimes a SNP can appear correlated with phenotypic variation … but
it can be due to some other feature that co-varies with the SNP and the phenotype
The clearest example: population structure
Other examples:
- gender of the individuals
- shared environments for subgroups
- an example from our yeast studies:
ploidy differences when some F2s are haploid
and some are diploid
23
Example: S. cerevisiae strains (Liti et al. 2009)
Vineyard strains
Oak strains
Phenotype
15
10
5
0
AA
TT
Genotype
24
Mixed linear model identifies SNPs with a significant p-value.
Often plot the –log(p) across the genome (Manhattan plot)
Again, the p-value cutoff comes from permutations
(randomize the strain-phenotype labels and perform mapping
on randomized data 10,000 times)
How to find the causative SNP/polymorphism in giant regions?
Often very challenging to find which SNP(s) or polymorphisms
(copy-number differences, rearrangements, etc) are causal
Some strategies people use:
- Look at what’s known about the genes in the peak
CAUTION: very easy to get led by what ‘seems likely’
- Look at signatures of selection within the population
e.g. differences in FST
- Look for derived alleles
- Look for coding changes, genes in the region with severe expression
differences
- Combine with other data
e.g. other mapping studies (QTL + GWAS), genomic datasets