AA vs. Aa and aa
Download
Report
Transcript AA vs. Aa and aa
Population Structure, Association
Studies, and QTLs
Stat 115/215
Structure Algorithm
• One of the most widely-used programs in
population genetics (original paper cited >9,000
times since 2000)
– Pritchard, Stephens and Donnelly (2000). Inference of
Population Structure Using Multilocus Genotype Data,
Genetics. 155:945-959.
• Very flexible model can determine:
– The most likely number of uniform groups
(populations, K)
– The genomic composition of each individual
(admixture coefficients)
– Possible population of origin
2
A simple model of population
structure
• Individuals in our sample represent a
mixture of K (unknown) ancestral
populations.
• Each population is characterized by
(unknown) allele frequencies at each locus.
• Within populations, markers are in HardyWeinberg and linkage equilibrium.
3
The model
• Let A1, A2, …, AK represent the (unknown) allele
frequencies in each subpopulation
• Let Z1, Z2, … , Zm represent the (unknown)
subpopulation of origin of the sampled individuals –
they are indicators
• Assuming HWE and LE within subpopulations, the
likelihood of an individual’s genotypes at various
loci in subpopulation k is given by the product of
the relevant allele frequencies:
4
More details
• Probability of observing a genotype at locus
l by chance in population is a function of
allele frequencies:
– Pl =pi2 for homozygous loci
– Pl =2pipj for heterozygous loci
• Assuming no linkage among the markers,
we have the product form as in the previous
page.
5
Heuristics
• If we knew the population allele frequencies
in advance, then it would be easy to assign
individuals (using Bayes rule).
P(Gi | Zi = k, A)P(Zi = k | A)
Pr(Zi = k | Gi , A1,… , Ak ) =
å P(Gi | Zi = j, A)P(Zi = j | A)
• If we knew the individual assignments, it
would be easy to estimate frequencies.
• In practice, we don’t know either of these,
but we have the Gibbs sampler!
6
MCMC algorithm (for fixed K)
• Start with random assignment of individuals
to populations
– Step 1: Gene frequencies in each population are
estimated based on the individuals that are
assigned to it.
– Step 2: Individuals are assigned to populations
based on gene frequencies in each population.
• And this is repeated...
• Estimation of K performed separately
7
Admixed individuals are mosaics
of ancestral populations
8
Two basic models
9
Inferred from human populations
10
More details
11
12
Alternative approach
• Structure is very computationally intensive
• Often no clear best-supported K-value
• Alternative is to use traditional multivariate
statistics to find uniform groups
• Principal Components Analysis is most
commonly used algorithm
• EIGENSOFT (PCA, Patterson et al., 2006;
PloS Genetics 2:e190)
13
Principal Component Analysis
• Efficient way to summarize multivariate data like
genotypes
• Each axis passes through maximum variation in
data, explains a component of the variation
14
Human population assignment with SNPs
• Assayed 500,000 SNP genotypes for 3,192 Europeans
• Used Principal Components Analysis to ordinate samples in space
• High correspondence between sample ordination and geographic
origin of samples
Individuals assigned to
populations of origin with
high accuracy
15
Genetic Association Tests
• Review of typical approach: chi-square test
– 2x3 table (or 2x2 table)
AA A a
aa
Total
Cases
n11
n12
n13
n1.
Controls
n21
n22
n23
n2.
n.1
n.2
n.3
n..
Total
A
a
Tota
l
Cases
n11
n12
n1.
Controls
n21
n22
n2.
Total
n.1
n.2
n..
– Alternatively, we can do a logistic regression
P(Y =1)
log
= a + bX
P(Y = 0)
16
Genetic Models and
Underlining Hypotheses
Genotypic Model
Genotype
AA
Genotypic Value
μAA
Aa
aa
μAa
μaa
Genotypic value is
the expected
phenotypic value
of a particular
genotype
Hypothesis: all 3 different genotypes have
different effects
AA vs. Aa vs. aa
Genetic Models and
Underlining Hypotheses
Dominant Model
Genotype
AA
Genotypic Value
μA-
Aa
aa
μAμaa
Hypothesis: the genetic effects of AA and Aa
are the same (assuming A is the minor allele)
AA and Aa vs. aa
Genetic Models and
Underlining Hypotheses
Recessive Model
Genotype
AA
Genotypic Value
μA-
Aa
aa
μaμaa
Hypothesis: the genetic effects of Aa
and aa are the same (A is the minor
allele)
19
AA vs. Aa and aa
Genetic Models and
Underlining Hypotheses
Allelic Model
Genotype
Genotypic Value
AA
2μA
Aa
aa
μA+ μa
2μa
Hypothesis: the genetic effects of allele A
and allele a are different
A vs. a
Pearson’s Chi-squared Test
Genotypic Model:
Null Hypothesis: Independence
H0 : ij i. . j
cases
controls
AA
nAA
mAA
Aa
nAa
mAa
df = 2
aa
naa
maa
Pearson’s Chi-squared Test
Dominant Model:
Null Hypothesis: Independence
H0 : ij i. . j
cases
controls
AA+Aa
nAA + nAa
mAA + mAa
df = 1
aa
naa
maa
Pearson’s Chi-squared Test
Recessive Model:
Null Hypothesis: Independence
H0 : ij i. . j
cases
controls
AA
nAA
mAA
Aa +aa
nAa + naa
mAa + maa
df = 1
Pearson’s Chi-squared Test
Allelic Model:
Null Hypothesis: Independence
H0 : ij i. . j
cases
controls
A
2nAA + nAa
2mAA + mAa
df = 1
a
nAa + 2naa
mAa +2 maa
Test Statistic
Chi-squared Test Statistic:
(O E )
E
all cells
2
2
O is the observed cell counts
E is the expected cell counts, under null
hypothesis of independence
(row total column tot al )
E
N
Other Options
Fisher’s Exact Test:
When sample size is small, the asymptotic approximation of
null distribution is no longer valid. By performing Fisher’s
exact test, exact significance of the deviation from a null
hypothesis can be calculated.
For a 2 by 2 table, the exact p-value can be calculated as:
a
b
c
d
Association Tool
PLINK:
http://pngu.mgh.harvard.edu/~purcell/plink/
Case-control, TDT, quantitative traits.
27
Mapping Quantitative Traits
• Examples: weight, height, blood pressure,
BMI, mRNA expression of a gene, etc.
• Example: F2 intercross mice
28
Quantitative traits (phenotypes)
133 females from our earlier (NOD B6) (NOD B6) cross
Trait 4 is the log count of a particular white blood cell type.
29
Another representation of a trait distribution
30
Note the equivalent of dominance in our trait distributions.
A second example
31
Note the approximate additivity in our trait distributions here.
Trait distributions:
a classical view
In general we seek a difference
in the phenotype distributions
of the parental strains before we
think seeking genes associated
with a trait is worthwhile.
But even if there is little
difference, there may be many
such genes. Our trait 4 is a case
like this.
32
Data and goals
Data
Phenotypes: yi = trait value for mouse i
Genotype: xij = 1/0 of mouse i is A/H at marker j (backcross);
need two dummy variables for intercross
Genetic map: Locations of markers
Goals
•Identify the (or at least one) genomic region, called quantitative
trait locus = QTL, that contributes to variation in the trait
•Form confidence intervals for the QTL location
•Estimate QTL effects
Models: GenotypePhenotype
• Let y = phenotype,
g = whole genome genotype
• Imagine a small number of QTLw with genotypes
g1,…., gp (2p or 3p distinct genotypes for BC, IC resp).
•
We assume
E(y|g) = (g1,…gp ), var(y|g) = 2(g1,…gp)
34
Models: GenotypePhenotype, ctd
• Homoscedacity (constant variance)
2(g1,…gp) = 2 (constant)
• Normality of residual variation
y|g ~ N(g ,2 )
• Additivity:
(g1,…gp ) = + ∑j gj (gj = 0/1 for BC)
• Epistasis: Any deviations from additivity.
35
Additivity, or non-additivity (BC)
36
Additivity or non-additivity: F2
37
The simplest method: ANOVA
• Split mice into groups
according to genotype
at a marker
• Do a t-test/ANOVA
• Repeat for each marker
• Adjust for multiplicity
LOD score = log10 likelihood ratio, comparing single-QTL
model to the “no QTL anywhere” model.
38
Interval mapping (IM)
• Lander & Botstein (1989)
• Take account of missing genotype data (uses the HMM)
• Interpolates between markers
• Maximum likelihood under a mixture model
39
Interval mapping, cont
• Imagine that there is a single QTL, at position z between two
(flanking) markers
• Let qi = genotype of mouse i at the QTL, and assume
•
yi | qi ~ Normal( qi , 2 )
• We won’t know qi, but we can calculate
•
pig = Pr(qi = g | marker data)
• Then, yi, given the marker data, follows a mixture of normal
distributions, with known mixing proportions (the pig).
• Use an EM algorithm to get MLEs of = (A, H, B, ).
• Measure the evidence for a QTL via the LOD score, which is the log10
likelihood ratio comparing the hypothesis of a single QTL at position z
to the hypothesis of no QTL anywhere.
40
Epistasis, interactions, etc
• How to find interactions?
– Stepwise regression
– BEAM (Zhang and Liu 2007)
41
Naïve Bayes model
Y
X1
X2
X3
Xm
42
Augmented Naïve Bayes
Group 0
X01
X2.21
Y
X02
X2.22
Group 22
X11
X12
Group 1
X13
X2.12
X2.11
X2.13
Group 21
43
Variable Selection with Interaction
Let Y ∈ R be a univerate response variable and X ∈ R p
be a vector of p continuous predictor variables
Y = X 1× X 2+ ϵ , ϵ ∼ N (0,σ 2 ), X ∼ MVN(0, I p )
Suppose p= 1000 . How to find X 1 and X 2 ?
One step forward selection :∼500,000 interaction terms
Is there any marginal relationship between Y and X 1 ?
44
σ̂ = 2.24
2
(1)
σ̂ = 0.97 σ̂ = 0.42
2
(2)
2
(3)
y
45
x1
x2
46
x1
Acknowledgment
• Terry Speed (some of the slides)
• Karl Broman (U of Wisconsin)
• Steven P. DiFazio (West Virginia U)
47