OR<1 - Amazon S3

Download Report

Transcript OR<1 - Amazon S3

CSE291: Personal genomics for bioinformaticians
Class meetings: TR 3:30-4:50 MCGIL 2315
Office hours: M 3:00-5:00, W 4:00-5:00 CSE 4216
Contact: [email protected]
Today’s schedule:
• 3:30-4:10 Intro to GWAS
• 4:10-4:15 Break
• 4:15-4:45 Controlling for confounders (and more)
• 4:45-4:50 Go over PS3
Announcements:
• PS3 out. Time to work on it Thursday
• Journal club discussion Thursday
• Please fill out evaluations on Dr. Bloss’s lecture
Complex Traits
• Tuesday, January 31
• Intro to GWAS
• Controlling for confounding
• Thursday, February 2
• Risk prediction + paper discussion (Martin et al., Kong et
• Time to work on PS3
• Tuesday, February 7
• Missing heritability
• Discuss project assignment (+midterm evals)
• Thursday, February 9 (PS3 due)
• Scaling GWAS to millions of people + paper discussion
Introduction to GWAS
CSE291: Personal Genomics for
Bioinformaticians
01/31/17
Outline
• Intro to complex traits
• Quantitative traits
• Case/control studies
• Controlling for confounders
• GWAS best practices
• PS3 overview
Intro to complex traits
Mendelian vs. complex traits
Mendelian  one gene, one trait
01
Example: cystic fibrosis (CFTR)
01
11
Complex  multiple genes, one (or more) traits (s)
Examples:
height, cholesterol (quantitative);
schizophrenia, diabetes, Crohn’s disease (case/control)
What is a genome-wide association study? (GWAS)
AA
AA
GG
AG
AG
GA
GG
AG
GG
GG
AA
AA
Controls
(e.g. no diabetes)
Cases
(e.g. diabetes)
P(A)
P(G)
CONTROLS
0.66
0.33
CASES
0.33
0.66
GWAS challenges
Sounds easy! But…
• Requires diligent QC to avoid technical artifacts
• Multiple hypothesis correction (performing millions of
tests!)
• Controlling for confounding factors (e.g. population
structure)
• Correlation does not imply causality, hard to pinpoint the
causal variant
• Doesn’t directly give any biological insight
• Requires huge sample sizes to have power to detect
Is GWAS relevant to personal genomics?
Interpreting one genome requires tens of thousands of genom
- Daniel MacArthur
vs.
• Risk prediction (although not yet very good for
most traits)
• Extract biological insight, could lead to new
therapies
• Learn which traits are controlled primarily by
genetic vs. other (environmental components)
Heritability defined
Your phenotype (P) is the sum of effects of your genetics
(G) and your environment (E):
P=G+E
Looking at a population, we’re interested in phenotypic
variance:
VP = VG + VE + VGE
Heritability describes how much variation in the
phenotype is described by genetics:
h2 = VG/VP
* Broad sense (all genetic effects) vs. narrow sense (only additive)
Measuring heritability – twin studies
h2=2(rMZ-rDZ)
Examples:
h2(height) = 2(0.92-0.47) = 0.9
h2(blood pressure) = 2(0.59-0.29) = 0.6
h2(Mental/behavioral disorder – alcohol) = 2(0.55-0.30) = 0.5
Polderman et al. Nature Genetics 201
Quantitative traits
Example: height
80% heritable
697 variants
significantly
associated with height
Testing association with a single SNP
Height
Genotype
10
AG
10
GG
11
AG
12
GG
12
GG
13
AG
14
AA
15
AG
16
AG
16
AA
Height
Known:
Y: Phenotype (usually standardize to N(0,1))
X: Genotype (0, 1, or 2 minor alleles)
AA (0)
AG (1) GG (2)
Y = βX + ε
Infer:
β: Effect size (direction and magnitude)
ε: Error term (e.g. environmental, technical,
noise)
P-value: P(β=0)
If Var(Y)=1 and Var(X)=1, β2 gives % variance
explained
Extending the analysis genome-wide
Y = β1X1 + ε1
Y = β2X2 + ε2
Y = β3X3 + ε3
…
Y = βnXn + εn
n=number of SNPs tested
Millions of tests  Huge multiple
hypothesis correction
• With 1.5 million SNPs and a pvalue threshold of p=0.05, by
chance will have 75K significant
hits!
• Most common solutions:
• Bonferroni correction with
threshold p=510-8
• Control the false discovery
rate
• Permutation approaches
The GIANT Consortium
•
•
•
•
Genome-wide data + height for 253,288 individuals
697 variants at genome-wide significance clustered in 423 loc
These explain only one-fifth of the heritability!
Hypothesize thousands of variants involved in height
Wood et al. Nature Genetics 2014
Table from DNA.land
GWAS visualization – Manhattan Plots
Linkage disequilibrium results in “skyscrapers”
Wood et al. Nature Genetics 2014
Make sure to LOOK at your data!
Longevity GWAS Example:
Observed p-value [-log10]
GWAS visualization – QQ Plots
Expected p-value under the null [log10]
GWAS visualization – QQ Plots
Q-Q plots: are two
distributions roughly the
same?
X-axis (Expected values):
• Generate p-values from the
“null distribution”. (Often the
uniform distribution over 0,
1).
• Sort and take –log10 of each
Y-axis (Observed values):
• Sort observed p-values from On the diagonal: no different than
your study and take –log10 expected by chance
each
Above the diagonal: signal!
Plot observed vs. expected
Way below or way above: indicates
problem
QQ plot helpful code
import numpy as np
import matplotlib.pyplot as plt
# Generate n values from uniform distribution
X = np.random.uniform(low=0, high=1, size=n)
# Take log10 of an array
X_log10 = -1*np.log10(x)
# Sort an array
X_sorted = sorted(X)
# Scatter plot
plt.scatter(X, Y)
Curses and blessings of linkage disequilibrium
Blessings:
• Test a small subset of SNPs, impute missing ones, and test all for
association
• Even if we have never genotyped the true causal variant, GWAS may
identify it if it is in LD with nearby SNPs
Curses:
• Association != causality! Rarely is the top SNP association a causal
variant
• Often multiple variants in perfect LD, impossible to determine the
causal variant
The SIGMA Type 2 Diabetes Consortium
Case/control studies
Example: schizophrenia
Heritability: 80%
108 genome-wide significant loci
Explain ~2-3% of heritability
Schizophrenia Working Group of the Psychiatric Genomics Consortium
Testing a single SNP – χ2 test
TT
TT
TG
TG
TG
GG
GG
TG
GG
GG
TT
TT
Controls
(e.g. no diabetes)
Cases
(e.g. diabetes)
TT
TG
GG
CONTROLS
3
2
1
CASES
1
2
3
Testing a single SNP – χ2 test
Genotype counts
TT
TG
GG
Total
CONTROLS
r0
r1
r2
R
CASES
s0
s1
s2
S
TOTAL
n0
n1
n2
N
Expected allele counts
Observed allele counts
T
G
Total
CONTROLS
2r0+r1
r1+2r2
2R
CASES
2s0+s1
s1+2s2
TOTAL
2n0+n1
n1+2n2
T
G
Total
CONTROLS
(2n0+n1)(R/N)
(n1+2n2)(R/N)
2R
2S
CASES
(2n0+n1)(S/N)
(n1+2n2)(S/N)
2S
2N
TOTAL
2n0+n1
n1+2n2
2N
Chi-square test:
Σi(obsi-expi)2/expi ~ χ2 with 1 degree of freedom  p-value
e.g. plink --assoc
Odds ratio
Observed allele counts
T
G
CONTROLS
A=2r0+r1
C=r1+2r2
CASES
B=2s0+s1
D=s1+2s2
Odds: P(event occurs)/P(event doesn’t occur) =
P(event occurs)/(1-P(event occurs))
Odds that T occurs in a case: B/A
Odds that G occurs in a case: D/C
Odds ratio:
(B/A)/(D/C) = BC/AD
OR=1 no association
OR>1 T increases risk
OR<1 T decreases risk
Testing a single SNP – Logistic regression
Yi (phenotype for sample i) (Yi=1 for case, Yi=0 for control)
Xi (genotype for sample i) (0, 1 or 2 for TT, TG, GG)
Pi=E[Yi|Xi] (expected phenotype given genotype of sample i)
Logit(pi)=loge[pi/(1-pi)]
Logit(pi)~β0+β1Xi +β2Ci+β3Di+…
Test: is β1 significantly different than 0?
eβ1 gives estimated odds ratio
Advantage: adding covariates
Controlling for confounders
What is a confounder
Confounder
Ind.
variable
Confounders can:
• Introduce bias (giving false positives)
• Increase variance (giving false
negatives)
Outcome
Common confounders in GWAS
e.g. Northern European ancestry
Genotyp
e
Confounder
Height
Many possible confounders in GWAS!
• Age
• Sex
GWAS for height on Europeans without
• Diet
correcting for ancestry gives…
• Lifestyle
• Environment
LCT (the lactose intolerance gene!) as the top
• Ethnicity
hit
• Northern Europeans are taller on average
• Northern Europeans have higher frequency
of lactose tolerance than southern
Europeans
This locus is likely not causal for height!
Detecting evidence of confounding – the art of QQ
Above the diagonal ->
population structure or
other confounding
(or highly polygenic
signal)
Observed [log10]
Follows diagonal -> null
distribution
Strong departure from
the null -> true hits?
Expected [log10p]
Below the diagonal ->
could be
underpowered?
Controlling for population stratification – take 1
Height (Y)
Genotype (X)
10
AG
10
GG
11
AG
12
GG
12
GG
13
AG
14
AA
15
AG
16
AG
16
AA
Population (P)
CEU
YRI
CEU
CEU
CEU
YRI
YRI
CEU
CEU
YRI
0
1
0
0
0
1
1
0
0
1
Y = βX +Υ1P
• Relies on self-reported ancestry and discrete
population groups
• Only captures high level population structure.
Doesn’t accurately capture admixture, cryptic
relatedness
Controlling for population stratification – take 1I
Height (Y)
Genotype (X)
10
AG
10
GG
11
AG
12
GG
12
GG
13
AG
14
AA
Top n Principle components
vi
vn
Q
15
AG
16
AG
16
AA
Controlling for population stratification – take 1I
Height (Y)
Genotype (X)
10
AG
10
GG
11
AG
12
GG
12
GG
13
AG
14
AA
15
AG
16
AG
16
AA
PC1
0.1
0.3
-0.4
0.1
0.2
0.3
0.4
-0.2
0.5
-0.5
PC2
-0.1
0.2
-0.1
0.1
0.5
-0.3
0.2
-0.1
0.6
-0.4
PCn
-0.1
0.1
-0.5
0.1
0.6
0.2
-0.2
-0.3
0.7
0.2
…
Y = βX +Υ1PC1+Υ2PC2 + … ΥnPCn
• How many PCs? Typically ~10. See theory in Price et al.
Plos 2006
• Advantages/disadvantages in taking too many/too few PCs?
GWAS best practices
Pipeline overview
Reed et al. 2015
Extracting biological insight – genome-wide
Finucane, et al. Biorxiv 201
Extracting biological insight – per locus
• Statistical fine mapping
• Incorporate other “functional data”
• Experimental validation
The SIGMA Type 2 Diabetes Consortium
PS3 Overview
Problem 1: A standard GWAS
Problem 2: Controlling for population structure
Problem 3: Eye color prediction
Walsh et al. 2011
Journal club papers for
Thursday (Risk
prediction):
• “Human demographic history impacts
genetic risk prediction across diverse
populations”
• “Selection against variants in the
genome associated with educational
attainment”