Lecture-Day4- Genome-Wide Association Studies

Download Report

Transcript Lecture-Day4- Genome-Wide Association Studies

Genome-Wide Association Studies
Caitlin Collins, Thibaut Jombart
MRC Centre for Outbreak Analysis and Modelling
Imperial College London
Genetic data analysis using
30-10-2014
Outline
• Introduction to GWAS
• Study design
o GWAS design
o Issues and considerations in GWAS
• Testing for association
o Univariate methods
o Multivariate methods
• Penalized regression methods
• Factorial methods
2
Genomics & GWAS
3
The genomics revolution
• Sequencing technology
o 1977 – Sanger
o 1995 – 1st bacterial genomes
• < 10,000
bases per day per
machine
o 2003 – 1st human genome
• > 10,000,000,000,000
bases per day per
machine
• GWAS publications
o 2005 – 1st GWAS
o Age-related macular
degeneration
o 2014 – 1,991 publications
o 14,342 associations
Genomics & GWAS
4
A few GWAS discoveries…
Genomics & GWAS
5
So what is GWAS?
• Genome Wide Association Study
o Looking for SNPs…
associated with a phenotype.
• Purpose:
o Explain
• Understanding
• Mechanisms
• Therapeutics
o Predict
• Intervention
• Prevention
• Understanding not required
Genomics & GWAS
6
Association
p
• Definition
o Any relationship
between two
measured quantities
that renders them
statistically dependent.
• Heritability
SNPs
Controls
Cases
n
individuals
o The proportion of
variance explained by
genetics
o P = G + E + G*E
• Heritability > 0
Genomics & GWAS
7
Genomics & GWAS
8
Why?
•
•
•
•
Environment, Gene-Environment interactions
Complex traits, small effects, rare variants
Gene expression levels
GWAS methodology?
Genomics & GWAS
9
Study Design
10
GWAS design
• Case-Control
o Well-defined “case”
o Known heritability
• Variations
o Quantitative phenotypic data
• Eg. Height, biomarker concentrations
o Explicit models
• Eg. Dominant or recessive
Study Design
11
Issues & Considerations
• Data quality
o 1% rule
• Controlling for confounding
o Sex, age, health profile
o Correlation with other variables
• Population stratification**
• Linkage disequilibrium**
Study Design
12
Population stratification
• Definition
o “Population stratification” =
population structure
o Systematic difference in allele
frequencies btw. subpopulations…
• … possibly due to different
ancestry
• Problem
o Violates assumed population homogeneity, independent
observations
•  Confounding, spurious associations
o Case population more likely to be related than Control
population
•  Over-estimation of significance of associations
Study Design
13
Population stratification II
• Solutions
o Visualise
• Phylogenetics
• PCA
o Correct
• Genomic Control
• Regression on Principal
Components of PCA
Study Design
14
Linkage disequilibrium (LD)
• Definition
o Alleles at separate loci are NOT
independent of each other
• Problem?
o Too much LD is a problem
•  noise >> signal
o Some (predictable) LD can be
beneficial
•  enables use of “marker” SNPs
Study Design
15
Testing for Association
16
Methods for association testing
• Standard GWAS
o Univariate methods
• Incorporating interactions
o Multivariate methods
• Penalized regression methods (LASSO)
• Factorial methods (DAPC-based FS)
Testing for Association
17
Univariate methods
p
SNPs
• Approach
o Individual test statistics
o Correction for multiple
testing
• Variations
o Testing
Controls
Cases
n
individuals
• Fisher’s exact test, Cochran-Armitage trend test, Chisquared test, ANOVA
• Gold Standard—Fischer’s exact test
o Correcting
• Bonferroni
• Gold Standard—FDR
Testing for Association
18
Univariate – Strengths & weaknesses
Strengths
• Straightforward
• Computationally fast
• Conservative
• Easy to interpret
Testing for Association
Weaknesses
• Multivariate system,
univariate framework
• Effect size of individual
SNPs may be too small
• Marginal effects of
individual SNPs ≠
combined effects
19
What about interactions?
Testing for Association
20
Interactions
• Epistasis
o “Deviation from linearity under a general linear model”
𝜇𝑖 = 𝛿 + 𝛽1 𝑥𝑖 + 𝛽2 𝑦𝑖 vs. 𝜇𝑖 = 𝛿 + 𝛽1 𝑥𝑖 + 𝛽2 𝑦𝑖 + 𝜸𝒙𝒊 𝒚𝒊
• With p predictors, there are:
•
𝑝
𝑘
=
𝑝𝑘
𝑘!
k-way interactions
• p = 10,000,000  5 x 1011
o That’s 500 BILLION possible pair-wise interactions!
• Need some way to limit the number of pairwise
interactions considered…
Testing for Association
21
Multivariate methods
Neural Networks
Penalized Regression
LASSO penalized regression
The elastic net
Ridge regression
Bayesian Approaches
Bayesian partitioning
Bayesian Logistic Bayesian Epistasis
Regression with Association Mapping
Stochastic Search
Variable Selection
Factorial Methods
Multi-factor
Sparse-PCA
dimensionality reduction
Supervised-PCA
method
DAPC-based FS
(snpzip) Odds-ratiobased MDR
Testing for Association
Genetic programming
Parametric
optimized neural
decreasing method
networks
Logic Trees
Logic feature selection
Monte Carlo
Logic regression
Logic Regression
Modified Logic
Regression-Gene
Expression Programming
Genetic Programming for
Set associationAssociation Studies
approach
Non-parametric Methods
Random forests
Restricted
partitioning method
Combinatorial
partitioning method
22
Multivariate methods
• Penalized regression methods
o LASSO penalized regression
• Factorial methods
o DAPC-based
feature selection
Testing for Association
23
Penalized regression methods
• Approach
o Regression models multivariate association
o Shrinkage estimation  feature selection
• Variations
o LASSO, Ridge, Elastic net, Logic regression
• Gold Standard—LASSO penalized regression
Testing for Association
24
LASSO penalized regression
• Regression
o Generalized linear model (“glm”)
• Penalization
o L1 norm
o Coefficients  0
o Feature selection!
Testing for Association
25
LASSO – Strengths & weaknesses
Strengths
Weaknesses
• Stability
• Multicollinearity
• Interpretability
• Not designed for high-p
• Likely to accurately
select the most
influential predictors
• Sparsity
• Computationally
intensive
• Calibration of penalty
parameters
o User-defined  variability
• Sparsity
Testing for Association
26
Factorial methods
• Approach
o Place all variables (SNPs) in a multivariate space
o Identify discriminant axis  best separation
o Select variables with the highest contributions to that axis
• Variations
o Supervised-PCA, Sparse-PCA, DA, DAPC-based FS
o Our focus—DAPC with feature selection (snpzip)
Testing for Association
27
DAPC-based feature selection
a
b
e
Alleles
Individuals
Diseased (“cases”)
Healthy (“controls”)
Discriminant Axis
c
Discriminant Axis
0.5
0.4
Contribution to
Discriminant Axis
0.3
0.2
0.1
0
a
Testing for Association
b
c
d
e
individuals
of individuals
Density of
Density
d
Discriminant axis
28
DAPC-based feature selection
• Where should we draw the line?
o  Hierarchical clustering
0.4
Density of individuals
0.35
Contribution to Discriminant Axis
0.3
0.25
0.2
?
0.15
0.1
0.05
0
a
b
Discriminant
c axis
d
e
Hierarchical clustering (FS)
0.5
0.4
0.3
Contribution to
Discriminant Axis
0.2
0.1
0
a
b
c
d
e
Hooray!
Testing for Association
30
DAPC – Strengths & weaknesses
Strengths
• More likely to catch all
relevant SNPs (signal)
• Computationally quick
• Good exploratory tool
• Redundancy > sparsity
Testing for Association
Weaknesses
• Sensitive to n.pca
• N.snps.selected varies
• No “p-value”
•
Redundancy > sparsity
31
Conclusions
• Study design
o GWAS design
o Issues and considerations in GWAS
• Testing for association
o Univariate methods
o Multivariate methods
• Penalized regression methods
• Factorial methods
32
Thanks for listening!
33
Questions?
34