Platform: Affymetrix GeneChip System

Transcript Platform: Affymetrix GeneChip System

Disease gene mapping in
the post-genomic age
Pak C. Sham
Department of Psychiatry & Genome Research Centre, HKU
Bioinformatic and Comparative Genome Analysis Course
HKU-Pasteur Research Centre - Hong Kong, China
August 17 - August 29, 2009
Overview
















Introduction
Linkage
Model-based Linkage Analysis
Model-free Linkage Analysis
Population Association Analysis
Case-Control Association Analysis
Family-based Association Analysis
Linkage Disequilibrium
Genome-wide Association Studies
Data Quality Checks of Genetic Data
Population stratification
Levels of Statistical Significance
Statistical Power Calculation
Quantile-Quantile Plots
Gene-Gene and Gene-Environment Interactions
Meta-analysis
Human Diseases

Inherited “Mendelian” diseases




Complex diseases





Huntington’s disease
Cystic fibrosis
Haemophilia
Cancers
Ischaemic heart disease
Diabetes
Depression
Health-related quantitative traits




Body mass index
Blood pressure
Blood sugar level
Neuroticism
Classical Genetics




Mendel: the gene as the unit of biological inheritance
Inference of genetic effects
 Mendelian segregation ratios
 Twin studies (CMZ > CDZ)
 Adoption studies
Estimation of genetic parameters
 Gene frequency, penetrance, mutation rate
 Heritability: narrow and broad
Based only on the pattern of disease occurrence in
families and in the population
 Not direct analysis of DNA
DNA Sequence Variation
Single nucleotide polymorphisms
Structural Variations
Types of Genetic Traits
Monogenic Diseases
(e.g Huntington’s Disease,
Cystic Fibrosis)
Polygenic Inheritance:
Quantitative traits
(e.g. blood pressure, IQ)
Complex disorders
(e.g. type 2 diabetes)
Levels of Genetic Analysis

Heritability



Gene finding



Based on genetic sharing between family members
Determines the relative contributions of genetic and
environmental variation to trait variation in the population
Based on measuring genetic markers (naturally occurring
sequence variants) and associating these with disease or
trait within families (LINKAGE) or in an entire population
(ASSOCIATION)
Identifies genes which contain sequence variation that
influence disease risk, course, outcome and response to
different treatments
Gene function


Involves molecular genetics, cell biology and animal models
Identifies mechanisms through which sequence variation
influences cell function and therefore disease related traits
Overview
















Introduction
Linkage
Model-based Linkage Analysis
Model-free Linkage Analysis
Population Association Analysis
Case-Control Association Analysis
Family-based Association Analysis
Linkage Disequilibrium
Genome-wide Association Studies
Data Quality Checks of Genetic Data
Population stratification
Levels of Statistical Significance
Statistical Power Calculation
Quantile-Quantile Plots
Gene-Gene and Gene-Environment Interactions
Meta-analysis
Mendelian Double Backcross
Pure lines
AA/BB
aa/bb
ab
AB
Informative parent
Aa/Bb (doubly heterozygous, phase-known)
F1
Parental Types
Aa/Bb
AB
aB
ab
aa/bb
Ab
Aa/bb
Recombinants
aa/Bb
ab
aa/bb
ab
Recombination Fraction
For two loci (e.g. A and B), the recombination fraction (θ) is the
proportion of gametes that are recombinant
For two loci on different chromosomes, θ = ½
For two loci on the same chromosome (syntenic), θ < ½
As the distance between 2 syntenic loci  0, θ  0
Crossing Over in Meiosis
Cross-over points
The recombination fraction between two loci is the probability of
having an odd number of cross-over points between the two loci
Overview
















Introduction
Linkage
Model-based Linkage Analysis
Model-free Linkage Analysis
Population Association Analysis
Case-Control Association Analysis
Family-based Association Analysis
Linkage Disequilibrium
Genome-wide Association Studies
Data Quality Checks of Genetic Data
Population stratification
Levels of Statistical Significance
Statistical Power Calculation
Quantile-Quantile Plots
Gene-Gene and Gene-Environment Interactions
Meta-analysis
Genetic Markers




Linkage can be used to map disease-related genes if we can
measure naturally-occurring DNA sequence variations
(polymoprhisms) at known chromosomal locations.
The determination of the alleles present at a polymorphic
marker locus is called genotyping
Classical genetic markers
 Mendelian disorders
 Blood groups
 HLA antigens
Molecular genetic markers
 Restriction fragment length polymorphisms (RFLPs)
 Variable-length short-sequence repeats, SSRs (e.g.
CACACA… )
 Single-nucleotide polymorphisms, SNPs (e.g. C/T)
Disease Pedigrees
Marker allele A1
cosegregates with
dominant disease
A3A4
A1A2
A1A3
A1A2
A1A4
A2A4
A3A4
A2A3
A3A2
Problem:
Genotypes at disease
locus not directly
measured although they
can be partially deduced
from disease status
Single Major Locus Model

Mendelian dominant disease


Mendelian recessive disease


DD: Risk = 1, Dd, dd: Risk = 0
Generalisation (the SML model)



DD, Dd: Risk = 1, dd: Risk = 0
DD: Risk = f2, Dd: Risk = f1, dd: Risk = f0
f2, f1, f0 are called penetrances
Additional parameters

Disease allele (D) frequency q
Statistics Diversion

Likelihood


Probability of data as a function of the value of
an unknown parameter
Example





A urn contains red and blue marbles in the
proportions p and 1-p respectively, where p is
unknown
Of 10 marbles randomly drawn from the urn with
replacement, 7 are red
Likelihood function = p7(1-p)3
Maximum Likelihood Estimate of p: 0.7
Likelihood ratio test of hypothesis H0: p=0.5
T  2(ln(0.770.33 )  ln(0.510 )) ~ 1.65
LOD Scores


Given a SML model with specified parameter values, the
likelihood of the pedigree data (disease and marker
status) can be calculated numerically (using the ElstonStewart or Lander Green Algorithms) for a range of
chromosomal locations of the disease locus
The lod score at a particular location is defined as the
common logarithm of the likelihood at the location (among
the markers) to the that at a very distant location (unlinked
to the markers)
L( X  x )
lod  log10
L ( X  )

A lod score of 3 or more at a chromosomal location is by
convention considered significant evidence for declaring
linkage at that location. For Mendelian diseases this
criterion has an empirical false positive rate of ~ 2%
Computer Programs

Elston-Stewart Algorithm



Lander-Green Algorithms



Large pedigrees, few markers
LINKAGE
Small pedigrees, many markers
MERLIN
Simulations (MCMC)


Large pedigrees, many markers
SIMWALK (very slow)
Overview















Introduction
Model-based Linkage
Model-free Linkage
Population Association
Case-Control Association
Family-based Association
Linkage Disequilibrium
Genome-wide Association Studies
Data Quality Checks of Genetic Data
Population stratification
Levels of Significance
Statistical Power Calculation
Quantile-Quantile Plots
Gene-Gene and Gene-Environment Interactions
Meta-analysis
Identity by Descent (IBD)

Two alleles are IBD if they are descended from the same
allele of a common ancestor in the pedigree
AC
AB
AC
AB
AB
AC
AB
BC
AB
AC
Affected Sib Pair (ASP) Method




At any locus the numbers of alleles shared IBD by a
sib pair can be
 0 (25%),
 1 (50%)
 2 (25%)
For sib pairs where both sibs are affected, the IBD
distribution at marker locus linked to the disease locus
is distorted such that fewer pairs have IBD 0 and more
pairs have IBD 2
ASP method is based on a test of whether IBD sharing
is increased at each marker locus
ASP is considered “model-free” or “non-parametric” as
it does not require the assumption of an SML model
Generalized NPL methods




The IBD concept can be applied to types of
relative pairs other than sib pairs
Each type of relative pairs has a
characteristic distribution of IBD sharing in
the absence of linkage
For marker loci linked to a disease locus,
the IBD sharing of affected relative pairs will
be elevated
Non-parametric linkage (NPL) tests are
based on the detection of an elevated level
of IBD sharing at a marker locus in all the
affected pairs in general pedigrees
Statistics Diversion





Many test statistics are calculated by summing up the
contributions from many observation units (e.g.
families)
The frequency distribution of such statistics are often
approximately normal (central limit theorem)
If the standard deviation of such a statistic can be
calculated, then this can be used to “standardize” the
statistic so that it has a standard deviation of 1
The resulting test statistic can be referred to a
standard normal distribution to obtain a p-value
Many tests in genetic analysis (including NPL tests)
are of this form
Computer Programs

Lander-Green algorithm for IBD
calculation, followed by calculation of
NPL test statistics from IBD sharing of
affected pairs
Genehunter
 Merlin

Overview















Introduction
Model-based Linkage
Model-free Linkage
Population Association
Case-Control Association
Family-based Association
Linkage Disequilibrium
Genome-wide Association Studies
Data Quality Checks of Genetic Data
Population stratification
Levels of Significance
Statistical Power Calculation
Quantile-Quantile Plots
Gene-Gene and Gene-Environment Interactions
Meta-analysis
Disease Association
Two variables are statistically associated if their values in
the population are not independent but related to each
other
In the case of a disease and a genetic variant (allele), they
are associated if individuals with different numbers of
copies of the variant have different risks of disease
Hypothetical Example
Genotype
DD
Dd
dd
Risk
0.09
0.03
0.01
Risk Ratio
9
3
1
Prospective Design




Collect a cohort of individuals from the population
(random sampling, stratified sampling, cluster
sampling etc)
Collect DNA and measure genotypes at genetic
markers (plus other relevant covariate, e.g. age, sex,
ethnicity, environmental exposures)
Follow-up the cohort for the occurrence of the disease
of interest
Perform statistical analysis for association between
marker genotype and disease
 Hypothesis test (typically chi-square tests)
 Parameter estimation (typically odds ratios)
 Statistical modelling (e.g. logistic regression) required
if analysis to include other loci or covariates in
addition to locus of interest
Statistics Diversion
Odds ratio
Odds = Risk / (1-Risk)
Odds Ratio = Odds (exposed) / Odds (Unexposed)
Example
Genotype
DD
Dd
dd
Risk
0.09
0.03
0.01
Odds Odds Ratio
0.989
9.79
0.309
3.06
0.010
1
Statistics Diversion
Simple odds ratio (OR) analysis
Disease
Healthy
“B” allele
a
b
“b” allele
c
d
OR = (ad)/(bc)
SD (ln OR) = 1/a + 1/b + 1/c + 1d
Chi-square test statistic = (OR/SD)2
Statistics Diversion

Logistic Regression






The log-odds of disease is modelled as a linear combination
of predictor variables
Predictor variables include marker genotype of interest plus
other covariates (e.g. age, sex, ethnicity)
Regression coefficients represent log odds ratios and are
estimated by maximum likelihood
Significance tests can be performed for a single predictor
variable for jointly for a group of predictors
There are automated methods for model selection (e.g.
forward stepwise, backward stepwise)
Available in most popular statistical packages
Codings for Genetic Models
What values to assign to different genotypes ?
Genotype
DD
Dd
dd
Dominant
1
1
0
Recessive
1
0
0
Additive
1
0.5
0
Can include any 2 into a model, but including all 3 will lead
to “collinearity” and an error message
Overview















Introduction
Model-based Linkage
Model-free Linkage
Population Association
Case-Control Association
Family-based Association
Linkage Disequilibrium
Genome-wide Association Studies
Data Quality Checks of Genetic Data
Population stratification
Levels of Significance
Statistical Power Calculation
Quantile-Quantile Plots
Gene-Gene and Gene-Environment Interactions
Meta-analysis
Retrospective Design




Prospective design is inefficient for rare disease
In retrospective (case-control) design, a sample of
affected individuals (cases) and a sample of unaffected
individuals (controls) from the population are collected
Collect DNA and measure genotypes at genetic markers
(plus other relevant covariate, e.g. age, sex, ethnicity,
environmental exposures)
Perform statistical analysis for association between
marker genotype and disease
 Hypothesis test (typically chi-square tests)
 Parameter estimation (typically odds ratios)
 Statistical modelling (e.g. logistic regression) required
if analysis to include other loci or covariates in
addition to locus of interest
Statistics Diversion
Simple odds ratio (OR) analysis
“B” Allele
“b” allele
Cases
a
b
Controls
c
d
OR = (ad)/(bc)
SD (ln OR) = 1/a + 1/b + 1/c + 1d
Chi-square test statistic = (OR/SD)2
Same calculations as for prospective study
Statistics Diversion

In a statistical test
 False positive result = Type 1 error
 False negative result = Type 2 error
 Type 1 error rate is controlled by setting the p-value
threshold for declaring the test significant
 Probability of detecting a true effect is defined as
statistical power 1-Type 2 error rate
 Study should be designed to achieve adequate
statistical power (e.g. 80%)
 Statistical power may depend on parameters whose
values are unknown (e.g. effect size of genotypes as
measured by odds ratio). This calls consideration over
a range of plausible assumptions.
Selection of Controls




The controls should be from the same population as the
cases
Some minor mismatching between cases and controls
can be handled by stratified analysis or statistical
adjustment (e.g. in logistic regression model)
For fixed number of cases, statistical power increases
with increasing number of controls
For fixed total sample size, a balanced design (equal
number of cases and controls) is optimal for maximizing
statistical power
Overview















Introduction
Model-based Linkage
Model-free Linkage
Population Association
Case-Control Association
Family-based Association
Linkage Disequilibrium
Genome-wide Association studies
Data Quality Checks of Genetic Data
Population stratification
Levels of Significance
Statistical Power Calculation
Quantile-Quantile Plots
Gene-Gene and Gene-Environment Interactions
Meta-analysis
Hidden Population Stratification


In mixed populations of diverse origin case-control studies
runs the risk of mismatch between cases and controls and
this can lead to false positive and false negative
association results
Example
 Peptic ulcer and O blood group are both more common in
“Orientals” than Europeans
 In a mixed population, “Orientals” are likely to be overrepresented in cases with peptic ulcer
 Cases with peptic ulcer would have a higher frequency of
O blodd group than randomly drawn controls from the
population
 If ethnic origin is known then this can be entered in the
the statistical model for adjustment. If not, then this could
lead to a false positive association between peptic ulcer
and the O blood group
Transmission/Disequilibrium Test
Using family members as controls should ensure matching between cases
and controls and avoid the problem of population stratification
The TDT uses the non-transmitted alleles of the parents of cases as controls
Example:
Overall table
T NT
Bb
Bb
T
NT
Paternal
B b
“B” allele
b
c
Maternal
B b
“b” allele
c
b
BB
TDT = (b-c)2/(b+c) OR = b/c Var ln(OR) = 1/b + 1/c
Generalizations of TDT




To using multiallelic markers (ETDT)
To using siblings instead of parents (STDT)
To using general pedigrees (PTDT, FBAT)
To using quantitative instead of disease phenotypes
(QTDT)
TDT Designs
Disadvantages
 Parents may be difficult to recruit for late-onset diseases
 Less statistical power for same number of subjects
genotyped (3 for TDT ~ 2 for balanced CC)
 Assymmetry between transmitted and non-transmitted
alleles lead to propensity to artifacts (e.g. missing and
wrong genotype calls)
BUT
 Opportunity to examine de novo mutations and parent-oforigin effects
Overview















Introduction
Model-based Linkage
Model-free Linkage
Population Association
Case-Control Association
Family-based Association
Linkage Disequilibrium
Genome-wide Association Studies
Data Quality Checks of Genetic Data
Population stratification
Levels of Significance
Statistical Power Calculation
Quantile-Quantile Plots
Gene-Gene and Gene-Environment Interactions
Meta-analysis
Linkage disequilibrium (LD)
LD is defined as association between alleles at closely
linked loci
Example
“A” allele frequency 0.1
“B” allele frequency 0.2
Expected frequency of “AB” haplotype 0.1  0.2 =0.02
Observed frequency of “AB” haplotype = 0.1
i.e. Positive LD between alleles “A” and “B”
Measures of LD
d = f(AB) – f(A)f(B)
d’ = d / Max(d) given the f(A) and f(B)
r2 = d2 / f(A)f(a)f(B)f(b) i.e. squared correlation
Haplotype Frequency Estimation
Genotype data tells us how many copies each allele is
present in an individual, not how these are made up in
terms of haplotypes (i.e. their parental origin)
Example
Genotype: AaBb
Possible haplotype combinations AB/ab or Ab/aB
Therefore haplotype frequency estimation requires a
statistical procedure rather than simple counting
Alternative methods
EM algorithm (maximum likelihood): e.g. EH, EH+
Bayesian algorithms, e.g. PHASE
Overview















Introduction
Model-based Linkage
Model-free Linkage
Population Association
Case-Control Association
Family-based Association
Linkage Disequilibrium
Genome-wide Association Studies
Data Quality Checks of Genetic Data
Population stratification
Levels of Significance
Statistical Power Calculation
Quantile-Quantile Plots
Gene-Gene and Gene-Environment Interactions
Meta-analysis
Systematic Association Mapping
Disease
Indirect
Direct
A
LD
G
From set of SNPs in high LD with each other, select only one (the tag
SNP) to genotype (others can be imputed, e.g. IMPUTE, MACH, PLINK)
A systematic association analysis of a gene, a chromosome region, or the
entire genome can be achieved by the appropriate choice of tag SNPs
This requires a catalogue of SNPs in the genome and their LD
relationships; this information is provided by the HapMap
Tagging 90% of all common SNPs in the human genome requires
genotyping ~ 1 M SNPs (e.g. Illumina 1M or Affy 6.0) in European and
Asian populations (Genome-Wide Association Studies, GWAS)
RET
NRG1
First
2008
2007 first
second
quarter
third
quarter
2006
2005
2007quarter
fourth
quarter
Second
quarterquarter
2008
Updated from Manolio, Brooks, Collins.
J Clin Invest 2008; 118:1590-625
GWAS: Current Experience
A few SNP were detectable
with small sample sizes
(100s) but the majority
required very large sample
sizes (1000s)
Frequency distribution of effect sizes
for the risk alleles of 92 validated
SNPs identified from GWAS on 16
disorders.
Wray et al, 2008
Overview















Introduction
Model-based Linkage
Model-free Linkage
Population Association
Case-Control Association
Family-based Association
Linkage Disequilibrium
Genome-wide Association Studies
Data Quality Checks of Genetic Data
Population stratification
Levels of Significance
Statistical Power Calculation
Quantile-Quantile Plots
Gene-Gene and Gene-Environment Interactions
Meta-analysis
“Data Cleaning”
Large-scale genetic data allow systematic checks of data integrity and
removal of problematic subjects or SNPs from further analyses
Subjects (sample) checks
Genotype call rate (low suggests poor DNA quality)
Autosomal heterozygosity (low suggests poor DNA quality, high
suggests contamination)
X heterozygosity (to check against recorded sex, inconsistency
suggests sample mix up)
Inconsistencies between recorded family relationships and those
inferred from genotype data
SNP checks
Low call rate
Hardy-Weinberg proportions
Allele frequency (consistency with reference values, e.g. HapMap)
LD relationships (consistency with reference values, e.g. HapMap)
Consistency of genotypes SNP pairs in strong LD
Example: Gender Check
Labelled male
Sty 306 X chromosome heterozygosity
Labelled female
1.2
1
0.8
F
0.6
0.4
0.2
0
0
1000
2000
-0.2
-0.4
DNA No.
Overview















Introduction
Model-based Linkage
Model-free Linkage
Population Association
Case-Control Association
Family-based Association
Linkage Disequilibrium
Genome-wide Association Studies
Data Quality Checks of Genetic Data
Population stratification
Levels of Significance
Statistical Power Calculation
Quantile-Quantile Plots
Gene-Gene and Gene-Environment Interactions
Meta-analysis
Safeguards against stratification


Genomic Control (GC) method
 Find median of chi-square test statistics of all SNPs
 Calculate an inflation factor (IF)
 Use IF to adjust down all chi-square statistics
Extracting and adjusting for population structure
 Bayesian approach: STRUCTURE
 Maximum likelihood latent class analysis: L-POP
 Principal components analysis, EIGENSTRAT
 Multidimensional scaling, PLINK
 Hierarchical complete-linkage clustering, PLINK
Empirical assessment of
ancestry
CEPH/European
Yoruba
Han Chinese
Japanese
~2K SNPs
Empirical assessment of
ancestry
Entire Phase I HapMap
Empirical assessment of
ancestry
Han Chinese
Japanese
~10K SNPs
Sample Checks: MDS Plot
Overview















Introduction
Model-based Linkage
Model-free Linkage
Population Association
Case-Control Association
Family-based Association
Linkage Disequilibrium
Genome-wide Association Studies
Data Quality Checks of Genetic Data
Population stratification
Levels of Significance
Statistical Power Calculation
Quantile-Quantile Plots
Gene-Gene and Gene-Environment Interactions
Meta-analysis
Multiple Testing Correction




Many tests for which H0 is true will have small p-values e.g. in
100,000 tests, 5,000 are expected to have p < 0.05, by
chance
Therefore the threshold for declaring statistical significance
needs to be set at a more stringent level
Methods
 Bonferroni, set α = 0.05 / n, for n tests
 “Bayesian”: fix α to give the desired false positive rate
given certain prior assumptions about proportion of SNPs
with true association and the distribution of effect sizes
 False Discovery Rate (FDR): use the empirical distribution
of test statistics or p-values to set appropriate α to give
the desired false positive rate
For GWAS, it is usual to set α at 510-8, regardless of the
number of SNPs tested
Statistics Diversion
Benjamini & Hochberg (1995) Procedure:
1.
2.
3.
4.
Set FDR (e.g. to 0.05)
Rank the tests in ascending order of p-value, giving
p1  p2  …  pr  …  pm
Then find the test with the highest rank, r, for which
the p-value, pr, is less than or equal to (r/m)  FDR
Declare the tests of rank 1, 2, …, r as significant
A minor modification is to replace m by m0
B & H FDR Method
FDR=0.05
Rank
P-value
(Rank/n)×FDR
Reject H0 ?
1
.001
.005
1
2
.010
.010
1
3
.165
.015
0
4
.205
.020
0
5
.396
.025
0
6
.450
.030
0
7
.641
.035
0
8
.781
.040
0
9
.901
.045
0
10
.953
.050
0
Overview















Introduction
Model-based Linkage
Model-free Linkage
Population Association
Case-Control Association
Family-based Association
Linkage Disequilibrium
Genome-wide Association Studies
Data Quality Checks of Genetic Data
Population stratification
Levels of Significance
Statistical Power Calculation
Quantile-Quantile Plots
Gene-Gene and Gene-Environment Interactions
Meta-analysis
Power Calculations

Help design studies that are likely to succeed
 Determine the minimum sample size necessary to
achieve the desired level of statistical power (usually
> 80%), for a given effect size
 Determine the minimum effect size that can be
detected with adequate statistical power, for a fixed
sample size
Usually obligatory for grant applications
Steps in Power Calculation




Specify
 Study design (e.g. case-control)
 Statistical test
Assume hypothetical values for 2 of the 3 parameters:
 Sample size
 Effect size (including effect frequency)
 Statistical power
Calculate the remaining parameter
Make informative plots:
 Sample size against effect size, for fixed power
 Power versus sample size, for fixed effect size
Program for Power Calculation
Genetic Power Calculator (on-line interactive tool)
Contains modules for power or sample size calculations
for a variety of genetic study designs, e.g.
Quantitative trait linkage analysis
Quantitative trait association analysis
Case-Control association analysis
Ways to Enhance Power










Increase sample size
Increase marker density
Increase accuracy of phenotype measurements
Increase accuracy of genotyping
Rigorous quality control and error checking
Collect and adjust for environmental covariates
Appropriate treatment of heterogeneity
Appropriate treatment of population substructure
Select most “extreme” individuals for study
Optimal statistical test that extracts maximum information
Overview















Introduction
Model-based Linkage
Model-free Linkage
Population Association
Case-Control Association
Family-based Association
Linkage Disequilibrium
Genome-wide Association Studies
Data Quality Checks of Genetic Data
Population stratification
Levels of Significance
Statistical Power Calculation
Quantile-Quantile Plots
Gene-Gene and Gene-Environment Interactions
Meta-analysis
QQ Plots




An effective way of visualizing the overall pattern of pvalues from a large-scale systematic association
analysis, especially GWAS
Plots observed –log(p), ranked in magnitude, against
their expected values according to the null hypothesis
(i.e. uniform between 0 and 1)
If the null hypothesis is true for all SNPs and the tests
are behaving appropriately then the plot should follow
a straight line at 45° from the origin
Deviations from the null line may suggest
 The presence of true association!
 Misbehaviour of statistical tests due to a variety of
reasons
Example QQ Plot:
Hirschsprung’s Disease
Without EIGENSTRAT correction
With EIGENSTRAT correction
Overview















Introduction
Model-based Linkage
Model-free Linkage
Population Association
Case-Control Association
Family-based Association
Linkage Disequilibrium
Genome-wide Association Studies
Data Quality Checks of Genetic Data
Population stratification
Levels of Significance
Statistical Power Calculation
Quantile-Quantile Plots
Gene-Gene and Gene-Environment Interactions
Meta-analysis
Statistical Interaction
If the effect of factor B (on risk of disease) is different
depending on whether factor A is also present, then the
factors A and B are said to display statistical interaction
Example:
OR of allele “B” in the smokers = 3
OR of allele “b” non-smokers = 1
Interactions can be GG (epistasis), GE, or EE
Gene-Environment Interaction:
Male conduct disorder
% conduct disorder
100
MAOA activity
80
Low MAOA activity
High MAOA activity
60
40
20
0
Low
Mild
Severe
Child Maltreatment
Caspi et al., 2002 (Science)
Analysis of Interactions




Statistical modelling
 Inclusion of product term in logistic regression model
Set association
 Stepwise approach to find best subset of SNPs
among a large number of SNPs in a gene or pathway
Multidimensional Reduction (MDR)
 Model-free approach to reduce high-dimensional
genotype data into high- and low- risk classes
Canonical correlation
 Considers interactions between 2 sets of SNPs
Overview















Introduction
Model-based Linkage
Model-free Linkage
Population Association
Case-Control Association
Family-based Association
Linkage Disequilibrium
Genome-wide Association Studies
Data Quality Checks of Genetic Data
Population stratification
Levels of Significance
Statistical Power Calculation
Quantile-Quantile Plots
Gene-Gene and Gene-Environment Interactions
Meta-analysis
Meta-analysis
Combination of results from multiple studies in order to detect
effects that are too small to detect in the individual studies
Methodology
Consistent phenotype definition
Shared SNPs (possibly by imputation)
Consistent coding of SNP data
Combine estimates, weighted by sample size or inverse variance
Combine p-values, Fisher’s method and variants
Checks for publication bias (not so much a concern for GWAS)
Heterogeneity tests
THANK YOU

Platform: Affymetrix GeneChip System

Transcript Platform: Affymetrix GeneChip System

Directory