Transcript Association

Strategies for gene identification
in complex traits
--- Association studies ---
What is an association study?
Objective:Is there a statistical relation?
• Genomic Variation
at one or more sites
• Phenotypic variation
- Presence/Absence of a disease
- Levels of a disease-related trait
Principle:
Compares 2 groups that are expected to differ in
their prevalence of disease-susceptibility alleles
Analytical Issues in Genetic
Association Studies
•
•
•
•
Sampling Design
Markers (typed; Map density)
Unit of Analysis
Statistical testing
Linkage disequilibrium between 2
tightly linked loci
•
Marker 2
Marker 1
A2
a2
A1 A1A2 A1a2 f(A1)
a1 a1A2 a1a2 f(a1)
f(A2) f(a2)
Allelic association  f(i,j)  f(i) x f(j)
 Haplotype frequency  product of allele frequencies
LD decays with time/generations and genetic distance
(recombination)
Measures of allelic association
D’ (Lewinson’s); r2 (correlation)
0  r2  D’  1
D’ ~ recombinational events in the genomic region
r2 ~ The 2 SNPs carry same information
D’ can be high but not r2
D’ =1; r2 =1
A1A2
f(A1)
a1a2 f(a1)
f(A2) f(a2)
D’ =1; r2 <1
A1A2
f(A1)
a1A2 a1a2 f(a1)
f(A2) f(a2)
Power in Population-based vs.
Family-based Analysis
TDT
Case-Control
• Genotype 3 Subjects/Family
• Phenotype 1 Subject/Family
• Increased power with
multiple affected sibs
• Generally, Immune to
population stratification
• Family structure provides
some error-checking and
haplotype information
• Full trios may not be
available
• Genotype 2 Subjects to
equal one trio
• Phenotype 2 Subjects to
equal one trio
• Increased power with 3:1
controls:cases
• Susceptible to population
stratification
Most common forms of markers
• Repeated sequences of 2,3 or 4 nucleotide (Microsatellites)
– reasonably frequent in genome
– highly polymorphic/informative  useful in linkage analysis
– few disease susceptibility gene variants are likely STRs
• Single Nucleotide Polymorphisms (SNPs) “one” letter of
the code is altered
– very frequent in genome (1/500 to 1/1000 base pairs)
– Exonic SNPs may or may not cause an amino acid change
– many disease susceptibility gene variants are likely SNPs
Unit of Analysis in Genetic Association
Studies
• Allele vs. Genotype
– Dominance can be considered in genotype analysis
– Extra degree of freedom in genotype analysis
– Not clear which is optimal
• Single SNP vs. Haplotype
– Haplotypes capture evolutionary history
– Need for haplotype imputation
– Single SNP optimal if functional SNP is included
What are we hoping from
a genetic association study?
Situation of Interest:
 Trait variation is influenced by
OR
The typed variant
A second variant
Marker= Causal Variant
Marker in LD with Causal variant
Direct Association
Indirect Association
Likelihood of detecting a true
association?
• Genetic effects of the causal allele on trait
susceptibility/variation --Relative Risk & allele
frequency
• LD between the marker and the causal variant
(Marker map & LD patterns in the genomic
region of the causal variant)
Detectable Genetic effects (1)
Power
Variant with modest effects (OR~1.6):
Power as a function of allele frequency
100%
80%
60%
40%
20%
0%
5%
1%
0.1%
0.01%
f=0.05
f=0.10
Power under different Nominal P-values
N=2,000 (1,000 cases + 1,000 controls)
Detectable Genetic effects (2)
Power
Rare (f=0.02) causal variant :
Power as a function of OR
100%
80%
60%
40%
20%
0%
OR=1.5
OR=2.0
Power under different Nominal P-values
N=2,000 (1,000 cases + 1,000 controls)
5%
1%
0.1%
0.01%
Detectable Genetic effects?
 Association is powerful to detect causal variants
that are
- Common (>10%) with relatively modest effects (RR)
- Less common (~5%) but with substantial effects
(RR>2)
Likelihood of detecting a true association?
Direct
r2=1
r2=0
0< r2<1
• For a given Power, required N with 1/r2
r2=
N=
1
1,000
0.8
0.5
1,250 2,000
0.20
5,000
0

• For a given N, Power
Max
nul
Hot spots and Haplotype blocks
• LD is variable : Recombination does not
occur with equal probability at all points in
the genome ---- there are « hot » and
« cold » spots
• Recently, it has been suggested that the
genome falls into « blocks », with little
haplotype diversity within blocks: Mean
block size seems to be about
~14kb in Caucasians, and
~8 kb in Africans
Detectable Causal Variants?
• Causal polymorphism is known and typed (direct
association) or
• There are markers that are highly correlated to
the causal variant:
- The causal locus lies in a « cold » spot (« LD
blocks »)
- The « best » map density to be used will
depend on the LD patterns of the region
 implications on statistical significance (multitest correction)
Human Genome
• The human genome consists of about 3x109
base pairs (3-6 x106 SNPs) and contains
about 25,000 genes
• Much of the DNA is either in introns or in
intergenic regions
 Trait variation: A few hundred of
(functional) variants may make a meaningful
contribution to variation in any single
phenotype
 Prior probability that a variant selected at
random will influence a given trait is very
Genetic variants to be typed?
--- Choices have to be made --Two complementary approaches:
• Functional: incorporates assessments of
the likely functional effect of variation
within a gene or region of interest.
• Tagging: exploits presence of LD in many
parts of the genome.
Significance of association with AD, for SNPs
immediately surrounding APOE (<100 kb)[Martin et
al., AJHG, 2000]
Selection of variants: Functional approach
Target polymorphisms which are themselves putative
causal variants.
Critical issues:
• Identification of candidate polymorphisms
– Beyond mutations altering aminoacid sequence (nSNPs),
little is known on the potential effect of non-coding
sequence on gene regulation & expression?
• MAF of functional variants is skewed (MAF<5%)
Power to detect uncommon variants with modest
effects?
 Potential to be the most powerful (Direct
association) design, but may be limited to the
discovery of some of the genetic causes of
Selection of variants: Indirect Association
The polymorphism is a surrogate for the
causal variant
But, necessary to type several surrounding
markers to have a high chance of picking
up the indirect association
Questions: Do we need to type all markers in
the region? Can we reduce genotyping
costs & multi-test burden without
decreasing « too much » the power?
Tagging approaches
Type a subset of variants that captures a high
amount of the information in common
regional haplotypes
Various strategies ---SNP & haplotype tagging
--- but still debate as to the best methods
[Johnson et al. Nat Genet, 200]
Power as a function of average spacing of tags
[De Bakker, Nat Genet, Nov 2005]
r2=1
r2=0.8
r2=0.3
random
kb
Tags picked at r2 = 1, 0.8, 0.5 and 0.3
 A marker map density of ~1 tagSNP/5kb
(r2>0.8) captures >80% of common variation
Tagging approach: Limits
• Less powerful than direct studies,
• There cannot be a definite negative result, since
we cannot exclude the possibility that a causal
variant exists but is not picked up by the markers
chosen,
• Intrinsic biological merit of tagSNPs as markers
for complex trait susceptibility variants?
 « Common disease, common variant »
hypothesis
Supported by the few variants consistently shown
to be associated to common diseases: -- APOE &
Alzheimer --- Macular degeneration &
In practical terms, an observed statistical
association will be due to …
1. Direct association: The allele itself is functional and
directly affects the expression of the phenotype
2. Indirect
association:
The allele is in linkage
disequilibrium with an allele at another locus that directly
affects
the
expression
of
the
phenotype
3. The finding could be due to chance or artifact,
e.g., confounding or selection bias
 Study design aims to maximize detection of
“true” findings while controlling
(minimizing) rate of “false” findings
“False” Association findings
1. Chance: measured by the nominal P value of
the test, i.e., prior probability that a typed
marker is found associated when HO (no
association) is true.
 Multi-test problem: The rate of “false”
findings of a given experiment increases with
the number of markers tested.
• Solutions
– Simulation: Empirical p-values
– Replication and/or use Multi-Phases design
Multi-phase designs
Are efficient to reduce the multi-test problem
For example:
1. 2,000 cases + 2,000 controls with
500,000 SNP chip
2. Further 2,000 + 2,000 for best 100,000
SNPs
3. Further 4,000 + 4,000 for best 10,000
SNPs
•
Computation of the characteristics of
“False” Association findings
2. Artifact (confounding, selection bias, pop
stratification, genotyping): affects the Prior
probability of a “chance” finding
 The significance of a finding is no longer
controlled by the nominal P-value.
• Solutions
- Careful matching of cases & controls
- use homogenous populations
- use family-based controls
- use genomic control or other similar methods
- use QC methods for scoring genotyping errors
(Clayton et al., Nat Genet, 2005)
Prospects for whole-genome screens:
Estimated numbers of «common» SNPs (MAF>5%)
• Direct studies of nsSNPs: ~30,000 - 50,000 SNPs
• Indirect studies of genes: ~300,000 -500,000 SNPs
• «Nearly» whole genome: 500,000 - 1,000,000
• Whole genome: ~ 2,000,000 – 4,000, 000
Choice of markers
• Optimal choice of markers requires detailed
mapping of LD, e.g. based on HapMap data
• Truly optimal solutions are computationally intensive.
Current chip designers are using single marker r2 clusterbased algorithms
Choices of markers
have to be made
• The strategy used to define the subsets
of variants to be typed has a substantial
effect on the power & the quality of the
study.
• Greater understanding of genomic
variation has allowed more logical choices.
Nonetheless, variant selection is always a
pragmatic compromise.
Research key questions
• Are common human diseases due to common
variants or multiple rare variants?
• Will rare or common SNPs be better candidates
for a particular disease?
• Can large differences between populations in
the frequency of an allele be merely due to
chance?