Transcript Slide 1

Introduction to Gene-Finding:
Linkage and Association
Danielle Dick, Sarah Medland, (Ben
Neale)
Aim of QTL mapping…
LOCALIZE and then IDENTIFY a locus that
regulates a trait (QTL)
• Locus: Nucleotide or sequence of nucleotides with variation
in the population, with different variants associated with
different trait levels.
Location and Identification
• Linkage
• localize region of the genome where a QTL that
regulates the trait is likely to be harboured
• Family-specific phenomenon: Affected individuals
in a family share the same ancestral predisposing
DNA segment at a given QTL
Location and Identification
• Association
• identify a QTL that regulates the trait
• Population-specific phenomenon: Affected
individuals in a population share the same
ancestral predisposing DNA segment at a given
QTL
Linkage
Overview
Progress of the Human Genome Project
Human Chromosome 4
Genetic markers
(DNA polymorphisms)
ATGCTTGCCACGCE
ATGCTTCTTGCCATGCE
Microsatellite Markers
can be di(2), tri(3), or tetra (4)
nucleotide repeats
ATGCTTGCCACGCE
Single Nucleotide Polymorphism
ATGCTTGCCATGCE
DNA polymorphisms

Can occur in gene, but be silent

Can change gene product (protein)


Can regulate gene product



Alter amino acid sequence (a lot or a little)
Upregulate or downregulate protein production
Turn off or on gene
Can occur in noncoding region

This happens most often!
Mutations
How do we map genes?

Deviation from Mendel’s Independent
Assortment Law


Aa & Bb = ¼ AB, ¼ Ab, ¼ aB, ¼ ab
We’re looking for variation from this
Recombination
Recombination

Another way of introducing genetic diversity

Allows us to map genes!

Crossovers more likely to occur between genes
that are further away; likelihood of a
recombination event is proportional to the
distance


Interference – tend not to see 2 crossovers in a small
area
Alleles that are very close together are more
likely to stay together, don’t assort independently
Linkage Mapping (is a marker “linked”
to the disease gene)

Collect families with affected individuals

Genome Scan - Test markers evenly spaced
across the entire genome (~every 10cM, ~400
markers)

Lod score (“log of the odds”) – what are the odds
of observing the family marker data if the marker
is linked to the disease (less recombination than
expected) compared to if the marker is not linked
to the disease
Thomas Hunt Morgan – discoverer of linkage
Linkage = Co-segregation
A3A4
A1A2
A1A3
A1A2
A1A4
A2A4
A3A4
A2A3
A3A2
Marker allele A1
cosegregates with
dominant disease
Lod scores

>3.0 evidence for linkage

<-2.0 can rule out linkage

In between – inconclusive, collect
more families
Linkage = Co-segregation
A3A4
A1A2
A1A3
A1A2
A1A4
•Parametric Linkage used
very successfully to map
disease genes for Mendelian
disorders
A2A4
A3A4
A2A3
A3A2
•Problematic for complex
disorders: requires disease
model, penetrance, assumes
gene of major effect,
phenotypic precision
Nonparametric Linkage

Based on allele-sharing

More appropriate for phenotypes with
multiple genes of small effect, environment,
no disease model assumed

Basic unit of data: affected relative (often
sibling) pairs
x
1/4
1/4
1/4
1/4
IDENTITY BY DESCENT
Sib 1
2
1
1
0
1
2
0
1
1
0
2
1
0
1
1
2
Sib 2
4/16 = 1/4 sibs share BOTH parental alleles IBD = 2
8/16 = 1/2 sibs share ONE parental allele IBD = 1
4/16 = 1/4 sibs share NO parental alleles IBD = 0
Genotypic similarity between relatives
IBS
Alleles shared Identical By State “look the same”, may have the
same DNA sequence but they are not necessarily derived from a
known common ancestor - focus for association
IBD
Alleles shared
M1
Q1
M2
Q2
M3
Q3
M3
Q4
Identical By Descent
are a copy of the
same ancestor allele
M1 M2
Q1 Q2
M3 M 3
Q 3 Q4
- focus for linkage
M1 M3
Q1 Q 3
M1 M3
Q1 Q4
IBS
IBD
2
1
Genotypic similarity – basic principals

Loci that are close together are more likely to be
inherited together than loci that are further apart

Loci are likely to be inherited in context – ie with their
surrounding loci

Because of this, knowing that a loci is transmitted from a
common ancestor is more informative than simply
observing that it is the same allele

Critical to have parental data when possible
Linkage Markers…
For disease traits (affected/unaffected)
Affected sib pairs selected
1000
750
500
250
Expected
1
2
3
Markers
127
310
IBD = 2
IBD = 1
IBD = 0
For continuous measures
Unselected sib pairs
Correlation between sibs
1.00
0.75
0.50
0.25
0.00
IBD = 0
IBD = 1
IBD = 2
So how does all this fit into Mx?
IDENTITY BY DESCENT
Sib 1
2
1
1
0
1
2
0
1
1
0
2
1
0
1
1
2
Sib 2
4/16 = 1/4 sibs share BOTH parental alleles IBD = 2
8/16 = 1/2 sibs share ONE parental allele IBD = 1
4/16 = 1/4 sibs share NO parental alleles IBD = 0

In biometrical modeling A is correlated at 1 for
MZ twins and .5 for DZ twins

.5 is the average genome-wide sharing of genes
between full siblings (DZ twin relationship)
1 or .5
1
1
1
A
1
1
C
a
c
T1
E
e
1
E
1
C
e
c
T2
A
a

In linkage analysis we will be estimating an
additional variance component Q

For each locus under analysis the coefficient of
sharing for this parameter will vary for each pair of
siblings

The coefficient will be the probability that the pair of
siblings have both inherited the same alleles from a
common ancestor
ˆ
ˆ
MZ=1.0 DZ=0.5
MZ & DZ = 1.0
1
1
Q
1
A
q
a
C
c
PTwin1
1
E
e
1
1
1
E
C
e
A
c
1
Q
a
PTwin2
q
Break down of time spent during a
linkage/association study
Linkage
Cleaning &
preparing
genotype data
Runing linkage
analyses
Estimating
significance
How do we do this?
1.Genotyping data.
Microsatellite data


Ideally positioned at equal genetic distances
across chromosome
Mostly di/tri nucleotide repeats
http://research.marshfieldclinic.org/genetics/GeneticResearch/screeningsets.asp
Microsatellite data


Raw data consists of allele lengths/calls (bp)
Different primers give different lengths

So to compare data you MUST know which
primers were used
http://research.marshfieldclinic.org/genetics/GeneticResearch/screeningsets.asp
Binning

Raw allele lengths are converted to allele
numbers or lengths

Example:D1S1646 tri-nucleotide repeat size
range130-150




Logically: Work with binned lengths
Commonly: Assign allele 1 to 130 allele, 2 to 133 allele …
Commercially: Allele numbers often assigned based on
reference populations CEPH. So if the first CEPH allele
was 136 that would be assigned 1 and 130 & 133 would
assigned the next free allele number
Conclusions: whenever possible start from the RAW allele
size and work with allele length
Error checking

After binning check for errors




Family relationships (GRR, Rel-pair)
Mendelian Errors (Sib-pair)
Double Recombinants (MENDEL, ASPEX,
ALEGRO)
An iterative process
‘Clean’ data

ped file

Family, individual, father, mother, sex, dummy,
genotypes
Estimating genotypic sharing…

The ped file is used with ‘map’ files to obtain
estimates of genotypic sharing between
relatives at each of the locations under
analysis
Estimating genotypic sharing…
Merlin will give you probabilities of sharing 0, 1, 2 alleles for every pair of individuals.
Estimating genotypic sharing…

Output
Estimating genotypic sharing…

Output
Why isn’t P0, P1, P2 exact
for everyone?
Estimating genotypic sharing…

Output
Why isn’t P0, P1, P2 exact
for everyone?
-missing parental genotypes
-low informativeness at marker
1/2
1/2
2/2
2/2
ˆ
MZ=1.0 DZ=0.5
MZ & DZ = 1.0
1
1
Q
1
A
q
a
C
c
PTwin1
1
E
e
1
1
1
E
C
e
A
c
1
Q
a
PTwin2
q
Genotypic similarity between relatives
IBD
Alleles shared Identical By Descent are a copy of the
same ancestor allele
Pairs of siblings may share 0, 1 or 2 alleles IBD
The probability of a pair of relatives
being IBD is known as pi-hat
ˆ  p( IBD2)  .5* p( IBD1)
M1
Q1
M2
Q2
M1 M 2
Q 1 Q2
M1 M3
Q1 Q3
M3
Q3
M3
Q4
M3 M3
Q3 Q4
M1 M 3
Q 1 Q4
IBS
IBD
2
1
Estimating genotypic sharing…

Output
ˆ  p( IBD2)  .5* p( IBD1)
ˆ  ?
ˆ  ?
Distribution of pi-hat
STUDY:

2 Harold sample (middelb lft)
40
30
Adult Dutch DZ pairs:
distribution of pi-hat ˆ
at 65 cM on
chromosome 19
ˆ < 0.25: IBD=0 group
 ˆ > 0.75: IBD=2 group

20

10
Std. Dev = .30
Mean = .45
N = 117.00
0
0.00
.13
.06
PIHAT65
.25
.19
.38
.31
.50
.44
.63
.56
.75
.69
.88
.81
1.00
.94

others: IBD=1 group
pi65cat= (0,1,2)
Linkage Analyses

Advantage


Systematically scan the genome
Disadvantages



Not very powerful
Need hundreds – thousands of family member
Broad peaks
Lod scores
1cM = 1MB
1MB=1000kb
1kb=1000bp
1cM = 1,000,000 bp
Strategy
1. Ascertain families with multiple affecteds
2. Linkage analyses to identify chromosomal regions
2.5
Wave 1
2
LodScores
Wave 2
1.5
Combined
1
0.5
0
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160
 allele-sharing among affecteds
within a family
cM
3. Association analyses to identify specific genes
Gene A
Gene B
Gene C

BREAK
Linkage vs. Association

Linkage analyses look for relationship between a
marker and disease within a family (could be
different marker in each family)

Association analyses look for relationship between
a marker and disease between families (must be
same marker in all families)
Allelic Association:
Extension of linkage to the population
3/5
3/6
2/6
5/6
3/5
3/2
2/6
5/2
Both families are ‘linked’ with the marker, but a different
allele is involved
Allelic Association
Extension of linkage to the population
3/5
3/6
2/6
5/6
3/6
3/2
2/4
6/2
4/6
6/6
All families are ‘linked’ with the marker
Allele 6 is ‘associated’ with disease
2/6
6/6
Localization

Linkage analysis yields broad chromosome
regions harbouring many genes



Resolution comes from recombination events (meioses)
in families assessed
‘Good’ in terms of needing few markers, ‘poor’ in terms
of finding specific variants involved
Association analysis yields fine-scale resolution
of genetic variants


Resolution comes from ancestral recombination events
‘Good’ in terms of finding specific variants, ‘poor’ in
terms of needing many markers
Allelic Association
Three Common Forms
• Direct Association
• Mutant or ‘susceptible’ polymorphism
• Allele of interest is itself involved in phenotype
• Indirect Association
• Allele itself is not involved, but a nearby correlated
marker changes phenotype
• Spurious association
• Apparent association not related to genetic aetiology
(most common outcome…)
Indirect and Direct Allelic Association
Direct Association
D
Indirect Association & LD
M1 M2
D
Mn
*
Measure disease relevance (*)
directly, ignoring correlated
markers nearby
Assess trait effects on D via
correlated markers (Mi) rather
than susceptibility/etiologic
variants.
Semantic distinction between
Linkage Disequilibrium: correlation between (any) markers in population
Allelic Association:
correlation between marker allele and trait
Decay of Linkage Disequilibrium
Reich et al., Nature 2001
Average Levels of LD along chromosomes
1.00
CEPH
W.Eur
Estonian
D'
0.75
0.50
0.25
0.00
0
5
10
15
20
25
30
Position (Mb)
Chr22
Dawson et al
Nature 2002
Characterizing Patterns of Linkage Disequilibrium
Average LD decay vs physical distance
Mean trends along chromosomes
1.00
D'
0.75
0.50
0.25
0.00
0
5
10
15
20
Position (Mb)
Haplotype Blocks
25
30
Linkage Disequilibrium Maps & Allelic
Association
Marker
1
2
3
D
n
LD
Primary Aim of LD maps: Use relationships amongst background
markers (M1, M2, M3, …Mn) to learn something about D for association
studies
Something =
* Efficient association study design by reduced genotyping
* Predict approx location (fine-map) disease loci
* Assess complexity of local regions
* Attempt to quantify/predict underlying (unobserved)
patterns
···
Deliverables: Sets of haplotype tagging SNPs
Building Haplotype Maps for Gene-finding
1. Human Genome Project
 Good for consensus,
not good for individual
differences
Sept 01
Feb 02
April 04
2. Identify genetic variants
 Anonymous with respect to
traits.
April 1999 – Dec 01
3. Assay genetic variants
 Verify polymorphisms,
catalogue correlations
amongst sites
 Anonymous with respect to
traits
Oct 2002 - present
Oct 04
Haplotype Tagging for Efficient Genotyping
Cardon & Abecasis, TIG 2003
• Some genetic variants within haplotype blocks give redundant information
• A subset of variants, ‘htSNPs’, can be used to ‘tag’ the conserved haplotypes with little loss of
information (Johnson et al., Nat Genet, 2001)
• … Initial detection of htSNPs should facilitate future genetic association studies
HapMap Strategy

Samples


Four populations, small samples
Genotyping



5 kb initial density across genome (600K
markers)
Subsequent focus on low LD regions
Recent NIH RFA for deeper coverage
Hapmap validating millions of SNPs.
Are they the right SNPs?
Distribution of allele frequencies in public markers is biased toward common alleles
Population frequency
0.6
Expected frequency in population
0.5
0.4
Frequency of public markers
0.3
0.2
Updated with phase 2—more
similar to expectation
0.1
0
1-10%
11-20%
21-30%
31-40%
41-50%
Minor allele frequency
Phillips et al. Nat Genet 2003
Summary of Role of Linkage
Disequilibrium on Association Studies

Marker characterization is becoming extensive and
genotyping throughput is high

Tagging studies will yield panels for immediate use


Need to be clear about assumptions/aims of each panel
Density of eventual Hapmap probably cover much of
genome in high LD, but not all
Challenges


Just having more markers doesn’t mean that success rate will improve
Expectations of association success via LD are too high.
Two types of association studies


Case-control
Family-based
Allelic Association
Controls
Cases
6/6
6/2
3/5
3/4
3/6
2/4
3/2
5/6
3/6
4/6
6/6
2/6
5/2
Allele 6 is ‘associated’ with disease
2/6
Main Blame
Primary Concern with Case-Control Analyses
Population stratification
Analysis of mixed samples having different allele frequencies
is a primary concern in human genetics, as it leads to false
evidence for allelic association.
Population Stratification

Leads to spurious association

Requirements:



Group differences in allele frequencies AND
Group differences in outcome
In epidemiology, this is a classic matching
problem, with genetics as a confounding variable
Population Stratification
Affected
Unaffected
M
50
450
.50
Affected
Unaffected
Sample ‘A’
m
Freq.
50
.10
450
.90
.50
2
 1 is n.s.
+
M
51
549
.30
Affected
Unaffected
m
59
1341
.70
21 = 14.84, p < 0.001
Spurious Association
M
1
99
.10
Sample ‘B’
m
Freq.
9
.01
891
.99
.90
2
 1 is n.s.
Freq.
.055
.945
Family-based association methods
TDT – Transmission Disequilibrium Test
1/2
3/3
2/3
•50/50 chance the 2 is transmitted
•Looking for overtransmission of a particular allele
across affected individuals (undertransmission to unaffecteds)
TDT Advantages/Disadvantages
Advantages
Robust to stratification
Genotyping error detectable via Mendelian inconsistencies
Estimates of haplotypes possible
Disadvantages
Detection/elimination of genotyping errors causes bias (Gordon et al., 2001)
Uses only heterozygous parents
Inefficient for genotyping
3 individuals yield 2 founders: 1/3 information not used
Can be difficult/impossible to collect
Late-onset disorders, psychiatric conditions, pharmacogenetic applications
Association studies < 2000: TDT
• TDT virtually ubiquitous over past decade
Grant, manuscript referees & editors mandated design
• View of case/control association studies greatly
diminished due to perceived role of stratification
Association Studies 2000+ :
Return to population
• Case/controls, using extra genotyping
• +families, when available
Detecting and Controlling for
Population Stratification with Genetic Markers
Idea
• Take advantage of availability of large N genetic markers
• Use case/control design
• Genotype genetic markers across genome
(Number depends on different factors)
• Look if any evidence for background population substructure
exists and account for it
Two types of association studies

Case-control



Adv: more powerful
Disadv: population stratification
limited by case/control definition
Family-based


Adv: population stratification not a problem
Disadv: less powerful, hard to collect parents for some
phenotypes
Association Analyses vs Linkage

Advantage


Disadvantage


More powerful
Not systematic (in the past)
Now!

Genome wide association scans
Current Association Study Challenges
1) Genome-wide screen or candidate gene
Genome-wide screen



Hypothesis-free
High-cost: large genotyping
requirements
Multiple-testing issues

Possible many false
positives, fewer misses
Candidate gene



Hypothesis-driven
Low-cost: small genotyping
requirements
Multiple-testing less
important

Possible many misses,
fewer false positives
Current Association Study Challenges
2) What constitutes a replication?
GOLD Standard for association studies
Replicating association results in different laboratories is often seen
as most compelling piece of evidence for ‘true’ finding
But…. in any sample, we measure
Multiple traits
Multiple genes
Multiple markers in genes
and we analyse all this using multiple statistical tests
What is a true replication?
What is a true replication?
Replication Outcome





Association to same trait, but
different gene
Association to same trait,
same gene, different SNPs
(or haplotypes)
Association to same trait,
same gene, same SNP – but
in opposite direction
(protective  disease)
Association to different, but
correlated phenotype(s)
No association at all
Explanation

Genetic heterogeneity

Allelic heterogeneity

Allelic heterogeneity/pop
differences

Phenotypic heterogeneity

Sample size too small
Current Association Study Challenges
3) Do we have the best set of genetic markers
There exist 6+ million putative SNPs in the
public domain. Are they the right markers?
Allele frequency distribution is biased toward common alleles
Population frequency
0.6
Expected frequency in population
0.5
0.4
Frequency of public markers
0.3
0.2
0.1
0
1-10%
11-20%
21-30%
31-40%
Minor allele frequency
41-50%
Current Association Study Challenges
3) Do we have the best set of genetic markers
Tabor et al, Nat Rev Genet 2003
Greatest power comes from markers that
match allele freq with trait loci
Disease Allele
Frequency
Marker Allele Frequency
0.1
0.3
0.5
0.7
0.9
0.1
248
626
1306
2893
10830
0.3
1018
238
466
996
3651
0.5
2874
702
267
556
2002
0.7
9169
2299
925
337
1187
0.9
73783
18908
7933
3229
616
ls = 1.5, a = 5 x 10-8, Spielman TDT
(Müller-Myhsok and Abel, 1997)
Current Association Study Challenges
4) Integrating the sampling, LD and genetic effects
Questions that don’t stand alone:
How much LD is needed to detect complex disease genes?
What effect size is big enough to be detected?
How common (rare) must a disease variant(s) be to be identifiable?
What marker allele frequency threshold should be used to find complex disease
genes?
Complexity of System
•In any indirect association study, we measure marker alleles
that are correlated with trait variants…
We do not measure the trait variants themselves
•But, for study design and power, we concern ourselves with
frequencies and effect sizes at the trait locus….
This can only lead to underpowered studies and
inflated expectations
•We should concern ourselves with the apparent effect size
at the marker, which results from
1) difference in frequency of marker and trait alleles
2) LD between the marker and trait loci
3) effect size of trait allele
Practical Implications of Allele Frequencies

‘Strongest argument for using common markers is
not CD-CV. It is practical:
For small effects, common markers are the only ones
for which sufficient sample sizes can be collected
 There are situations where indirect association
analysis will not work


Discrepant marker/disease freqs, low LD, heterogeneity, …
Linkage approach may be only genetics approach in these cases
At present, no way to know when association will/will
not work

Balance with linkage
Current Association Study Challenges
5) How to analyse the data

Allele based test?

2 alleles  1 df


3 genotypes  2 df

E(Y) = a + b1A+ b2D A = 0/1 additive (hom); W = 0/1 dom (het)
Haplotype-based test?

For M markers, 2M possible haplotypes  2M -1 df


X = 0/1 for presence/absence
Genotype-based test?


E(Y) = a + bX
E(Y) = a + bH
H coded for haplotype effects
Multilocus test?

Epistasis, G x E interactions, many possibilities
Current Association Study Challenges
6) Multiple Testing

Candidate genes: a few tests (probably correlated)

Linkage regions: 100’s – 1000’s tests (some correlated)

Whole genome association: 100,000s – 1,000,000s tests (many
correlated)

What to do?
 Bonferroni (conservative)
 False discovery rate?
 Permutations?
….Area of active research
Despite challenges: upcoming association
studies hold some promise

Availability of millions of genetic markers
 Genotyping costs decreasing rapidly

Cost per SNP: 2001 ($0.25)  2003 ($0.10)  2004
($0.01)

Background LD patterns being characterized

International HapMap and other projects
Genome Wide Association Studies
(GWAS) Underway:

Genetic Analysis Information Network (GAIN)
 Psoriasis, ADHD, Schizophrenia, Bipolar Disorder, Depression,
Type 1 Diabetes

Welcome Trust Case Control Consortium
 Bipolar Disorder, Coronary Artery Disease, Crohn’s disease,
Rheumatoid Arthristis, Type 1 Diabetes, Type 2 Diabetes

Genes, Environment, & Health Initiative (Gene/Environment
Association Studies: GENEVA)
 Addiction, diabetes, Heart Disease, Oral Clefts, Maternal
Metabolism and Birth Weight, Lung Cancer, Pre-Term Birth,
Dental Carries

Genes, Environment, & Development Initiative (GEDI)