Dias nummer 1

Download Report

Transcript Dias nummer 1

Analysis of whole genome
association studies in pedigreed
populations
Goutam Sahana
Genetics and Biotechnology
Faculty of Agricultural Sciences
Aarhus University, 8830 Tjele, Denmark
AAR H U S
UNIVERSITET
Faculty of Agricultural Sciences
Concept of mapping
Identification of genetic variant underlying
disease susceptibility or a trait value
Evidence for the
location of the gene
= Causal variant
Approaches to Mapping
1. Candidate gene studies


Association
Resequencing approaches
2. Genome-wide studies


Linkage analysis
Genome-wide association studies (Linkage
disequilibrium, LD mapping)
Linkage mapping
 Look for marker alleles that are correlated
with the phenotype within a pedigree
 Different alleles can be connected with the
trait in the different pedigrees
Association mapping
 Marker alleles are correlated with a trait
on a population level
 Can detect association by looking at
unrelated individuals from a population
 Does not necessarily imply that markers
are linked to (are close to) genes
influencing the trait.
Linkage vs. association
Effect
Unlikely to exist
Linkage analysis
Association
study
Very difficult
Freq. of causal variant
Modified from D. Altschuler
Linkage vs. association
Potential Advantage
Linkage Association
No prior information regarding gene
function required
+
+
Localization to small genomic region
-
+
Not susceptible to effects of stratification
+
-/+
Sufficient power to detect common alleles
of modest effect (MAFs>5%)
-/+
+
Ability to detect rare allele (MAFs<1%)
+
-
Tools for analysis available
+
+/-
Hirschhorn & Daly, Nature Rev. Genet. 2005
Allelic Association
 Direct Association
 Allele of interest is itself involved in phenotype
 Indirect Association
 Allele itself is not involved, but due to LD with
the functional variant
 Spurious association
 Confounding factors (e.g., population
stratification)
Linkage disequilibrium
 Non random association between alleles at
different loci. Loci are in LD if alleles are
present on haplotypes in different
proportions than expected based on allele
frequencies
 Two alleles that are in LD are occurring
together more often than would be
expected by chance
Linkage disequilibrium
Locus A: Alleles A & a; freq. PA & Pa
Locus B: Alleles B & b; freq. PB & Pb
Possible haplotyoes
A
A
a
a
B
b
B
b
Expected frequencies: pApB
p Ap b p a p B p a p b
Observed frequencies: pAB
pAb
paB
D = pAB - pApB ≠ 0
pab
LD variation across genome
 The extent of LD is highly variable across
the genome
 The determinants of LD are not fully
understood.
 Factors that are believed to influence LD





Genetic drift
Population growth
Admixture or migration
Selection
Variable recombination rates
Haplotype
Genotypes
Locus1
Locus2
Locus3
Locus4
Locus5
Locus6
2
1
3
4
2
1
4
3
2
1
3
2
Haplotypes
Identification
of phase
PHASE
BEAGLE
4
1
3
1
2
2
2
3
2
4
3
1
Haplotype-based analysis
 Increased ability to identify regions that
are shared identical by descent among
affected individuals
 Haplotypes may the causative ‘composite
allele’ rather than a particular nucleotide
at a particular SNP
 Haplotype analysis is meaningful only if
SNPS are in themselves in LD
Monogenic
verses
Complex traits
Monogenic trait
 Mutation in single gene is both necessary
and sufficient to produce the phenotype or
to cause the disease
 The impact of the gene on genetic risk is
the same in all families
 Follow clear segregation pattern in families
 Typically rare in population
Complex trait
 Multiple genes lead to genetic predisposition
to a phenotype
 Pedigree reveals no Mendelian pattern
 Any particular gene mutation is neither
sufficient nor necessary to explain the
phenotype
 Environment has major contribution
 We study the relative impact of individual
gene on the phenotype
Some examples
Mendelian/
Complex
No. of
genes
Incidence
(in 100,000)
Cystic fibrosis
M
1
40
Huntington
disease
M
1
5-10
Diabetes, type 2
C
?
10,000 – 20,000
Alzheimer
C
?
20,000
Schizophrenia
C
?
1000
Disease
Quantitative Trait
A biological trait that shows continuous
variation rather than falling into distinct
categories
Quantitative trait locus (QTL) - Genetic locus
that is associated with variation in such
quantitative trait
Assessing genetic contributions
to complex traits
 Continuous characters (wt, blood pressure)
 Heritability: Proportion of observed variance in
phenotype explained by genetic factors
 Discrete characters (disease)
 Relative risk ratio: λ= risk to relative of an
affected individual/risk in general population
 λ encompasses all genetic and environmental
effects, not just those due to any single locus
Factors that influence identification
of allelic association




Effect size
Linkage disequilibrium
Disease and marker allele frequencies
Sample Size
Reviewed by Zondervar & Cardon, Nature
Rev. Genet. 2004
Odds ratio
Sample size
Disease
allele freq.
Marker
allele freq.
Odd ratio
3.0
0.2
0.05
2.0
1.3
0.2
150
360
2900
0.5
430
1250
11,000
0.2
1170
4150
40,000
0.5
4200
15000
160,000
No. of cases= no. of controls; D’=0.7; power 80%;  =0.001
Zondervar & Cardon (Nature Rev. Genet. 2004)
Population stratification
Consider two case/control samples, genotyped at a
marker with alleles M and m
Sample A
Sample B
M
m
Freq.
Affected
50
50
0.10
Unaffec.
450
450
0.90
Freq.
0.50
0.50
2 NS
M
m
Freq.
Affected
1
9
0.01
Unaffec.
99
891
0.99
Freq.
0.10
0.90
2 NS
Population stratification
Sample A
M
Sample B
m
Freq.
M
m
Freq.
Affected
50
50
0.10
Affected
1
9
0.01
Unaffec.
450
450
0.90
Unaffec.
99
891
0.99
Freq.
0.50
0.50
Freq.
0.10
0.90
M
m
Freq.
Affected
51
59
0.055
Unaffec.
549
1341
0.945
Freq.
0.30
0.70
2 =14.8
P<0.001
Dealing with population structure
 Genomic control (Devlin and Roeder, 1999)
 Inflate the distribution of the test statistic by λ.
 λ estimated from data
2
No stratification
E(2)
Test locus
Unlinked ‘null’ markers
2
E(2)
Stratification
Adjust test statistics
Dealing with population structure
 Structured association (Pritchard et al., 2000)
 Discover structure from set of unlinked markers,
i.e. assign probabilities of ancestry from k
populations to each individual, and then control
for it.
Association analysis approaches
 Case–control studies
 Markers frequencies are determined in a group of
affected individuals and compared with allele
frequencies in a control population
 Family based methods
 Based on unequal transmission of alleles from parents to
a single affected child in each family. Associations are
summed over many unrelated families
Case-Control studies: 2 test
Alleles
Genotypes
11
12
22
Case
n11
n12
n22
N
Ctrl
m11
m12
m22
M
T12
T22
N+M
Total T11
1
2
Total
Case
n1
n2
2N
Ctrl
m1
m2
2M
Total
T1
T2
2(N+M)
Total
2x3 contingency table
2x2 contingency table
Test of independence:
2 = (O-E)2/E with 2 or 1 df
Family based tests
 Genotypes from independent family trios
where the child is affected
 Use the non-transmitted genotypes or
alleles as internal controls to the
transmitted ones
Family-based association studies
?
?
12
34
14
23
14
transmitted
non-transmitted
control
Is an allele transmitted more often than
it’s not transmitted to affected offspring
?
TDT: Transmission Disequilibrium
Test
G/g
G/G
G/g
Transmitted
Non-transmitted
G
g
G
a
b
g
c
d
TDTG = (TG-NTG)2/(TG+NTG)
=(b-c)2/(b+c) ~ 21
TDT: Transmission Disequilibrium
Test
 Multiallelic markers
 ETDT (Sham & Curtis, 1995)
 Missing parent genotypes
 TRANSMIT (Cayton,1999)
 Haplotypes
 TDTHAP (Clayton & Jones, 1999)
 Sibs
 TDT/STDT (Spielman & Ewens, 1998)
 Pedigrees
 PBAT (Martin et al, 2000)
 Quantitative traits
 QTDT (Abecasis et al. 2000)
Some limitations
 Subjects – random or structure family
 Parents not available
 Difficult when there are very many genes
individually of small effect
 Environmental influence may obscure
genetic effects
 Genetic heterogeneity underlying disease
phenotype
 Hidden (unaccounted) relationship
Rare allele
Single family is segregating
A
a
B
b
Offspring group I
Offspring group II
Complex pedigree
&
Quantitative traits
Complex pedigree
 Non-independence among pedigree
members
 Only polygenic relationship is not sufficient
 Association analysis should account for the
point-wise relationship among individuals
 Identical-by-decent probabilities
Methods




Combined linkage and LD
Generalized linear models
Mixed-model (Yu et al. 2006)
Bayesian approach
Combined linkage and LD
Phenotype= Fixed factors + Polygene + Haplotype
• Polygene – the whole relationship in pedigree is used
• Identical-by-descend coefficients were estimated for
point-wise relationship
Phase determination - GDQTL
QTL mapping - DMU
QTL for Clinical Mastitis in cattle
16
14
12
LA
LRT
10
8
6
4
2
0
0.0
0.1
0.2
0.3
0.4
0.5
Morgan
0.6
0.7
0.8
0.9
1.0
QTL for Clinical Mastitis in cattle
16
14
12
LA
LRT
10
8
6
4
LD
2
0
0.0
0.1
0.2
0.3
0.4
0.5
Morgan
0.6
0.7
0.8
0.9
1.0
QTL for Clinical Mastitis in cattle
16
14
LD/LA
12
LA
LRT
10
8
6
4
LD
2
0
0.0
0.1
0.2
0.3
0.4
0.5
Morgan
0.6
0.7
0.8
0.9
1.0
Simulation





100 half-sib families (Dairy cattle pedigree)
2000 progeny
5 chromosomes – 100 cM (each)
SNP – 5000
15 QTL (1QTL-10%, 4QTL-5 %, 10QTL–2%)
 50% of the genetic variance
 Heritability – 30%
Generalized linear models
Phenotype= Sire-family + genotype
Software – TASSEL
http://www2.maizegenetics.net/index.php?page=bioinformatics/tassel
Generalized linear models
120
100
- ln(p)
80
60
40
20
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Generalized linear models
120
100
- ln(p)
80
60
40
20
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Generalized linear models
120
100
- ln(p)
80
60
40
20
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Mixed-model (Yu et al. 2006)
Phenotype= Fixed factors + SNP + Population +
polygene
Relationship
0
1
2
STRUCTURE
SAS mixed model (Gael Pressoir)
Mixed-model
120
100
- ln(p)
80
60
40
20
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Mixed-model
120
100
- ln(p)
80
60
40
20
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Mixed-model
120
100
- ln(p)
80
60
40
20
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Bayesian approach
Phenotype= Fixed factors + Polygene + Allele or
Haplotype
• All markers are fitted simultaneously, search for
marker combination that explains the trait variation
• Avoid multiple testing
Software – iBays (Janss LLG, 2007)
Bayesian approach
Posterior probability
1
0.8
0.6
0.4
0.2
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Bayesian approach
Posterior probability
1
0.8
0.6
0.4
0.2
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Multiple testing
Multiple testing
 Performing one test at an alpha level of 0.05
implies 5% chance of rejecting a true null
hypothesis (false positive)
 Performing 100 tests at  = 0.05 when all
100 H0 are true, we expect 5 of the tests to
give FP results
 Pr(at least one FP)=1-Pr(no FP)= 1- (0.95)100
= 0.994
 (if the tests are independent)
Multiple testing
 Bonferroni correction
 Rejection level of each test is i  /m
 Permutation test
 False discovery rate (FDR)
 What proportion of rejections are when H0 is
true?
 Of all the times you reject H0 how often is H0
true?
 q value (Storey et al. PNAS 2003)
Summary
 4 methods




LD and linkage
GLM
Mixed-model
Bayesian approach
Project team
Goutam Sahana
Bernt Guldbrandtsen
Luc Janss
Mogens Sandø Lund