Basics of association analysis

Transcript Basics of association analysis

Association analysis
Genetics for Computer Scientists
15.3.-19.3.2004
Biomedicum & Department of Computer Science, Helsinki
Päivi Onkamo
Lecture outline
•
•
•
•
•
Genetic association analysis
Allelic association
χ2 –test
Linkage disequilibrium (LD) process
Formulation of the computational problem for LD
mapping
• Limitations of the LD mapping
• Approaches. For example: HPM
Genetic association analysis
• Search for significant correlations between gene
variants and phenotype
• For example:
Affected Unaffected
Locus A for
SLE: 100
Allele 1
79
46
cases and 100
controls
Allele 2
21
54
genotyped
•Allele 1 seems to be associated, based on
sheer numbers, but how sure can one be
about it?
Allelic association =
An allele is associated to a trait
Affected
Healthy

Allele 1
79
46
125
Allele 2
21
54
75

100
100
200
• The idea is to compare the observed frequencies to
frequencies expected under hypothesis of no
association between alleles and the occurrence of
the disease (independency between variables)
• Test statistic
2
(
o

e
)
2
i
i
 
i1 ei
k
Where
• oi is the observed class frequency for class i, ei
expected (under H0 of no association)
• k is the number of classes in the table
• Degrees of freedom for the test: df=(r-1)(s-1)
Expected
Affected
Healthy

Allele 1
62.5 (79)
62.5 (46)
125
Allele 2
37.5 (21)
37.5 (54)
75

100
100
200
2  
i, j
(oij  eij ) 2
eij
(79  62.5) 2 (46  62.5) 2


62.5
62.5
(21  37.5) 2 (54  37.5) 2


 23.23
37.5
37.5
df=1
p<<0,001
df
0,995
0,9500
0,100
0,050
0,025
0,010
0,005
1
0,000
0,004
2,706
3,842
5,024
6,635
7,879
2
0,010
0,103
4,605
5,992
7,378
9,210
10,597
3
0,072
0,352
6,251
7,815
9,348
11,345
12,838
4
0,207
0,711
7,779
9,488
11,143
13,277
14,860
5
0,412
1,146
9,236
11,071
12,833
15,086
16,750
6
0,676
1,635
10,645
12,592
14,449
16,812
18,548
7
0,989
2,167
12,017
14,067
16,013
18,475
20,278
8
1,344
2,733
13,362
15,507
17,535
20,090
21,955
9
1,735
3,325
14,684
16,919
19,023
21,666
23,589
10
2,156
3,940
15,987
18,307
20,483
23,209
25,188
11
2,603
4,575
17,275
19,675
21,920
24,725
26,757
12
3,074
5,226
18,549
21,026
23,337
26,217
28,300
13
3,565
5,892
19,812
22,362
24,736
27,688
29,819
14
4,075
6,571
21,064
23,685
26,119
29,141
31,319
15
4,601
7,261
22,307
24,996
27,488
30,578
32,801
16
5,142
7,962
23,542
26,296
28,845
32,000
34,267
17
5,697
8,672
24,769
27,587
30,191
33,409
35,718
18
6,265
9,390
25,989
28,869
31,526
34,805
37,156
19
6,844
10,117
27,204
30,144
32,852
36,191
38,582
20
7,434
10,851
28,412
31,410
34,170
37,566
39,997
21
8,034
11,591
29,615
32,671
35,479
38,932
41,401
22
8,643
12,338
30,813
33,924
36,781
40,289
42,796
23
9,260
13,091
32,007
35,172
38,076
41,638
44,181
24
9,886
13,848
33,196
36,415
39,364
42,980
45,558
25
10,520
14,611
34,382
37,652
40,646
44,314
46,928
26
11,160
15,379
35,563
38,885
41,923
45,642
48,290
27
11,808
16,151
36,741
40,113
43,195
46,963
49,645
28
12,461
16,928
37,916
41,337
44,461
48,278
50,994
29
13,121
17,708
39,087
42,557
45,722
49,588
52,335
30
13,787
18,493
40,256
43,773
46,979
50,892
53,672
40
20,707
26,509
51,805
55,758
59,342
63,691
66,766
50
27,991
34,764
63,167
67,505
71,420
76,154
79,490
60
35,534
43,188
74,397
79,082
83,298
88,379
91,952
70
43,28
51,74
85,53
90,53
95,02
100,43
104,21
80
51,17
60,39
96,58
101,88
106,63
112,33
116,32
90
59,20
69,13
107,57
113,15
118,14
124,12
128,30
Interpretation of the test results
• The p-value is low enough that H0 can be
rejected = the probability that the observed
frequencies would differ this much (or even
more) from expected by just coincidence <
0.001
• χ2 –tables (Appendix), internet resources,
etc.
• Genetic association is population level
correlation with some known genetic
variant and a trait: an allele is overrepresented in affected individuals →
• From a genetic point of view, an
association does not imply causal
relationship
• Often, a gene is not a direct cause for the
disease, but is in LD with a causative gene
→
Linkage disequilibrium (LD)
• Closely located genes often express linkage
disequilibrium to each other:
Locus 1 with alleles A and a, and locus 2 with
alleles B and b, at a distance of a few
centiMorgans from each other
• At equilibrium, the frequency of the AB
haplotype should equal to the product of the
allele frequencies of A and B, AB = AB. If
this holds, then Ab = A b, aB = aB and
ab = ab , as well. Any deviation from
these values implies LD.
Linkage disequilibrium (LD)
• LD follows from the fact that closely
located genes are transmitted as a ”block”
which only rarely breaks up in meioses
• An example:
– Locus 1 – marker gene
– Locus 2 – disease locus, with allele b as
dominant susceptibility allele with 100%
penetrance
An example
• Association evaluated →
Locus 1 also seems associated, even though it has
nothing to do with the disease – association
observed just due to LD
LD mapping – utilizing founder effect
• A new disease mutation born n generations ago in
a relatively small, isolated population
• The original ancestral haplotype slowly decays as
a function of generations
• In the last generation, only small stretches of
founder haplotype can be observed in the diseaseassociated chromosomes
LD mapping: Utilizing founder effect
Data: Searching for a needle in a haystack
Disease gene
Disease
status SNP1
S2 ...
...
a
a
?
?
2 1
1 2
1
1
1 2 2 11 2
2 1 2 11 2
1
2
1 2
2 1
2 1
1 2
c
c
2
1
1 ?
1 ?
?
?
1 2 2 11 2
1 2 2 21 1
1
1
2 1
1 1
1 1
1 1
a
a
1
1
1 2
1 1
1
2
1 1 2 11 2
2 2 2 11 2
2
1
2 2
1 1
?
?
…………
1
1
• Task is to find either an allele or an allele
string (haplotype) which is overrepresented
in disease-associated chromosomes
– markers may vary: SNPs, microsatellites
– populations vary: the strength of marker-tomarker LD
• Many approaches:
– ”old-fashioned” allele association with some
simple test (problem: multiple testing)
– TDT; modelling of LD process: Bayesian, EM
algorithm, integrated linkage & LD
Limitations of the LD mapping
• The relationship between the distance of the
markers vs. the strength of LD: theoretical curve
Linkage disequilibrium (D’) for the African American (red)
and European (blue) populations binned in 5 kb classes
after removing all SNPs with minor allele frequencies less
than 20%. 3429 SNPs were included (Source
http://www.fhcrc.org/labs/kruglyak/PGA/pga.html)
Limitations: LD is random process
• LD is a continuous process, which is created and
decreased by several factors:
–
–
–
–
–
genetic drift
population structure
natural selection
new mutations
founder effect
→ limits the accuracy of association mapping
Research challenges …
• Haplotyping methods needed as
prerequisite for association/LD methods
• …or, searching association directly from
genotype data (without the haplotyping
stage)
• Better methods for measurement of the
association (and/or the effects of the
genes)
• Taking disease models into consideration
A methodological project:
Haplotype Pattern Mining (HPM)
AJHG 67:133-145, 2000
• Search the haplotype data for recurrent
patterns with no pre-specified sequence
• Patterns may contain gaps, taking into
consideration missing and erroneous data
• The patterns are evaluated for their strength
of association
• Markerwise ‘score’ of association is
calculated

Algorithm
1. Find a set of associated haplotype
patterns
– number of gaps allowed (2)
– maximum gap length (1 marker)
– maximum pattern length (7 markers)
– association threshold (2 = 9)
2. Score loci based on the patterns



Evaluate significance by permutation
tests
Extendable to quantitative traits
Extendable to multiple genes
Example: a set of associated patterns
Marker
P1
P2
P3
P4
P5
P6
P7
P8
Score
P9
01
2
2
2
2
1
*
*
2
5
2
02
1
1
1
1
*
*
2
1
6
1
03
2
2
2
*
1
1
1
1
7
1
04
2
2
2
2
2
2
2
2
7
*
05
2
2
*
1
2
2
*
*
6
*
06
*
1
1
*
*
1
*
*
3
*
07
*
*
1
*
*
2
*
*
2
*
08
*
*
*
*
*
*
*
*
0
*
2
9.6
9.2
8.9
8.1
7.4
7.1
7.1
6.9
6.8
Pattern selection
• The set of potential patterns is large.
• Depth-first search for all potential patterns
• Search parameters limit search space:
–
–
–
–
number of gaps
maximum gap length
maximum pattern length
association threshold
Score and localization: an example
Permutation tests
• random permutation of the status fields of
the chromosomes
• 10,000 permutations
• HPM and marker scores recalculated for
each permuted data set
• proportion of permuted data sets in which
score > true score  empirical p-value.
Permutation surface (A=7.5 %). The solid line is
the observed frequency.
Localization power with simulated SNP data (density 3
SNPs per 1 cM). Isolated population with a 500-year
history was simulated. Disease model was monogenic with
disease allele frequency varying from 2.5-10 % in the
affecteds. 12.5 % of data was missing. Sample size 100
cases and 100 controls.
Benefits & drawbacks
• Non-parametric, yet efficient approach; no
disease model specification is needed +
• Powerful even with weak genetic effects
and small data sets +
• Robust to genotyping errors, mutations,
missing data +
• Allows for gaps in haplotypes +
• Flexible: easily extended to different types
of markers, environmental covariates, and
quantitative measurements +
• optimal pattern search parameters may need
to be specified case-wise • no rigid statistical theory background • requires dense enough map to find the area
where DS gene is in LD with nearby
markers.
• Search of the susceptibility gene:
1.With good luck - and information from gene
banks, pick up the correct candidate gene
2.Genetic region with positive linkage signal is
saturated with markers, and this data is now
searched for a secondary correlation –
correlation of marker allele(s) with the actual
disease mutation (LD)



Improved statistical methods to detect LD
– Terwilliger (1995)
– Devlin, Risch, Roeder (1996)
– McPeek and Strahs (1999)
– Service, Lang et al. (1999)
Statistical power of association test statistics
– Long, Langley (1999).
Review on statistical approaches to gene mapping
– Ott, Hoh (2000)

Basics of association analysis

Transcript Basics of association analysis

Directory