Intro to Computational Genetics
Download
Report
Transcript Intro to Computational Genetics
Workshop in Bioinformatics
Eran Halperin
The Human Genome Project
“What
are announcing
“But
“I our
would
work
be previously
willing to make
has shown…
a predication
thatwe
within
10 years, we
thatopportunity
we have reached
thatwill
having
haveone
thegenetic
potential
code
of is
offering
important,
anytoday
of youis the
to finda
milestone…that
is,increased
covering the
but out
it's not
what
allparticular
that useful.”
genetic conditions
you may be at
genome in…a working draft of
risk for…”
the human sequence.”
Washington, DC
June, 26, 2000
The Vision of Personalized Medicine
Genetic and epigenetic variants +
measurable environmental/behavioral factors would
be used for a personalized treatment and diagnosis
Example: Warfarin
An anticoagulant drug,
useful in the prevention
of thrombosis.
Example: Warfarin
Warfarin was originally
used as rat poison.
Optimal dose varies
across the population
Genetic variants (VKORC1 and CYP2C9) affect
the variation of the personalized optimal dose.
Association Studies
Genetic variants such as Single Nucleotide Polymorphisms
(
), Copy Number Variants (
) are tested for
association with the trait.
Usually
are Nucleotide
bi-allelic
SNP SNPs
=Where
Single
should Polymorphism
we look?
Cases:
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC
AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTC
AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC
AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCC
AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCC
AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC
AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTC
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC
Controls:
Associated SNP
AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC
AGAGCAGTCGACATGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC
AGAGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC
AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTC
AGAGCCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC
AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCC
AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC
AGAGCCGTCGACAGGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC
Published Genome-Wide Associations through 6/2009, 439
published GWA at p < 5 x 10-8
NHGRI GWA Catalog
www.genome.gov/GWAStudies
Genetic
Factors
Complex
disease
Multiple genes may affect the disease.
Therefore, the effect of every single gene
may be negligible.
Environmental
Factors
How does it work?
• For every pair of SNPs we can construct a
contingency table:
A
G
Total
Cases
a
b
n
Control
s
c
d
n
n ab c d
p1 a / n
p2 c / n
p1 p2
p
2
n( p1 p2 ) 2
T
p(1 p)
Results: Manhattan Plots
The curse of dimensionality –
corrections of multiple testing
• In a typical Genome-Wide Association Study
(GWAS), we test millions of SNPs.
• If we set the p-value threshold for each test to
be 0.05, by chance we will “find” about 5% of
the SNPs to be associated with the disease.
• This needs to be corrected.
Bonferroni Correction
• If the number of tests is n, we set the
threshold to be 0.05/n.
• A very conservative test. If the tests are
independent then it is reasonable to use it. If
the tests are correlated this could be bad:
– Example: If all SNPs are identical, then we lose a
lot of power; the false positive rate reduces, but so
does the power.
Data
International consortium that aims
in genotyping the genome of 270
individuals from four different
populations.
HUJI 2006
- Launched in 2002.
- First phase (2005):
~1 million SNPs for 270 individuals from four populations
- Second phase (2007):
~3.1 million SNPs for 270 individuals from four populations
- Third phase (ongoing):
> 1 million SNPs for 1115 individuals across 11 populations
HUJI 2006
Other Data Sources
• Human Genome Diversity Project
– 50 populations, 1000 individuals, 650k SNPs
• POPRES
– 6000 individuals (controls)
• Encode Project
– Resequencing, discovery of new SNPs
• 1000 Genomes project
• dbGAP
Haplotypes
Haplotypes
• Can 1,000,000 SNPs tell us everything?
• No, but they can still tell us a lot about
the rest of the genome.
– SNPs in physical proximity are correlated.
– A sequence of alleles along a chromosome
are called haplotypes.
Haplotype Data in a Block
(Daly et al., 2001) Block 6 from Chromosome 5q31
LD structure
Phasing - haplotype inference
Haplotypes
ATCCGA
AGACGC
Genotype
C C
T chromosome
mother
A
CG
father
chromosome
G A
A
• Cost effective genotyping technology
gives genotypes and not haplotypes.
Possible
phases:
ATACGA
AGCCGC
AGACGA
ATCCGC ….
Inferring Haplotypes From Trios
Parent 1
Parent 2
Child
122112
1??11?
10011?
10?11?
1??11?
11111?
11?11?
210022
1100??
?100??
11000?
0100??
?100??
01001?
120222
1?0???
10011?
100???
1?0???
11000?
110???
Assumption: No recombination
25
Population Substructure
• Imagine that all the cases are collected
from Africa, and all the controls are from
Europe.
– Many association signals are going to be
found
– The vast majority of them are false; Why ???
Different evolutionary forces: drift, selection, mutation,
migration, population bottleneck.
Natural Selection
• Example: being lactose telorant is
advantageous in northern Europe,
hence there is positive selection in the
LCT gene
different allele frequencies in
LCT
Genetic Drift
• Even without selection, the allele frequencies in the population
are not fixed across time.
• Consider the following case:
– We assume Hardy-Weinberg Equilibrium (HWE), that is, individuals are
mating randomly in the population.
– We assume a constant population size, no mutation, no selection
Genetic Drift: The Wright-Fisher Model
Generation 1
Allele frequency 1/9
Genetic Drift: The Wright-Fisher Model
Generation 2
Allele frequency 1/9
Genetic Drift: The Wright-Fisher Model
Generation 3
Allele frequency 1/9
Genetic Drift: The Wright-Fisher Model
Generation 4
Allele frequency 1/3
Genetic Drift: The Wright-Fisher Model
Genetic Drift: The Wright-Fisher Model
Ancestral population
Ancestral population
migration
Ancestral population
Genetic drift
different allele frequencies
Population Substructure
• Imagine that all the cases are collected
from Africa, and all the controls are from
Europe.
– Many association signals are going to be
found
– The vast majority of them are false;
What can we do about it?
Jakobsson et al, Nature 421: 998-103
Principal Component Analysis
• Dimensionality reduction
• Based on linear algebra
• Intuition: find the ‘most important’
features of the data
Principal Component Analysis
Plotting the data on a one
dimensional line for which
the ‘spread’ is maximized.
Principal Component Analysis
• In our case, we want to look at two
dimensions at a time.
• The original data has many dimensions
– each SNP corresponds to one
dimension.
HapMap Populations
MKK
LWK
YRI
GIH
ASW
CHD CEU
MEX
JPT
CHB
TSI
43
HapMap PCA 1-2
44
HapMap PCA 1-3
45
HapMap PCA 1,2,4
46
Ancestry Inference:
• To what extent can population structure be detected from SNP
data?
• What can we learn from these inferences?
Novembre et al., 2008
Ancestry inference in recently admixed populations
Puerto Rican Population (GALA study, E. Burchard)
100%
80%
60%
40%
20%
Individual subjects 1-90
88
85
82
79
76
73
70
67
64
61
58
55
52
49
46
43
40
37
34
31
28
25
22
19
16
13
7
10
Percent
racial
admixture
4
1
0%
European
African
Native
American
Recombination Events
Copy 1
Copy 2
Probability ri for recombination
in position i.
child chromosome
Recently Admixed Populations
After generation 1
Recently Admixed Populations
After generation 2
Recently Admixed Populations
After generation 10
W
Recombination Indicators
g
Generations
Z
Ancestral states
r
Recombination rate
X
Alleles
α
Admixture fraction
p,q
Allele frequencies
Chromosome
Overall Accuracy
Applications:
• Population genetics (admixture events, recombination events,
selection forces, migration patterns)
• Potential applications in personalized medicine
• Finding new associations (through admixture mapping)
55
Admixture Mapping