Ascertainment Bias

Transcript Ascertainment Bias

Introduction to Genetic
Epidemiology
HRM 728 - 2015
Course Coordinator: Dr. Sonia Anand
Course Dataset Assistant: Binod
Course Outline
• 14 classes
• Mid-Term Assignment: 16-October-2015
• Help Session/Analytical Questions using
PLINK – Nov 20, 2015
• Final Exam – Dec 4, 2015
• Final Assignment-Independent Study
Presentation - Dec 11, 2015
Student Evaluation
• Class Attendance/Participation: 15%
• Mid-Term Assignment: 25% 5 page single
spaced scholary summary (preapproved topic by Dr. Anand)
• Final Exam: 25%
• Independent Study: 35% including class
presentation
Seminar 1
• Key Concepts in Genetic Epidemiology
– What does genetic epidemiology mean to
you?
Epidemiology
Biology
Statistics
~50 years
1865
Mendel discovers laws of genetics
1900
Rediscovery of Mendel’s genetics
1944
DNA identified as hereditary material
1953
DNA structure
1960’s
Genetic code
1977
Advent of DNA sequencing
1975-79
First human genes isolated
1986
DNA sequencing automated
1990
Human genome project officially begins
1995
First whole genome
1999
First human chromosome
2003
‘Finished’ human genome sequence
The Human genome project
The Human genome project promised to
revolutionise medicine and explain every
base of our DNA.
Large MEDICAL GENETICS focus
Identify variation in
the genome that is
disease causing
Determine how individual
genes play a role in health
and disease
The 2 Human genome project
PUBLIC - Watson/Collins
•
•
•
•
•
Human Genome Project
Officially launched in 1990
Worldwide effort - both academic
and government institutions
Assemble the genome using maps
1996 Bermuda accord
PRIVATE - Craig Venter
•
•
•
•
1998 Celera Genomics
Aim to sequence the human
genome in 3 years
‘Shotgun’ approach - no use of
maps for assembly
Data release NOT to follow
Bermuda principles
The Human genome project
It cost 3 billion dollars and took 10 years to complete (5 less
than initially predicted).
•
•
Currently 3.2 Gb
Approx 200 Mb still in progress
– Heterochromatin
– Repetitive
• Most recent human
genome uploaded
February 2009
How Are Traits Transmitted from Parents to Offspring?
•Gregor Mendel’s experiments showed that genes are passed from
parents of offspring
–Each parent carries two genes that control a trait
–Each parent contributes one copy from each pair
–Pairs of genes separate from each other during the formation of egg and
sperm (meiosis)
–When egg and sperm fuse during fertilization, genes from mother and father
become a new gene pair
Genes are contained on chromosomes
–Chromosomes are found in the nucleus of human cells and other higher
organisms
–Meiosis separates chromosomes pairs during formation of egg and sperm
Concept of Heritability
• Proportion of a traits total variance that is
attributable to genetic factors in a particular
population
• Trait: Quantitative trait or continuous trait – i.e.
height
• “Attributable to”
“caused by”
• If everyone in the population were homozygous
or everyone in the population had the same
environmental exposure – the factors would not
play a role in the “variance” in a trait. Heritability
= zero
Hardy-Weinberg Law of
Population Genetics
• Assume random mating in a population
• In a two allele system, homozygosity and
heterozygosity balance out
• Allele and genotype frequencies will
remain the same if:
– Organisms reproduce
– Allele frequencies are the same in both sexes
– Loci must segregate independently
– Mating is random with respect to genotype
Hardy-Weinberg Law of
Population Genetics
2
p
Frequency of Alleles
in population
2
q
+ 2pq + = 1
p+q=1
Dominant allele
Recessive allele
GENETIC EPIDEMIOLOGY
Flow of research
Disease characteristics:
Familial clustering:
Genetic or environmental:
Mode of inheritance:
Disease susceptibility loci:
Disease susceptibility markers:
Descriptive epidemiology
Family aggregation studies
Twin/adoption/half-sibling/migrant studies
Segregation analysis
Linkage analysis
Association studies
Why do we care about variations?
underlie phenotypic
differences
cause inherited
diseases
allow tracking ancestral
human history
Human Genome
• ~30,000 genes
•
•
•
•
3 billion base pairs in the human genome
15 million SNPs in human genome
Human Diversity = 0.5%
Far less than other animals like the chimp
(because humans are younger)
• Patterns of Linkage Disequilibrium (LD) in
formative about population histories
October 2004
SNPs
• SNPs are more common variants (> 5%)
• Most mutations will disappear but some will
achieve higher frequencies due either to random
genetic drift or to selective pressure
• Base substitution through a non-repaired error
that occurs during DNA replication
• Low mutation rate 10-8 substitution per base pair
per generation
• Majority of SNPs are inherited - not de novo
mutations
SNPs persistence influenced by 2
forces
• 1) Random Genetic Drift – random sampling of
different allele with each generation (because
only a small fraction of gametes pass onto the
next generation); eventually FIXATION occurs
when an allele reaches 100% or 0%
• 2) Natural Selection – Affects the probability that
a SNP is passed to the next generation - ↑
speed of fixation if it confers a fitness advantage
= positive selection or ↓ new deleterious
variants from gene pool (negative selection) or
results in Balanced selection
Linkage Disequilibrium
• Chromosome are mosaics
• Patterns of LD informative about
population histories and depend on:
– Recombination rate
– Mutation rate
– Population Size
– Natural selection
Conrad Nature Genetics 2006
Progress in Genetics
• 1866 Gregor Mendel suggested traits were
inherited
• 1869-Friedrich Miescher isolated DNA
• 1953 Double Helix Structure of DNA – Watson,
Crick, Rosalind Franklin
• 1975- Sanger Sequencing –”1st Generation”
• 2003 –Human Genome “Crack the Code”
• International Hap Map Project
• Automated Sequencing
• 1000 Genomes
2nd generation sequencing
Genome wide annotation of functional elements made easy!
Background into 1000
genomes
•
International collaboration
•
Sequence whole genome of approximately 2000
individuals from ~ 20 populations
•
Central goal is to describe most of the genetic variation
that occurs at a population frequency greater than 1%
•
Help scientists:
•
•
•
•
•
Identify genetic variation with high resolution
Improved imputation
Novel genotype-phenotype associations
Causal variants
More accurately study evolutionary process & racial
differences
The 1000 Genomes Project Consortium (2012).
An integrated map of genetic variation from 1,092 human genomes Nature DOI: 10.1038/nature11632
Population-specific genetic
variation at high resolution
 Observe and identify population-specific genetic
variation
 Novel SNPs are rare and more likely to be
observed in one ethnic group
 Need good coverage in multiple populations
 Identification of such variants can help develop new
population-specific arrays, minimizing ascertainment
bias that currently exists as most are derived from
Europeans
Imputation to GWAS
 Provide resource to aid imputation of missing
genotypes in association studies
 From the pilot study, authors found that each
signal was in LD with 56 variants, on average
 19% of time a coding variant was present in this LD
Shows that 1000 genomes can be used to find variants
that could be functional corresponding to GWAS hits
Identification of causal
variants
 Precise causative genes are difficult to identify as
GWAS focus on LD / genomic regions
 Deep sequencing studies can help find novel or
rare functional variants
 Re-sequencing studies support this approach in
uncovering rarer variants with larger effects and
functional causes with disease (Nejentsev 2009)
From the Pilot phase
 Describes
genomes from
1,092 individuals
representing 14
populations across
Europe, Africa,
Asia, and the
Americas
1000 Genomes
 The fraction of variants
identified across the
project that are found in
only one population
(white line), are
restricted to a single
ancestry-based group
(solid colour), are found
in all groups (solid black
line) and all populations
(dotted black line)
1000 Genomes
 Most common variants
were almost always
present in all 14
populations
 Degree of rare variants
differed greatly
From Genetics to
Genomics
Genetics
Genomics
• Disease
• Information
• Single Gene Disorders
• All Diseases
• Mutations/One Gene
• Variation/Multi Genes
• High Disease Risk
• Low Disease Risk
• Environment Role +/-
• Environment Role ++
• “Genetic Services”
• Gene-Environment
Inxs
Common Complex Diseases
• Condition such as CVD is common
• Includes closely related but not identical
manifestations – angina, unstable angina, MI
• Multiple genes have small effects - RR of 1.2 to
1.5 – affect multiple “risk factors” or intermediate
phenotypes
• Causative genotype may be the more common
genotype (unlike monogenic disorders)
What are we trying to study?
"It's a classic scientific paradox
— we know a genotype and
we know a phenotype, but
there's a black box in
between"
Genetic Association Studies
Other Risk factors
SNP
Variation
Gene
Expression
Protein
Synthesis
Post
Protein
Translational Expression
Changes
Disease
Genetic Association Studies
Other Risk factors
SNP
Variation
Gene
Expression
Protein
Synthesis
Post
Protein
Translational
Expression
Changes
Environmental Exposure
Disease
Indirect and Direct Allelic Association
Direct Association
D
Indirect Association & LD
M1 M2
D
Mn
*
Measure disease relevance (*)
directly, ignoring correlated
markers nearby
Assess trait effects on D via
correlated markers (Mi) rather
than susceptibility/etiologic
variants.
Semantic distinction between
Linkage Disequilibrium: correlation between (any) markers in population
Allelic Association:
correlation between marker allele and trait
Wacholder, 2002 (www)
Population Stratification
Marchini, 2004 (www)
Models of gene–environment interactions
Hunter, 2005 (www)
Sample size requirement for gene-environment
interaction studies
Hunter, 2005 (www)
An example of a gene-environment interaction
In Alzheimer disease, the risk of cognitive decline as measured by TICS test is
particularly high in APOE4 carriers who have untreated hypertension
(APOE4+/HT+).
Hunter, 2005 (www)
Ascertainment Bias
• Case-control type studies are specifically prone to ascertainment
bias in this scenario as unlike a population-based study, cases and
controls can be enriched for factors which investigators would like to
focus, in the case of diabetes, hyperglycemia
• In case of TCF7L2 (rs7903146) it could appear that in control
samples the T-allele is associated with lower BMI, this is because,
although the T-allele causes hyperglycaemia, the controls are
selected to be normoglycaemic leading to accumulation of T-allele
carriers with higher physical activity levels or lower BMI
Future Directions: Beyond DNA & RNA
“Omic” approach
Technology
Number
estimated in
humans
Genomics
Single nucleotide
polymorphisms
(SNPs)
Transcriptomics
Microarrays of gene
transcripts (RNA)
Proteomics
Protein arrays of
specific protein
products
~100,000
Metabolomics
Metabolic profiles
1000 – 10,000
metabolites
~10,000,000
~20,000
*adapted from Ginsburg G, et al. J Am Coll Cardiol. 2005;46:1615-1627.
Height and Risk of
Coronary Artery Disease
Paper by Gertler et al. from 1951 reported that individuals who
suffered from a myocardial infarction before the age of 40 were
on average 5 cm (2.9%) shorter than a healthy control population
Gertler MM, Garn SM, White PD
The Journal of the American Medical
Association 1951
Short stature is associated
with coronary heart disease: a
systematic review of the
literature and a meta-analysis.
Paajanen TA, Oksala NKJ,
Kuukasjärvi P, Karhunen PJ
European Heart Journal 2010
Methods
•
Selection of studies for review:
Systematic reviews, meta-analyses, randomized clinical
trials, clinical trials, and cohort or case-control studies
with at least 200 subjects
Height dichotomized into short and tall groups
Outcome defined as diagnosis of angina pectoris,
ischaemic heart disease (IHD) or heart disease without
MI, acute MI, or history of MI, coronary artery occlusion
equal to or more than 50%, revascularization or
percutaneous transluminal coronary angioplasty (PTCA),
as well as all-cause mortality, CVD mortality, or CHD
mortality
•
Meta-analysis:
I-squared test for heterogeneity of data
ORs and RRs from all studies converted to RRs for
shorter group
Results
•
Average cut-off for shorter group was 160.5 cm and cutoff for taller group was 173.9 cm, with different ranges
for men and women
•
Combined RR for shorter group to experience CHD was
1.46 (95% CI 1.37–1.55)
•
Combined RR for all-cause mortality for short men was
1.37 (1.29–1.46) and for short women 1.55 (1.41–1.70)
•
Combined RR for all types of cardiovascular (CVD)
deaths among men and women was 1.55 (95% CI 1.37–
1.74)
•
Overall, short stature represents ~1.5 times increased
risk of CHD morbidity and mortality compared against
tall stature
New Approach to crack the question
Using a genetic approach to explore the association
between height and CAD risk helps remove some of
the lifestyle and environmental confounders present in
epidemiological studies
•
Background:
180 single-nucleotide polymorphisms (SNPs)
were found to be significantly associated with
height (GIANT study in Europeans, n=183,727)
•
Aims:
Assess combined effect of 180 heightassociated SNPs on CAD risk
Assess effect of these SNPs on CAD risk factors
(e.g. blood pressure, LDL, etc.)
Identify any biological pathways mediating this
association
Nelson NEJM 2015
Study Population
•
Summary association statistics extracted from 3 metaanalyses of GWAS case-control studies of CAD:
•
Coronary Artery Disease Genomewide Replication and
Meta-Analysis (CARDIoGRAM) Consortium
21977 cases, 62289 controls
All 180 SNP variants
•
Coronary Artery Disease (C4D) Consortium
17766 cases, 17115 controls
All 180 SNP variants
•
Metabochip
Combined CARDIoGRAM+C4D Consortium for cohorts
not included in previous meta-analyses
25323 cases, 48979 controls
112 SNP variants
Nelson NEJM 2015
Advantages of genetic approach in this study over traditional
epidemiologic approach:
- Genetic determinants of height are not confounded by
lifestyle (e.g. nutrition) or environmental (e.g.
socioeconomic status) factors
- Allows tracing of genetic pathways to identify potential
mechanisms driving association
Limitations:
- Lifestyle and environmental choices/events can be a direct
consequence of height
Height-Associated Variants and CAD Methods
OR for CAD per
• Using:
1 SD increase in
β1 = effect size of association between
variant
genetically
and height (GIANT study)
determined
height
β2 = effect size of association between
variant
and CAD (CARDIoGRAM, C4D, and Metabochip
studies)
•
To calculate:
β3 = effect size of association between height
and CAD mediated through variant
β3 is the odds ratio for CAD per 1-standard
deviation increase in genetically determined
height
Height-Associated Variants and CAD Methods
•
Association between individual SNPs with
height (β1) and between individual SNPs with
CAD (β2) is very small
•
Thus, β3 values for individual SNPs are
centered around 1.0 and generally
insignificant
•
To determine complete association between
height and CAD, we combined β3 values from
all SNPs using inverse-variance—weighted
random-effects meta-analysis
Height-Associated Variants and CAD - Results
•
Combined association between heightassociated SNPs and CAD was significant
(OR=0.88, 95% CI = 0.82 to 0.95, p<0.001)
•
13.5% increase in CAD risk per 1-standard
deviation (SD) decrease in height
•
Most individual β3 values centered around 1.0
and insignificant, but a few values were
significant (p<0.05)
3 out of 180 SNPs remained significant after
Bonferroni correction
Genetic Risk Score Analysis - Methods
•
Subgroup of CAD cohorts had genomewide
individual-level genotype data available (8240 cases,
10009 controls)
•
Weighted analysis of genetic risk scores to evaluate
effect of increasing number of height-associated
variants on CAD risk
•
Genetic risk score:
Value from 0 to 2 for each SNP obtained by
multiplying sum of posterior probabilities for heightincreasing allele with effect size of allele on height
Values totalled across all SNPs for each individual
Individuals ranked and divided into quartiles
Logistic regression on quartiles to estimate
combined odds ratio for CAD
Genetic Risk Score Analysis - Results
•
Increased number of height-raising alleles
associated with reduced risk of CAD
•
Odds ratios for each quartile:
Quartile 2 vs. Quartile 1 = 0.90 (95% CI = 0.83 to
0.98, p=0.02)
Quartile 3 vs Quartile 1 = 0.88 (95% CI = 0.81 to
0.96, p=0.003)
Quartile 4 vs Quartile 1 = 0.74 (95% CI = 0.68 to
0.80, p<0.001)
•
Quartile 4 includes individuals with highest
number of height-raising alleles, Quartile 3
has individuals with second most, etc.
What if SNPs for Height are also associated with
CAD risk factors? and CAD Risk Factors
•
Obtained estimates of effect sizes for 180 height variants on
CAD risk factors based on meta-analyses for genomewide
association studies:
Systolic blood pressure (n=69899)
Diastolic blood pressure (n=69909)
Mean arterial pressure (n=29182)
Pulse pressure (n=74079)
LDL cholesterol level (n=95454)
HDL cholesterol level (n=99900)
Triglyceride level (n=96598)
Type 2 diabetes (34840 cases, 114981 controls)
Glucose (n=96496)
Log-transformed plasma insulin (n=85573)
Smoking quantity (n=41150)
•
β3 values calculated for association of height with CAD risk
factors (similar to how they were calculated for overall CAD
risk)
Height-Associated Variants and CAD Risk
Factors
•
β3 values represent change in measurement unit of variable
per 1-standard deviation change in height
•
Only LDL cholesterol level (OR= -0.06, 95% CI = -0.09 to -0.04,
p<0.001) and triglyceride level (OR= -0.05, 95% CI = -0.08 to 0.03, p<0.001) had significant associations with heightassociated SNPs
•
19% of association between genetically determined height and
CAD explained by effect of height on LDL cholesterol
•
12% of association between genetically determined height and
CAD explained by effect of height on triglyceride level
Conclusions
•
Association between genetically determined decrease in height
(sum of 180 height-associated SNPs) and increased risk of CAD
(13.5% increase in CAD risk per 1-SD decrease in height)
2.3 % of this association explained by effect of height on LDL
levels (inverse relationship)
1.9% of this association explained by effect of height on
triglyceride levels (inverse relationship)
•
Genetically determined height was associated with CAD risk in
men but not in women, in contrast with findings from
epidemiological studies suggesting an association in both
genders
•
Height-associated SNPs were not significantly associated with
BMI, suggesting pathway independent of obesity

Ascertainment Bias

Transcript Ascertainment Bias

Directory