W11: Epidemiology File

Transcript W11: Epidemiology File

BIO508
Epidemiological Data:
integration with ‘omics data
Wednesday, April 4, 2012
Aditi Hazra
Research Question
• Public Health Problem: 57,650 women are
diagnosed with in situ cancer in the US.
• How can we identify the perturbed networks
in non-invasive breast lesions that predict
the likelihood of a subsequent invasive
breast cancer?
Research Question, cont.
•
•
•
•
•
Hypothesis
Who will you study? How many samples?
Which assay?
Interpretation of results
Prioritization of biomarkers for validation
Outline
•
•
•
•
•
Study Design
Data Collection
Data Analysis
Data Integration
Data Storage
• Discussion
Study Design
Types of Epidemiologic Studies
• Observational Studies
• Clinical Trials
• Randomized Clinical Trials (RCTs)
Designing a Study
Observation
Space
Persons
Design
Closed cohort with
fixed follow up
Additional
Sampling
Person-Time
Matched on person
Estimate
Non-cases
Traditional
case-control
Odds ratio
All persons
Case base
Risk ratio
Modern
case-control
Rate ratio
Matched on time
case-control
Rate ratio
Matched on time
Sub-cohort
reuse in risk
sets
Case-Cohort
Rate ratio
Matched on time
Cases only
Case-Crossover
Rate ratio
Non-cases
Case-Time
Control
Rate ratio
Closed cohort with Pool of Person
changing exposure time (density
or Open cohort
sampling)
Cohort with
changing baseline
incidence
Design
Non-cases in
risk set
Study Designs
Rakyan et al. 2011
Nurses’ Health Study Cohort
1976
1989 1993-Tumor block collection
121,700
1999-Blood & urine
32,826
Blood sub-cohort - DNA
Nested case-control study
Nested case-control study
• Nested case-control study of breast
cancer in the NHS
• Controls are matched to cases
–
–
–
–
Age at diagnosis
Time, season, and year of blood collection
Post-menopausal hormone use
Ethnicity
Implementing a Study
•
•
•
•
•
Case definition
Exposure definition
Covariate definition
Covariate adjustment
Modeling of determinants of
– Outcome: dependent variable of interest
– Exposure
• Unmeasured confounders
11
Confounding/Bias
• Confounding is systematic bias
• Derives from characteristics of the source
population
• Presupposes innate differences in risk between
individuals
C
E
D
Sources of bias
•
•
•
•
•
Sample effects
Batch effects
Population stratification (PCA)
Non-random missingness
Reproducibility of variant calls
Study Power
• Sample size
– number of cases required for the study
• Magnitude of effect
– smallest difference between the strength of
association detectable
• Significance level (Type I error)
Sample Size
Effect of increasing n upon power
Effect of increasing n at 80% power
Design of Sequencing
Association Studies
• Key parameters for power and sample
size calculations
– % of causal variants (β ≠ 0)
– Effect sizes: β as a decreasing function of
MAF
– Incidence of disease
Distribution of NGS coverage depth vs.
distribution estimated by the model
Genetic Epidemiology
Volume 35, Issue 4, pages
269-277, 2 MAR 2011
Power for NGS
Genetic Epidemiology
Volume 35, Issue 4, pages
269-277, 2 MAR 2011
Power to detect a rare variant
Genetic Epidemiology
Volume 35, Issue 4, pages
269-277, 2 MAR 2011
Power to detect
Genetic Epidemiology
Volume 35, Issue 4, pages 269-277, 2 MAR 2011
Slide courtesy of Xihong Lin
Power Calculator
• Itsik Pe’er
http://www.cs.columbia.edu/~itsik/OPERA/
Data Collection
Tissue heterogeneity
Receptor Status in Breast Cancer
ER
PGR
HER-2
Targeting HER2
Nahta R et al. (2006) Mechanisms of Disease: understanding resistance to HER2-targeted therapy in human breast cancer
Nat Clin Pract Oncol 3: 269–280 doi:10.1038/ncponc0509
“Breast Tumor Intrinsic“ Subtype
Classification
Hierarchical cluster analysis using this ‘intrinsic gene
list’ revealed the existence of 5 molecular subtypes of
breast cancer:
1.
2.
3.
4.
5.
Luminal A
Luminal B
Normal breast-like
HER2
Basal-like
Sorlie T et al. Proc Natl Acad SciUSA 2001
Sorlie T,et al. Proc Natl Acad Sci USA 2003
Sotiriou C et al. Proc Natl Acad Sci USA 2003.
Hu Z et al. BMC Genom 2006.
Molecular Subtypes of BC
Basal-like breast :
ER-negative
PR–negative
HER2-negative
Luminal-A :
ER-positive
histologically low-grade
Luminal-B:
ER-positive ,
low levels of hormone receptors
high-grade
HER2-positive:
amplification ERB2 gene
other genes of ERB2 amplicon.
Luminal
Subtype A
Luminal
Subtype B
Basal Normal
ERBB2+ Subtype Breast–like
HER
gen
clust
Assays
•
•
•
•
Sample Preparation
Feasibility
Assay comparison
Technical replicates for reproducibility and
batch-to-batch variation
• Biological inferences
Data Analysis
Why is QC essential for NGS
• Sequencing is expensive
• Analyzing sequencing data is expensive
– CPU time
– Storage of raw sequence data
• Two aspects of sequencing QC
– Run/Lane QC
– Library QC
Raw DNA Sequence QC
• Quality control on all sequence data/library
is essential
– Variation in local content
– Proportion of GC
– Extent of segment duplication
– R packages:
• htSeqTools
QC, cont.
• Bisulfite sequencing
• Gene expression
• Sample QC
– Frozen tissue vs. fixed tissue
Genotype Concordance
• How do you know that the data is from the
right individual
• Check
– Concordance with known genotypes
(fingerprint panel)
– Compare to other lanes of the same library
– Barcode libraries
Correlation of Replicates
The Good - Cell lines
35 ng/µl
The Bad: Mid conc. tumors
20-40ng/µl
The Ugly: High conc. tumors
≥40ng/µl
Unpublished data
Batch-to-batch Variation
ANOVA model for batch correction:
yi   plate  chip   i
Slide courtesy of Levi Waldron
•12 plates
•12 chips per plate
•8 samples per chip
36
QC Methods
•
•
•
•
•
•
Reproducibility
Sample QC
Correction for batch-to-batch variation
Q-Q plots
Normalization (if appropriate)
Filter
Q-Q Plot
Hazra et al. HMG 2009
Adjusting for tumor
heterogeneity and subtype
• Adjust for molecular subtype of the tumor
• Adjust for cellular component as a
covariate in your statistical model
– PCA
PCA for tumor heterogeneity
• Optimal linear transformation:
– maximize the variance by projecting the data on to
new axes in order of the principal components
• Components are orthogonal (mutually
uncorrelated)
• Few PCs may capture most
variation in original data
Regression Analysis
• Linear regression
– continuous outcomes
• Logistic regression
• Cox proportional hazards
Multiple Logistic Regression
• Dichotomous outcome
• Test: Chi-square test, Mantel-Haenszel
test
log
pi
1- pi
=log (Odds) = 0 + 1* gene_variant i
+ 2* covariate i
Interpretation of Results
• Multiple testing correction
• Generalizability
Correction for Multiple Testing
• Bonferroni Correction
• False Discovery Rate (FDR)
– Expected proportion of false positives among
all significant tests
• Permutation Testing
Bonferroni Correction
• Conservative correction
• For n tests performed
=α/n
• Example: What is the corrected p-value
for:
– GWAS with 1M SNPs
Data Integration
Impact of Genome Sequencing?
Rapid Publication on April 2 2012Sci. Transl.
Med. DOI: 10.1126/scitranslmed.3003380
RESEARCH ARTICLE
The Predictive Capacity of Personal Genome
Sequencing
Winning the War: Science Parkour
Bert Vogelstein and Kenneth W. Kinzler
Data Integration
• Integration of high-throughput ‘omics:
– measurements from in tumor tissue
– paired blood specimens
• will enable
– presymptomatic diagnosis, stratification of
disease, assessment of disease progression,
evaluation of patient response and
identification of reoccurrences
Systems approach to medicine
• Fundamental principle:
– Disease arises as a consequence of one or
more disease-perturbed networks in cells of
the relevant organ
Regional association plot for
ZNF365 and MD
•Mammographic density (MD)
is one of the strongest risk
factors for BC
•Common variants in ZNF365
have been associated with BC
•Meta analysis of 5 GWAS of
%MD reported an association
with rs10995190 in ZNF365
(combined P = 9.6 × 10(-10).
Lindstrom et al. 2011 Nature Genetics
Study Aims
Available Resources
Prospectively collected
lifestyle factors in the
Nurses’ Health Study
(e.g. BMI, alcohol intake,
folate intake, multivitamin
use, parity, family history
of breast cancer, etc)
Personalized Genomics
(GWAS data and eQTL data
in the NHS)
Missing Link: Epigenomic Signatures
Epigenetic signatures of environmental and
lifestyle breast cancer risk factors
Methylome signatures in paired fixed basallike breast tumor tissue, fixed normal breast
tissue and blood samples
Personalized
Medicine
Tumorigenic Process
• Genomic regions have been implicated in
multiple cancer phenotypes
– 8q24
– TERT/CLPTM1L locus
• Gene(s) could demonstrate a modest change
in expression/function due to:
– a SNP
– increased carcinogenic exposure
– may undergo further inactivation due to
epigenetic changes
Syapse Demo for NGS
Digital Lab Notebook
Digital Lab Notebook
Digital Lab Notebook
Digital Lab Notebook
Digital Lab Notebook
Syapse
• www.syapse.com
Resources
• The Cancer Genome Atlas
• http://cancergenome.nih.gov/
References
• Rothman KJ, Greenland S, Lash TL. Modern
Epidemiology, 3rd ed. Philadelphia: Lippincott Williams &
Wilkins, 2008
• Sampson et al. Genetic Epidemiology
Volume 35, Issue 4, pages 269-277, 2 MAR 2011
Research Question
• Public Health Problem: 57,650 women are
diagnosed with in situ cancer in the US.
• How can we identify the perturbed networks
in non-invasive breast lesions that predict
the likelihood of a subsequent invasive
breast cancer?
Discussion
1. Specific Aim and Hypothesis
2. Sample Type
3. Assay Technology
4. Study Design
5. Significance Level for answer to 1
6. Quality Assurance and Control
Evaluating Impact of a Study
Proposal
•
•
•
•
•
Significance
Investigator
Innovation
Approach
Environment

W11: Epidemiology File

Transcript W11: Epidemiology File

Directory