Nutrition Seminar Presentation 7-18-2014

Download Report

Transcript Nutrition Seminar Presentation 7-18-2014

A review of quality control and preprocessing measures for the Illumina
450K BeadChip
Randa Stringer
Supervisor: Dr. Guillaume Paré
Steps for Review






Sample Quality
Probe Quality
Background correction
Normalization
Cellular composition
Batch effects
Array Design
> 485,000 CpG sites
 Covers 99% of RefSeq genes
 Average of 17 sites per gene



Distributed across promoter, 5’
UTR, first exon, gene body, and 3’
UTR
Covers 96% of known CpG
islands
Sample Quality

Reported vs. predicted sex



Use DNA methylation to predict sex
Minfi – getSex function
 yMed - xMed is less than cutoff we predict a female,
otherwise male.
Sample detection cut-offs

Threshold of failed probes in a sample (usually < 0.05 or
0.01)
Probe Quality






Probe detection cut-offs
Bead count ( > 3 )
Remove probes on sex chromosomes
Probes containing SNPs
Cross-reactive probes
MAF > 1%
Background Correction

Background subtraction method


Available in GenomeStudio
Background calculated from negative control
probes is subtracted from all probes (separately for
each channel [rd vs grn])
(GenomeStudio Methylation Module v1.8 User Guide)
Normalization
Goal: reduce non-biological variation
 Equalizes probe intensity and signal
distributions across arrays and between colour
channels
 New challenges with DNA methylation vs.
gene expression techniques



Systematic/technical variation
Novel probe design
Normalization for Illumina 450K

Problem: 2-type probe design
Infinium I Probe
2 different probes per CpG
Infinium II Probe
Single base extension at CpG
Maksimovic et al. Genome Biology 2012
CpG Content



Infinium II ≤ 3
Infinium I ≥ 3
Compressed β value
distribution in InfII
Solution: scale
Infinium II probes to
InfI probes
Maksimovic et al. Genome Biology 2012
Normalization to Internal Controls

Illumina GenomeStudio


Probe intensity multiplied by constant normalization factor
(NF)
NF calculated as average of controls in a reference sample
(GenomeStudio Methylation Module v1.8 User Guide)

Doesn’t account for the InfI vs InfII probe issues
Peak-Based Correction (PBC)
Raw

Uses peak summits to
correct β values




Convert β to M values
Determine peaks for I and II
probes with kernel density
estimation
Rescale M values by peak
summits
Rescale these corrected M
values to the I range and
converted back to β values
PBC
Dedeurwaerder et al. Epigenomics 2011
Subset Quantile Normalization (SQN)

Modeled after SQN
methods in expression





Probes separated and poor
detection removed
‘Anchors’ (RQs) chosen
from InfI probes
Target quantiles are
estimated for InfI and II
InfI and II normalized to
their RQs
Dataset is rebuilt
Touleimat and Tost, Epigenomics, 2012
SQN Cont’d
No normalization
RQs by ‘relation to CpG’
Unique RQs
RQs by ‘relation to gene sequence’
Maksimovic et al. Genome Biology 2012
Subset Within-Array Normalization (SWAN)

Allows InfI and InfII probes
to be normalized together




Subset of N InfI and InfII
probes chosen based on
underlying CpG content
Separate methylated and
unmethylated channels
Mean intensity for each of
3N calculated
InfI and II probes adjusted
separately by linear
interpolation
Maksimovic et al. Genome Biology 2012
Beta-MIxture Quantile normalization (BMIQ)

Novel normalization method



Fit 3-state (U/H/M) to InfI and
InfII probes separately
Transform InfI U and M
probes using the inverse of
the cumulative beta
distribution estimated from
the respective InfII probes
For H probes perform
dilation transformation to fit
the data into the gap
Teschendorff et al. Bioinformatics 2012
START Data
Raw Data
SWAN Normalized
Cellular Composition
Adapted from Correa-Rocha et al. Pediatric Research 2012
Estimations by Houseman
Houseman et al. BMC Bioinformatics 2012
Batch Effects

Can be assessed using principal component analysis
or variations on singular variable decomposition (ex.
sva)

ComBat method uses a parametric or nonparametric empirical Bayes framework to adjust for a
known source of batch effects
Singular Variable Decomposition (START)
Questions
&
Discussion