W08: Microarrays Data (Aedin)
Download
Report
Transcript W08: Microarrays Data (Aedin)
Microarrays
(Slides mostly thanks to Curtis)
Aedín Culhane
[email protected]
Genotype -> Phenotype
2
Gene Expression 101
Cricks Central Dogma
Phenotype
Genome Sequence important, but genes have effect through expression
Assay of [mRNA ] captures the dynamic cellular activity and informs:
• What is going on in the cell at a certain point in time?
• On the large-scale genetic level, what accounts for differences between
phenotypes?
History of Microarrays
1995
First cDNA Microarrays
(45 Arabidopsis genes)
1996
864 Yeast genes
1996
1000 human
cancer genes
1999
7,000 Gene on
arrays
2003
Genome on 2 arrays
2004
Whole genome
(coding) arrays
2005
Exon, tiling arrays
Next Generation Sequencing
Early microarray
(18,000 probes)
Illumina BeadArray
(~1M probes)
Affymetrix GeneChip
(~1M probes)
Nimblegen Maskless Array
(~2M probes)
Agilent microarray
(~100K probes)
5
Impact of Microarrays (for
Patients)
Dec 2004 - First microarray test for treatment decision
approved by FDA
Affymetrix's AmpliChip Cytochrome P450 Genotyping Test:
identifies variations in 2 genes affecting response to a wide
variety of drugs.
Feb 2007 - FDA cleared MammaPrint for breast cancer
…
Oncotype DX, Blueprint… myriad of tests
Genomics applications in clinical trials rising. ~20% U.S.
clinical trials use some sort of genomics approach, with
the highest percentage in oncology trials.
What can microarrays tell us?
• What genes are involved in specific
biological processes (e.g. stress response)
• Assumption = guilt by association (similar
expression pattern => same pathway)
• Tumor classification for treatment guidance
& outcome prediction
Public Microarray Data
Statistics March 2012
ArrayExpress
• 28,448 Studies, 815,057 profiles
(gxa - 3170 studies, 84165 profiles)
GEO
• 28,941 Studies (715,074 profiles)
Microarrays
www.ncbi.nlm.nih.gov/geo
www.ebi.ac.uk/microarray-as/ae
www.ebi.ac.uk/gxa
insilico.ulb.ac.be/
smd.stanford.edu/
biogps.gnf.org
9
Microarray Data Flow
Microarray
experiment
Image
Analysis
Unsupervised
Analysis –
clustering
Database
Supervised
Analysis
Data Selection &
Missing value
estimation
Normalization
& Centering
Networks &
Data Integration
Data Matrix
Decomposition
techniques
Goal of a microarray study
• Detect number of RNA molecules
• Actually measure fluorescence intensity
of spot
INDIRECT MEASUREMENT
Normalisation aims to reduce systematic
noise introduced in measurement
Raw data are not mRNA
concentrations
•
•
•
•
tissue contamination
RNA degradation
amplification efficiency
reverse transcription
efficiency
• Hybridization efficiency and
specificity
• clone identification and
mapping
• PCR yield, contamination
•
spotting efficiency
•
DNA support binding
•
other array manufacturing related
issues
•
image segmentation
•
signal quantification
•
“background” correction
Quality control:
Noise and reliable signal
Probe level
Array level
Gene level
Arrays 1 ... n
Probe level: quality of the expression measurement of one spot
on one particular array
Array level: quality of the expression measurement on one
particular glass slide
Gene level: quality of the expression measurement of one probe
across all arrays
Probe and Array level problems
vary by platform type
Microarray technologies
• Two color / Two channel
– Spotted arrays (cDNA or oligo)
• Robotic microspotting, full-length genes or short oligos
– Bubble jet / Ink jet arrays (e.g. Agilent)
• Oligos (25-60 nts) built directly on arrays (in situ synthesis)
• Single channel
– Affymetrix GeneChips
• Photolithography with masks (from computer industry)
• Each gene represented by many n-mers
– Nimblegen
• Photolithography with micromirrors
– Illumina
• Addressable oligo-coated beads in microwells
Different Protocols, Different Platforms
Spotted Array Platform
Specific
– “In house” printing effects
– Regional effects within and
between print-tips
– Need regional plate and print-tip
lowess normalisation
spotting pin quality decline
after delivery of 5x105 spots
after delivery of 3x105 spots
H. Sueltmann DKFZ/MGA
PCR plates
Affymetrix Platform Specific
– Probe level effect. Need a gene
expression measure from the 11
probe in probeset
Probe-response calibration
position- and sequence-specific
effects wi(s):
Naef et al., Phys Rev E 68 (2003)
25
log Y log x wi (si )
i1
wi
i
Image Analysis: Spot identification
•
The signal of the spots is quantified.
Histogram of pixel
intensities of a single spot
„Donuts“
Mean / Median / Mode / 75% quantile
Local background
GenePix
QuantArray
ScanAlyse
Spatial Defects
Chromosomal rearrangments
can effect gene expression
Expression microarray
Gene Expression
(log10 ratio)
rpl20aD/ rpl20aD, Chromosome XV
CGH microarray
Genomic DNA content
(log10 ratio)
Gene index along chrom.
Gene index along chrom.
(data from Hughes et al. (2000))
Partial chromosome
change
Chromosomal rearrangements (cont.)
Undetected rearrangements can result in incorrect biological conclusions!
Gene
expression
(log ratio)
Genomic DNA
content
(log ratio)
(Hughes et al. (2000) Nature Genetics)
Expression
microarray
CGH
microarray
Data Normalization: Definition
• Normalization is an attempt to compensate for
systematic bias in data
• Normalization attempts to remove the impact of
non-biological influences on biological data:
– Balance fluorescent intensities of the two dyes
– Adjust for differences in experimental conditions (b/w
replicate gene expression experiments)
– Probe-specific intensities for Affymetrix data (b/w arrays)
• Normalization allows you to compare data from
one experiment to another (after removing
experiment-specific biases)
Normalization: Effects on Intensity
Non-normalized
Normalized
Same mRNA hybridized in both channels
Normalization methods
• Use all genes on array because only a few
genes are differentially expressed
• Constantly expressed genes (housekeeping
genes)
• Spiked controls of known concentrations
(experimental)
• Spatial normalization
Gene expression data
mRNA Samples
sample1 sample2 sample3 sample4 sample5 …
Gene
1
2
3
4
5
0.46
-0.10
0.15
-0.45
-0.06
0.30
0.49
0.74
-1.03
1.06
0.80
0.24
0.04
-0.79
1.35
1.51
0.06
0.10
-0.56
1.09
0.90
0.46
0.20
-0.32
-1.09
gene-expression level or ratio for gene i in mRNA sample j
Log2(red intensity / green intensity)
M=
Function (PM, MM) of MAS, dchip or RMA
average: log2(red intensity), log2(green intensity)
A=
Function (PM, MM) of MAS, dchip or RMA
...
...
...
...
...
Log2(ratio) measures treat up- and down-regulated genes equally
Log(ratio) Histogram
3000
2500
Frequency
2000
1500
1000
500
0
-2
8
-1.
6
-1.
4
-1.
2
-1.
-1
8
-0.
6
-0.
4
-0.
2
-0.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Log(ratio)
log2(1) = 0
log2(2) = 1
log2(1/2) = -1
1.6
1.8
2
Scatterplot
Data
Message: look at your data on log-scale!
Data (log scale)
Median centering
One of the simplest strategies is to bring all „centers“ of the array data to
the same level.
Assumption: the majority of genes are un-changed between
conditions.
Divide all expression
measurements of
each array by the
Median.
Log Signal, centered at 0
Median is more robust to outliers than the mean.
Problem of median-centering
Median-Centering is a global Method. It does not adjust for local effects,
intensity dependent effects, print-tip effects, etc.
Scatterplot of log-Signals
after Median-centering
Log Red
M = Log Red - Log Green
M-A Plot of the same data
Log Green
A = (Log Green + Log Red) / 2
Normalize to scaling factor
Normalized to the 75th percentile
Not influenced by outliers
Still too much below the line
Quantile Normalisation
distribution
of intensities
across every
slide is
forced to be
same.
Outliers
are not
tolerated
Lowess normalization
M = Log Red - Log Green
Locally Weighted Scatterplot Smoothing
Local
estimate
A = (Log Green + Log Red) / 2
Use the estimate to bend
the banana straight
Straightens the
banana!
Data centering
• Important when reference levels and other
spot-specific parameters aren’t of interest
• Centering adjusts values of each gene to
reflect their variation from some property of
the series of observed values (e.g. median)
• If reference is important (e.g. reference is
sample at time point 0), centering may not
make sense
The missing value problem
• Microarrays can have systematic or random
missing values
• Some algorithms can’t deal with missing
values
• Instead of hoping missing values won’t bias
the analysis, better to estimate them
accurately
Accurate estimation important for analysis
Complete data set
Data set with 30% entries
missing and filled with
zeros (zero values appear
black)
Data set with missing
values estimated by
KNNimpute algorithm
KNNimpute Algorithm
• Idea: use genes with similar expression
profiles to estimate missing values
j
2 | | 5 | 7 | 3 | 1 Gene X
j
2 |4.3| 5 | 7 | 3 | 1 Gene X
2 | 4 | 5 | 7 | 3 | 2 Gene B
2 | 4 | 5 | 7 | 3 | 2 Gene B
3 | 5 | 6 | 7 | 3 | 2 Gene C
3 | 5 | 6 | 7 | 3 | 2 Gene C
Comparison of Normalization
methods applied to Affy data
• Model of tumor recurrence
• Pooled 5 mice to get sufficient mRNA
• Compare Primary v Recurrent
• Affymetrix MOE430A and B chips
– MOE430A - 22690 probesets
– MOE430B - 22575 probesets
Raw Data
Primary
Recurrent
Primary
Recurrent
Mas5.0
VSN
gcRMA
Red >2 fold difference in gcRMA normalised
data
Are these methods always valid?
• Mas5.0, RMA, gcRMA and vsn
– all assume that the sum of RNA is constant
(same no of genes up and down)
• THIS IS NOT ALWAYS TRUE
– k/o of pol II
– Blocking methylation/translation etc
Normalising to an external set of
genes
• Housekeeping
– Not a good idea
• Li & Wong
– Transform using non-linear smooth curves
– Uses rank invariant probes
– Available in dChip and Bioconductor (Affy package)
– Cheng Li & Wing Hung Wong (2001a) PNAS 98, 31-36
• Spike in Controls
– External RNA
– van de Peppel et al., (2003) EMBO Rep. 4(4):38793.
Colon Cancer Data
• Fresh-frozen human colorectal tumours.
• N=6
– Whole tumour N=3
– Parenchymal fraction (LCM dissected)
• On Affymetrix U133plus2 chips
– 54675 probesets
Normalisation Approaches!
MAS 5.0
Many Normalisation
methods
RMA
Need to consider best
one for your
experimental design
Li & Wong
Most normalisation
methods assume sum
of mRNAs is equal
Summary I
• Raw data are not mRNA concentrations
• We need to check data quality on different
levels
– Probe level
– Array level (all probes on one array)
– Gene level (one gene on many arrays)
• Always log your data
• Normalize your data to avoid systematic
(non-biological) effects
• Lowess normalization straightens banana
Microarray Usage Beyond
Simple Expression
Microarray Design
• First spotted arrays contained PCR products of
entire ORFs (yeast), or PCR’d cDNA clone inserts
(human).
– significant cross-hybridization potential, and unable to
distinguish which strand was expressed.
• Typical Affymetrix arrays use multiple probe pairs
(~10 per transcript), with match and mismatch
probes. Exon arrays use 10 probes per exon, no
mismatch probes.
Microarray Design II
• Trend for all array designs is to use
oligonucleotides.
– more specific
– able to distinguish expressed strand (assuming you do
the molecular biology correctly).
• Oligonucleotides selected depends on application
– Tiling arrays
• gene discovery, full genome coverage
– Exon junction arrays
– Promoters, etc.
• Customized microarrays for specific genes
Beyond mRNA
Cricks Central Dogma
Phenotype
That sounds simple…
Really?
Beyond Expression: CGH
• Competitive Genomic Hybridization (CGH)
– genomic DNA used rather than mRNA
– quick detection of aneuploidy, amplifications, deletions
(not rearrangements)
But the Simple Model of Transcription is
…
60
… is much more complex!!
61
Gene Expression
ncRNA
TFBS
DNA – SNP/CNV
Methylation
Proteins
Nucleic Acids Res. 2011 January; 39(Database issue): D1005–D1010.
Tiling Arrays
T
ATGGCTACGTTCATGCAT
ATGGCTACGTTCATGCAT
ATGGCTACGTTCATGCAT
Measuring competitive
ATGGCTACGTTCATGCAT
hybridization efficiency
can indicate areas with
ATGGCTACGTTCATGCAT
small polymorphisms
ATGGCTACGTTCATGCAT
ATGGCTACGTTCATGCAT
ATGGCTACGTTCATGCAT
ATGGCTACGTTCATGCAT
ATGGCTACGTTCATGCAT
ATGGCTACGTTCATGCAT
Tiling Arrays for SNP Detection
High Throughput Sequencing
• Deep sequencing is a tiling array with very
high precision!
Direct to Consumer (DTC)
• Increasing DTC Tests
–
–
–
–
–
–
–
–
23andMe - customized Illumina chip
FamilyTreeDNA - Illumina OmniExpress
Complete Genomics - WGS
deCODEme - Illumina Human 1M
Navigenics - Affymetrix Genome-Wide SNP 6.0
SeqWright - Affymetrix Genome-Wide SNP 6.0
Gene Essence -Affymetrix Genome-Wide SNP 6.0
Ancestory.com
Source http://www.snpedia.com/index.php/Testing
Social Web – Future Impact?
• Consumer lead research
• Social Network based drug trials (Lithium in
ALS - PatientsLikeMe)