M08: Microarray Data File

Download Report

Transcript M08: Microarray Data File

Microarray data
Curtis Huttenhower
Slides courtesy of:
Amy Caudy (Princeton)
Gavin Sherlock (Stanford)
Matt Hibbs (Jackson Labs)
Florian Markowetz (Cancer Research UK)
Olga Troyanskaya (Princeton)
Harvard School of Public Health
Department of Biostatistics
03-25-13
All I Really Need to Know about Biology
I Learned from One PowerPoint Slide
Cell
Stimulus
DNA
2
All I Really Need to Know about Biology
I Learned from One PowerPoint Slide
Cell
Stimulus
DNA
T
r
a
n
s
c
r
i
p
t
i
o
n
mRNA
3
All I Really Need to Know about Biology
I Learned from One PowerPoint Slide
Cell
Stimulus
DNA
T
r
a
n
s
c
r
i
p
t
i
o
n
mRNA
T
r
a
n
s
l
a
t
t
i
o
n
Proteins
4
Why microarray analysis: the questions
• Large-scale study of biological processes
• What is going on in the cell at a certain point
in time?
• On the large-scale genetic level, what
accounts for differences between
phenotypes?
• Sequence important, but genes have effect
through expression
Microarray technologies
• Two color / Two channel
– Spotted arrays (cDNA or oligo)
• Robotic microspotting, full-length genes or short oligos
– Bubble jet / Ink jet arrays (e.g. Agilent)
• Oligos (25-60 nts) built directly on arrays (in situ synthesis)
• Single channel
– Affymetrix GeneChips
• Photolithography with masks (from computer industry)
• Each gene represented by many n-mers
– Nimblegen
• Photolithography with micromirrors
– Illumina
• Addressable oligo-coated beads in microwells
Early microarray
(18,000 probes)
Illumina BeadArray
(~1M probes)
Affymetrix GeneChip
(~1M probes)
Nimblegen Maskless Array
(~2M probes)
Agilent microarray
(~100K probes)
7
Microarrays
Control mRNA sample
Experimental mRNA sample
Add red dye
Add green dye
Mix
Normal
Cells
Spot slide with gene
sequences
Conditions
Hybridize
Genes
Scan
Gene A
Gene C
Experimental
condition X
Gene B
Gene D
A
B
C
D
X
Y
Z
2
1.5
1
0.2
-0.1
0.15
-1
-1.5
-2
0
3
0.5
Microarray Outputs
Measure amounts of green and
red dye on each spot
Represent level of expression as a
log ratio between these amounts
Raw Image from Spellman et al., 98
Extracting
Data
Extracting
Data
200 10000 50.00 5.64
4800 4800 1.00 0.00
9000
300 0.03 -4.91
Cy3
Cy5
Cy 5
Cy5



Cy 3 log 2 
Cy3
Genes
Experiments
From experiment to data
Microarray Data Flow
Microarray
experiment
Image
Analysis
Unsupervised
Analysis –
clustering
Database
Data Selection & Missing
value estimation
Normalization
& Centering
Supervised
Analysis
Networks &
Data Integration
Data Matrix
Decomposition
techniques
Microarrays
www.ncbi.nlm.nih.gov/geo
www.ebi.ac.uk/microarray-as/ae
biogps.gnf.org
www.ebi.ac.uk/gxa
13
What can microarrays tell us?
• What genes are involved in specific
biological processes (e.g. stress response)
• Assumption = guilt by association (similar
expression pattern => same pathway)
• Tumor classification for treatment guidance
& outcome prediction
Comparing Two Classes
This comparison will tell us about the differences
in gene expression between a breast tumor and
normal breast tissue
Breast tumor
(test sample)
Normal breast tissue
(reference sample)
Comparing between classes
How to compare breast tumor and ovarian tumor?
Breast tumor
Ovarian tumor
Normal breast
Normal ovary
Using a common reference
Breast tumor
Ovarian tumor
Normal breast
Normal ovary
Using a common reference
sample can allow indirect
comparison between many
different samples.
Dye-flip controls
Breast tumor
Reference
Reference
Breast tumor
Using a dye-flip design
allows you to deal with
labeling bias.
Dye-Flip Reference Design
A
B
D
E
• Examining five samples with
a dye-flip design
– Requires 10 microarrays
– Measures each sample twice
– Measures the reference 10
times
C
Advantages of Common
Reference Design
• Easily extensible
• Simple interpretation of all results
• Requires less RNA per experimental
sample.
• Less sensitive to bad RNA samples
Shortcomings of common
reference design
• Reference sample is measured over and
over again
• Biological samples are measured only once
or twice
• All comparisons are indirect, relying on the
reference sample
Raw data are not mRNA
concentrations
•
•
•
•
tissue contamination
RNA degradation
amplification efficiency
reverse transcription
efficiency
• Hybridization efficiency and
specificity
• clone identification and
mapping
• PCR yield, contamination
•
spotting efficiency
•
DNA support binding
•
other array manufacturing related
issues
•
image segmentation
•
signal quantification
•
“background” correction
Quality control:
Noise and reliable signal
Probe level
Array level
Gene level
Arrays 1 ... n
Probe level: quality of the expression measurement of one spot
on one particular array
Array level: quality of the expression measurement on one
particular glass slide
Gene level: quality of the expression measurement of one probe
across all arrays
Data Filtering
• Goals:
– Extract only experiment/gene subsets of interests
– Extract only “accurate” data points
• Various filtering criteria:
– Manual
– Fluorescence distribution
– Level of expression in each channel
• Filters can be combined using logical operators
Why worry?: Spots with low regression
Challenge – How can we differentiate between data and noise in
microarray spot images?
Spot identification
Individual spots are recognized, size and shape might be
adjusted per spot (automatically fine adjustments by
hand).
Additional manual flagging of bad (X) or non-present (NA)
spots
NA
X
poor spot quality
good spot quality
Different Spot identification methods: Fixed circles, circles with variable size,
arbitrary spot shape (morphological opening)
Spot identification
•
The signal of the spots is quantified.
Histogram of pixel
intensities of a single spot
„Donuts“
Mean / Median / Mode / 75% quantile
Local background
GenePix
QuantArray
ScanAlyse
Spatial Defects
Probe-level quality control
• Individual spots printed on the slide
• Sources:
– faulty printing, uneven distribution, contamination with debris,
magnitude of signal relative to noise, poorly measured spots;
• Visual inspection:
– hairs, dust, scratches, air bubbles, dark regions, regions with haze
• Spot quality:
– Brightness: foreground/background ratio
– Uniformity: variation in pixel intensities and ratios of intensities within
a spot
– Morphology: area, perimeter, circularity.
– Spot Size: number of foreground pixels
• Action:
– set measurements to NA (missing values)
– local normalization procedures which account for regional
idiosyncrasies.
– use weights for measurements to indicate reliability in later analysis.
Array level quality control
• Problems:
–
–
–
–
–
array fabrication defect
problem with RNA extraction
failed labeling reaction
poor hybridization conditions
faulty scanner
• Quality measures:
–
–
–
–
–
Percentage of spots with no signal (~30% excluded spots)
Range of intensities
(Av. Foreground)/(Av. Background) > 3 in both channels
Distribution of spot signal area
Amount of adjustment needed: signals have to substantially
changed to make slides comparable.
Gene-level quality control
Gene g
• Poor hybridization in the
reference channel may
introduce bias on the foldchange
• Some probes will not hybridize
well to the target RNA
• Printing problems: such that all
spots of a given inventory well
have poor quality.
•A well may be of bad quality – contamination
•Genes with a consistently low signal in the reference channel are suspicious
Chromosomal rearrangments
can effect gene expression
Expression microarray
Gene Expression
(log10 ratio)
rpl20aD/rpl20aD, Chromosome XV
CGH microarray
Genomic DNA content
(log10 ratio)
Gene index along chrom.
Gene index along chrom.
(data from Hughes et al. (2000))
Partial chromosome
change
Chromosomal rearrangements (cont.)
Undetected rearrangements can result in incorrect biological conclusions!
Gene
expression
(log ratio)
Genomic DNA
content
(log ratio)
(Hughes et al. (2000) Nature Genetics)
Expression
microarray
CGH
microarray
Data Normalization: Definition
• Normalization is an attempt to compensate for
systematic bias in data
• Normalization attempts to remove the impact of
non-biological influences on biological data:
– Balance fluorescent intensities of the two dyes
– Adjust for differences in experimental conditions (b/w
replicate gene expression experiments)
– Probe-specific intensities for Affymetrix data (b/w arrays)
• Normalization allows you to compare data from
one experiment to another (after removing
experiment-specific biases)
Normalization: Effects on Intensity
Non-normalized
Normalized
Same mRNA hybridized in both channels
Normalization methods
• Use all genes on array because only a few
genes are differentially expressed
• Constantly expressed genes (housekeeping
genes)
• Spiked controls of known concentrations
(experimental)
• Spatial normalization
Gene expression data
mRNA Samples
sample1 sample2 sample3 sample4 sample5 …
Gene
1
2
3
4
5
0.46
-0.10
0.15
-0.45
-0.06
0.30
0.49
0.74
-1.03
1.06
0.80
0.24
0.04
-0.79
1.35
1.51
0.06
0.10
-0.56
1.09
0.90
0.46
0.20
-0.32
-1.09
gene-expression level or ratio for gene i in mRNA sample j
Log2(red intensity / green intensity)
M=
Function (PM, MM) of MAS, dchip or RMA
average: log2(red intensity), log2(green intensity)
A=
Function (PM, MM) of MAS, dchip or RMA
...
...
...
...
...
Scatterplot
Data
Message: look at your data on log-scale!
Data (log scale)
MA Plot
A = 1/2 log2(RG)
Median centering
One of the simplest strategies is to bring all „centers“ of the array data to
the same level.
Assumption: the majority of genes are un-changed between
conditions.
Divide all expression
measurements of
each array by the
Median.
Log Signal, centered at 0
Median is more robust to outliers than the mean.
Problem of median-centering
Median-Centering is a global Method. It does not adjust for local effects,
intensity dependent effects, print-tip effects, etc.
Scatterplot of log-Signals
after Median-centering
Log Red
M = Log Red - Log Green
M-A Plot of the same data
Log Green
A = (Log Green + Log Red) / 2
M = Log Red - Log Green
Lowess normalization
Local
estimate
A = (Log Green + Log Red) / 2
Use the estimate to bend
the banana straight
Data centering
• Important when reference levels and other
spot-specific parameters aren’t of interest
• Centering adjusts values of each gene to
reflect their variation from some property of
the series of observed values (e.g. median)
• If reference is important (e.g. reference is
sample at time point 0), centering may not
make sense
The missing value problem
• Microarrays can have systematic or random
missing values
• Some algorithms can’t deal with missing
values
• Instead of hoping missing values won’t bias
the analysis, better to estimate them
accurately
Accurate estimation important for analysis
Complete data set
Data set with 30% entries
missing and filled with
zeros (zero values appear
black)
Data set with missing
values estimated by
KNNimpute algorithm
KNNimpute Algorithm
• Idea: use genes with similar expression
profiles to estimate missing values
j
2 | | 5 | 7 | 3 | 1 Gene X
j
2 |4.3| 5 | 7 | 3 | 1 Gene X
2 | 4 | 5 | 7 | 3 | 2 Gene B
2 | 4 | 5 | 7 | 3 | 2 Gene B
3 | 5 | 6 | 7 | 3 | 2 Gene C
3 | 5 | 6 | 7 | 3 | 2 Gene C
Summary I
• Raw data are not mRNA concentrations
• We need to check data quality on different
levels
– Probe level
– Array level (all probes on one array)
– Gene level (one gene on many arrays)
• Always log your data
• Normalize your data to avoid systematic
(non-biological) effects
• Lowess normalization straightens banana
Microarray Usage Beyond
Simple Expression
Microarray Design
• First spotted arrays contained PCR products of
entire ORFs (yeast), or PCR’d cDNA clone inserts
(human).
– significant cross-hybridization potential, and unable to
distinguish which strand was expressed.
• Typical Affymetrix arrays use multiple probe pairs
(~10 per transcript), with match and mismatch
probes. Exon arrays use 10 probes per exon, no
mismatch probes.
Microarray Design II
• Trend for all array designs is to use
oligonucleotides.
– more specific
– able to distinguish expressed strand (assuming you do
the molecular biology correctly).
• Oligonucleotides selected depends on application
– Tiling arrays
• gene discovery, full genome coverage
– Exon junction arrays
– Promoters, etc.
• Customized microarrays for specific genes
Co-regulated genes
are co-expressed
Expression profiles of
53 genes in S.
cerevisiae genome
that contain the exact
match to an MCB
box in their
promoters (profiles
normalized by mean
& variance).
Transcription Factor
ATG
Open Reading Frame
RNA polymerase
Cliften et al. Science 301 2003
Integration of expression with
sequence for motif discovery
• Identify sequence motifs or motif
combinations common to each group of coexpressed genes
ACGCGT
Regulatory motif discovery from
Gene Expression data
• Identify sets of co-regulated genes from
microarrays
– Unsupervised analysis - clustering
– Supervised analysis
• Identify common motifs in regulatory regions
of co-regulated genes
– Combinatorial methods (statistics!)
– Probabilistic methods (EM, Gibbs Sampling – a
special case of MCMC)
Beyond Expression: CGH
• Competitive Genomic Hybridization (CGH)
– genomic DNA used rather than mRNA
– quick detection of aneuploidy, amplifications, deletions
(not rearrangements)
Tiling Arrays
T
ATGGCTACGTTCATGCAT
ATGGCTACGTTCATGCAT
ATGGCTACGTTCATGCAT
Measuring competitive
ATGGCTACGTTCATGCAT
hybridization efficiency
can indicate areas with
ATGGCTACGTTCATGCAT
small polymorphisms
ATGGCTACGTTCATGCAT
ATGGCTACGTTCATGCAT
ATGGCTACGTTCATGCAT
ATGGCTACGTTCATGCAT
ATGGCTACGTTCATGCAT
ATGGCTACGTTCATGCAT
Tiling Arrays for SNP Detection
High Throughput Sequencing
• Deep sequencing is a tiling array with very
high precision!
Summary
• Basic microarray pipeline:
– Data production and collection into DB
– Filtering, Normalization, Missing value imputation
– Analysis
• Unsupervised
– Clustering: Hierarchical, K-means, etc.
– Other: Search, SVD/PCA, etc.
• Supervised
– SVMs, differential expression
– Evaluation
• Statistical: low variability, large separation, etc.
• Functional: enriched biological processes