bioinformatics - Department of Medical Biophysics

Download Report

Transcript bioinformatics - Department of Medical Biophysics

Canadian Bioinformatics
Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Lecture 7
Microarrays I: Data Pre-Processing
MBP1010
†
Dr. Paul C. Boutros
Winter 2014
DEPARTMENT OF
MEDICAL BIOPHYSICS
†
Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE)
This workshop includes material
originally developed by Drs. Raphael Gottardo,
Sohrab Shah, Boris Steipe and others
Course Overview
•
•
•
•
•
•
•
•
•
•
Lecture 1: What is Statistics? Introduction to R
Lecture 2: Univariate Analyses I: continuous
Lecture 3: Univariate Analyses II: discrete
Lecture 4: Multivariate Analyses I: specialized models
Lecture 5: Multivariate Analyses II: general models
Lecture 6: Sequence Analysis
Lecture 7: Microarray Analysis I: Pre-Processing
Lecture 8: Microarray Analysis II: Multiple-Testing
Lecture 9: Machine-Learning
Final Exam (written)
Lecture 7: Microarrays Part I
bioinformatics.ca
House Rules
• Cell phones to silent
• No side conversations
• Hands up for questions
Lecture 7: Microarrays Part I
bioinformatics.ca
Topics For This Week
• Microarrays Overview
• Attendance
• Microarray Data Pre-Processing
• Next-Generation Sequencing (Redux)
Lecture 7: Microarrays Part I
bioinformatics.ca
Example #1
You are conducting a study of osteosarcomas using mouse models. You
are studying transgenic animals with deletion of a tumour suppressor
(TS), or with amplification of an oncogene (OG). You consider the
penetrance of tumours in a set of 8 different mouse strains.
Your hypothesis: some mouse strains are more tumour-prone than
others when OG is amplified. You measure weeks to tumour formation, or
X if no tumour forms. Animals are followed for 6 weeks. Only OG animals
are considered.
Strain 1 (weeks) Strain 2 (weeks)
4
6
5
2
5
3
Strain 3 (weeks)
X
3
4
Strain 4 (weeks)
X
X
2
Strain 5 (weeks) Strain 6 (weeks)
X
6
5
6
5
6
Strain 7 (weeks)
X
X
6
Strain 8 (weeks)
1
1
2
Lecture 7: Microarrays Part I
bioinformatics.ca
Example #2
You are conducting a study of osteosarcomas using mouse models. You
are using a strain of mice naturally susceptible to these tumours at ~20%
penetrance. You are studying two transgenic lines, one with deletion of a
tumour suppressor (TS), the other with amplification of an oncogene
(OG). Tumour penetrance in these is 100%.
Your hypothesis: You now wonder if tumour size is differing by sex of
the animal, and suspect tumour-size differs between lines, but is
confounded by sex differences. Your data:
TS (cm3)
3.9 (M)
7.1 (M)
3.1 (F)
4.4 (F)
5.0 (F)
Lecture 7: Microarrays Part I
OG (cm3)
5.2 (M)
1.9 (M)
5.0 (M)
6.1 (F)
4.5 (F)
4.8 (F)
Wildtype (cm3)
1.1 (M)
1.5 (M)
2.1 (M)
2.5 (M)
0.3 (F)
2.2 (F)
bioinformatics.ca
Let’s start off with a question
What do expression
microarrays actually
measure?
Lecture 7: Microarrays Part I
bioinformatics.ca
What is a Microarray?
“A DNA microarray is a multiplex technology consisting
of thousands of oligonucleotide spots, each containing
picomoles of a specific DNA sequence.”
• Used to quantitate mRNA or DNA
• Many applications:
• mRNA or DNA levels
• SNP identification
• ChIP-on-Chip
Lecture 7: Microarrays Part I
bioinformatics.ca
Hypotheses
• Microarrays are usually hypothesis-generating:
• They highlight specific genes or features that are particularly
interesting for follow-up experiments
• There are many interesting exceptions
• Biomarkers
• Pathway analyses
• This does not reduce the importance of experimental
design
• the low statistical power of array studies make good design
even more important and very challenging
Lecture 7: Microarrays Part I
bioinformatics.ca
Input Samples
The nature of the sample is critical:
* Unfrozen vs. Frozen vs. FFPE
* Total RNA vs. poly-A RNA vs. other subsets
Lecture 7: Microarrays Part I
bioinformatics.ca
Microarray Basics
Imagine a one-spot microarray…
Target DNA…
… is labeled
… and hybridized
… and washed.
Target
Finally, scan the
chip.
Chip
Feature
Probe
Lecture 7: Microarrays Part I
bioinformatics.ca
Lecture 7: Microarrays Part I
bioinformatics.ca
These Are Spotted Arrays
Robotically printed onto a series of
glass slides using a robot with
needle-heads.
Product a characteristic gridding pattern and
almost always use two samples
simultaneously (two-colour).
Lecture 7: Microarrays Part I
bioinformatics.ca
Other Types of Arrays
• Inkjet Arrays
• Photolithographically generated arrays
• Bead arrays
• Protein/cell/lipid-arrays
• More “niche” applications
• Not discussed here
Lecture 7: Microarrays Part I
bioinformatics.ca
InkJet Arrays
In 1999, HP spun off its life-science and
measurement division into Agilent
Technologies.
The new company wanted to determine if printer technology could
be harnessed to generate microarrays.
Lecture 7: Microarrays Part I
bioinformatics.ca
Inkjet Array Manufacture Involves Sequential
Nucleotide Addition
Lecture 7: Microarrays Part I
bioinformatics.ca
Photolithographic Arrays
• Produced by the techniques for the production of
transistors.
• Mostly pioneered by the company Affymetrix, although
other suppliers exist (e.g. Nimblegen)
• We will be working with Affymetrix data later, so we will
walk through the platform in significant detail
Lecture 7: Microarrays Part I
bioinformatics.ca
The Glass Matrix
Silination
Addition of Linker molecule
Lecture 7: Microarrays Part I
bioinformatics.ca
Photolithographic Synthesis
Photolithographic mask
Lecture 7: Microarrays Part I
bioinformatics.ca
Deprotection
Lecture 7: Microarrays Part I
bioinformatics.ca
Nucleotide Addition
Lecture 7: Microarrays Part I
bioinformatics.ca
Nucleotide Addition
Lecture 7: Microarrays Part I
bioinformatics.ca
Nucleotide Addition
Lecture 7: Microarrays Part I
bioinformatics.ca
Capping Agents
Lecture 7: Microarrays Part I
bioinformatics.ca
Final Chip
Wafer
Feature
Chip
Lecture 7: Microarrays Part I
bioinformatics.ca
Lecture 7: Microarrays Part I
bioinformatics.ca
RNA Wash
Lecture 7: Microarrays Part I
bioinformatics.ca
RNA Wash
Lecture 7: Microarrays Part I
bioinformatics.ca
An Affymetrix Microarray
Lecture 7: Microarrays Part I
bioinformatics.ca
Self-Assembling Bead-Arrays
•
•
•
•
Produced by Illumina
3 μm silicon beads, randomly placed
coated with ~105 identical 25bp probes
probes have identifying barcode (address) sequences
Labeled
cDNA
bead
address
Lecture 7: Microarrays Part I
probe
bioinformatics.ca
Comparing Array Platforms
Price
Oligos
Data
Quality
Spotted
cDNA
$
variable
+
+++
Affymetrix
$$$
25 bp
+++
+++
Inkjet
$$
~70 bp
++
++
$$
~25 bp
++
+
Platform
Bead Arrays
Bioinformatics
Research
I do not endorse specific platforms – they all
have their strengths and weaknesses
Lecture 7: Microarrays Part I
bioinformatics.ca
What Are Microarrays Used For?
Molecular
DNA
• DNA sequence
(SNPs)
• DNA copy-number
• DNA capture
(exome, ChIP)
• Tag quantitation
(genetic screening)
Lecture 7: Microarrays Part I
RNA
• mRNA abundances
• Splicing (quantitate
different isoforms)
• mRNA degradation rates
(half-life)
• mRNA translation rates
• RNA capture (RIP)
Other
• Protein arrays
• Cell based
arrays
• Lipid arrays
bioinformatics.ca
What Are Microarrays Used For?
Biological
RNA
• mRNA abundances
• Splicing (quantitate
different isoforms)
• mRNA degradation rates
(half-life)
• mRNA translation rates
• RNA capture (RIP)
Lecture 7: Microarrays Part I
* Candidate Gene Identification
* Pathway Analysis
* Model Characterization
* Classifiers/Predictive Models
* Drug-Analysis (Dose/Time/Class)
* Integration Analysis
bioinformatics.ca
Attendance Break
Lecture 7: Microarrays Part I
bioinformatics.ca
Each Spot is a Probe
Spot Cy3 Cy5
Quantitation
Background
Spot
Quality
A) Remove Noise
Inter-array
Intra-Array
Significance
Testing
Spot List
B) Extract Data
Clustering
Lecture 7: Microarrays Part I
Integration
?
bioinformatics.ca
Step #1: Image Quantitation
Why?
How?
Difficulty?
Research?
Quantitative vs. Qualitative
Image Segmentation
+++
+
Lecture 7: Microarrays Part I
bioinformatics.ca
Image Segmentation 101: Find Grids
1. Find Grids
2. Find Spots
3. Spot Outline
Lecture 7: Microarrays Part I
bioinformatics.ca
Image Segmentation 101: Find Spots
Key Step:
Integrate Signal
Across Array
Lecture 7: Microarrays Part I
bioinformatics.ca
Image Segmentation 101: Challenges
Problems:
Stray Signal
Missing Spots
Gross Deformities
Manual Validation
Lecture 7: Microarrays Part I
bioinformatics.ca
Research?
• Surprisingly, not much investigation
• This is probably a source of error in all studies
• Manual checking of spot-detection remains the norm
• Problematic as studies & arrays get larger
Lecture 7: Microarrays Part I
bioinformatics.ca
Spot Cy3 Cy5
Quantitation
Background
Spot
Quality
Inter-array
Intra-Array
Significance
Testing
Spot List
Clustering
Lecture 7: Microarrays Part I
Integration
?
bioinformatics.ca
Step #2: Background Correction
Why?
How?
Difficulty?
Research?
Remove Stray
Signal
Model-based
++++
++
Lecture 7: Microarrays Part I
bioinformatics.ca
Spot Segmentation
Signal
???
Background
Lecture 7: Microarrays Part I
bioinformatics.ca
So what do we get?
Background Intensity: BG
Foreground Intensity: FG
Isn’t it simple?
Signal = FG - BG
Lecture 7: Microarrays Part I
NO!
If BG > FG
Then -ve Signal
0.1-2% of spots
bioinformatics.ca
Why Might This happen?
In 2001 two papers showed that empty spots
have less signal than background
Unbound spots correspond to
low-expression genes
Background Intensity: BG
Foreground Intensity: FG
Thus unbound spots are
particularly prone to problems
Lecture 7: Microarrays Part I
bioinformatics.ca
So What to Do?
• Heavy-duty mathematical tools employed
• Three major models developed:
• Edwards
• Smyth
• Kooperberg
log-linear
normexp
Bayesian
The math is extremely advanced, so
we’ll skip that for now. Let’s summarize
the methods instead.
Lecture 7: Microarrays Part I
bioinformatics.ca
Comparison
Method
Speed
Accuracy
Edwards
Fast
Good
NormExp
Slow
Better
Kooperberg
Very Slow
Best
No strong criteria for selecting
between these algorithms.
Lecture 7: Microarrays Part I
bioinformatics.ca
Spot Cy3 Cy5
Quantitation
Background
Spot
Quality
Inter-array
Intra-Array
Significance
Testing
Spot List
Clustering
Lecture 7: Microarrays Part I
Integration
?
bioinformatics.ca
Step 3: Spot Quality
Why?
How?
Difficulty?
Research?
Lecture 7: Microarrays Part I
Identify artefacts
Unknown
+++++
+
bioinformatics.ca
Spot-Weighting
A “perfect” spot is used normally in analysis
Weight = 1
A “poor” spot is given less consideration
0 < Weight < 1
Problem:
How the heck do we calculate weights?
Lecture 7: Microarrays Part I
bioinformatics.ca
A Few Approaches
• Mean-Median Correlation
• Composite q-metrics
 improve homotypic signal:noise
But both fail sometimes, seemingly randomly.
Do we really need this?
Lecture 7: Microarrays Part I
bioinformatics.ca
All from one
“good-quality”
array!
Lecture 7: Microarrays Part I
bioinformatics.ca
But I use Affymetrix!
(Or Agilent)
(Or Nimblegen)
(Or Other Commercial Supplier)
Lecture 7: Microarrays Part I
bioinformatics.ca
Okay, Let’s See Some Affy Data
Lecture 7: Microarrays Part I
bioinformatics.ca
Lecture 7: Microarrays Part I
bioinformatics.ca
Lecture 7: Microarrays Part I
bioinformatics.ca
Lecture 7: Microarrays Part I
bioinformatics.ca
Those Three Were From A Spike-In
Experiment Done by Affymetrix
Themselves!
Lecture 7: Microarrays Part I
bioinformatics.ca
Lecture 7: Microarrays Part I
bioinformatics.ca
Lecture 7: Microarrays Part I
bioinformatics.ca
Lecture 7: Microarrays Part I
bioinformatics.ca
Spot Quality is An Issue, Regardless of
Platform
Lecture 7: Microarrays Part I
bioinformatics.ca
Manual Flagging?
Two studies show error rates of 5-20%
Spot-Quality is a huge, unsolved problem.
Most investigators ignore it.
More bioinformaticians struggle with it.
Then we ignore it too.
Lecture 7: Microarrays Part I
bioinformatics.ca
Spot Cy3 Cy5
Quantitation
Background
Spot
Quality
Inter-array
Intra-Array
Significance
Testing
Spot List
Clustering
Lecture 7: Microarrays Part I
Integration
?
bioinformatics.ca
Step 4: Intra-Array Normalization
Why?
How?
Difficulty?
Research?
Balance channels
Remove spatial artifacts
Multiple robust algorithms
++
+++++
Lecture 7: Microarrays Part I
bioinformatics.ca
Within-Array Normalization
1. Spatial gradients
2. Channel-balancing
3. Intensity bias
70000
60000
50000
Are red and green
equal in our
starting sample?
40000
30000
20000
10000
0
10000 Part I
Lecture 07: Microarrays
20000
30000
40000
50000
bioinformatics.ca
60000
70000
We Can Handle This!
Spatial Effects:
Gaussian Spatial Smoothing
Intensity Effects:
Loess Smoothing
Combination Effects:
Robust Splines
All methods well-established
Lecture 7: Microarrays Part I
bioinformatics.ca
Spot Cy3 Cy5
Quantitation
Background
Spot
Quality
Inter-array
Intra-Array
Significance
Testing
Spot List
Clustering
Lecture 7: Microarrays Part I
Integration
?
bioinformatics.ca
Step 5: Inter-Array Normalization
Why?
How?
Difficulty?
Research?
Balance arrays
Multiple robust algorithms
+
+++++
Lecture 7: Microarrays Part I
bioinformatics.ca
Balancing Arrays
Problem:
Pipette error can lead to differential loading of sample
between arrays
Solution:
Scale arrays
Extremely easy to handle
Lecture 7: Microarrays Part I
bioinformatics.ca
Scaling Has a Major Effect
Before
After
Intensity
p(I)
Lecture 7: Microarrays Part I
bioinformatics.ca
Spot Cy3 Cy5
Quantitation
Background
Spot
Quality
Inter-array
Intra-Array
Significance
Testing
Spot List
Clustering
Lecture 7: Microarrays Part I
Integration
?
bioinformatics.ca
Significance Testing
Why?
How?
Difficulty?
Research?
Find spots that change
Statistical tests
+++
+++++
Lecture 7: Microarrays Part I
bioinformatics.ca
Significance Testing Questions
1. Are these two groups different?
2. Do these two things synergize?
3. Does treatment affect patient outcome?
4. Can we predict clinical features?
In our practical session we will focus on #1
Lecture 7: Microarrays Part I
bioinformatics.ca
Spot Cy3 Cy5
Quantitation
Background
Spot
Quality
Inter-array
Intra-Array
Significance
Testing
Spot List
Clustering
Lecture 7: Microarrays Part I
Integration
?
bioinformatics.ca
Clustering
Why?
How?
Difficulty?
Research?
Finding patterns in the data
Unsupervised machine-learning
+
+++++
Lecture 7: Microarrays Part I
bioinformatics.ca
What is Clustering?
• Clustering: finding patterns in data
• Each “pattern” is a cluster
• A (small) branch of “machine learning”
• A (very) overused part of bioinformatics
Lecture 7: Microarrays Part I
bioinformatics.ca
Anatomy of a Clustergram
Dendrogram
Heatmap
Lecture 7: Microarrays Part I
bioinformatics.ca
How is Clustering Done (Simple)?
Gene
Stimulus #2
Intra-cluster distance
Stimulus #1
Cluster
Inter-cluster
distance
Outliers
Clustering aims to MINIMIZE intra-cluster and MAXIMIZE inter-cluster distance
Lecture 7: Microarrays Part I
bioinformatics.ca
Why is Clustering Used?
I.
Data visualization
II.
To predict class assignment
III.
To identify co-regulation
IV.
Quality Control
Lecture 7: Microarrays Part I
bioinformatics.ca
Example: Predicting Gene Function
• Most genes have NO functional annotation
• 1,500 / 7,000 yeast genes
• 12,000 / 20,000 human genes
• Can we automatically estimate their function based on
their patterns of expression?
Lecture 7: Microarrays Part I
bioinformatics.ca
Solution: Clustering of Expression Profiles
Tissues
Lecture 7: Microarrays Part I
Hughes et al
Cell 2000
bioinformatics.ca
Abuses of Clustering?
• Clustering pre-selected data
• Clustering after significance analysis is only for visualization
• Detecting differential expression
• Clustering cannot replace significance-testing
• No assessment of chance
• How likely is a given pattern to be observed by chance alone?
Statistics exist to test this!
Lecture 7: Microarrays Part I
bioinformatics.ca
Summary Point #1:
Microarray data is analyzed with a pipeline
of sequential algorithms.
This pipeline defines the standard
workflow for microarray experiments.
Lecture 7: Microarrays Part I
bioinformatics.ca
Spot Cy3 Cy5
Quantitation
Background
Spot
Quality
Inter-array
Intra-Array
Significance
Testing
Spot List
Clustering
Lecture 7: Microarrays Part I
Integration
?
bioinformatics.ca
Summary Point #2:
This is an active research area.
Lecture 7: Microarrays Part I
bioinformatics.ca
Summary Point #3:
These basic steps hold true for all
microarray platforms and types.
Lecture 7: Microarrays Part I
bioinformatics.ca
What Is BioConductor?
“Bioconductor is an open source, open development software
project to provide tools for the analysis and comprehension of
high-throughput genomic data.”
- BioConductor website
The vast majority of our analyses will use BioConductor
code, but there are clearly non-BioConductor approaches.
Lecture 7: Microarrays Part I
bioinformatics.ca
I’ve outlined the general workflow.
Each technology and application has its
own unique characteristics to consider.
Module 1
bioinformatics.ca
Let’s Define an AffymetrixSpecific Workflow
Module 1
bioinformatics.ca
Quantitation is done according
Spot Cy3 Cy5
to Affymetrix defaults with
Quantitation
minimal user intervention.
One-Channel array
Background
Spot
Quality
Single-Channel array, so one
Inter-array simultaneous normalization
Intra-Array
procedure
Typically
ignored
Significance
Testing
Spot List
Clustering
Module 1
Integration
?
bioinformatics.ca
Let’s Collapse This a Bit And
Re-Phrase Things
Module 1
bioinformatics.ca
.CEL
Files
Background
Normalization
ProbeSet
Annotation
Spot List
Clustering
Statistics
Integration
?
Module 1
bioinformatics.ca
Arrays Can Become Outdated
• Gene definitions change
• The reference genome sequence gets finished
• Novel splice variants are found
• Errors are made in the initial design and remain present
in all arrays made
Module 1
bioinformatics.ca
The Mask Production Makes Affymetrix Designs
Expensive To Change
Photolithographic mask
Module 1
bioinformatics.ca
But… there are multiple probes per gene
Module 1
bioinformatics.ca
We Can Change Those Mappings!
Hybridized
Chip
Module 1
bioinformatics.ca
Let’s go Back to Pre-Processing
What exactly is pre-processing (aka
normalization)?
Why do we do it?
Module 1
bioinformatics.ca
Sources of Technical Noise
Where does technical noise come from?
Module 1
bioinformatics.ca
More Sources of Technical Noise
Module 1
bioinformatics.ca
Any step in the experimental pipeline can
introduce artifactual noise
•
•
•
•
•
•
•
Array design
Array manufacturing
Sample quality
Sample identity  sequence effects?
Sample processing
Hybridization conditions  ozone?
Scanner settings
Pre-Processing tries to remove these systematic effects
Module 1
bioinformatics.ca
Important Note
Pre-processing is never a substitute for
good experimental design. This is not a
course on statistical design, but a few
basic principles should be mentioned.
Biological replicates are
preferable to technical
replicates.
Always try to balance
experimental groups.
If processing samples identically is not possible, include
controls for processing-effects.
Module 1
bioinformatics.ca
Introducing Two Major Affymetrix PreProcessing Methods
• The two most commonly used methods are:
• RMA = Robust Multi-array
• MAS5 = Microarray Analysis Suite version 5
• MAS5 has strengths & weaknesses
• Sacrifices precision for accuracy
• Can easily be used in clinical settings
• RMA has strengths & weaknesses
• Sacrifices accuracy for precision
• Challenging to integrate multiple studies
• Reduces variance (critical for small-n studies)
• Both are well accepted by journals and reviewers, perhaps RMA a bit more so.
We’ll talk about some of the mathematics later on in this course.
Module 1
bioinformatics.ca
Course Overview
•
•
•
•
•
•
•
•
•
•
Lecture 1: What is Statistics? Introduction to R
Lecture 2: Univariate Analyses I: continuous
Lecture 3: Univariate Analyses II: discrete
Lecture 4: Multivariate Analyses I: specialized models
Lecture 5: Multivariate Analyses II: general models
Lecture 6: Sequence Analysis
Lecture 7: Microarray Analysis I: Pre-Processing
Lecture 8: Microarray Analysis II: Multiple-Testing
Lecture 9: Machine-Learning
Final Exam (written)
Lecture 7: Microarrays Part I
bioinformatics.ca