Transcript Microarrays

Pabio590B – week 1
Microarrays
 Overview
 Design & hybridization
 Data analysis
Overview
 Affix/synthesize probes of
known sequence to chip
 Hybridize with labeled sample
 Quantify level of hybridization
to each probe
 Normalization
 Statistics
 Clustering & more
Experiments you might do
Measure RNA expression
Changes in gene expression over time / lifecycle
Compare differences between tissues/cell types
Comparisons between species/strains/conditions
Whole genome transcript mapping (tiling arrays)
Measure DNA content
Presence or absence of region
Copy number via Comparative Genomic Hybridization
SNP Genotyping/Re-sequencing
Other
ChIP on chip arrays
RIP on chip
Microarray Design
 Affix/synthesize probes of
known sequence to chip
 Hybridize with labeled sample
 Quantify level of hybridization
to each probe
 Normalization
 Statistics
 Clustering & more
RNA Expression Chip Designs
Expression Array:
- N number of probes per gene of interest
- Trade-off between accuracy and number of features
Tiling array:
- Place probe of X nt every Y bases
- Biased vs unbiased
20 nt
50 nt
50 nt
70 nt tiling window
Probe considerations
 Number of probes per region of interest
 Specificity of probes
 Distance between probes (tiling)
 Mismatch probes (Affymetrix)
Hybridization
 Affix/synthesize probes of
known sequence to chip
 Hybridize with labeled sample
 Quantify level of hybridization
to each probe
 Normalization
 Statistics
 Clustering & more
Two-color vs One-color
 Two-color
• Two samples one each slide
• cy3 - green - 532nm
• cy5 - red - 635nm
 One-color
• One sample per slide
• cy3
 No significant difference in accuracy or
reproducibility
Designs for Two-color Array
Experiment Replicates
cy3
WT
WT
WT
cy5
Mu
Mu
Mu
Dye Swaps
cy3
WT
Mu
WT
Mu
cy5
Mu
WT
Mu
WT
Biological Replicates
cy3
WT1
WT2
WT3
cy5
Mu1
Mu2
Mu3
Common Reference
cy3
ref
ref
ref
ref
ref
ref
cy5
A
B
C
D
E
F
Round Robin
cy3
A
B
C
D
E
F
cy5
B
C
D
E
F
A
Data Normalization
 Affix/synthesize probes of
known sequence to chip
 Hybridize with labeled sample
 Quantify level of hybridization
to each probe
 Normalization
 Statistics
 Clustering & more
Within-Array Normalization
Cy3/Cy5
Lowess Normalization
Signal intensity
Before
After
Between-Array Normalization




RNA Spike-in
Random Probes
Median Scaling
Quantile Scaling
Median and quantile normalization are predicated upon
the arrays in question having the same distribution. That
is to say, if you can safely assume that the bulk of genes
have the same expression across the arrays, only then
you can use those methods.
Quantile Normalization
Before
After
Statistical Analysis
 Affix/synthesize probes of
known sequence to chip
 Hybridize with labeled sample
 Quantify level of hybridization
to each probe
 Normalization
 Statistics
 Clustering & more
Some Advice About Statistics
 Don’t get too hung up on p-values [or any other stat].
 Ultimately what matters is biological relevance and
external knowledge and other heterogeneous
measures (related functions, pathways, other data
types) that are not easily measured by statistics alone.
 P-values should help you evaluate the strength of the
evidence, rather than being used as an absolute
yardstick of significance.
 Statistical significance is not necessarily the same as
biological relevance and vice-versa.
John Quackenbush
Probe Signal
Is this gene differentially expressed
between the two conditions?
Sample
A
Sample
B
To rephrase the question
Is the mean probe value different between Samples A & B
• Null Hypothesis = H0 = means are the same
• Alternate Hypothesis = Ha = means are different
What affects our ability to test
the hypothesis?
 Difference in means
 Number of sample points
 Standard deviations of sample
The T-statistic
 Directly proportional to difference in means
 Inversely proportional to standard deviation
 Directly proportional to sample size
The T-test calculates how likely the T-statistic is,
given the null hypothesis that the means are
actually the same.
T-statistic and P-values
 P-values can be determined from theoretical
distributions or permutation testing
• Theoretical distributions rely on a set of assumptions that
array experiments do not necessarily follow
• Permutation tests do not rely on any assumptions
Permutation Testing
Gene
A
Gene
B
Permutation 2
Probe Signal
Permutation 1
Probe Signal
Probe Signal
Original
Group
1
Group
2
Group
1
Group
2
1) Permute n times by random shuffling
2) Calculate T-statistic for each permutation
3) Calculate probability of original T-statistic
Interpreting P-values
 T-test tests the null hypothesis that sample
means are equal
 Gene X has p-value of 5% from T-test
 95% chance it is differentially expressed
 5% chance that is NOT differentially expressed
  = False Positive Rate = 5%
T-Test Refinements
 Equal vs unequal variance of samples
 Equal vs unequal sample size
 Dependant vs independent samples
CAVEAT:
As sample sizes get smaller, the validity of p-values
calculated via permutation diminishes.
Microarrays typically have few probes per gene, so sample
size is smallish.
Multiple Testing Problem
 If there is a 5% chance of false positives in
one experiment, what happens when we
are testing 10,000 genes.
• The majority of those genes are not
differentially expressed, but
• a 5% p-value means we will have 500 falsepositives.
Family-Wise Error Rate (FWER)
FWER is the probability of making one or more false
discoveries (type I errors) among all the hypotheses
when performing multiple pair-wise tests.
 One comparison: FWER = p-value
 10,000 comparisons: FWER ~ 1.0
That means that when making 10,000
comparisons you are sure to make at
least one error.
Bonferroni Correction
What if you want to keep the FWER at 5%
• 0.05 / 10,000 = 0.000005 = 5e-6
• Only those genes with T-test p-value of < 5xe-6 are
called differentially expressed
• Leads to experiment-wide  of 0.05
The Standard Bonferroni correction is
considered very conservative
Adjusted Bonferroni
 Rank all genes by ascending order of p-value
 Assign gene with smallest p-value a
corrected p-value of  / N (0.5/10,000)
 Assign gene with second smallest p-value a
corrected p-value of  / N-1
 Etc…
The Adjusted Bonferroni correction is
less conservative
False Discovery Rate
 Measures the likely number of false positives
amongst “discovered” genes
 Factors affecting FDR:
•
•
•
•
Proportion of actual differentially expressed genes
Distribution of the true differences
Measurement variability
Sample size
Analysis of Variance (ANOVA)
 Microarray testing across ≥ 3 conditions
 Is a gene expressed equally across all
conditions?
 F-ratio for given gene X:
(variability within conditions) / (variability across conditions)
 Calculate p-value
• Look up probability of F-ratio
• Determine probability by permutation testing
Significance Analysis of
Microarrays (SAM)
 Gene-specific T-tests
 Computes statistic (dj) for each gene j
• measures the relationship between gene expression and a
response variable
• describes and groups the data based on experimental conditions
• uses non-parametric statistics
• repeated permutations are used to determine FDR
 Accounts for correlations in genes and avoids
parametric assumptions about the (normal vs
non-normal) distribution of individual genes
Clustering
 Affix/synthesize probes of
known sequence to chip
 Hybridize with labeled sample
 Quantify level of hybridization
to each probe
 Normalization
 Statistics
 Clustering & more
Why do clustering?
 Identify groups of possibly co-regulated genes
(e.g. so you can look for common sequence motifs)
 Identify typical temporal or spatial gene
expression patterns (e.g. cell-cycle data)
 Arrange a set of genes in a linear order that is
at least not totally meaningless
Can also cluster experiments
 Quality control
• detect bad/outlying experiments
 Identify or categorize classes of
biological samples
• sorting by tumor sub-type
How you cluster?
 Define a distance measure
 Group genes (or experiments) based on
that measure
Objects are placed into
groups. Objects within a
group are more similar to
each other than objects
across groups.
In some cases groups are
hierarchically organized
based on the intra-group
similarity
Distance Metrics
Correlation
Euclidean
Correlation (X,Y) = 1
Distance (X,Y) = 4
Correlation (X,Z) = -1
Distance (X,Z) = 2.83
Correlation (X,W) = 1
Distance (X,W) = 1.41
Clustering considerations
Correlation clustering
• Direction only
• ≥ 3 conditions
Euclidean clustering
• Magnitude & direction
• ≥ 2 conditions
Array data is noisy, so you probably
need multiple data points per condition
Clustering methods
• Hierarchical
• Partitional
• Other
Hierarchical clustering
Agglomerative, bottom-up method
 Initial state
- each item is a cluster
 Iterate
- join two most similar cluster
 Stop
- when number of clusters
reaches user-defined value
Linkage methods
Ways to determine cluster similarity
Single Link:
Similarity of two
most similar
members
Complete Link:
Similarity of two
most similar
members
Average Link:
Average similarity
of all members
Comparing linkage methods
Single
Complete
Average
Partitional (K-means) clustering
Divisive, top-down method
 Partition data into K random clusters
 Assign each point to nearest cluster
 Calculate centroid of each cluster
 GOTO step 2
Other methods









Support Vector Machines (SVM)
K-nearest Neighbor (KNN)
Self Organizing Maps (SOM)
Self Organizing Tree Algorithm (SOTA)
Cluster Affinity Search Technique (CAST)
QT Cluster (QTC)
Discriminant Analysis Classifier (DAM)
Principal Component Analysis (PCA)
Etc.
Warnings and Limitations
 Clusters are like statistics
Ideally they mirror reality, but they should only be taken
seriously in conjunction with confirmatory data from other
sources.
 Clustering software clusters things
If you tell it to find 4 clusters, it will find 4 clusters in
anything!
 Garbage In, Garbage Out
Clustering typically relies on a set of input parameters that
can be hard to evaluate except for empirically evaluating
the outputs for a given set of input parameters.
Clusters Interpretation - EASE
(Expression Analysis Systematic Explorer)
Population Size: 40 genes
Cluster size: 12 genes
10 genes, shown in green, have a common
biological theme and 8 occur within the cluster
Microarray Analysis Software
TIGR MEV
Limma
SAM
EDGE
•
•
These software packages are free and open-source
Each has different strengths/weaknesses and makes
different assumptions about your data
$$ Analysis Platforms
Gene Sifter
Rosetta Resolver
Bio Discovery
Microarray Data Sources
 Gene Expression Omnibus (NCBI)
 ArrayExpress (EBI)
 Stanford Microarray Database
 Yale Microarray Database
Microarray Data Standards
 Microarray Gene Expression Data Society
(MGED)
• MIAME
• MAGE - OM
• MAGE ML
 RNA Abundance Database (RAD)
• Integrating data from various types of
expression experiments