W08: Microarray Analysis File
Download
Report
Transcript W08: Microarray Analysis File
Microarray analysis
Curtis Huttenhower
Slides courtesy of:
Amy Caudy (Princeton)
Gavin Sherlock (Stanford)
Matt Hibbs (Jackson Labs)
Florian Markowetz (Cancer Research UK)
Olga Troyanskaya (Princeton)
Harvard School of Public Health
Department of Biostatistics
03-23-11
Visualizing Data
MAK16
YAL025C
MAK16
YBL015W ACH1
5
0.5
4
0
OD 7.30
OD 6.90
OD 3.70
OD 1.80
OD 0.80
OD 0.46
OD 0.26
3
2
-0.5
1
MAK16
-2
-2
-2.5
-3
-4
7.
30
D
6.
90
O
D
3.
70
O
D
1.
80
O
D
0.
80
O
D
O
D
O
D
-1
O
-1.5
0.
46
0
0.
26
-1
YBL048W
YBL048W
YBL049W
YBL049W
YBL064C
YBL064C
YBL078C
YBL078C
YBR072W
HSP26
YBR139W
YBR139W
YBR147W
YBR147W
YCR021C
HSP30
YDL085W
YDL085W
YDL204W
YDL204W
YDL208W NHP2
Visualizing Data (cont.)
Expression During Sporulation
5
Series1
Series2
Series3
Series4
4
Series5
Series6
Series7
Series8
Series9
Series10
3
Series11
Series12
Series13
Series14
Series15
2
Series16
Series17
Series18
Series19
Log Ratio
Series20
Series21
1
Series22
Series23
Series24
Series25
Series26
0
Series27
0
2
4
6
8
10
Series28
Series29
Series30
Series31
-1
Series32
Series33
Series34
Series35
Series36
Series37
-2
Series38
Series39
Series40
Series41
Series42
-3
Series43
Series44
Series45
Series46
Series47
-4
Series48
Time (hours)
Series49
Series50
Series51
Supervised analysis
= learning from examples, classification
– We have already seen groups of healthy and
sick people. Now let’s diagnose the next person
walking into the hospital.
– We know that these genes have function X (and
these others don’t). Let’s find more genes with
function X.
– We know many gene-pairs that are functionally
related (and many more that are not). Let’s
extend the number of known related gene pairs.
Known structure in the data needs to be
generalized to new data.
Un-supervised analysis
= clustering
– Are there groups of genes that behave similarly
in all conditions?
– Disease X is very heterogeneous. Can we
identify more specific sub-classes for more
targeted treatment?
No structure is known. We first need to find
it. Exploratory analysis.
Supervised analysis
Calvin, I still don’t know the
difference between cats
and dogs …
Oh, now I get it!!
Class 1: cats
Don’t worry!
I’ll show you once
more:
Class 2: dogs
Un-supervised analysis
Calvin, I still don’t know the
difference between cats
and dogs …
I don’t know it either.
Let’s try to figure it out
together …
Unsupervised analysis clustering
What is clustering?
• Reordering of gene (or experiment)
expression vectors in the dataset so that
similar patterns are next to each other (or in
separate groups)
• Identify subsets of genes (or experiments)
that are related by some measure
Why cluster?
Conditions
• Dimensionality reduction:
datasets are too large to
be able to get information
out without reorganizing
the data
Genes
• “Guilt by association” =>
if unknown gene i is
similar in expression to
known gene j, maybe
they are involved in the
same/related pathway
Clustering Techniques
• Algorithm (Method)
–
–
–
–
–
–
–
–
Hierarchical
K-means
Self Organizing Maps
QT-Clustering
NNN
.
.
.
• Distance Metric
–
–
–
–
–
–
–
–
Euclidean (L2)
Pearson Correlation
Spearman Correlation
Manhattan (L1)
Kendall’s t
.
.
.
Distance Metrics
• Choice of distance measure is important for most clustering
techniques
• Pair-wise metrics – compare vectors of numbers
– e.g. genes x & y, ea. with n measurements
Euclidean Distance
Pearson Correlation
Spearman Correlation
Distance Metrics
Euclidean Distance
Spearman Correlation
Pearson Correlation
Hierarchical clustering
• Imposes (pair-wise) hierarchical structure on
all of the data
• Often good for visualization
• Basic Method (agglomerative):
1.
2.
3.
4.
Calculate all pair-wise distances
Join the closest pair
Calculate pair’s distance to all others
Repeat from 2 until all joined
Hierarchical clustering
Hierarchical clustering
Hierarchical clustering
Hierarchical clustering
Hierarchical clustering
Hierarchical clustering
Single linkage Clustering
Nearest Neighbor
•
•
+•
•
•
•
•
+
•
This method produces
long chains which
form straggly clusters.
•
•
Complete Linkage Clustering
Uses the Furthest
Neighbor
•
•
+•
•
•
•
•
+
•
•
• This method tends to
produce very tight
clusters of similar
patterns
Average Linkage Clustering
•
Average (only
shown for two
cases)
•
+•
•
•
•
•
+
•
•
•
The red and blue ‘+’
signs mark the
centroids of the two
clusters.
Centroid Linkage Clustering
•
Centroid
•
+•
•
•
•
•
+
•
•
•
The red and blue ‘+’
signs mark the
centroids of the two
clusters.
Hierarchical clustering: problems
• Hard to define distinct clusters
• Genes assigned to clusters on the basis of all
experiments
• Optimizing node ordering hard (finding the optimal
solution is NP-hard)
• Can be driven by one strong cluster – a problem
for gene expression b/c data in row space is often
highly correlated
K-means Clustering
• Groups genes into a pre-defined number of
independent clusters
• Basic algorithm:
1. Define k = number of clusters
2. Randomly initialize each cluster with a seed (often
with a random gene)
3. Assign each gene to the cluster with the most
similar seed
4. Recalculate all cluster seeds as means (or
medians) of genes assigned to the cluster
5. Repeat 3 & 4 until convergence
(e.g. No genes move, means don’t change much, etc.)
K-means example
K-means example
K-means example
K-means: problems
• Have to set k ahead of time
– Ways to choose “optimal” k: minimize withincluster variation compared to random data or
held out data
• You’ll get k clusters whether they exist or not
• Each gene only belongs to exactly 1 cluster
• One cluster has no influence on the others
(one dimensional clustering)
• Genes assigned to clusters on the basis of
all experiments
Can a gene belong to N clusters?
• Fuzzy clustering: each gene’s relationship to
a cluster is probabilistic
• Gene can belong to many clusters
• More biologically realistic,
but harder to get to work well/fast
0.85
• Harder to interpret
0.15
Advanced clustering methods
•
•
•
•
Fuzzy clustering
Clustering with resampling
Biclustering
Clustering based on physical properties
(clusters defined by “attraction of points”)
Clustering Tools
• TIGR MeV
– http://www.tm4.org/mev.html
• Sleipnir
– http://huttenhower.sph.harvard.edu/sleipnir
• Cluster & JavaTreeView
– http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm
– http://jtreeview.sourceforge.net/
• CLICK & EXPANDER
– http://www.cs.tau.ac.il/~rshamir/expander/expander.html
Never underestimate the power of Excel in conjunction with Python!
Exercise Caution
• Typically, when constructing a microarray
dataset, certain filters are applied to retain
only the ‘interesting’ genes.
• Clustering imposes an ordering/cluster
structure on the genes, whether one exists or
not.
• This is accentuated by filtering out of genes.
An Example
Bryan, 2004
The Result of Filtering
Groups in the data, which didn’t exist with the full dataset,
suddenly appear!
Bryan, 2004
Cluster Evaluation
• Mathematical consistency
– Compare coherency of clusters to background
• Look for functional consistency in clusters
– Requires a gold standard, often based on GO,
MIPS, etc.
– ROC curves/AUC or precision/recall
• Evaluate likelihood of enrichment in clusters
– Hypergeometric distribution, etc.
– Several tools available
Hypergeometric Distribution
• Probability of observing x or more genes in a
cluster of n genes with a common annotation
M N M
n
j n j
p value
N
j x
N = total number of genes in genome
n
–
– M = number of genes with annotation
– n = number of genes in cluster
– x = number of genes in cluster with annotation
• Multiple hypothesis correction required if testing
multiple functions (Bonferroni, FDR, etc.)
• Additional genes in clusters with strong enrichment
may be related
GO term Enrichment Tools
• SGD’s & Princeton’s GoTermFinder
– http://www.yeastgenome.org
– http://go.princeton.edu
• GOLEM (http://function.princeton.edu/GOLEM)
• HIDRA (http://function.princeton.edu/hidra/)
Sealfon et al., 2006
Predictive Power
Precision/Recall of various data types
• Examine ability of data to recapitulate
known biology - not limited to
microarrays
• GO biological process gold standard
• Analyze both overall and specific
functional predictive power of datasets
http://function.princeton.edu/GRIFn
Advanced analysis methods –
a very brief overview
More Unsupervised Methods
• Search-based approaches
– Starting with a query gene/condition, find most
related group
• Singular Value Decomposition (SVD) &
Principal Component Analysis (PCA)
– Decomposition of data matrix into “patterns”
“weights” and “contributions”
– Real names are “principal components”
“singular values” and “left/right eigenvectors”
– Used to remove noise, reduce dimensionality,
identify common/dominant signals
SVD (& PCA)
• SVD is the method, PCA is performing SVD on
centered data
• Projects data into another orthonormal basis
• New basis ordered by variance explained
Vt
=
X
U
Singular “Eigen-genes”
values
Original
“Eigen-conditions”
Data matrix
SVD
SVD
Supervised methods – a very
brief introduction
Supervised methods
• Clustering, PCA etc are unsupervised
methods
• Supervised method – any method that
“learns” a way to perform an operation
based on examples of problems with known
solutions (i.e. has “supervision”)
• => have to know what we are looking for
when using supervised methods
Supervised vs. Unsupervised
• Unsupervised methods can find novel profile
groupings
• Supervised methods take known groupings
and create rules for reliably assigning genes
or conditions into these groups
Hierarchical clustering of lung cancers
Supervised analysis: setup
• Training set
– Data: microarrays
– Labels: for each one we know if it falls into our
class of interest or not (binary classification)
• New data (test data)
– Data for which we don’t have labels.
– Eg. Genes without known function
• Goal: Generalization ability
– Build a classifier from the training data that is
good at predicting the right class for the new
data.
One microarray, one dot
Think of a space with #genes
dimensions (yes, it’s hard for more
than 3).
Expression of gene 2
Each microarray corresponds to a
point in this space.
If gene expression is similar under
some conditions, the points will be
close to each other.
If gene expression overall is very
different, the points will be far away.
Expression of gene 1
Which line separates best?
A
B
C
D
No sharp knife, but a …
Support Vector Machines
Maximal margin
separating hyperplane
Datapoints closest
to separating
hyperplane
= support vectors
How well did we do?
Training error: how well
do we do on the data we
trained the classifier on?
But how well will we do in
the future, on new data?
Test error: How well does
the classifier generalize?
Same classifier (= line)
New data from same classes
The classifier will usually perform
worse than before:
Test error > training error
Cross-validation
Training error
Train classifier and test it
Test error
Train
Test
K-fold Cross-validation
Here for
K=3
Step 1.
Train
Train
Test
Step 2.
Train
Test
Train
Step 3.
Test
Train
Train
Other supervised methods
•
•
•
•
•
•
Bayesian networks
Neural networks
Linear discriminant analysis (LDA)
Logistic regression
Boosting
Decision trees
Summary II
• Supervised and un-supervised learning
… are needed everywhere in biology and medicine
• Microarrays = points in high-dimensional spaces
• Classifiers = lines (hyperplanes) in these spaces
• Support Vector Machines use maximal margin
hyperplanes as classifiers
• Classifier performance: Test error > training error
• Cross-validation is the right way to evaluate
classifier performance
Identifying biomarkers
Class discovery
Class comparison
Class prediction
The problem
• Have samples in two groups A and B
• Want to identify biomarker genes between A and B
• Challenges:
– Data are not normally distributed
– Microarray data are often noisy
A
B
• Goal:
robust and reliable methods for identification of
biomarker genes
Hierarchical clustering of lung cancers
Patient survival for
Adenocarcinoma subgroups
1
Cum. Survival (Group 1)
Cum. Survival
.8
Cum. Survival (Group 2)
Cum. Survival (Group 3)
.6
.4
p = 0.002
.2
for Gr. 1 vs. Gr. 3
0
0
10
20
30
40
Time (months)
50
60
Nonparametric t-test
• Want to pick genes with:
– Maximal difference in mean expression
between samples
A
B
A
B
– Minimal variance of expression within sample
A
B
A
B
Nonparametric t-test
group 1 : n1 samples, with average expression X1
group 2 : n 2 samples, with average expression X 2
t statistic
X
:t
S
1 X2
S X1 X 2
X1 X 2
s12 s22
n1 n2
p value (from column permutations) :
p
j perm
count(t j perm t jo b s )
count(perm
utations)
count(permutations)
Low
Expression
High
Expression
Wilcoxon rank-sum test
• Tests for equality of means of two samples
• Uses rank data
• Good for non-normal data
Original data
2 | 0 | 3 | 5 | 9 gene 1
Ranks
2 | 1 | 3 | 4 | 5 gene 1
0 | 1 | 2 | 3 | 5 gene 2
1 | 2 | 3 | 4 | 5 gene 2
7 | 4 | 3 | 2 | 1 gene 3
5 | 4 | 3 | 2 | 1 gene 3
• Identifies genes with skewed distribution of ranks
Microarrays: MeV
www.tm4.org/mev
65
Microarrays: MeV
www.tm4.org/mev
66
Analysis summary
• Unsupervised methods –
– Clustering
• Hierarchical clustering
• K means clustering
– Decomposition
• SVD/PCA
• Supervised methods
– Require examples with known answers
• Need both positive and negative examples
– Support Vector Machines
• Biomarker identification
– Nonparametric t-test
– Rank sum test