More Microarray Analysis: Unsupervised Approaches Matt Hibbs Troyanskaya Lab

Download Report

Transcript More Microarray Analysis: Unsupervised Approaches Matt Hibbs Troyanskaya Lab

More Microarray Analysis:
Unsupervised Approaches
Matt Hibbs
Troyanskaya Lab
Outline
• Gene Expression vs. DNA applications
• A little more normalization (missing values)
• Unsupervised Analysis
– Basic Clustering
– Statistical Enrichment
– PCA/SVD
– Advanced Clustering
– Search-based Approaches
Expression / DNA
• Some similar concepts to analysis, but often
very different goals
• Expression – clustering, guilt by association,
functional enrichment
• DNA – signal processing, spatial
relationships, motif finding
• Visualized differently (Heat maps vs.
karyoscope)
The missing value problem
• Microarrays can have systematic or random
missing values
• Some algorithms can’t deal with missing
values (PCA/SVD in particular)
• Instead of hoping missing values won’t bias
the analysis, better to estimate them
accurately
Spatial Defects
KNN Impute
• Idea: use genes with similar expression
profiles to estimate missing values
2 | | 5 | 7 | 3 | 1 Gene X
2 |4.3| 5 | 7 | 3 | 1 Gene X
2 | 4 | 5 | 7 | 3 | 2 Gene A
2 | 4 | 5 | 7 | 3 | 2 Gene A
8 | 9 | 2 | 1 | 4 | 9 Gene B
8 | 9 | 2 | 1 | 4 | 9 Gene B
3 | 5 | 6 | 7 | 3 | 2 Gene C
3 | 5 | 6 | 7 | 3 | 2 Gene C
Imputation affects downstream analysis
Complete data set
Data set with 30%
entries missing and
filled with zeros (zero
values appear black)
Data set with missing
values estimated by
KNNimpute algorithm
Unsupervised Analysis
• Supervised techniques great if you have
starting information (e.g. labels)
– But, we often we don’t know enough beforehand
to apply these methods
• Unsupervised techniques are exploratory
– Let the data organize itself, then try to find
biological meaning
– Approaches to understand whole data
– Visualization often helpful
Clustering
• Let the data organize itself
• Reordering of genes (or conditions) in the
dataset so that similar patterns are next to
each other (or in separate groups)
• Identify subsets of genes (or experiments)
that are related by some measure
Quick Example
Genes
Conditions
Why cluster?
• “Guilt by association” – if unknown gene X is
similar in expression to known genes A and
B, maybe they are involved in the
same/related pathway
• Visualization: datasets are too large to be
able to get information out without
reorganizing the data
Clustering Techniques
• Algorithm (Method)
–
–
–
–
–
–
–
–
Hierarchical
K-means
Self Organizing Maps
QT-Clustering
NNN
.
.
.
• Distance Metric
–
–
–
–
–
–
–
–
Euclidean (L2)
Pearson Correlation
Spearman Correlation
Manhattan (L1)
Kendall’s t
.
.
.
Distance Metrics
• Choice of distance measure is important for most clustering
techniques
• Pair-wise metrics – compare vectors of numbers
– e.g. genes x & y, ea. with n measurements
Euclidean Distance
Pearson Correlation
Spearman Correlation
Distance Metrics
Euclidean Distance
Spearman Correlation
Pearson Correlation
Hierarchical clustering
• Imposes (pair-wise) hierarchical structure on
all of the data
• Often good for visualization
• Basic Method (agglomerative):
1.
2.
3.
4.
Calculate all pair-wise distances
Join the closest pair
Calculate pair’s distance to all others
Repeat from 2 until all joined
Hierarchical clustering
Hierarchical clustering
Hierarchical clustering
Hierarchical clustering
Hierarchical clustering
Hierarchical clustering
HC – Interior Distances
• Three typical variants to calculate interior
distances within the tree
– Average linkage: mean/median over all possible
pair-wise values
– Single linkage: minimum pair-wise distance
– Complete linkage: maximum pair-wise distance
Hierarchical clustering: problems
• Hard to define distinct clusters
• Genes assigned to clusters on the basis of all
experiments
• Optimizing node ordering hard (finding the optimal
solution is NP-hard)
• Can be driven by one strong cluster – a problem
for gene expression b/c data in row space is often
highly correlated
HC: Real Example
• Demo in JavaTreeView & HIDRA
– Spellman et al., 1998: yeast alpha-factor sync
cell cycle timecourse
HC: Another Example
• Expression of tumors hierarchically clustered
• Expression groups by clinical class
Garber et al.
K-means Clustering
• Groups genes into a pre-defined number of
independent clusters
• Basic algorithm:
1. Define k = number of clusters
2. Randomly initialize each cluster with a seed (often
with a random gene)
3. Assign each gene to the cluster with the most
similar seed
4. Recalculate all cluster seeds as means (or
medians) of genes assigned to the cluster
5. Repeat 3 & 4 until convergence
(e.g. No genes move, means don’t change much, etc.)
K-means example
K-means example
K-means example
K-means: problems
• Have to set k ahead of time
– Ways to choose “optimal” k: minimize withincluster variation compared to random data or
held out data
• Each gene only belongs to exactly 1 cluster
• One cluster has no influence on the others
(one dimensional clustering)
• Genes assigned to clusters on the basis of
all experiments
K-means: Real Example
• Demo in TIGR MeV
– Spellman et al. alpha-factor cell cycle
Clustering “Tweaks”
• Fuzzy clustering – allows genes to be
“partially” in different clusters
• Dependent clusters – consider betweencluster distances as well as within-cluster
• Bi-clustering – look for patterns across
subsets of conditions
– Very hard problem (NP-complete)
– Practical solutions use heuristics/simplifications
that may affect biological interpretation
Cluster Evaluation
• Mathematical consistency
– Compare coherency of clusters to background
• Look for functional consistency in clusters
– Requires a gold standard, often based on GO,
MIPS, etc.
• Evaluate likelihood of enrichment in clusters
– Hypergeometric distribution, etc.
– Several tools available
Gene Ontology
• Organization of curated biological knowledge
– 3 branches: biological process, molecular function, cellular component
Hypergeometric Distribution
• Probability of observing x or more genes in a
cluster of n genes with a common annotation
–
–
–
–
N = total number of genes in genome
M = number of genes with annotation
n = number of genes in cluster
x = number of genes in cluster with annotation
• Multiple hypothesis correction required if testing
multiple functions (Bonferroni, FDR, etc.)
• Additional genes in clusters with strong enrichment
may be related
GO term Enrichment Tools
• SGD’s & Princeton’s GoTermFinder
– http://go.princeton.edu
• GOLEM (http://function.princeton.edu/GOLEM)
• HIDRA
Sealfon et al., 2006
More Unsupervised Methods
• Search-based approaches
– Starting with a query gene/condition, find most
related group
• Singular Value Decomposition (SVD) &
Principal Component Analysis (PCA)
– Decomposition of data matrix into “patterns”
“weights” and “contributions”
– Real names are “principal components”
“singular values” and “left/right eigenvectors”
– Used to remove noise, reduce dimensionality,
identify common/dominant signals
SVD (& PCA)
• SVD is the method, PCA is performing SVD on
centered data
• Projects data into another orthonormal basis
• New basis ordered by variance explained

Vt
=
X
U
Singular “Eigen-genes”
values
Original
“Eigen-conditions”
Data matrix
SVD
SVD
SVD: Real Example
• Demo in TIGR MeV
– Spellman et al., 1998 cell cycle time courses
• alpha-factor sync
• cdc15 sync
DNA arrays / Sequence-based Analysis
• Methods so far focused on expression data
• Other uses of microarrays often sequence
based: CGH, ChIP-chip, SNP scanner
– Data has important, inherent order
– Most analysis methods developed from signal
processing techniques (e.g. sound)
– View data in chromosomal order (karyoscope)
• Tools: JavaTreeView, IGB, Chippy
CGH Example
• Demo in JavaTreeView
Aneuploidy affects expression too
rpl20aD/ rpl20aD, Chromosome XV
(data from Hughes et al. (2000))
Software Tools
•
•
•
•
•
•
JavaTreeView – viz, karyoscope
HIDRA – viz, mult. datasets, search
Cluster (Eisen lab) – clustering
TIGR MeV – clustering, viz
IGB – Affy’s CGH browser
ChIPpy – ChIP-chip analysis
Summary
• Unsupervised Analysis
– Let the data organize itself, find patterns
– Clustering: Distance Metric + Algorithm
– SVD/PCA – auto find dominant patterns
• Impute missing values (KNN)
• CGH – Karyoscope view
• Questions?