CSCE590/822 Data Mining Principles and Applications
Download
Report
Transcript CSCE590/822 Data Mining Principles and Applications
CSCE555 Bioinformatics
14 Cluster analysis for
microarray data
Lecture
Meeting: MW 4:00PM-5:15PM SWGN2A21
Instructor: Dr. Jianjun Hu
Course page:
http://www.scigen.org/csce555
University of South Carolina
Department of Computer Science and Engineering
2008 www.cse.sc.edu.
Outline
Why microarray data analysis
Understanding Microarray Data
Clustering in microarray analysis
Project proposals and Group
7/20/2015
2
Functional Genomics – Why?
You can’t learn everything from DNA sequence
Levels of organization contain different information
Want to know how a whole cell works
May not know a priori which pieces of information are
going to be the most valuable to solve a particular
problem
Need to know both analytical tools and biology
◦ What are the great questions?
◦ What techniques are available to answer them?
◦ What computational/stasticial techniques can get you what you
want?
Start with whole genome, annotated so you know
where most of the genes are
◦ Make probes to 3’ ends of genes generally
Microarrays
Extract RNA
and incorporate
label
Combine
Scanned image
Hybridize
Types of Microarrays
For measuring mRNA
◦ Affymetrix, Nimblegen, Illumina (short oligos made
on the slide)
◦ Oligonucleotides (short DNA – typically 70-mers)
For identifying where proteins bind DNA and
non-coding or other RNAs
◦ Tiling arrays
For identifying SNPs
◦ http://www.illumina.com/
There are even protein arrays!
Why microarrays?
You can get a great deal of data –
multiplexed, uses very little reagent, can
use standard detection devices
Similar techniques are used for working
with RNA and DNA and similar
attachment chemistries
Can carry out enzyme reactions on the
arrays for sequencing and other analyses
Microarray experiments have multiple steps
each step introduces some variability
Cell harvest
Slide Manufacture
RNA preparation
Slide Post-processing
Labeling reaction
Pre-hybridization
Hybridization
Washing
Scanning
Data Analysis
What questions are you asking?
First of all – you are measuring mRNA
abundance – so questions are about gene
expression at the mRNA level
What are the levels of all mRNAs?
How do they change?
Are there interesting changes or groups
that provide insight into some process or
state?
Design and analysis
Experimental design is critical
Analysis tools:
◦
◦
◦
◦
◦
◦
◦
Hierarchical clustering
Gene lists based on expression
Time course analysis
Statistics
Gene Ontology
Pathway analysis
Classifiers – to identify biomarkers
Aim of clustering: Group objects according
to their similarity
Cluster:
a set of objects
that are similar
to each other
and separated
from the other
objects.
Example: green/
red data points
were generated
from two different
normal distributions
Clustering microarray data
Genes and
experiments/samples are
given as the row and
column vectors of a gene
expression data matrix.
Clustering may be
applied either to genes
or experiments
(regarded as vectors in
Rp or Rn).
gene expression data matrix
p genes
n experiments
Why cluster genes?
Identify groups of possibly co-regulated
genes (e.g. in conjunction with sequence
data).
Identify typical temporal or spatial gene
expression patterns (e.g. cell cycle data).
Arrange a set of genes in a linear order
that is at least not totally meaningless.
Why cluster experiments/samples?
Quality control: Detect experimental
artifacts/bad hybridizations
Check whether samples are grouped according
to known categories (though this might be
better addressed using a supervised approach:
statistical tests, classification)
Identify new classes of biological samples (e.g.
tumor subtypes)
Alizadeh et al., Nature 403:503-11, 2000
Cluster analysis
Generally, cluster analysis is based on two
ingredients:
Distance measure: Quantification of
(dis-)similarity of objects.
Cluster algorithm: A procedure to
group objects. Aim: small within-cluster
distances, large between-cluster distances.
Some distance measures
Given vectors x = (x1, …, xn), y = (y1, …, yn)
Euclidean distance:
Manhattan distance:
Correlation
d C ( x, y ) 1
distance:
n
2
(
x
y
)
i i
d E ( x, y )
i 1
n
d M ( x, y ) xi yi .
i 1
( x x )( y
i
i 1
i
y)
(x x) ( y
2
i 1
i
i 1
i
y)
2
.
x = (1, 1, 1.5, 1.5)
y = (2.5, 2.5, 3.5, 3.5) = 2x + 0.5
z = (1.5, 1.5, 1, 1)
4
3
1
2
b
• The choice of distance
measure should be based
on the application area.
What sort of similarities
would you like to detect?
• Correlation distance dc
measures trends/relative
differences:
dc(x, y)= dc(ax+b, y) if a > 0.
5
Which distance measure to use?
1.0
1.5
2.0
2.5
3.0
Index
dc(x, y) = 0, dc(x, z) = 2.
dE(x, z) = 1, dE(x, y) ~ 3.54.
3.5
4.0
Which distance measure to use?
Euclidean and Manhattan distance both measure
absolute differences between vectors.
Manhattan distance is more robust against
outliers.
May apply standardization to the
observations:
Subtract mean and divide by standard
deviation:
xx
x
ˆ x
After standardization, Euclidean and correlation
distance are equivalent:
2
dE ( x1, x2 ) 2ndC ( x1, x2 ).
K-means clustering
Input: N objects given as data points in Rp
Specify the number k of clusters.
Initialize k cluster centers. Iterate until convergence:
- Assign each object to the cluster with the closest
center (wrt Euclidean distance).
- The centroids/mean vectors of the obtained clusters
are taken as new cluster centers.
K-means can be seen as an optimization problem:
Minimize the sum of squared within-cluster distances,
W (C )
1 K
2 k 1
d E ( xi , x j )2
C ( i ) C ( j ) k
Results depend on the initialization.
Use several starting
points and choose the “best” solution (with minimal
W(C)).
K-means/PAM: How to choose K (the
number of clusters)?
There is no easy answer.
Many heuristic approaches try to
compare the quality of clustering results
for different values of K (for an overview
see Dudoit/Fridlyand 2002).
The problem can be better addressed in
model-based clustering, where each
cluster represents a probability
distribution, and a likelihood-based
framework can be used.
Self-organizing maps Neural Network
K = r*s clusters are arranged as
nodes of a two-dimensional grid.
Nodes represent cluster
centers/prototype vectors.
This allows to represent similarity
between clusters.
Algorithm:
Initialize nodes at random
positions. Iterate:
- Randomly pick one data point
(gene) x.
- Move nodes towards x, the
closest node most, remote nodes
(in terms of the grid) less.
Decrease amount of movements
with no. of iterations.
from Tamayo et al. 1999
Self-organizing maps
from Tamayo
et al. 1999
(yeast cell
cycle data)
Agglomerative hierarchical clustering
Bottom-up algorithm (top-down
(divisive) methods are less common).
Start with the objects as clusters.
In each iteration, merge the two clusters
with the minimal distance from each
other - until you are left with a single
cluster comprising all objects.
But what is the distance between two
clusters?
Distances between clusters used for
hierarchical clustering
Calculation of the distance between two clusters is
based on the pairwise distances between members of
the clusters.
Complete linkage: largest distance
Average linkage: average distance
Single linkage:
smallest distance
Complete linkage gives preference to compact/spherical
clusters. Single linkage can produce long stretched clusters.
Hierarchical clustering
The height of a node in
the dendrogram
represents the distance of
the two children clusters.
Loss of information: n
objects have n(n-1)/2
pairwise distances, tree
has n-1 inner nodes.
The ordering of the leaves
is not uniquely defined by
the dendrogram: 2n-2
possible choices.
Golub data: different types of leukemia.
Clustering based on the 150 genes with
highest variance across all samples.
Alternative: direct visualization of
similarity/distance matrices
Array batch 1
Array batch 2
Useful if one wants to investigate a specific factor (advantage: no loss of
information). Sort experiments according to that factor.
Clustering of time course data
Suppose we have expression data from different
time points t1, …, tn, and want to identify typical
temporal expression profiles by clustering the
genes.
Usual clustering methods/distance measures
don’t take the ordering of the time points into
account – the result would be the same if the
time points were permuted.
Simple modification: Consider the difference
xi(j+1) – xij between consecutive timepoints as an
additional observation yij. Then apply a clustering
algorithm such as K-means to the augmented
data matrix.
Biclustering
Usual clustering algorithms are based
on global similarities of rows or
columns of an expression data matrix.
But the similarity of the expression
profiles of a group of genes may be
restricted to certain experimental
conditions.
Goal of biclustering: identify
“homogeneous” submatrices.
Difficulties: computational complexity,
assessing the statistical significance of
results
Example: Tanay et al. 2002.
The role of feature selection
Sometimes, people first select genes that
appear to be differentially expressed
between groups of samples. Then they
cluster the samples based on the
expression levels of these genes. Is it
remarkable if the samples then cluster
into the two groups?
No, this doesn’t prove anything, because
the genes were selected with respect to
the two groups! Such effects can even be
obtained with a matrix of i.i.d. random
numbers.
Conclusions
Clustering has been very popular in the analysis
of microarray data. However, many typical
questions in microarray studies (e.g. identifying
expression signatures of tumor types) are
better addressed by other methods
(classification, statistical tests).
Clustering algorithms are easy to apply (you
cannot really do it wrong), and they are useful
for exploratory analysis. But it is difficult to
assess the validity/significance of the results.
Even “random” data with no structure can yield
clusters or exhibit interesting looking patterns.
Final Project Proposal
Aims: investigate a bioinformatics problem
Group: one or two person
Proposal: one page
◦
◦
◦
◦
Problem description
Data sources
Algorithms
Expected results
Test: presentation and final report
Acknowledgement
Anja von Heydebreck