More on microarrays. (2/17)

Download Report

Transcript More on microarrays. (2/17)

More on Microarrays
Chitta Baral
Arizona State University
Some case studies
• Yeast cell cycle
– Goal: Find groups of yeast genes whose expression profiles are similar
over a 24-hour period.
– Approach: Obtain gene expression measurements using Affymetrix S98
genome microarrays for a synchronized sample of yeast cells over a 24hr period by sampling the total RNA populations at 30-minute intervals.
• 48 separate time points were sampled twice (for duplicate measurements) at
each time point. (T1 … T48 and replicates T1’…T48’)
• Drug intervention study
– Goal: Characterize effect(s) of drug X three hours after it is introduced
into normal adult mice by the expression level of liver cell genes.
– Approach: Gene expression profiles of normal adult mice liver cells that
are not treated with drug X are used as the control state.
• Call the preintervention or control state A, and the post intervention state B
• For replicate measurements, liver samples were obtained without drug X
application from MA adult mice and another MB adult mice liver samples were
obtained after drug X was applied.
Some potential questions when
trying to cluster
• What uncategorized genes have an expression pattern
similar to these genes that are well-characterized?
• How different is the pattern of expression of gene X from
other genes?
• What genes closely share a pattern of expression with
gene X?
• What category of function might gene X belong to?
• What are all the pairs of genes that closely share
patterns of expression?
• Are there subtypes of disease X discernible by tissue
gene expression?
• What tissue is this sample tissue closest to?
Questions – cont.
• Which are the different patterns of gene expression?
• Which genes have a pattern that may have been a result
of the influence of gene X?
• What are all the gene-gene interactions present among
these tissue samples?
• Which genes best differentiate these two group of
tissues?
• Which gene-gene interactions best differentiate these
two groups of tissue samples.
• DIFFERENT ALGORITHMS ARE MORE
PARTICULARLY SUITED TO ANSWER SOME OF
THESE QUESTIONS, COMPARED WITH THE
OTHERS.
Bioinformatics algorithms and
some known uses -- Unsupervised
• Feature determination: Determining genes with interesting
properties, without looking for a particular pattern determined a
priori.
– Principal component analysis: determine genes explaining the majority
of the variance in the data set.
• Cluster determination: Determine groups of genes or samples with
similar patterns of gene expression.
– Nearest neighbour clustering: # clusters are decided first , the clusters
are calculated and each gene is assigned to a single cluster.
• Self-organizing maps
– Tamayao et al. used it to functionally cluster genes into various patterned time
courses in HL-60 cell macrophage differentiation.
– Toronen used hierarchical SOM to cluster yeast genes responsible fore diauxic
shift.
• K-means clustering
– Soukas et al. used it to cluster genes involved in leptin signalling.
– Tavazoie et al. used it to cluster genes with common regulatory sequences.
k-means clustering: basic idea
• Input: n objects (or points) and a number k
• A set of k-clusters that mimimizes the squared-error
criterion (sum of squared errors = Si=1k S p in Ci |p-mi|2,
where mi is the mean of cluster ci.)
• Algorithm – complexity is O(nkt), where t = #iterations
– Arbitrarily choose k objects as the initial cluster centers
– Repeat
– (Re)assign each object to the cluster to which the object is the
most similar based on the mean value of the objects in the
cluster.
– Update the cluster means, (i.e., calculate the mean value of the
objects for each cluster)
– Until no change.
Pluses and minuses of k-means
• Pluses: Low complexity
• Minuses
– Mean of a cluster may not be easy to define (data with
categorical attributes)
– Necessity of specifying k
– Not suitable for discovering clusters of non-convex shape or of
very different sizes
– Sensitive to noise and outlier data points (a small number of
such data can substantially influence the mean value)
– Some of the above objections (especially the last one) can be
overcomed by the k-medoid algorithm.
• Instead of the mean value of the objects in a cluster as a reference
point, the medoid can be used, which is the most centrally located
object in a cluster.
K-Medoid clustering
• Input: Set of objects (often points in a multi-dimensional space)
• Output: These objects clustered into k clusters
• Algorithm: Complexity O(n2) when k <<n for each iteration
–
–
–
–
–
–
–
–
Select arbitrarily k representative objects.
Mark these objects as selected and mark the remaining as non-selected
Repeat
For all selected objects Oi do
****(complexity O(k(n-k)n))
• For all non-selected objects Oh compute Cih, where
Cih, denotes the total cost of swapping i with h, i.e., Cih = Sj Cjih, where Cjih =
dhj – d ij, and dij denotes d(Oi, Oj) the distance between Oi and Oj.
Select imin, hmin such that Cimin,hmin = Mini,h Cih ****(complexity O(k(n-k)))
If Cimin,hmin < 0 then mark Oi as non-selected and Oh as selected.
Until no change
The selected objects now define the clusters.
• A non-selected object Oj belongs to the cluster represented by an object Oi if
d(Oi, Oj) = Min Oe d(Oj, Oe), where min is taken over all selected objects Oe.
Self organizing maps
• A neural network algorithm that has been used for a wide variety of
applications, mostly for engineering problems but also for data
analysis.
• SOM can be used at the same time both to reduce the amount of
data by clustering, and for projecting the data nonlinearly onto a
lower-dimensional display.
• SOM vs k-means
– In the SOM the distance of each input from all of the reference vectors
instead of just the closest one is taken into account, weighted by the
neighborhood kernel h. Thus, the SOM functions as a conventional
clustering algorithm if the width of the neighborhood kernel is zero.
– Whereas in the K-means clustering algorithm the number K of clusters
should be chosen according to the number of clusters there are in the
data, in the SOM the number of reference vectors can be chosen to be
much larger, irrespective of the number of clusters. The cluster
structures will become visible on the special displays
Bioinformatics algorithms and some
known uses – Unsupervised; cont.
• Cluster determination (cont.)
– Aggolmerative clustering: bottom up method, where
clusters start as empty, then genes are successively
added to the existing clusters
• Dendograms: Groups are defined as sub-trees in a
phylogenetic-type tree created by a comprehensive pair-wise
dissimilarity measure.
• 2-D Dendograms
– Divisive or partitional clustering: top-down method,
where large clusters are successively broken down
into smaller ones, until each sub-cluster contains only
one object (gene)
• Dendograms and 2-D Dendograms.