6-Clustering_and_Stats(Ch16)

Download Report

Transcript 6-Clustering_and_Stats(Ch16)

Analyzing Expression Data:
Clustering and Stats
Chapter 16
Goals
• We’ve measured the expression of genes or
proteins using the technologies discussed
previously.
• What can we do with that information?
– Identify significant differences in expression
– Identify similar patterns of expression
(clustering)
Analysis steps
1. Data normalization
2. Statistical Analysis
3. Cluster Analysis
I. Data Normalization
• Why normalize?
– Removes systematic errors
– Makes the data easier to analyze statistically
Sources of Error
• Measurements always contain errors.
– Systematic (oops)
– Random (noise!)
• Subtracting the background level can remove some
systematic error
– Using the ratio in two-channel experiments does this
– Subtracting the overall average intensity can be used with onechannel data.
• Taking averages over replicates of the experiment reduces
the random error.
• Advanced error models are mentioned on p. 628 and
covered in “Further Reading”.
Expression data usually not
Gaussian (normal)
• Many statistical tests
assume that the data is
normally distributed.
• Expression microarray
spot intensity data (for
example) is not.
• Intensity ratio data (twochannel) is not normal
either.
• Both go from 0 to infinity
whereas normal data is
symmetrical.
QuickTime™ and a
decompressor
are needed to see this picture.
Taking the logarithm helps
normalize expression ratio data
• The expression ratio
plotted versus the
expression level
(geometric mean) in both
channels.
• Plotting the log ratio vs.
the log expression level
gives data that is centered
around y=0 and fairly
“normal looking”.
QuickTime™ and a
decompressor
are needed to see this picture.
Taking the log of the expression
ratio “fixes” the left tail
QuickTi me™ and a
decompressor
are needed to see this picture.
LOWESS Normalization
• Sometimes there is still a bias that depends on the
expression level.
• This can be removed by a type of regression called
“Locally Weighted Scatterplot Smoothing”.
• This computes and subtracts the mean locally for various
values of expression level (RG).
QuickTime™ and a
decompressor
are needed to see this picture.
II. Statistical Analysis
• Determining what differences in expression
are statistically significant
• Controlling false positives
When are two measurements
significantly different?
• We want to say that an expression ratio is significant if it is big enough
(>1) or small enough (<1).
• A two-fold ratio (for example) is only significant if the variances of the
underlying measurements are sufficiently small.
• The significance is related to the area of the overlap of the underlying
distributions.
QuickTi me™ and a
decompressor
are needed to see this picture.
The Z-test
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompress or
are needed to see this picture.
• If the data is approximately normal, convert it to a Z-score.
– X can be the log expression ratio;  is then 0
–  is the sample standard deviation; n is the number of repeats
• The Z-score is distributed N(0,1) (standard normal).
• The significance level is the area in the tail(s) of the
standard normal distribution.
The t-test
• The t-test makes fewer assumptions about the data
than the Z-test
• It can be applied to compare two average
measurements which can have
– Different variances
– Different numbers of observations
• You compute the t-statistic (see pages 654-655)
and then look up the significance level of the
Students’ T distribution in a table.
III. Cluster Analysis
• Similar expression patterns
– Groups of genes/proteins with similar
expression profiles
• Similar expression sub-patterns
– Groups of genes/proteins with similar
expression profiles in a subset of conditions
• Different clustering methods
• Assessing the value of clusters
Example: Gene Expression
Profiles
• Expression level of a gene
is measured at different
time points after treating
cells.
• Many different expression
profiles are possible.
– No effect
– Immediate increase or
decrease
– Delayed increase or
decrease
– Transient increase or
decrease
Clustering by Eye
• n genes or proteins
• m different samples (or conditions)
• Represent a gene as a point:
– X = <x1, x2, …, xm>
• If m is 1 or 2 (or even 3) you can plot the points
and look for clusters of genes with similar
expression.
– But what if m is bigger than 3?
– Need to reduce the dimensionality: PCA
Reducing the Dimensionality of Data:
Principal Components Analysis
• PCA linearly map each point to
a small set of dimensions
(components).
– The principal components are
dimensions that capture the
maximum variation in the data.
• The principal components
capture most of the important
information in the data
(usually).
• Plotting each point’s values in
two of the principal component
dimensions allows us to see
clusters.
QuickTime™ and a
TIFF (U ncompressed) decompressor
are needed to see this picture.
2-D Gel Data
PCA: An Illustration
Yeast Cell Cycle Gene Expression
• Singular value decomposition of a matrix X (SVD) is
– X = U  VT
• The mapped value of X is
– Y = X VT
• The rows of Y give the mapping of each gene.
– Mapped gene i: Yi = <y1, y2, …., ym>
@PNAS (2000)
Clustering Using Statistics
• Algorithm identifies
groups.
– Example: similar
expression profiles
• Distance measure
between pairs of
points is needed.
QuickTime™ and a
TIFF (U ncompressed) decompressor
are needed to see this picture.
Distance Measures Between Pairs
of Points
• In order to cluster the points (genes or conditions),
we need some concept of which points are “close”
to each other.
• So we need a measure of distance (or, conversely,)
similarity between two rows (or columns) in our n
by m matrix.
• We can then compute all the pair-wise distances
between rows (or columns).
Standard Distance Measures
• Euclidean Distance
• Pearson Correlation Coefficient
• Mahalanobis Distance
Euclidean Distance
• Standard, everyday distance
– Treats all dimensions equally
– If some genes vary more than others (have higher
variance), they influence the distance more.
Mahalanobis Distance
• The “normalized” Euclidean distance
• Scales each dimension by the variance in that dimension.
– This is useful if the genes tend to vary much more in one sample
than in others since it reduces the affect of that sample on the
distances.
Pearson Correlation Coefficient
• Distances are small when
two genes have similar
patterns of change even if
the size of the changes are
different.
• This is accomplished by
scaling by the sample
variance of the gene’s
expression levels under
different conditions.
Choice of Distance Matters
• Heirarchical clustering
(dentrogram) of
tissues.
– Corresponds to
clustering the columns
of the matrix.
• Branches are different
(cancer B/C vs A/B).
QuickTime™ and a
TIFF (U ncompressed) decompressor
are needed to see this picture.
Clustering Algorithms
• Hierarchical Clustering
• K-means clustering
• Self-organizing maps and trees
Hierachical Clustering
• Algorithms progressively merge clusters or split clusters.
– Merging criterion can by single-linkage or complete-linkage.
• Produce dendrograms
– Can be interpreted at different thresholds.
Types of Linkage
• A. Single Linkage
• B. Complete Linkage
• C. Centroid Method
QuickTi me™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
K-means Clutering
• Related to Expectation Maximization
• You specify the number of clusters
• Iteratively moves the means of the clusters to
maximize the likelihood (minimize total error).
QuickTime™ and a
TIFF (U ncompressed) decompressor
are needed to see this picture.