Transcript Document

Statistics Tools in GeneSpring
The Center for Bioinformatics
UNC at Chapel Hill
Jianping Jin Ph.D.
Bioinformatics Scientist
Phone: (919)843-6015
E-mail: [email protected]
Fax: (919)966-6821
What GeneSpring Can do?
• Works with both Affymetrix and two-color
data.
• Views data graphically (classification,
graph, tree, scatter plot, Vann Diagram …)
• Performs statistical analyses.
• Annotates genes (updating from GenBank,
LocusLink, Unigene; biochemical
pathways).
• ……
What statistical analyses does GS do?
• Clustering:
• k-means (non-hierarchical)
• Self-organizing map
• Gene trees (hierarchical dendrograms).
• principal component analysis
• T-Test analyses ( p-values)
• Like a known gene or average of genes
• Like a pattern drawn with the mouse
• Genes with high confidence
• Genes with relative expression in certain ranges
• Pathway analysis finding genes that fit in a certain place in a
pathway.
• Sequence analysis to automatically find regulatory sequences.
• Automatic functional annotation of sub-trees in dendrograms.
•…
Tree Clustering
1.
2.
3.
4.
5.
6.
7.
8.
9.
Standard correlation
Smooth correlation
Change correlation
Upregulated correlation
Pearson correlation
Spearman correlation
Spearman confidence
Two-sided Spearman confidence
Distance
Notations to the Formulas
 Result: the result of the calculation for genes A
and B.
 n: the numbers of samples being correlated
over.
 a: the vector (a 1 , a 2 , a 3 ... a n) of expression
values for gene A.
 b: the vector (b 1 , b 2 , b 3 ... b n) of expression
values for gene B.
 a.b = a 1 b 1 +a 2 b 2 +...+a n b n.
 |a|=square root(a.a)
Standard Correlation
• Equation: a.b/(|a||b|)
• also called “Pearson correlation around
zero”.
• Measure the angular separation of
expression vectors for genes A & B.
• Answer the question “do the peaks match
up?”
Pearson Correlation
• Equation: A.B / ( | A || B | )
• Very similar to the Std correlation, except it
measures the angle of expression vector for
genes A & B around the mean of the
expression vectors.
• A = the mean of all element in vector a - the
value from each element in a.
• Do the same for b to make a vector B
Spearman Confidence
• r = the value of the Spearman correlation,
SC = 1-(probability you would get a value
of r or higher by chance)
• A measure of similarity, not a correlation
• High SC value if a high Spearman corr, & a
low p-value.
• Takes account of the number of subexperiment in your experiment set.
Two-sided Spearman Confidence
• A measure of similarity, very similar to the
Spearman conf.
• Two-sided test of whether the Spearman corr. is
either significantly gt/lt zero.
• “what genes behave similarly/opposite to a
specific gene?”
• Probably not good for k-means/tree clustering.
• 1-(probability you would get a Spearman
correlation of |r| or higher, or -|r| or lower, by
chance).
Distance
• A measurement of dissimilarity, not a
correlation at all.
• Euclidian dist. b/w expression Profiles (
values for each point in N-dimensional
space) of genes A & B.
• Distance = |a-b|/square root of N (expt.
points)
Special Case Correlations
• Smooth correlation, Change correlation and
Upregulated correlation.
• All three modified version of the Std.
correlation.
• Only make sense when data in a sequence,
such as “before”/”after”, a time series, or a
drug series.
Smooth Correlation
• Make a new vector A from a by
interpolating the avg. of each consecutive
pair of elements of a.
• Insert this new value b/w the old values
• Do this for each pair of elements that would
connected by a line in the graph screen
• Do the same to make a vector B from b.
Change Correlation
• The opposite of what the Smooth corr. looks
for. Only the chg. in expression level of
adjacent points.
• Similar to the Std corr., but use an arc
tangent transformation of ratio b/w adjacent
pairs of points to create the expr. vector.
Less sensitive to outliers than using the ratio
directly.
• The value created b/w two values a i and a
i+1 is atan(a i+1 /a i )-  /4
Upregulated Correlation
• Very similar to the Chg. Corr., but it only
considers positive changes. All negative
values for the arc tangent are set to zero.
• Make a new vector A from a by looking at
the change b/w each pair of elements of a.
• The value created b/w two values a i and a
i+1 is max(atan(a i+1 /a i )-  /4.0).
Algorithm to Build Gene Tree
• Determine if there is only one gene or subtree
left. If yes, go to step five.
• Find the two closest genes/subtrees.
• Merge these two into one subtree.
• Return to step one.
• Merge together branches where the distance
between sub-branches is less than the separation
ratio, subject to considering genes with less than
the minimum distance apart.
Algorithm to Build Tree
• The minimum distance: how far down the
tree discrete branches are depicted. Higher
number, more genes in a group, less
specific.
• The separate ratio: the correlation diff. b/w
groups of clustered genes. B/w 0 and 1.
Increasing separation increases the
branchiness of the tree.
Principal Components Analysis
• Not a clustering method.
• PCA, the most abundant building blocks, a
set of expression patterns.
• 1st PC is obtained by finding the linear
combination of expr. Patterns for the most
of variability in the data. And so on.
k-Means Clustering
• Divides genes into a user-defined # (k) of
equal-sized groups, based on their
expression patterns.
• Creates centroids at the avg. location of
each group of genes
• With each iteration, genes are reassigned to
the group with closest centroid
• After all of the genes have been reassigned,
the location of the centroids is recalculated.
Self-Organizing Maps
• Similar to k-means clustering.
• Relationship b/w groups in a 2-D map.
• Best represents the variability of the data,
while still maintaining similarity b/w
adjacent nodes, e.g. point 1,2 is one unit
away from 1,3.
What does t-test mean in GS
• Replicates: one-sample Student’s t-test
• Comparisons for 2 groups: Student’s two-sample
t-test.
• Comparisons for multiple groups: one-way
analysis of variance (ANOVA).
• Filtering genes: based on a one-sample t-test of the
mean expression level across replicates vs. a
reference value (Expression Percentage
Restriction)
Filter Genes Analysis Tools
• Global Error Model: filters out genes with
large std deviations or error values.
• Raw data filtering: gets rid of genes too
close to the background.
• Sample to sample comparison: fold cmp.
Among different samples.
• Statistical Group cmp.: filters out genes not
vary significantly across different groups.
• Data File Restriction: based on other field (
P/S call, +/- pairs).
Statistical Group Comparison
• Genes statistically significant difference in the
mean expression levels across all group.
• For two groups: Students’s two-sample t-test.
• For multiple groups: ANOVA
• Non-parametric cmp.: for each gene, the rank
order is used for analysis. Wilcoxon two-sample
test (Mann-Whitney U test), the Kruskal-Wallis
test for multiple groups.
Data Normalization
• In two-color experiments, normalizing vs. the
control channel (green) for each gene.
• Normalize each sample to itself or to a positive
control. Make diff. samples comparable to one
another.
• Normalizing each gene to itself: remove the
differing intensity scales from multiple expt
readings (highly recommended if not using a twocolor experiment.
NCI-60 cell lines
DrugActivity_AT