Transcript Document

Gene Expression
Data Analyses (3)
Trupti Joshi
Computer Science Department
317 Engineering Building North
E-mail: [email protected]
573-884-3528(O)
Lecture Outline -1

Statistical significance vs. biological relevance

Statistical methods
 Two sample statistical tests

Parametric: T-test (paired and unpaired t test)

Non-parametric:
 Mann-Whitney test for independent samples
 Wilcoxon signed-rank test for paired data
 Multivariate statistics

One-way vs Two-way analysis of variance (ANOVA)

Kruskal-Wallis
 Multiple comparison corrections

Bonferroni Correction

False Discovery Rate
Lecture Outline -2

Data interpretation

Selection of softwares
 Image analysis

Imagene
 Statistical analysis

GeneSpring

SAM

ArrayStat
Lecture Outline -1

Statistical significance vs. biological relevance

Statistical methods
 Two sample statistical tests

Parametric: T-test (paired and unpaired t test)

Non-parametric:
 Mann-Whitney test for independent samples
 Wilcoxon signed-rank test for paired data
 Multivariate statistics

One-way vs Two-way analysis of variance (ANOVA)

Kruskal-Wallis
 Multiple comparison corrections

Bonferroni Correction

False Discovery Rate
Why Statistical Analysis?

Rank results by confidence with significance
metrics (e.g. p-value)

Estimate the false positive (Type I errors) and
false negatives (Type II errors)

Achieve the desired balance of sensitivity and
specificity

Result in a certain amount of flexibility (and
arbitrariness) when interpreting significance
metrics generated by a test
Statistical Significance vs.
Biological Relevance
Lecture Outline -1

Statistical significance vs. biological relevance

Statistical methods
 Two sample statistical tests

Parametric: T-test (paired and unpaired t test)

Non-parametric:
 Mann-Whitney test for independent samples
 Wilcoxon signed-rank test for paired data
 Multivariate statistics

One-way vs Two-way analysis of variance (ANOVA)

Kruskal-Wallis
 Multiple comparison corrections

Bonferroni Correction

False Discovery Rate
Normal Distribution
 Central
peak: mean
 Symmetrical
Parametric Analysis

Test the hypothesis that one or more treatments
have no effect on the mean and variance of a
chosen variable

Assume yield data as a normal distribution

Disadvantages: If the yield is not normally
distributed.
Non-parametric Analysis

Use ranks of numerical data rather than the data
themselves

Use information about the relative sizes of
observations, without making any assumptions
about the means and variances of the populations
being tested

Can be used for any data set

Disadvantages: if the data is normally distributed, it
is less powerful than parametric analysis
Lecture Outline -1

Statistical significance vs. biological relevance

Statistical methods
 Two sample statistical tests

Parametric: T-test (paired and unpaired t test)

Non-parametric:
 Mann-Whitney test for independent samples
 Wilcoxon signed-rank test for paired data
 Multivariate statistics

One-way vs Two-way analysis of variance (ANOVA)

Kruskal-Wallis
 Multiple comparison corrections

Bonferroni Correction

False Discovery Rate
T test
 Paired
t test:
the size of two groups should be same
Comparison for organism before or after
treatment (before and after heat shock)
 Unpaired
t test:
the size of two groups do not need to be
same
Comparison between organisms with
treatment or non-treatment
How to Perform T test
Paired T-test
Un-Paired T-test
T-test example
Paired T test
Unpaired T test
Mann-Whitney Test
Use if sample is not distributed normally
 Similar to non-paired T test but nonparametric
 Use the rankings of the numerical values
instead of variance

Mann-Whitney Test-example
Wilcoxon Signed-Rank
Test






Use if sample is not distributed normally
Similar to paired T test but non-parametric
Rank the difference between arrays
If the difference between two pairs is 0, the
value is not used
If the difference is identical between 2 pairs, the
average rank of the two groups is used
Use Wilcoxon Table
Lecture Outline -1

Statistical significance vs. biological relevance

Statistical methods
 Two sample statistical tests

Parametric: T-test (paired and unpaired t test)

Non-parametric:
 Mann-Whitney test for independent samples
 Wilcoxon signed-rank test for paired data
 Multivariate statistics

One-way vs Two-way analysis of variance (ANOVA)

Kruskal-Wallis
 Multiple comparison corrections

Bonferroni Correction

False Discovery Rate
ANOVA (Analysis of Variance)
A parametric test
 Assumes a normal distribution
 The variance in the groups must be equal
 The data points in each group must be
from independent samples
 If only two groups, ANOVA is equivalent to
T test

Perform ANOVA
 Two
estimates of variance are taken
Estimate the variance within the group based
on the standard deviation of each group
Estimate the variance among groups based
on the variability between means of each
group
One-Way ANOVA
One-Way ANOVAexample
Two-Way ANOVA
Two-Way ANOVA-example
Kruskal-Wallis


Non-parametric equivalent to ANOVA
Using Chi-square distribution with k-1 degrees
of freedom
Lecture Outline -1

Statistical significance vs. biological relevance

Statistical methods
 Two sample statistical tests

Parametric: T-test (paired and unpaired t test)

Non-parametric:
 Mann-Whitney test for independent samples
 Wilcoxon signed-rank test for paired data
 Multivariate statistics

One-way vs Two-way analysis of variance (ANOVA)

Kruskal-Wallis
 Multiple comparison corrections

Bonferroni Correction

False Discovery Rate
Multiple Comparison
Corrections

When the sample size increases, the number for
significance will be increased.

The number of false positives (Type I errors)
may increase as well.

To fix this problem, some sort of adjustment of
p-values or -levels
Let,
k = the number of groups;
K = the number of comparisons that are necessary
Each subsequent column represents the chosen level of significance.
Increased likelihood of generating Type I error by performing
multiple pair-wise comparisons
Bonferroni Correction

The cut-off level of significance being used is
divided by the number of means being
compared.

In stead of testing each hypothesis at level ,
test each at level /m.

Good for a small number of samples

May be too conservative
False Discovery Rate

Multiple test controls Prob(V1)

M is huge=> falsely rejected (Type II error) are likely to
occur

Better to control

Intuitive definition of false discovery rate:

Compared to Bonferroni:
 Bonferroni fixed error rate: estimated rejection area
 FDR fixed rejection error: estimated rejection error
Two Algorithms for FDR

Benjamin and Hochberg:
 The rate that false discoveries occur
 Fix a cutoff *, and then derive a decision rule that
achieves FDR*

Storey:
 The rate that discoveries are false
 Fix a decision rule, and then estimate the FDR
associated with using this decision rule
 Estimate m0
Lecture Outline -1

Statistical significance vs. biological relevance

Statistical methods
 Two sample statistical tests

Parametric: T-test (paired and unpaired t test)

Non-parametric:
 Mann-Whitney test for independent samples
 Wilcoxon signed-rank test for paired data
 Multivariate statistics

One-way vs Two-way analysis of variance (ANOVA)

Kruskal-Wallis
 Multiple comparison corrections

Bonferroni Correction

False Discovery Rate
Lecture Outline -2

Data interpretation

Selection of softwares
 Image analysis

Imagene
 Statistical analysis

GeneSpring

SAM

ArrayStat
Lecture Outline -2

Data interpretation

Selection of softwares
 Image analysis

Imagene
 Statistical analysis

GeneSpring

SAM

ArrayStat
How to Interpret
Expression Profiling Data



Overlay functional information and allow
biological context to help decide what is of
interest and what is not
Using computational methods (classification,
clustering, promoter prediction, etc.)
Data mining tools
 Public identifier: GenBank, Swiss-prot, Gene
Ontology (GO)
 Using database: LocusLink, HomologGene, RefSeq,
UniGene, etc.
 GeneFAS (Digbio), GenePath (Digbio), NetAffx, etc.
Gene Ontology (GO)

Most commonly used public domain sources of
gene classification

Provide controlled vocabulary hierarchies for
 molecular function
 biological process
 cellular component
GO
Current GO annotation

http://www.geneontology.org/GO.current.annotations.shtml

More than 30 species are listed
Lecture Outline -2

Data interpretation

Selection of softwares
 Image analysis

Imagene
 Statistical analysis

GeneSpring

SAM

ArrayStat
Image Analysis

More 20 softwares are listed at
http://ihome.cuhk.edu.hk/~b400559/arraysoft_image.html

Imagene (BioDiscovery, Inc.)
Imagene Analysis
Flagging Spot
Defining Thresholds for
Empty Spots
Lecture Outline -2

Data interpretation

Selection of softwares
 Image analysis

Imagene
 Statistical analysis

GeneSpring

SAM

ArrayStat
GeneSpring

GeneSpring (Silicon Genetics)
 Broadly used
 Nice user interface
 Data Normalization (Lowess, etc.)
 Powerful ANOVA statistical analysis



t-test/1-way ANOVA test
2-way ANOVA tests
1-way post-hoc tests for reliably identifying differentially
expressed genes
 Incorporation of different analysis tools




Clustering
Visual filtering
Pathway viewing
Scripting
ANOVA in GeneSpring (I)

Tools -> Statistical Analysis -> test type:
parametric, assume variance equal or
parametric, don't assume variance equal.

Technical replicates are on different slides + Biological
replicates (e.g. as in the case of one-color arrays)

GeneSpring does not make the distinction between
technical sample and biological sample replicates
ANOVA in GeneSpring
(II)




Use Tools -> Statistical Analysis -> test type:
parametric, assume variance equal or parametric,
don't assume variance equal.
The on-chip variance is being ignored.
Technical replicates are spotted on a chip (i.e. on-chip
replicates) + biological replicates
e.g. If you have 3 sets of on-chip replicates X 2
biological replicates for group A, same set up for group
B.
 GeneSpring will first average the on-chip replicates. Now, you
have the average on-chip value for replicate #1 and another
average for the on-chip values for replicate #2. Then,
GeneSpring uses these two final averages to compute
ANOVA. The df is 2-1.
ANOVA in GeneSpring
(III)




Use Tools -> Statistical Analysis -> test type:
parametric, use all available error measurements.
In this case, both the on-chip and biological replicate
information are used.
Technical replicates are spotted on a chip (i.e.. on-chip
replicates) + biological replicates
If you have 3 sets of on-chip replicates X 2 biological
replicates for group A, same set up for group B.
 GeneSpring will take on-chip and biological variance into account when
calculating the ANOVA. The degree of freedom will also account for
both types of replicates. The equation for the degree of freedom is
actually quite complex, because GeneSpring takes the standard error
of the on-chip and biological replicates into consideration. This is done
so that different levels of variations between technical and biological
replicates will be accounted for.
Error correction

P-value Cutoff/False discovery rate: 0.05

Multiple testing correction: Too
conservative. Use None

Post-Hoc testing: Used for 3 more more
conditions.
Showing the pairing conditions between
which the significant changes are detected.
Statistical Analysis of
Microarray (SAM)








From Stanford (http://www-stat.stanford.edu/~tibs/SAM/)
Correlates gene expression data to a wide variety of clinical
parameters including treatment, diagnosis categories, survival
time and time trends
Provides estimate of False Discovery Rate for multiple testing
Automatic imputation of missing data via nearest neighbor
algorithm
Can deal with blocked designs, for example, when treatments
are applied within different batches of arrays
Convenient Excel Add-in
Works with data from both cDNA and oligo microarrays. Can
also be applied to protein expression data and SNP chip data.
Genes are web-linked to Stanford SOURCE database
ArrayStat (Imaging
research)









Accepts data from MS Excel and in text format
Novel and standard random error estimations methods
Performs powerful statistics on as few as two replicates
“Outlier” detection and removal
Showing number of replicates in the results
Flexible normalization within an experiment and across
experiments
False positive corrections
Dependent and independent statistical tests of
expression changes
Statistical power analysis to minimize false negatives
Reading Assignments
Suggested reading:
 GeneChip Expression Analysis. Affymetrix,
Inc.
 John D. Storey. 2002. A direct approach to
false discovery rates. J. R. Statist. Soc. B.
part 3, 479-498