CSCE590/822 Data Mining Principles and Applications

Download Report

Transcript CSCE590/822 Data Mining Principles and Applications

CSCE555 Bioinformatics

Lecture 16 Identifying Differentially Expressed
Genes from microarray data
Meeting: MW 4:00PM-5:15PM SWGN2A21
Instructor: Dr. Jianjun Hu
Course page: http://www.scigen.org/csce555
University of South Carolina
Department of Computer Science and Engineering
2008
www.cse.sc.edu.
Outline
The problem: identifying Diff Expressed
Genes
 Statistic Methods: t-test
 Non-parametric: Rank product
 Summary

4/10/2016
2
The Biological Problem: Identify
Differentially Expressed Genes
Which pathways
will be affected?
No treatment
Treatment
Which genes are
involved?
3
Identify differentially expressed
genes
One of the core goals of microarray data analysis is
to identify which of the genes show good evidence
of being DE. This goal has two parts.
1. The first is select a statistic which will rank the
genes in order of evidence for differential
expression, from strongest to weakest
evidence.
2. The second is to choose a critical-value for the
ranking statistic above which any value is
considered to be significant.
k-fold change
1.
2.
3.
measure of differential expression by the ratio of
expression levels between two samples
genes with ratios above a fixed cut-off k that is,
those whose expression underwent a k-fold
change, were said to be differentially expressed
this test is not a statistical test, and there is no
associated value that can indicate the level of
confidence in the designation of genes as
differentially expressed or not differentially
expressed
k-fold change
4. replication is essential in experimental
design because it allows an estimate of
variability
5. ability to assess such variability allows
identification of biologically reproducible
changes in gene expression levels
Standard statistical tests
1. More typically, researchers now rely on
variants of common statistical tests.
2. These generally involve two parts:
calculating a test statistic and determining
the significance of the observed statistic.
3. A standard statistical test for detecting
significant change between repeated
measurements of a variable in two groups
is the t-test;
4. this can be generalized to multiple groups
via the ANOVA F statistic.
Standard statistical tests
1.
For most practical cases, computing a standard t or
F statistic is appropriate, although referring to the t
or F distributions to determine significance is often
not.
2.
The main hazard in using such methods occurs
when there are too few replicates to obtain an
accurate estimate of experimental variances. In
such cases, modeling methods that use pooled
variance estimates may be helpful.
Standard statistical tests
1.
Regardless of the test statistic used, one must
determine its significance
2.
Standard interpretations of t-like tests assume
that the data are sampled from normal
populations with equal variances
3.
Expression data may fail to satisfy either or both
of these constraints
Standard statistical tests
1.use of non-parametric rank-based statistics is also
common, via both traditional statistical methods
and
2.ad hoc ones designed specifically for microarray
data
RankProd : a non-parametric method to detect
differentially regulated genes in replicated experiments
• What does it do? What is the method implemented in the
package
RankProd utilizes the so called rank product non-parametric method
(Breitling et al., 2004 ) to identify up-regulated or down-regulated
genes under one condition against another condition.
Rank Product is a non-parametric statistic which detects items that are
consistently highly ranked in a number of lists, for example genes that
are consistently found among the most strongly unregulated genes in a
number of replicate experiments.
• How does it compare to other methods for similar purpose
(1) originates from an analysis of biological reasoning , easy to understand
(2) fast, simple and robust to outliers (suitable for noisy data )
(3) provides statistical significance for each gene and allows for the control of
the overall significance (e.g., false discovery rate)
(4) provides straightforward way for cross-platform meta-analysis (integrates
data generated at different laboratories/under different environments into
one study, and achieves increased power)
Rank Product

Calculate RP:

Calculate significance
Permutation tests for calulating
significance levels
Permutation tests, generally carried out by repeatedly
scrambling the samples’ class labels and computing t
statistics for all genes in the scrambled data, best capture the
unknown structure of the data.
Tusher, V.G., Tibshirani, R. & Chu, G. Significance analysis of
microarrays applied to the ionizing radiation response. Proc.
Natl Acad. Sci. USA 98, 5116-5121 (2001).
Golub, T.R. et al. Molecular classification of cancer: class
discovery and class prediction by gene expression
monitoring. Science 286, 531-537 (1999).
Dudoit, S., Yang, Y.-H., Callow, M.J. & Speed, T.P. Statistical
methods for identifying differentially expressed genes in
replicated cDNA microarray experiments. Technical Report
578 (Department of Statistics, University of California at
Berkeley, Berkeley, CA, 2000).
Summary
The problem: Identify Differentially
expressed genes from Microarray data
 How to identify: t-test and Rank product
 How to evaluate significance of identified
genes
