Microarray statistical validation and functional annotation

Download Report

Transcript Microarray statistical validation and functional annotation

Microarray statistical
validation and functional
annotation
Microarrays


DNA microarray technology is an high
throughput method for gaining information
on gene function.
Microarray technology is based on the
availability of gene sequences arrayed on a
solid surface and it allows parallel
expression analysis of thousands of genes.
Microarrays

Microarray can be a valuable tool
– to define transcriptional signatures bound to
a pathological condition
– to rule out molecular mechanisms tightly
bound to transcription

Since our actual knowledge on genes
function in high eukaryotes is quite limited
– Microarray analysis frequently does not imply
a final answer to a biological problem but
allows the discovery of new research paths
which let to explore it by a different
perspective
Microarrays

A gold standard methodology to
identify, with high sensitivity and
precision, “biologically meaningful”
differentially expressed genes is not
yet available.
– Therefore, various approaches are under
development to optimize the extraction of
data linked to the “biology” of the problem
under study.
Microarrays

The principal steps of a microarray
analysis are:
– Gene intensity measurements and data
normalization.
– Statistical validation of differential expression.
– Functional data mining.
Microarrays


Statistical validation usually implies the selection from the
user of statistical significance parameters.
For example:
– SAM (Significance Analysis of Microarrays) always requires
the input of a “delta” value which defines the threshold of
false positive in the validated dataset.


If the stringency of the statistical validation is too high
biologically meaningful genes can be lost making more
difficult to role out functional correlations between the
differentially expressed genes.
If the stringency of the statistical validation is too loose
the increase of false positives creates background noise
from which is difficult to extract trustful functional
correlations between the differentially expressed genes.
Microarrays
Microarrays
Microarrays


Statistical validation implies the selection
from the user of statistical significance
parameters.
For example:
– SAM (Significance Analysis of Microarrays)
requires the definition of a “delta” value which
defines the threshold of false positive in the
validated dataset.
– When Fisher’s test is used the definition of a
threshold value is even more hard.
Microarrays
Microarrays

It is important to remark that:
– A statistical validation not always implies the
selection of the most “biologically”
meaningful dataset

Therefore we are trying to integrate
“biologically” important parameters, as
Gene ontology, in the statistical
validation.
Microarrays
Gene Ontology (GO) is a dynamic controlled
vocabulary that can be applied to all
organisms even as knowledge of gene and
protein roles in cells is accumulating and
changing.
 GO might help to link differentially expressed
genes to specific functional classes.

Microarrays

Molecular Function:
the tasks performed by individual gene, products;
examples are transcription factor and DNA helicase.
Microarrays

Biological Process:
broad biological goals, such as mitosis or purine metabolism,
that are accomplished by ordered assemblies of molecular
functions
Microarrays

Cellular Component:
subcellular structures, locations, and macromolecular
complexes; examples include nucleus, telomere, and origin
recognition complex
Microarrays

Recently has been shown that:
 There
is a strong instability of the size and
overlap of the gene lists that result from
varying gene selection methods.
(Hosack et al, Genome Biology 2003, 4:P4)
Microarrays
The percentage of genes overlapping
in any two lists was highly variable,
and ranged from 7% to 60%.
(Hosack et al, Genome Biology 2003, 4:P4)
Microarrays

In spite of this striking variation:
 The
top five biological biologically themes
linked to the data sets are the same.
 This evidence suggests that the conversion
of genes to themes favour the "biological
result" of the experiment to be determined
despite substantial differences in gene list
content resulting from the use of various
normalization, gene intensity and statistical
selection methods.
(Hosack et al, Genome Biology 2003, 4:P4)
Microarrays
(Hosack et al, Genome Biology 2003, 4:P4)
Microarrays

Integrating GO in statistical validation:
– The number of GO classes are counted in the data set under
statistical validation.
– SAM analyses are performed using various delta parameters.
– The GO classes present in the statistically validated subsets
are counted.
– The presence of enrichment of GO classes in the SAM
validated sets is evaluated using a binomial test corrected for
Type I errors.

A score for each GO class is generated performing the log2(pvalue * % hits)
– The SAM subset showing the best compromise between
number of enriched GO classes and number of HITs for each
class is selected for further studies
CONCORDANT MORPHOLOGIC AND GENE
EXPRESSION DATA SHOW THAT A VACCINE
FREEZES HER-2/neu PRENEOPLASTIC LESIONS
22 wks
10 wks
Atypical hyperplasia
and in situ carcinomas
Cured mammary gland
22 wks
(Quaglino et al submitted)
Lobular carcinoma
log2(p-value * %HITs)
Microarrays
Microarrays

We observed that:
– simple statistical validation and statistical validation
mediated by GO classes analysis have strong overlap.

However, some interesting differentially expressed
genes can be only detected using GO mediated
statistical validation.
-3.0
1:1
wk 22nti
wk 2 prgi
wk10nt j
wk 2 prgi
3.0
wk 22 pbi
wk 2 prgi
wk 22nti
wk 2 prgi
wk10nt j
wk 2 prgi
wk 22 pbi
wk 2 prgi
Ig-linked immuno response
common to simple statistical analysis
and GO-mediated statistical validation
a
e
b
c
d
Cell-linked immuno response
specific of GO-mediated
statistical validation
(c)
We also observed that the previously described approach can also be
(d)
used to improve
data mining related to the transcriptional signature
(h)
Subsets of SAM
present in co-regulated
gene
validated genes
(SSVG)
SAM
program
(g)
min(AMs
specific
p-value)
Patser
program
No
(b)
(e)
(n)
Selected
SSVG
(f)
Run SAM with
at least 3 different
threshold?
Consensus
program
Alignment
matrices
(AMs)
(a)
Starting
dataset
(SD)
Yes
(i)
(l)
Patser
program
Filtering by
AMs specific
P-value
Any
AM is over-represented
in SSVG?
(m)
No
Discard