Transcript Slide 1

CAVEAT 1
MICROARRAY EXPERIMENTS ARE EXPENSIVE AND COMPLICATED.
MICROARRAY EXPERIMENTS ARE THE STARTING POINT FOR RESEARCH.
MICROARRAY EXPERIMENTS CANNOT BE THE FINAL GOAL OF A PROJECT.
CAVEAT 2
LISTS OF GENES DON’T GIVE BIOLOGICAL ANSWERS.
STATISTICS CAN COMPLETELY DETACHED FROM BIOLOGY.
THE AMOUNT OF RESULTS IS ALWAYS BIGGER THAN OUR IMAGINATION.
CAVEAT 3
WITH MICROARRAYS WE OBSERVE ONLY THE TRANSCRIPTOME.
WE CAN ONLY BUILD UP HYPOTHESIS ABOUT GENOME AND PROTEOME.
CAREFUL AND EXTENSIVE ANNOTATION OF THE RESULTS IS NEEDED.
Dai M, et al
Nucleic Acids Res. 2005 Nov 10;33(20):e175.
PMID: 16284200
THE PROBLEM OF ANNOTATION
THE PROBLEM OF:
WHO:
WHAT:
WHERE:
WHEN:
HOW:
WHO ARE THEY?
WHAT DO THEY DO?
WHERE ARE THEY AND WHERE DO THEY WORK?
WHEN DO THEY WORK?
HOW DO THEY WORK?
WHO
WE NEED TO GET ALL POSSIBLE INFORMATION ON THE GENES
WE GET FROM MICROARRAYS.
AVAILABLE TOOLS: Gene (EX-LocusLink), OMIM, PubMed
WHAT
THE FUNCTION OF MANY GENES IS ALREADY KNOWN.
AVAILABLE TOOLS: KEGG, GeneOntology (Biological Process, Molecular Function),
OMIM, PubMed.
WHERE
LOCATE THE GENES ON THE GENOME IS VERY IMPORTANT IN MANY
SITUATIONS
(--- a portion of a chromosome is strongly affected under a certain clinical condition)
(--- genes closed to each other can be regulated with the same mechanisms).
AVAILABLE TOOLS: NCBI-Genome, EnsEMBL.
WHERE THE PRODUCTS OF THE GENES OPERATE INTO THE CELL?
AVAILABLE TOOLS: KEGG, GeneOntology (Cellular Component), PubMed.
WHEN
IN WHICH CONDITIONS THE EXPRESSION OF A GIVEN GENE CHANGES?
AVAILABLE TOOLS: PubMed, GEO
HOW
HOW DO GENES WORK?
AVAILABLE TOOLS: PubMed, OMIM, Gene, GeneOntology
THE SOCIAL LIFE OF THE GENES
DIFFERENT SOCIAL DIMENSIONS:
DNA LEVEL (GENOMIC POSITION)
RNA LEVEL (RNA PROCESSING)
PROTEIN LEVEL (INTERACTION OF PROTEINS)
Diverse Biological Roles
Consider a population of genes representing a
diverse set of biological roles or themes shown
below as different colors.
Many algorithms can be applied to expression data to
partition genes based on expression profiles over
multiple conditions.
Many of these techniques work solely on expression
data and disregard biological information.
Consider a particular cluster…
-What are the some of the predominant
biological themes represented in the cluster
and how should significance be assigned to
a discovered biological theme?
Example:
Population Size: 40 genes
Cluster size: 12 genes
10 genes, shown in green, have a common
biological theme and 8 occur within the cluster.
Consider the Outcome
The frequency of the theme in the population is 10/40 = 25%
10
40
12
8
The frequency of the theme within the cluster is 8/12 = 67%
AND
* 80% of the genes related to the theme in the population
ended up within the relatively small cluster.
Contingency Matrix
A 2x2 contingency matrix is typically used to
capture the relationships between cluster membership
and membership to a biological theme.
Cluster
in
out
in
8
2
out
4
26
Theme
Contingency
Matrix
Assigning Significance to the Findings
The Fisher’s Exact Test permits us to determine if there are
non-random associations between the two variables, expression
based cluster membership and membership to a particular
biological theme.
Cluster
in
out
in
8
2
out
4
26
Theme
( 2x2 contingency matrix )
p  .0002
Hypergeometric Distribution
a
b
c
d
a+c
b+d
a+b The probability of any particular
matrix occurring by random
c+d selection, given no association
between the two variables, is given
by the hypergeometric rule.
(a  c)! (b  d )!

a!c!
b!d!  (a  b)!(c  d )!(a  c)!(b  d )!
n!
n!a!b!c!d!
(a  b)!(c  d )!
Probability Computation
For our matrix,
8
2
4
26
, we are not only
interested in getting the probability of getting exactly
8 annotation hits in the cluster but rather the probability
of having 8 or more hits. In this case the probabilities
of each of the possible matrices is summed.
8
2
9
1
10
0
4
26
3
27
2
28
.0002207 + 7.27x10-6 + 7.79x10-8  .000228