georgescu_talk

Transcript georgescu_talk

A stepwise procedure for conditional testing
of GO term overrepresentation
Constantin Georgescu
© Intelligent Systems and Bioinformatics Laboratory
1
The human genome
• The whole hereditary information of an organism: Instructions providing
all the information necessary for a living organism to grow and live
• Instructions encoded in the form of DNA molecules.
DNA encodes a detailed set of plans, like a blueprint, for building different parts of a cell.
• Reside in the nucleolus of every cell, on 23 pairs of chromosomes
• DNA molecule forms a double helix, a string built with the four-letter
DNA alphabet A,C,T,G
DNA strand made of letters that make words that make sentences called “genes”.
• Genes: segment of chromosomal DNA that encode and direct the
synthesis of a protein; proteins carrying out most cellular functions
• Sequenced by 2003; 2 meters of DNA; 3 bil bp; 25000 genes;
• 97% junk DNA
© Intelligent Systems and Bioinformatics Laboratory
2
© Intelligent Systems and Bioinformatics Laboratory
3
Differential expression
• Cells: the fundamental working units of every living organism.
• Each cell contains a complete copy of the organism's genome.
• Cells are of many different types and states E.g. Blood, nerve, and skin
cells, dividing cells, cancerous cells, etc.
• What makes the cells different?
• Differential gene expression, i.e., when, where, and how much each
gene is expressed.
• On average, 40% of our genes are expressed at any given time.
© Intelligent Systems and Bioinformatics Laboratory
4
Central dogma
• The expression of the genetic information stored in
the DNA molecule occurs in two stages:
– (i) transcription, during which DNA is transcribed
into mRNA;
– (ii) translation, during which mRNA is translated to
produce a protein.
DNA-> mRNA->protein
• Other important aspects of gene regulation:
methylation, alternative splicing, etc.
© Intelligent Systems and Bioinformatics Laboratory
5
Examining Gene Expression
• Understanding the functions of genes depends on knowing
when and in what cells they are each expressed.
• microarray chip (developed in late 1990) allow examining
the expression of thousands of genes simultaneously
• microarray chips are glass slides spotted with many rows
containing tiny amounts of probe DNA, one for each of
thousands of genes
• measure the amount of mRNA transcribed from a gene in a
particular cell type through complementary binding
• rapid and sensitive tests, in a variety of experimental
studies on different cell types : cancer cells versus normal
cells, or liver cells versus kidney cells, etc
© Intelligent Systems and Bioinformatics Laboratory
6
A RNA is isolated from cells
from two
samples (in this illustration, infected and
uninfected plant cells). B. The mRNA
from both samples is copied to a more
stable form, called cDNA, using reverse
transcriptase. C. At the same time, the
cDNA is labeled with fluorescent tags
(a different color tag for each sample).
D. The tagged cDNA is placed on the
microarray chip, where it binds to the
corresponding DNA that makes up the
genes that have been previously spotted
on the chip. E. The chip is placed in
a laser scanner, which identifies the
genes that hybridize to each sample
(uninfected=green; infected=red; and
both samples=yellow). F. The data are
displayed on a computer screen where
expression of the individual genes can
be identified.
© Intelligent Systems and Bioinformatics Laboratory
7
Combining data across slides
Data on G genes for n hybridizations results in a
Gxn gene-by-array data matrix
Gene1
Gene2
Gene3
Gene4
Gene5
…
Array1
0.46
-0.10
0.15
-0.45
-0.06
…
Array2
0.30
0.49
0.74
-1.03
1.06
…
Array3
0.80
0.24
0.04
-0.79
1.35
…
Array4
1.51
0.06
0.10
-0.56
1.09
…
Array5 …
0.90 ...
0.46 ...
0.20 ...
-0.32 ...
-1.09 ...
… ...
Preprocessing->normalization->summarization->testing=>
List of differentially expressed genes
© Intelligent Systems and Bioinformatics Laboratory
8
Gene Groups
• Challenge: go from sequence to function, i.e.,
define the role of each gene and understand how
the genome functions as a whole.
• The complete genome sequence doesn’t tell us
much about how the organism functions as a
biological system.
• We need to study how different gene products
interact to produce various components.
• Most important activities are not the result of a
single molecule but depend on the coordinated
effects of multiple molecules.
© Intelligent Systems and Bioinformatics Laboratory
9
Gene Ontology
• Common set of terms and descriptions for basic biological
functions, processes and entities. (Mechanism for representing a communities domain
knowledge in a form accessible by human and amenable to computation)
• GO provides a restricted vocabulary and clear description of the
relationships between terms.
• Gene Ontology consortium produce 3 independent ontologies:
-Biological Process: “biological objective to which the gene product contribute”;
accomplished via one or more ordered assembiles of molecular functions. Ex: cell
growth; signal transduction “almost a pathway”
-Molecular Function: “biochemical activity or action of the gene product”, EX:”enzime”,
”transporter”,”ligand”
-Cellular Component: component of a cell that is part of some larger object or structure;
Ex: chromosome, nucleus, ribosome
© Intelligent Systems and Bioinformatics Laboratory
10
Gene Ontology
• Organized as a DAG with
many to many relationships;
• Children terms are more
specific that their parents
• Is a/has a relationships
• Mapping of genes to GO
terms carried out separately
(ex chip meta-data, GOA);
• Mapping as specific as
possible;
• Propagation up through
hierarchy
• “Across dependences”: one
gene mapped to several GO
terms
© Intelligent Systems and Bioinformatics Laboratory
11
Gene set analysis
Given:
• a directed acyclic graph (GO graph) and a set of items (genes) s.t.:
– each node in the graph contains some genes
– the parent of a node contains all the genes of its child
– a node can contain genes that are not found in the children
• a subset of genes that we call significant genes (differentially expressed genes)
Goal:
• find the nodes from the graph (biological functions) that best represent the
significant genes w.r.t some scoring function (some test statistic)
Over-representation analysis (ORA): is based on Fisher (hypergeometric) test
-Most popular method: easy; exact; works for small sets; stability
-implemented in GOstats, OntoExpress, GOMiner, Ontologizer, FatiGO,
MAPPfinder …
© Intelligent Systems and Bioinformatics Laboratory
12
Fisher’s exact test
The score for a GO term is the degree of independence
between the two properties:
A = {gene is in the list of significant genes}
B = {gene is found in the GO term}.
• Testing the independence of two groups in the above
contingency table corresponds to Fisher’s exact test
[Khatri and Draghici, 2005]
© Intelligent Systems and Bioinformatics Laboratory
13
Fisher’s exact test
For computing the
significance of a gene set,
we can use a
hypergeometric test:
• N genes are on microarray
• Bio is a GO term
– M genes in Bio
– N −M genes not in Bio
• Let K be the no. of
significant genes
• What is the probability of
having exactly x genes
from K of type Bio ?
© Intelligent Systems and Bioinformatics Laboratory
14
This is the probability of getting exactly x by
chance (not what we want)
Parent-Child method
• What is the proper N ?
• x=10, M=400, K=40
N=1000 => pval=0.98
N=5000 => pval=0.0009145082
• Need unspecific prefiltering (remove genes
not expressed in any sample)
• Remove genes not present in any GO terms
• Parent-Child method (Grossmann)
proposes
N=nb. genes in the parent of current GO term
© Intelligent Systems and Bioinformatics Laboratory
15
Complex test dependence
• Gene annotations
propagate through DAG
• Gene annotated to multiple
unrelated GO terms
(across dependence)
• Implicit propagation of
GO term significance
• No reasonable pvalue
correction mehtod
available
© Intelligent Systems and Bioinformatics Laboratory
16
Elim method
The main idea: Test how enriched node x is
if we do not consider the genes from its
significant children (Alexa A. 2006)
•
The nodes are processed bottom-up.
This assures that all children of node x
were investigated before node x itself.
•
The p-value for node x is computed
using Fisher’s exact test.
•
If node x is found significant, remove
all the genes mapped to this node, from
all its ancestors.
•
Elimw –use some heuristic to ease gene
removal
•
Essentially Parent-Child method at the
other end of DAG
© Intelligent Systems and Bioinformatics Laboratory
17
Step method
• First attempt: do both. Good ordering but (very) little test
power
• Need to reduce conditioning as much as possible (to recover
test power) =>stepwise feature selection
• Asymptotically Hypergeometric test
binomialnormalchi-squareratio likelihood
(information criteria test)
• Feature selection uses AIC/BIC=f(information criteria)
AIC=IC-d; BIC=IC-d*log(N)/2
• Translate BIC back in terms of hypergeometric => Fisher
test with adaptive pvalue treshold
• Develop close form solutions specific to this particular
situation for diffrence in deviances of two models
© Intelligent Systems and Bioinformatics Laboratory
18
Step methods
• Reduces to Parent-Child /elim for nodes on
bottom/top of the DAG GO
• Adaptive threshold: no need to choose a cutoff for
the p-value
• Results in independent tests (makes value
correction methods valid)
• Developed in terms of hypergeometric test: fast,
applicable on small GO terms
© Intelligent Systems and Bioinformatics Laboratory
19
Simulation results
1000 iterations;
3 nodes enriched 1/20 vs
1/100
tpr
enriched nodes 1.000
sigN
0.063
sigNc
0.078
Grossman
0.110
selGlobGO
0.608
selectsGOi
0.516
selectsGOi2
0.289
selectsGOih
0.472
fpr
0.000
0.936
0.921
0.889
0.391
0.483
0.710
0.527
© Intelligent Systems and Bioinformatics Laboratory
tsel
sel
3000 3000
2832 44696
1036 13237
2034 18416
1262 2073
1252 2422
1441 4978
1261 2669
20
-Use of Affymetrix U133 gene arrays,
-Explored the APC-induced
gene expression in the lung of
baboons challenged with lethal doses
of E. coli at 8 hrs. Expression pattern
and biological significance of the
differentially expressed genes were
explored using Gene Ontology (GO)
and pathway analysis.
-6 samples (3 control 3 lethal E coli)
-8700 expressed genes
-294 diff expressed genes (at 0.01 FDR)
-44 BP GO terms (<0.01)
GOBPID
mark
GO:0009607 10
GO:0010038
8
GO:0006508
9
GO:0045185
GO:0019363
GO:0042327
8
GO:0008624 10
© Intelligent Systems and Bioinformatics Laboratory
21
Term
response to biotic stimulus
response to metal ion
proteolysis
maintenance of protein locali
pyridine nucleotide biosynthe
positive regulation of phosph
induction of apoptosis by ext
Significant GO terms with Step
GOBPID Pvalue ExpCount Count Size
GO:0009607
GO:0010038
GO:0006508
GO:0045185
GO:0019363
GO:0042327
GO:0008624
0.0000
0.0090
0.0012
0.0040
0.0055
0.0180
0.0291
16.0550
0.1547
9.4969
0.3403
0.1237
0.2165
0.6806
39
2
20
3
2
2
3
Term
519
response to biotic stimulus
5
response to metal ion
307
proteolysis
11
maintenance of protein localization
4
pyridine nucleotide biosynthesis
7
positive regulation of phosphorylation
22 induction of apoptosis by extracellular signals
markGO
10
8
9
0
0
8
10
pvlw W pvlGsm G pvlGlb
1.0000
0.0090
0.0890
1.0000
0.0055
0.1448
0.0291
0
1
0
0
1
0
0
0.0002
0.4762
0.0012
0.0084
0.0084
0.2000
0.0307
1
0
1
1
1
0
0
0.0000
0.0047
0.0008
0.0008
0.0029
0.0015
0.0059
S pvlGih I
1
1
1
1
1
1
1
0.0000
0.0053
0.0014
0.0010
0.0032
0.0016
0.0069
GO:0009607` response to biotic stimulus
"A change in state or activity of a cell or an organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a
result of a biotic stimulus, a stimulus caused or produced by a living organism."
`GO:0010038` response to metal ion
"A change in state or activity of a cell or an organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a
result of a metal ion stimulus."
`GO:0006508` proteolysis
"The hydrolysis of a peptide bond or bonds within a protein."
`GO:0045185` maintenance of protein localization
"The processes by which a protein is maintained in a location and prevented from moving elsewhere. These include sequestration,
stabilization to prevent transport elsewhere and the active retrieval of proteins that do move away."
`GO:0019363` pyridine nucleotide biosynthesis
"The chemical reactions and pathways resulting in the formation of a pyridine nucleotide, a nucleotide characterized by a pyridine
derivative as a nitrogen base."
`GO:0042327` positive regulation of phosphorylation
"Any process that activates or increases the frequency, rate or extent of addition of phosphoric groups to a molecule."
`GO:0008624` induction of apoptosis by extracellular signals
"Any process induced by extracellular signals that directly activates any of the steps required for cell death by apoptosis."
© Intelligent Systems and Bioinformatics Laboratory
22
1
1
1
1
1
1
1
First Connected Component
© Intelligent Systems and Bioinformatics Laboratory
23
GOBPID
Term
mark I
GO:0030218
erythrocyte differentiation 3 0
GO:0030099
myeloid cell differentiation 9 0
GO:0006857
oligopeptide transport 4 0
GO:0015833
peptide transport 0 0
GO:0045185
maintenance of protein localization 0 1
GO:0006621
protein retention in ER 5 0
GO:0019363
pyridine nucleotide biosynthesis 0 1
GO:0007259
JAK-STAT cascade 10 0
GO:0018108
peptidyl-tyrosine phosphorylation 0 0
GO:0042327 positive regulation of phosphorylation 8 1
GO:0008624 induction apoptosis by extcell signals 10 1
GOBPID
Term
GO:0006508
proteolysis
GO:0006511
ubiquitin-dependent protein catabolism
GO:0006568
tryptophan metabolism
GO:0006569
tryptophan catabolism
GO:0006576
biogenic amine metabolism
GO:0006586
indolalkylamine metabolism
GO:0006725
aromatic compound metabolism
GO:0009056
catabolism
GO:0009072
aromatic amino acid family metabolism
GO:0009074
aromatic amino acid family catabolism
GO:0019439
aromatic compound catabolism
GO:0019941
modification-dependent protein catabolism
GO:0030163
protein catabolism
GO:0042219
amino acid derivative catabolism
GO:0042402
biogenic amine catabolism
GO:0042430
indole and derivative metabolism
GO:0042434
indole derivative metabolism
GO:0042436
indole derivative catabolism
GO:0043285
biopolymer catabolism
GO:0043632 modification-dependent macromlc catabolism
GO:0046218
indolalkylamine catabolism
© Intelligent Systems and Bioinformatics Laboratory
24
Selection with Bayesian network
GO:0006576
GO:0006725
GO:0009072
GO:0006508
GO:0009074
GO:0019439
GO:0006511
GO:0009056
GO:0043632
GO:0019941
GO:0006568
DIFF
GO:0006621
GO:0045185
GO:0030163
GO:0042219
GO:0042402
GO:0043285
GO:0006954
GO:0006952
GO:0006586
GO:0009607
GO:0009611
GO:0042430
GO:0050874
GO:0051707
GO:0006569
GO:0019363
GO:0006302
GO:0042434
GO:0006950
GO:0009613
GO:0006955
GO:0010038
GO:0050896
GO:0006857
GO:0007259
GO:0015833
GO:0018108
GO:0030099
GO:0030218
GO:0042436
GO:0046218
© Intelligent Systems and Bioinformatics Laboratory
25
The acute lymphoblast leukemia (ALL) microarray
dataset of Chiaretti et al. (2004)
Differential gene expression between B-cell ALL with the
BCR/ABL (37 samples) fusion and cytogenetically normal
NEG B-cell (42 samples) ALL
The BCR/ABL fusion (Dudoit 2006)
A number of recent articles have investigated the prognostic relevance of the BCR/ABL fusion
in adult ALL of the B-cell lineage (Gleissner et al., 2002). The BCR/ABL fusion is the
molecular analogue of the Philadelphia chromosome, one of the most frequent cytogenetic
abnormalities in human leukemias. This t(9;22) translocation leads to a head-to-tail fusion of
the v-abl Abelson murine leukemia viral oncogene homolog 1 (ABL1) from chromosome
9 with the 5’ half of the breakpoint cluster region (BCR) on chromosome 22 (Figure 4). The
ABL1 proto-oncogene encodes a cytoplasmic and nuclear protein tyrosine kinase that has been
implicated in processes of cell differentiation, cell division, cell adhesion, and stress response.
Although the BCR/ABL fusion protein, encoded by sequences from both the ABL1 and BCR
genes, has been extensively studied, the function of the normal product of the BCR gene is
not clear. The BCR/ABL proto-oncogene has been found to be highly-expressed in chronic
myeloid leukemia (CML) and acute myeloid leukemia (AML) cells (Mukhopadhyay et al.,
2002). (See Figure 4 in Dudoit paper)
© Intelligent Systems and Bioinformatics Laboratory
26
© Intelligent Systems and Bioinformatics Laboratory
27
$`GO:0007155` cell adhesion
The attachment of a cell, either to another cell or to an
underlying substrate such as the extracellular matrix, via
cell adhesion molecules.
$`GO:0007154` cell communication
Any process that mediates interactions between a cell
and its surroundings. Encompasses interactions such
as signaling or attachment between one cell and another
cell, between a cell and an extracellular matrix, or
between a cell and any other aspect of its environment.
$`GO:0008283` cell proliferation
The multiplication or reproduction of cells, resulting in
the rapid expansion of a cell population.
$`GO:0007165` signal transduction
The cascade of processes by which a signal interacts
with a receptor, causing a change in the level or activity
of a second messenger or other downstream target,
and ultimately effecting a change in the functioning of the cell.
$`GO:0007166` cell surface receptor linked signal
transduction
Any series of molecular signals initiated by the binding of an
extracellular ligand to a receptor on the surface of the target
cell.
© Intelligent Systems and Bioinformatics Laboratory
28
© Intelligent Systems and Bioinformatics Laboratory
29
BCR vs NEG, ALL file
GOBPID
Pvalue ExpCount Count Size
GO:0007155
GO:0008283
GO:0007154
GO:0007165
GO:0007166
GO:0043067
GO:0042981
GO:0048519
GO:0043118
GO:0051243
GO:0048523
GO:0009653
GO:0007275
GO:0000902
GO:0007420
GO:0048731
GO:0007399
GO:0031175
GO:0009887
GO:0048513
GO:0048468
GO:0048666
GO:0007611
GO:0030036
0.0000
0.0856
0.0001
0.0002
0.0006
0.0093
0.0093
0.0021
0.0023
0.0041
0.0083
0.0005
0.0008
0.0019
0.0019
0.0025
0.0025
0.0042
0.0051
0.0066
0.0067
0.0077
0.0036
0.0097
6.8198
11.1271
42.4743
40.4404
12.7423
8.4949
8.4949
17.7076
16.2120
16.0924
16.9299
8.4350
21.4166
4.6662
0.2991
4.1876
4.1876
0.7179
2.7519
7.4779
1.2563
0.8375
0.1196
3.0510
19
16
65
61
25
16
16
30
28
27
27
19
36
12
3
11
11
4
8
15
5
4
2
8
Term
114
cell adhesion
186
cell proliferation
710
cell communication
676
signal transduction
213
cell surface receptor linked signal transduction
142
regulation of programmed cell death
142
regulation of apoptosis
296
negative regulation of biological process
271
negative regulation of physiological process
269 negative regulation of cellular physiological process
283
negative regulation of cellular process
141
morphogenesis
358
development
78
cellular morphogenesis
5
brain development
70
system development
70
nervous system development
12
neurite development
46
organ morphogenesis
125
organ development
21
cell development
14
neuron development
2
learning and/or memory
51
actin cytoskeleton organization and biogenesis
pvlw W pvlGsm G pvlGlb S pvlGih Ih
0.0002
0.2002
0.0153
0.3324
0.5743
1.0000
0.3622
1.0000
0.4735
0.2914
0.6374
1.0000
0.8927
0.6335
0.0019
1.0000
0.3264
1.0000
0.0082
0.8719
1.0000
1.0000
0.0036
0.0073
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
1
1
0.0000
0.0382
0.0000
0.8081
0.0656
0.0080
0.1605
0.0033
0.0014
0.0025
0.0166
0.0613
0.0008
0.0038
0.0103
0.0672
1.0000
0.4945
0.1553
0.2363
0.0443
0.3801
0.0077
0.7184
1
0
1
0
0
1
0
1
1
1
0
0
1
1
0
0
0
0
0
0
0
0
1
0
0.0114
0.0012
0.0000
0.7354
0.7487
0.1348
0.1348
0.1178
0.2156
0.2096
0.2096
0.1760
0.0055
0.1496
0.0350
0.2145
0.2145
0.2947
0.0666
0.1892
0.1343
0.2947
1.0000
0.0680
-disagreement about including or not development
-cell proliferation not significant initially, very significant after conditioning
© Intelligent Systems and Bioinformatics Laboratory
30
1
1
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0.0002
0.0032
0.0000
0.9084
0.6235
0.1414
0.1414
0.0703
0.1252
0.1211
0.1211
0.0105
0.0069
0.0215
0.0099
0.0241
0.0241
0.1582
0.0084
0.0115
0.0416
0.1582
1.0000
0.1294
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
REFERENCES
1.
http://www.learner.org/channel/courses/biology/support/1_genom.pdf
2.
Tarca AL, Romero R, Draghici S. Analysis of microarray experiments of gene expression profiling.
American Journal of Obstetrics and Gynecology 195(2):373-388, August 2006
3.
Purvesh Khatri and Sorin Draghici. Ontological analysis of gene expression data: current tools,
limitations, and open problems. Bioinformatics, 21(18):3587-95, September 2005
4.
A. Alexa et al. Improved scoring of functional groups from gene expression data by decorrelating
GO graph structure, Bioinformatics, 13, 2006
5.
Grossmann, S., Bauer, S., Robinson, P.N., Vingron, M. (2006) An improved statistic for detecting
over-represented Gene Ontology annotations in gene sets. Proceedings of the Lecture Notes in
Computer Science 3909 , pp. 85–98 March 2006.
6.
Drăghici, S. et al. (2003) Onto-Tools, the toolkit of the modern biologist: Onto-Express, OntoCompare, Onto-Design and Onto-Translate. Nucleic Acids Res., 31, 3775–3781.
7.
7S. Falcon, R. Gentleman Using GOstats to test genes lists for GO term association,
Bioinformatics, Jan 15, 2007, 23
8.
H. Zhu et all. (2007) Genomic and structural analysis of the protective effects of activated protein C
in a baboon model of E. Coli sepsis. ISTH 2007 Congress
9.
Chiaretti, S., et al. (2004) Gene expression profile of adult T-cell acute lymphocytic leukemia
identifies distinct subsets of patients with different response to therapy and survival. Blood, 103,
2771–2778
10.
Dudoit S. Multiple Tests of Association with Biological Annotation Metadata
© Intelligent Systems and Bioinformatics Laboratory
31

georgescu_talk

Transcript georgescu_talk

Directory