Transcript Bauer forum
Large scale
genomic data mining
Curtis Huttenhower
Harvard School of Public Health
Department of Biostatistics
10-23-09
Mining Biological Data
~100 GB
More than 100GB
Mining Biological Data
~100 GB
More than 100GB
Mining Biological Data
~100 GB
How can we ask and answer
specific biomedical questions
using thousands of
genome-scale datasets?
More than 100GB
Outline
1. Methodology:
2. Applications:
Algorithms for mining
genome-scale datasets
Human molecular data
and clinical cancer cohorts
3. Next steps:
Methods for microbial communities
and functional metagenomics
5
A Definition of Functional Genomics
Genomic data
Gene
↓
Function
Gene
↓
Gene
Prior knowledge
Data
↓
Function
Function
↓
Function
6
MEFIT: A Framework for
Functional Genomics
Related Gene Pairs
BRCA1 BRCA2 0.9
BRCA1 RAD51 0.8
RAD51 TP53
0.85
…
Frequency
MEFIT
Low
Correlation
High
Correlation
7
MEFIT: A Framework for
Functional Genomics
Related Gene Pairs
BRCA1 BRCA2 0.9
BRCA1 RAD51 0.8
RAD51 TP53
0.85
…
Frequency
MEFIT
Unrelated Gene Pairs
BRCA2 SOX2 0.1
RAD51 FOXP2 0.2
ACTR1 H6PD 0.15
…
Low
Correlation
High
Correlation
8
MEFIT: A Framework for
Functional Genomics
Functional
Relationship
Golub
1999
Butte
2000
Whitfield
2002
Hansen
1998
9
MEFIT: A Framework for
Functional Genomics
Functional
Relationship
Golub
1999
Butte
2000
Biological
Context
Whitfield
2002
Functional area
Tissue
Disease
…
Hansen
1998
10
Functional Interaction Networks
Global interaction network
Currently have data from
30,000 human experimental results,
15,000 expression conditions +
15,000 diverse others, analyzed for
200 biological functions and
150 diseases
MEFIT
Autophagy network
Vacuolar transport
network
Translation network
11
Predicting Gene Function
Predicted relationships
between genes
Low
Confidence
High
Confidence
Cell cycle genes
12
Predicting Gene Function
Predicted relationships
between genes
Low
Confidence
High
Confidence
Cell cycle genes
13
Predicting Gene Function
Predicted relationships
between genes
Low
Confidence
High
Confidence
These edges provide
a measure of how
likely a gene is to
specifically participate
in the process of
interest.
Cell cycle genes
14
Comprehensive Validation of
Computational Predictions
With David Hess, Amy Caudy
Genomic data Prior knowledge
Computational Predictions of Gene Function
SPELL
bioPIXIE
MEFIT
Hibbs et al 2007
Myers et al 2005
Retraining
New known functions for
correctly predicted genes
Genes predicted to function in
mitochondrion organization
and biogenesis
Laboratory Experiments
Growth
curves
Petite
frequency
Confocal
microscopy
15
Evaluating the Performance of
Computational Predictions
Genes involved in mitochondrion organization and biogenesis
106
135
Original GO Annotations
Under-annotations
82
17
Novel Confirmations, Novel Confirmations,
First Iteration
Second Iteration
340 total: >3x previously known genes in ~5 person-months
16
Evaluating the Performance of
Computational Predictions
Genes involved in mitochondrion organization and biogenesis
Computational
95 predictions
40from large 80
17
Original GO Annotations
Under-annotations
collections
of genomicConfirmed
data canNovel
be Confirmations Novel Confirmations
Under-annotations
First Iteration
Second Iteration
accurate despite incomplete or
misleading gold
standards,
340 total: >3x previously
known
genesand
in they
~5 person-months
continue to improve as additional data
are incorporated.
106
17
Functional Associations Between Contexts
Predicted relationships
between genes
Low
Confidence
High
Confidence
The average strength
of these relationships
indicates how cohesive
a process is.
Cell cycle genes
18
Functional Associations Between Contexts
Predicted relationships
between genes
Low
Confidence
High
Confidence
Cell cycle genes
19
Functional Associations Between Contexts
Predicted relationships
between genes
Low
Confidence
High
Confidence
The average strength of these
relationships indicates how
associated two processes are.
Cell cycle genes
DNA replication genes
20
Functional mapping:
Scoring functional associations
How can we formalize
these relationships?
Any sets of genes G1 and G2
in a network can be compared
using four measures:
• Edges between their genes
• Edges within each set
• The background edges
incident to each set
• The baseline of all edges
in the network
Stronger connections between
the sets increase association.
FAG1 ,G2
between(G1 , G2 )
baseline
background (G1 , G2 ) within(G1 , G2 )
Stronger within self-connections or nonspecific
background connections decrease association.
21
Functional mapping:
Bootstrap p-values
For any graph, compute FA scores for many
Null distribution
is
• Scoring
functional
associations
is great…
randomly
chosen gene
sets of different
sizes.
approximately normal
…how do you interpret an association
score?
with
mean
1.
#
Genes–
1 gene5 sets 10
50 sizes?
For
of arbitrary
ˆ FA (Gi , G j ) 1
– In arbitrary graphs?
A(| Gi |) | G j | B
of
edges?
1 – Each with its own bizarre distribution
ˆ FA (Gi , G j )
| Gi | C (| G j |)
5
Standard deviation is
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
10
0
0.05
10
0
2
0
10
10
50
10
1
asymptotic in the sizes
of both gene sets.
P( FAG1 ,G2 x) 1 ˆ (G1 ,G2 ),ˆ (G1 ,G2 ) ( x)
2
10
3
10
4
10
|G1|
|G2|
Null distribution
one graph
Histograms
of FAsσs
forfor
random
sets
Maps FA scores to p-values
for any gene sets and
underlying graph.
22
Functional Associations Between Processes
Hydrogen
Transport
Electron
Transport
Edges
Associations between processes
Cellular
Respiration
Aldehyde
Metabolism
Very
Strong
Cell Redox
Homeostasis
Peptide
Metabolism
Energy
Reserve
Metabolism
Moderately
Strong
Vacuolar
Protein
Catabolism
Protein
Processing
Negative Regulation
of Protein Metabolism
Protein
Depolymerization
Organelle
Fusion
Organelle
Inheritance
23
Functional Associations Between Processes
Hydrogen
Transport
Electron
Transport
Edges
Associations between processes
Cellular
Respiration
Aldehyde
Metabolism
Very
Strong
Cell Redox
Homeostasis
Peptide
Metabolism
Energy
Reserve
Metabolism
Moderately
Strong
Vacuolar
Protein
Catabolism
Protein
Processing
Negative Regulation
of Protein Metabolism
Borders
Protein
Depolymerization
Data coverage of processes
Organelle
Fusion
Sparsely
Covered
Well
Covered
Organelle
Inheritance
24
Functional Associations Between Processes
Hydrogen
Transport
Electron
Transport
AHP1
DOT5
GRX1
GRX2
…
Aldehyde
Metabolism
Edges
Associations between processes
Cellular
Respiration
Very
Strong
Cell Redox
Homeostasis
Peptide
Metabolism
APE3
Energy
LAP4
Reserve
PAI3
Metabolism
PEP4
…
Moderately
Strong
Vacuolar
Protein
Catabolism
Protein
Processing
Negative Regulation
of Protein Metabolism
Nodes
Cohesiveness of processes
Below
Baseline
Baseline
(genomic
background)
Very
Cohesive
Borders
Protein
Depolymerization
Data coverage of processes
Organelle
Fusion
Sparsely
Covered
Well
Covered
Organelle
Inheritance
25
Functional Maps:
Focused Data Summarization
ACGGTGAACGTACA
GTACAGATTACTAG
GACATTAGGCCGTA
TCCGATACCCGATA
Data integration summarizes an
impossibly huge amount of
experimental data into an
impossibly huge number of
predictions; what next?
26
Functional Maps:
Focused Data Summarization
ACGGTGAACGTACA
GTACAGATTACTAG
GACATTAGGCCGTA
TCCGATACCCGATA
How can a biologist take
advantage of all this data to
study his/her favorite
gene/pathway/disease without
losing information?
Functional mapping
•
•
•
•
Very large collections of genomic data
Specific predicted molecular interactions
Pathway, process, or disease associations
Underlying experimental results and
functional activities in data
27
Outline
1. Methodology:
2. Applications:
Algorithms for mining
genome-scale datasets
Human molecular data
and clinical cancer cohorts
3. Next steps:
Methods for microbial communities
and functional metagenomics
28
HEFalMp: Predicting human gene function
HEFalMp
29
HEFalMp: Predicting human
genetic interactions
HEFalMp
30
HEFalMp: Analyzing human genomic data
HEFalMp
31
HEFalMp: Understanding human disease
HEFalMp
32
Validating Human Predictions
With Erin Haley, Hilary Coller
Autophagy
5½ of 7 predictions
currently confirmed
Luciferase
ATG5
(Negative control)
(Positive control)
Predicted novel autophagy proteins
LAMP2
RAB11A
Not
Starved
Starved
(Autophagic)
33
Current Work: Molecular
Mechanisms in a Colon Cancer Cohort
With Shuji Ogino, Charlie Fuchs
Nurse’s
Health
Study
Health
Professionals
Follow-Up
Study
LINE-1 Methylation
• Repetitive element making up ~20% of
mammalian genomes
• Very easy to assay methylation level (%)
• Good proxy for whole-genome methylation level
~3,100
gastrointestinal
subjects
~2,100
cancer
mutation tests
~1,200
LINE-1
methylation
~3,800
tissue samples
~1,450
colon cancer
samples
~1,150
CpG island
methylation
~700
TMA immunohistochemistry
~775
gene
expression
DASL Gene Expression
• Gene expression analysis from
paraffin blocks
• Thanks to Todd Golub, Yujin Hoshida
34
Colon Cancer:
LINE-1 methylation levels
With Shuji Ogino, Charlie Fuchs
Lower LINE-1 methylation associates
with poor colon cancer prognosis.
LINE-1 methylation varies
remarkably between individuals…
…but it is highly correlated
within individuals.
LINE-1 Methylation in
Multiple Tumors from the
Same Subject
Ogino et al, 2008
Methylation %, Tumor #2
80
70
60
50
40
30
What does it all mean??
What is the biological
mechanism linking LINE-1
methylation to colon cancer?
30
40
50
60
70
Methylation %, Tumor #1
80
ρ = 0.718, p < 0.01
35
Colon Cancer:
LINE-1 methylation levels
With Shuji Ogino, Charlie Fuchs
Lower LINE-1 methylation associates
with poor colon cancer prognosis.
LINE-1 methylation varies
remarkably between individuals…
…but it is highly correlated
within individuals.
LINE-1 Methylation in
Multiple Tumors from the
Same Subject
Is anything different
about these outliers?
Ogino et al, 2008
This suggests linkage to a
cancer-related pathway.
What is the biological
mechanism linking LINE-1
methylation to colon cancer?
Methylation %, Tumor #2
80
70
60
50
40
30
This suggests a copy number variation.
This suggests a genetic effect.
30
40
50
60
70
Methylation %, Tumor #1
80
ρ = 0.718, p < 0.01
36
Colon Cancer:
LINE-1 methylation levels
Preliminary Data
•
•
•
•
•
•
Six genes differentially expressed even using naïve methods
One uncharacterized, one oncogene, three malignancy, one histone
1/3 are from a family with known variable GI expression, prognostic value
2/3 fall in same cytogenic band, which is also a known CNV hotspot
HEFalMp links to a set of transmembrane receptors/channels
Better analysis pulls out mostly one-carbon metabolism and a few
more signaling pathways (neurotransmitters??)
Check back in a
couple of months!
What is the biological
mechanism linking LINE-1
methylation to colon cancer?
37
Outline
1. Methodology:
2. Applications:
Algorithms for mining
genome-scale datasets
Human molecular data
and clinical cancer cohorts
3. Next steps:
Methods for microbial communities
and functional metagenomics
38
Next Steps:
Microbial Communities
• Data integration is off to a great start in humans
– Complex communities of distinct cell types
– Very sparse prior knowledge
• Concentrated in a few specific areas
– Variation across populations
– Critical to understand mechanisms of disease
39
Next Steps:
Microbial Communities
• What about microbial communities?
– Complex communities of distinct species/strains
– Very sparse prior knowledge
• Concentrated in a few specific species/strains
– Variation across populations
– Critical to understand mechanisms of disease
40
Next Steps:
Functional Metagenomics
• Metagenomics: data analysis from environmental samples
– Microflora: environment includes us!
• Another data integration problem
– Must include datasets from multiple organisms
• Another context-specificity problem
– Now “context” can also mean “species”
• What questions can we answer?
– How do human microflora interact with diabetes,
obesity, oral health, antibiotics, aging, …
– What’s shared within community X?
What’s different? What’s unique?
– What’s perturbed in disease state Y?
One organism, or many? Host interactions?
– Current methods annotate ~50% of synthetic data,
<5% of environmental data
41
Next Steps:
Microbial Communities
~120 available
expression
datasets
~70 species
•
•
•
•
Data integration works just as well in microbes as it does in humans
We know an awful lot about some microorganisms and almost nothing about others
Purely sequence-based and purely network-based tools for function transfer both fall short
We need data integration to take advantage of both and mine out useful biology!
Weskamp et al 2004
Flannick et al 2006
Kanehisa et al 2008
Tatusov et al 1997
42
Functional Maps for
Functional Metagenomics
KO1: YG1, YG2, YG3
KO2: YG4
KO3: YG6
…
YG2
ECG1, ECG2
PAG1
ECG3, PAG2
…
YG3
YG4
YG1
KO2
YG5
YG6
YG7
KO3
KO5
KO
4
YG8
YG9
YG10
YG12
KO8
YG11
KO
6
YG13
YG15
YG16
KO7
KO9
YG14
YG17
43
Functional Maps for
Functional Metagenomics
44
Validating Orthology-Based
Functional Mapping
What is the effect of “projecting”
through an orthologous space?
GO
GO
Individual
datasets
log(Precision/Random)
Unsupervised
integration
log(Precision/Random)
Does unweighted data integration
predict functional relationships?
Recall
Recall
KEGG
Unsupervised
integration
Individual
datasets
Recall
log(Precision/Random)
log(Precision/Random)
KEGG
Recall
45
Validating Orthology-Based
Functional Mapping
YG2
YG3
YG4
Holdout set,
uncharacterized “genome”
YG1
YG5
Random subsets,
characterized “genomes”
YG6
YG7
YG8
YG9
YG10
YG12
YG11
YG13
YG15
YG14
YG16
YG17
46
Validating Orthology-Based
Functional Mapping
47
Validating Orthology-Based
Functional Mapping
Can subsets of the yeast genome
predict a heldout subset’s
functional maps?
Can subsets of the yeast genome
predict a heldout subset’s
interactome?
GO
GO
What have we learned?
0.68
• Yeast is incredibly well-curated
0.48
0.30
• KEGG tends to be more specific than GO
0.37
0.40
• Predicting interactomes by projecting through functional maps
works decently in the absolute best case
0.39
KEGG
0.25
0.27
0.43
0.39
KEGG
48
Functional Maps for
Functional Metagenomics
Now, what happens if you do this for
characterized microbes?
• ~20 (somewhat) well-characterized species
• 1-35 datasets each
KEGG
• Integrate within species
• Evaluate using KEGG
log(Precision/Random)
• Then cross-validate by holding out species
Unsupervised
integrations
Recall
49
Next Steps:
Missing Methodology, Mining
• Most machine learning algorithms are
optimized for one of two cases:
– Small, dense data
– Large, sparse data
• HEFalMp integrates ~300M records using
~1K features, relatively few of which are
missing, in ~200 contexts
Simple models, efficient algorithms
50
Next Steps:
Missing Methodology, Models
Functional
Relationship
Dataset
#1
Dataset
#2
Dataset
#2
…
51
Next Steps:
Missing Methodology, Models
Functional
Relationship
Dataset
#1
Dataset
#2
Biological
Context
Dataset
#3
…
52
Next Steps:
Missing Methodology, Models
Regulation
Dataset
#1
Cross-Species
Orthology
Functional
Relationship
Dataset
#2
Cellular
Processes
Developmental
Stage
Dataset
#3
Tissue/Cell
Lineage
Disease
State
…
Types of
Interactions
This is clearly not a sustainable system;
novel large-scale hierarchical modeling is needed
to capture the complex biology of metazoan and
metagenomic interaction networks.
53
Efficient Computation For Biological Discovery
Massive datasets and genomes require
efficient algorithms and implementations.
• Sleipnir C++ library for computational
functional genomics
• Data types for biological entities
•
•
Microarray data, interaction data, genes and gene sets,
functional catalogs, etc. etc.
Network communication, parallelization
• Efficient machine learning algorithms
•
Generative (Bayesian) and discriminative (SVM)
It’s also speedy:• improves
And it’s
on Bayes Net Toolbox by
~22x in memory usage and
up to >100x in runtime.
fully documented!
54
Efficient Computation For Biological Discovery
Massive datasets and genomes require
efficient algorithms and implementations.
• Sleipnir C++ library for computational
functional genomics
• Data types for biological entities
Original
processing time
8 hours
30 years
• Microarray data, interaction data, genes and gene sets,
Current
processing time
functional catalogs, etc. etc.
• minute
Network communication,
1
2 months parallelization
• Efficient machine learning algorithms
18 hoursand 2-3
hours (SVM)
• Generative (Bayesian)
discriminative
• And it’s fully documented!
55
Outline
• Bayesian system for genomic
data integration
• Sleipnir software for efficient
large scale data mining
• Functional mapping to statistically
summarize large data collections
• HEFalMp system for human data
analysis and integration
• Six confirmed predictions in
autophagy
• Ongoing analysis of LINE-1
methylation in colon cancer
1. Methodology:
2. Applications:
Algorithms for mining
genome-scale datasets
Human molecular data
and clinical cancer cohorts
• Data integration applied to
microbial communities and
functional metagenomics
• Efficient machine learning
for large, dense feature spaces
3. Next steps:
Methods for microbial communities
and functional metagenomics
56
Thanks!
Olga Troyanskaya
Matt Hibbs
Chad Myers
David Hess
Edo Airoldi
Florian Markowetz
Hilary Coller
Erin Haley
Tsheko Mutungu
Shuji Ogino
Charlie Fuchs
Interested? We’re looking
for students and postdocs!
Biostatistics Department
http://huttenhower.sph.harvard.edu
http://function.princeton.edu/hefalmp
http://function.princeton.edu/sleipnir
57
Colon Cancer:
Immunohistochemistry
Tumor #1 Tumor #2 … Tumor #700
AKT1
AURKA
CCND1
…
0
0
25
11
5
0
55
0
30
…
Genes
Conditions
What is the biological
mechanism linking LINE-1
methylation to colon cancer?
Quantities
What does the IHC data
tell us about LINE-1
hypomethylation?
The world’s smallest,
cheapest microarray!
59
Colon Cancer:
Immunohistochemistry
~700 Tumor Samples
LINE-1 hypomethylated outliers
LINE-1 methylation “normal”
IHC Pseudoexpression
80
70
60
50
40
LINE-1 Methylation
Low
30
20
Normal
10
STAT3
EPAS1
VDR
JCVT
HIF1A
CTNNB1
CDKN1B
SIRT1
AURKA
KDM1
MAPK
PTGER2
CDX2
HDAC3
DNMT1
ESR2
PPARG
AKT1
CDK8
PRKAA1
CTSB
MTOR
PTEN
TP53
CCND1
STMN1
0
What is the biological
mechanism linking LINE-1
methylation to colon cancer?
What does the IHC data
tell us about LINE-1
hypomethylation?
The
world’s
smallest,
Can
existing
microarrays
cheapest
microarray!
amplify the LINE-1
hypomethylation signal?
60
Colon Cancer:
Mining Microarrays
~650 datasets
~15,000 expression conditions
26 genes in signature
1.2
log2( Low / Normal )
1
0.8
0.6
0.4
~24,000 genes
0.2
0
-0.2
-0.4
STAT3
EPAS1
VDR
JCVT
HIF1A
CTNNB1
CDKN1B
SIRT1
AURKA
KDM1
MAPK
PTGER2
CDX2
HDAC3
DNMT1
ESR2
PPARG
AKT1
CDK8
PRKAA1
CTSB
MTOR
PTEN
TP53
CCND1
STMN1
-0.6
Most like our 26-gene LINE-1
differential methylation signature
What is the biological
mechanism linking LINE-1
methylation to colon cancer?
What does the IHC data
tell us about LINE-1
hypomethylation?
Can existing microarrays
amplify the LINE-1
hypomethylation signal?
Least like the
signature
Identify microarray datasets
with conditions enriched for
LINE-1 hypomethylation.
61
Colon Cancer:
Mining Microarrays
Most like our 26-gene LINE-1
differential methylation signature
Least like the
signature
“The goal of GSEA is to determine
whether members of a gene data
set S tend to occur toward the top
(or bottom) of the list L.”
Subramanian et al, 2005
Dataset 1
Dataset 2
Condition X
Condition Y
Condition Z
Condition A
Condition B
Condition C
Condition D
Condition E
Bleomycin effect on mutagen- Folic acid deficiency effect
sensitive lymphoblastoid cells
on colon cancer cells
Normal tissue of
diverse types
What is the biological
mechanism linking LINE-1
methylation to colon cancer?
What does the IHC data
tell us about LINE-1
hypomethylation?
Can existing microarrays
amplify the LINE-1
hypomethylation signal?
Muscle function
and aging
Identify microarray datasets
with conditions enriched for
LINE-1 hypomethylation.
Bladder tumor stage
classification
Non-diseased
lung tissue
What CNV-linked genes are
differentially expressed in
these datasets?
62
Colon Cancer:
Mining Microarrays
Most upregulated in
significantly enriched datasets
Most
downregulated
“The goal of GSEA is to determine
whether members of a gene
set S tend to occur toward the top
(or bottom) of the list L.”
Subramanian et al, 2005
CNV 1
CNV 2
Gene X
Gene Y
Gene Z
Gene A
Gene B
Gene C
Gene D
Gene E
PSGs (11 genes on 19q13.3)
PCDHs (~50 genes on 5q31.3)
Misc. ~12 genes on 16p13.3
?
Iafrate et al, 2005
What is the biological
mechanism linking LINE-1
methylation to colon cancer?
What does the IHC data
tell us about LINE-1
hypomethylation?
Can existing microarrays
amplify the LINE-1
hypomethylation signal?
Identify microarray datasets
with conditions enriched for
LINE-1 hypomethylation.
What CNV-linked genes are
differentially expressed in
these datasets?
63
Colon Cancer:
Mining Microarrays
Pregnancy specific β glycoproteins
Salahshor et al, 2005
“PSG9 is not found in the nonpregnant adult except in association
with cancer, and it appears to be an
early molecular event associated with
colorectal cancer.”
Differential gene expression profile
reveals deregulation of pregnancy
specific β1 glycoprotein 9 early
during colorectal carcinogenesis
Iafrate et al, 2005
What is the biological
mechanism linking LINE-1
methylation to colon cancer?
What does the IHC data
tell us about LINE-1
hypomethylation?
Can existing microarrays
amplify the LINE-1
hypomethylation signal?
Identify microarray datasets
with conditions enriched for
LINE-1 hypomethylation.
What CNV-linked genes are
differentially expressed in
these datasets?
64
Colon Cancer:
Generating a Hypothesis
Pregnancy specific β glycoproteins
What is the biological
mechanism linking LINE-1
methylation to colon cancer?
What does the IHC data
tell us about LINE-1
hypomethylation?
Can existing microarrays
amplify the LINE-1
hypomethylation signal?
Identify microarray datasets
with conditions enriched for
LINE-1 hypomethylation.
What CNV-linked genes are
differentially expressed in
these datasets?
65
Colon Cancer:
Generating a Hypothesis
Pregnancy specific β glycoproteins
What is the biological
mechanism linking LINE-1
methylation to colon cancer?
What does the IHC data
tell us about LINE-1
hypomethylation?
Can existing microarrays
amplify the LINE-1
hypomethylation signal?
Identify microarray datasets
with conditions enriched for
LINE-1 hypomethylation.
What CNV-linked genes are
differentially expressed in
these datasets?
66
Colon Cancer:
Using All the Data
What’s the state of the data?
• Extremely hypomethylated colon cancer carries a significantly poor prognosis
• In our cohort, these ~20 tumors are weakly enriched for a protein activity signature
based on IHC
• The expression datasets most enriched for the same signature represent mainly GI
cancer and chemotherapy conditions
• The PSG gene family is upregulated in these datasets and is linked to a known CNV
• HEFalMp associates the PSGs with cancer based on correlation with known colorectal
cancer genes in a variety of expression datasets
Get back to me in a
couple of months…
Nothing definite –
yet.
Yes
(caveat investigator)
GI cancers and
chemotherapy
Pregnancy specific β
glycoproteins
What is the biological
mechanism linking LINE-1
methylation to colon cancer?
What does the IHC data
tell us about LINE-1
hypomethylation?
Can existing microarrays
amplify the LINE-1
hypomethylation signal?
Identify microarray datasets
with conditions enriched for
LINE-1 hypomethylation.
What CNV-linked genes are
differentially expressed in
these datasets?
67
Human Regulatory Networks
Quiescence: reversible exit from the cell cycle
G0
Serum starved (hrs) Serum re-stimulated (hrs)
1 2 4 8 24 96 1 2 4 8 24 48
I
II
III
FIRE: Elemento et al. 2007
IV
Elk-1
6,829
genes
V
YY1
0
<5
Development
5<
RNA processing
X
Cell cycle
IX
Metabolism
NF-Y
Protein localization
VIII
Cholesterol
Sp1
Development
VI
VII
• Of only five regulators found, four have
generic cell cycle/proliferation targets
• Just five basic regulators for ~7,000 genes?
• These motifs only appear upstream of ~half
of the genes
68
Regulatory Modules:
Expression Biclusters + Sequence Motifs
Bicluster:
Coregulated subset of
genes and conditions
1
2
3
4
5
6
7
8
RND2
RND6
RND3
CRG2
RND8
RND5
CRG1
CRG3
RND4
RND1
RND7
CRG4
69
Regulatory Modules:
Expression Biclusters + Sequence Motifs
Bicluster:
Coregulated subset of
genes and conditions
1
2
3
4
5
6
7
8
CRG1
CRG2
CRG3
CRG4
RND1
RND2
RND3
RND4
RND5
RND6
RND7
RND8
70
Regulatory Modules:
Expression Biclusters + Sequence Motifs
Bicluster:
Coregulated subset of
genes and conditions
1
CRG1
CRG2
CRG3
CRG4
RND1
RND2
RND3
RND4
…do all that, and
simultaneously find
(under)enriched
sequence motifs!
RND5
RND6
RND7
RND8
3
4
7
2
5
6
8
…any dataset
can contain many
overlapping
biclusters…
…any gene or
condition can
participate in
multiple
biclusters…
71
COALESCE: Combinatorial Algorithm for
Expression and Sequence-based Cluster Extraction
5’ UTR
3’ UTR
Upstream flank
Nucleosome
Positions
Gene Expression
Downstream flank
DNA Sequence
Evolutionary
Conservation
Create a
new module
Identify conditions
Identify motifs
Feature selection:
where
genes
enriched in genes’
Tests for differential expression/frequency
coexpress
sequences
Select
genes based
Bayesian
on
conditions
integration
and motifs
Subtract mean
from all data
Regulatory modules
• Coregulated genes
• Conditions where they’re
coregulated
• Putative regulating motifs
72
COALESCE: Selecting
Coexpressed Conditions
• For each gene expression condition…
– Compare distributions of values for
• Genes in the module versus
• Genes not in the module
– If significantly different, include the condition
Preserving data structure:
• If multiple conditions derive from the same
dataset, can be included/excluded as a unit
• For example, time course vs. deletion collection
• Test using multivariate z-test
• Precalculate covariance matrix; still very efficient
73
COALESCE: Selecting
Significant Motifs
• Coalesce looks for three kinds of motifs:
– K-mers
– Reverse complement pairs
– Probabilistic Suffix Trees (PSTs)
ACGACGT
ACGACAT | ATGTCGT
A
• For every possible motif…
– Compare distributions of values for
• Genes in the module versus
• Genes not in the module
A
C
T
G
G
C
T
T
– If significantly different, include the motif
• This can distinguish flanks from UTRs
• Fast!
• Efficient enough to search coding sequence
(e.g. exons/introns)
74
COALESCE: Selecting
Probable Genes
• For each gene in the genome…
For each significant condition…
For each significant motif…
What’s the probability the gene came from the module’s distribution?
What’s the probability that it came from outside the module?
The probability of a gene being in
the module given some data…
P( g M | D)
Prior is used to stabilize module
convergence; genes already in the module
are more likely to stay there next iteration.
P( D | g M ) P( g M )
P( D | g M ) P( g M ) P( D | g M ) P( g M )
Distributions of each feature in and out of the
developing module are observed from the data.
75
COALESCE: Integrating
Additional Data Types
Nucleosome placement
Evolutionary conservation
• Can be included as additional datasets and feature
selected just like expression conditions/motifs.
N
C
G1
2.5
0.0
G2
0.6
0.5
G3
1.2
0.9
…
…
…
• Or can be used as a prior or weight on the values of
individual motifs.
TCCGGTAGAACTACTGGTATTGTTTTGGATTCCGGTGATG
76
COALESCE Results:
S. cerevisiae Modules
~2,200 conditions
A needle
100 genes
80 conditions
The haystack
~6,000 genes
77
COALESCE Results:
Yeast TF/Target Accuracy
1.3
1.1
0.9
Z-Score
0.7
COALESCE
cMonkey
FIRE
0.5
Weeder
0.3
0.1
-0.1
Bas1p Hap4p Met32p Cup2p Met31p Zap1p Upc2p Mbp1p Hsf1p
Gln3p Hap3p Gcn4p Uga3p Gis1p Hap5p
-0.3
78
COALESCE Results:
Yeast Clustering Accuracy
• ~2,200 yeast conditions
– Recapitulation of known biology from Gene Ontology
79
COALESCE Results:
Yeast Clustering Accuracy
C. elegans: Up in larvae, down in adults
• ~2,200 yeast conditions
– Recapitulation of known biology from Gene Ontology
GATA in 5’ flank, miR-788 seed in 3’ UTR
M. musculus: Up in callosal and motor neurons
ASCL1 in 5’ flank, unch. sequences underenriched in 3’ UTR
H. sapiens: Up in normal muscle, down in diabetic
AAGGGGC (zf?) and
enriched in 5’ flank
80
COALESCE: Coregulated
Quiescence Modules
Up during quiescence entry, down during quiescence exit
Down with let-7 exposure
Many known related (proliferation) motifs:
Pax4, Staf, NFKB1, Gfi, ESR1, Runx1, Su(H)
let-7 motifs predicted in 3’ UTR (UACCUC)
Down during quiescence entry,
enriched for transport/trafficking
Down during quiescence entry, up during quiescence exit,
down with adenoviral infection
miR-297 motif predicted in 3’ UTR (CACATAC)
Specific predicted uncharacterized reverse complement motif
81
Summary
• COALESCE algorithm for regulatory module
prediction
– Biclustering + putative de novo motifs
– Optimized for complex organisms (fast!)
• Large genomes, large data collections
– High accuracy, low false positives
– Leverage prior knowledge, multiple data types
82