Transcript Document

Answering biological questions
using large genomic data collections
Curtis Huttenhower
Harvard School of Public Health
Department of Biostatistics
10-05-09
A Definition of
Computational Functional Genomics
Genomic data
Gene
↓
Function
Gene
↓
Gene
Prior knowledge
Data
↓
Function
Function
↓
Function
2
MEFIT: A Framework for
Functional Genomics
Related Gene Pairs
BRCA1 BRCA2 0.9
BRCA1 RAD51 0.8
RAD51 TP53
0.85
…
Frequency
MEFIT
Low
Correlation
High
Correlation
3
MEFIT: A Framework for
Functional Genomics
Related Gene Pairs
BRCA1 BRCA2 0.9
BRCA1 RAD51 0.8
RAD51 TP53
0.85
…
Frequency
MEFIT
Unrelated Gene Pairs
BRCA2 SOX2 0.1
RAD51 FOXP2 0.2
ACTR1 H6PD 0.15
…
Low
Correlation
High
Correlation
4
MEFIT: A Framework for
Functional Genomics
Functional
Relationship
Golub
1999
Butte
2000
Whitfield
2002
Hansen
1998
5
MEFIT: A Framework for
Functional Genomics
Functional
Relationship
Golub
1999
Butte
2000
Biological
Context
Whitfield
2002
Functional area
Tissue
Disease
…
Hansen
1998
6
Functional Interaction Networks
Global interaction network
Currently have data from
30,000 human experimental results,
15,000 expression conditions +
15,000 diverse others, analyzed for
200 biological functions and
150 diseases
MEFIT
Autophagy network
Vacuolar transport
network
Translation network
7
Predicting Gene Function
Predicted relationships
between genes
Low
Confidence
High
Confidence
Cell cycle genes
8
Predicting Gene Function
Predicted relationships
between genes
Low
Confidence
High
Confidence
Cell cycle genes
9
Predicting Gene Function
Predicted relationships
between genes
Low
Confidence
High
Confidence
These edges provide
a measure of how
likely a gene is to
specifically participate
in the process of
interest.
Cell cycle genes
10
Functional Associations Between Contexts
Predicted relationships
between genes
Low
Confidence
High
Confidence
The average strength
of these relationships
indicates how cohesive
a process is.
Cell cycle genes
11
Functional Associations Between Contexts
Predicted relationships
between genes
Low
Confidence
High
Confidence
Cell cycle genes
12
Functional Associations Between Contexts
Predicted relationships
between genes
Low
Confidence
High
Confidence
The average strength of these
relationships indicates how
associated two processes are.
Cell cycle genes
DNA replication genes
13
Functional Associations Between Processes
Hydrogen
Transport
Electron
Transport
Edges
Associations between processes
Cellular
Respiration
Aldehyde
Metabolism
Very
Strong
Cell Redox
Homeostasis
Peptide
Metabolism
Energy
Reserve
Metabolism
Moderately
Strong
Vacuolar
Protein
Catabolism
Protein
Processing
Negative Regulation
of Protein Metabolism
Protein
Depolymerization
Organelle
Fusion
Organelle
Inheritance
14
Functional Associations Between Processes
Hydrogen
Transport
Electron
Transport
Edges
Associations between processes
Cellular
Respiration
Aldehyde
Metabolism
Very
Strong
Cell Redox
Homeostasis
Peptide
Metabolism
Energy
Reserve
Metabolism
Moderately
Strong
Vacuolar
Protein
Catabolism
Protein
Processing
Negative Regulation
of Protein Metabolism
Borders
Protein
Depolymerization
Data coverage of processes
Organelle
Fusion
Sparsely
Covered
Well
Covered
Organelle
Inheritance
15
Functional Associations Between Processes
Hydrogen
Transport
Electron
Transport
AHP1
DOT5
GRX1
GRX2
…
Aldehyde
Metabolism
Edges
Associations between processes
Cellular
Respiration
Very
Strong
Cell Redox
Homeostasis
Peptide
Metabolism
APE3
Energy
LAP4
Reserve
PAI3
Metabolism
PEP4
…
Moderately
Strong
Vacuolar
Protein
Catabolism
Protein
Processing
Negative Regulation
of Protein Metabolism
Nodes
Cohesiveness of processes
Below
Baseline
Baseline
(genomic
background)
Very
Cohesive
Borders
Protein
Depolymerization
Data coverage of processes
Organelle
Fusion
Sparsely
Covered
Well
Covered
Organelle
Inheritance
16
HEFalMp: Predicting human gene function
HEFalMp
17
HEFalMp: Predicting human
genetic interactions
HEFalMp
18
HEFalMp: Analyzing human genomic data
HEFalMp
19
HEFalMp: Understanding human disease
HEFalMp
20
Validating Human Predictions
With Erin Haley, Hilary Coller
Autophagy
5½ of 7 predictions
currently confirmed
Luciferase
ATG5
(Negative control)
(Positive control)
Predicted novel autophagy proteins
LAMP2
RAB11A
Not
Starved
Starved
(Autophagic)
21
Comprehensive Validation of
Computational Predictions
With David Hess, Amy Caudy
Genomic data Prior knowledge
Computational Predictions of Gene Function
SPELL
bioPIXIE
MEFIT
Hibbs et al 2007
Myers et al 2005
Retraining
New known functions for
correctly predicted genes
Genes predicted to function in
mitochondrion organization
and biogenesis
Laboratory Experiments
Growth
curves
Petite
frequency
Confocal
microscopy
22
Evaluating the Performance of
Computational Predictions
Genes involved in mitochondrion organization and biogenesis
106
135
Original GO Annotations
Under-annotations
82
17
Novel Confirmations, Novel Confirmations,
First Iteration
Second Iteration
340 total: >3x previously known genes in ~5 person-months
23
Evaluating the Performance of
Computational Predictions
Genes involved in mitochondrion organization and biogenesis
Computational
95 predictions
40from large 80
17
Original GO Annotations
Under-annotations
collections
of genomicConfirmed
data canNovel
be Confirmations Novel Confirmations
Under-annotations
First Iteration
Second Iteration
accurate despite incomplete or
misleading gold
standards,
340 total: >3x previously
known
genesand
in they
~5 person-months
continue to improve as additional data
are incorporated.
106
24
Functional Maps:
Focused Data Summarization
ACGGTGAACGTACA
GTACAGATTACTAG
GACATTAGGCCGTA
TCCGATACCCGATA
Data integration summarizes an
impossibly huge amount of
experimental data into an
impossibly huge number of
predictions; what next?
25
Functional Maps:
Focused Data Summarization
ACGGTGAACGTACA
GTACAGATTACTAG
GACATTAGGCCGTA
TCCGATACCCGATA
How can a researcher take
advantage of all this data to
study his/her favorite
gene/pathway/disease without
losing information?
Functional mapping
•
•
•
•
Very large collections of genomic data
Specific predicted molecular interactions
Pathway, process, or disease associations
Underlying experimental results and
functional activities in data
26
Thanks!
Olga Troyanskaya
Matt Hibbs
Chad Myers
David Hess
Edo Airoldi
Florian Markowetz
Hilary Coller
Erin Haley
Tsheko Mutungu
Shuji Ogino
Charlie Fuchs
Interested? I’m accepting students and postdocs!
http://www.huttenhower.org
http://function.princeton.edu/hefalmp
27
Next Steps:
Microbial Communities
• Data integration is off to a great start in humans
– Complex communities of distinct cell types
– Very sparse prior knowledge
• Concentrated in a few specific areas
– Variation across populations
– Critical to understand mechanisms of disease
29
Next Steps:
Microbial Communities
• What about microbial communities?
– Complex communities of distinct species/strains
– Very sparse prior knowledge
• Concentrated in a few specific species/strains
– Variation across populations
– Critical to understand mechanisms of disease
30
Next Steps:
Microbial Communities
~120 available
expression
datasets
~70 species
•
•
•
•
Data integration works just as well in microbes as it does in humans
We know an awful lot about some microorganisms and almost nothing about others
Purely sequence-based and purely network-based tools for function transfer both fall short
We need data integration to take advantage of both and mine out useful biology!
Weskamp et al 2004
Flannick et al 2006
Kanehisa et al 2008
Tatusov et al 1997
31
Next Steps:
Functional Metagenomics
• Metagenomics: data analysis from environmental samples
– Microflora: environment includes us!
• Another data integration problem
– Must include datasets from multiple organisms
• Another context-specificity problem
– Now “context” can also mean “species”
• What questions can we answer?
– How do human microflora interact with diabetes,
obesity, oral health, antibiotics, aging, …
– What’s shared within community X?
What’s different? What’s unique?
– What’s perturbed in disease state Y?
One organism, or many? Host interactions?
– Current methods annotate ~50% of synthetic data,
<5% of environmental data
32