Title goes here

Download Report

Transcript Title goes here

Advancing Science with DNA Sequence
IMG terms and pathways
Natalia Ivanova
Iain Anderson
Thanos Lykidis
Nikos Kyrpides
Krishna Palaniappan
Amy Chen
Frank Korzeniewski
Yuri Grechkin
Ernest Szeto
Victor Markowitz
MGM Workshop
May 16, 2012
Advancing Science with DNA Sequence
New: SEED
subsystems
Transport DB,
Phenotypes
Why so many?
What’s the difference?
Which one should I use?
Advancing Science with DNA Sequence
Where it all comes from
• Experimental data: gene A in a
genome X




catalyzes a reaction
interacts with another protein(s)
gene knock-out causes certain phenotype
…
This information is recorded in a
structured way:
 ontologies (e.g. Gene Ontology)
 pathway collections (metabolic and
protein-protein interaction)
 other (reasoning rules, like TIGR Genome
Properties)
Advancing Science with DNA Sequence
Modeling the data properly – why
nobody does that
phenotype
gene
pathway
transcript
protein
evidence
reaction
enzyme
compounds
• Genes are connected to phenotypes via a multi-step
process, with many parameters
• We have very vague ideas about the steps/parameters for
the majority of genes/phenotypes
• If we design a relational database for gene/phenotype
connections, most tables will be empty
Advancing Science with DNA Sequence
What it looks like in real life –
KEGG vs MetaCyc
KEGG
http://www.genome.jp/kegg/
MetaCyc
http://metacyc.org/
Advancing Science with DNA Sequence
Ammonia oxidation pathway in
KEGG
• Plus 4 more entries:
for 1.14.99.39
for each subunit
Advancing Science with DNA Sequence
The same pathway/reaction in
MetaCyc
Similar problems to
KEGG:
• multifunctional
enzymes
• multisubunit
enzymes
• differences in
reaction
recording
Advancing Science with DNA Sequence
Even MetaCyc record is still
incomplete
• Which subunit has which
cofactor?
• Type of Cu2+ cluster,
type of Fe2+ cluster?
• One of the subunits is a
cytochrome c, yet the
enzyme is cytosolic?
• Does it require any help
with maturation of metal
clusters?
• Pseudomonas sp. PB16 was shown to have only 1 enzyme from the
pathway, hydroxylamine reductase. Does it have the entire pathway?
Advancing Science with DNA Sequence
Even bigger mess: bioinformatics
inference
• Experimental data: gene A in a
genome X




catalyzes a reaction
interacts with another protein(s)
gene knock-out causes certain phenotype
…
What about gene B in genome Y,
which is similar to gene A?
Advancing Science with DNA Sequence
“True or false?” game
• If GenBank record says nothing about gene B
annotation protocol, the annotation must be
correct
• If GenBank record says the gene was
manually annotated, the annotation must be
correct
• If GenBank record says gene B was manually
annotated, and it has a bi-directional best
BLAST hit to gene A with e-value of 1.0e-5,
the annotation must be correct
•…
Advancing Science with DNA Sequence
Weaknesses
• Orthology detection: fails on many families
with deviation from vertical transmission
• BLAST is agnostic of which amino acids are
more important for protein function
• Using consensus sequence (either as PSSM or
HMM) with family-specific bit score cutoffs
would be much better, but cannot be used in
current implementation of KEGG
Advancing Science with DNA Sequence
Pathway collections: KEGG,
MetaCyc and others
Which particular set of interactions is a
pathway? (i. e. how do we define
pathway boundaries within the network?)
Advancing Science with DNA Sequence
Ideal solution: pathway NR
• All pathway collections share a common
skeleton of reactions, which consist of
reactants (compounds)
• All reactions share the common base of
proteins annotated as catalysts
• Can we merge the information from
different collections, using the best features
of all of them?
Advancing Science with DNA Sequence
IMG terms: 3 types
A
B
R1
Not an IMG term!
Enzyme (EC x.x.x.x)
Enzyme (EC x.x.x.x)
monomeric, needs cofactor C
C
R2, spontaneous
Enzyme (EC x.x.x.x)
monomeric precursor
IMG term of the type
“Gene product”
 IMG terms of 3 types:
1. gene product
2. multi-subunit protein complex
3. modified protein
Enzyme (EC x.x.x.x)
heterotrimeric, needs cofactor D
R4, chaperone
Enzyme (EC x.x.x.x)
heterotrimeric, subunit C
IMG term of the type
“Modified protein”
Enzyme (EC x.x.x.x)
heterotrimeric, subunit A
D
IMG term of the type
“Protein complex”
R3, spontaneous
Enzyme (EC x.x.x.x)
heterotrimeric, subunit B
IMG term of the type
“Gene product”
Enzyme (EC x.x.x.x)
heterotrimeric, subunit A precursor
Advancing Science with DNA Sequence
Protein-protein interaction
pathways:
same model
Advancing Science with DNA Sequence
You’ve been warned!