Title goes here

Download Report

Transcript Title goes here

Advancing Science with DNA Sequence
IMG terms and pathways
Natalia Ivanova
Iain Anderson
Thanos Lykidis
Nikos Kyrpides
Krishna Palaniappan
Amy Chen
Frank Korzeniewski
Yuri Grechkin
Ernest Szeto
Victor Markowitz
MGM Workshop
February 1, 2012
Advancing Science with DNA Sequence
New: SEED
subsystems
Transport DB,
Phenotypes
Why so many?
What’s the difference?
Which one should I use?
Advancing Science with DNA Sequence
Where it all comes from
• Experimental data: gene A in a
genome X




catalyzes a reaction
interacts with another protein(s)
gene knock-out causes certain phenotype
…
This information is recorded in a
structured way:
 ontologies (e.g. Gene Ontology)
 pathway collections (metabolic and
protein-protein interaction)
 other (reasoning rules, like TIGR Genome
Properties)
Advancing Science with DNA Sequence
Modeling the data properly – why
nobody does that
phenotype
gene
pathway
transcript
protein
evidence
reaction
enzyme
compounds
• Genes are connected to phenotypes via a multi-step
process, with many parameters
• We have very vague ideas about the steps/parameters for
the majority of genes/phenotypes
• If we design a relational database for gene/phenotype
connections, most tables will be empty
Advancing Science with DNA Sequence
What it looks like in real life –
KEGG vs MetaCyc
KEGG
http://www.genome.jp/kegg/
MetaCyc
http://metacyc.org/
Advancing Science with DNA Sequence
Ammonia oxidation pathway in
KEGG
Advancing Science with DNA Sequence
The same pathway/reaction in
MetaCyc
Advancing Science with DNA Sequence
Even MetaCyc record is still
incomplete
• Which subunit has which
cofactor?
• Type of Cu2+ cluster,
type of Fe2+ cluster?
• One of the subunits is a
cytochrome c, yet the
enzyme is cytosolic?
• Does it require any help
with maturation of metal
clusters?
• Pseudomonas sp. PB16 was shown to have only 1 enzyme from the
pathway, hydroxylamine reductase. Does it have the entire pathway?
Advancing Science with DNA Sequence
Even bigger mess: bioinformatics
inference
• Experimental data: gene A in a
genome X




catalyzes a reaction
interacts with another protein(s)
gene knock-out causes certain phenotype
…
What about gene B in genome Y,
which is similar to gene A?
Advancing Science with DNA Sequence
“True or false?” game
• If gene B was manually annotated, the
annotation must be correct
• If gene B was manually annotated, and it has
a bi-directional best BLAST hit to gene A
with e-value of 1.0e-5, the annotation must
be correct
• If gene B was manually annotated, and it has
>50% identity to gene A, it is found in the
same conserved chromosomal neighborhood
as gene A, the annotation must be correct
•…
Advancing Science with DNA Sequence
Poorly done inference - MetaCyc
• Software called PathoLogic
• Parses annotated files, tries to find matches between EC
numbers/full product names/partial product names and
reactions in MetaCyc database
• Automatically infers pathway presence based on matches to
MetaCyc reactions
• Tries to find candidate genes for “missing” enzymes by
doing BLAST of the genes assigned to this reaction in other
organisms
• Generates a lot of false positives - inferred the presence of
ammonia oxidation pathway in Staphylococcus based on the
presence of 1 gene annotated as ammonia monooxygenase
in GenBank file
Advancing Science with DNA Sequence
Better inference: KEGG
• Annotation is inferred
based on orthology,
defined as bi-directional
best BLAST hits, manually
refined based on
“Ortholog tables” and
chromosomal clusters
• Poorly documented, but
seems to generate a lot
less false positives than
PathoLogic
Advancing Science with DNA Sequence
Even the best structured inference
is far from perfect
• Problem: both BLAST or Smith-Waterman
don’t know which amino acids are more
important for protein function than
others
• Using consensus sequence (either as PSSM
or HMM) with family-specific bit score
cutoffs would be much better
Advancing Science with DNA Sequence
Pathway collections: KEGG,
MetaCyc and others
Which particular set of interactions is a
pathway? (i. e. how do we define
pathway boundaries within the network?)
Advancing Science with DNA Sequence
Ideal solution: pathway NR
• All pathway collections share a common
skeleton of reactions, which consist of
reactants (compounds)
• All reactions share the common base of
proteins annotated as catalysts
• Can we merge the information from
different collections, using the best features
of all of them?
Advancing Science with DNA Sequence
IMG terms: 3 types
A
B
R1
Not an IMG term!
Enzyme (EC x.x.x.x)
Enzyme (EC x.x.x.x)
monomeric, needs cofactor C
C
R2, spontaneous
Enzyme (EC x.x.x.x)
monomeric precursor
IMG term of the type
“Gene product”
 IMG terms of 3 types:
1. gene product
2. multi-subunit protein complex
3. modified protein
Enzyme (EC x.x.x.x)
heterotrimeric, needs cofactor D
R4, chaperone
Enzyme (EC x.x.x.x)
heterotrimeric, subunit C
IMG term of the type
“Modified protein”
Enzyme (EC x.x.x.x)
heterotrimeric, subunit A
D
IMG term of the type
“Protein complex”
R3, spontaneous
Enzyme (EC x.x.x.x)
heterotrimeric, subunit B
IMG term of the type
“Gene product”
Enzyme (EC x.x.x.x)
heterotrimeric, subunit A precursor
Advancing Science with DNA Sequence
Protein-protein interaction
pathways:
same model
Advancing Science with DNA Sequence
You’ve been warned!