PathoLogic Pathway Predictor - Bioinformatics Research
Download
Report
Transcript PathoLogic Pathway Predictor - Bioinformatics Research
PathoLogic Pathway Predictor
SRI International
Bioinformatics
Inference of Metabolic Pathways
Annotated Genomic
Sequence
Pathway/Genome
Database
Gene Products
Pathways
Genes/ORFs
DNA Sequences
Multi-organism Pathway
Database (MetaCyc)
Pathways
Reactions
PathoLogic
Software
Integrates genome and
pathway data to identify
putative metabolic
networks
Compounds
Gene Products
Genes
Reactions
Genomic Map
Compounds
PathoLogic Functionality
Initialize
SRI International
Bioinformatics
schema for new PGDB
Transform existing genome to PGDB form
Infer metabolic pathways and store in PGDB
Infer operons and store in PGDB
Assemble Overview diagram
Assist user with manual tasks
Assign enzymes to reactions they catalyze
Identify false-positive pathway predictions
Build protein complexes from monomers
Infer transport reactions
Fill pathway holes
SRI International
Bioinformatics
PathoLogic Input/Output
Inputs:
List of all genetic elements
Enter using GUI or provide a file
Files containing annotation for each genetic element
Files containing DNA sequence for each genetic element
MetaCyc database
Output:
Pathway/genome database for the subject organism
Reports that summarize:
Evidence in the input genome for the presence of reference pathways
Reactions missing from inferred pathways
SRI International
Bioinformatics
File Naming Conventions
One
pair of sequence and annotation files for
each genetic element
Sequence
files: FASTA format
suffix fsa or fna
Annotation
file:
Genbank format: suffix .gbk
PathoLogic format: suffix .pf
SRI International
Bioinformatics
Typical Problems Using Genbank
Files With PathoLogic
Wrong
qualifier names used: read PathoLogic
documentation!
Extraneous
Check
information in a given qualifier
results of trial parse carefully
GenBank File Format
SRI International
Bioinformatics
Accepted feature types:
CDS, tRNA, rRNA, misc_RNA
Accepted qualifiers:
/locus_tag
Unique ID
[recm]
/gene
Gene name
[req]
/product
[req]
/EC_number
[recm]
/product_comment
[opt]
/gene_comment
[opt]
/alt_name
Synonyms
[opt]
/pseudo
Gene is a pseudogene [opt]
/db_xref
DB:AccessionID
[opt]
/go_component, /go_function, /go_process GO terms [opt]
For multifunctional proteins, put each function in a separate
/product line
PathoLogic File Format
Each record starts with line containing an ID attribute
Tab delimited
Each record ends with a line containing //
One attribute-value pair is allowed per line
Use multiple FUNCTION lines for multifunctional proteins
Lines starting with ‘;’ are comment lines
Valid attributes are:
ID, NAME, SYNONYM
STARTBASE, ENDBASE, GENE-COMMENT
FUNCTION, PRODUCT-TYPE, EC, FUNCTION-COMMENT
DBLINK
GO
INTRON
SRI International
Bioinformatics
PathoLogic File Format
SRI International
Bioinformatics
ID
TP0734
NAME
deoD
STARTBASE
799084
ENDBASE
799785
FUNCTION
purine nucleoside phosphorylase
DBLINK
PID:g3323039
PRODUCT-TYPE
P
GENE-COMMENT
similar to GP:1638807 percent identity:
57.51; identified by sequence similarity; putative
//
ID
TP0735
NAME
gltA
STARTBASE
799867
ENDBASE
801423
FUNCTION
glutamate synthase
DBLINK
PID:g3323040
PRODUCT-TYPE
P
GO
glutamate synthase (NADPH) activity [goid
0004355] [evidence IDA] [pmid 4565085]
SRI International
Bioinformatics
Before you start:
What to do when an error occurs
Navigator errors are automatically trapped –
debugging information is saved to error.tmp file.
All other errors (including most PathoLogic
errors) will cause software to drop into the Lisp
debugger
Unix: error message will show up in the original terminal
window from which you started Pathway Tools.
Windows: Error message will show up in the Lisp console.
The Lisp console usually starts out iconified – its icon is a
blue bust of Franz Liszt
2 goals when an error occurs:
Try to continue working
Obtain enough information for a bug report to send to
pathway-tools support team.
Most
The Lisp Debugger
SRI International
Bioinformatics
Sample error (details and number of restart actions differ
for each case)
Error: Received signal number 2 (Keyboard interrupt)
Restart actions (select using :continue):
0: continue computation
1: Return to command level
2: Pathway Tools version 10.0 top level
3: Exit Pathway Tools version 10.0
[1c] EC(2):
To generate debugging information (stack backtrace):
:zoom :count :all
To continue from error, find a restart that takes you to the
top level – in this case, number 2
:cont 2
To exit Pathway Tools:
:exit
How to report an error
Determine
SRI International
Bioinformatics
if problem is reproducible, and how to
reproduce it (make sure you have all the latest
patches installed)
Send email to [email protected]
containing:
Pathway Tools version number and platform
Description of exactly what you were doing (which command
you invoked, what you typed, etc.) or instructions for how to
reproduce the problem
error.tmp file, if one was generated
If software breaks into the lisp debugger, the complete error
message and stack backtrace (obtained using the command
:zoom :count :all, as described on previous slide)
SRI International
Bioinformatics
Using the PPP GUI to Create a
Pathway/Genome Database
Input
Project Information
Organism -> Create New
Creates directory structure for new PGDB
Creates and saves empty PGDB, populated only with objects
common to all PGDBs (schema classes, elements, etc.) and
data you entered in the form.
Offers to invoke Replicon Editor
SRI International
Bioinformatics
Input Project Information
Enter Replicon Information
For
SRI International
Bioinformatics
each replicon
Name
Type: chromosome, plasmid, etc.
Circular?
Annotation file
Sequence file (optional)
Contigs (optional)
Links to other DBs (optional)
GUI-Based entry
Build->Specify Replicons
File-Based Entry
Create genetic-elements.dat file using template provided
GUI-Based Replicon Entry
SRI International
Bioinformatics
Batch Entry of Replicon Info
SRI International
Bioinformatics
File /<orgid>cyc/<version>/input/genetic-elements.dat:
ID
TEST-CHROM-1
NAME Chromosome 1
TYPE :CHRSM
CIRCULAR?
N
ANNOT-FILE
chrom1.pf
SEQ-FILE
chrom1.fsa
//
ID
TEST-CHROM-2
NAME Chromosome 2
CIRCULAR?
N
ANNOT-FILE
/mydata/chrom2.gbk
SEQ-FILE
/mydata/chrom2.fna
//
Specify Reference PGDB(s)
This
SRI International
Bioinformatics
step is optional, and most users will omit it
MetaCyc is always the primary reference PGDB
Specify additional reference PGDB if you have
your own curated PGDB which has:
Pathways and/or reactions that are not in MetaCyc
Manual functional assignments, with names similar to current
genome
There is no point specifying any of our PGDBs as
references, only your own curated PGDBs.
Building the PGDB
SRI International
Bioinformatics
Trial
Parse
Build -> Trial Parse
Check output to ensure numbers “look right”
Same number of gene start positions, end positions, names
Did my file contain EC numbers? Were they detected?
Did my file contain RNAs? Were they detected?
Fix any errors in input files
Build pathway/genome database
Build -> Automated Build
SRI International
Bioinformatics
PathoLogic Parser Output
Automated Build
Parses
SRI International
Bioinformatics
input files
Creates objects for every gene and gene product
Uses EC numbers, GO annotations and name
matcher to match enzymes to reactions in
MetaCyc
Imports catalyzed enzymes and compounds from
MetaCyc
Generates list of likely enzymes that couldn’t be
assigned
Infers pathways likely to be present
Generates Cellular Overview Diagram (first pass)
Generates reports
SRI International
Bioinformatics
Matching Enzymes to Reactions
Matches
on full EC number (partial ECs ignored)
Matches on Molecular Function GO terms
If definition of GO term includes cross-reference either to an
EC number or to a MetaCyc reaction.
Matches on full enzyme name
Match is case-insensitive and removes the punctuation
characters “ -_(){}',:”
Also matches after removal of prefixes and suffixes such as:
“Putative”, “Hypothetical”, etc
alpha|beta|…|catalytic|inducible chain|subunit|component
Parenthetical gene name
Enzyme Name Matcher
SRI International
Bioinformatics
For
names that do not match, software identifies
probable metabolic enzymes as those
Containing “ase”
Not containing keywords such as
User
“sensor kinase”
“topoisomerase”
“protein kinase”
“peptidase”
Etc
should research unknown enzymes
MetaCyc, Swiss-Prot, PubMed
SRI International
Bioinformatics
Stored in ORGIDcyc/VERSION/reports/name-matching-report.txt
Automated Pathway Inference
SRI International
Bioinformatics
All
pathways in MetaCyc for which there is at
least one enzyme identified in the target organism
are considered for possible inclusion.
errs on side of inclusivity – easier to
manually delete a pathway from an organism than
to find a pathway that should have been predicted
but wasn’t.
Algorithm
SRI International
Bioinformatics
Considerations taken into account when
deciding whether or not a pathway should
be inferred:
Is there a unique enzyme – an enzyme not involved in any
other pathway?
Does the organism fall in the expected taxonomic domain of
the pathway?
Is this pathway part of a variant set, and, if so, is there more
evidence for some other variant?
If there is no unique enzyme:
Is there evidence for more than one enzyme?
If a biosynthetic pathway, is there evidence for final reaction(s)?
If a degradation pathway, is there evidence for initial reaction(s)?
If an energy metabolism pathway, is there evidence for more than half the
reactions?
SRI International
Bioinformatics
Assigning Evidence Scores to
Predicted Pathways
X|Y|Z
denotes score for P in O
where:
X = total number of reactions in P
Y = enzymes catalyzing number of reactions for which there is
evidence in O
Z = number of Y reactions that are used in other pathways in O
Pathway Evidence Report
SRI International
Bioinformatics
On
Organism Summary Page in Navigator, button
“Generate Pathway Evidence Report”
Report saved as HTML file, view in browser
Hierarchical listing of all inferred pathways
“Pathway Glyph” shows evidence graphically
Steps with/without enzymes (green/black)
Steps that are unique to pathway (orange)
Steps filled by Pathway Hole Filler (blue)
Counts reactions in pathway, with evidence, in other
pathways
Lists other pathways that share reactions
Link to pathway in MetaCyc
SRI International
Bioinformatics
Manual Pruning of Pathways
SRI International
Bioinformatics
Use pathway evidence report
Coloring scheme aids in assessing pathway evidence
Phase I: Prune extra variant pathways
Rescore pathways, re-generate pathway evidence report
Phase II: Prune pathways unlikely to be present
No/few unique enzymes
Most pathway steps present because they are used in another pathway
Pathway very unlikely to be present in this organism
Nonspecific enzyme name assigned to a pathway step
Caveats
Cannot
SRI International
Bioinformatics
predict pathways not present in MetaCyc
Evidence
for short pathways is hard to interpret
Since
many reactions occur in multiple pathways,
some false positives
Next
generation pathway inference algorithm is
work currently in progress!
Output from PPP
Pathway/genome
SRI International
Bioinformatics
database
Summary
pages
Pathway evidence page
Click “Summary of Organisms”, then click organism name, then click
“Pathway Evidence”, then click “Save Pathway Report”
Missing enzymes report
Directory
etc.
tree containing sequence files, reports,
SRI International
Bioinformatics
Resulting Directory Structure
ROOT/ptools-local/pgdbs/user/ORGIDcyc/VERSION/
input
reports
ORGIDbase.ocelot
data
name-matching-report.txt
trial-parse-report.txt
kb
organism.dat
organism-init.dat
genetic-elements.dat
annotation files
sequence files
overview.graph
released -> VERSION
SRI International
Bioinformatics
Manual Polishing
Refine -> Assign Probable Enzymes
Do this first
Refine -> Rescore Pathways
Redo after assigning enzymes
Refine -> Create Protein Complexes
Can be done at any time
Refine -> Assign Modified Proteins
Can be done at any time
Refine -> Transport Identification Parser Can be done at any time
Refine -> Pathway Hole Filler
Refine -> Predict Transcription Units
Refine -> Update Overview Do this last, and repeat after any material
changes to PGDB
Assign Probable Enzymes
SRI International
Bioinformatics
SRI International
Bioinformatics
How to find reactions for probable
enzymes
First,
verify that enzyme name describes a
specific, metabolic function
Search for fragment of name in MetaCyc – you
may be able to find a match that PathoLogic
missed
Look up protein in UniProt or other DBs
Search for gene name in PGDB for related
organism (bear in mind that gene names are not
reliable indicators of function, so check carefully)
Search for function name in PubMed
Other…