General - Bioinformatics Research Group at SRI International
Download
Report
Transcript General - Bioinformatics Research Group at SRI International
Overview of the
Pathway Tools Software
and
Pathway/Genome Databases
Peter D. Karp
Bioinformatics Research Group
SRI International
[email protected]
Pathway/Genome Database
Integrating Genomic and Biochemical Data
Pathways
Reactions
Compounds
Proteins
Genes
Operons,
Promoters,
DNA Binding Sites
Chromosomes,
Plasmids
CELL
Key Functionality
Pathway
analysis
Prediction of pathways from genomes
Comparative pathway analysis
Ongoing
curation of PGDBs
WWW publishing of PGDBs
Analysis
of gene expression data
Tools and Datasets
Pathway/Genome
Navigator
Visualize, Query and
Analyze PGDBs
PathoLogic
Editors
PGDB
Create PGDBs
Pathways
Genes
Update PGDBs
PathoLogic Pathway Predictor
Set of
Annotated
Genes
MetaCyc
PGDB
Pathway
Prediction
Reports
New PGDB
Prediction of Pathways from
Genomes
Pathway/Genome Database
Annotated Genome
Metabolic Network
List of Gene Products
PathoLogic
Pathways
List of Genes/ORFs
Reactions
DNA
Sequence
Proteins
Genes
Genomic Map
Compounds
MetaCyc Overview
Meta
Metabolic Encyclopedia
439
pathways, 1095 enzymes, 4217 reactions
173 E. coli pathways
Literature-based
DB with extensive references
and commentary
Pathways,
Editor
reactions, enzymes, substrates
in chief: Dr. Monica Riley
Pathway/Genome Navigator
Query
and visualization tools for PGDBs
Metabolic pathways, reactions, compounds
Enzymes, transporters, transcription factors
Genome maps, genes, operons, promoters, DNA sites
Retrieve nucleotide and DNA sequences
Perform Blast searches
Runs
as an application on Solaris, Windows
Runs as a WWW server on Solaris
Query
and comparative analysis functions
Interactive Editing Tools
Pathway
editor
Reaction editor
Gene editor
Enzyme editor
Compound editor
Transcription Unit Editor
Facilitate
updates to PGDBs
Improved computational predictions
Literature-based data
Record citations, comments, evidence, history
Pathway Views of Expression Data
Import
gene expression data
Compute expression ratios
Obtain pathway based visualizations of data
Numerical spectrum of expression values mapped to a color
spectrum
Steps of overview painted with color corresponding to
expression level(s) of genes that encode enzyme(s) for that
step
Absolute or relative expression values
Environment for Computational
Exploration of Genomes
Powerful
ontology opens many facets of the
biology to computational exploration
Global
characterization of metabolic network
Analysis of interface between transport and
metabolism
Nutrient analysis of metabolic network
PathoLogic Pathway Predictor
Pathologic Pathway Predictor
Introduction
Description
of PPP execution
Inputs to PPP
Using the GUI to create a pathway/genome
database
Output from PPP
Caveats
PathoLogic Goals
Create
the set of class frames that encode DB
schema
Copied from MetaCyc
Create
the appropriate set of instance frames
Genes, genetic elements, proteins created from input files
Substrates, reactions, and pathways are copied from the
reference database
Interconnect
frames in a manner that accurately
reflects their semantic relationships
PathoLogic Input/Output
Inputs:
File listing genetic elements
http://bioinformatics.ai.sri.com/ptools/genetic-elements.dat
Files containing DNA sequence for each genetic element
Files containing annotation for each genetic element
MetaCyc database
Output:
Pathway/genome database for the subject organism
Directory tree for the subject organism
Reports that summarize:
Evidence contained in the input genome for the presence of reference
pathways
Reactions missing from inferred pathways
Inputs to PathoLogic Pathway
Predictor
genetic-elements.dat
Sequence
files
GenBank file format
PathoLogic format
Directory
Structure
genetic-elements.dat
ID
TEST-CHROM-1
NAME Chromosome 1
TYPE :CHRSM
CIRCULAR?
N
ANNOT-FILE
chrom1.pf
SEQ-FILE
chrom1.fsa
//
ID
TEST-CHROM-2
NAME Chromosome 2
CIRCULAR?
N
ANNOT-FILE
/mydata/chrom2.gbk
SEQ-FILE
/mydata/chrom2.fna
//
File Naming Conventions
One
pair of sequence and annotation files for
each genetic element
Sequence
files: FASTA format
suffix fsa or fna
Annotation
file:
Genbank format: suffix .gbk
PathoLogic format: suffix .pf
GenBank File Format
Accepted feature types:
CDS, tRNA, rRNA, misc_RNA
Accepted qualifiers:
/label
/gene
/product
/EC_number
/product_comment
/gene_comment
/alt_name
Unique ID [recm]
Gene name [req]
[req]
[recm]
[opt]
[opt]
Synonyms [opt]
For multifunctional proteins, put each function in a separate
/product line
Typical Problems Using Genbank Files
With PathoLogic
Wrong
qualifier names used
Extraneous
Check
information in a given qualifier
results of trial parse carefully
PathoLogic File Format
Each record starts with line containing an ID attribute
Tab delimited
Each record ends with a line containing //
One attribute-value pair is allowed per line
Use multiple FUNCTION lines for multifunctional proteins
Lines starting with ‘;’ are comment lines
Valid attributes are:
ID, NAME, SYNONYM
STARTBASE, ENDBASE, GENE-COMMENT
FUNCTION, PRODUCT-TYPE, EC, FUNCTION-COMMENT
DBLINK
PathoLogic File Format
ID
TP0734
NAME
deoD
STARTBASE
799084
ENDBASE
799785
FUNCTION
purine nucleoside phosphorylase
DBLINK
PID:g3323039
PRODUCT-TYPE
P
GENE-COMMENT
similar to GP:1638807 percent identity:
57.51; identified by sequence similarity; putative
//
ID
TP0735
NAME
gltA
STARTBASE
799867
ENDBASE
801423
FUNCTION
glutamate synthase
DBLINK
PID:g3323040
PRODUCT-TYPE
P
Using the PPP GUI to Create a
Pathway/Genome Database
Input Project Information
Organism -> Create New
Trial Parse
Build -> Trial Parse
Build pathway/genome database
Build -> Automated Build
Manual polishing
Refine -> Resolve Ambiguous Name Matches
Refine -> Assign Modified Proteins
Refine -> Create Protein Complexes
Refine -> Run Consistency Checker
Refine -> Update Overview
PathoLogic Command Menus
Organism
Select
Create New
Save KB
Revert KB
Reinitialize KB
Exit
Build
Trial Parse
Automated Build
Refine
Resolve Ambiguous Name
Matches
Assign Modified Proteins
Create Protein Complexes
Re-run Name Matcher
Rescore Pathways
Run Consistency Checker
Update Overview
Input Project Information
PathoLogic PP Parse Output
Enzyme Name to Reaction Mapping
Enzyme Name Matching Tool
Dictionary
of enzyme names assembled from:
All metabolic reactions found in MetaCyc
Two files that map synonyms not found in MetaCyc to
reaction names:
System file (pangea-enzyme-mappings.dat)
User-supplied file (local-enzyme-mappings.dat)
Location
of sources:
$GPROOT/pathologic/$VERSION-NUMBER/data
Enzyme Name Matcher
Matches
on full enzyme name
Match is case-insensitive and removes the
punctuation characters “ -_(){}',:”
Also matches after removal of prefixes and
suffixes such as:
“Putative”, “Hypothetical”, etc
alpha|beta|…|catalytic|inducible chain|subunit|component
Parenthetical gene name
Enzyme Name Matcher
For
names that do not match, software identifies
probable metabolic enzymes as those
Containing “ase”
Not containing keywords such as
“sensor kinase”
“topoisomerase”
“protein kinase”
“peptidase”
Etc
Research
unknown enzymes
MetaCyc, Swiss-Prot, PIR, Medline, EMP
Assigning Evidence Scores to
Predicted Pathways
X|Y|Z
denotes score for P in O
where:
Not
X = total number of reactions in P
Y = enzymes catalyzing number of reactions for which there is
evidence in O
Z = number of Y reactions that are used in other pathways in O
clear how to convert these scores into a
probability of occurrence
Algorithm for Automated Pathway
Pruning
A
pathway will never be pruned if it contains a
unique enzyme – an enzyme not present in any
other pathway
A pathway will be pruned if one of the following
conditions holds:
Evidence is better for a different pathway in same variant set
Evidence for only one reaction in pathway, or
Its set of reactions present is a proper subset of the reactions
present in some other pathway, and
If pathway is a biosynthetic pathway, final reaction(s) missing
If pathway is a degradation pathway, initial reaction(s) missing
If pathway is an energy metabolism pathway, more than half the
reactions are missing
Creating Protein Complexes
Complex Subunits Stoichiometries
Proteins as Reaction Substrates
Manual Pruning of Pathways
Use pathway evidence report
Coloring scheme aids in assessing pathway evidence
Phase I: Prune extra variant pathways
Rescore pathways, re-generate pathway evidence report
Phase II: Prune pathways unlikely to be present
No/few unique enzymes
Most pathway steps present because they are used in another pathway
Pathway very unlikely to be present in this organism
Overview Graph
Output from PPP
Pathway/genome
database
Summary
pages
Pathway evidence page
Click “Summary of Organisms”, then click organism name, then click
“Pathway Evidence”, then click “Save Pathway Report”
Missing enzymes report
Directory
etc.
tree containing sequence files, reports,
Resulting Directory Structure
ROOT/aic-export/ecocyc/ORGIDcyc/VERSION/
input
reports
ORGIDbase.ocelot
data
name-matching-report.txt
trial-parse-report.txt
kb
organism.dat
organism-init.dat
genetic-elements.dat
annotations files
sequence files
overview.graph
released -> VERSION
Caveats
Cannot
predict pathways not present in MetaCyc
Evidence
Since
for short pathways is hard to interpret
many reactions occur in lots of pathways,
many false positives
The Pathway Tools Schema
Motivations for Understanding
Schema
Pathway
Tools visualizations and analyses
depend upon the software being able to find
precise information in precise places within a
Pathway/Genome DB
When
writing Lisp complex queries to PGDBs,
those queries must name classes and slots within
the schema
A
Pathway/Genome Database is a web of
interconnected objects; each object represents a
biological entity
Reference
Pathway
Tools User’s Guide, Volume I
Appendix A: Guide to the Pathway Tools Schema
Web of Relationships for One Enzyme
TCA Cycle
Succinate + FAD = fumarate + FADH2
Enzymatic-reaction
Succinate dehydrogenase
Sdh-flavo
Sdh-Fe-S
Sdh-membrane-1
Sdh-membrane-2
sdhA
sdhB
sdhC
sdhD
Frame Data Model and Schema
Frame
Data Model -- organizational principle for a
DB
Object
Displays
Schema
Gene slots
Polypeptide slots
Protein slots
Protein Complex slots
Reaction slots
Enzymatic Reaction slots
Frame Data Model
Knowledge
base (KB, Database, DB)
Frames
Slots
Facets
Annotations
Knowledge Base
Collection
of frames and their associated slots,
values, facets, and annotations
Can
be stored within
An Oracle DB
A disk file
A Pathway Tools binary program
Frames
Entities with which facts are associated
Kinds of frames:
Classes: Genes, Pathways, Biosynthetic Pathways
Instances (objects): trpA, TCA cycle
Classes:
Superclass(es)
Subclass(es)
Instance(s)
A symbolic frame name (id, key) uniquely identifies each
frame
Slots
Encode
attributes/properties of a frame
Integer, real number, string
Represent
relationships between frames
The value of a slot is the identifier of another frame
Every
slot is described by a “slot frame” in a KB
that defines meta information about that slot
Slot Links
TCA Cycle
in-pathway
Succinate + FAD = fumarate + FADH2
reaction
Enzymatic-reaction
catalyzes
Succinate dehydrogenase
component-of
Sdh-flavo
Sdh-Fe-S
Sdh-membrane-1
Sdh-membrane-2
product
sdhA
sdhB
sdhC
sdhD
Slots
Number of values
Single valued
Multivalued: sets, bags
Slot values
Any LISP object: Integer, real, string, symbol (frame name), list
Slotunits define properties of slots: datatypes, classes,
constraints
Two slots are inverses if they encode opposite relationships
Slot Product in class Genes
Slot Gene in class Polypeptides
Representation of Function
TCA Cycle
EC#
Keq
Succinate + FAD = fumarate + FADH2
Enzymatic-reaction
Succinate dehydrogenase
Cofactors
Inhibitors
Molecular wt
pI
Sdh-flavo
Sdh-Fe-S
Sdh-membrane-1
Sdh-membrane-2
sdhA
sdhB
sdhC
sdhD
Left-end-position
Monofunctional Monomer
Pathway
Reaction
Enzymatic-reaction
Monomer
Gene
Bifunctional Monomer
Pathway
Reaction
Reaction
Enzymatic-reaction
Enzymatic-reaction
Monomer
Gene
Monofunctional Multimer
Pathway
Reaction
Enzymatic-reaction
Multimer
Monomer
Monomer
Monomer
Monomer
Gene
Gene
Gene
Gene
Pathway and Substrates
Reactant-1
Pathway
left
in-pathway
Reactant-2
Reaction
Product-1
Product-2
right
Reaction
Reaction
Reaction
Transcriptional Regulation
trp
apoTrpR
trpLEDCBA
Int005
site001
Int001
pro001
Int003
trpL
trpE
trpD
trpC
trpB
trpA
TrpR*trp
RpoSig70
Principle Classes
Class names are capitalized, plural
Genetic-Elements, with subclasses:
Chromosomes
Plasmids
Genes
Transcription-Units
RNAs
Proteins, with subclasses:
Polypeptides
Protein Complexes
Principle Classes
Reactions,
with subclasses:
Transport-Reactions
Enzymatic-Reactions
Pathways
Compounds-And-Elements
Slots in Multiple Classes
Common-Name
Synonyms
Names
(computed as union of Common-Name,
Synonyms)
Comment
Citations
DB-Links
Genes Slots
Chromosome
Left-End-Position
Right-End-Position
Centisome-Position
Transcription-Direction
Product
Proteins Slots
Molecular-Weight-Seq
Molecular-Weight-Exp
pI
Locations
Modified-Form
Unmodified-Form
Component-Of
Polypeptides Slots
Gene
Protein-Complexes Slots
Components
Reactions Slots
EC-Number
Left,
Right
Substrates (computed as union of Left, Right)
DeltaG0
Keq
Spontaneous?
Species
Enzymatic-Reactions Slots
Enzyme
Reaction
Activators
Inhibitors
Physiologically-Relevant
Cofactors
Prosthetic-Groups
Alternative-Substrates
Alternative-Cofactors
Editing Pathway/Genome
Databases
Pathway Tools Paradigm
Separate
database from user interface
Navigator
Editors
provides one view of the DB
provide an alternative view of the DB
Invoking the Editors
Right-Click
on an Object Handle
Edit
Notes
Show
Shift-Middle-Click
on an Object Handle
Saving Changes
The
user must save changes explicitly with Save
KB
To
discard changes made since last save
Special -> KB -> Revert KB
Administering the Pathway
Tools
Information Sources
Pathway
Tools User’s Guide
aic-export/ecocyc/genopath/released/doc/userguide1.pdf
Appendix A: Guide to the Pathway Tools Schema
aic-export/ecocyc/genopath/released/doc/userguide2.pdf
Pathway
Tools Web Site
http://bioinformatics.ai.sri.com/ptools/
Pathway
Tools Tutorial
http://bioinformatics.ai.sri.com/ptools/tutorial/
Reporting Problems
E-mail
to [email protected]
Include:
Error message
Result of :zoom :count :all
What version and platform you are running
What operation were you performing when the error
occurred?