General - Bioinformatics Research Group at SRI International

Download Report

Transcript General - Bioinformatics Research Group at SRI International

Overview of the
Pathway Tools Software
and
Pathway/Genome Databases
Peter D. Karp
Bioinformatics Research Group
SRI International
[email protected]
Pathway/Genome Database
Integrating Genomic and Biochemical Data
Pathways
Reactions
Compounds
Proteins
Genes
Operons,
Promoters,
DNA Binding Sites
Chromosomes,
Plasmids
CELL
Key Functionality
 Pathway
analysis
 Prediction of pathways from genomes
 Comparative pathway analysis
 Ongoing
curation of PGDBs
 WWW publishing of PGDBs
 Analysis
of gene expression data
Tools and Datasets
Pathway/Genome
Navigator
Visualize, Query and
Analyze PGDBs
PathoLogic
Editors
PGDB
Create PGDBs
Pathways
Genes
Update PGDBs
PathoLogic Pathway Predictor
Set of
Annotated
Genes
MetaCyc
PGDB
Pathway
Prediction
Reports
New PGDB
Prediction of Pathways from
Genomes
Pathway/Genome Database
Annotated Genome
Metabolic Network
List of Gene Products
PathoLogic
Pathways
List of Genes/ORFs
Reactions
DNA
Sequence
Proteins
Genes
Genomic Map
Compounds
MetaCyc Overview
 Meta
Metabolic Encyclopedia
 439
pathways, 1095 enzymes, 4217 reactions
 173 E. coli pathways
 Literature-based
DB with extensive references
and commentary
 Pathways,
 Editor
reactions, enzymes, substrates
in chief: Dr. Monica Riley
Pathway/Genome Navigator
 Query
and visualization tools for PGDBs
 Metabolic pathways, reactions, compounds
 Enzymes, transporters, transcription factors
 Genome maps, genes, operons, promoters, DNA sites
 Retrieve nucleotide and DNA sequences
 Perform Blast searches
 Runs
as an application on Solaris, Windows
 Runs as a WWW server on Solaris
 Query
and comparative analysis functions
Interactive Editing Tools
 Pathway
editor
 Reaction editor
 Gene editor
 Enzyme editor
 Compound editor
 Transcription Unit Editor
 Facilitate
updates to PGDBs
 Improved computational predictions
 Literature-based data
 Record citations, comments, evidence, history
Pathway Views of Expression Data
 Import
gene expression data
 Compute expression ratios
 Obtain pathway based visualizations of data
 Numerical spectrum of expression values mapped to a color
spectrum
 Steps of overview painted with color corresponding to
expression level(s) of genes that encode enzyme(s) for that
step
 Absolute or relative expression values
Environment for Computational
Exploration of Genomes
 Powerful
ontology opens many facets of the
biology to computational exploration
 Global
characterization of metabolic network
 Analysis of interface between transport and
metabolism
 Nutrient analysis of metabolic network
PathoLogic Pathway Predictor
Pathologic Pathway Predictor
 Introduction
 Description
of PPP execution
 Inputs to PPP
 Using the GUI to create a pathway/genome
database
 Output from PPP
 Caveats
PathoLogic Goals
 Create
the set of class frames that encode DB
schema
 Copied from MetaCyc
 Create
the appropriate set of instance frames
 Genes, genetic elements, proteins created from input files
 Substrates, reactions, and pathways are copied from the
reference database
 Interconnect
frames in a manner that accurately
reflects their semantic relationships
PathoLogic Input/Output

Inputs:
 File listing genetic elements





http://bioinformatics.ai.sri.com/ptools/genetic-elements.dat
Files containing DNA sequence for each genetic element
Files containing annotation for each genetic element
MetaCyc database
Output:
 Pathway/genome database for the subject organism
 Directory tree for the subject organism
 Reports that summarize:


Evidence contained in the input genome for the presence of reference
pathways
Reactions missing from inferred pathways
Inputs to PathoLogic Pathway
Predictor
 genetic-elements.dat
 Sequence
files
 GenBank file format
 PathoLogic format
 Directory
Structure
genetic-elements.dat
ID
TEST-CHROM-1
NAME Chromosome 1
TYPE :CHRSM
CIRCULAR?
N
ANNOT-FILE
chrom1.pf
SEQ-FILE
chrom1.fsa
//
ID
TEST-CHROM-2
NAME Chromosome 2
CIRCULAR?
N
ANNOT-FILE
/mydata/chrom2.gbk
SEQ-FILE
/mydata/chrom2.fna
//
File Naming Conventions
 One
pair of sequence and annotation files for
each genetic element
 Sequence
files: FASTA format
 suffix fsa or fna
 Annotation
file:
 Genbank format: suffix .gbk
 PathoLogic format: suffix .pf
GenBank File Format

Accepted feature types:
 CDS, tRNA, rRNA, misc_RNA

Accepted qualifiers:
 /label
 /gene
 /product
 /EC_number
 /product_comment
 /gene_comment
 /alt_name

Unique ID [recm]
Gene name [req]
[req]
[recm]
[opt]
[opt]
Synonyms [opt]
For multifunctional proteins, put each function in a separate
/product line
Typical Problems Using Genbank Files
With PathoLogic
 Wrong
qualifier names used
 Extraneous
 Check
information in a given qualifier
results of trial parse carefully
PathoLogic File Format



Each record starts with line containing an ID attribute
Tab delimited
Each record ends with a line containing //

One attribute-value pair is allowed per line
 Use multiple FUNCTION lines for multifunctional proteins

Lines starting with ‘;’ are comment lines

Valid attributes are:
 ID, NAME, SYNONYM
 STARTBASE, ENDBASE, GENE-COMMENT
 FUNCTION, PRODUCT-TYPE, EC, FUNCTION-COMMENT
 DBLINK
PathoLogic File Format
ID
TP0734
NAME
deoD
STARTBASE
799084
ENDBASE
799785
FUNCTION
purine nucleoside phosphorylase
DBLINK
PID:g3323039
PRODUCT-TYPE
P
GENE-COMMENT
similar to GP:1638807 percent identity:
57.51; identified by sequence similarity; putative
//
ID
TP0735
NAME
gltA
STARTBASE
799867
ENDBASE
801423
FUNCTION
glutamate synthase
DBLINK
PID:g3323040
PRODUCT-TYPE
P
Using the PPP GUI to Create a
Pathway/Genome Database




Input Project Information
 Organism -> Create New
Trial Parse
 Build -> Trial Parse
Build pathway/genome database
 Build -> Automated Build
Manual polishing
 Refine -> Resolve Ambiguous Name Matches
 Refine -> Assign Modified Proteins
 Refine -> Create Protein Complexes
 Refine -> Run Consistency Checker
 Refine -> Update Overview
PathoLogic Command Menus
Organism
Select
 Create New
 Save KB
 Revert KB
 Reinitialize KB
 Exit
Build
 Trial Parse
 Automated Build

Refine







Resolve Ambiguous Name
Matches
Assign Modified Proteins
Create Protein Complexes
Re-run Name Matcher
Rescore Pathways
Run Consistency Checker
Update Overview
Input Project Information
PathoLogic PP Parse Output
Enzyme Name to Reaction Mapping
Enzyme Name Matching Tool
 Dictionary
of enzyme names assembled from:
 All metabolic reactions found in MetaCyc
 Two files that map synonyms not found in MetaCyc to
reaction names:


System file (pangea-enzyme-mappings.dat)
User-supplied file (local-enzyme-mappings.dat)
 Location
of sources:
 $GPROOT/pathologic/$VERSION-NUMBER/data
Enzyme Name Matcher
 Matches
on full enzyme name
 Match is case-insensitive and removes the
punctuation characters “ -_(){}',:”
 Also matches after removal of prefixes and
suffixes such as:
 “Putative”, “Hypothetical”, etc
 alpha|beta|…|catalytic|inducible chain|subunit|component
 Parenthetical gene name
Enzyme Name Matcher
 For
names that do not match, software identifies
probable metabolic enzymes as those
 Containing “ase”
 Not containing keywords such as





“sensor kinase”
“topoisomerase”
“protein kinase”
“peptidase”
Etc
 Research
unknown enzymes
 MetaCyc, Swiss-Prot, PIR, Medline, EMP
Assigning Evidence Scores to
Predicted Pathways
 X|Y|Z
denotes score for P in O
 where:



 Not
X = total number of reactions in P
Y = enzymes catalyzing number of reactions for which there is
evidence in O
Z = number of Y reactions that are used in other pathways in O
clear how to convert these scores into a
probability of occurrence
Algorithm for Automated Pathway
Pruning
A
pathway will never be pruned if it contains a
unique enzyme – an enzyme not present in any
other pathway
 A pathway will be pruned if one of the following
conditions holds:
 Evidence is better for a different pathway in same variant set
 Evidence for only one reaction in pathway, or
 Its set of reactions present is a proper subset of the reactions
present in some other pathway, and



If pathway is a biosynthetic pathway, final reaction(s) missing
If pathway is a degradation pathway, initial reaction(s) missing
If pathway is an energy metabolism pathway, more than half the
reactions are missing
Creating Protein Complexes
Complex Subunits Stoichiometries
Proteins as Reaction Substrates
Manual Pruning of Pathways

Use pathway evidence report
 Coloring scheme aids in assessing pathway evidence

Phase I: Prune extra variant pathways

Rescore pathways, re-generate pathway evidence report

Phase II: Prune pathways unlikely to be present
 No/few unique enzymes
 Most pathway steps present because they are used in another pathway
 Pathway very unlikely to be present in this organism
Overview Graph
Output from PPP
 Pathway/genome
database
 Summary
pages
 Pathway evidence page


Click “Summary of Organisms”, then click organism name, then click
“Pathway Evidence”, then click “Save Pathway Report”
Missing enzymes report
 Directory
etc.
tree containing sequence files, reports,
Resulting Directory Structure

ROOT/aic-export/ecocyc/ORGIDcyc/VERSION/
 input






reports



ORGIDbase.ocelot
data


name-matching-report.txt
trial-parse-report.txt
kb


organism.dat
organism-init.dat
genetic-elements.dat
annotations files
sequence files
overview.graph
released -> VERSION
Caveats
 Cannot
predict pathways not present in MetaCyc
 Evidence
 Since
for short pathways is hard to interpret
many reactions occur in lots of pathways,
many false positives
The Pathway Tools Schema
Motivations for Understanding
Schema
 Pathway
Tools visualizations and analyses
depend upon the software being able to find
precise information in precise places within a
Pathway/Genome DB
 When
writing Lisp complex queries to PGDBs,
those queries must name classes and slots within
the schema
A
Pathway/Genome Database is a web of
interconnected objects; each object represents a
biological entity
Reference
 Pathway
Tools User’s Guide, Volume I
 Appendix A: Guide to the Pathway Tools Schema
Web of Relationships for One Enzyme
TCA Cycle
Succinate + FAD = fumarate + FADH2
Enzymatic-reaction
Succinate dehydrogenase
Sdh-flavo
Sdh-Fe-S
Sdh-membrane-1
Sdh-membrane-2
sdhA
sdhB
sdhC
sdhD
Frame Data Model and Schema
 Frame
Data Model -- organizational principle for a
DB
 Object
Displays
 Schema
Gene slots
 Polypeptide slots
 Protein slots
 Protein Complex slots
 Reaction slots
 Enzymatic Reaction slots

Frame Data Model
 Knowledge
base (KB, Database, DB)
 Frames
 Slots
 Facets
 Annotations
Knowledge Base
 Collection
of frames and their associated slots,
values, facets, and annotations
 Can
be stored within
 An Oracle DB
 A disk file
 A Pathway Tools binary program
Frames

Entities with which facts are associated

Kinds of frames:
 Classes: Genes, Pathways, Biosynthetic Pathways
 Instances (objects): trpA, TCA cycle

Classes:
 Superclass(es)
 Subclass(es)
 Instance(s)

A symbolic frame name (id, key) uniquely identifies each
frame
Slots
 Encode
attributes/properties of a frame
 Integer, real number, string
 Represent
relationships between frames
 The value of a slot is the identifier of another frame
 Every
slot is described by a “slot frame” in a KB
that defines meta information about that slot
Slot Links
TCA Cycle
in-pathway
Succinate + FAD = fumarate + FADH2
reaction
Enzymatic-reaction
catalyzes
Succinate dehydrogenase
component-of
Sdh-flavo
Sdh-Fe-S
Sdh-membrane-1
Sdh-membrane-2
product
sdhA
sdhB
sdhC
sdhD
Slots

Number of values
 Single valued
 Multivalued: sets, bags

Slot values
 Any LISP object: Integer, real, string, symbol (frame name), list

Slotunits define properties of slots: datatypes, classes,
constraints

Two slots are inverses if they encode opposite relationships
 Slot Product in class Genes
 Slot Gene in class Polypeptides
Representation of Function
TCA Cycle
EC#
Keq
Succinate + FAD = fumarate + FADH2
Enzymatic-reaction
Succinate dehydrogenase
Cofactors
Inhibitors
Molecular wt
pI
Sdh-flavo
Sdh-Fe-S
Sdh-membrane-1
Sdh-membrane-2
sdhA
sdhB
sdhC
sdhD
Left-end-position
Monofunctional Monomer
Pathway
Reaction
Enzymatic-reaction
Monomer
Gene
Bifunctional Monomer
Pathway
Reaction
Reaction
Enzymatic-reaction
Enzymatic-reaction
Monomer
Gene
Monofunctional Multimer
Pathway
Reaction
Enzymatic-reaction
Multimer
Monomer
Monomer
Monomer
Monomer
Gene
Gene
Gene
Gene
Pathway and Substrates
Reactant-1
Pathway
left
in-pathway
Reactant-2
Reaction
Product-1
Product-2
right
Reaction
Reaction
Reaction
Transcriptional Regulation
trp
apoTrpR
trpLEDCBA
Int005
site001
Int001
pro001
Int003
trpL
trpE
trpD
trpC
trpB
trpA
TrpR*trp
RpoSig70
Principle Classes

Class names are capitalized, plural

Genetic-Elements, with subclasses:
 Chromosomes
 Plasmids

Genes

Transcription-Units

RNAs

Proteins, with subclasses:
 Polypeptides
 Protein Complexes
Principle Classes
 Reactions,
with subclasses:
 Transport-Reactions
 Enzymatic-Reactions
 Pathways
 Compounds-And-Elements
Slots in Multiple Classes
 Common-Name
 Synonyms
 Names
(computed as union of Common-Name,
Synonyms)
 Comment
 Citations
 DB-Links
Genes Slots
 Chromosome
 Left-End-Position
 Right-End-Position
 Centisome-Position
 Transcription-Direction
 Product
Proteins Slots
 Molecular-Weight-Seq
 Molecular-Weight-Exp
 pI
 Locations
 Modified-Form
 Unmodified-Form
 Component-Of
Polypeptides Slots
 Gene
Protein-Complexes Slots
 Components
Reactions Slots
 EC-Number
 Left,
Right
 Substrates (computed as union of Left, Right)
 DeltaG0
 Keq
 Spontaneous?
 Species
Enzymatic-Reactions Slots
 Enzyme
 Reaction
 Activators
 Inhibitors
 Physiologically-Relevant
 Cofactors
 Prosthetic-Groups
 Alternative-Substrates
 Alternative-Cofactors
Editing Pathway/Genome
Databases
Pathway Tools Paradigm
 Separate
database from user interface
 Navigator
 Editors
provides one view of the DB
provide an alternative view of the DB
Invoking the Editors
 Right-Click
on an Object Handle
Edit
 Notes
 Show

 Shift-Middle-Click
on an Object Handle
Saving Changes
 The
user must save changes explicitly with Save
KB
 To
discard changes made since last save
 Special -> KB -> Revert KB
Administering the Pathway
Tools
Information Sources
 Pathway
Tools User’s Guide
 aic-export/ecocyc/genopath/released/doc/userguide1.pdf


Appendix A: Guide to the Pathway Tools Schema
aic-export/ecocyc/genopath/released/doc/userguide2.pdf
 Pathway
Tools Web Site
 http://bioinformatics.ai.sri.com/ptools/
 Pathway
Tools Tutorial
 http://bioinformatics.ai.sri.com/ptools/tutorial/
Reporting Problems
 E-mail
to [email protected]
 Include:
Error message
 Result of :zoom :count :all
 What version and platform you are running
 What operation were you performing when the error
occurred?
