BioCyc - SRI International
Download
Report
Transcript BioCyc - SRI International
The Ocelot Frame Knowledge
Representation System
Peter D. Karp, Ph.D.
Bioinformatics Research Group
SRI International
[email protected]
1
SRI International Bioinformatics
Frame Knowledge Representation
Systems
Long
history of development in the AI knowledge
representation community
Distant cousin of object-oriented databases
(convergent evolution)
Background
reading on frame systems
P. Karp, “The design space of frame knowledge
representation systems”
P. Karp, “Distinguishing Knowledge Bases and Data Bases:
Who's on First and What's on Second”
2
http://www.ai.sri.com/pubs/files/236.pdf
http://www.ai.sri.com/pubs/files/1397.pdf
SRI International Bioinformatics
Ocelot Information
P.D.
Karp et al, “A collaborative environment for
authoring large knowledge bases,” J Intelligent
Information Systems 13:155-94 1999.
http://www.ai.sri.com/pkarp/pubs/99jiis.pdf
“Ocelot
User’s Guide”
http://www.ai.sri.com/pkarp/ocelot/
3
SRI International Bioinformatics
Ocelot Data Model
Ocelot
database
Aka DB, Knowledge Base, KB, PGDB
An
Ocelot database is a collection of frames and
slots
5
SRI International Bioinformatics
Ocelot Frames
Two kinds of frames:
Classes: Genes, Pathways, Biosynthetic Pathways
Instances (objects): trpA, TCA cycle
A symbolic frame name (id, key) uniquely identifies each
frame
Examples: EG10223, TRP, Proteins
Classes have Superclass(es), Subclass(es), Instance(s)
Instances have one or more parent classes
6
SRI International Bioinformatics
Slots
Encode
attributes and properties of a frame
Molecular weight, gene coordinates, comments
Represent relationships between frames
The value of a slot is the identifier of another frame
7
SRI International Bioinformatics
Slots
Number of values
Single valued
Multivalued: sets or lists
Slot values
Integer, real, string, symbol (frame name)
Every slot is described by a “slot frame” (slotunit) in a KB
that defines meta information about that slot
Datatype, classes it pertains to, constraints
Enumerations
Two slots are inverses if they encode opposite relationships
8
Slot Product in class Genes
Slot Gene in class Polypeptides
SRI International Bioinformatics
Ocelot Schema
Schema
is stored within the DB
Schema is self documenting
Slot frames define metadata about slots
Schema
evolution facilitated by
Easy addition/removal of slots, or alteration of slot datatypes
Flexible data formats that do not require dumping/reloading of
data
New versions of Pathway Tools include a schema upgrade
function
10
Updates schema to match that of new MetaCyc version
Transforms data into new schema
SRI International Bioinformatics
Figure
showing multiple users tapping into one
mysql server
12
SRI International Bioinformatics
Ocelot Storage Subsystem
RDBMS
KBs
RDBMS
schema is independent of application
schema
DBMS is submerged within Ocelot, invisible to
users
Frames transferred from DBMS to Ocelot
On demand
By background prefetcher
Memory cache
Persistent disk cache speeds performance via Internet
13
SRI International Bioinformatics
Ocelot Frame Faulting
When
a frame is referenced by Pathway Tools
Look in Ocelot virtual memory
Look in disk cache
Look in RDBMS
14
SRI International Bioinformatics
Ocelot RDBMS Transaction History
RDBMS
KBs store complete transaction history
Stored
as sequences of GFP operations executed
by the user or by Pathway Tools
Right
click -> Show -> Changes in pop-up window
Used
to compute gene last-curated date
Can
15
be used to open a PGDB in an earlier state
SRI International Bioinformatics
Ocelot RDBMS Concurrency Control
When
user A saves updates:
Ocelot queries all transactions that occurred since A last
saved or since the start of A’s session
Ocelot compares the operations in those transactions with the
updates made by A
If conflicts are found, save does not occur and conflicts are
reported to the user
If no conflicts, save proceeds
Other user transactions are evaluated into A’s session
16
“Refresh”
SRI International Bioinformatics
Ocelot Update Conflicts
Example
conflicting updates:
User A deletes frame F ; User B modifies value in slot F
User A changes MW of protein P from 3 to 4 ; User B
changes MW of protein P from 3 to 5
Example
of updates that don’t conflict:
User A updates frame E ; User B updates frame F
User A updates the value of P.MW ; User B updates the
value of P.pI
Users A and B both delete all values of P.MW
17
SRI International Bioinformatics
Revert KB Operation
Undoes
18
all changes in current session
SRI International Bioinformatics
Pathway Tools / BioCyc
Software/Database Bundles
Each
downloadable Pathway Tools configuration
contains a combination of PGDBs
Those PGDBs are loaded into Lisp virtual memory
Build process:
Start Common Lisp
Load in all Pathway Tools compiled Lisp code into virtual
memory
Load in all PGDBs for that configuration into virtual memory
Save virtual memory image as binary executable file
19
SRI International Bioinformatics
“Full BioCyc” or Tier 1+2+3
Configuration
507
20
PGDBs loaded into virtual memory
SRI International Bioinformatics
BioCyc at 10,000 Genomes
Scalability
of current approach is limited
New
approach: For full BioCyc, store PGDBs not
in virtual memory but in Franz AllegroCache
AllegroCache is a Common Lisp object-oriented
database
Implementation now in hand for Ocelot
We have done extensive performance testing
Performance looks good to 10,000 PGDBs
21
SRI International Bioinformatics