BioCyc - SRI International

Download Report

Transcript BioCyc - SRI International

The Ocelot Frame Knowledge
Representation System
Peter D. Karp, Ph.D.
Bioinformatics Research Group
SRI International
[email protected]
1
SRI International Bioinformatics
Frame Knowledge Representation
Systems
 Long
history of development in the AI knowledge
representation community
 Distant cousin of object-oriented databases
(convergent evolution)
 Background
reading on frame systems
 P. Karp, “The design space of frame knowledge
representation systems”


P. Karp, “Distinguishing Knowledge Bases and Data Bases:
Who's on First and What's on Second”

2
http://www.ai.sri.com/pubs/files/236.pdf
http://www.ai.sri.com/pubs/files/1397.pdf
SRI International Bioinformatics
Ocelot Information
 P.D.
Karp et al, “A collaborative environment for
authoring large knowledge bases,” J Intelligent
Information Systems 13:155-94 1999.
http://www.ai.sri.com/pkarp/pubs/99jiis.pdf
 “Ocelot
User’s Guide”
http://www.ai.sri.com/pkarp/ocelot/
3
SRI International Bioinformatics
Ocelot Data Model
 Ocelot
database
 Aka DB, Knowledge Base, KB, PGDB
 An
Ocelot database is a collection of frames and
slots
5
SRI International Bioinformatics
Ocelot Frames

Two kinds of frames:
 Classes: Genes, Pathways, Biosynthetic Pathways
 Instances (objects): trpA, TCA cycle

A symbolic frame name (id, key) uniquely identifies each
frame
 Examples: EG10223, TRP, Proteins

Classes have Superclass(es), Subclass(es), Instance(s)
Instances have one or more parent classes

6
SRI International Bioinformatics
Slots
Encode
attributes and properties of a frame
 Molecular weight, gene coordinates, comments
Represent relationships between frames
 The value of a slot is the identifier of another frame
7
SRI International Bioinformatics
Slots

Number of values
 Single valued
 Multivalued: sets or lists

Slot values
 Integer, real, string, symbol (frame name)

Every slot is described by a “slot frame” (slotunit) in a KB
that defines meta information about that slot
 Datatype, classes it pertains to, constraints
 Enumerations
 Two slots are inverses if they encode opposite relationships


8
Slot Product in class Genes
Slot Gene in class Polypeptides
SRI International Bioinformatics
Ocelot Schema
 Schema
is stored within the DB
 Schema is self documenting
 Slot frames define metadata about slots
 Schema
evolution facilitated by
 Easy addition/removal of slots, or alteration of slot datatypes
 Flexible data formats that do not require dumping/reloading of
data
 New versions of Pathway Tools include a schema upgrade
function


10
Updates schema to match that of new MetaCyc version
Transforms data into new schema
SRI International Bioinformatics
 Figure
showing multiple users tapping into one
mysql server
12
SRI International Bioinformatics
Ocelot Storage Subsystem
 RDBMS
KBs
 RDBMS
schema is independent of application
schema
 DBMS is submerged within Ocelot, invisible to
users
 Frames transferred from DBMS to Ocelot
 On demand
 By background prefetcher
 Memory cache
 Persistent disk cache speeds performance via Internet
13
SRI International Bioinformatics
Ocelot Frame Faulting
 When
a frame is referenced by Pathway Tools
 Look in Ocelot virtual memory
 Look in disk cache
 Look in RDBMS
14
SRI International Bioinformatics
Ocelot RDBMS Transaction History
 RDBMS
KBs store complete transaction history
 Stored
as sequences of GFP operations executed
by the user or by Pathway Tools
 Right
click -> Show -> Changes in pop-up window
 Used
to compute gene last-curated date
 Can
15
be used to open a PGDB in an earlier state
SRI International Bioinformatics
Ocelot RDBMS Concurrency Control
 When
user A saves updates:
 Ocelot queries all transactions that occurred since A last
saved or since the start of A’s session
 Ocelot compares the operations in those transactions with the
updates made by A
 If conflicts are found, save does not occur and conflicts are
reported to the user
 If no conflicts, save proceeds
 Other user transactions are evaluated into A’s session

16
“Refresh”
SRI International Bioinformatics
Ocelot Update Conflicts
 Example
conflicting updates:
 User A deletes frame F ; User B modifies value in slot F
 User A changes MW of protein P from 3 to 4 ; User B
changes MW of protein P from 3 to 5
 Example
of updates that don’t conflict:
 User A updates frame E ; User B updates frame F
 User A updates the value of P.MW ; User B updates the
value of P.pI
 Users A and B both delete all values of P.MW
17
SRI International Bioinformatics
Revert KB Operation
 Undoes
18
all changes in current session
SRI International Bioinformatics
Pathway Tools / BioCyc
Software/Database Bundles
 Each
downloadable Pathway Tools configuration
contains a combination of PGDBs
 Those PGDBs are loaded into Lisp virtual memory
 Build process:
 Start Common Lisp
 Load in all Pathway Tools compiled Lisp code into virtual
memory
 Load in all PGDBs for that configuration into virtual memory
 Save virtual memory image as binary executable file
19
SRI International Bioinformatics
“Full BioCyc” or Tier 1+2+3
Configuration
 507
20
PGDBs loaded into virtual memory
SRI International Bioinformatics
BioCyc at 10,000 Genomes
 Scalability
of current approach is limited
 New
approach: For full BioCyc, store PGDBs not
in virtual memory but in Franz AllegroCache
 AllegroCache is a Common Lisp object-oriented
database
 Implementation now in hand for Ocelot
 We have done extensive performance testing
 Performance looks good to 10,000 PGDBs
21
SRI International Bioinformatics