Database Engines for Biosciences

Download Report

Transcript Database Engines for Biosciences

Database Engines for Bioscience
John Corwin
Avi Silberschatz
Swathi Yadlapalli
Yale University
Database Engines for Bioscience


The Yale Center for Medical Informatics, led
by Perry Miller, makes extensive use of
biomedical and bioscience databases.
Large database projects include
–
–
Trial/DB
SenseLab
Trial/DB

Trial/DB is an open-source clinical study data
management system
–
–
–
–
Manages patient data across multiple studies.
No limit on the number of patients per study.
No limit on the number of parameters that are
tracked in each study.
Currently used for a large number of clinical
studies at Yale and other universities.
SenseLab

Stores integrated, multidisciplinary models of
neurons and neural systems.
–
–
–
Cell properties (receptors, currents, and
transmitters)
Neurons
Networks of neurons (computational neuron
models)
Database Engines for Bioscience

What’s different about these databases?
–
–
–
–
Sparse data
Frequent schema changes
Row-level security policies
Lots of metadata
EAV

Entity-Attribute-Value (EAV)
–

Also known as Object-Attribute-Value
Data is stored in 3-column tables
–
–
–
Column 1 stores the object name
Column 2 stores the attribute name
Column 3 stores the attribute value
EAV – Example

Suppose we’re building a database of fruit
types
Entity
Apple
Apple
Banana
Attribute
Color
AvgWeight
AvgWeight
Value
Red
8.2
11.5
Banana
Apple
Color
Potassium
Yellow
159
EAV/CR

EAV/CR: extends EAV with classes and
relationships
–
–
Classes – the value field in each EAV triple may
contain complex, structured data
Relationships – values may refer to other objects
in the database, thus relationships in the
database are stored explicitly
Drawbacks of EAV and EAV/CR




Since all data is conceptually in a single table, the
only operations we can perform are filters and selfjoins
Support for regular queries must be re-implemented
on top of the EAV database
The performance of attribute-centered queries is
significantly worse than a conventional database
High storage overhead for dense data
Our work

Extend a conventional database engine to
better support the requirements of bioscience
databases.
–
–
–
Take advantage of existing tools
Better performance
Better space-efficiency
Schema modification



Use 1-column tables
Each attribute is stored in an individual table
The original table is formed by joining the 1-column
tables by tuple index
Fruit
Color
AvgWeight
Potassium
Apple
Red
8.2
159
Banana
Grape
Yellow
Green
11.5
0.2
400
3
Schema modification

Adding and Removing attributes
–
One-column tables can be added and removed without
affecting the existing tables
Fruit
Fruit
Fruit
Apple
Apple
Apple
Banana
Banana
Banana
Grape
Grape
Grape
Color
AvgWeight
Color AvgWeight
AvgWeightSodium Sodium
Potassium
Color
Potassium
Red
RedRed
8.2
8.2
Yellow 11.5 11.5
11.5
Yellow
Yellow
Green 0.2 0.2
0.2
Green
Green
8.2
15
3
2
15
159159
3400400
23 3
Schema modification

Query
–
Original table is formed by joining each onecolumn table by tuple index
Schema modification


To test this new storage mechanism, we
implemented it in PostgreSQL
PostgreSQL is an open-source relational database
engine
–
–
–
Based on the original Postgres engine developed at
Berkeley
Supports modern database features such as complex
queries, transactional integrity, and extensible functions and
data types
500,000 lines of C code
Dynamic Tables: Interface


Create an extension of the SQL syntax
All other table operations work normally on dynamic
tables – the implementation of dynamic tables is
transparent to the database user
CREATE DYNAMIC TABLE fruit_table (
fruit
string,
avgWeight
float,
potassium
int);
Dynamic Tables: Implementation

Original statement:
CREATE DYNAMIC TABLE t(c1 t1, c2 t2, ..., cn tn);

Create individual tables, for 1 ≤ i ≤ n:
CREATE TABLE dyn_t_i(ci ti);

Expose original table as a view:
CREATE VIEW t AS (SELECT c1, c2, ..., cn FROM
dyn_t_1, dyn_t_2, ..., dyn_t_n WHERE (c1.ctid =
c2.ctid) AND ... AND (cn-1.ctid = cn.ctid));
Dynamic Tables: Implementation


Override implementation of insert, update,
and delete
Insert: Given the query
INSERT INTO t VALUES (v1, ..., vn);
Translates to
INSERT INTO dyn_t_i VALUES (v_i);
for 1 ≤ i ≤ n
Dynamic Tables: Implementation


Update and Delete: take advantage of
PostgreSQL’s rule system to translate
operations to individual tables
Must maintain the invariant that each table
has the same number of entries
Sparse Data


Make the column index explicit
Change our one-column tables to twocolumn tables
–
–


Column 1: row-index of this tuple
Column 2: data value
Table is exposed by joining sub-tables by
explicit row index
Dense and sparse attributes can be mixed
within the same relation
Database Engines for Bioscience

Future work:
–
–
–
Row-level security
Metadata
Implement SenseLab database and compare
performance
Database Engines for Bioscience

Questions?