Macromolecular Structure Database group
Download
Report
Transcript Macromolecular Structure Database group
EMBL-EBI
Database
Replication - Distribution
EMBL-EBI
Relational public databases
EBI’s mission to provide freely accessible information
on the public domain
Data formats and technologies, should not contradict
to this policy
Adopt widely accepted, successful standards that are well known
and used
Free access not only in the information content, but in the supporting
technologies
Reasonable investment in resources and expertise by users so that
the data is accessible to a wider audience
But without a severe restriction to the benefits to the users
A trade-off situation, different users, different needs
Relational databases are an industry standard
Vendors have different implementations but there are underlying
formal standards
ANSI-SQL for query expression
ODBC, JDBC for API’s
EMBL-EBI
RDB’s versus flat files
Relational databases are flexible, powerful and
consistent
They are a lot more complex
They impose data organisation that can’t be easily
vertically partitioned
Organising and inter-exchanging data on a per-entry basis does not
come by default
Physical implementations are not standard
Remember the days (or imagine) flat files without a common
character encoding standard (without ASCII around)
Vendors support migration of other databases to their
own but not the other way-round
There is not a common vendor-independent exchange or dump
format
This is not trivial due to differences in implementation details and
extensions on the standards
EMBL-EBI
Why Replicate?
To take advantage of local hardware and CPU time –
some operations are simply not possible on-line
To avoid continuous dependency on network and EBI
resources
To extend or merge information with other databases or
data sources
To utilise the information in new innovative ways
To ensure confidentiality of research
EMBL-EBI
MSD replication options
We offer MSDSD in Oracle
With indexes pre-built
Implementation uses Oracle import-export
With frequent (weekly) incrementals so that new entries are
becoming available soon
Users need to have Oracle licence
We have more experience and offer better support
Or in mySQL
In compressed myIsam format without indexes
We give directly the mySQL data-files (they are platform and version
independent)
We don’t offer weekly increments but new full releases every few
months
We recommend the Oracle distribution for advanced users
But mySQL is great if they can’t afford Oracle
Or want to evaluate the MSDSD database
EMBL-EBI
Replication Components
Database copy on Sun Solaris
Schema export-import plus sql-loader files for creating
the database initially for Oracle on other platforms
Possibility to Import to Non Oracle databases (MySQL)
Periodic synchronisation with the MSD master database
using periodic incremental scripts for all Oracle
platforms
Use of two schemas, main search database and
incremental
EMBL-EBI
Incremental Data Export – Import
Why Incremental Updates
Implemented in server side JavaScript
Data is exported as Oracle Export files organised in marts
Data files on the FTP server
Aim for weekly updates
Mechanism flexible enough to adapt on different data mart
Combinations
Prerequisites: Rhino, Java, Oracle-JDBC driver, oracle-exportimport
The user has just to download and run the periodic incremental
import script of a data mart for his database
Database version, Data version, Data mart maintenance is
controlled via the administration tables through synchronisation
EMBL-EBI
Incremental Replication Mechanism
DATA MARTS
DATA MARTS
Increment log
Admin Tables
Admin Tables
JDBC
JDBC
crontab
crontab
PERIODIC
EXPORT SCRIPT
Oracle
Dump
Files
MSD Search Database
PERIODIC
IMPORT SCRIPT
Web-FTP
Service
Target Database
EMBL-EBI
Replication overview
Oracle Dictionary
JDBC metadata
Schema
Export
Schema creation
SQL scripts
Oracle
postgreSQL
mySQL
MSD in
Oracle
Import Export
Configuration
Structure
Source
database
Target
database
INSERT
statements
MSD in
mySQL
SELECT
statements
Data
Import
Data
Export
Java serialised
data files
EMBL-EBI
JDBC and Java
Java is one of the best environments regarding
portability
Java compiled machine code works directly on all platforms
Java serialisation is machine independent
JDBC standard is well defined and detailed
Maps database types to Java object types
Not all implementations are full in all details
JDBC offers metadata services
Easy to get information about schemas, tables and columns through
JDBC
Java offers data compression
Implementing a database vendor independent exportimport is trivial
Could not find one available so developed a simple
and flexible mechanism at MSD
EMBL-EBI
MSD cross-replication
Inputs JDBC metadata and Oracle dictionary
Exports schema creation scripts into SQL files
Gathers information from JDBC metadata and oracle dictionary
Takes care of type implementation details of the various databases
(maximum size of varchar etc)
Works with standard ANSI-SQL types only (not object-types, nested
tables, blobs etc)
Exports configuration files
Table, column names of target database can be different
Can export subsets of the data
Exports the data in compressed java serialised arrays
In data files or directly piped into the Import
mechanism
EMBL-EBI
Cross-replication details
Potentially for any relational database with ANSI-SQL
support
Has been tested for PostgreSQL, MS-Access, Mckoi (java RDB)
Flexible configuration
Target tables can be different different
The SELECT and INSERT statements are kept in configuration files
This is how merged (partitioned) tables where built
Includes support for incrementals
This option is still not used in production
The information in the data files can be examined offline
Foreign keys have to be disabled during the load
EMBL-EBI
Oracle versus mySQL
mySQL has several underlying database engines
InnoDB
Transactions & referential integrity
Not best performance, inefficient disk space usage
myIsam
Good performance but not foreign keys
myIsam compressed
Efficient I/O, good use of disk space but read-only
Can’t build indexes without uncompressing
Support for VLDB’s
Merged tables are similar to Oracle partitioning but implemented by
the user
Harder to simulate hash partitioning, range partitioning by default
Problems of using the indexes of the merged tables
Query optimiser of mySQL
Compared with Oracle seems primitive
EMBL-EBI
MSD mySQL experience
We used myIsam compressed tables without any
indexes
The configuration that required the less disk space
Faster to download
Once the data are local users can uncompress the data and build the
recommended or any other indexes locally
We used merged tables
To also avoid data files larger than 8GB
And for performance reasons
Character-sets - collation
Textual data in mySQL are by default case insensitive
Only some character collations allow a similar behaviour with Oracle
Other details
Table names are by default case sensitive (problem with windowsunix file systems)
Choosing the appropriate numeric type (Integer versus Numeric)
EMBL-EBI
Summary
MSD Search Database
Database Replication
Why Replicate
Replication Overview
Components of the Replication
Incremental Data Export – Import
Incremental Replication
Mechanism