Macromolecular Structure Database group

Transcript Macromolecular Structure Database group

EMBL-EBI
Database
Replication - Distribution
EMBL-EBI
Relational public databases
 EBI’s mission to provide freely accessible information
on the public domain
 Data formats and technologies, should not contradict
to this policy
Adopt widely accepted, successful standards that are well known
and used
Free access not only in the information content, but in the supporting
technologies
Reasonable investment in resources and expertise by users so that
the data is accessible to a wider audience
But without a severe restriction to the benefits to the users
A trade-off situation, different users, different needs
 Relational databases are an industry standard
Vendors have different implementations but there are underlying
formal standards
ANSI-SQL for query expression
ODBC, JDBC for API’s
EMBL-EBI
RDB’s versus flat files
 Relational databases are flexible, powerful and
consistent
 They are a lot more complex
 They impose data organisation that can’t be easily
vertically partitioned
 Organising and inter-exchanging data on a per-entry basis does not
come by default
 Physical implementations are not standard
 Remember the days (or imagine) flat files without a common
character encoding standard (without ASCII around)
 Vendors support migration of other databases to their
own but not the other way-round
 There is not a common vendor-independent exchange or dump
format
 This is not trivial due to differences in implementation details and
extensions on the standards
EMBL-EBI
Why Replicate?

To take advantage of local hardware and CPU time –
some operations are simply not possible on-line
 To avoid continuous dependency on network and EBI
resources
 To extend or merge information with other databases or
data sources
 To utilise the information in new innovative ways
 To ensure confidentiality of research
EMBL-EBI
MSD replication options
 We offer MSDSD in Oracle
With indexes pre-built
Implementation uses Oracle import-export
With frequent (weekly) incrementals so that new entries are
becoming available soon
Users need to have Oracle licence
We have more experience and offer better support
 Or in mySQL
In compressed myIsam format without indexes
We give directly the mySQL data-files (they are platform and version
independent)
We don’t offer weekly increments but new full releases every few
months
We recommend the Oracle distribution for advanced users
But mySQL is great if they can’t afford Oracle
Or want to evaluate the MSDSD database
EMBL-EBI
Replication Components
 Database copy on Sun Solaris
 Schema export-import plus sql-loader files for creating
the database initially for Oracle on other platforms
 Possibility to Import to Non Oracle databases (MySQL)
 Periodic synchronisation with the MSD master database
using periodic incremental scripts for all Oracle
platforms
 Use of two schemas, main search database and
incremental
EMBL-EBI
Incremental Data Export – Import
 Why Incremental Updates
 Implemented in server side JavaScript
 Data is exported as Oracle Export files organised in marts
 Data files on the FTP server
 Aim for weekly updates
 Mechanism flexible enough to adapt on different data mart
Combinations
 Prerequisites: Rhino, Java, Oracle-JDBC driver, oracle-exportimport
 The user has just to download and run the periodic incremental
import script of a data mart for his database
 Database version, Data version, Data mart maintenance is
controlled via the administration tables through synchronisation
EMBL-EBI
Incremental Replication Mechanism
DATA MARTS
DATA MARTS
Increment log
Admin Tables
Admin Tables
JDBC
JDBC
crontab
crontab
PERIODIC
EXPORT SCRIPT
Oracle
Dump
Files
MSD Search Database
PERIODIC
IMPORT SCRIPT
Web-FTP
Service
Target Database
EMBL-EBI
Replication overview
Oracle Dictionary
JDBC metadata
Schema
Export
Schema creation
SQL scripts
Oracle
postgreSQL
mySQL
MSD in
Oracle
Import Export
Configuration
Structure
Source
database
Target
database
INSERT
statements
MSD in
mySQL
SELECT
statements
Data
Import
Data
Export
Java serialised
data files
EMBL-EBI
JDBC and Java
 Java is one of the best environments regarding
portability
 Java compiled machine code works directly on all platforms
 Java serialisation is machine independent
 JDBC standard is well defined and detailed
 Maps database types to Java object types
 Not all implementations are full in all details
 JDBC offers metadata services
 Easy to get information about schemas, tables and columns through
JDBC
 Java offers data compression
 Implementing a database vendor independent exportimport is trivial
 Could not find one available so developed a simple
and flexible mechanism at MSD
EMBL-EBI
MSD cross-replication
 Inputs JDBC metadata and Oracle dictionary
 Exports schema creation scripts into SQL files
 Gathers information from JDBC metadata and oracle dictionary
 Takes care of type implementation details of the various databases
(maximum size of varchar etc)
 Works with standard ANSI-SQL types only (not object-types, nested
tables, blobs etc)
 Exports configuration files
 Table, column names of target database can be different
 Can export subsets of the data
 Exports the data in compressed java serialised arrays
 In data files or directly piped into the Import
mechanism
EMBL-EBI
Cross-replication details
 Potentially for any relational database with ANSI-SQL
support
 Has been tested for PostgreSQL, MS-Access, Mckoi (java RDB)
 Flexible configuration
 Target tables can be different different
 The SELECT and INSERT statements are kept in configuration files
 This is how merged (partitioned) tables where built
 Includes support for incrementals
 This option is still not used in production
 The information in the data files can be examined offline
 Foreign keys have to be disabled during the load
EMBL-EBI
Oracle versus mySQL
 mySQL has several underlying database engines
 InnoDB
Transactions & referential integrity
Not best performance, inefficient disk space usage
 myIsam
 Good performance but not foreign keys
 myIsam compressed
 Efficient I/O, good use of disk space but read-only
 Can’t build indexes without uncompressing
 Support for VLDB’s
 Merged tables are similar to Oracle partitioning but implemented by
the user
 Harder to simulate hash partitioning, range partitioning by default
 Problems of using the indexes of the merged tables
 Query optimiser of mySQL
 Compared with Oracle seems primitive
EMBL-EBI
MSD mySQL experience
 We used myIsam compressed tables without any
indexes
The configuration that required the less disk space
Faster to download
Once the data are local users can uncompress the data and build the
recommended or any other indexes locally
 We used merged tables
To also avoid data files larger than 8GB
And for performance reasons
 Character-sets - collation
 Textual data in mySQL are by default case insensitive
 Only some character collations allow a similar behaviour with Oracle
 Other details
 Table names are by default case sensitive (problem with windowsunix file systems)
Choosing the appropriate numeric type (Integer versus Numeric)
EMBL-EBI
Summary
 MSD Search Database
 Database Replication
 Why Replicate
 Replication Overview
 Components of the Replication
 Incremental Data Export – Import
 Incremental Replication
Mechanism

Macromolecular Structure Database group

Transcript Macromolecular Structure Database group

Directory