BioMart - Csc - Tieteen Tietotekniikan Keskus Oy
Download
Report
Transcript BioMart - Csc - Tieteen Tietotekniikan Keskus Oy
BioMart
Databases made easy
Richard Holland
European Bioinformatics Institute
Helsinki, September 2006
BioMart
• A joint project
– European Bioinformatics Institute (EBI)
– Cold Spring Harbor Laboratory (CSHL)
• Aim
– To develop a generic, query-oriented data
management system capable of integrating
distributed data sources.
Focus
• ‘Data mining’ or advance search
– Creating custom datasets
– Querying multiple datasets
– Interactive
• Users
– People who provide database-based service
– ‘Power user’ biologists and bioinformaticians
Requirements
• User
– ‘One-stop shop’ for biological data
– Suitable for power biologists and bioinformaticians
– A set of interfaces that allow user to group and refine
biological data based upon many criteria
• Deployer
– ‘Out of the box’ installation
– Built in ‘ query optimization
– Easy data federation
• Architecture
– Domain agnostic
– Distributed
– Platform independent
Advanced search GUIs
Single interface
Single access point
Queries across different databases
Dataset 1
Links
Dataset 2
Main features
• Domain agnostic
• Platform independent (MySQL, ORACLE,
Postgres)
• Scalable for big datasets
• Federated architecture
• Automated UI configuration
How does it work?
BioMart
Source data
XML
XML
XML
BioMart software
Data mart
Meta data
Federated architecture
Query Engine
Data model
FK
FK
PK
PK
FK
FK
Data model
FK
FK
FK
FK
PK
PK
FK
FK
FK
FK
Data model - ‘reversed star’
FK1dm
FK1
FK2
FK2dm
FK2
PK1
main1
PK1
2
PK2 FK1
PK2
PK1
FK1dm
FK1
FK2
FK2
FK2
Data mart and dataset
Dataset
Data mart, dataset and
virtual schema
virtual schema
BioMart abstractions
• Dataset
– A subset of data organized into 1 or more tables
• Attribute
– A single data point
– e. g. gene name
• Filter
– An operation on an attribute
– e. g. ‘Chromosome =1’
Datasets, Attributes and Filters
Mart
Dataset
GENE
gene_id(PK)
gene_stable_id
gene_start
gene_chrom_end
chromosome
gene_display_id
description
Attribute
Filter
BioMart abstractions (cont)
• Link
– ‘common currency’ between two datasets
– e. g. accession
• Exportable
– Potential links to export
• Importable
– Potential links to import
Exportables,
Importables and Links
Dataset 1
Links
Dataset 2
Exportables,
Importables and Links
Exportable
Links
Importable
name = uniprot_id
name = uniprot_id
attributes = uniprot_ac
filters = uniprot_ac
Dataset 1
Dataset 2
Exportables,
Importables and Links
Exportable
Links
Importable
name=genomic_region
name=genomic_region
attributes=chr_name,
chr_start,
chr_end
filters=chr_name (=),
chr_start (>=),
chr_end (<=)
Dataset 1
Dataset 2
Creating BioMart databases
Building BioMart databases
Configuration
Transformation
Mart
Source
databases
MartBuilde
MartBuilder
r
XML
MartEditor
Schema transformation
principles
• Central table
– Longest n:1, 1:1 path
• Dimension table
– Central transformation ‘around’ 1:n table.
– Link tables are decomposed into a set of 1:n first
MartBuilder Application
• Read database meta data
• Transforms a source schema into
suggested datasets and lets you edit
the process
• Produces a set of SQL statements (DDL)
to run against the server to perform the
transformation
Dataset Configuration
• Dataset configuration
•
•
•
•
•
•
Attributes
Filters
Trees, Groups, Collections
Exportables, Importables
Semantics
Relational mapping
• User interface
• Linking datasets
• XML-based
Table naming convention
Naïve configuration
• Tables
– Meta tables
– Data tables
meta_content
dataset__content__type
• Data tables
– Main
– Dimension
__main
__dm
• Columns
– Key
_key
Naming convention
examples
• Homo sapiens gene ensembl
– hsapiens_gene_ensembl__gene__main
– hsapiens_gene_ensembl__xref_hugo__dm
• Encode
– hsapiens_encode__encode__main
• Uniprot
– uniprot__protein__main
– uniprot__interpro__dm
• Uniprot sequence
– uniprot_sequence__sequence__main
Dataset Configuration
XML
XML
XML
MartEditor
Accessing BioMart databases
BioMart architecture
Retrieval
MartShell
MartExplorer
JAVA
MartView
Perl
BioMart API
Databases
Public data (local or remote)
MartBuilder
MartEditor
Vega
SNP
myMart
myDatabase
Schema
transformation
Configuration
XML
MSD
UniProt
Ensembl
MartView (current)
MartView (new 0_5)
MartExplorer
MartShell
Using = dataset
Get = attribute
Where = filter
MartShell (MQL)
●
Uses Mart Query Language (MQL) to generate queries:
using <dataset> get <attributes> where <filters>
●
Can join datasets together:
using Dataset1 get Attribute1 where Filter1=var1 as q;
using Dataset2 get Attribute2 where Filter2=var2 and
filter3 in q
●
Can script and pipe:
martshell.sh -E MQLscript.mql > results.txt
martshell.sh -E MQLscript.mql | wc
MartShell examples
MartShell> using MSD.msd get pdb_id where
resolution_less < 1.5 and has_ec_info only;
193l
194l
1arb ...
MartShell> using MSD.msd get pdb_id where
resolution_less < 1.5 and has_ec_info only as q;
MartShell> using
Ensembl.hsapiens_gene_ensembl get sequence
transcript_flanks+1000 where pdb in q;
ENST00000270142.2 ENSG00000142168.2
strand=forward
chr=21 assembly=NCBI34
downstream flanking sequence of transcript
only
AAACTAAATTAGCTCTGATACTTATTTATATAAACAGCTTCAGTGG
AA ....
biomaRt
Taverna
DAS ProServer
BioMart deployers
• Large scale data federation (EBI)
• Optimising access to a large database
(Ensembl, WormBase)
• Connecting priopriatery datasets to
public data (Pasteur, Unilever, Serono,
Sanofi-Aventis, DevGen etc …)
Hinxton example
EBI
SANGER
Uniprot
MSD
Ensembl
SNP
Vega
Sequence
WWW
BioMart deployers
• Large scale data federation (Hinxton)
• Optimising access to a large database
(Ensembl, WormBase, ArrayExpress)
• Connecting priopriatery datasets to public data
(Pasteur, Unilever, Serono, Sanofi-Aventis,
DevGen etc …)
WormBase
Genes
Expression
Phenotypes
Variations
Literature
Ontologies
Sequence
Ensembl
Genes
Ontologies
Variations
Protein annotation
Disease
Homologies
Sequence
Array annotations
HapMap
Population
Frequencies
Inter
population
comparisons
Gene
annotation
ArrayExpress
BioMart deployers
• Large scale data federation (Hinxton)
• Optimising access to a large database
(Ensembl, WormBase)
• Federating third party data with public
data (Pasteur, INRA, Bayer,Unilever,
Serono, Sanofi-Aventis, DevGen, Solexa
etc …)
In development
•
•
•
•
•
CAPRISA
RGD
DICTYBASE
PURDUE UNIVERSITY
RZPD
Music Mart
BioMart model
• Already applied
–
–
–
–
–
–
–
–
–
–
Ensembl
Vega
SNP
Uniprot
MSD
ArrayExpress
WormBase
Gramene
HapMap
Variety of ‘in house’ projects (academia and industrial)
User restriction
martUser
XML
“default”
“advanced”
XML
Dataset
Interface configuration
Interface
XML
“single-page
web interface”
“wizard style
web interface”
XML
Dataset
Web services
XML
MartView
MartService
80
3306
3306
X
3306
Local Mart
Remote Mart
Web services (cont)
MartService requests
• Registry XML
• Dataset information: name, type etc
• DatasetConfig XML
• Mart Query:
– API query object is converted to a XML representation on the
client and sent to the server.
– Query object is regenerated on the server and processed. Results
are sent back to client as a simple tab-delim HTML page.
Summary
• A generic data management system
– A set of easily configurable user interfaces
– Distributed Data federation
– Query optimization
BioMart
•
•
•
•
•
•
www.biomart.org
Open source (LGPL)
Public MySQL server
ftp
[email protected]
[email protected]
Acknowledgments
• BioMart
–
–
–
–
Arek Kasprzyk (EBI)
Damian Smedley (EBI)
Syed Haider (EBI)
Gudmundur Thorisson (CSHL)
• Contributors
–
–
–
–
–
–
–
–
–
–
–
–
–
Darin London (EBI)
Will Spooner (CSHL)
Damian Keefe (Ensembl)
Arne Stabenau (Ensembl)
Andreas Kahari (Ensembl)
Craig Melsopp (Ensembl)
Katerina Tzouvara (Uniprot)
Paul Donlon (Unilever)
Steffen Durinck (SCD-ESAT, Katholieke Universiteit Leuven)
Benoit Ballester (Universite de la Mediterranee)
Stephen Robinson (EBI)
Asif Kibria (EBI)
Paul Donlon (Unilever)