BioMart - Indico

Download Report

Transcript BioMart - Indico

BioMart
Federated Database Architecture
Arek Kasprzyk
EBI
9 June 2005
BioMart
• A join project
– European Bioinformatics Institute (EBI)
– Cold Spring Harbor Laboratory (CSHL)
• Aim
– To develop a simple and scalable data management
system capable of integrating distributed data
sources.
Challenges
• Data sources
– Large
– Distributed
– Different data
Requirements
• User
– All data accessible through a single set of interaces
– Suitable for power biologists and bioinformaticians
• Deployer
– ‘Out of the box’ installation
– Built in query optimization
– Easy data federation
• Architecture
– Distributed
– Domain agnostic
– Platform independent
Federated architecture
Query Engine
BioMart
User interfaces
Data mart
Data sources
Data mart and dataset
Dataset
Data mart, dataset and
schema
Schema
Dataset Configuration
XML
XML
XML
BioMart abstractions
• Dataset
– A subset of data organized into 1 or more tables
• Attribute
– A single data point
– e. g. gene name
• Filter
– An operation on an attribute
– e. g. ‘Chromosome =1’
Datasets, Attributes and Filters
Mart
Dataset
GENE
gene_id(PK)
gene_stable_id
gene_start
gene_chrom_end
chromosome
gene_display_id
description
Attribute
Filter
Examples
Upstream sequences
for all kinases
up-regulated in brain and associated with a
QTL for a neurological disorder
Name, chromosome position, description
of all genes
located on chromosome 1, expressed in lung,
associated with human homologues and nonsynonymous snp changes
Data model
FK
FK
PK
PK
FK
FK
Data model
PK
PK
FK
FK
FK FK FK FK
PK
PK
PK
FK
FK
Data model
FK
FK
FK
FK
PK
PK
FK
FK
FK
FK
Data model - ‘reversed star’
FK1dm
FK1
FK2
FK2dm
FK2
PK1
main1
PK1
2
PK2 FK1
PK2
PK1
FK1dm
FK1
FK2
FK2
FK2
Dataset
Fixed schema transformation
A
TA
B
TB
C
BioMart abstractions
• Link
– ‘common currency’ between two datasets
– e. g. accession
• Exportable
– Potential links to export
• Importable
– Potential links to import
Exportables,
Importables and Links
Dataset 1
Links
Dataset 2
Exportables,
Importables and Links
Exportable
Links
Importable
name = uniprot_id
name = uniprot_id
attributes = uniprot_ac
filters = uniprot_ac
Dataset 1
Dataset 2
Exportables,
Importables and Links
Exportable
Links
Importable
name=genomic_region
name=genomic_region
attributes=chr_name,
chr_start,
chr_end
filters=chr_name (=),
chr_start (>=),
chr_end (<=)
Dataset 1
Dataset 2
Building BioMart databases
Configuration
Transformation
Mart
Source
databases
MartBuilder
XML
MartEditor
MartEditor
Table naming convention
Naïve configuration
• Tables
– Meta tables
– Data tables
meta_content
dataset__content__type
• Data tables
– Main
– Dimension
__main
__dm
• Columns
– Key
_key
BioMart architecture
Retrieval
MartExplorer
MartShell
JAVA
MartView
Perl
BioMart API
Databases
Public data (local or remote)
MartBuilder
MartEditor
Vega
SNP
myMart
myDatabase
Schema
transformation
Configuration
XML
MSD
UniProt
Ensembl
MartView
MartExplorer
MartShell
Using = dataset
Get = attribute
Where = filter
Mart Query Language (MQL)
●
Mart Query Language (MQL) syntax:
using <dataset> get <attributes> where <filters>
●
Can join datasets together:
using Dataset1 get Attribute1 where Filter1=var1 as q;
using Dataset2 get Attribute2 where Filter2=var2 and
filter3 in q
●
Can script and pipe:
martshell.sh -E MQLscript.mql > results.txt
martshell.sh -E MQLscript.mql | wc
Third party software
• Bioconductor (biomaRt)
– BioMart schema
• Taverna
– BioMart java library
• DAS ProServer
– BioMart perl library
biomaRt
Taverna
ProServer
• No programming
• DAS request and responses defined by
Exportables and Importables and
configured by MartEditor
• DAS1
BioMart deployers
• Large scale data federation (EBI)
• Optimising access to a large database
(Ensembl, WormBase)
• Connecting priopriatery datasets to
public data (Pasteur, Unilever, Serono,
Sanofi-Aventis, DevGen etc …)
Hinxton example
EBI
SANGER
Uniprot
MSD
Ensembl
SNP
Vega
Sequence
WWW
BioMart deployers
• Large scale data federation (Hinxton)
• Optimising access to a large database
(Ensembl, WormBase, ArrayExpress)
• Connecting priopriatery datasets to
public data (Pasteur, Unilever, Serono,
Sanofi-Aventis, DevGen etc …)
WormBase
Ensembl
ArrayExpress
BioMart deployers
• Large scale data federation (Hinxton)
• Optimising access to a large database
(Ensembl, WormBase)
• Federating user data with public data
(Pasteur, INRA, Bayer,Unilever, Serono,
Sanofi-Aventis, DevGen, Solexa etc …)
dbsnp
Give me frequency
data from dbsnp
SNP1 T/A AL13929 963253 1
SNP2 C/T AL13929 963255 -1
SNP3 C/G AL13929 963258 1
. ……………………………….
. ……………………………….
HapMap
Give me genoype
and frequency
data from HapMap
Ensembl RefSeq AceView Vega
Give me SNPs
location on
gene/transcript
GMIA_SNP_mart_database
Give me frequency, genotype,
location on gene/transcript from
dbsnp, HapMap, Ensembl, RefSeq,
AceView and Vegas
Java graphical user interface
WWW web browser
Genetics of Infectious and Autoimmune Diseases,
Pasteur Institute, INSERM U730, Paris, France.
… what next ?
BioMart model
• Already applied
–
–
–
–
–
–
–
–
Ensembl
Vega
SNP
Uniprot
MSD
ArrayExpress
WormBase
Variety of ‘in house’ projects
• In development
– HapMap
Summary
• BioMart interface
– Batch queries
– ‘Data mining’
– Large annotation
• BioMart software
– Set up your own database
– Make your database scalable and
responsive
– Federate with other data
Where are we?
• 0.2 released in february
• 0.3 to be released in june
– Platforms
• Mysql
• Oracle
• Postgres
Acknowledgments
• BioMart
– Damian Smedley (EBI)
– Darin London (EBI)
– Will Spooner (CSHL)
• Contributors
–
–
–
–
–
Arne Stabenau (Ensembl)
Andreas Kahari (Ensembl)
Craig Melsopp (Ensembl)
Katerina Tzouvara (Uniprot)
Paul Donlon (Unilever)