Slides - Indico

Download Report

Transcript Slides - Indico

EMBL/ELIXIR use-cases
for EGI/EUDAT
Tony Wildish
www.ebi.ac.uk
Data resources available from EMBL-EBI
Genes, genomes & variation
European Nucleotide Archive
Ensembl
European Variation Archive
Ensembl Genomes
European Genome-phenome Archive
GWAS Catalog
Metagenomics portal
Gene, protein & metabolite expression
RNA Central
Literature & ontologies
Europe PubMed Central
Gene Ontology
Experimental Factor
Ontology
Express
Array Expression Atlas
Metabolights
PRIDE
Protein sequences, families &
motifsInterPro
Pfam
UniProt
Molecular structures
Protein Data Bank in Europe
Electron Microscopy Data Bank
Chemical biology
Reactions, interactions &
pathways
IntAct
Reactome
MetaboLights
ChEMBL
ChEBI
Systems
BioModels
Enzyme Portal
BioSamples
ELIXIR: Driven by 4 scientific use-cases
• Marine Metagenomics
• Genomic & Phenotypic data for Crop and
Forest plants
• Rare Diseases
• Human Genetic Data
 All scientific use cases require either private
or public data sets to be replicated from the
source or between analysis sites
3
Three types of metadata
• Content metadata
– Scientific/biological content, the value of the data
– Structure specific to the archive hosting the data
• File metadata
– File size, checksum, creation date
– Logical File Name (LFN) – filename relative to archive root
• Access metadata
– Physical File Name (PFN) - host, protocol/port, path to root
of archive, LFN
– Defined per site, per protocol (HTTP, FTP…)
– Site-specific, part of the fabric
4
Use-case characteristics
• Data volumes from 10’s to several 100’s of GB
monthly
– Human data likely to be largest volume/traffic
• Replication between a handful of sites
– Periodic updates to reference datasets => metadata
handling to describe datasets consistently
• Download smaller subsets for individual analyses
• End-users widely distributed, communities of all
sizes/scales
5
Use-case characteristics
• Content metadata replication not a target
– Complex, domain-specific, well established
– No clear gain in replicating it at this time
• Decouple dataset-description metadata from
file-location and transfer/access metadata
– Allow file-distribution to be explored and
understood without digging into details of what
the data is about
6
User
interface
Abstract model:
Users browse
metadata to
discover or define
datasets which are
located at multiple
sites
Complexity comes
from the dataset
structure…
Content
Metadata
Access Path
Catalogue
PFN
Dataset
definition
Dataset
Site A
Storage
Dataset
Version
File
File
File
LFN
Source
archive
Access Path
Catalogue
PFN
Site B
Storage
7
Example dataset structure: three
datasets each have their own files
#1 and #2 don’t overlap with each
other, but both overlap with #3
Dataset 1
LFN
LFN
LFN
LFN
LFN
LFN
LFN
LFN
LFN
Dataset 2
Dataset 3
LFN
LFN
LFN
LFN
LFN
LFN
8
3 overlapping Datasets -> 5 non-overlapping Filesets
Dataset 1
Fileset 1
Fileset 2
Fileset 3
Fileset 4
Dataset 2
Dataset 3
Fileset 5
9
Filesets in releases are ‘closed’ (immutable)
As-yet unreleased filesets may be ‘open’ (mutable)
Dataset A
Dataset A
Release 2
Dataset A
Release 1
Dataset B
Fileset 3
(open)
Fileset 2
(closed)
Fileset 1
(closed)
Fileset n
File
File
File
LFN
File
File
File
LFN
File
File
File
LFN
10
Site A
Dataset
Dataset
Version
File
File
File
LFN
Metadata/
Dataset
catalogue
Dataset
Version
Topology,
Infrastructure,
Fabric…
PFN
Fileset
Storage
File
File
File
LFN
File/Fileset
replica
catalogue
Access
Protocols
Site B
Access
protocol
catalogue
PFN
Storage
…
11
Data stewardship
• Data stewardship requires data policies
– This site must maintain a copy of my data
– I want a copy of my data somewhere in EGI, but I don’t care
where
– This data can be deleted after 10 years
– …
• Infrastructure providers need access to those policies
– Can I delete my copy of the data? Does it matter if I lose it
accidentally?
– Is my copy ‘custodial’ (do I need to keep it backed up?)
– Does my copy have to be permanently online? Near-line?
• Belongs as part of the replica catalogue
12
Summary
• Three types of metadata for Elixir data mgmt
– Content (value), file, access: manage separately
– Content metadata is/will-be managed by Elixir
– File, access metadata needs catalogues, tools
• Dataset/data-organization metadata is complex
– Not cleanly separable, overlapping, multi-scale…
– Need to explore real use-cases, understand details
• Proposed data-model can address needs
– Needs validation: prototype, deploy...
13