Transcript identityx

Identity management –
life sciences perspective
Ugis Sarkans
European Bioinformatics
Institute
European Bioinformatics Institute
• Outstation of the European Molecular Biology Laboratory
• International organisation created by treaty (cf CERN, ESA)
• EMBL-EBI has 400 Staff, €30 Million Budget, several million
users
• 15 year history of service provision and scientific excellence
• Sited at the Wellcome Trust Genome Campus Hinxton,
Cambridge, UK after European competition
2008
funding sources
2
EMBL-EBI Mission
• To provide freely available data and bioinformatics
services to all facets of the scientific community in
ways that promote scientific progress
• To contribute to the advancement of biology through
basic investigator-driven research in bioinformatics
• To provide advanced bioinformatics training to
scientists at all levels, from PhD students to
independent investigators
• To help disseminate cutting-edge technologies to
industry
3
Comprehensive, universal, integrated…
•
•
•
•
•
•
•
•
•
•
•
•
Life sciences
Medicine
Agriculture
Pharmaceuticals
Biotechnology
Environment
Bio-fuels
Cosmaceuticals
Neutraceuticals
Consumer products
Personal genomes
Etc…
Literature
Literatureand
and ontologies
ontologies
CitExplore
CitExplore, ,GO
GO
Genomes
Genomes
Ensembl
Ensembl, ,Ensembl
Ensembl
Genomes,
Genomes, EGA
EGA
Nucleotide
Nucleotidesequence
sequence
EMBL
EMBL-Bank
-Bank
Proteomes
Proteomes
UniProt,
UniProt,PRIDE
PRIDE
Gene
Geneexpression
expression
ArrayExpress
ArrayExpress
Protein
Proteinstructure
structure
PDBe
PDBe
Protein
Proteinfamilies,
families,
motifs
motifsand
anddomains
domains
InterPro
InterPro
Chemical
Chemicalentities
entities
ChEBI
,
ChEMBL
ChEBI , ChEMBL
Protein
Proteininteractions
interactions
IntAct
IntAct
Pathways
Pathways
Reactome
Reactome
Systems
Systems
BioModels
BioModels
4
Challenges facing information
infrastructure for life sciences
• The growth of biomedical data is faster than the
Moore's law
• Data generated in geographically distributed manner,
but needs to be tightly integrated for interpretation
• Data analysis algorithms need to be applied to
combined datasets on raw data level
• Human research subject data (clinical data) needs to
be integrated with bio-molecular data raising the
privacy issues and need for highly controlled access
• The data analysis algorithms are becoming more
compute intensive – the need for parallelisation
Dynamic growth response
Available disk
space
Log(data volume)
Time
Dynamic growth response
Data to be
stored
Available disk
space
Log(data volume)
Time
What is Elixir?
• An EU Framework 7 Preparatory Phase Project
• Coordinated by Prof Janet Thornton, Director EMBL-EBI
• To construct a plan for the operation of a sustainable
infrastructure for biological information in Europe
• €4.5 million grant awarded May 2007, three year term
• 32 member consortium engaging many of Europe’s main
bioinformatics funding agencies and research institutes
• Deliverables are memoranda of understanding to fund the
implementation phase which could cost €500 million
• Interested parties should register as stake-holders via the
ELIXIR Website: www.elixir-europe.org
9
ESFRI
The European Strategy Forum on Research Infrastructures
• Created by the Commission in February 2002
• Adopted by the Competitiveness Council in April 2002
• Representatives of EU Member States, Associated States,
and one representative of the European Commission.
• Chairman: Prof Carlo Rizzuto (Sincrotrone Trieste S.c.p.A.ELETTRA, IT)
• To support a coherent approach to policy-making on
research infrastructures in Europe
• To act as an incubator for international negotiations about
concrete initiatives
10
European Roadmap for Research Infrastructures
• 35 ‘mature’ projects for new large scale Research
Infrastructures
• Based on an international peer review process
• Covers all scientific areas, regardless of possible location
• Likely to be realized in the next 10 to 20 years
• Supported by a relevant European partnership or
intergovernmental research organisation.
• Impact on science and technology development at
international level
• Support new ways of doing science in Europe
• Contribute to the enhancement of the European Research
Area
11
Roadmap projects summary.
•
•
•
•
•
•
6 Social Science & Humanities
8 Environmental Sciences
3 Energy
6 Biomedical and Life Sciences
7 Material Sciences
5 Astronomy, Astro-, Nuclear and Particle Physics
http://cordis.europa.eu/esfri/
• 1 Computer and Data Treatment (transverse)
12
Cost of 35 Mature ESFRI RI Projects
Computing
£300M
Social Science
Environment
£1,300
Physics
£3,600
Biomedical
£1,600
Energy
£2,200
Materials
£4,500
Total Capital Cost = €13,696 Million
13
The ten ESFRI BMS RI
14
ELIXIR Scientific & Technical Structure
15
15
BMS Support of the European Grand Challenges
ELIXIR will provide Infrastructure for
the other ESFRI BMS RI.
16
BioMedBridges
•
•
•
•
•
•
Call 8 (Research) Topic 2.3.2 “Clustering the ESFRI BMS.”
Coordinated by Janet Thornton
To create the links between the ESFRI BMS RI
€10.6M over 4 years, 21 participating organisations, 12 WP
To “build bridges” between the infrastructures
Deliverables are infrastructure components that will link
data from the different domains of the ESFRI BMS RI to
ELIXIR Core Datasets
• It is anticipated that these components will be incorporated
into ELIXIR Construction Phase
• ESFRI BMS RIs will be doing the work
• e-Infrastructure Advisory Panel: GÉANT, DANTE, EGI.eu,
PRACE
17
BioMedBridges Structure of Proposal
•
•
•
•
•
•
WP1 Management
WP2 Outreach and inreach
WP3 ESFRI BMS Standards Description and Harmonization
WP4 Technical integration
WP5 Secure access
Five Use Cases WP6 – WP12
– WP6 Interoperability of large scale image data sets from different biological scales
– WP7 PhenoBridge - crossing the species bridge between mouse and human
– WP8 Personalized Medicine - integrating complex data sets to understand disease
pathogenesis and improve biomarker and treatment selection
– WP9 From cells to molecules - integrating structural data
– WP10 Integrating disease related data and terminology from samples of different types
• WP11 Technology Watch
• WP12 Training
18
EMBL-EBI: Most important data collections
Genomes & Genes
1.
2.
3.
4.
5.
Ensembl: Joint project with Sanger Institute - high-quality annotation of vertebrate genomes
Ensembl Genomes: Environment for genome data from other taxons
1000 Genomes: Catalogue of human variation from major World populations
EGA*: European Genotype Archive* – genotype, phenotype and sequences from individual subjects and controls
ENA: European Nucleotide Archive – all DNA & RNA, nextgen reads and traces
Transcription
6.
7.
ArrayExpress: Archive of transcriptomics and other functional genomics data
Expression Atlas: Differentially expressed genes in tissues, cells, disease states & treatments
Protein
8.
9.
10.
11.
UniProt: Archive of protein sequences and functional annotation
InterPro: Integrated resource for protein families, motifs and domains
PRIDE: Public data repository for proteomics data
PDBe: Protein and other macromolecular structure and function
Small molecules
12.
13.
ChEBI: Chemical entities of biological interest
ChEMBL: Bioactive compounds, drugs and drug-like molecules, properties and activities
Processes
14.
15.
16.
IntAct: Public repository for molecular interaction data
Reactome: Biochemical pathways and reactions in human biology
Biomodels: Mathematical models of cellular processes
Ontologies
17.
GO: Gene Ontology, consistent descriptions of gene products
Scientific literature
18.
CiteXplor: Bibliographic query system
* Requires authentication
20
Data supporting publication – typical lifecycle
submitted
manuscript
published
manuscript
author
restricted
data
public
data
reviewer
European Genome-phenome Archive (EGA)
•
Primary archive for any data consented for research but not for fully public distribution
•
•
all data must be de-identified and in accordance with the informed
consent.
Controlled access to the data
•
•
•
distributed access policy:
Data Access Committee (DAC)
data release policy – data access application and data access agreement
•
EGA supports only data access decisions that are based on the original
informed consent
•
•
•
authorized users have personal accounts in our system
access to the data requires account password
data decryption requires a separate key that must be requested and is sent
off line
HSF - 20.1.2011
22
EGA works with Data Access Committees (DAC)
HSF - 20.1.2011
23
Mechanics of secure data access
(5)
Secure Server responds to FTP
requests directly; FTP client
downloads the custom-encrypted file
FTP Client
Authentication of FTP clients
is inherently insecure; we may
have to require FTPS compliant
clients (RFC 4217)
(1)
Request for whole file for
download (with username/
password)
(2)
EGA verifies user and provides list
of authorized list of files.
Secure
Server
(4)
Requested BAM data decrypted, and
re-encrypted using client key
(3)
EGA provides archival encryption key and
ile path in the archive. This requires a secure
API to facilitate access into the EGA master database
EGA secure layer
EGA secure layer
Acknowledgements
• Andrew Lyall, ELIXIR project manager
• Paul Flicek, Ilkka Lappalainen, EGA
• Alvis Brazma, Functional Genomics,
BioMedBridges security