Transcript Document

Describing Bioinformatic Metadata at EBI
James Malone
[email protected]
Cross-Domain Data available from
EBI
Literature and ontologies
Genomes
Protein sequence
DNA & RNA sequence
Protein structure
Gene expression
Chemical entities
Protein families,
motifs and domains
Protein interactions
Pathways
Systems
2
Master headline
The Sorts of Data we Serve
• We manage databases of biological data such as nucleic
acid, protein sequences and macromolecular structures
• ENA: nucleotide sequencing information
• UniProt: protein sequence and functional information
• ArrayExpress: functional genomics data repository
• Ensembl: genome info for vertebrates and other
eukaryotes
• InterPro: database of predictive protein "signatures"
• PDBe: data resource on biological macromolecular
structures
Master headline
Sorts of Metadata we need
•
•
•
•
•
•
•
•
•
Low complexity – high volume (genome sequencing)
High complexity – low volume (mouse phenotyping)
1000 genomes in order of magnitude physics data
Provenance models
Experimental variables
Publication details
Synonym and domain specific language
Cross-domain mappings
Metadata has existed and been captured for a while, e.g.
InterPro IDs
Master headline
Master headline
Metadata: Minimum Information Standards
• Minimum Information Standards specify minimum amount of
meta data (and data) required to meet a specific aim (usually
reporting data or submitting to public repository)
• MIAMI: Minimum Information About a Microarray Experiment
• MIARE: Minimum Information About an RNAi Experiment
• MIAPE: Minimum Information About a Proteomic Experiment
• MIFlowCyt: Minimum Information about a Flow Cytometry
Experiment
• ISA: cross domain experiment reporting
• Some public repositories require some conformation, e.g.
ArrayExpress – MIAME scoring
Master headline
Ontologies
• As a method of representing knowledge in which
concepts are described both by their meaning and their
relationship to each other.
• Increasingly important component to formalise metadata
• Thriving bio-ontology community
• e.g. Gene Ontology ‘project to standarise the
representation of gene and gene product attributes
• e.g. ChEBI ‘ontology of molecular entities focused on
small chemical compounds’
• e.g. Ontology of Biomedical Investigations ‘ontology to
describe experimental protocols from inception to
analysis’
Master headline
Metadata that is Interoperable
• Goal: community is interoperable set reference ontologies
• Consumed by application ontologies for specific needs
• E.g. Experimental Factor Ontology @ www.ebi.ac.uk/efo
Cell Type Ontology
Disease Ontology
Relation
Ontology
Chemical Entities of
Biological Interest
(ChEBI)
Anatomy
Reference
Ontology
Various
Species
Anatomy
Ontologies
Applying Ontologies in Data Curation
@ www.ebi.ac.uk/gxa
Query for Cell adhesion genes in all ‘organism parts’
‘View on EFO’
Master headline
Ontologically
Modeling Sample Variables in Gene Expression Data
[email protected]
Strategies for Integrating Multi-Domain Data
• Consuming reference ontologies and mapping to multiple
ontologies where overlap exists offers us maximum
interoperability
QUERY
Rdf triple
Atlas
Rdf triple
Rdf triple
Amino
Acid
Ontology
Swiss
Prot
Master headline
Rdf triple
Rdf triple
Rdf triple
ELIXIR Report
• Data Integration & Interoperability Recommendations – Jul
2009
• ELIXIR should build a distributed data infrastructure based
on a Service Oriented Architecture using WS technology
• Ontologies needed in areas of disease, anatomy and taxon
• Annotation systems for associating data to metadata
• Pan-domain coordination and funding for reporting
standards
Master headline
Current Challenges
Literature – data gap
Curation relatively slow, more advanced tooling required
Ontologies not interoperable yet and more needed
Bio-ontology funding
New high-throughput methods
12000
300000
10000
250000
8000
200000
6000
150000
4000
100000
2000
50000
0
O
ct
.0
Ja 3
n.
Ap 04
r.
04
Ju
l.
0
O 4
ct
.0
Ja 4
n.
Ap 05
r.
0
Ju 5
l.
0
O 5
ct
.0
Ja 5
n.
Ap 06
r.
06
Ju
l.
0
O 6
ct
.0
Ja 6
n.
Ap 07
r.
0
Ju 7
l.
0
O 7
ct
.0
Ja 7
n.
Ap 08
r.
0
Ju 8
l.
0
O 8
ct
.0
Ja 8
n.
Ap 09
r.
0
Ju 9
l.
0
O 9
ct
.0
Ja 9
n.
10
0
Assays
Master headline
Experiments
Experiments
350000
Assays
•
•
•
•
•
Challenges: Scaling
World-wide sequencing data production is now just an order of
magnitude behind CERN
•
•
•
•
Large Hadron Collider produces 15 petabytes per year from single point
source
LHC grid is 140 computer centres - 33 countries centered at CERN (Tier 0)
Sequencing is producing data in hundreds of centers in dozens of countries
with Tier 0 sites (EBI & NCBI)
More than 150 Terabytes of 1000genomes data in the Short Read Archive
and this represents more than half of all the data in the archive
Slide: Laura Clarke, EBI
Master headline
Summary
• EBI uses combination of metadata strategies
• Minimal Information useful for reporting standards
• Ontologies provide powerful method describing domain
knowledge
• Ontologies also allow community consensus to be built as
well as strategies for data integration
• ELIXIR suggests :
• Infrastructures should be WS compatible
• Annotation tools required
• Pan-domain coordination is essential
Master headline
Acknowledgements
•
•
•
•
•
•
•
•
Ontology creation:
• James Malone, Tomasz Adamusiak, Ele Holloway, Helen Parkinson, Jie
Zheng (U Penn)
Atlas GUI Development
• Misha Kapushesky, Pasha Kurnosov, Anna Zhukova. Nikolay Kolesinkov
External Review and anatomy:
• Jonathan Bard, Jie Zheng
ArrayExpress Production Staff
EBI Rebholz Group (Whatizit text mining tool)
Many source ontologies for terms and definitions esp. Disease Ontology, Cell Type
Ontology, FMA, NCIT, OBI
Funders: EC (Gen2Phen,FELICS, MUGEN, EMERALD, ENGAGE, SLING), EMBL,
NIH
Eric Neumann, Joanne Luciano and Alan Ruttenberg
HCLS Group - Eric Prud'hommeaux and Scott Marshall
Developing an Ontology from the Application Up
[email protected]