ArrayExpress

Download Report

Transcript ArrayExpress

Anatomy ontology evaluation @ ArrayExpress
Helen Parkinson, PhD
www.ebi.ac.uk/arrayexpress
EBI is an Outstation of the European Molecular Biology Laboratory.
Content
•
•
•
•
ArrayExpress use cases
Fuzzy matching of ontology terms
Data driven ontology building
Wish list
www.ebi.ac.uk/arrayexpress
ArrayExpress: Overview
Public/Private
Experiment
queries
Submit
Hybs
Re-annotate
Public Only
Gene
queries
Genes
Summarize
www.ebi.ac.uk/arrayexpress
Cross expt/
species
queries
ATLAS
Fuzzy matching of ontology terms – why?
•
•
•
•
•
Clean up ArrayExpress OE and synonym tables
OE based integration
Constrain OEs on data entry/validation
Improved searches in repository/DW web interface
Data integration across species, experiments and
experimental designs
• Automated mapping of free text to ontology terms for data
imporrt
www.ebi.ac.uk/arrayexpress
Phonetic Matching
• Precompute phonetic encodings of all terms in the
ontology
• Match each target term by comparing these encodings
• Soundex: Robert Russell and Margaret Odell (1918), famously
described by Donald Knuth
• Double Metaphone: Lawrence Philips (2000)
• Metaphone: Lawrence Philips
• Most matches are single
• Highest success rate
www.ebi.ac.uk/arrayexpress
Algorithm comparisons
SAEL vs. AE OrganismPart
100.00%
90.00%
80.00%
70.00%
60.00%
none
multiple_bad
multiple_okay
single_bad
single_okay
valid
50.00%
40.00%
30.00%
20.00%
Levenshtein
Double Metaphone
Metaphone
0.00%
Soundex
10.00%
www.ebi.ac.uk/arrayexpress
Percent matches using automated mapping
35
30
25
20
15
10
5
0
CARO
SAEL
www.ebi.ac.uk/arrayexpress
MIAA
Failures to match
• Species (or Kingdom)-specific terms (e.g. plant anatomy)
• Conflated terms (e.g. diseased cell types)
• Compound terms (e.g. "cerebral cortex and
hypothalamus")
• Genuinely missing terms
• Esoteric terms less of a priority
• Most trivial misspellings, however, were matched
• Dirty input data
www.ebi.ac.uk/arrayexpress
Implications
• Need more terms in some commonly-used ontologies
• Synonyms are important
• generating less noise
• better coverage
• Choice of ontology can limit expressivity - this will be
frustrating to biologists
www.ebi.ac.uk/arrayexpress
Why?
•
•
•
•
•
•
Clean up ArrayExpress OE and synonym tables
Add accessions/DB links to these tables
Constrain OEs on data entry/validation
Improved searches in repository/DW web interface
Generate suggestions for new OE terms
Evaluate domain coverage by a given ontology
www.ebi.ac.uk/arrayexpress
Developing the Ontology
• Define Scope: ArrayExpress already has some useful
structure given the current database plus rich source of
use cases and competency questions.
• Build: Ontology Capture: Identify key concepts and
relationships within our domain and give explicit
definitions to these features:
• Middle-out approach – specify core of basic terms then specialise
and generalise as required
• Mappings – text mining approach to do initial semi-automated
mappings to external resources for rapid coverage
• Manual mapping for data warehouse data, and selected data sets
11
06.04
ArrayExpress Ontology
.2016 Development
and Future Directions
www.ebi.ac.uk/arrayexpress
Capture to Code: Definitions and Hierarchy
06.04 ArrayExpress Ontology
.2016 Development
and Future Directions
www.ebi.ac.uk/arrayexpress
Semantic Roadmap
• Position of the ArrayExpress Experimental Factor
Ontology in the ‘bigger picture’
• Key is orthogonal coverage, reuse of existing resources
and shared frameworks
NCI
Cell Type Ontology
Disease Ontology
AE Ontology
06.04 ArrayExpress Ontology
.2016 Development
and Future Directions
Chemical Entities of
Biological Interest
(ChEBI)
Various
Common Anatomy Species
Reference
Anatomy
Ontologies
Ontology
www.ebi.ac.uk/arrayexpress
Wish list
•
•
•
•
•
•
NOT to build our own anatomy ontology
CARO extension
CARO evaluation
Mapping CARO to relevant multi-species ontologies
Application of CARO to ArrayExpress data
Use of CARO in ArrayExpress tools
www.ebi.ac.uk/arrayexpress
Acknowledgments
•
•
•
•
•
•
•
•
•
•
Anna Farne
Ele Holloway
James Malone
Margus Lukk
Helen Parkinson
Tim Rayner
Faisal Rezwan
Eleanor Williams
Mengyao Zhao
Holly Zheng
ArrayExpress Production Team
www.ebi.ac.uk/arrayexpress