Icbo_2011_coriell

Download Report

Transcript Icbo_2011_coriell

Rapid Development of an Ontology of
Coriell Cell Lines
Chao Pang, Tomasz Adamusiak, Helen Parkinson and
James Malone
[email protected]
Overview
•
•
•
•
•
Motivation – our EBI use cases, big stuff in little time
Methods – normalization, axiomatisation, automation
Results – large flat ontology, adding structure
Desiderata
Future work
Master headline
Motivation
• Lot of large (often online) catalogues
• Infeasible to manually ontologise all of them (and cost is
not easy to justify)
Master headline
High quality artefacts take time
$money$
75k classes
FMA
30 person years
33k classes
Gene
Ontology
Since 1999
3k classes
OBI
Since 2006
Funders
Master headline
Motivation
• Would like to include these in BioSample Database for
curating data, querying, browsing
• Ontology use has proven to add value (e.g. GXA)
• RDF representations of data for integration
Master headline
Methods: A Model for Cell Lines
•
•
Agreed with cell line ontology folk
With addition of is_model_for, to reflect use of cell lines as models for
particular diseases (they can be used as models for diseases other
than the original disease borne by the subject)
Master headline
Methods: Programmatic Engineering
•
•
Lexical concept recognition (2) - Map the terms to existing ontologies
by either class label or synonyms in order to get identifiers (e.g. label –
Homo sapiens and synonym – human)
Produce a list of classes from reference ontologies (3) e.g.
•
•
•
Map organism terms: NCBI taxonomy ontology
Map disease terms: Disease ontology
Map these to an upper ontology and axiomatic structure (4) cell
line model on previous slide
Master headline
Methods: Our “raw” data
• Coriell database dump
• Extract information for cell line, cell type, anatomy,
organism, sex and disease. (27002 cell lines)
Lexical Concept Recognition
• Our mapper tool was used, adapted the Levenshtein distance
algorithm <http://www.ebi.ac.uk/efo/tools>
• Mapping result:
•
•
•
•
•
93 organism - NCBI taxonomy ontology
61 anatomy – MAT (species neutral), FMA and other anatomy
23 cell type – cell ontology
337 disease - Disease ontology (DO) (via OMIM)
Check the mapping file manually
• Inconsistency e.g. fibroma as anatomy
• Unmapped terms
• Unclear information e.g. buttock-thigh
Using OWL-API to build ontology automatically
Example: import class cervix from EFO
It has parent class organism part
It has class relations
part_of some female reproductive system
Mapping files
Reference ontologies
Step one:
Adding classes
buildingOntology.jav
Coriell cell line protocol
Spreadsheet.txt
Step two:
Adding restriction
Hela cell line
Coriell cell line Ontology
Master headline
Adding Structure to Normalized Hierarchy
Query: human cancer cell line
Coriell cell line ontology
Master headline
Explanation of query: human cancer cell line
Cancer
Cell line
any cell type
Master headline
any organism part
human
Results In Brief
• Large (~28,000) cell line ontology, lots of axiomatisation
for the axes mentioned (disease, organism, cell type,
anatomy some extra bits like gender)
• Running code took ~5 minutes
• Flat asserted hierarchy under cell line; structure is all
inferred using axioms as per Rector Normalization
• Makes more flexible to change and also querying more
powerful
Master headline
BioPortal Analysis by Matt Horridge et al
• http://bio-ontologies.knowledgeblog.org/135
• “the Coriell Cell Line Ontology, which contains over 45,000
non-trivial entailments, and an average of 4 justifications
per entailment, peaking out at 65 justifications for one
particular entailment“
• Coriell is quite rich axiomatically – this is what we would
expect given the normalization method
• Suggestion rich ontologies can be created rapdily
Master headline
Discussion
•
Good bits:
•
•
•
•
•
Adding new classes was simple once code was done
Re-running the code also simple if a part of pattern was changed
Can be reused for different ontology with similar sort of input
Axiomatisation gives richer querying and renders implicit knowledge
explicit
Harder bits:
• Some data clean up necessary before using the code (SQL dumps can
be messy)
• Eliminating errors from concept matching (though expansion on
synonyms helped)
• Textual definitions for all classes
• Size of the ontology created a few issues, e.g. BioPortal not able to
browse, most reasoners can not complete on the ontology
Master headline
Conclusions
• To build very big ontologies we need either
• Lots of resource (i.e. money)
• Lots of people (crowd-sourcing)
• Good tools and methods
• Each having pros and cons
• Lots of money requires, well, lots of money
• Lots of people requires community buy in and ‘trust’ (so
audit trail crucial) – we’re not always good at that
• Automation can introduce errors and require some curation
Master headline
Desiderata: Querying and External URIs
• Easier to use interfaces, users like it simple
• One of top user feedbacks is “we’d like it to work like
Google does”
• Users also like putting URIs into a webpage and getting
something sensible back i.e.:
Master headline
Future Work
• Creating text definitions
• Will follow up on work by Stevens R, Malone, J et al (2011)* for
creating natural language definitions from OWL
• Text mining to extract missing disease information from
textual descriptions from original catalogue
• Begin curating data for BioSample Database at EBI using
ontology
• Integrate Coriell ontology with rest of process from sample
to analysis, e.g. software ontology
http://theswo.sourceforge.net
*Robert Stevens, James Malone, Sandra Williams, Richard Power, Alan Third
(2011) Automating generation of textual class definitions from OWL to English.
Journal of Biomedical Semantics, 2011 May 17, Vol. 2 Suppl 2:S5.
Master headline
Acknowledgements
• Thank to Chao Pang, Helen Parkinson,Tomasz
Adamusiak, Paulo Silva and all people from functional
genomics production team.
• Gen2phen
• The Coriell Institute for Medical Research
• Alan Ruttenberg and Science Commons for providing the
Coriell SQL dump
• Lynn Schriml and colleagues from the Disease Ontology
for OMIM mappings
• Sirarat Sarntivijai, Oliver He, Alexander Diehl and Terry
Meehan for discussions on the cell line model.
Master headline