Mapping Domain Thesauri to the CIDOC CRM for Semantic

Download Report

Transcript Mapping Domain Thesauri to the CIDOC CRM for Semantic

Mapping domain thesauri to the CRM
to assist the semantic interoperability
of data archives
Doug Tudhope
Hypermedia Research Unit
University of Glamorgan
CIDOC CRM SIG Workshop, Imperial College, 2006
Presentation
• FACET Project with Science Museum
– Thesaurus-based query expansion
with NMSI Collections database
– Semantic expansion
– Web Demonstrator
Extend to heterogeneous datasets and terminology systems
• DELOS pilot project demonstrator
– English Heritage upper ontology based on CRM
– Mapping English Heritage thesaurus and database to CRM
– Current work
FACET - Faceted Access to Cultural
hEritage Terminology
Aims:
• Integration of thesaurus into the interface
• Semantic expansion taking advantage of facet structure
http://www.comp.glam.ac.uk/~FACET/
FACET Collaborators
•
Research Council Funding: EPSRC 3 years
•
National Museum of Science and Industry (NMSI):
National Railway Museum and Science Museum Collections Database
•
J. Paul Getty Trust
Art and Architecture Thesaurus (AAT)
•
Museum Documentation Association (MDA)
Railway Thesaurus
•
Canadian Heritage Information Network (CHIN)
Advisors
NRM Collection
examples of free text object descriptor fields
•
•
•
•
•
•
•
•
•
Chair, London Midland & Scottish Railway, straight wooden back initials
carved on back, green leatherette seat.
Chair, Railway Clearing House, Curved back with blue leather inset &
blue leather seat. R. C.H. carved on back
Chair, M.S. & L.R., Straight back, blue leather seat with M.S. & L.R.
carved across back
Armchair, Pullman, green plush, fringed from Pullman section.
Carver chair, Oak with oval brocade seat. Prince of Wales crest on
back from Royal Saloon of 1876
Armchair, Upholstered in blue maquette with curved, buttoned back &
scroll arms. Wooden legs
Occasional table, Oak with drawer, ornately carved. From Royal Saloon
of 1876
Set of 4 chairs, High-backed carver chairs upholstered in floral
maquette
Clock, made by Jno Walker, 250 Regent Street. Metal face/Roman
numerals. Carved wooden square case. 20"x18"x10"
Indexed Example from NRM Collection
ID
1975-7309
Description
Armchair, Upholstered in blue moquette with curved,
buttoned back & scroll arms. Wooden legs
Item name(s) armchairs (AAT Hierarchy: Furnishings)
Part
overall
overall
overall
legs
back
back
arms shape
Aspect
physical
material
colour
material
shape
physical
scrolled arms
Term
upholstering
moquette
blue
wood
curved
buttoning
Components
(AAT Hierarchy)
Processes & techniques
Materials
Color
Materials
Physical attributes
Processes & techniques
Types of Knowledge Organisation System (KOS)
adapted from Zeng & Salaba: FRBR Workshop, OCLC 2005
Relationship Groups:
Classification &
Categorization:
Term Lists:
Ontologies
Semantic networks
Thesauri
Classification schemes
Taxonomies
Categorization schemes
Subject Headings
Synonym Rings
Authority Files
Glossaries/Dictionaries
Gazetteers
Pick lists
Natural language
Controlled language
Semantic Expansion
Expanding over thesaurus semantic relationships
allows the system to play an active role
• Ranking of matching results by semantic closeness
• Query Expansion (automatic/interactive)
• Augmented Browsing tools
Underpinning technologies:
• Measures of distance over the semantic index space
• Multi-concept Matching Function
Faceted Knowledge Organisation Systems
Faceted classifications based on primary division
into fundamental, high-level categories (facets)
Compound descriptors (multi-concept headings) are synthesised
by combination of terms from limited number of fundamental facets
In constructing AAT, adjectival noun phrases very common:
e.g. painted oak furniture
“Rather than enumerate the nearly infinite number of object and
subject descriptions needed by thesaurus users, the AAT decided to
pursue the building blocks of these descriptors in the form of a faceted
vocabulary”
(Guide to Indexing and Cataloging with the Art & Architecture Thesaurus)
Matching Problem
“The major problem lies in developing a system whereby individual parts of
subject headings containing multiple AAT terms are broken apart, individually
exploded hierarchically, and then reintegrated to answer a query with
relevance”
(Toni Petersen, AAT Director)
Query: mahogany, dark yellow, brocading, Edwardian, armchair
Descriptor: oak, light yellow, crests, ovals, brocade, Victorian, Carver chair
Potentially extra / missing / partially and non-matching terms
System Architecture
Compiled VB client interface
and web browser interface
Persistent
XML data:
Queries,
parameters
etc.
Query and
matching
functions
Expansion
engine
(and data
structure)
Application
interfaces
Application
data objects
Database interaction module
Active-X Data Objects (ADO)
Data access
components
Transact SQL
Stored
Procedures
SQL Server Databases collections & thesaurus
Database
FACET standalone system
http://www.comp.glam.ac.uk/~facet/webdemo/
[email protected]
FACET Web Demonstrator
• Illustrates thesaurus based expansion and faceted search
• Intended as an exploration of FACET research outcomes
via dynamically generated Web components
rather than a complete final interface
• Based on custom API for thesaurus programmatic access
• Browser-based interface (ASP application), using a combination
of server-side scripting and compiled components
http://www.comp.glam.ac.uk/~FACET/webdemo/
http://jodi.tamu.edu/Articles/v04/i04/Binding/
FACET Web Demonstator
Semantic Query Expansion
Some lessons learned
• Results show potential of faceted KOS for
– Query expansion with semantically ranked results
– Realtime implementation multi-concept matching function
– Semantic expansion as a browsing tool
– Potential combine with statistical and linguistic techniques
How to generalise?
 need for
• Common KOS representations and APIs
• Semantic mapping between different databases and KOS
Semantic InteroperabIlity
• NMSI’s different museums and collections
held in a single collections database
• Easy to express connections between thesaurus hierarchies
and DB fields
What if search across different DBs and KOS?
• Eg English Heritage (EH) a single organisation
but wide range unconnected DBs and vocabularies
Mapping domain thesauri to the CRM
to assist the semantic interoperability
of data archives
• DELOS NoE mini-project on Ontology-driven interoperability
for Cluster on Knowledge Extraction & Semantic Interoperability
• Proof of concept demonstrator for exploring retrieval potential of
mapping domain KOS to upper ontology (CIDOC CRM)
• In collaboration particularly with FORTH, University of Lund and
English Heritage (Keith and Sarah May)
• Investigate integration of datasets - for assisting archaeological
search and information extraction
Background
• Current EH situation one of fragmented datasets and
applications, with different terminology systems
• Interpretation may not consist of same terms as context
• Searchers from different scientific perspectives may not use
same terminology
• Need for integrative metadata framework
EH have designed an upper ontology based on CRM standard
•
Work to date focused on modelling
Databases not meaningfully connected
• Even simply expressed queries currently difficult to answer,
due to lack of tools for cross database searching
"Specialists could only talk to [field] archaeologists
and not talk to each other".
(from discussion with a palaeoenvironmental archaeologist)
Wider questions arising from science analysis by finds
specialists often referred back to field archaeologist
since databases documenting different scientific aspects
not meaningfully connected
DELOS pilot project datasets
• English Heritage (and EH Data Services Unit) supplying various
databases and controlled vocabularies.
• Starting with connecting to EH-CRM the new
Environmental Archaeology Thesaurus
and (part of) the Environmental Archaeology Bibliography
Environmental Archaeology Thesaurus
Scope Notes Extract (i)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Altered by Animals
SN:
Modification or damage by an animal
RT Worked (use where modification is by humans in ASPECT)
Anoxic
SN:
Material preserved by exclusion of oxygen usually due to saturation with water which inhibits decay
by micro-organisms
Non Preferred Term: Waterlogged
Burnt
SN:
Use for material that has been burnt
Calcined
SN:
Material burnt at a high temperature (above 700 degrees centigrade) leaving only the mineral
component.
Non-preferred term: cremated
BT: Burnt
RT Cremation
Charred
SN:
Material that has been burnt and at least in part reduced to carbon as a result of burning in a
reducing atmosphere below 500 degrees C.
Non-preferred term: Carbonised
BT: Burnt
Silicified
SN
Use for material that has been burnt at high temperatures in a good air supply such that only silica
component remains
BT: Burnt
……
Mineral Replaced
SN:
Replacement of organic material by minerals, including calcium carbonate and calcium phosphate
Non Preferred Term: Mineralised, Fossilised
Mineral Preserved
SN:
Preservation of material by the toxic effect of corrosion products in the immediate vicinity, or within,
Environmental Archaeology Thesaurus
Scope Notes Extract (ii)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Arthropods
SN:
Use for remains of arthropods in general, including woodlice, spiders, insects etc. Please note
crustaceans have been included under this category.
BT: Invertebrates
NT: Cladocerans, Crustaceans (Decapods), Insects, Mites, Ostracods
Cladocerans
SN:
Group of fresh water crustaceans which include the water fleas (Daphnia ssp.) the egg cases
(ephippia) of which are found in archaeological deposits (EH Guidelines for Environmental Archaeology)
BT: ArthropodsEAT-Draft scope notesv6.doc
Crustaceans (Decapods)
SN:
Use for the remains of shrimps, prawns, crabs and lobsters
BT: Arthropods
Insects
SN:
Use for the remains of any part of an insect (MDA Object Thesaurus)
Non-preferred term: Beetles, Coleoptera
Mites
SN:
Related to spiders. Use for ticks and true mites. Mites are widely present in archaeological deposits
but are rarely studied in detail as they are difficult to identify (Kenward, forthcoming)
BT: Arthropods
Ostracods
SN:
Small crustaceans ranging in size from 0.2mm to 30mm and possessing a bivalve carapace or
‘shell’. They live in salt-water, brackish and freshwater and are used to help to reconstruct aquatic
conditions e.g. pollution, degree of salinity
BT: Arthropods
EH extension to CRM
• Currently in pdf file
• Need to represent in machine readable format
Example of CRM - Thesaurus connection
(by EH collaborators)
• FlotationSampleResidueType – EH_E0067
CRM entity
E55: Type
Classification of flot and/or residue contents
• Mapping:
Use Arch Science Thesaurus Terms:
Object type, Material type, Modification state, Aspect
Example CRM - Thesaurus connection 2
• ContextSampleType – EHE0053
• CRM entity E55: Type
• Derived from the Environmental guidelines list
Samples taken will be of a particular type depending upon the
technique that will be used to analyse them.
• For Specialist Scientific Sampling it would be appropriate to use
Archaeological Science Thesaurus terms for “Investigative
Techniques”, but for samples taken by non-specialists the
investigative technique may not be know at the point of
sampling.
Current Work - Proof of concept demonstrator
• Express EH-CRM in machine-readable form
• Add connections for databases and thesauri to EH-CRM
Demonstrator – first steps
• Express user information need in terms of EH-CRM
• Identify database and thesaurus entities (if any)
from extended EH-CRM
• Drive search from this information
Next steps
• Involve other EH databases and vocabularies
• Connect very different datasets,
for example species taxonomies via via plant names
• Extend to associated grey literature
(and FRBR indexed documents)
Contact Information
Doug Tudhope
School of Computing
University of Glamorgan
Pontypridd CF37 1DL
Wales, UK
[email protected]
http://www.comp.glam.ac.uk/pages/staff/dstudhope
References
Binding C., Tudhope D. 2004. KOS at your Service: Programmatic Access to Knowledge
Organisation Systems. JoDI 4(4), http://jodi.tamu.edu/Articles/v04/i04/Binding/
CIDOC CRM http://cidoc.ics.forth.gr/
DELOS Network of Excellence http://www.delos.info/
DELOS Knowledge Extraction & Semantic Interoperability http://delos-wp5.ukoln.ac.uk/
FACET Case Study, DigiCult Thematic Issue 6: Resource Discovery Technologies for the
Heritage Sector,http://www.digicult.info/pages/Themiss.php [pdf]
FACET Web demonstrator http://www.comp.glam.ac.uk/~FACET/webdemo/