An Ontology-based Metadata Management System for

Download Report

Transcript An Ontology-based Metadata Management System for

Developing an Ontology-based
Metadata Management System for
Heterogeneous Clinical Databases
By Quddus Chong
Winter 2002
Outline
Towards a clinical data warehouse
 Integrating heterogeneous data sources
 Clinical abstractions as Ontologies
 Managing database metadata
 The data mediator approach
 Using Protégé-2000

Towards a Clinical Data Warehouse


Clinical Data Warehousing is the application of Data
Warehousing concepts to allow clinical data about a large
patient population to be analyzed to perform clinical quality
management and medical research.
In a data warehouse environment, data has the following
properties:





Data is organized by subject, or domain-level concepts,
rather than by function.
Data from various operational systems is integrated, by
definition or by content.
Data is archived in non-volatile storage to allow temporal
analysis.
Data is recorded with a temporal dimension (e.g. timestamp)
Data is optimized for decision making (DSS) or analysis
(OLAP).
Integrating Heterogeneous Data Sources



The main challenge in integrating data from heterogeneous
sources is in resolving schema and data conflicts.
Approaches to this problem include using a federated
database architecture, or providing a multi-database
interface. These approaches are geared more towards
providing query access to the data sources than towards
supporting analysis.
Types of data integration:



Physical integration – convert records from heterogeneous
data sources into a common format (e.g. ‘.xml’).
Logical integration – relate all data to a common process
model (e.g. a medical service like ‘diagnose patient’ or
‘analyze outcomes’).
Semantic integration – allow cross-reference and possibly
inferencing of data with regards to a common metadata
standard or ontology (e.g. HL7 RIM, OIL+DAML).
Clinical abstractions as Ontologies



An ontology is a explicit specification of the
conceptualization of a domain. Information models (such as
the HL7 RIM) and standardized vocabularies (such as UMLS)
can be part of an ontology. An ontology provides a core
component in a Knowledge-Based System.
In the clinical research field, ontologies have been used in
computerized guideline modeling. This allows the
development of applications to provide recommendations
(e.g. to make indications for the use of surgical procedures),
to identify deviations in practices, and screening services
(e.g. evaluate patient eligibility).
Benefits of using ontologies include:



Facilitating sharing between systems and reuse of knowledge
Aiding new knowledge acquisition
Improving the verification and validation of knowledge-based
systems.
Managing database metadata




Metadata is the detailed description of the instance data;
the format and characteristics of the populated instance
data; instances and values dependent on the
requirements/role of the metadata recipient.
Metadata is used in locating information, interpreting
information, and integrating/transforming data.
Being able to maintain a well-organized and up-to-date
collection of the organization’s metadata is a great step
towards improving overall data quality and usage.
However this task is complicated by the different quality
and formats of metadata available (or not) from the
heterogeneous data sources, and the consistency in
updating existing metadata.
A common metadata architecture is essential to keeping
data manageable.
The Data Mediator approach


In this project, we will attempt to develop an extensible and
adaptable architecture to perform integration of heterogeneous data
sources into a data warehouse environment using a ontology-based
data mediator approach.
The components of this architecture include:
 Knowledge base – stores the ontology; consists of:






The abstraction model – domain-level concepts
The database description model – metadata record of data sources
The mappings model – how data elements relate to attributes in the
abstraction model
The transformations model – metadata of available methods to
transform data elements from one data source to another
Data mediators – provides each data source an interface to the
warehouse and resolving data conflicts between any different
representations; necessary classes generated from the ontology.
Data warehouse – provides access to integrated data for
analysis and decision-making.
Patient model


The patient-data information model defines the classes and
attributes of patient data for an Electronic Patient Record (EPR).
The patient-data model consists of:







(adapted from SMI Dharma model)
a Patient class whose instances hold demographic information about
specific patients
a Note_Entry class that describes qualitative observations about
patients
a Numeric_Entry class that represent results of quantitative
measurements
an Adverse_Event class that models adverse reactions to specific
substances
a Condition class that represent medical conditions that persist over
time, and two intervention classes
Medication and Procedure, that model drugs and other medical
procedures that have been recommended, authorized, or used.
The defining characteristic of entities in the patient-data model is
that they are assertions about demographic and clinical conditions
of specific patients.
Database metadata model (adapted from Critchlow et. al.)




The metadata model here contains the information needed for the data
integration process.
The database description model contains language independent class
definitions that closely mirror the physical layout of a source database. In
our prototype model, the database description is simply a class containing
a set of database entries. A model is provided for two distinct entrytypes: field-entries (from flat-file data sources) and column-entries
(from relational data sources). Entries are essentially instances of the
attribute class.
Modeling the database metadata as an ontology provides flexibility when
trying to describe heterogeneous data sources. For instance, the model
can be easily extended to describe Native XML databases.
How the models are used in data integration:
 The source database attributes are mapped to the appropriate
abstraction characteristic through mappings. When an abstraction
defines multiple representations for the same characteristic attribute,
transformation functions are defined to convert between them.
A prototype architecture
*possible use of
JDBC metadata to
obtain db
descriptions
Source
db 1
Mediator
Interface 1
(Relational DBMS, *alternatively, a
common metadata
e.g. MySQL)
exchange standard
such as XMI could
be used
Source
db 2
Mediator
Interface 2
(Object-Relational *XML data binding
DBMS, e.g.
could be used to
Postgresql)
generate APIs for
data validation or
transformation
*ontologies can be created
and modified via Protégé-2000
tool; underlying format is RDF
Ontology Server
Target
db
Abstractions
Data
Descriptions
Data
Mappings
(Data
Warehouse
environment,
e.g. SQL Server)
*abstraction model in
the ontology is
extensible to any
domain
Transformation
Descriptions
*key goal: develop
the ontology server as
a component, use EJB
or .NET
Warehouse
Mediator
*possible use of XSLT to
perform data
transformations
Using Protégé-2000


Protégé-2000 is a experimental knowledge-acquisition tool,
written in Java, that allows users to import, export and
create their own ontologies.
The tool itself is extensible; a programming developer kit is
available for instructions on creating plug-ins:



‘tabs’ - user interface between a ontology model in Protégé
and another knowledge-based application.
‘slot-widget’ – user interface for viewing and acquiring slot
values for new instances.
backend plug-ins – specify the mechanism that Protégé-2000
will use to store the ontology.
Screenshot: Creating the classes and slots of an ontology
Screenshot: Viewing the newly created ontology model
References



Pedersen T. B., Jensen C. S., “Research Issues in Clinical Data
Warehousing” In Proceedings of the 10th International Conference
on Scientific and Statistical Database Management, pg. 43-52,
July 1998 (available online:
http://citeseer.nj.nec.com/pedersen98research.html)
Critchlow T., Ganesh M., Musick R., “Meta-Data Based Mediator
Generation” In Proceedings of the 3rd IFCIS Conference on
Cooperative Information Systems, August 1998 (available online:
http://citeseer.nj.nec.com/critchlow98metadata.html)
Tu S. et. al. “A Flexible Approach to Guideline Modeling” AMIA
Annual Symposium, 1999 (available online: http://smiweb.stanford.edu/pubs/SMI_Abstracts/SMI-1999-0793.html)