What is Ontology? - School of Computing

Download Report

Transcript What is Ontology? - School of Computing

Towards Mutual Understanding:
Ontologies, Ontology Matching,
and their Applications
Jingshan Huang
Assistant Professor
School of Computer and Information Sciences
University of South Alabama
http://cis.usouthal.edu/~huang/
CIS Department @ UO
Eugene, OR
May 21, 2010
Presentation Outline
• Research Motivation
• Learning-Based Ontology Matching – SOCCER
• Ongoing Research
• Summary
Research Motivation – Overview
•
Information from heterogeneous sources has different
semantics
Long (English)
Long (Chinese Pinyin) -> 龙 ->
•
•
•
Integrating the information from heterogeneous sources
must make use of all available clues, including syntax,
semantics, context, and pragmatics
Ontologies are a formal model to encode semantics
Ontological techniques are critical in semantic integration
• What is Ontology?
–
–
–
–
Quick Facts
a computational model of some domain of the world
describes the semantics of the terms used in the domain
often captured in the form of DAG (directed acyclic graph)
a finite set of concepts + properties + relationships
• What is Ontology Heterogeneity?
– an inherent characteristic of ontologies developed by different parties for the same (or
similar) domains
– the heterogeneous semantics may occur in different ways
(1) different terms could be used for the same concept;
(2) an identical term could be adopted for different concepts;
(3) properties and relationships could be different
“translation” is way from good enough, not even close…
• What is Ontology Matching?
– a.k.a. “Ontology Alignment” or “Ontology Mapping”
– the process of determining correspondences between concepts from heterogeneous
ontologies
– involving many different relationships, e.g., equivalentWith, subClassOf, superClassOf,
and siblings
Heterogeneity in Ontologies – A Simple Example
•
•
1.
2.
3.
4.
Formal definition of ontologies
A knowledge representation model of some portion of the world
It reflects its designers’ conceptual views
Ontology = Concepts + Relationships + Constraints
Concept – a category
“President”
Property – maps between concepts and data types
“gender” of “President”
Relationship – maps between concepts
“President” is a subClassOf “People”
Constraint – on properties or relationships
“gender”: range = “male”
Concept semantics: name + properties + relationships
President
sex
Person
female or male
Heterogeneity in Ontologies – Running Example
The Semantic Web
Heterogeneity in Ontologies – Running Example (cont.)
1.
Type “professor university” in Swoogle, 129 different results are returned
2.
All created and maintained by ontology professionals
Heterogeneity in Ontologies – Running Example (cont.)
Heterogeneity in Ontologies – Running Example (cont.)
Heterogeneity in Ontologies – Running Example (cont.)
Research Motivation – Summary
• Semantic integration is important in Computer
Science and Information Technology
• Ontologies are the foundation for semantic
integration; at the same time, they are
inherently heterogeneous
• The only way out – match/align ontologies
such that to understand different semantics
Ontology matching is far from being
solved despite its importance and the
number of researchers that have
investigated it
Classification For Current Algorithms
1. Rule-Based Matching
– Consider schema information alone
– Specify a set of rules
– Apply them to schema information
2. Learning-Based Matching
– Consider both schema and instances
– Apply different machine learning
techniques
Pros and Cons for Current Approaches
1. Rule-Based Matching
– Is relatively fast ()
– Ignores instance information ()
– Uses ad hoc predefined weights ()
concept semantics: name + properties + relationships
2. Learning-Based Matching
– Obtains extra clues from instances ()
– Runs longer ()
– Has difficulty in getting sufficient instances ()
Presentation Outline
• Research Motivation
• Learning-Based Ontology Matching – SOCCER
• Ongoing Research
• Summary
SOCCER (Similar Ontology Concept
ClustERing) – a learning-based algorithm
• Challenges and main idea
• Details
• Evaluation
Problems with Existing Matching Algorithms
•
Rule-Based Matching
•
Learning-Based Matching
– Ignores instance information ()
– Requires ad hoc predefined weights ()
– Runs longer ()
– Has difficulty in getting sufficient instances ()
Try to:
1. Adopt machine learning techniques to avoid ad hoc
predefined weights
2. Base learning on schema information alone to avoid the
difficulty in getting sufficient instances
The goal:
To find equivalent concept pairs among different ontologies, which is the first,
and the most critical step in semantic integration
Challenges
Very difficult for machines to learn how to match
ontology schemas by providing schema information
alone
1. Diversities in terminology
2. Diversities in relationships
Current learning-based algorithms make use of instances,
more or less
Anecdotally, instances usually has much less variety than
schemas have
Main Idea of SOCCER
• Equivalent concepts from different ontologies
tend to stay “closer” to each other in a
clustering space with structural dimensions
• Each cluster contains a number of concepts that
are from different ontologies and are equivalent
to each other
• SOCCER aims at finding such clusters by
exploiting ontology schemas alone
Details – Overview
• Build a three-dimensional vector for each
•
•
concept, corresponding to name, properties, and
relationships
Calculate the similarity between pairwise
concepts
Apply an agglomerative algorithm to generate
clusters
Therefore, SOCCER has two phases:
Phase I – weight learning
Phase II – clustering
SOCCER Phase I – learn weights (1)
Learning problem’s formal description
– Task T: match two ontologies
– Performance measure P: Precision, Recall, FMeasure, and Overall with regard to manual
matching
– Training experience E: a set of equivalent concept
pairs by manual matching
– Target function V: a pair of concepts
– Target function representation:

3
Vˆ (b)   ( wi si )
i 1
SOCCER Phase I – learn weights (2)
• Hypothesis space:
•
•
•
weight vector (w1,
w2, w3)
Learning objective:
find the weight vector
that best fits the
training examples
Training rule: delta
rule
Searching strategy:
minimize the training
error
SOCCER Phase I – learn weights (3)
• Similarity in concept names
d: edit distance between two strings
l: length of the longer string
s1  1 
d
l
• Similarity in concept properties
n: number of pairs of matched properties
m: smaller cardinality of lists p1 and p2
• Similarity in concept relationships
s2 
(super/subClassOf)
calculate the similarity values for pairwise
concepts in ancestor lists and choose the
maximum value
s3
n
m
SOCCER Phase I – learn weights (4)
• Overall similarity
3
s   (wi si )
i 1
• Create a matrix M between O1 and O2 (n1 x
n2)
1. cell[i, j] stores the similarity between the ith
concept in O1 and the jth concept in O2
2. wi’s are randomly initialized, and then
updated by the learning process
SOCCER Phase I – learn weights (5)
Training error
Weight update rule
1
E ( w)   (td  od ) 2
2 dD
1
E ( w)  [(tr  od )  (tc  od )]2
2 d D
wi    (t d  od ) sid
d D
wi    [(t r  od )  (tc  od )]sid
d D
D: training example set
tr: maximum value for row i
tc: maximum value for column j
od: network output for a specific training example d
 : the learning rate
sid: the si value for d
SOCCER Phase II – clustering (1)


Apply the learned weights to recalculate similarity
matrices for pairwise ontologies
Cluster similar concepts among a set of ontologies
Input: A set of ontologies and the corresponding matrices
1. Each concept forms a singleton cluster
2. Find two clusters, (a) and (b), with maximum similarity
3. If s[(a), (b)] > threshold, go to step 4; else go to step 7
4. Merge (a) and (b) into (a, b)
5. Update matrix: s[(a, b), (c)] = (s[(a), (c)] + s[(b), (c)])/2
6. Repeat steps 2 and 3
7. Output current clusters
The key is then to determine the threshold
SOCCER Phase II – clustering (2)



Let the number of concepts in Oi be ni (i in [1, k])
WLOG, suppose n1 is the largest one in ni’s
k
Total number of clusters should be in [ n1 ,  ni]
i 1
Evaluation Strategy
 The hypothesis: a set of clusters exist
across different ontologies
 Need to show:
1. Weight learning is correct
2. Resultant clusters are meaningful
Evaluation – test ontologies (1)
Test ontologies are eight independently developed, real-world ones
1.
2.
3.
4.
5.
6.
7.
8.
http://www.csd.abdn.ac.uk/~cmckenzi/playpen/rdf/akt_ontol
ogy_LITE.owl
http://www.mindswap.org/2004/SSSW04/aktive-portalontology-latest.owl
http://annotation.semanticweb.org/iswc/iswc.owl
http://www.mondeca.com/owl/moses/ita.owl
http://protege.stanford.edu/plugins/owl/owl-library/ka.owl
http://ontoware.org/frs/download.php/18/semiport.owl
http://www.mondeca.com/owl/moses/univ.owl
http://reliant.teknowledge.com/DAML/Mid-levelontology.owl
Evaluation – test ontologies (2)
Characteristics of test ontologies
Evaluation – result (1)
Weight convergence
Evaluation – result (2)
Clustering result
Evaluation – Four Measures
• Precision p – percentage of correct predictions
over all predictions
• Recall r – percentage of correct predictions over
correct matching
• F-Measure f (=
) – a.k.a. Harmonic Mean,
avoids the bias from adopting Precision or
Recall alone
• Overall o (=
2 rp
r p
) – Post-Match Effort,
i.e., how much human effort is needed to
remove false matches and add missed ones
r ( 2  1p )
Evaluation – result (3)
Four measures
p
r
f 
257
309
 0.83
257
257 86
 0.75
20.830.75
0.83 0.75
 0.79
o  0.75  ( 2 
1
0.83
)  0.6
SOCCER Summary
• SOCCER: A learning-based ontology matching
algorithm, and the first one based on ontology
schemas alone
• Our contributions:
1. ANN technique was integrated so that the
weights for different semantic aspects can be
learned instead of being specified by a human
in advance
2. Moreover, the learning technique was carried
out based on the ontology schemas alone,
which distinguishes it from most other learningbased algorithms.
Presentation Outline
• Research Motivation
• Learning-Based Ontology Matching – SOCCER
• Ongoing Research
• Summary
Ongoing Research: Bioinformatics/Medical Informatics (1)
• An abundance of medical/biological digital data has promised a profound
impact in both the quality and rate of discovery and innovation
• Worldwide health scientists are producing, accessing, analyzing,
integrating, and storing massive amounts of digital medical data daily
• If we were able to effectively transfer and integrate data from all
possible resources, then the following would be granted:
– A deeper understanding of all these data sets
– Better exposed knowledge
– Appropriate insights and actions that follow
• But…in many cases, the data users are not the data producers, and they
thus face challenges in harnessing data in unforeseen/unplanned ways
• Fortunately, ontological techniques can render help in this regard!
Ongoing Research: Bioinformatics/Medical Informatics (2)
• Ontological techniques have been widely applied to medical and
biological research
• The most successful example is the Gene Ontology (GO) project
– The GO’s aim: to standardize the representation of gene and gene product
attributes across species and databases
– Three ontologies in the GO: Cellular Component, Molecular Function, and
Biological Process
– The GO provides a controlled vocabulary of terms for describing gene
product characteristics and gene product annotation data
– It also provides tools to access and process such data
– The focus of the GO is to describe how gene products behave in a cellular
context
• Ontologies constructed under the auspices of the OBO (Open Biomedical
Ontologies) group exhibit great variety
• Semantic integration becomes an indispensable step in biological and
biomedical data mining
Ongoing Research: Bioinformatics/Medical Informatics (3)
An Experiment in Bio Data Mining
1.
2.
The characteristics of many biomedical ontologies: i) a rich set of super/subClassOf
relationships; ii) numeric strings adopted as concept names; and iii) little, if any, instance data
SOCCER suitably serves the goal of integrating semantics in computational biology
Ongoing Research: Digital Forensics (1)
• Challenges exist in Digital Forensics
– to maintain the integrity of evidence found by different parties
(usually from distributed geographic areas, or even with cultural
barriers)
– the accurate interpretation of evidence
– the trustworthy conclusion drawn thereafter
• Different parties are likely to adopt different formats and
metadata for storing evidence’s contents – due to different
people’s specific needs
• The seamless communication among different parties, along
with the knowledge sharing and reuse that follow, become a
non-trivial problem
Ongoing Research: Digital Forensics (2)
• Being a formal knowledge representation model, ontologies
may help us to handle the aforementioned challenges in
Digital Forensics
• But …
There is no such central ontology that is large enough to include
all concepts of interest to every individual criminal investigator
• Anyone can design ontologies according to his/her own
•
conceptual view, ontological heterogeneity is thus an
inherent feature
That is, each need for a conceptual model from any
individual party will have to provide its own particular
extensions – different from and incompatible with extensions
added by other parties
Ongoing Research: Digital Forensics (3)
• An agreed-upon, global, and “all-in-one” ontology is not a feasible
solution
• Different groups should maintain their own conceptual models, while
utilizing ontological techniques to synthesize their data with others’
models
• This way, it is possible to effectively decouple the evidence semantics
from its logical description and organization
Digital Investigation Evidence Acquisition Model Based on Ontology Matching
(DIEAOM) to facilitate:
(1) knowledge collection from disparate, heterogeneous evidence sources
(2) knowledge sharing and reuse
(3) decision support for criminal investigators
• The DIEAOM aims to synthesize vast amounts of evidence from different parties
•
by matching conceptual models
Our goal is to benefit the current criminal investigation procedure with higher
automation, enhanced effectiveness, and better knowledge sharing and reuse
Other Research Opportunities (1)
Heterogeneous Knowledge Acquisition/Management
• Increasing growth in the scale, complexity, and diversity of
data has been witnessed in recent years
• In addition, the data are often used in ways not envisioned
by those who created them
• New techniques are thus needed to repurpose, transform,
and integrate multiple and uncoordinated data sources;
interoperability is the fundamental goal
• In order to better achieve interoperability among distributed
knowledge sources, accurate and effective semantic
integration is the first, critical step to handle the
heterogeneity in data
Other Research Opportunities (2)
Component-Based Software Engineering
• Engineered software is decomposed into functional or
logical components, with well-defined interfaces for
communication across components
• Reusability is an important feature of a high quality
component
• (Semi)automated methodology to annotate, discover,
compose, and execute the software components
• Semantic integration techniques are important and
fundamental in such automation processes
Other Research Opportunities (3)
Semantics-Enriched Image Knowledge Bases
• Create image knowledge bases by using ontologies to
semantically encode image features
• Semantic search allows users to make use of concept
search, instead of traditional keyword search
• It also paves the way for more advanced search
strategies
• Users can specialize or generalize a query with the
help of a concept hierarchy
• Queries can be formed using information from
ontologies
Presentation Outline
• Research Motivation
• Learning-Based Ontology Matching – SOCCER
• Ongoing Research
• Summary
Summary
•
Information from heterogeneous sources has different
semantics, and semantic integration is necessary for a
better use of every possibly available information
•
As a formal knowledge representation model, ontologies
can render help in this regard
•
SOCCER, the first learning-based approach relied on
schemas alone, was developed to tackle the ontologymatching problem, which is a critical component in
semantic integration
•
Ontological techniques can be applied to many areas to
generate challenging interdisciplinary research topics
• Suggestions?
• Comments?
• Questions?
Thank you!!!