Adopting Ontologies for Multisource Identity Resolution

Download Report

Transcript Adopting Ontologies for Multisource Identity Resolution

Adopting Ontologies for
Multisource Identity Resolution
Milena Yankova, Horacio Saggion
Hamish Cunningham
Department of Computer Science,
The University of Sheffield
Overview
•
•
•
•
•
Introduction
Knowledge representation
Usage of ontologies in identity resolution
Case-study & Evaluation
Conclusion and Further Work
Introduction
• Identity resolution aims at identifying the newly
presented facts and linking them to their previous
mentions Our main
• hypothesis is that
– variations of one and the same fact can be recognised,
– duplications removed and
– their aggregation actually increases the correctness of fact
extraction.
• We use an ontology for internal and resulting
knowledge representational formalism.
• It not only contains the representation of the domain,
but also known entities and properties.
Knowledge Representation via
Ontologies
• Ontologies have been chosen because of its detailed entity
description that is complemented with semantic information.
• The expected benefit from using semantic representation the ability
to recognise not only the type/class of the objects, but also the
individual instances they refers to.
– For example, different appearances of “M&S" on different sources
(e.g. web pages) are extracted and collected as a single instance which
all mentions point to.
• The semantic linkup of the identified objects guaranties more
detailed description as opposed to a simple syntactic
representation.
• In this way it provides more details, which serving as evidence can
improve the accuracy of object comparison.
Source of information
• In this application we have two sources of
information ( company profiles):
– A database of manually collected company details
– Profiles extracted from web pages
Mapping Databases to Ontologies
• The database schema is the data description that holds
the meaning of the data
• binging databases to other knowledge representational
formalism e.g. ontologies requires deep understanding
and domain expertise
• It is usually done manually producing mapping
between the particular database schema and given
ontology
• We use company profiles stored in a MySql Relational
Database Management System which has been
manually mapped to the Musing ontology using scripts
Information Extraction
Ontology-based Information Extraction
• Ontology-based information extraction which
aims at identifying in text concepts and
instances from an underlying domain model
specified in an ontology.
• The extraction prototype uses some default
linguistic processors from GATE
• Custom application rules for concept
identification are specified in regular
grammars implemented in the JAPE language.
Ontologies in IDRF
• Our approach to the identity problem has
been implemented as Identity Resolution
Framework (IDRF)
• It uses an ontology for internal and resulting
knowledge representational formalism
• It is based on the PROTON ontology, which can
be extended, e.g. for our particular domain of
company profiling
Identity Class Models
• Execution of the IdRF is based on what we call Class Models - that
handle the differences of the entity types represented as ontology
classes.
• Each class model is expressed by a single formula based on first
order probabilistic logic
• Each formula is manually composed by combining predicates by the
usual logical connectives like \&", \j", \not" and \)".
• Class models are used in two stages of the framework pipeline:
– during the retrieval of potential matching candidates from the
ontology - applying a strict criteria;
– During actual comparison of entities potential matching pairs using a
soft criteria.
• They are also evaluated differently depending on which component
use them.
Example of Class Model definition
Pre-filtering
• It restrict the whole amount of ontology
instances to a reasonable number, to which
the source entity will be compared.
• In this case the engine does not formally
evaluate the class model/formula but
composes a SeRQL or SQL query.
• The query embodies the model strong
equivalency criteria
Example for Pre-filtering Query
• “MARKS & SPENCER“
query according to the
class model for
"musing:Company"
Evidence Collection (1)
• This component calculates the similarity between
two objects based on their class model,
• It is expressed by a probabilistic logic formula
resulting in a real number from 0 to 1.
– “0” means that the given entities are totally different
– “1” means that they are absolutely equivalent.
– any value between 0 and 1 the probability these
entities to be equivalent
Evidence Collection (2)
• The value fro each of the predicates in the
formula is calculated according to the
algorithm it present
– Predicate values are combined according to the
logical connectives in the formula
– In this setting the usual logical connectives are
expressed as arithmetic expressions, e.g. aVb =
a+b-ab
Data Integration
• It is this third stage of identification process
• It encodes the strength of the presented
evidence for choosing the candidate favored
by the Class Model.
• The successful candidate must pass a
threshold which balances the precision and
recall of the application.
Decision Threshold
• A pre-set threshold determine whether to
registers the matches as successful.
• We have used ROC curve analysis to sent the
threshold of 0.4. which gives the best
performance in our application
Case-study
• Our case-study is focused on company profiling.
• We have automatically extracted hundreds of
company profiles from different web sites, e.g.
http://uk.finance.yahoo.com
• Our database is populated with about 1,8M
manually collected company profiles provided by
http://www.marketlocation.com
• The evaluation has targeted a set of 310
extracted UK companies compared to the
database
Evaluation of the IDRF
• The accuracy of identity resolution is very
promising (89% F-measure)
• Anther experiment on automatically extracted
vacancies shows similar results
Evaluation of the IE
• The Recall of automatically extracted company
attributes is improved from 92% to 97% after
integration
• The Precision rise slightly from 70% to 73%
Conclusion and future work
• IRDF is a general framework for identity
resolution which is based on ontologies
• adapted to ontology-based information
extraction applications.
• future work - how uniqueness of the details
and their number influence the process of
identification
• Thank you Adam!
• Please don’t hesitate to send your questions
to [email protected]