Presentation

Download Report

Transcript Presentation

Semantic Middleware
Semantic Middleware
•
Investigating fundamental issues in
entity/relationship extraction, disambiguation
(matching & mapping) and annotation.
1.
2.
3.
Entity identification
Entity Disambiguation
Semantic Annotation
-----------------------------------------------------------------------------------------------------------------------------------------------------
World Model
Lexical Analysis, Natural Language
Processing, Additional linguistic
resources: Thesaurus,Dictionary
(synonymns, common variations)
Entity Identification /
Metadata Creation
Documents to
annotate
YES
Multiple matches
found during lookup?
NO
Knowledge Base
Semantic
Annotation of
selected documents
Annotated Documents
Entity
Disambiguation
Semantic Annotation
• Entities in a drug advisory annotated with
concepts and relationships from a Drug
Ontology
Excerpt of Drug Ontology
Excerpt of Drug Ontology
Sample Created Metadata
<Entity id="122805"
class="DrugOntology#prescription_drug_brandname">
Bextra
<Relationship id=”442134”
class="DrugOntology#has_interaction">
<Entity id="14280" class="DrugOntology
#interaction_with_physical_condition>sulfa allergy
</Entity>
</Relationship>
</Entity>
Semantic Associations
• Identifying implicit semantic associations
between entities in the document
Annotated RSS Feed
Ontology
Today, the Food and
Drug Administration
(FDA) is announcing that
it has asked Pfizer, Inc.
to voluntarily withdraw
Bextra from the
market. Pfizer has
agreed to suspend
sales and marketing of
Bextra in the , pending
further discussions with
the agency.
Grey and white circles indicate ontology nodes in
identified semantic associations. Lighter nodes lie in
less relevant associations.
Indicate entities in the RSS feed that were
extracted and annotated with concepts in an
ontology (shown in red)
Disambiguation
• Functionality:
– merging two databases / ontologies, multiple references
pointing to the same logical entity
– Adding new instances to an ontology, a similar entity
already exists and has to be merged with the new one
– Example: merging person instances recorded in a
government ontology and an incoming choice point
person entity.
Approaches
• Feature-based Similarity Approach
–
–
–
–
Set-Theory Similarity Approach
Information-Theory Similarity Approach
Clustering Approach
Hybrid Approach
• Relationship-based Similarity Approach
• Hybrid Similarity Approach
Challenges
• Varying information content in entities
– Differences in schema
– Variations in representation
• Use of abbreviations, mis-spellings, different naming
convention, representation formats changing over
time etc.
• Insufficient information while merging two
entities
Exploiting relationships and other
/previous reconciliation decisions
Schema
Conflicting instances
Person
Tim Robins
Timothy Wallace Robinson
-- SSN
-- 889889889
-- 889889889
-- TelNumber
-- 7065434567
-- 7062123443
-- FirstName
-- Tim
-- Timothy
-- MiddleName
--
-- Wallace
-- LastName
-- Robins
-- Robinson
-- Generation
--
--
-- Marital Status
-- Single
-- Married
-- Applicant
--
--
-- dependent of
--
--
-- spouse of
--
-- person12332
-- works for
-- People Soft
-- Oracle
-- affiliated with
--
--
-- foreign influence event
-- event7823
-- event099
-- address
-- place23
-- place23
Nature of attribute indicates its
relative importance – SSN given a
high weight in disambiguating
person entities
String similarity metrics
Recognized as a time sensitive
attribute
Reconciling Oracle and
PeopleSoft indicates the two
person entities work for the
same organization
Application using this disambiguation
algorithm
• Semantic Analytics on Social Networks: Experiences in
Addressing the Problem of Conflict of Interest Detection
Nominated for Best Paper Award at WWW 2006
• Disambiguate entities in a FOAF and DBLP
dataset
Schema
DBLP
FOAF
Person
rdfs:literal
rdfs:literal
rdfs:subClassOf
rdfs:literal
rdfs:literal
dblp:label
dblp:no_of_co_authors
dblp:homepage
dblp:no_of_publications
dblp:coauthor
foaf:knows
foaf:Person
rdfs:literal
foaf:surname
foaf:homepage
dblp:iswcLocation
foaf:mbox_sha1sum
rdfs:literal
foaf:nickName
rdfs:literal
rdfs:literal
foaf:Person
#4_2629
dblp:Researcher
#2_553
dblp:Researcher
#2_1417
dblp:Researcher
#2_324
foaf:Person
#4_19269
foaf:Person
#4_35126
foaf:Person
#4_28045
dblp:coauthor
dblp:coauthor
rdfs:literal
rdfs:literal
rdfs:literal
Instance
rdfs:literal
foaf:firstName
foaf:depiction
dblp:iswc_affiliation
rdfs:literal
foaf:workplacepage
rdfs:literal
dblp:iswc_type
rdfs:literal
rdfs:subClassOf
foaf:mbox
foaf:schoolpage
foaf:label
rdfs:literal
dblp:Researcher
rdfs:literal
foaf:knows
foaf:knows
dblp:coauthor
dblp:Researcher
#2_1518
dblp:no_of_publications
foaf:knows
124
dblp:homepage
dblp:label
dblp:no_of_co_authors
http://lsdis.cs.uga.edu/~amit/
Amit Sheth
134
Amit Sheth
foaf:Person
#4_38624
foaf:label
foaf:knows
Amit
foaf:nickName
foaf:mbox_sha1sum
foaf:schoolpage
foaf:homepage
foaf:workplacepage
9c1dfd993ad7d1852e80ef8c87fac30e10776c0c
http://lsdis.cs.uga.edu/~amit
http://lsdis.cs.uga.edu,http://www.semagix.com
http://www.bitsaa.org/,
http://www.cse.ohio-state.edu/
Statistics of the population and the
results
Provenance
• When you see some data on the Web /
database / ontology, do you know
– where it came from?
– why it is there?
• Provenance is the lineage / history of a
piece of information
The need for provenance
• Reliability and Quality – Identify lineage, measure
credibility
• Justification and Audit – Not only when and how but also
why certain derivations have been made
• Re-usability, reproducibility – not only how data has been
produced but also all necessary information to reproduce
results
• Ownership , Security and Copyright – Provides a trusted
source from which we can procure who the information
belongs to and when and how it was created
Challenges
• In recording and using provenance information
– Systematic Annotation for recording Provenance
– Using provenance involves propagating annotations
– location and propagation rules for annotations
• Multiple Granularity – at what level to annotate? the whole
database , the relations , the tuples or the data values?
Cost and Feasibility of each?
– Formal annotations - machine readable and executable
– Meaningful annotations require versioning
http://db.cis.upenn.edu/DL/fsttcs.pdf