139 - UMBC ebiquity research group
Download
Report
Transcript 139 - UMBC ebiquity research group
Information Retrieval
and the Semantic Web
Tim Finin, James Mayfield, Anupam Joshi,
R. Scott Cost and Clay Fink
University of Maryland, Baltimore County
Johns Hopkins University, Applied Physics Lab
04 January 2004
UMBC
AN HONORS UNIVERSITY IN MARYLAND
DARPA contract F30602-00-0591and NSF awards ITR-IIS-0326460
and ITR-IIS-0325464 provided partial research support for this work
Introduction
and motivation
UMBC
AN HONORS UNIVERSITY IN MARYLAND
“XML is Lisp's bastard nephew, with uglier
syntax and no semantics. Yet XML is
poised to enable the creation of a Web of
data that dwarfs anything since the
Library at Alexandria.”
-- Philip Wadler, Et tu XML? The fall of
the relational empire, VLDB, Rome,
September 2001.
UMBC
AN HONORS UNIVERSITY IN MARYLAND
“The web has made people smarter. We
need to understand how to use it to
make machines smarter, too.”
-- Michael I. Jordan (UC Berkeley),
paraphrased from a talk at AAAI, July
2002
UMBC
AN HONORS UNIVERSITY IN MARYLAND
“The Semantic Web will globalize
KR, just as the WWW globalize
hypertext”
-- Tim Berners-Lee
UMBC
AN HONORS UNIVERSITY IN MARYLAND
“The multi-agent systems
paradigm and the web both
emerged around 1990. One has
succeeded beyond imagination
and the other has not yet made it
out of the lab.”
-- Anonymous, 2001
UMBC
AN HONORS UNIVERSITY IN MARYLAND
tell
register
UMBC
AN HONORS UNIVERSITY IN MARYLAND
Vision and
Model
UMBC
AN HONORS UNIVERSITY IN MARYLAND
Vision
• Semantic markup (e.g., OWL) as markup
– Web documents are traditional HTML documents,
augmented with machine-readable semantic markup that
describes their content
• Inference and retrieval are tightly bound
– Inference over semantic markup improves retrieval and text
retrieval facilitates inference
• Agents should use the web like humans do
– Think of a query, encode to retrieve possibly relevant
documents, read some and extract knowledge, repeat until
objectives met
UMBC
AN HONORS UNIVERSITY IN MARYLAND
Why use IR techniques?
• We will want to retrieve over structured and
unstructured knowledge
– We should prepare for the appearance of text
documents with embedded SW markup
• We may want to get our SWDs into
conventional search engines, such as Google.
– Mature, scalable, low cost, deployed infrastructure
• IR techniques also have some unique
characteristics that may be very useful
– e.g., ranking matches, document similarity,
clustering, relevance feedback, etc.
UMBC
AN HONORS UNIVERSITY IN MARYLAND
Framework–Semantic Markup
agent
Local
KB
Semantic
Web Query
Statement
to be proved
Inference
Engine
Semantic
Markup
Encoder
(“swangler”)
Encoded
Markup
Web
Search
Engine
Semantic
Markup
UMBC
AN HONORS UNIVERSITY IN MARYLAND
Filters
Semantic
Markup
Extractor
Ranked
Pages
Framework–Incorporating Text
Local
KB
Semantic
Web Query
Statement
to be proved
Inference
Engine
Semantic
Markup
Encoder
(“swangler”)
Encoded
Markup
Text
Query
Web
Search
Engine
Text
Filters
Text
Extractor
Semantic
Markup
UMBC
AN HONORS UNIVERSITY IN MARYLAND
Filters
Semantic
Markup
Ranked
Pages
Harnessing Google
• Google started indexing RDF documents some
time in late 2003
• Can we take advantage of this?
• We’ve developed techniques to get some
structured data to be indexed by Google
• And then later retrieved
• Technique: give Google enhanced documents
with additional annotations containing Swangle
Terms ™
UMBC
AN HONORS UNIVERSITY IN MARYLAND
Swangle definition
swan·gle
Pronunciation: ‘swa[ng]-g&l
Function: transitive verb
Inflected Forms: swan·gled; swan·gling /-g(&-)li[ng]/
Etymology: Postmodern English, from C++ mangle,
Date: 20th century
1: to convert an RDF triple into one or more IR
indexing terms
2: to process a document or query so that its content
bearing markup will be indexed by an IR system
Synonym: see tblify
- swan·gler /-g(&-)l&r/ noun
UMBC
AN HONORS UNIVERSITY IN MARYLAND
Swangling
• Swangling turns a SW triple into 7 word like terms
– One for each non-empty subset of the three components with
the missing elements replaced by the special “don’t care”
URI
– Terms generated by a hashing function (e.g., SHA1)
• Swangling an RDF document means adding in triples
with swangle terms.
– This can be indexed and retrieved via conventional search
engines like Google
• Allows one to search for a SWD with a triple that
claims “Ossama bin Laden is located at X”
UMBC
AN HONORS UNIVERSITY IN MARYLAND
A Swangled Triple
<rdf:RDF
xmlns:s="http://swoogle.umbc.edu/ontologies/swangle.owl#"
</rdf>
<s:SwangledTriple>
<s:swangledText>N656WNTZ36KQ5PX6RFUGVKQ63A</s:swangledText>
<rdfs:comment>Swangled text for
[http://www.xfront.com/owl/ontologies/camera/#Camera,
http://www.w3.org/2000/01/rdf-schema#subClassOf,
http://www.xfront.com/owl/ontologies/camera/#PurchaseableItem]
</rdfs:comment>
<s:swangledText>M6IMWPWIH4YQI4IMGZYBGPYKEI</s:swangledText>
<s:swangledText>HO2H3FOPAEM53AQIZ6YVPFQ2XI</s:swangledText>
<s:swangledText>2AQEUJOYPMXWKHZTENIJS6PQ6M</s:swangledText>
<s:swangledText>IIVQRXOAYRH6GGRZDFXKEEB4PY</s:swangledText>
<s:swangledText>75Q5Z3BYAKRPLZDLFNS5KKMTOY</s:swangledText>
<s:swangledText>2FQ2YI7SNJ7OMXOXIDEEE2WOZU</s:swangledText>
</s:SwangledTriple>
UMBC
AN HONORS UNIVERSITY IN MARYLAND
What’s the point?
• We’d like to get our documents into Google
– Swangle terms look like words to Google and other search
engines.
• Cloaking obviates modifying document
– Add rules to the web server so that, when a search spider
asks for document X the document swangled(X) is returned.
Caching makes this efficient
• A swangle term length of 7 may be an acceptable
length for a Semantic Web of 1010 triples -- collision
prob for a triple ~ 2*10-6.
• We could also use Swanglish – hashing each triple into
N of the 50K most common English words
UMBC
AN HONORS UNIVERSITY IN MARYLAND
OWLIR
UMBC
AN HONORS UNIVERSITY IN MARYLAND
Student Event Scenario
• UMBC sends out descriptions of ~50 events a week to students.
• Each student has a “standing query” used to route event
messages.
– A student only receives announcements of events matching his/her
interests and schedule.
• Use LMCO’s AeroText system to automatically add
DAML+OIL markup to event descriptions.
– Categorize text announcements into event types
– Identify key elements and add DAML markup
• Use JESS to reason over the markup, drawing ontologysupported inferences
UMBC
AN HONORS UNIVERSITY IN MARYLAND
Event Ontology
Organizer
Event_Name
EVENT
...
• A simple ontology
for University
events
• Includes classes,
subclasses,
properties, etc.
• Can include
instance data, e.g.,
UMBC, NEC,
Fairleigh
Dickenson, etc.
Event_Date
Start_Time
Place
End_Time
DATE
TIME
...
TRIP
TEAM
...
MOVIE
SHOW
SPORT
INDIVIDUAL
...
BASEBALL
BASKETBALL
...
ATHLETICS
KEY:
Instance Of
CLASS
UMBC
AN HONORS UNIVERSITY IN MARYLAND
CHESS
Property
Property
Association
Subclass Of
OWLIR Architecture
Expand Event
Description
Classification
Talk
...
Event Categories
Text
Event
Descriptions
Sport
Extract
triples &
reason
Info
Extraction
Movie
Text+
DAML
LMCO
AeroText
+ Java
Agents
Text+
DAML
Jess
Text +
triples
Jess
Trip
Extract
triples &
reason
Must
Query
User
Interface
UMBC
AN HONORS UNIVERSITY IN MARYLAND
OK
Jess
Must
not
Text +
triples
Convert
triples to
index terms
Convert
triples to
index terms
Text
Text
Index
SIRE
Retrieve
Text +
triples
Results User
Interface
Final Results
Inference on results
Swoogle
UMBC
AN HONORS UNIVERSITY IN MARYLAND
Swoogle Search
CGI scripts
SWOs
Video
files
HTML
documents
SWIs
Audio
files
Images
SWD = SWO + SWI
SWOOGLE 2
Ontology Dictionary
Ontology
Dictionary
Swoogle
Search
Swoogle
Statistics
service
IR analyzer
SWD analyzer
SWD Cache
SWD Metadata
Web Server
Human users
Web Service
Intelligent Agents
The web, like Gaul, is divided into three parts:
the regular web (e.g. HTML), Semantic Web
Ontologies (SWOs), and Semantic Web
Instance files (SWIs)
analysis
digest
Swoogle Statistics
SWD Reader
discovery
The
Web
Candidate
URLs
SWD Rank
Web Crawler
Swoogle uses four kinds of crawlers to discover semantic web documents and several
analysis agents to compute metadata and relations among documents and ontologies.
Metadata is stored in a relational DBMS. Services are provided to people and agents.
A SWD’s rank is a function of its type
(SWO/SWI) and the rank and types of the
documents to which it’s related.
http://swoogle.umbc.edu/
Statistics as of November 2004
Swoogle provides services to
people via a web interface and to
agents as web services.
SWDs
336,000
Triples
Ontologies
UMBC
Classes
95,000
47,000,000
Properties
53,000
4,200
Individuals
7,200,000
SWD IR Engine
Swoogle puts documents into a character ngram based IR engine to compute document
similarity and do retrieval from queries
Contributors include Tim Finin, Anupam Joshi, Yun Peng, R. Scott Cost, Jim Mayfield, Joel Sachs, Pavan Reddivari, Vishal Doshi, Rong Pan, Li Ding, and Drew Ogle. Partial
AN HONORS UNIVERSITY IN MARYLANDresearch support was provided by DARPA contract F30602-00-0591 and by NSF by awards NSF-ITR-IIS-0326460 and NSF-ITR-IDM-0219649. November 2004.
Concepts
• Document
– A Semantic Web Document (SWD) is an online document written in semantic
web languages (i.e. RDF and OWL).
In swoogle, a document D is a valid SWD iff. JENA* correctly parses D and
produces at least one triple.
*JENA is a Java framework for writing Semantic Web applications. http://www.hpl.hp.com/semweb/jena2.htm
– An ontology document (SWO) is a SWD that contains mostly term definition
(i.e. classes and properties). It corresponds to T-Box in Description Logic.
– An instance document (SWI or SWDB) is a SWD that contains mostly class
individuals. It corresponds to A-Box in Description Logic.
• Term
– A term is a non-anonymous RDF resource which is the URI
reference of either a
rdf:type
foaf:Person
class or a property.
rdfs:Class
• Individual
– An individual refers to a non-anonymous RDF resourcerdf:type
which is the URI
http://.../foaf.rdf#finin
reference of a class member.
foaf:Person
UMBC
AN HONORS UNIVERSITY IN MARYLAND
Demo
1
Find “Time” Ontology
(Swoogle Search)
2
3
UMBC
AN HONORS UNIVERSITY IN MARYLAND
• Document view
• Term view
Find Term “Person”
(Ontology Dictionary)
4
5
Digest “Time” Ontology
Digest Term “Person”
• Class properties
• (Instance) properties
Swoogle Statistics
Demo
1
Find “Time” Ontology
We can use a set of keywords to search
ontology. For example, “time, before, after”
are basic concepts for a “Time” ontology.
UMBC
AN HONORS UNIVERSITY IN MARYLAND
Usage of Terms in SWD
http://www.cs.umbc.edu/~finin/foaf.rdf
rdf:type
foaf:Person
foaf:mbox
[email protected]
http://foo.com/foaf.rdf
rdf:type
foaf:Person
http://foo.com/foaf.rdf#finin
foaf:mbox
[email protected]
http://xmlns.com/foaf/1.0/
populated Class
rdfs:subClassOf
wordNet:Agent
populated Property
foaf:Person
rdf:type
rdfs:domain
rdfs:Class
defined Class
foaf:mbox
rdf:type
UMBC
AN HONORS UNIVERSITY IN MARYLAND
rdf:Property
defined Property
defined Individual
Demo
2(a)
Digest “Time” Ontology (term view)
TimeZone
before
UMBCintAfter
………….
AN HONORS UNIVERSITY IN MARYLAND
Demo
2(b)
Digest “Time” Ontology (document view)
UMBC
AN HONORS UNIVERSITY IN MARYLAND
Demo
3
Find Term “Person”
Not capitalized! URIref is case sensitive!
UMBC
AN HONORS UNIVERSITY IN MARYLAND
Demo
4
Digest Term “Person”
167 different properties
562 different properties
UMBC
AN HONORS UNIVERSITY IN MARYLAND
Demo
5
Swoogle Statistics
UMBC
AN HONORS UNIVERSITY IN MARYLAND
Swoogle IR Search
• This is work in progress, not yet fully integrated into
Swoogle
• Documents are put into an ngram IR engine (after
processing by Jena) in canonical XML form
– Each contiguous sequence of N characters is used as an
index term (e.g., N=5)
– Queries processed the same way
• Character ngrams work almost as well as words but
have some advantages
– No tokenization, so works well with artificial languages and
agglutinative languages
=> good for RDF!
UMBC
AN HONORS UNIVERSITY IN MARYLAND
Why character n-grams?
• Suppose we want to find ontologies for time
• We might use the following query
“time temporal interval point before after during day
month year eventually calendar clock duration end
begin zone”
• And have matches for documents with URIs like
–http://foo.com/timeont.owl#timeInterval
–http://foo.com/timeont.owl#CalendarClockInterval
–http://purl.org/upper/temporal/t13.owl#timeThing
UMBC
AN HONORS UNIVERSITY IN MARYLAND
Another approach: URIs as words
• Remember: ontologies define vocabularies
• In OWL, URIs of classes and properties are the
words
• So, take a SWD, reduce to triples, extract the
URIs (with duplicates), discard URIs for blank
nodes, hash each URI to a token (use
MD5Hash), and index the document.
• Process queries in the same way
• Variation: include literal data (e.g., strings) too.
UMBC
AN HONORS UNIVERSITY IN MARYLAND
Conclusion
UMBC
AN HONORS UNIVERSITY IN MARYLAND
What we have done
• Developed Swoogle – a crawler based retrieval
system for SWDs
• Developed and implemented a technique to get
Google to index and retrieve SWDs
• Prototyped (twice) an ngram based IR engine
for SWDs
• Explored the integration of inference and
retrieval
• Used these in several demonstration systems
UMBC
AN HONORS UNIVERSITY IN MARYLAND