Swoogle Semantic Web search engine

Download Report

Transcript Swoogle Semantic Web search engine

Search Engines
for Semantic Web
Knowledge
Tim Finin
University of Maryland,
Baltimore County
Joint work with Li Ding, Anupam Joshi, Yun Peng, Pranam Kolari, Pavan
Reddivari, Sandor Dornbush, Rong Pan, Akshay Java, Joel Sachs, Scott
Cost and Vishal Doshi
UMBC
an Honors University in Maryland
 http://creativecommons.org/licenses/by-nc-sa/2.0/
This work was partially supported by DARPA contract F30602-97-1-0215, NSF
grants CCR007080 and IIS9875433 and grants from IBM, Fujitsu and HP.
1
This talk
• Motivation
• Semantic web 101
• Swoogle Semantic Web
search engine
• Use cases and applications
• Conclusions
UMBC
an Honors University in Maryland
2
Once there were only a
few large computers
UMBC
an Honors University in Maryland
3
Then there were many,
UMBC
an Honors University in Maryland
4
All connected 24x7,
UMBC
an Honors University in Maryland
Internet
Cellular telephony
IRDA
802.11
Bluetooth
Ultra Wide Band
RFID
and more to come
5
Interoperating;
tcp/ip ftp smtp
rpc corba ssh
http html
xml
gif jpg mpg mp3
pdf
…
UMBC
an Honors University in Maryland
6
Access to the world’s knowledge
del.icio.us
UMBC
an Honors University in Maryland
7
Google has made us smarter
UMBC
an Honors University in Maryland
8
But what about our agents?
tell
register
UMBC
an Honors University in Maryland
Agents still have a very minimal
understanding of text and images.
9
This talk
• Motivation
• Semantic web 101
• Swoogle Semantic Web search
engine
• Use cases and applications
• Conclusions
UMBC
an Honors University in Maryland
10
XML helps
“XML is Lisp's bastard nephew, with uglier syntax
and no semantics. Yet XML is poised to enable the
creation of a Web of data that dwarfs anything
since the Library at Alexandria.”
-- Philip Wadler, Et tu XML? The fall of
the relational empire, VLDB, Rome,
September 2001.
UMBC
an Honors University in Maryland
11
Semantic Web adds semantics
“The Semantic Web will globalize
KR, just as the WWW globalize
hypertext”
-- Tim Berners-Lee
UMBC
an Honors University in Maryland
12
Semantic Web 101
• RDF/XML
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
• rdf:RDF tag
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" • namespaces 
xmlns:foaf=http://xmlns.com/foaf/0.1/
ontologies
xmlns:uni=http//ebiquity.umbc.edu/ontologies/uni/>
<uni:Student>
<foaf:name>Li Ding</foaf:name>
<foaf:mbox rdf:resource=“mailto:[email protected]”/>
</uni:Student>
</rdf:RDF>
foaf:name
rdf:type
UMBC
an Honors University in Maryland
• Semantic
graph, URIs as
nodes & links
• triples
Li Ding
uni:Student
13
Where’s the semantics?
• URIs as “rigid designators”
• Conventions for URIs denoting things in the “real
world”
• Namespaces and URIs provide an unambiguous shared
vocabulary
• RDF, RDFS and OWL have semantics defined using
model theory and also axioms
• Ontologies allow agents to draw inferences
– uni:Student is a subclass of foaf:Person
– Every uni:Student has at least one uni:school, which must be
an instance of uni:School
– A foaf:Person with a uni:school is necessarily a uni:Student
UMBC
an Honors University in Maryland
14
UMBC
an Honors University in Maryland
15
UMBC
an Honors University in Maryland
16
UMBC
an Honors University in Maryland
17
RDF/a
RDF/a is a W3C proposal for embedding RDF in XHTML documents
<html xmlns:foaf="http://xmlns.com/foaf/0.1/">
<head><title>Jo Lambda's Home Page</title></head>
<body>
Hello. This is <span property="foaf:name">Jo Lambda</span>'s
home page.
<h2>Work</h2>
If you want to contact me at work, you can either
<a rel="foaf:mbox" href="mailto:[email protected]">email me</a>,
or call <span property="foaf:phone">+1 777 888 9999</span>.
</body>
</html>
<> foaf:name "Jo Lambda"^^rdf:XMLLiteral ;
foaf:mbox <mailto:[email protected]> ;
foaf:phone "+1 777 888 9999"^^rdf:XMLLiteral .
UMBC
an Honors University in Maryland
An HTML
Document
with RDF
embedded
The triples
in ntriple
format.
18
But what about our agents?
Swoogle
Swoogle
Swoogle
Swoogle
tell
Swoogle
Swoogle
Swoogle
register
Swoogle
Swoogle
Swoogle
Swoogle
Swoogle
Swoogle
Swoogle
Swoogle
A Google for knowledge on the Semantic Web
is needed by software agents and programs
UMBC
an Honors University in Maryland
19
This talk
• Motivation
• Semantic web 101
• Swoogle Semantic Web search
engine
• Use cases and applications
• Conclusions
UMBC
an Honors University in Maryland
20
UMBC
an Honors University in Maryland
21
• http://swoogle.umbc.edu/
• Running since summer 2004
• 1.4M RDF documents, 250M RDF triples, 10K ontologies
• Semantic Web archive: many dynamic RDF documents
UMBC
an Honors University in Maryland
22
Swoogle Architecture
Analysis
SWD classifier
Ranking
Index
IR Indexer
SWD Indexer
…
Search Services
Semantic Web
metadata
Web
Server
Web
Service
html
document cache
Candidate
URLs
Discovery
SwoogleBot
Bounded Web Crawler
Google Crawler
rdf/xml
the Web
Semantic Web
human
machine
Legends
Information flow
UMBC
an Honors University in Maryland
Swoogle‘s web interface
23
A Hybrid Harvesting Framework
true
Manual submission
Inductive learner
would
Seeds M
Meta crawling
Seeds R
Seeds H
Bounded HTML crawling
google
Google API call
Swoogle
Sample
Dataset
crawl
RDF crawling
crawl
the Web
UMBC
an Honors University in Maryland
24
Performance – crawlers’ contribution
•
•
•
•
•
High SWD ratio: 42% URLs are confirmed as SWD
Consistent growth rate: 3000 SWDs per day
RDF crawler: best harvesting method
HTML crawler: best accuracy
Meta crawler: best in detecting websites
sw d
nsw d
failed
unpinged
sw oogle2
rdf craw ler
meta craw ler
html craw ler
0
UMBC
an Honors University in Maryland
500000
# of documents
1000000
1500000
26
This talk
•
•
•
•
•
•
UMBC
an Honors University in Maryland
Motivation
Swoogle overview
Bots navigate the Semantic Web
Ranking Semantic Web content
Use cases and applications
Conclusions
27
Applications and use cases
• Supporting Semantic Web developers
– Ontology designers, vocabulary discovery, who’s using
my ontologies or data?, use analysis, errors,statistics, etc.
• Searching specialized collections
– Spire: aggregating observations and data from biologists
– InderenceWeb: searching over and enhancing proofs
– SemNews: Text Meaning of news stories
• Supporting SW tools
– Triple shop: finding data for SPARQL queries
UMBC
an Honors University in Maryland
28
Web-scale semantic web data access
agent
data access service
ask (“person”)
Search vocabulary
Compose query
Populate
RDF database
inform (“foaf:Person”)
the Web
Index RDF data
Search URIrefs
in SW vocabulary
ask (“?x rdf:type foaf:Person”)
inform (doc URLs)
Search URLs
in SWD index
Fetch docs
Query local
RDF database
UMBC
an Honors University in Maryland
32
UMBC Triple Shop
• Online SPARQL RDF query processing based on HP’s
Joseki with two features
• Selectable reasoning level of inference
• Automatically finds SWDs for give queries using Swoogle
backend database
– Provide dataset creation wizard and server-side dataset
storage
– Tag and share saved datasets
SPARQL: a query language for getting information from RDF graphs (dataset)
UMBC
an Honors University in Maryland
33
UMBC Triple Shop
Querying the Semantic Web is as easy
as shopping
(1) Go to http://sparql.cs.umbc.edu/
(2) You provide a SPARQL query and constraints on what
sources to use
(3) Swoogle finds and suggests documents with relevant
data, producing a dataset
(4) You specify the amount of reasoning to do, possibly
resulting in an enhanced dataset
(5) We run the query and give you the results
(6) You can also download the dataset or save it on the
server and give it tags
UMBC
an Honors University in Maryland
34
UMBC
an Honors University in Maryland
35
UMBC
an Honors University in Maryland
36
UMBC
an Honors University in Maryland
37
This talk
•
•
•
•
•
•
UMBC
an Honors University in Maryland
Motivation
Swoogle overview
Bots navigate the Semantic Web
Ranking Semantic Web content
Use cases and applications
Conclusions
38
Will it Scale? How?
Here’s a rough estimate of the data in RDF documents on the
semantic web based on Swoogle’s crawling
System/date
Terms
Documents Individuals
Triples
Bytes
Swoogle2
1.5x105
3.5x105
7x106
5x107
7x109
Swoogle3
2x105
7x105
1.5x107
7.5x107
1x1010
2006
1x106
5x107
5x107
5x109
5x1011
2008
5x106
5x109
5x109
5x1011
5x1013
We think Swoogle’s centralized approach can be made to work
for the next few years if not longer.
UMBC
an Honors University in Maryland
39
How much reasoning?
• SwoogleN (N<=3) does limited reasoning
– It’s expensive
– It’s not clear how much should be done
• More reasoning would benefit many use cases
– e.g., type hierarchy
• Recognizing specialized metadata
– E.g., that ontology A some maps terms from B to C
UMBC
an Honors University in Maryland
40
Conclusion
• The web will contain the world’s knowledge in
forms accessible to people and computers
– We need better ways to discover, index, search and
reason over SW knowledge
• SW search engines address different tasks than
html search engines
– So they require different techniques and APIs
• Swoogle like systems can help create consensus
ontologies and foster best practices
– Swoogle is for Semantic Web 1.0
– Semantic Web 2.0 will make different demands
UMBC
an Honors University in Maryland
41
For more information
http://ebiquity.umbc.edu/
Annotated
in OWL
UMBC
an Honors University in Maryland
42