CSCI8986 DBpedia

Download Report

Transcript CSCI8986 DBpedia

DBpedia: Querying
Wikipedai like a
Database
Nucleus for a web of Open Data
Songqing Liu
7/16/2015
CSCI8986: DBpedia
1
DBpedia is an effort to



Extract structured information from Wikipedia
Make this information available on the Web
under an open license
Interlink DBpedia dataset with other datasets
on Web
7/16/2015
CSCI8986: DBpedia
2
Outline:




Extracting Structured Information from
Wikipedia
DBpedia Dataset
Accessing DBpedia Dataset over Web
Use Cases:



7/16/2015
Improving Wikipedia Search
Royalty-Free Data Source for other Applications
Nucleus for the Emerging Web of Data
CSCI8986: DBpedia
3
•Title
•Abstract
•Infoboxes
•Geo-coordinates
•Categories
•Images
•Links
•Other languages
•Other wiki pages
•To the web
•Redirects
•Disambiguates
7/16/2015
CSCI8986: DBpedia
4
Extracting Structured
Information from Wikipedia

Wikipedia consists of





12.379 million articles
In 275 languages (285 in total)
35 million users
Monthly growth-rate: 4%
Wikipedia articles contain structured
information






7/16/2015
Infobox which use template mechanism
Images depicting article’s topic
Categorization of the article
Links to external webpages
Intra-wiki links to other articles
Inter-language links to articles about same topic in
different languages
CSCI8986: DBpedia
5
Overview of the component:
Web 2.0
Mashups
SPARQL
Endpoint
Traditional
Web
Browser
Semantic Web
Browsers
SNORQL
Browser
Linked Data
Query
Builder
published via
Virtuoso
MySQL
loaded into
DBpedia datasets
Articles
Categories
Infobox
Extraction
Wikipedia Dumps
7/16/2015
Article texts
DB tables
CSCI8986: DBpedia
6
Infobox template:
7/16/2015
CSCI8986: DBpedia
7
Extracting Infobox Data(RDF)

Webpage


http://en.wikipedia.org/wiki/C
algary
DBpedia resource

http://dbpedia.org/page/Calg
ary

Dbpedia: native_name
“Calgary”
Dbpedia: altitute “1048”
Dbpedia: population_city
“1096833”
Dbpedia: population_metro
“1214839”
Mayor_name dbpedia: Naheed
Nenshi
Governing_body dbpeida:
Calgary_City_Council;





7/16/2015
CSCI8986: DBpedia
8
Question
7/16/2015
CSCI8986: DBpedia
9
Extract infomation

Short and long abstracts in different languages
dbpedia:Calgary
dbpedia:abstract “Calgary is the largest ...”@en ;
dbpedia:abstract “Calgary ist eine Stadt ...”@de .

Categorization information
dbpedia:Calgary
skos:subject dbpedia:Category_Cities_in_Alberta ;
skos:subject dbpedia:Host_cities_Olympic_Games .

Links to the original Wikipedia articles, pictures and relevant
external web pages
dbpedia:Calgary
foaf:page <http://en.wikipedia.org/wiki/Calgary> ;
dbpedia:wikipage-de<http://de.wikipedia.org/wiki/Calgary> ;
foaf:depiction <http://upload.wikimedia.org/thumb/3/32> ;
dbpedia:reference <http://www.calgary.ca> ;
dbpedia:reference <http://www.tourismcalgary.com>.
7/16/2015
CSCI8986: DBpedia
10
7/16/2015
CSCI8986: DBpedia
11
7/16/2015
CSCI8986: DBpedia
12
DBpedia Basics:

Structured information can be extracted from
Wikipedia


DBpedia.org project uses Resource Description
Framework (RDF) as flexible data model


Serve as basis for enabling sophisticated queries
against Wikipedia content
Representing extracted information and for publishing on
the Web
Use SPARQL query language to query this data

7/16/2015
At Developers Guide to Semantic Web Toolkits, we can
find development toolkit in our preferred programming
language to process DBpedia data
CSCI8986: DBpedia
13
The DBpedia Dataset







Describe 20.8 million things, out of which 10.5 mio overlap from English
DBpedia
Full Dbpedia dataset features labels and abstracts for 10.3 million unique things
in 111 different languages
8.0 million links to images and 24.4 million HTML links to external web pages
27.2 million data links into external RDF data sets
55.8 million links to Wikipedia categories and 8.2 million YAGO categories
Consists of 1.89 billion pieces of information (RDF triples) out of which 400
million were extracted from English edition
English version: 3.77 million things out of 2.35 million are classified in a
consistent Ontology





7/16/2015
764,000 persons
573,000 places
333,000 creative works: music albums, films and video games
192,000 organizations: companies and educational institutions
202,000 species and 5,500 diseases
CSCI8986: DBpedia
14
Multi-Lingual Abstracts


Dataset contains short and long English abstract for each
concept
Short abstracts











7/16/2015
English: 3,770,000
German: 1,244,000
French: 1,197,000
Dutch: 993,000
Italian: 882,000
Spanish: 879,000
Polish: 848,000
Japanese: 781,000
Portuguese: 699,000
Swedish: 457,000
Chinese: 445,000
CSCI8986: DBpedia
15
Accessing DBpedia Dataset
over the Web



SPARQL Endpoint
Linked Data Interface
DB Dumps for Download
7/16/2015
CSCI8986: DBpedia
16
SAPRQL:




SAPRQL is query language for RDF
RDF is a directed, labeled graph data format for
representing information in the Web
This specification defines syntax and semantics
of SPARQL query language for RDF
SPARQL can be used to express queries across
diverse data sources

7/16/2015
whether data is stored natively as RDF or viewed as
RDF via middleware
CSCI8986: DBpedia
17
DBpedia SPARQL Endpoint



http://dbpedia.org/sparql
Hosted on OpenLink Virtuoso server
Can answer SPARQL queries as




7/16/2015
Give me all Sitcoms that are set in NYC?
All tennis players from Moscow?
All films by Quentin Tarentino?
All German musicians that were born in Berlin in
the 19th century?
CSCI8986: DBpedia
18
7/16/2015
CSCI8986: DBpedia
19
Interesting Example: To know everything Bart wrote on
blackboard in season 12 of Simpson's
entities
• The Simpson episode Wikipedia pages are
the identified "things" that we would
consider as the subjects of our RDF triples.
• The content of the Wikipedia page for the
"Tennis the Menace" episode tells us that it
is a member of the Wikipedia category "The
Simpsons episodes, season 12".
• The episode's DBpedia page tells us that
p:blackboard is the property name for the
Wikipedia infobox "Chalkboard" field.
SELECT ?episode,?chalkboard_gag WHERE { ?episode skos:subject
<http://dbpedia.org/resource/Category:The_Simpsons_episodes%2C_season_12>.
?episode dbpedia2:blackboard ?chalkboard_gag }
7/16/2015
CSCI8986: DBpedia
Table
20
7/16/2015
CSCI8986: DBpedia
21
7/16/2015
CSCI8986: DBpedia
22
Linked Data Interface




Large body of information and knowledge is already
available in structured form, yet not accessible on
the Web
Integrating open data provides real value
Linked Data on the Web can be accessed using
Semantic Web browsers
Semantic Web browsers enable users to navigate
between different data sources
It also allows robots of Semantic Web search engines
to follow these links to crawl the Semantic Web
7/16/2015
CSCI8986: DBpedia
23
Linked Data Interface

Project follows Linked Data principles


All concepts are identified using Uniform Resource
Identifier references, URI is compact string of characters
used to identify or name a resource
Linked Data interface can be used by

Semantic Web Browsers, like




Semantic Web Crawlers, like



7/16/2015
DISCO Hyperdata Browser
Tabulator Browser
OpenLink RDF Browser
Zitgist (Zitgist LLC, USA)
SWSE (DERI, Ireland)
Swoogle (UMBC, USA)
CSCI8986: DBpedia
24
7/16/2015
CSCI8986: DBpedia
25
DBpedia Use Cases



Improving Wikipedia Search
Royalty-Free Data Source for other
Applications
Nucleus for the Emerging Web of Data
7/16/2015
CSCI8986: DBpedia
26
Improving Wikipedia Search (Various Interfaces)
7/16/2015
CSCI8986: DBpedia
27
Improving Wikipedia Search
7/16/2015
CSCI8986: DBpedia
28
Royalty-Free Data Source for
other Applications


Dbpedia is published under GNU Free Documentation License
Example use case: SPARQL generated tables within webpages
7/16/2015
CSCI8986: DBpedia
29
Nucleus for the Emerging Web of Data

W3C SWEO Linking Open Data Project
7/16/2015
CSCI8986: DBpedia
30
295 data sets consists of over 31 billion RDF
triples, which are interlinked 504 million RDF links
April 2005
CSA2050:NLTK
31
Dbpedia User Applications






AboutThisDay.com: Search engine of births &
deaths of people etc.
Dbpedia Mobile: Map view annotated with
Dbpedia entities
RelFinder: Connections between objects
SemLens: Uses scatter plots to analyze
Dbpedia data and semantic lenses
Dbpedia Navigator:
Alumis: Answer engine based on DBpedia
7/16/2015
CSCI8986: DBpedia
32
How can I support Dbpedia?



Develop another cool user interface to
Dbpedia
Publish more RDF datasets with
dereferenceable URIs
Interlink your datasets with Dbpedia
7/16/2015
CSCI8986: DBpedia
33
Discussions?

Dbpedia Website


http://wiki.dbpedia.org/About
Linking Open Data Project Website

7/16/2015
http://www.w3.org/wiki/SweoIG/TaskForces/Com
munityProjects/LinkingOpenData
CSCI8986: DBpedia
34