Scalable RDF Data Management & SPARQL Query Processing

Download Report

Transcript Scalable RDF Data Management & SPARQL Query Processing

Scalable
RDF Data Management
& SPARQL Query Processing
Martin Theobald1,
Katja Hose2, Ralf Schenkel3
1University
of Antwerp, Belgium
2University of Aalborg, Denmark
3University of Passau, Germany
Outline of this Tutorial
• Part I
– RDF in Centralized Relational Databases
• Part II
– RDF in Distributed Settings
• Part III
– Managing Uncertain RDF Data
Outline for Part I
• Part I.1: Foundations
– Introduction to RDF and Linked Open Data
– A short overview of SPARQL
• Part I.2: Rowstore Solutions
• Part I.3: Columnstore Solutions
• Part I.4: Other Solutions and Outlook
Information Extraction
YAGO/DBpedia et al.
bornOn(Jeff, 09/22/42)
gradFrom(Jeff, Columbia)
hasAdvisor(Jeff, Arthur)
hasAdvisor(Surajit, Jeff)
knownFor(Jeff, Theory)
>120 M facts for YAGO2
(mostly from Wikipedia infoboxes
& categories)
YAGO2 Knowledge Base
3 M entities, 120 M facts
100 relations, 200k classes
Entity
subclass
subclass
Organization
accuracy
 95%
subclass
Person
Location
subclass
subclass
Scientist
subclass
subclass
Biologist
subclass
Politician
instanceOf
subclass
subclass
State
instanceOf
Physicist
Country
instanceOf
City
instanceOf
Germany
instanceOf
Oct 23, 1944
instanceOf
Max_Planck
Society
instanceOf
Erwin_Planck
diedOn
Nobel Prize
Kiel
hasWon
FatherOf
locatedIn
locatedIn
bornIn
SchleswigHolstein
citizenOf
Oct 4, 1947
Apr 23, 1858
diedOn
Max_Planck
bornOn
means
means
Angela Merkel
“Max
Planck”
http://www.mpi-inf.mpg.de/yago-naga/
means
“Max Karl Ernst
Ludwig Planck”
means
“Angela
Merkel”
means
“Angela
Dorothea
Merkel”
Why care about scalability?
Rapid growth of available semantic data
Sources:
linkeddata.org
wikipedia.org
Why care about scalability?
As of Sept. 2011:
> 5 million owl:sameAs links
between DBpedia/YAGO/Freebase
Rapid growth of available semantic data
Sources:
linkeddata.org
wikipedia.org
More than 30 billion triples in more than 200 sources across the LOD cloud
DBPedia: 3.4 million entities, 1 billion triples
… and still growing
• Billion triple challenge 2008: 1B triples
• Billion triple challenge 2010: 3B triples
http://km.aifb.kit.edu/projects/btc-2010/
• Billion triple challenge 2011: 2B triples
http://km.aifb.kit.edu/projects/btc-2011/
•
|
• War stories from
http://www.w3.org/wiki/LargeTripleStores
–
–
–
–
BigOWLIM: 12B triples in Jun 2009
Garlik 4Store: 15B triples in Oct 2009
OpenLink Virtuoso: 15.4B+ triples
AllegroGraph: 1+ Trillion triples
Queries can be complex, too
SELECT DISTINCT ?a ?b ?lat ?long WHERE
{ ?a dbpedia:spouse ?b.
?a dbpedia:wikilink dbpediares:actor.
?b dbpedia:wikilink dbpediares:actor.
?a dbpedia:placeOfBirth ?c.
?b dbpedia:placeOfBirth ?c.
?c owl:sameAs ?c2.
?c2 pos:lat ?lat.
?c2 pos:long ?long.
}
Q7 on BTC2008 in [Neumann & Weikum, 2009]
What effects does the financial crisis have on
migration rates in the US?
Is there a significant increase of serious weather
conditions in Europe over the past 20 years?
Which glutamic-acid proteases are inhibitors of HIV?
Question Answering (QA) Systems
• KB from Wikipedia and user edits
• 600 million facts, 25 million entities
• KB of curated, structured data
• 10 trillion (!) facts, 50k algorithms
IBM Watson: Deep Question Answering
William Wilkinson's "An Account of the
Principalities of Wallachia and Moldavia"
inspired this author's most famous novel
This town is known as "Sin City" & its
downtown is "Glitter Gulch"
As of 2010, this is the only
former Yugoslav republic in the EU
99 cents got me a 4-pack of Ytterlig coasters
from this Swedish chain
question
classification &
decomposition
knowledge
back-ends
D. Ferrucci et al.: Building Watson: An Overview of the
DeepQA Project. AI Magazine, Fall 2010.
YAGO
www.ibm.com/innovation/us/watson/index.htm
SPARQL 1.0 / 1.1
• Query language for RDF suggested by the W3C.
• 3 ways to interpret RDF data:
– Instances of logical predicates (“facts”)
– Graphs (subjects/objects as nodes,
predicates as directed and labeled edges)
– Relations (either multiple binary relations
or a single, large ternary relation)
• SPARQL main building block:
– select-project-join combination of relational triple patterns
equivalent to graph isomorphism queries
over a potentially very large RDF graph
SPARQL – Example
Example query:
Find all actors from Ontario (that are in the knowledge base)
scientist
isA
actor
isA
vegetarian
isA
isA
Mike_Myers
Jim_Carrey
bornIn
bornIn
Scarborough
Newmarket
locatedIn
locatedIn
Ontario
locatedIn
Canada
isA
physicist
isA
chemist
isA
isA
Albert_Einstein
Otto_Hahn
bornIn
bornIn
Ulm
Frankfurt
locatedIn
locatedIn
Germany
locatedIn
Europe
SPARQL – Example
Example query:
Find all actors from Ontario (that are in the knowledge base)
SELECT ?person WHERE { ?person isA actor. ?person bornIn ?loc .
?loc locatedIn Ontario . }
Find subgraphs of this form:
actor
constants
isA
?person
bornIn
?loc
locatedIn
Ontario
actor
isA
variables
vegeta
isA
isA
Mike_Myers
Jim_Carrey
bornIn
bornIn
Scarborough
Newmarket
locatedIn
locatedIn
Ontario
locatedIn
Canada
SPARQL 1.0 – More Features
• Eliminate duplicates in results
SELECT DISTINCT ?c WHERE {?person isA actor. ?person bornIn ?loc.
?loc locatedIn ?c}
• Return results in some order
SELECT ?person WHERE {?person isA actor. ?person bornIn ?loc.
?loc locatedIn Ontario} ORDER BY DESC(?person)
with optional LIMIT n clause
• Optional matches and filters on bounded var’s
SELECT ?person WHERE {?person isA actor.
OPTIONAL{?person bornIn ?loc}.
FILTER (!BOUND(?loc))}
• More operators: ASK, DESCRIBE, CONSTRUCT
See: http://www.w3.org/TR/rdf-sparql-query/
SPARQL 1.1 Extensions of the W3C
W3C SPARQL 1.1:
• Aggregations (COUNT, AVG, …) and grouping
• Subqueries in the WHERE clause
• Safe negation: FILTER NOT EXISTS {?x …}
– Syntactic sugar for
OPTIONAL {?x … }
FILTER(!BOUND(?x))
• Expressions in the SELECT clause:
SELECT (?a+?b) AS ?sum
• Label constraints on paths:
?x foaf:knows/foaf:knows/foaf:name ?name
• More functions and operators …
RDF+SPARQL: Centralized Engines
•
•
•
•
•
•
BigOWLIM (now ontotext.com)
OpenLink Virtuoso
OntoBroker (now semafora-systems.com)
Apache Jena (different main-memory/relational backends)
Sesame (now openRDF.org)
SW-Store, Hexastore, 3Store, RDF-3X
(no reasoning)
System deployments with >1011 triples
( see http://esw.w3.org/LargeTripleStores)
SPARQL: Extensions from Research (1)
More complex graph patterns:
• Transitive paths [Anyanwu
et al., WWW’07]
SELECT ?p, ?c WHERE {
?p isA scientist .
?p ??r ?c. ?c isA Country . ?c locatedIn Europe .
PathFilter(cost(??r) < 5) .
PathFilter(containsAny(??r,?t ). ?t isA City . }
• Regular expressions [Kasneci et al., ICDE’08]
SELECT ?p, ?c WHERE {
?p isA ?s. ?s isA scientist.
?p (bornIn | livesIn | citizenOf) locatedIn* Europe.}
Meanwhile mostly covered by the SPARQL 1.1 query proposal.
SPARQL: Extensions from Research (2)
Queries over federated RDF sources:
• Determine distribution of triple patterns as part
of query (for example in Jena ARQ)
• Automatically route triple predicates to useful
sources
SPARQL: Extensions from Research (2)
Queries over federated RDF sources:
• Determine distribution of triple patterns as part
of query (for example in Jena ARQ)
• Automatically route triple predicates to useful
sources
Potentially requires mapping of
identifiers from different sources
SPARQL 1.1 explicitly supports
federation of sources
http://www.w3.org/TR/sparql11
-federated-query/
Ranking is Essential!
• Queries often have a huge number of results:
– “scientists from Canada”
– “publications in databases”
– “actors from the U.S.”
• Queries may have no matches at all:
– “Laboratoire d'informatique de Paris 6”
– “most beautiful railway stations”
• Ranking is an integral part of search
• Huge number of app-specific ranking methods:
paper/citation count, impact, salary, …
• Need for generic ranking of 1) entities and 2) facts
Extending Entities with Keywords
Remember: entities occur in facts & in documents
 Associate entities with terms in those documents,
keywords in URIs, literals, …
(context of entity)
chancellor
Germany
scientist election
Stuttgart21
Guido
Westerwelle
France Nicolas
Sarkozy
Extensions: Keywords
Problem: not everything is triplified!
• Consider witnesses/sources
(provenance meta-facts)
• Allow text predicates with
each triple pattern (à la XQ-FT)
Semantics:
• triples match struct. pred.
• witnesses match text pred.
European composers who have won the Oscar,
whose music appeared in dramatic western scenes,
and who also wrote classical pieces ?
Select ?p Where {
?p instanceOf Composer .
?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe .
?p hasWon ?a .?a Name AcademyAward .
?p contributedTo ?movie [western, gunfight, duel, sunset] .
?p composed ?music [classical, orchestra, cantata, opera] . }
Extensions: Keywords
Problem: not everything is triplified!
• Consider witnesses/sources
(provenance meta-facts)
• Allow text predicates with
each triple pattern (à la XQ-FT)
Proximity of
keywords or phrases
boosts expressiveness
French politicians married to Italian singers?
Select ?p1, ?p2 Where {
?p1 instanceOf ?c1 [France, politics] .
?p2 instanceOf ?c2 [Italy, singer] .
?p1 marriedTo ?p2 . }
CS researchers whose advisors worked on the Manhattan project?
Select ?r, ?a Where {
instOf
science“] .
?r ?p1
?o1researcher
[“computer[“computer
science“] .
workedOn
?x [“Manhattan
project“]
.
?a ?p2
?o2 [“Manhattan
project“]
.
hasAdvisor
?r ?p3
?a . } ?a . }
Extensions: Keywords
Problem: not everything is triplified!
CLEF/INEX 2012-13
Linked Data Track
https://inex.mmci.uni-saarland.de/tracks/lod/
<topic id="2012374" category="Politics">
<jeopardy_clue>Which German politician is a successor of another politician
who stepped down before his or her actual term was over,
and what is the name of their political ancestor?</jeopardy_clue>
<keyword_title>German politicians successor other stepped down before actual
term name ancestor</keyword_title>
<sparql_ft>
SELECT ?s ?s1 WHERE {
?s rdf:type <http://dbpedia.org/class/yago/GermanPoliticians> .
?s1 <http://dbpedia.org/property/successor> ?s .
FILTER FTContains (?s, "stepped down early") .
}
</sparql_ft>
</topic>
Extensions: Keywords / Multiple Languages
Problem: not everything is triplified!
Multilingual Question Answering over
Linked Data (QALD-3), CLEF 2011-13
http://greententacle.techfak.unibielefeld.de/~cunger/qald/
<question id="4" answertype="resource" aggregation="false" onlydbo="true">
<string lang="en">Which river does the Brooklyn Bridge cross?</string>
<string lang="de">Welchen Fluss überspannt die Brooklyn Bridge?</string>
<string lang="es">¿Por qué río cruza la Brooklyn Bridge?</string>
<string lang="it">Quale fiume attraversa il ponte di Brooklyn?</string>
<string lang="fr">Quelle cours d'eau est traversé par le pont de Brooklyn?</string>
<string lang="nl">Welke rivier overspant de Brooklyn Bridge?</string>
<keywords lang="en">river, cross, Brooklyn Bridge</keywords>
<keywords lang="de">Fluss, überspannen, Brooklyn Bridge</keywords>
<keywords lang="es">río, cruza, Brooklyn Bridge</keywords>
<keywords lang="it">fiume, attraversare, ponte di Brooklyn</keywords>
<keywords lang="fr">cours d'eau, pont de Brooklyn</keywords>
<keywords lang="nl">rivier, Brooklyn Bridge, overspant</keywords>
<query>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX res: <http://dbpedia.org/resource/>
SELECT DISTINCT ?uri
WHERE {
res:Brooklyn_Bridge dbo:crosses ?uri .
}
</query>
</question>
What Makes a Fact “Good”?
Confidence:
Prefer results that are likely correct
 accuracy of info extraction
 trust in sources
(authenticity, authority)
Informativeness:
Prefer results with salient facts
Statistical estimation from:
 frequency in answer
 frequency on Web
 frequency in query log
Diversity:
Prefer variety of facts
Conciseness:
Prefer results that are tightly connected
 size of answer graph
 weight of Steiner tree
bornIn (Jim Gray, San Francisco) from
“Jim Gray was born in San Francisco”
(en.wikipedia.org)
livesIn (Michael Jackson, Tibet) from
“Fans believe Jacko hides in Tibet”
(www.michaeljacksonsightings.com)
q: Einstein isa ?
Einstein isa scientist
Einstein isa vegetarian
q: ?x isa vegetarian
Einstein isa vegetarian
Whocares isa vegetarian
E won … E discovered … E played …
E won … E won … E won … E won …
Einstein won NobelPrize
Bohr won NobelPrize
Einstein isa vegetarian
Cruise isa vegetarian
Cruise born 1962 Bohr died 1962
How Can We Implement This?
Confidence:
Prefer results that are likely correct
 accuracy of info extraction
 trust in sources
(authenticity, authority)
Informativeness:
Prefer results with salient facts
Statistical estimation from:
 frequency in answer
 frequency on Web
 frequency in query log
Diversity:
Prefer variety of facts
Conciseness:
Prefer results that are tightly connected
 size of answer graph
 weight of Steiner tree
Empirical accuracy of Information Extraction
PageRank-style estimate of trust
combine into:
max { accuracy (f,s) * trust(s) |
s  witnesses(f) }
PageRank-style entity/fact ranking
[V. Hristidis et al., S.Chakrabarti, …]
or
IR models: tf*idf … [K.Chang et al., …]
Statistical Language Models [de Rijke et al.]
Statistical Language Models
[Zhai et al., Elbassuoni et al.]
Graph algorithms (BANKS, STAR, …)
[S.Chakrabarti et al., G.Kasneci et al., …]
Outline for Part I
• Part I.1: Foundations
– Introduction to RDF
– A short overview of SPARQL
• Part I.2: Rowstore Solutions
• Part I.3: Columnstore Solutions
• Part I.4: Other Solutions and Outlook
RDF in Rowstores
•
Rowstore: general relational database, storing
relations (incl. facts) as complete rows
(MySQL, PostgreSQL, Oracle, DB2, SQLServer, …)
•
General principles:
– store triples in one giant three-attribute table
(subject, predicate, object)
– convert SPARQL to equivalent SQL
– The database will do the rest
• Strings
often
mapped
to unique (with
integer
IDs
Simple
extension
to quadruples
graphid):
(graph,subject,predicate,object)
• Used by many
TripleStores, including 3Store,
Jena, HexaStore, RDF-3X, …
We consider only triples for simplicity!
Example: Single Triple Table
ex:Katja
ex:Martin
ex:Ralf
ex:teaches
ex:works_for
ex:PhD_from
ex:teaches
ex:works_for
ex:PhD_from
ex:teaches
ex:PhD_from
ex:works_for
ex:Databases;
ex:MPI_Informatics;
ex:TU_Ilmenau.
ex:Databases;
ex:MPI_Informatics;
ex:Saarland_University.
ex:Information_Retrieval;
ex:Saarland_University;
ex:Saarland_University,
ex:MPI_Informatics.
subject
ex:Katja
ex:Katja
ex:Katja
ex:Martin
ex:Martin
ex:Martin
ex:Ralf
ex:Ralf
ex:Ralf
ex:Ralf
predicate
ex:teaches
ex:works_for
ex:PhD_from
ex:teaches
ex:works_for
ex:PhD_from
ex:teaches
ex:PhD_from
ex:works_for
ex:works_for
object
ex:Databases
ex:MPI_Informatics
ex:TU_Ilmenau
ex:Databases
ex:MPI_Informatics
ex:Saarland_University
ex:Information_Retrieval
ex:Saarland_University
ex:Saarland_University
ex:MPI_Informatics
Conversion of SPARQL to SQL
General approach to translate SPARQL into SQL:
(1) Each triple pattern is translated into a (self-) JOIN
over the triple table
(2) Shared variables create JOIN conditions
(3) Constants create WHERE conditions
(4) FILTER conditions create WHERE conditions
(5) OPTIONAL clauses create OUTER JOINS
(6) UNION clauses create UNION expressions
Example: Conversion to SQL Query
SELECT ?a ?b ?t WHERE
{?a works_for ?u. ?b works_for ?u. ?a phd_from ?u. }
OPTIONAL {?a teaches ?t}
FILTER (regex(?u, “Saar”))
Projection

?a

?a,?u
P4
P1.subject
R1.B,as
R2.T
A, FROM
P2.subject as B
SELECT R1.A,

?u
P3
( SELECT
P1.subject
as A,P2,
P2.subject
as B
FROM
Triples
P1, Triples
Triples P3
Filter
P2
WHERE
FROM Triples
P1.predicate=“works_for”
P1, Triples P2, Triples
AND
P3 P2.predicate=“works_for”
regex(?u,“Saar“)
WHERE
AND P3.predicate=“phd_from”
P1.predicate=“works_for” AND P2.predicate=“works_for”
P1 AND P1.object=P3.object
AND
ANDP1.object=P2.object
P3.predicate=“phd_from”
AND P1.subject=P3.subject
AND
ANDREGEXP_LIKE(P1.object,
P1.object=P2.object AND
“Saar”)
P1.subject=P3.subject AND P1.object=P3.object
AND REGEXP_LIKE(P1.object, “Saar”)
) R1 LEFT OUTER JOIN
( SELECT P4.subject as A, P4.object as T FROM Triples P4
WHERE P4.predicate=“teaches”) AS R2
) ON (R1.A=R2.A)
Is that all?
Well, no.
• Which indexes should be built?
(to support efficient evaluation of triple patterns)
• How can we reduce storage space?
• How can we find the best execution plan?
Existing databases need modifications:
• flexible, extensible, generic storage not needed here
• cannot deal with multiple self-joins of a single table
• often generate bad execution plans
Dictionary for Strings
Map all strings to unique integers (e.g., via hashing)
• Regular size (4-8 bytes), much easier to handle
• Dictionary usually small, can be kept in main
memory
<http://example.de/Katja>
<http://example.de/Martin>
<http://example.de/Ralf>
 194760
 679375

4634
This may break original lexicographic sorting order
 RANGE conditions (not in SPARQL) are difficult!
 FILTER conditions may be more expensive!
Indexes for Commonly Used Triple Patterns
Patterns with a single variable are frequent
Example: Albert_Einstein invented ?x
 Build clustered index over (s,p,o)
(16,19,5356)
(16,24,567)
(16,24,876)
All triples in
(s,p,o) order (27,19,643)
(27,48,10486)
(50,10,10456)
…
B+ tree for
easy access
1. Lookup ids for constants:
Albert_Einstein  16, invented  24
2. Lookup known prefix in index:
(16,24,0)
3. Read results while prefix matches:
(16,24,567), (16,24,876)
come already sorted!
Can also be used for pattern like Albert_Einstein
?p ?x
Build similar clustered indexes for all six permutations (3 x 2 x 1 = 6)
• SPO, POS, OSP to cover all possible triplet patterns
• SOP, OPS, PSO to have all sort orders for patterns with two var’s
Triple table no longer needed, all triples in each index
Why Sort Order Matters for Joins
When inputs sorted by join
attribute, use Merge Join:
• sequentially scan both inputs
• immediately join matching triples
• skip over parts without matches
• allows pipelining
When inputs are unsorted/sorted
by wrong attribute, use Hash Join:
• build hash table from one input
• scan other input, probe hash table
• needs to touch every input triple
• breaks pipelining
 HJ
 MJ
(16,19,5356)
(16,24,567)
(16,24,876)
(27,19,643)
(27,48,10486)
(50,10,10456)
(16,33,46578)
(16,56,1345)
(24,16,1353)
(27,18,133)
(47,37,20495)
(50,134,1056)
(16,19,5356)
(16,24,567)
(16,24,876)
(27,19,643)
(27,48,10486)
(50,10,10456)
(27,18,133)
(50,134,1056)
(16,56,1345)
(24,16,1353)
(47,37,20495)
(16,33,46578)
In general, Merge Joins are more preferable:
small memory footprint, pipelining
RDF-3x: Even More Indexes!
SPARQL 1.0 considers duplicates (unless removed with DISTINCT)
but does not (yet) support aggregates/counting
often queries with many duplicates like
SELECT ?x WHERE ?x ?y Germany.
to retrieve entities related to Germany
(but counts may be important in the application!)
this materializes many identical intermediate results
Solution: even more redundancy!
• Pre-compute aggregated indexes SP,SO,PO,PS,OP,OS,S,P,O
Example: SO contains, for each pair (s,o), the number of triples
with subject s and object o
• Do not materialize identical bindings, but keep counts
Example: ?x=Albert_Einstein:4; ?x=Angela_Merkel:10
• 15 indexes overall (all SPO permutations + their unique subsets)
RDF-3x: Compression Scheme for Triplets
• Compress sequences of triples in lexicographic order
(v1;v2;v3); for SPO: v1=S, v2=P, v3=O
• Step 1: compute per-attribute deltas
(16,19,5356)
(16,19,5356)
(16,24,567)
(0,5,-4798)
(16,24,676)
 (0,0,109)
(27,19,643)
(11,-5,-34)
(27,48,10486)
(0,29,9843)
(50,10,10456)
(23,-38,-30)
• Step 2: variable-byte encoding for each delta triple
gap
bit
header
(7 bits)
Delta of value 1
(0-4 bytes)
Delta of value 2
(0-4 bytes)
Delta of value 3
(0-4 bytes)
1-13 bytes
When gap=1, the
delta of value3 is
included in header,
all others are 0
Otherwise, header contains length of encoding for each of the
three deltas (5*5*5=125 combinations stored in 7 bits)
Many variants exist; this one is designed for triplets…
Compression Effectiveness vs. Efficiency
• Byte-level encoding almost as effective as bit-level
encoding techniques (Gamma, Golomb, Rice, etc.)
• Much faster (10x) for decompressing
• Example for Barton dataset [Neumann & Weikum: VLDB’10]:
– Raw data 51 million triples, 7GB uncompressed (as N-Triples)
– All 6 main indexes:
• 1.1GB size, 3.2s decompression with byte-level encoding
• Optionally: additional compression with LZ77 2x more
compact, but much slower to decompress
– Compression always on page level
Back to the Example Query
SELECT ?a ?b ?t WHERE
{?a works_for ?u. ?b works_for ?u. ?a phd_from ?u. }
OPTIONAL {?a teaches ?t}
FILTER (regex(?u, “Saar”))
250
Projection
50
?u MJ
 MJ
1000
PSO(works_for,?u,?b)
?u,?a
Filter
regex(?u,“Saar“)
POS(teaches,?a,?t)
100
POS(pdh_from,?u,?a)
1000
POS(works_for,?u,?a)
Projection
?a
250
?a
250
5
250
2500
50
 MJ
?u
Filter
 HJ
?a,?u
POS(teaches,?a,?t)
100
PSO(phd_from,?a,?u)
1000
regex(?u,“Saar“)
POS(works_for,?u,?b)
1000
POS(works_for,?u,?a)
Core ingredients of a goodWhich
query of
optimizer
the twoare
plans is better?
selectivity estimators for triple patterns
(indexintermediate
scans) and joins
How many
results?
RDF-3x: Selectivity Estimation
How many results will a triple pattern have?
Standard databases:
• Per-attribute histograms
• Assume independence of attributes
too simplistic
and inexact
 Use aggregated indexes for exact count
Additional join statistics for triple blocks (pages):
…
(16,19,5356)
(16,24,567)
(16,24,876)
(27,19,643)
(27,48,10486)
(50,10,10456)
…
Assume independence
between triple patterns;
additionally precompute
exact statistics for frequent
paths in the data
Handling Updates
What should we do when our data changes?
(SPARQL 1.1 has updates!)
Assumptions:
• Queries far more frequent than updates
• Updates mostly insertions, hardly any deletions
• Different applications may update concurrently
Solution: Differential Indexing
RDF-3x: Differential Updates
Staging architecture for updates in RDF-3X
Workspace A:
Triples inserted
completion by application A
of A
Workspace B:
Triples inserted
completion by application B
of B
on-demand indexes
at query time
kept in main memory
Deletions:
• Insert the same tuple again with “deleted” flag
• Modify scan/join operators: merge differential indexes with main index
Outline for Part I
• Part I.1: Foundations
– Introduction to RDF
– A short overview of SPARQL
• Part I.2: Rowstore Solutions
• Part I.3: Columnstore Solutions
• Part I.4: Other Solutions and Outlook
Principles
Observations and assumptions:
• Not too many different predicates
• Triple patterns usually have fixed predicate
• Need to access all triples with one predicate
Design consequence:
• Use one two-attribute table for each predicate
Example: Columnstores
ex:Katja
ex:Martin
ex:Ralf
ex:teaches
ex:Databases;
ex:works_for ex:MPI_Informatics;
works_for
ex:PhD_from ex:TU_Ilmenau.
subject
ex:teaches
ex:Databases;
ex:Katja
ex:works_for ex:MPI_Informatics;
ex:Martin
ex:PhD_from ex:Saarland_University.
ex:teaches
ex:Information_Retrieval; ex:Ralf
ex:Ralf
ex:PhD_from ex:Saarland_University;
ex:works_for ex:Saarland_University,
ex:MPI_Informatics.
teaches
subject
ex:Katja
ex:Martin
ex:Ralf
object
ex:Databases
ex:Databases
ex:Information_Retrieval
PhD_from
subject
ex:Katja
ex:Martin
ex:Ralf
object
ex:MPI_Informatics
ex:MPI_Informtatics
ex:Saarland_University
ex:MPI_Informatics
object
ex:TU_Ilmenau
ex:Saarland_University
ex:Saarland_University
Simplified Example: Query Conversion
SELECT ?a ?b ?t WHERE
{?a works_for ?u. ?b works_for ?u. ?a phd_from ?u. }
SELECT W1.subject as A, W2.subject as B
FROM works_for W1, works_for W2, phd_from P3
WHERE W1.object=W2.object
AND W1.subject=P3.subject
AND W1.object=P3.object
So far, this is yet another relational representation of RDF.
So, what is a columnstore?
Columnstores and RDF
Columnstores store all columns of a table separately.
PhD_from
subject
ex:Katja
ex:Martin
ex:Ralf
object
ex:TU_Ilmenau
ex:Saarland_University
ex:Saarland_University
PhD_from:subject
ex:Katja
ex:Martin
ex:Ralf
PhD_from:object
ex:TU_Ilmenau
ex:Saarland_University
ex:Saarland_University
Advantages:
• Fast if only subject or object are accessed, not both
• Allows for a very compact representation
Problems:
• Need to recombine columns if subject and object are accessed
• Inefficient for triple patterns with predicate variable
Compression in Columnstores
General ideas:
• Store subject only once
• Use same order of subjects for all columns, including NULL values
when necessary
subject
ex:Katja
ex:Martin
ex:Ralf
ex:Ralf
PhD_from
ex:TU_Ilmenau
ex:Saarland_University
ex:Saarland_University
NULL
teaches
ex:Databases
ex:Databases
ex:Information_Retrieval
NULL
• Additional compression to get rid of NULL values
PhD_from: bit[1110]
ex:TU_Ilmenau
ex:Saarland_University
ex:Saarland_University
Teaches: range[1-3]
ex:Databases
ex:Databases
ex:Information_Retrieval
works_for
ex:MPI_Informatics
ex:MPI_Informatics
ex:Saarland_University
ex:MPI_Informatics
Outline for Part I
• Part I.1: Foundations
– Introduction to RDF
– A short overview of SPARQL
• Part I.2: Rowstore Solutions
• Part I.3: Columnstore Solutions
• Part I.4: Other Solutions and Outlook
Property Tables
Group entities with similar predicates into a relational table
(for example using RDF types or a clustering algorithm).
ex:Katja
ex:Martin
ex:Ralf
subject
ex:Katja
ex:Martin
ex:Ralf
ex:Ralf
ex:teaches
ex:Databases;
ex:works_for ex:MPI_Informatics;
ex:PhD_from ex:TU_Ilmenau.
ex:teaches
ex:Databases;
ex:works_for ex:MPI_Informatics;
ex:PhD_from ex:Saarland_University.
ex:teaches
ex:Information_Retrieval;
ex:PhD_from ex:Saarland_University;
ex:works_for ex:Saarland_University,
subject
ex:MPI_Informatics.
ex:Katja
ex:Martin
ex:Ralf
ex:Axel
predicate
object
ex:works_for ex:MPI_Informatics
ex:works_for ex:MPI_Informatics
ex:works_for ex:Saarland_University
ex:works_for ex:MPI_Informatics
teaches
ex:Databases
ex:Databases
ex:IR
NULL
PhD_from
ex:TU_Ilmenau
ex:Saarland_University
ex:Saarland_University
ex:TU_Vienna
“Leftover triples”
Property Tables: Pros and Cons
Advantages:
• More in the spirit of existing relational systems
• Saves many self-joins over triple tables etc.
Disadvantages:
• Potentially many NULL values
• Multi-value attributes problematic
• Query mapping depends on schema
• Schema changes very expensive
Even More Systems…
• Store RDF data as sparse matrix with bit-vector
compression [BitMat, Hendler at al.: ISWC’09]
• Convert RDF into XML and use XML methods
(XPath, XQuery, …)
• Store RDF data in graph databases and perform
bi-simulation [Fletcher at al.: ESWC’12] or employ
specialized graph index structures [gStore, Zou et al.:
PVLDB’11]
• And many more …
 See our list of readings.
Which Technique is Best?
• Performance depends a lot on precomputation,
optimization, implementation, fine-tuning …
• Comparative results on BTC 2008:
(from [Neumann & Weikum, 2009])
RDF-3X
RDF-3X (2008)
COLSTORE
ROWSTORE
RDF-3X
RDF-3X (2008)
COLSTORE
ROWSTORE
Challenges and Opportunities
• SPARQL with different entailment regimes
• New SPARQL 1.1 features
(grouping, aggregation, updates)
• User-oriented ranking of query results
– Efficient top-k operators
– Effective scoring methods for structured queries
• What are the limits of a centralized RDF engine?
• Dealing with uncertain RDF data –
what is the most likely query answer?
– Triples with probabilities  probabilistic databases
Outline of this Tutorial
• Part I
– RDF in Centralized Relational Databases
• Part II
– RDF in Distributed Settings
• Part III
– Managing Uncertain RDF Data
Outline for Part II
• Part II.1: Search Engines for the Semantic Web
• Part II.2: Mediator-based and Federated
Architectures
Semantic Web Search Engines
• Querying RDF data collections started by adapting
existing search engines to RDF data.
– Crawling for .rdf files, and HTML documents with embedded
RDF content (see: RDFa microformat).
– Indexing & search based on keywords extracted from entityand property names.
– Usually generate a virtual document for an entity (string literals
and human-readable names).
• Swoogle [Ding et al., CIKM’04] (University of Maryland)
• Falcons [Cheng at al., WWW’08] (Nanjing University)
Outline for Part II
• Part II.1: Search Engines for the Semantic Web
• Part II.2: Mediator-based and Federated
Architectures
Classification of Distributed Approaches
Approaches for querying
distributed and potentially heterogeneous (RDF) data sources
Virtually materialized
approaches
Materialization-based approaches
(data-warehousing)
Mediator-based
systems
Federated
systems
MapReduce/
Hadoop
DARQ
FedEx
YARS2
Shard, Jena-HBase
[Abadi et al. PVLDB’11]
Peer-2-Peer
Gridvine
RDFPeers
Shared-nothing
architectures
Partout
4Store
Eagre
Shared-memory
architectures
(Message-Passing, RMI, etc.)
Trinity (MSR)
How to Integrate Data Sources?
• Ship and integrate data from different sources
to the client.
• Three common approaches:
– Query-driven (single mediator)
– Database federations (exported schemas)
– Warehousing (fully integrated
?
& centrally managed)
RDF
Source
RDF
Source
Query-Driven Approach
SPARQL Client
SPARQL Client
Mediator
Wrapper
query
result
RDF
Source
Wrapper
query
result
RDF
Source
Wrapper
query
result
RDF
Source
List of SPARQL endpoints: http://www.w3.org/wiki/SparqlEndpoints
DBpedia: http://dbpedia.org/sparql
YAGO: https://d5gate.ag5.mpi-sb.mpg.de/webyagospo/Browser
Advantages of Query-Driven Integration
• No need to copy data
– no or little own storage costs
– no need to purchase data
• Potentially more up-to-date data
• Mediator holds catalog (statistics, etc.) and may
optimize queries
• Only generic query interface needed at sources
(SPARQL endpoints)
• May be less draining on sources
• Sources often even unaware of participation
Federation-based Approach
SPARQL Client
SPARQL Client
Federated Schema
Exported Schema
query
result
Exported Schema
query
Exported Schema
query
result
result
Local Schema
Local Schema
Local Schema
RDF
Source
RDF
Source
RDF
Source
Source 1
Source 2
…
Source n
Advantages of Federation-Based Integration
• Very similar to query-driven integration, except
– that the sources know that they are part of a federation;
– and they export their local schemas into a federated schema.
• Intermediate step toward full integration of the data in a single
“warehouse”.
Warehousing Architecture
SPARQL Client
SPARQL Client
Query & Analysis
Metadata
Warehouse
Integration
RDF
Source
RDF
Source
RDF
Source
Integrated LOD index:
http://lod2.openlinksw.com/sparql
Advantages of Warehousing
• Perform Extract-Transform-Load (ETL) processes
with periodic updates over the source
• High query performance
• Local processing at sources unaffected
• Can operate even when sources are offline
• Can query data that is no longer stored at sources
• More detailed statistics and metadata available at
warehouse
– Modify, summarize (store aggregates), analyse
– Add historical information, provenance, timestamps, etc.
Classification of Distributed Approaches
Approaches for querying
distributed and potentially heterogeneous (RDF) data sources
Virtually materialized
approaches
Materialization-based approaches
(data-warehousing)
Mediator-based
systems
Federated
systems
MapReduce/
Hadoop
DARQ
FedEx
YARS2
Shard, Jena-HBase
[Abadi et al. PVLDB’11]
Peer-2-Peer
Gridvine
RDFPeers
Shared-nothing
architectures
Partout
4Store
Eagre
Shared-memory
architectures
(Message-Passing, RMI, etc.)
Trinity (MSR)
DARQ [Leser et al., Humbold University Berlin, ISWC’08]
• Classical mediator-based architecture
connecting a given SPARQL endpoint to
other endpoints via a combination of
wrappers and service descriptions.
• Service descriptions
– RDF data descriptions
– Statistical information
– Binding constraints
• Query optimizer based on rewriting
rules and cost estimations for physical
join operators.
FedEx [fluid Op’s & MPI-INF: ISWC’11]
• Online query optimization over federations of SPARQL endpoints.
SELECT ?drug ?title WHERE {
?drug drugbank:drugCategory drugbank-category:micronutrient .
?drug drugbank:casRegistryNumber ?id .
?keggDrug rdf:type kegg:Drug .
?keggDrug bio2rdf:xRef ?id .
?keggDrug purl:title ?title .
}
• Cost estimates based on result sizes of SPARQL ASK queries.
• “Bound nested-loop joins” by grouping sets of variable bindings
into SPARQL UNION queries (instead of using FILTER conditions):
Partout [Galaraga, Hose, Schenkel: PVLDB’13]
• Materialization-based, distributed &
workload-aware SPARQL engine.
• Distribution helps to scale-out query
processing via parallel join executions.
Global query workload (aka. “query log”)
Global query graph
• Triple fragments are distributed • H1…Hn run local RDF-3x
over hosts H1…Hn by
instances.
– (1) maximizing query locality, and
– (2) balancing the hosts workload.
– (1) local S,P,O statistics by RDF-3x,
– (2) global (cached) statistics.
Partout Example Query Plan
•
•
•
H1, H2, H3 hold triplets for ?s rdf:type db:city
H1 has triplets for ?s db:located db:Germany
H2 has triplets for ?s db:name ?name
More Distributed RDF Engines
•
•
•
•
•
•
•
•
•
Shard TripleStore (Hadoop + Hash-Partitioning)
RDFPeers (P2P/Chord architecture) [Cai et al., WWW’04]
Gridvine (P2P/Chord architecture) [Aberer et al., VLDB’07]
YARS2 (federated architecture) [Decker at al., ISWC’07]
Jena-HBase (Hadoop & HBase) [Khadilkar et al., ISWC’12]
SW-Store (Hadoop/RDF-3x) [Abadi et al., PVLDB’11]
4Store (materialized, shared-nothing) [Harris et al., SSWS’09]
Eagre (materialized, shared-nothing) [HKUST & HP Labs, ICDE’13]
Trinity (materialized, shared-memory, message passing) [MSR, SIGMOD’13]
 more in Zoi’s tutorial in the afternoon…
Outline of this Tutorial
• Part I
– RDF in Centralized Relational Databases
• Part II
– RDF in Distributed Settings
• Part III
– Managing Uncertain RDF Data
Outline for Part III
• Part III.1: Motivation
– What is uncertain data, and where does it come from?
• Part III.2: Possible Worlds & Beyond
• Part III.3: Probabilistic Database Engines
– Stanford Trio Project
– MystiQ @ U Washington
• Part III.4: Managing Uncertain RDF Data
– URDF @ Max Planck Institute
What is “Uncertain” Data?
“Certain” Data
“Uncertain” Data
Temperature is 74.634589 F
Sensor reported 75 ± 0.5 F
Bob works for Yahoo
Bob works for either Yahoo or
Microsoft
Mary sighted a Finch
Mary sighted either a Finch
(60%) or a Sparrow (40%)
It always rains in Galway
There is a 89% chance of rain
in Galway tomorrow
Yahoo stocks will be at 100
in a month
Yahoo stock will be between
60 and 120 in a month
John’s age is 23
John’s age is in [20,30]
… And Why Does It Arise?
“Certain” Data
“Uncertain” Data
Temperature is 74.634589 F
Sensor reported 75 ± 0.5 F
Precision of devices
Bob works for Yahoo
Bob works for either Yahoo or
Microsoft
Mary sighted a Finch
Mary sighted either a Finch
(60%) or a Sparrow (40%)
Lack of exact
information
(alternatives and
missing values)
It always rains in Galway
There is a 89% chance of rain
in Galway tomorrow
Yahoo stocks will be at 100
in a month
Yahoo stock will be between
60 and 120 in a month
John’s age is 23
John’s age is in [20,30]
Uncertainty
about future
events
Anonymization
Applications: Deduplication
Name
John Doe
J. Doe
?
80% match
Applications: Information Integration
name,
hPhone,
oPhone,
hAddr,
oAddr
name,
phone,
address
at the schema level:
“schema integration”
Combined View
at the instance level:
“record linkage”
Applications: Information Extraction (I)
Restaurant Zip
Hard Rock
Cafe
94111
94133
94109
Applications: Information Extraction (II)
Subj.
Pred.
Obj.
Galway
type
City
locatedIn
Ireland
hasPopulation
75,414
areaCode
091
namedAfter
Gaillimh_River
…
…
…
What is Uncertain Data
and Why Does It Arise?
Applications: Information Extraction (III)
YAGO/DBpedia et al.
bornOn(Jeff, 09/22/42)
gradFrom(Jeff, Columbia)
hasAdvisor(Jeff, Arthur)
hasAdvisor(Surajit, Jeff)
knownFor(Jeff, Theory)
>120 M facts for YAGO2
(mostly from Wikipedia infoboxes)
New fact candidates
type(Jeff, Author)[0.9]
author(Jeff, Drag_Book)[0.8]
author(Jeff,Cind_Book)[0.6]
worksAt(Jeff, Bell_Labs)[0.7]
type(Jeff, CEO)[0.4]
100’s M additional facts
from Wikipedia text
How do current database management
systems (DBMS) handle uncertainty?
They don’t 
What Do (Most) Applications Do?
• Clean: turn into data that DBMSs can handle
Observer Bird-1
Bird-1
Mary
Finch: 80%
Sparrow: 20%
Finch
Susan
Dove: 70%
Sparrow: 30%
Dove
Jane
Hummingbird: 65%
Sparrow: 35%
Hummingbird
(1) Loss of information
(2) Errors compound and propagate insidiously
Outline for Part III
• Part III.1: Motivation
– What is uncertain data, and where does it come from?
• Part III.2: Possible Worlds & Beyond
• Part III.3: Probabilistic Database Engines
– Stanford Trio Project
– MystiQ @ U Washington
• Part III.4: Managing Uncertain RDF Data
– URDF @ Max Planck Institute
Dan Suciu, Dan Olteanu, Christopher Ré, Christoph Koch:
Probabilistic Databases
(Synthesis Lectures on Data Management)
Morgan & Claypool Publishers, 2012
Databases Today are Deterministic
• An item either is in the database or it is not.
• A tuple either is in the query answer or it is not.
• This applies to all variety of data models:
– Relational, E/R, hierarchical, XML, …
What is a Probabilistic Database ?
• “An tuple belongs to the database” is a
probabilistic event.
• “A tuple is an answer to the query” is a
probabilistic event.
• Can be extended to all possible kinds of data
models; we consider only
 probabilistic relational data.
Sample Spaces & Venn Diagrams
Sample Space 
“Tuple t1 is in
the database.”
“Tuple t2 is an
answer to a query.”
• Sample space : all possible events that can be observed. Pr() = 1.
• Random variable χt assigns a probability to an event s.t. 0 ≤ Pr(χt) ≤ 1.
• As a convention, we will use tuple identifiers in the place of random
variables to denote probabilistic events.
Possible Worlds Semantics
Attribute domains:
int,
# values: 232,
varchar(55),
datetime
2440,
264
Relational schema:
Employee(ID:int, name:varchar(55), dob:datetime, salary:int)
# of possible tuples:
# of possible relation instances:
232 × 2440 × 264 × 232
32 × 2440 × 264 × 232
2
2
Database schema:
Employee(. . .), Projects( . . . ), Groups( . . .), WorksFor( . . .)
# of possible database instances: N (= big but finite)
The Definition
Given a finite set of all possible database instances:
INST = {I1, I2, I3, . . ., IN}
Definition: A probabilistic database Ip
is a probability distribution on INST
Pr : INST → [0,1]
s.t. i=1,…,N Pr(Ii) = 1
Definition: A possible world is I  INST s.t. Pr(I) > 0
Example
p
I
=
Customer Address
Product
Customer Address
Product
John
Seattle
Gizmo
John
Boston
Gadget
John
Seattle
Camera
Sue
Denver
Gizmo
Sue
Denver
Gizmo
Pr(I2) = 1/12
Pr(I1) = 1/3
Customer Address
Product
Customer Address
Product
John
Seattle
Gizmo
John
Boston
Gadget
John
Seattle
Camera
Sue
Seattle
Camera
Sue
Seattle
Camera
Pr(I4) = 1/12
Pr(I3) = 1/2
Possible worlds = {I1, I2, I3, I4}
Tuples as Events
Marginal
probability of t
One tuple t  event “t  I”
Pr(t) = I: t  I Pr(I)
Two tuples t1, t2  event “t1  I Λ t2  I”
Pr(t1 Λ t2) = I: t1I Λ t2I Pr(I)
Marginal
probability of t1 Λ t2
Tuple Correlations
NOT
Pr(⌐t1) = 1 - Pr(t1)
⌐
Independent-AND
Pr(t1 Λ t2) = Pr(t1) Pr(t2)
IΛ
Independent-OR
Pr(t1 V t2) = 1-(1-Pr(t1))(1-Pr(t2))
IV
Disjoint-AND
Pr(t1 Λ t2) = 0
DΛ
Disjoint-OR
Pr(t1 V t2) = Pr(t1)+Pr(t2)
DV
Negatively correlated Pr(t1 Λ t2) < Pr(t1) Pr(t2)
N
Positively correlated Pr(t1 Λ t2) > Pr(t1) Pr(t2)
P
Identical
Pr(t1 Λ t2) = Pr(t1) = Pr(t2)
=
Example with Correlations
p
I
=
N
=
Customer Address
Product
Customer Address
Product
John
Seattle
Gizmo
John
Boston
Gadget
John
Seattle
Camera
Sue
Denver
Gizmo
Sue
Denver
Gizmo
D
Pr(I2) = 1/12
Pr(I1) = 1/3
P
Customer Address
Product
Customer Address
Product
John
Seattle
Gizmo
John
Boston
Gadget
John
Seattle
Camera
Sue
Seattle
Camera
Sue
Seattle
Camera
D
Pr(I4) = 1/12
Pr(I3) = 1/2
Special Case!
Tuple-independent probabilistic database
TUP = {t1, t2, …, tM}
pr : TUP → (0,1]
= all tuples
INST = P(TUP)
N = 2M
No restrictions w.r.t. other tuples
Pr(I) = t  I pr(t) × t  I (1-pr(t))
… back to the Venn Diagram (I)
Sample Space 
“Tuple t1 is in
the database.”
“Tuple t2 is in the
database.”
Pr(“Tuple t1 is in the database and tuple t2 is in the database”)
:= Pr(t1) x Pr(t2) = pr(t1) x pr(t2)
 If t1 and t2 are independent (per assumption!):
4 possible worlds = 4 subsets of events
… back to the Venn Diagram (II)
Sample Space 
“Tuple t2 is in the
database.”
“Tuple t1 is in
the database.”
Pr(“Tuple t1 is in the database and tuple t2 is in the database”) := 0
 If t1 and t2 are disjoint (per assumption!):
3 possible worlds = 3 subsets of events
Tuple Prob.  Possible Worlds
J=
Assumption:
Tuples are
independent!
Name
City
pr
John
Seattle
p1 = 0.8
Sue
Boston
p2 = 0.6
Fred
Boston
p3 = 0.9
Ip =
Name
I1
City
I2
Name
City
Name
City
Name
City
Name
City
Name
City
Name
City
Name
City
John
Seattl
Sue
Bosto
Fred
Bosto
John
Seattl
John
Seattl
Sue
Bosto
John
Seattl
Sue
Bosto
Fred
Bosto
Fred
Bosto
Sue
Bosto
Fred
Bosto
I3
I4
I5
(1-p1)
(1-p2)
(1-p3) p1(1-p2)(1-p3) (1-p1)p2(1-p3) (1-p1)(1-p2)p3 p1p2(1-p3)
=1
I6
I7
I8
p1(1-p2)p3
(1-p1)p2p3
p1p2p3
Tuple Prob.  Query Evaluation
Customer
Product
Date
pr
Name
City
pr
John
Gizmo
...
q1
John
Seattle
p1
John
Gadget
...
q2
Sue
Boston
p2
John
Gadget
...
q3
Fred
Boston
p3
Sue
Camera
...
q4
Sue
Gadget
...
q5
Sue
Gadget
...
q6
Fred
Gadget
...
q7
SELECT DISTINCT x.city
FROM Personp x, Purchasep y
WHERE x.Name = y.Customer
and y.Product = ‘Gadget’
Tuple
Seattle
Boston
Marginals
Probability
p1(1-(1-q2)(1-q3))
1- (1- p2(1-(1-q5)(1-q6 ) ))
× (1- p3q7 )
Summary of Data Model
Possible Worlds Semantics
• Very powerful model:
– Can capture any tuple correlations.
• Needs separate representation formalism:
(“just tables” are generally not enough)
 Boolean event expressions to capture complex tupledependencies: “provenance”, “lineage”, “views”, etc.
• But: query evaluation may be very expensive.
– Need to find good cases, otherwise must approximate.
Outline for Part III
• Part III.1: Motivation
– What is uncertain data, and where does it come from?
• Part III.2: Possible Worlds & Beyond
• Part III.3: Probabilistic Database Engines
– Stanford Trio Project
– MystiQ @ U Washington
• Part III.4: Managing Uncertain RDF Data
– URDF @ Max Planck Institute
[Widom et al.: 2008]
Trio’s Data Model
Uncertainty-Lineage Databases (ULDBs)
1.
2.
3.
4.
Alternatives
‘?’ (Maybe) Annotations
Confidence values
Lineage
Trio’s Data Model
1. Alternatives: uncertainty about value
Saw (witness, color, car)
Amy
red, Honda ∥ red, Toyota ∥ orange, Mazda
Three possible
instances
Trio’s Data Model
1. Alternatives
2. ‘?’ (Maybe): uncertainty about presence
Saw (witness, color, car)
Amy
red, Honda ∥ red, Toyota ∥ orange, Mazda
Betty
blue, Acura
Six possible
instances
?
Trio’s Data Model
• 1. Alternatives
• 2. ‘?’ (Maybe) Annotations
• 3. Confidences: weighted uncertainty
Saw (witness, color, car)
Amy
red, Honda 0.5 ∥ red, Toyota 0.3 ∥ orange, Mazda 0.2
Betty
blue, Acura 0.6
Six possible instances,
each with a probability
?
So Far: Model is Not Closed
Saw (witness, car)
Cathy
Honda ∥ Mazda
Drives (person, car)
Jimmy, Toyota ∥ Jimmy, Mazda
Billy, Honda ∥ Frank, Honda
Hank, Honda
Suspects = πperson(Saw ⋈ Drives)
Suspects
Jimmy
Billy ∥ Frank
Hank
?
?
?
CANNOT
Does not correctly
capture possible
instances in the
result
Example with Lineage
ID
11
Saw (witness, car)
Cathy
Honda ∥ Mazda
ID
Drives (person, car)
21
Jimmy, Toyota ∥ Jimmy, Mazda
22
Billy, Honda ∥ Frank, Honda
23
Hank, Honda
Suspects = πperson(Saw ⋈ Drives)
ID
Suspects
31
Jimmy
32
Billy ∥ Frank
33
Hank
?
?
?
λ(31) = (11,2) Λ (21,2)
λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2)
λ(33) = (11,1) Λ 23
Example with Lineage
ID
11
Saw (witness, car)
Cathy
Honda ∥ Mazda
ID
Drives (person, car)
21
Jimmy, Toyota ∥ Jimmy, Mazda
22
Billy, Honda ∥ Frank, Honda
23
Hank, Honda
Suspects = πperson(Saw ⋈ Drives)
ID
Suspects
31
Jimmy
32
Billy ∥ Frank
33
Hank
? λ(31) = (11,2) Λ (21,2)
? λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2)
? λ(33) = (11,1) Λ 23
Correctly captures
possible instances in
the result (7)
Operational Semantics
D
direct
implementation
possible
instances
I1, I2, …, In
D′
rep. of
instances
Q on each
instance
Closure:
up-arrow
always exists
J1, J2, …, Jm
Completeness: any (finite) set of possible
instances can be represented
Summary on Trio’s Data Model
Uncertainty-Lineage Databases (ULDBs)
1.
2.
3.
4.
Alternatives
‘?’ (Maybe) Annotations
Confidence values
Lineage
Theorem: ULDBs are closed and complete.
Formally studied properties like minimization, equivalence,
approximation and membership based on lineage.
[Benjelloun, Widom, et al.: VLDB J. 08]
MYSTIQ: Query Complexity
• Data complexity of a query Q:
• Compute Q(Ip), for probabilistic database J
– Extensional query evaluation:
Works for “safe” query plans with PTIME data
complexity
– Intensional query evaluation:
Works for any plan but has #P-complete data
complexity in the general case
• Assume independent tuples in J
• Compute marginal probabilities for tuples in Q
• Boolean event expressions for intensional query evaluation
Extensional Query Evaluation
[Fuhr&Roellke:1997, Dalvi&Suciu:2004]
Relational op’s compute probabilities
v p
v1 v2 p1 p2
s
v p
v 1-(1-p1)(1-p2)…
v p1(1-p2)
P
×
v1 p1
or: p1 + p2 + …
v2 p2
v
v
p1
p2
Data complexity: PTIME
v
p1
v p2
[Dalvi&Suciu:2004]
SELECT DISTINCT x.City
FROM Personp x, Purchasep y
WHERE x.Name = y.Customer
and y.Product = ‘Gadget’
“Safe Plans”
Sea
1-(1-p1q1)(1- p1q2)(1- p1q3)
Jon Sea p1(1-(1-q1)(1-q2)(1-q3))
×
P
Wrong !
×
Jon Sea p1
Jon Sea p1q1
Jon Sea p1q2
Jon Sea p1q3
Jon q1
Jon q2
Jon q3
Correct !
Jon 1-(1-q1)(1-q2)(1-q3)
P
Jon Sea
p1
Depends on plan !!!
Jon
Jon
Jon
q1
q2
q3
[Dalvi&Suciu:2004]
Query Complexity
Sometimes there exists a correct extensional plan,
but consider the following:
Qbad :- R(x), S(x,y), T(y)
Data complexity
is #P-complete
NP = class of problems of the form “is there a witness ?”
#P = class of problems of the form “how many witnesses ?”
(will be coming back to this…)
[Fuhr&Roellke:1997]
Intensional Database
Atomic event ids
Probabilities:
Event expressions: Λ, V, ⌐
e1, e2, e3, …
p1, p2, p3, …  [0,1]
e3 Λ (e5 V ⌐e2)
Intensional probabilistic database J
 each tuple t has an event attribute t.E
Probability of Boolean Expressions
Needed for query evaluation!
E
= X1X3 v X1X4 v X2X5 v X2X6
Sampling: Randomly make each variable true with the following probabilities
Pr(X1) = p1, Pr(X2) = p2, . . . . . , Pr(X6) = p6
What is Pr(E) ???
Answer: Re-group cleverly
E = X1 (X3 v X4 ) v X2 (X5 v X6)
Pr(E) = 1 - (1-p1(1-(1-p3)(1-p4)))
(1-p2(1-(1-p5)(1-p6)))
“Read once”
formula
Complexity Issues
Theorem [Valiant:1979]
For a Boolean expression E, computing Pr(E) is #P-complete
NP = class of problems of the form “is there a witness ?”  SAT
#P = class of problems of the form “how many witnesses ?”  #SAT
The decision problem for 2CNF is in PTIME
The counting problem for 2CNF is #P-complete
MYSTIQ: [Re, Suciu: VLDB’04]
Probabilistic Query Evaluation on
Top of a Deterministic Database Engine
(Top-k) Answers
1. Sampling
SQL Query
Probabilistic
Query Engine
Deterministic
Database
2. Extensional joins
3. Indexes
Outline for Part III
• Part III.1: Motivation
– What is uncertain data, and where does it come from?
• Part III.2: Possible Worlds & Beyond
• Part III.3: Probabilistic Database Engines
– Stanford Trio Project
– MystiQ @ U Washington
• Part III.4: Uncertain RDF Data
– URDF Project @ Max Planck Institute
Uncertain RDF (URDF) Data Model
• Extensional Layer (information extraction & integration)
– High-confidence facts: existing knowledge base (“ground truth”)
– New fact candidates: extracted facts with confidence values
– Integration of different knowledge sources:
Ontology merging or explicit Linked Data (owl:sameAs, owl:equivProp.)
 Large “Probabilistic Database” of RDF facts
• Intensional Layer (query-time inference)
– Soft rules: deductive grounding & lineage (Datalog/SLD resolution)
– Hard rules: consistency constraints (more general FOL rules)
– Propositional & probabilistic consistency reasoning
Soft Rules vs. Hard Rules
(Soft) Deduction Rules vs.
(Hard) Consistency Constraints
• People may live in more than one place
livesIn(x,y)  marriedTo(x,z)  livesIn(z,y) [0.8]
livesIn(x,y)  hasChild(x,z)  livesIn(z,y) [0.5]
• People are not born in different places/on different dates
bornIn(x,y)  bornIn(x,z)  y=z
bornOn(x,y)  bornOn(x,z)  y=z
• People are not married to more than one person
(at the same time, in most countries?)
marriedTo(x,y,t1)  marriedTo(x,z,t2)  y≠z
 disjoint(t1,t2)
Soft Rules vs. Hard Rules
(Soft) Deduction Rules vs.
(Hard) Consistency Constraints
Deductive database:
Datalog, core of SQL &
relational algebra,
• People may live in more than one place RDF/S, OWL2-RL, etc.
livesIn(x,y)  marriedTo(x,z)  livesIn(z,y) [0.8]
livesIn(x,y)  hasChild(x,z)  livesIn(z,y) [0.5]
• People are not born in different places/on different dates
More general FOL
constraints:
Datalog with constraints,
• People are not married to more than oneX-Tuples
person in Prob.-DB’s
(at the same time, in most countries?)
owl:FunctionalProperty,
marriedTo(x,y,t1)  marriedTo(x,z,t2)  y≠z etc.
bornIn(x,y)  bornIn(x,z)  y=z
bornOn(x,y)  bornOn(x,z)  y=z
 disjoint(t1,t2)
URDF: Running Example
KB:
Base Facts
type[1.0]
Computer
Scientist
Rules
type[1.0]
hasAdvisor(x,y) 
worksAt(y,z)
 graduatedFrom(x,z)
type[1.0]
hasAdvisor[0.8]
Jeff
hasAdvisor[0.7]
Surajit
graduatedFrom
graduatedFrom
[0.6]
[?]
[0.4]
David
graduatedFrom[0.9]
graduatedFrom(x,y) 
graduatedFrom(x,z)
 y=z
graduatedFrom[?]
graduatedFrom[0.7]
worksAt[0.9]
Stanford
Princeton
type[1.0]
type[1.0]
University
Derived Facts
gradFr(Surajit,Stanford)
gradFr(David,Stanford)
Basic Types of Inference
• MAP Inference
– Find the most likely assignment to query variables y
under a given evidence x.
– Compute: argmax y P( y | x)
(NP-hard for MaxSAT)
• Marginal/Success Probabilities
– Probability that query y is true in a random world
under a given evidence x.
– Compute: ∑y P( y | x)
(#P-hard already for
conjunctive queries)
General Route: Grounding & MaxSAT Solving
Query
graduatedFrom(x, y)
1) Grounding
CNF
(graduatedFrom(Surajit, Stanford) 
graduatedFrom(Surajit, Princeton)) 1000
 (graduatedFrom(David, Stanford) 
graduatedFrom(David, Princeton))
 (hasAdvisor(Surajit, Jeff) 
worksAt(Jeff, Stanford) 
graduatedFrom(Surajit, Stanford))
1000
0.4
 (hasAdvisor(David, Jeff) 
worksAt(Jeff, Stanford) 
graduatedFrom(David, Stanford))
0.4






0.9
0.8
0.7
0.6
0.7
0.9
worksAt(Jeff, Stanford)
hasAdvisor(Surajit, Jeff)
hasAdvisor(David, Jeff)
graduatedFrom(Surajit, Princeton)
graduatedFrom(Surajit, Stanford)
graduatedFrom(David, Princeton)
– Consider only facts (and rules)
which are relevant for answering
the query
2) Propositional formula in CNF,
consisting of
– Grounded hard & soft rules
– Weighted base facts
3) Propositional Reasoning
– Find truth assignment to facts such
that the total weight of the
satisfied clauses is maximized
 MAP inference: compute “most
likely” possible world
URDF: MaxSAT Solving with Soft & Hard Rules
Special case: Horn-clauses as soft rules & mutex-constraints as hard rules
C: Weighted Horn clauses (CNF)
S: Mutex-const.
[Theobald,Sozio,Suchanek,Nakashole: VLDS‘12]
{ graduatedFrom(Surajit, Stanford),
graduatedFrom(Surajit, Princeton) }
{ graduatedFrom(David, Stanford),
graduatedFrom(David, Princeton) }
(hasAdvisor(Surajit, Jeff) 
worksAt(Jeff, Stanford) 
graduatedFrom(Surajit, Stanford)) 0.4
 (hasAdvisor(David, Jeff) 
worksAt(Jeff, Stanford) 
graduatedFrom(David, Stanford))
0.4






0.9
0.8
0.7
0.6
0.7
0.9
worksAt(Jeff, Stanford)
hasAdvisor(Surajit, Jeff)
hasAdvisor(David, Jeff)
graduatedFrom(Surajit, Princeton)
graduatedFrom(Surajit, Stanford)
graduatedFrom(David, Princeton)


Find: arg max y P( y | x)
Resolves to a variant of MaxSAT
for propositional formulas
Compute
MaxSAT Alg.
W0 = ∑clauses C w(C) P(C is satisfied);
For each hard constraint S {
For each fact f in St {
Compute
Wf+t = ∑clauses C w(C) P(C is sat. | f = true);
}
Compute
WS-t = ∑clauses C w(C) P(C is sat. | St = false);
Choose truth assignment to f in St that
maximizes Wf+t , WS-t ;
Remove satisfied clauses C;
t++;
}
• Runtime: O(|S||C|)
• Approximation
guarantee of 1/2
Deductive Grounding with Lineage (SLD Resolution/Datalog)
Query
graduatedFrom(Surajit, y)
graduatedFrom
(Surajit,
Q1 Princeton)
graduatedFrom
(Surajit,
Q2 Stanford)
 A(B (CD))
A(B (CD))

A graduatedFrom
C
hasAdvisor
(Surajit,Jeff)[0.8]
hasAdvisor(x,y) 
worksAt(y,z)
 graduatedFrom(x,z) [0.4]
graduatedFrom(x,y) 
graduatedFrom(x,z)
 y=z
Base Facts
\/
(Surajit,
Princeton)[0.7]
Rules
B graduatedFrom
/\
D
(Surajit,
Stanford)[0.6]
worksAt
(Jeff,Stanford)[0.9]
graduatedFrom(Surajit, Princeton) [0.7]
graduatedFrom(Surajit, Stanford) [0.6]
graduatedFrom(David, Princeton) [0.9]
hasAdvisor(Surajit, Jeff) [0.8]
hasAdvisor(David, Jeff) [0.7]
worksAt(Jeff, Stanford) [0.9]
type(Princeton, University) [1.0]
type(Stanford, University) [1.0]
type(Jeff, Computer_Scientist) [1.0]
type(Surajit, Computer_Scientist) [1.0]
type(David, Computer_Scientist) [1.0]
Lineage & Possible Worlds
Query
graduatedFrom(Surajit, y)
[Das Sarma,Theobald,Widom: ICDE‘08
Dylla, Miliaraki,Theobald: CIKM‘11]
1) Deductive Grounding
0.7x(1-0.888)=0.078
graduatedFrom
(Surajit,
Q1 Princeton)
(1-0.7)x0.888=0.266
graduatedFrom
(Surajit,
Q2 Stanford)
A(B (CD))
 A(B (CD))
1-(1-0.72)x(1-0.6)
=0.888

\/
0.8x0.9
=0.72
A graduatedFrom
(Surajit,
Princeton)[0.7]
C
hasAdvisor
(Surajit,Jeff)[0.8]
D
2) Lineage DAG (not in CNF),
consisting of
– Grounded hard & soft rules
– Weighted base facts
Plus: entire derivation history!
3) Probabilistic Inference
B graduatedFrom
/\
– Dependency graph of the query
– Trace lineage of individual query
answers
(Surajit,
Stanford)[0.6]
worksAt
(Jeff,Stanford)[0.9]
 Compute marginals:
P(Q): aggregate probabilities of
all possible worlds where the
lineage of the query evaluates to
“true”
P(Q|H): drop “impossible worlds”
Possible Worlds Semantics
P(Q1)=0.0784
P(Q2)=0.2664
P(Q1|H)=0.0784 / 0.412
= 0.1903
P(Q2|H)=0.2664 / 0.412
= 0.6466
A:0.7
B:0.6
C:0.8
D:0.9
Q2:
1
1
1
1
0
0.7x0.6x0.8x0.9 = 0.3024
1
1
1
0
0
0.7x0.6x0.8x0.1 = 0.0336
1
1
0
1
0
… = 0.0756
1
1
0
0
0
… = 0.0084
1
0
1
1
0
… = 0.2016
1
0
1
0
0
… = 0.0224
1
0
0
1
0
… = 0.0504
1
0
0
0
0
… = 0.0056
0
1
1
1
1
0.3x0.6x0.8x0.9 = 0.1296
0
1
1
0
1
0.3x0.6x0.8x0.1 = 0.0144
0
1
0
1
1
0.3x0.6x0.2x0.9 = 0.0324
0
1
0
0
1
0.3x0.6x0.2x0.1 = 0.0036
0
0
1
1
1
0.3x0.4x0.8x0.9 = 0.0864
0
0
1
0
0
… = 0.0096
0
0
0
1
0
… = 0.0216
0
0
0
0
0
… = 0.0024
P(W)
 A(B(CD))
Hard rule H:  A   (B  (CD))
0.0784
0.2664
1.0
0.412
More Probabilistic Approaches
• Propositional
– Stochastic MaxSat solvers: MaxWalkSat (MAP-Inference)
– URDF: constrained weighted MaxSat solver for soft & hard rules
• Lineage & Possible Worlds (tuple-independent database)
– Exact probabilistic inference: junction trees, variable elimination
– Approximate inference: decision diagrams/Shannon expansions, sampling
• Combining First-Order Logic &
Probabilistic Graphical Models
– Markov Logic Networks*
[Richardson & Domingos: Machine Learning 2006]
– Factor Graphs
[FactorIE, McCallum et al.: NIPS 2008]
– Variety of MCMC sampling techniques for probabilistic inference
(e.g., Gibbs sampling, MC-SAT, etc.)
*Alchemy – Open-Source AI: http://alchemy.cs.washington.edu/
Experiments
• YAGO Knowledge Base: 2 Mio entities, 20 Mio facts
• Basic query answering: SLD grounding & MaxSat solving of 10 queries over 16
soft rules (partly recursive) & 5 hard rules (bornIn, diedIn, marriedTo, …)
• Asymptotic runtime checks: runtime comparisons for synthetic rule expansions
• URDF: SLD grounding & MaxSat
solving
|C| - # literals in soft rules
|S| - # literals in hard rules
• URDF MaxSat vs. Markov
Logic (MAP inference & MCSAT)
UViz: URDF Visualization Frontend
[Meiser, Dylla, Theobald: CIKM’11 Demo]
• System components:
– Flash Player client
– Tomcat server (JRE)
– Relational backend
(JDBC)
– Remote Method
Invocation & Object
Serialization (BlazeDS)
UViz: URDF Visualization Frontend
[Meiser, Dylla, Theobald: CIKM’11 Demo]
Demo!
http://urdf.mpi-inf.mpg.de
Recommended Readings
PART I
• SPARQL Query Language for RDF, W3C Recommendation, 15 January 2008, http://www.w3.org/TR/2008/REC-rdf-sparql-query/
• SPARQL 1.1 Query Language, W3C Working Draft, 21 March 2013, http://www.w3.org/TR/sparql11-query/
• SPARQL 1.1 Federated Query, W3C Working Draft, 21 March 2013, http://www.w3.org/TR/sparql11-federated-query/
• Kemafor Anyanwu, Angela Maduko, Amit P. Sheth: SPARQ2L: towards support for subgraph extraction queries in RDF databases. WWW Conference, 2007
• Krisztian Balog, Edgar Meij, Maarten de Rijke: Entity Search: Building Bridges between Two Worlds. WWW, 2010
• Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, S. Sudarshan: Keyword Searching and Browsing in Databases using BANKS. ICDE, 2002
• Tao Cheng , Xifeng Yan , Kevin Chen-Chuan Chang: EntityRank: searching entities directly and holistically. VLDB, 2007
• Shady Elbassuoni, Maya Ramanath, Ralf Schenkel, Marcin Sydow, Gerhard Weikum: Language-model-based ranking for queries on RDF-graphs. CIKM, 2009
• Vagelis Hristidis, Heasoo Hwang, Yannis Papakonstantinou: Authority-based keyword search in databases. ACM Transactions on Database Systems 33(1), 2008
• Gjergji Kasneci, Maya Ramanath, Mauro Sozio, Fabian M. Suchanek, Gerhard Weikum: STAR: Steiner-Tree Approximation in Relationship Graphs. ICDE, 2009
• Gjergji Kasneci, Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath, Gerhard Weikum: NAGA: Searching and Ranking Knowledge. ICDE, 2008
• Thomas Neumann, Gerhard Weikum: Scalable join processing on very large RDF graphs. SIGMOD Conference, 2009
• Thomas Neumann, Gerhard Weikum: The RDF-3X engine for scalable management of RDF data. VLDB Journal 19(1), 2010
• François Picalausa, Yongming Luo, George H. L. Fletcher, Jan Hidders, Stijn Vansummeren: A Structural Approach to Indexing Triples. ESWC 2012
• Nicoleta Preda, Gjergji Kasneci, Fabian M. Suchanek, Thomas Neumann, Wenjun Yuan, Gerhard Weikum: Active knowledge: dynamically enriching RDF knowledge bases
by Web Services. SIGMOD Conference, 2010
• Cheng Xiang Zhai: Statistical Language Models for Information Retrieval. Morgan & Claypool Publishers, 2008
• Lei Zou, Jinghui Mo, Lei Chen, M. Tamer Özsu, Dongyan Zhao: gStore: Answering SPARQL Queries via Subgraph Matching. PVLDB 4(8), 2011
PART II
• Min Cai, Martin R. Frank: RDFPeers: a scalable distributed RDF repository based on a structured peer-to-peer network. WWW, 2004
• Gong Cheng, Weiyi Ge, Yuzhong Qu: Falcons: searching and browsing entities on the semantic web. WWW, 2008
• Li Ding, Tim Finin, Anupam Joshi, Rong Pan, R. Scott Cost, Yun Peng, Pavan Reddivari, Vishal C Doshi, Joel Sachs: Swoogle: A Search and Metadata Engine for the Semantic
Web. CIKM, 2004
• Luis Galárraga, Katja Hose, Ralf Schenkel: Partout: A Distributed Engine for Efficient RDF Processing. To appear in PVLDB, 2013
• Steve Harris, Nick Lamb, Nigel Shadbolt: 4store: The Design and Implementation of a Clustered RDF Store. SSWS, 2009
• Jiewen Huang, Daniel J. Abadi, Kun Ren: Scalable SPARQL Querying of Large RDF Graphs. PVLDB 4(11), 2011
• Vaibhav Khadilkar, Murat Kantarcioglu, Bhavani Thuraisingham, Paolo Castagna: Jena-HBase: A Distributed, Scalable and Efficient RDF Triple Store. ISWC, 2012
• Bastian Quilitz, Ulf Leser: Querying Distributed RDF Data Sources with SPARQL. ISWC, 2008
• Andreas Schwarte, Peter Haase, Katja Hose, Ralf Schenkel, Michael Schmidt: FedX: A Federation Layer for Distributed Query Processing on Linked Open Data. ESWC, 2011
• Bin Shao, Haixun Wang, Yatao Li: Trinity: A Distributed Graph Engine on a Memory Cloud. To appear in SIGMOD, 2013
• Xiaofei Zhang, Lei Chen, Yongxin Tong, Min Wang: EAGRE: Towards Scalable I/O Efficient SPARQL Query Evaluation on the Cloud. ICDE, 2013
PART III
• Omar Benjelloun, Anish Das Sarma, Alon Y. Halevy, Martin Theobald, Jennifer Widom: Databases with uncertainty and lineage. VLDB J. 17(2), 2008
• Jihad Boulos, Nilesh N. Dalvi, Bhushan Mandhani, Shobhit Mathur, Christopher Ré, Dan Suciu: MYSTIQ: a system for finding more answers by using probabilities. SIGMOD
Conference, 2005
• Nilesh N. Dalvi, Dan Suciu: Efficient Query Evaluation on Probabilistic Databases. VLDB, 2004
• Maximilian Dylla, Iris Miliaraki, Martin Theobald: Top-k Query Processing in Probabilistic Databases with Non-Materialized Views. ICDE, 2013
• Norbert Fuhr, Thomas Rölleke: A Probabilistic Relational Algebra for the Integration of Information Retrieval and Database Systems. ACM Trans. Inf. Syst. 15(1), 1997
• Timm Meiser, Maximilian Dylla, Martin Theobald: Interactive reasoning in uncertain RDF knowledge bases. CIKM, 2011
• Ndapandula Nakashole, Mauro Sozio, Fabian Suchanek, Martin Theobald: Query-Time Reasoning in Uncertain RDF Knowledge Bases with Soft and Hard Rules. VLDS, 2012
• Dan Suciu, Dan Olteanu, Christopher Ré, Christoph Koch: Probabilistic Databases (Synthesis Lectures on Data Management), Morgan & Claypool Publishers, 2012