Oblivious Querying - Department of Computer Science
Download
Report
Transcript Oblivious Querying - Department of Computer Science
Oblivious Querying of Data
with Irregular Structure
Yaron Kanza
The Rachel and Selim Benin
School of Engineering and Computer Science
The Hebrew University of Jerusalem
1
Joint Work With
• Queries with Incomplete Answers
– Werner Nutt, Shuky Sagiv
• Flexible Queries
– Shuky Sagiv
• SQL4X
– Sara Cohen, Shuky Sagiv
• Computing Full Disjunctions
– Shuky Sagiv
2
Agenda
Why is it difficult to query semistructured data?
Queries with incomplete answers (QwIA)
Flexible queries (FQ)
Oblivious querying = QwIA + FQ
Using QwIA and FQ for information integration
3
Agenda
Why is it difficult to query semistructured data?
Queries with incomplete answers (QwIA)
Flexible queries (FQ)
Oblivious querying = QwIA + FQ
Using QwIA and FQ for information integration
4
The Semistructured Data Model
• Data is described as a rooted labeled
directed graph
• Nodes represent objects
• Edges represent relationships between
objects
• Atomic values are attached to atomic nodes
5
Movie Database
1
Movie
Movie
11
Actor
Actor
21
Name
30
Mark
Hamill
22
Name
Title
12
Star 1977
Wars
Harrison
Ford
25
Title
27
Léon
33
Magnolia
A Movie Database Example
6
28
Kyle
Title
MacLachlan
Name
Natalie
Portman
14
Movie
Name T.V. Series
26
32
Actor
13
Actor Title
Year
23 24
31
Film
29
Title
Year
34
35
36
Twin
Peaks
Dune
1984
<?xml version=“1.0”?>
<MDB>
<Movie>
<Title>Star Wars</Title>
<Year>1977</Year>
<Actor>
<Name>Mark Hamill</Name>
</Actor>
<Actor>
<Name>Harrison Ford</Name>
</Actor>
</Movie>
…
</MDB>
XML that Encodes the Semistructured Data
7
What Should be the
form of the Query?
Movie Database
1
Movie
Movie
11
Actor
Actor
21
Name
30
Mark
Hamill
8
22
Name
Title
12
Star 1977
Wars
Harrison
Ford
25
Title
27
Léon
33
Magnolia
29
28
Kyle
Title
MacLachlan
Name
Natalie
Portman
14
Movie
Name T.V. Series
26
32
Actor
13
Actor Title
Year
23 24
31
Film
Title
Year
34
35
36
Twin
Peaks
Dune
1984
Consider a Query that Requests
Movies, Actors that Acted in the Movies
and the Movies’ Year of Release
The movie has
a year attribute
Movie Database
The year of the
movie is missing
1
Movie
Movie
11
Actor
Actor
21
Name
30
Mark
Hamill
22
Name
Title
12
Star 1977
Wars
Harrison
Ford
25
Title
27
Léon
33
Magnolia
Incomplete Data
9
28
Kyle
Title
MacLachlan
Name
Natalie
Portman
14
Movie
Name T.V. Series
26
32
Actor
13
Actor Title
Year
23 24
31
Film
29
Title
Year
34
35
36
Twin
Peaks
Dune
1984
Movie Database
Actor below movie
Movie
Movie
11
Actor
Actor
21
Name
30
Mark
Hamill
22
Name
Title
Harrison
Ford
25
Title
27
Léon
33
Magnolia
Variations in Structure
10
28
Kyle
Title
MacLachlan
Name
Natalie
Portman
14
Movie
Name T.V. Series
26
32
Actor
13
Actor Title
Year
Star 1977
Wars
31
Film
12
23 24
Movie below actor
1
29
Title
Year
34
35
36
Twin
Peaks
Dune
1984
Movie Database
1
A movie label
11
Actor
Actor
21
Name
30
Mark
Hamill
22
Name
Title
25
Movie
Name T.V. Series
Title
26
27
Léon
Natalie
Portman
28
Kyle
Title
MacLachlan
Name
32
A film label
13
13
Actor Title
Year
Star 1977
Wars
Harrison
Ford
Film
12
23 24
31
Actor
Movie
Movie
34
Magnolia
Title
Year
33
34
35
Twin
Peaks
Dune
1984
Dealing with ontology variations is
Ontology Variations
beyond the scope of this talk
11
29
Irregular Data
• Data is incomplete
– Missing values of attributes in objects
• Data has structural variations
– Relationships between objects are represented
differently in different parts of the database
• Data has ontology variations
– Different labels are used to describe objects of
the same type
12
Irregular data does not conform to a strict schema
Queries over irregular data
should not be rigid patterns
The schema cannot guide a user
in formulating a query
13
Data is contributed
by many users
in a variety of designs
The query should
deal with different
structures of data
The structure of the
database is changed
frequently
Queries should be
rewritten frequently
The description of the
schema is large
(e.g., a DTD of XML)
It is difficult to use the
schema when
formulating queries
In Which Cases is it Difficult to Formulate
14
Queries over Semistructured Data?
Can Regular Expressions Help in
Querying Irregular Data?
• In many cases, regular expressions can be
used to query irregular data
• Yet, regular expressions are
– Not efficient – it is difficult to evaluate regular
expressions
– Not intuitive – it is difficult for a naïve user to
formulate regular expressions
15
More on Using
Regular Expressions
• When querying irregular data, the size of
the regular expression could be exponential
in the number of labels in the database
– For n types of objects, there are n! possible
hierarchies
– For an object with n attributes, there are 2n
subsets of missing attributes
16
Agenda
Why is it difficult to query semistructured data?
Queries with incomplete answers (QwIA)
Flexible queries (FQ)
Oblivious querying = QwIA + FQ
Using QwIA and FQ for information integration
17
Queries with Incomplete Answers
• We have developed queries that deal with
incomplete data in a novel way and return
incomplete answers
• The queries return maximal answers rather
than complete answers
• Different query semantics admit different
levels of incompleteness
18
Queries with complete answers
Queries with AND Semantics
Queries with Weak Semantics
Queries with OR Semantics
Queries with Incomplete Answers
19
Increasing
level of
incompleteness
Queries and Matchings
• The queries are labeled rooted directed
graphs
• Query nodes are variables
• Matchings are assignments of database
objects to the query variables according to
– the constraints specified in the query, and
– the semantics of the query
20
Constraints On Complete Matchings
• Root Constraint:
• Satisfied if the query root is mapped to the db root
Query Root
r
1
Database Root
• Edge Constraint:
• Satisfied if a query edge with label l is mapped to a
database edge with label l
x
l
l
y
21
12
25
Movie Database
A Complete
Matching
1
r
Movie
Movie
Producer Movie
11
Uncredited
Director Actor
Actor
Title
Actor
Year
21
22
23 24
12
Actor
27
25
x
Director
Title
26
29
Star 1977
Name
Name
Hook
Name
Name Name Wars
Date of
34
30
32
31
33
birth George Dustin
Steven
Mark Harrison
Lucas Hoffman
Spielberg
35
Hamill Ford
14 May 1944
22
Movie
Producer
y
Director Uncredited
Actor
z
Date of
birth Name
u
The root constraint and
All the nodes are mapped
all the edge constraints
to non-null values
are satisfied
v
Movie Database
1
r
Movie
Movie
Producer Movie
11
Uncredited
Director Actor
Actor
Title
Actor
Year
21
22
23 24
12
Actor
27
25
x
Director
Title
26
29
Star 1977
Name
Name
Hook
Name
Name Name Wars
Date of
34
30
32
31
33
birth George Dustin
Steven
Mark Harrison
Lucas Hoffman
Spielberg
35
Hamill Ford
14 May 1944
Consider the case where Node 35
is removed from the database
23
Movie
Producer
y
Director Uncredited
Actor
z
Date of
birth Name
u
v
No Complete
Matching Exists!
Movie Database
1
r
Movie
Movie
Producer Movie
11
Uncredited
Director Actor
Actor
Title
Actor
Year
21
22
23 24
Star 1977
Name Name Wars
30
31
Mark Harrison
Hamill Ford
24
Actor
27
25
NULL
12
Director
Title
26
29
Name
Name Name Hook
32
33
George Dustin
Lucas Hoffman
34
Steven
Spielberg
x
Movie
NULL
Producer
y
Director Uncredited
Actor
z
NULL
Date of
birth Name
u
v
NULL
This is not a matching, since the sequence of labels
Every
from the Not
database
rootPartial
to NodeAssignment
31 is different from
is an
Incomplete
Matching
any sequence
of labels
that starts
at the query root
and ends in variable v
The Reachability Constraint
on Partial Matchings
• A query node v that is mapped to a database
object o satisfies the reachability constraint
if there is a path from the query root to v,
such that all edge constraints along this path
Database
are satisfied
11
r l
2
l1
x
w
l4
l3
y
25
Query
z
l5
l6
vv
5
l1
l2
7
l4
l3
8
9
l5
l6
55
“And” Matchings
• A partial matching is an AND matching if
– The root constraint is satisfied
– The reachability constraint is satisfied by every
query node that is mapped to a database node
– If a query node is mapped to a database node,
all the incoming edge constraints are satisfied
r
x
Producer
Actor
Director
26
y
z
Movie Database
1
r
Movie
Movie
Producer Movie
11
Uncredited
Director Actor
Actor
Title
Actor
Year
21
22
23 24
Star 1977
Name Name Wars
30
31
Mark Harrison
Hamill Ford
12
Actor
27
25
x
Director
Title
29
26
Name
Name Name Hook
32
34
33
George Dustin
Lucas Hoffman
Steven
Spielberg
An AND Matching
27
Movie
Producer
y
Director Uncredited
Actor
z
Date of
birth Name
u
NULL
v
Movie Database
1
r
Movie
Movie
Producer Movie
11
Uncredited
Director Actor
Actor
Title
Actor
Year
21
22
23 24
Star 1977
Name Name Wars
30
31
Mark Harrison
Hamill Ford
12
Actor
27
25
x
Director
Title
26
29
Name
Name Name Hook
32
33
George Dustin
Lucas Hoffman
Suppose that we remove the
edges that are labeled with
Uncredited
Actor
28
34
Movie
Producer
y
Director Uncredited
Actor
z
Date of
birth Name
u
v
Steven
Spielberg
In an AND matching,
Node z must be null!
Weak Satisfaction of
Edge Constraints
• Edge Constraint:
• Is Weakly Satisfied if it is either
• Satisfied (as defined earlier), or
• One (or more) of its nodes is mapped to a null value
null
x
12
l
l
y
25
x
12
l
m
y
25
null
29
x
l
null
12
m
y
25
x
l
y
null
Weak Matchings
• A partial matching is a weak matching
if
– The root constraint is satisfied
– The reachability constraint is satisfied by
every query node that is mapped to a
database node
– Every edge constraint is weakly satisfied
30
Movie Database
1
r
Movie
Movie
Producer Movie
11
12
Actor
Title Director
Actor
Year
21
22
23 24
Star 1977
Name Name Wars
30
27
Mark Harrison
Hamill Ford
25
Director
Title
26
33
George Dustin
Lucas Hoffman
A Weak Matching
31
29
Name
Name Name Hook
32
31
Actor
x
34
Steven
Spielberg
Movie
Producer
y
Director Uncredited
Actor
z
NULL
Date of
birth Name
u
v
NULL
Edges that are
weakly satisfied
In a weak matching, all four options are permitted
In an AND matching, only the first three options
are permitted
null
x
12
l
l
y
25
x
12
l
m
y
25
null
32
null
x
x
l
l
y
m
y
null
12
25
Movie Database
1
r
Movie
Movie
Producer Movie
11
12
Actor
Title Director
Actor
Year
21
22
23 24
Star 1977
Name Name Wars
30
31
Mark Harrison
Hamill Ford
27
Actor
25
Director
Title
26
29
Name
Name Name Hook
32
33
George Dustin
Lucas Hoffman
Consider the case where edges
labeled with Producer are removed
33
x
34
Movie
Producer
y
Director Uncredited
Actor
z
Date of
birth Name
u
v
Steven
Spielberg
In a weak matching,
Node z must be null!
“OR” Matchings
• A partial matching is an OR matching
if
– The root constraint is satisfied
– The reachability constraint is satisfied by
every query node that is mapped to a
database node
34
Movie Database
1
r
Movie
Movie
Movie
11
12
Actor
Title Director
Actor
Year
21
22
23 24
Star 1977
Name Name Wars
30
31
Mark Harrison
Hamill Ford
27
Actor
25
x
Director
Title
29
26
Name
Name Name Hook
32
34
33
George Dustin
Lucas Hoffman
Steven
Spielberg
An OR Matching
35
Movie
Producer
y
Director Uncredited
Actor
z
NULL
Date of
birth Name
u
v
NULL
An edge which
is not weakly
satisfied
Increasing Level of
Incompleteness
• A complete matching is an AND matching
• An AND matching is a weak matching
• A weak matching is an OR matching
36
Maximal Matchings
• A tuple t1 subsumes a tuple t2 if t1 is the result of
replacing some null values in t2 by non-null
values:
t1=(1, 5, 2, null)
t2=(1, null, 2, null)
Matchings are represented as
tuples of oid’s and null values
• A matching is maximal if no other matching
subsumes it
• A query result consists of maximal matchings
only
37
On the Complexity of Computing
Queries with Incomplete Answers
• The size of the result can be exponential in
the size of the input (database and query)
– Note that the same is true when joining
relations – the size of the result can be
exponential in the size of the input (database
and query)
• Instead of using data complexity (where the
runtime depends only on the size of the
database), we use input-output complexity
38
Input-Output Complexity
In input-output complexity, the
time complexity is a function of
the size of the query,
the size of the database, and
the size of the result.
39
The Motivation for Using
I/O Complexity
• Measuring the time complexity with respect to the
size of the input does not separate between the
following two cases:
– An algorithm that does an exponential amount of work
simply because the size of the output is exponential in
the size of the input
– An algorithm that does an exponential amount of work
even when the query result is small
• Either the algorithm is naïve (e.g., it unnecessarily
computes subsumed matchings) or the problem is
hard
40
I/O Complexity of Query Evaluation
(lower bounds are for non-emptiness)
Query /
Semantics
Complete
41
Path
Query
PTIME
Tree
Query
DAG
Query
Cyclic
Query
PTIME
NPComplete
NPComplete
AND
PTIME
PTIME
PTIME
NPComplete
Weak
PTIME
PTIME
PTIME
PTIME
OR
PTIME
PTIME
PTIME
PTIME
Recent Results (PODS’03)
Filter Constraints
• Constraints that filter the results (i.e., the
maximal matchings)
• There are
– Weak filter constraints (the constraint is
satisfied if a variable in the constraint is null)
– Strong filter constraints (all variables must be
non-null for satisfaction)
• Existence constraint: !x is true if x is not null
42
I/O Complexity of Query Evaluation with
Existence Constraints
(lower bounds are for non-emptiness)
Query /
Semantics
Complete
AND
43
Path
Query
PTIME
PTIME
Tree
Query
DAG
Query
Cyclic
Query
PTIME
NPComplete
NPComplete
PTIME
NPComplete
NPComplete
NPComplete
NPComplete
Weak
PTIME
PTIME
NPComplete
OR
PTIME
PTIME
NPComplete
I/O Complexity of Query Evaluation with
Weak Equality/Inequality Constraints
(lower bounds are for non-emptiness)
Path
Query
Tree
Query
DAG
Query
Cyclic
Query
PTIME
NPComplete
NPComplete
NPComplete
PTIME
NPComplete
NPComplete
NPComplete
Weak
PTIME
NPComplete
NPComplete
NPComplete
OR
PTIME
NPComplete
NPComplete
NPComplete
Query /
Semantics
Strong
AND
44
Query Containment
• Query containments for queries with incomplete
answers is defined differently from query
containment for queries with complete answers
• Q1 Q2 if for all database D,
every matching of Q1 w.r.t. to D
is subsumed by
a matchings of Q2 w.r.t. to D
• Query containment (query equivalence) is useful
for the development of optimization techniques
45
Containment in AND Semantics
• Homomorphism between the query graphs
is necessary and sufficient for containment
Q1
r
l1
l2
x
Q2
r
l1
p
l2
l2
y
l3
u
z
l3
l4
v
u
q
Q1 Q2
l4
v
homomorphism
• Deciding whether one query is contained in
another is NP-Complete
46
Containment in OR Semantics
• The following is a necessary and sufficient
condition for query containment in OR semantics
• For every spanning tree T1 of the contained query,
there a spanning tree T2 of the containing query,
such that there is a homomorphism from T2 to T1
– is in ΠP2
– NP-Complete if the containee is a tree
– polynomial if the container is a tree
47
Containment in Weak Semantics
• Similar to containment in OR Semantics,
with the following difference
• Instead of checking homomorphism
between spanning trees, we check
homomorphism between graph fragments
– A graph fragment is a restriction of the query to
a subset of the variables that includes the query
root such that every node in the fragment is
reachable from the root
48
Agenda
Why is it difficult to query semistructured data?
Queries with incomplete answers (QwIA)
Flexible queries (FQ)
Oblivious querying = QwIA + FQ
Using QwIA and FQ for information integration
49
Flexible Queries
• To deal with structural variations in the
data, we have developed flexible queries
50
Rigid Queries
Semiflexible Queries
Flexible Queries
Flexible Queries
51
Increasing
level of
flexibility
Example
Movie Database
r
Actor
A query that finds all pairs of actors
that acted in the same movie
Actor
x
y
Movie
Movie
z
However, if in the database, actors
are descendents of movies,
the query has to be reformulated
Instead, we propose new ways
of matching queries to databases
52
Rigid matchings and
complete matchings
are the same
Returning rigid matchings is the usual
semantics for queries
(e.g., XQuery, Lorel, XML-QL, etc.)
53
Constraints On Rigid Matchings
• Root Constraint:
• Satisfied if the query root is mapped to the db root
Query Root
r
1
Database Root
• Edge Constraint:
• Satisfied if a query edge with label l is mapped to a
database edge with label l
x
l
l
y
54
12
25
Movie Database
1
Movie
Movie
11
Actor
Actor
21
Name
30
22
Name
Title
Movie
T.V. Series
Star 1977
Wars
Name 28
Léon
Title
Name
31
x
Movie
y
29
26
34
32
Natalie
Portman
A Rigid Matching
55
14
Actor Title
Year
25
Actor
Actor
12
23 24
Mark Harrison
Hamill Ford
r
27
Title Year
35
36
Twin
1984
Peaks Dune
Kyle
MacLachlan
This is not a Rigid Matching
A Semiflexible Matching
• The query root is mapped
to the db root
• A query node with an
incoming label l is
mapped to a db node
with an incoming label l
• The image of every query
path is embedded in some
database path
• SCC is mapped to SCC
56
Query
Root
DB
Root
1
r
x
l
y
×
9
l
11
A Semiflexible Matching
Query
• The query root is
Root
The lasttotwo
conditions
mapped
the db
root
r
cannot
be
verified
locally,
• A query node with an
i.e., by label
considering
incoming
l is
one query
mapped
to a dbedge
node at a time
with an incoming label l
• The image of every query
path is embedded in some
database path
• SCC is mapped to SCC
57
DB
Root
1
x
l
9
l
y
11
Movie Database
1
Movie
Movie
11
Actor
Actor
21
Name
30
22
Name
Title
Star 1977
Wars
Mark Harrison
Hamill Ford
14
Movie
T.V. Series
Actor Title
Year
25
Name 28
Léon
Title
Name
x
Movie
y
29
26
32
Natalie
Portman
34
27
Title Year
35
36
Twin
1984
Peaks Dune
Kyle
MacLachlan
The Semiflexible Matchings
58
Actor
Actor
12
23 24
31
r
We get all the
actor-movie pairs
r
r
Actor
Movie
y
Actor
x
x
Movie
y
Under semiflexible semantics,
these two queries are equivalent
The user does not have to know
if movies are above or below
actors in the database
59
Movie Database
1
Movie
Movie
11
Actor
Actor
21
Name
30
22
Name
Title
Star 1977
Wars
Mark Harrison
Hamill Ford
Actor
x
14
Movie
Movie
T.V. Series
Actor Title
Year
25
Actor
Actor
12
23 24
31
r
Name 28
Léon
Title
Name
y
Movie
z
29
26
32
Natalie
Portman
34
27
Title Year
35
36
Twin
1984
Peaks Dune
Kyle
MacLachlan
Impossible to get this pair by means
We get
of apairs
rigid of
Another since
Example
a is a dag
matching,
the of
query
actors
andthat
theacted
db in
60
Matching
is Semiflexible
a tree
the same movie
A Flexible Matching
• The query root is mapped
to the db root
• A query node with an
incoming label l is mapped
to a db node with an
incoming label l
• An edge is mapped to two
nodes on one path
• Notice that a path in the query
is not necessarily mapped to a
path in the db
61
Query
Root
r
DB
Root
1
x
l
9
l
y
11
An Example of a Flexible Query
r
Director
A director
The director name
x
Name
Movie
A movie of the director
y
z
Title
Actor
An actor in the movie
v
u
Name
The name of the actor
62
w
The title of the movie
Movie Database
1
r
Movie
Director
Producer Movie
11
Actor
Title Director
Actor
Year
21
22
23 24
x
12
27
Actor
25
Director
Title
26
Name
Movie
z
29
Star 1977
Name
Name
Hook
Name
Name Name Wars
Date of
34
30
32
31
33
birth George Dustin
Steven
Mark Harrison
Lucas Hoffman
Spielberg
35
Hamill Ford
14 May 1944
y
Title
Actor
flexible
neither a rigid
AThis
query
edge matching
is mappedisto
matching
nor
a
semiflexible
matching
two
db
nodes
on
one
path
63
v
u
Name
w
Movie Database
1
r
Movie
Producer
Producer Movie
11
Actor
Title Director
Actor
Year
21
22
23 24
x
12
27
Actor
25
Director
Title
26
29
Movie
y
Star 1977
Name
Name
Hook
Name
Name Name Wars
Date of
34
30
32
31
33
birth George Dustin
Steven
Mark Harrison
Lucas Hoffman
Spielberg
35
Hamill Ford
14 May 1944
In thisWhy
flexible
matching, a producer
are semiflexible
matchingsis given
with
that he directed
but did
not produce
preferred
sometimes
to flexible
matchings?
64 a movie
Movie Database
1
r
Movie
Producer
Producer Movie
11
Producer
Actor
99
Title Director
Actor
Year
21
22
23 24
27
x
12
Actor
25
Director
Title
26
29
Movie
y
Star 1977
Name
Name
Hook
Name
Name Name Wars
Date of
34
30
32
31
33
birth George Dustin
Steven
Mark Harrison
Lucas Hoffman
Spielberg
35
Hamill Ford
14 May 1944
In semiflexible semantics, the problem is solved
since the image of a query path is embedded in
65
a database
path
Differences Between the Semiflexible
and Flexible Semantics
• On a technical level, in flexible matchings
– Query paths are not necessarily embedded in database
paths
– SCC’s are not necessarily mapped to SCC’s
• On a conceptual level, in the semiflexible
semantics, nodes are “semantically related” if they
are on the same path, and hence
– Query paths are embedded in database paths
• In the flexible semantics, this condition is relaxed:
– Query edges are embedded in database paths
66
Increasing Level of Flexibility
• A rigid matching is a semiflexible matching
• A semiflexible matching is a flexible
matching
67
Verifying that Mappings are
Semiflexible Matchings
• Is a given mapping of query nodes to database
nodes a semiflexible matching?
– Not as simple as for rigid matchings (no local test, i.e.,
need to consider paths rather than edges)
• In a dag query, the number of paths may be
exponential
– Yet, verifying is in polynomial time
• In a cyclic query, the number of paths may be
infinite
– Yet, verifying is in exponential time
68
Verifying that a Mapping is a
Semiflexible Matching
Query /
Database
Path
Query
Tree
Query
DAG
Query
Cyclic
Query
No
Path
Database
PTIME
PTIME
PTIME
Tree
Database
PTIME
PTIME
PTIME
DAG
Database
PTIME
PTIME
PTIME
matchings
Cyclic
Database
PTIME
PTIME
coNP
coNP
69
matchings
No
matchings
No
Input-Output Complexity of Query Evaluation
for the Semiflexible Semantics
• Next slide summarizes results about the
input-output complexity
– Polynomial for a dag query and a tree database
(or simpler cases)
• Rather difficult to prove, even when the query is a
tree, since there is no local test for verifying that
mappings are semiflexible matchings
– Exponential lower bounds for other cases
70
I/O Complexity for SF Semantics
(lower bounds are for non-emptiness)
Query /
Database
Path
Database
Path
Query
PTIME
Tree
Query
PTIME
DAG
Query
Cyclic
Query
PTIME
Result is
empty
Tree
Database
PTIME
PTIME
PTIME
Result is
empty
DAG
Database
NPComplete
NPComplete
NPComplete
Result is
empty
Cyclic
Database
NPComplete
NPComplete
NP-Hard
NP-Hard
(in P2)
(in P2)
71
Data Complexity is Polynomial in all Cases
Query Evaluation for the
Flexible Semantics
• The database is replaced with a relationship
graph which is a graph, such that
– The nodes are the nodes of the database
– Two nodes are connected by an edge if there is
a path between them in the database (the
direction of the path is unimportant)
• The query is evaluated under rigid
semantics w.r.t. the relationship graph
72
I/O Complexity of Query Evaluation
for the Flexible Semantics
• Results follow from a reduction to query
evaluation under the rigid semantics
• Tree query
– Input-Output complexity is polynomial
• DAG query
– Testing for non-emptiness is NP-Complete
73
Query Containment
• Q1 Q2 if for all database D,
the set of matchings of Q1 w.r.t. to D
is contained in
the set of matchings of Q2 w.r.t. to D
• We assume that
– Both queries have the same set of variables
74
Complexity of Query Containment
• Under the semiflexible semantics, Q1 Q2 iff the
identity mapping from the variables of Q2 to the
variables of Q1 is a semiflexible matching of Q2
w.r.t. Q1
• Thus, containment is
– in coNP when Q1 is a cyclic graph and Q2 is
either a dag or a cyclic graph
– in polynomial time in all other cases
• Under the flexible semantics, query containment
is always in polynomial time
75
Database Equivalence
• D1 and D2 are equivalent if for all queries
Q,
the set of matchings of Q w.r.t. to D1
is equal to
the set of matchings of Q w.r.t. to D2
• Both databases must have the same set of
objects and the same root
76
Complexity of Database Equivalence
• For the semiflexible semantics, deciding
equivalence of databases is
– in polynomial time if both databases are dags
– in coNP if one of the databases has cycles
• For the flexible semantics, deciding
equivalence of databases is polynomial in
all cases
77
Database Transformation
MDB
MDB
1
Actor
1
Actor
Actor
2
3
4
Dustin
Harrison
Movie
Hoffman
Ford Movie
Movie
6
Hook
8
Star Wars
Movie
Hook
Mark
Hamill
Actor
2
Dustin
Hoffman
Movie
6
8 Star Wars
Actor
3
Harrison
Ford
Actor
4
Mark
Hamill
The databases are equivalent under both
A DAG has become a TREE!
the flexible and semiflexible semantics
78
Transforming a Database into a Tree
• Reasons for transforming a database into an
equivalent tree database:
– Evaluation of queries over a tree database is
more efficient
– In a graphical user interface, it is easier to
represent trees than DAGs or cyclic graphs
– Storing the data in a serial form (e.g., XML)
requires no references
79
Transformation into a Tree
• There are algorithms for
– Testing if a database can be transformed into an
equivalent tree database, and
– Performing the transformation
• For the semiflexible semantics
– The algorithms are polynomial
• For the flexible semantics
– The algorithms are exponential
80
Implementing Flexible Queries
• Flexible queries were implemented in
SQL4X
• In an SQL4X query, relations and XML
documents are queried simultaneously
• A query result can be either a relation or an
XML document
81
A query under the
Flexible Semantics
r
Director
x
Name
Movie
y
z
Title
v
An SQL4X Query
QUERY AS RELATION
SELECT text(y) as director, text(v) as title
FROM x Director of ‘MDB.xml’, y Name of x,
z Movie of x, v Title of z
82
A query under the
Flexible Semantics
r
Director
x
Name
Movie
y
z
Title
v
An SQL4X Query
Constraints can
be added
QUERY AS RELATION
SELECT text(y) as director, text(v) as title
FROM x Director of ‘MDB.xml’, y Name of x,
z Movie of x, v Title of x
WHERE text(v) = ‘Star Wars’
83
A query under the
Flexible Semantics
FilmBudgets
r
Director
Title
x
…
…
…
…
Name
Movie
y
z
Title
v
Budget
A relation with data
about film budgets
An SQL4X Query
QUERY AS RELATION
SELECT text(x) as director, text(v) as title, Budget
FROM x Director of ‘MDB.xml’, y Name of x,
z Movie of x, v Title of x, FilmBudgets
WHERE text(v) = FilmBudgets.Title
84
Relations and XML Documents
can be queried simultaneously
Agenda
Why is is difficult to query semistructured data?
Queries with incomplete answers (QwIA)
Flexible queries (FQ)
Oblivious querying = QwIA + FQ
Using QwIA and FQ for information integration
85
Combining the Paradigms
• In oblivious querying:
– The user does not have to know where data is
incomplete
– The user does not have to know the exact
structure of the data
• The paradigm of flexible queries and the
paradigm of queries with incomplete
answers should be combined
86
Flexible Queries with
Incomplete Answers
• A flexible query w.r.t. a database is actually
a rigid query w.r.t. the relationship graph
• Evaluating a query in AND-semantics
(weak semantics, OR-Semantics) w.r.t. the
relationship graph produces a flexible query
that returns maximal answers rather than
complete answers
87
Movie Database
1
r
Movie
Director
Producer Movie
11
Actor
Title Director
Actor
Year
21
22
23 24
x
12
27
Actor
25
Director
Title
26
29
Star 1977
Name
Name
Hook
Name
Name Name Wars
Date of
34
30
32
31
33
birth George Dustin
Steven
Mark Harrison
Lucas Hoffman
Spielberg
35
Hamill Ford
14 May 1944
88
Consider the case where Node 25
and Node 33 are removed
Name
Movie
y
z
Title
Actor
v
u
Name
w
Movie Database
1
r
Movie
Director
Producer Movie
11
Actor
Title Director
Actor
Year
21
22
23 24
x
12
27
Star 1977
Name
Name Name Wars
Date of
30
32
31
birth George
Mark Harrison
Lucas
35
Hamill Ford
14 May 1944
Director
Title
26
Name
Movie
z
29
Title
Actor
Hook Name
34
Steven
Spielberg
y
v
u
Name
NULL
w
NULL
A Flexible matching which is also an
incomplete (maximal) matching
89
Agenda
Why is is difficult to query semistructured data?
Queries with incomplete answers (QwIA)
Flexible queries (FQ)
Oblivious querying = QwIA + FQ
Using QwIA and FQ for information integration
90
Full Disjunction
• Intuitively, the full disjunction of a given set
of relations is the join of these relations that
does not discard dangling tuples
• Dangling tuples are padded with nulls
• Only maximal tuples are retained in the full
disjunction (as in the case of QwIA)
91
Movies
Actors
m-id title
year
1
Zelig
2
Antz
language
a-id name
date-of-birth
1983 English
1
Woody Allen
1/12/1935
1998 English
2
Bruce Willis
19/3/1955
3
Armageddon 1998 English
3
Julia Roberts
28/10/1967
4
Fantasia
1940 English
Acted-in a-id m-id role
Actors-that-Directed a-id m-id
1
1
The Full Disjunction of the Given Relations
1
1
Zelig
1
2
Z
2
3
Harry
m-id
title
year
language a-id name
Date-of-birth role
1
Zelig
1983
English
1
Woody Allen
1/12/1935
Zelig
2
Antz
1998
English
1
Woody Allen
1/12/1935
Z
3
Armageddon
1998
English
2
Bruce Willis
19/3/1955
Harry
4
Fantasia
1940
English
3
Julia Roberts
28/10/1967
92
Movies
m-id title
year
language
1
Zelig
1983 English
2
Antz
1998 English
3
Armageddon 1998 English
4
Fantasia
This tuple will not
be in the full disjunction
1940 English
m-id
title
year
language a-id name
1
Zelig
1983
English
Date-of-birth role
The Full Disjunction of the Given Relations
m-id
title
year
language a-id name
Date-of-birth role
1
Zelig
1983
English
1
Woody Allen
1/12/1935
Zelig
2
Antz
1998
English
1
Woody Allen
1/12/1935
Z
3
Armageddon
1998
English
2
Bruce Willis
19/3/1955
Harry
4
Fantasia
1940
English
The
does
not
subsumed
93 full
disjunction
3 include
Julia Roberts
28/10/1967 tuples
Movies
Actors
m-id title
year
1
Zelig
2
Antz
language
a-id name
date-of-birth
1983 English
1
Woody Allen
1/12/1935
1998 English
2
Bruce Willis
19/3/1955
3
Armageddon 1998 English
3
Julia Roberts
28/10/1967
4
Fantasia
1940 English
Acted-in
a-id m-id role
a-id m-id
1
1
Zelig
1
1
2
Z
The Full Disjunction of the Given Relations
2
3
Harry
m-id
m-id
1
4
2
title
title
Zelig
Fantasia
Antz
year
year
1983
1940
1998
language
language
English
English
English
a-id
a-id
1
3
1
name
name
Woody Allen
Julia Roberts
Woody Allen
Date-of-birth
Date-of-birth
1/12/1935
28/10/1967
1/12/1935
role
role
Zelig
Z
3
Armageddon
1998
English
2
Bruce Willis
19/3/1955
Harry
Actors-that-Directed
1
The
full
disjunction
does not include
tuples that
are based
4
Fantasia
1940 English
on
Cartesian Product
rather than
join
94
3
Julia Roberts 28/10/1967
In the Full Disjunction
of a Given Set of Relations:
Every tuple of the input is a part
of at least one tuple of the output
Tuples are joined as in a natural
join, padded with null values
The result includes only
“maximal connected portions”
95
Motivation for Full Disjunctions
• Full disjunctions have been proposed by
Galiando-Legaria as an alternative for
outerjoins [SIGMOD’94]
• Rajaraman and Ullman suggested to use full
disjunctions for information integration
[PODS’96]
96
Computing Full Disjunctions
for γ-acyclic Relation Schemas
• Rajaraman and Ullman have shown how to
evaluate the full disjunction by a sequence
of natural outerjoins when the relation
schemas are γ-acyclic
• Hence, the full disjunction can be computed
in polynomial time, under input-output
complexity, when the relation schemas are
γ-acyclic
97
Weak Semantics Generalizes
Full Disjunctions
• Relations can be converted into a
semistructured database
• The full disjunction can be expressed as the
union of several queries that are evaluated
under weak semantics
We have developed an algorithm that uses this
generalization to compute full disjunctions in
polynomial time under I/O complexity,
even when the relation schemas are cyclic
98
Generalizing Full Disjunctions
• In a full disjunction, tuples are joined
according to equality constraints as in a
natural join (or equi-join)
• We can generalize full disjunctions to
support constraints that are not merely
equality among attributes
99
Example
Movies (m-id, title, year, language, location)
Actors (a-id, name, date-of-birth)
Acted-in (a-id, m-id, role)
Actors-that-Directed (a-id, m-id)
Historical-Events (name, date, description)
Historical-Sites (Country, State, City, Site)
The date of the historical event is a date in the year when
The filming location is near the historical site
the movie was released
100
Another Way of Generalizing Full
Disjunctions: Use OR-Semantics
• OR-semantics is used rather than weak
semantics when tuples are joined
• This relaxes the requirement that every pair
of tuples should be join consistent
• Instead, a tuple of the full disjunction is
only required to be generated by database
tuples that form a connected subgraph, but
need not be pairwise join consistent
101
Example
Employees (e-id, ename, city, dept-no)
Departments (dept-no, dname, building)
Located-in (building, city, street)
Employee: (007, James Bond, London, 6)
Department: (6, MI-6, 10)
Located-in: (10, Liverpool, King)
e-id
ename
city
007
James Bond
London
6
6
MI-6
10
6
MI-6
10
10
Liverpool
King
102
dept dept dname
-no -no
building building city
The Full Disjunction
street
Example
Employees (e-id, ename, city, dept-no)
Departments (dept-no, dname, building)
Located-in (building, city, street)
Employee: (007, James Bond, London, 6)
Department: (6, MI-6, 10)
Located-in: (10, Liverpool, King)
e-id
ename
city
007
James Bond
London
103
dept dept dname
-no -no
6
6
MI-6
building building city
10
10
Liverpool
The Full Disjunction under OR-Semantics
street
King
Information Integration from
Heterogeneous Sources
Integrated Relation
Relation
Relation
Relation
Query
Query
Query
Data Source
Data Source
Data Source
104
We use queries that combine flexible semantics
and weak semantics:
-The queries are insensitive to changes in the data
- Easy to formulate the query
Integrated Relation
Relation
Relation
Relation
Query
Query
Query
Data Source
Data Source
Data Source
105
The integration of the relations is done with
a full disjunction of the computed relations
Integrated Relation
Relation
Relation
Relation
Query
Query
Query
Data Source
Data Source
Data Source
106
Conclusion
• Flexible and semiflexible queries facilitate
easy and intuitive querying of
semistructured databases
– Querying the database even when the user is
oblivious to the structure of the database
– Queries are insensitive to variations in the
structure of the database
107
Conclusion (continued)
• Queries in AND semantics, OR semantics
or weak semantics facilitate easy and
intuitive querying of incomplete databases
– Querying the database even when the user is
oblivious to missing data
– Queries return maximal answers rather than
complete answers
108
Conclusion (continued)
• The two paradigms of flexible queries and
queries with maximal answers can be
combined
• The combination of the paradigms can
facilitate integration of information from
heterogeneous sources
109
Conclusion (continued)
• Full disjunctions can be computed using
queries in weak semantics
• Full disjunctions can be generalized so that
relations are joined using constraints that
are not merely equality constraints
110
Thank You
Questions?
111