An Abstract Framework for Generating Maximal Answers to Queries

Download Report

Transcript An Abstract Framework for Generating Maximal Answers to Queries

An Abstract Framework
for Generating Maximal
Answers to Queries
Sara Cohen, Yehoshua Sagiv
ICDT 2005
Motivation
Queries and Databases
Answers and Semantics
Graph Properties
ICDT 2005
The Problem



In many different domains, we are given the
option to query some source of information
Usually, the user only gets results if the query
can be completely answered (satisfied)
In many domains, this is not appropriate, e.g.,
 The
user is not familiar with the database
 The database does not contain complete information
 There is a mismatch between the ontology of the user
and that of the database
 The query is a “search” that is not expected to be
correct
ICDT 2005
Search for papers by
“Smith” that appeared in
ICDT 2004
ICDT 2005
Sorry, no matching
record found
ICDT 2005
Search for buses from
“Haifa-Technion” to
“Ben Gurion Airport”
ICDT 2005
There is no direct bus
line between the required
destinations
ICDT 2005
Search for buses to
“Ben Gurion Airport”
ICDT 2005
Must choose From and To
ICDT 2005
What Do Users Need?



Users need a way to get interesting partial
answers to their queries, especially if a
complete answer does not exist
These partial answers should contain maximal
information
Main Problems:
 What
should be the semantics of partial answers?
 How can all partial answers be efficiently computed?
ICDT 2005
Previous Work

Many solutions have been given for the main
problems
 solutions

differ, according to the problem domain
Examples:
 Full
disjunctions: Galindo-Legaria (94), Rajaraman,
Ullman (96), Kanza, Sagiv (03)
 Queries with incomplete answers over semistructured
data: Kanza, Nutt, Sagiv (99)
 FleXPath: Amer-Yahia, Lakshmanan, Pandit (04)
 Interconnections: Cohen, Kanza, Sagiv (03)
ICDT 2005
Our Contribution

In the past, for each semantics considered, the
query evaluation problem had to be studied
anew. In this paper, we:
 Present
a general framework for defining semantics
for partial answers
 Framework is general enough to cover most
previously studied semantics
 Query evaluation problem can be solved once within
this framework – and reused for new semantics
 Results improve upon previous evaluation algorithms
 Presents relationship between this problem and that
of the maximal P-subgraph problem
ICDT 2005
Motivation
Queries and Databases
Answers and Semantics
Graph Properties
ICDT 2005
Databases

Databases are modeled as data graphs:
(V, E, r, lV, lE)
 r:
Can have a designated root
 lV: Labels on the vertices
 lE: Labels on the edges

Note:
 Nodes
correspond to data items
 Even databases that do not have an inherent
graph structure can be modeled as graphs,
e.g., relational databases
ICDT 2005
XML as a Data Graph
University
Name
Dept
Dept
Technion
Name
Computer
Science
Name
Chana
Israeli
Name
Faculty
Faculty
Biology
Professor
Teaches
Lecturer
Teaches
Databases
Name
Bioinformatics
ICDT 2005
Teaches
Avi
Levy
Molecular
Biology
Relational Database as a Data Graph
Sites
Climates
Country
Canada
UK
USA
Climate
diverse
temporate
temporate
Country City
Site
UK
London Buckingham
USA
NY
Metropolitan
Accommodations
Country
UK
City
London
Hotel
Plaza
Canada
Canada
Montreal
Toronto
Hitlon
Ramada
ICDT 2005
Relational Database as a Data Graph
Sites
Climates
Country
Canada
UK
USA
Climate
diverse
temporate
temporate
Country City
Site
UK
London Buckingham
USA
NY
Metropolitan
Accommodations
(C, (Canada, diverse))
(C, (UK, temporate))
Country
UK
City
London
Hotel
Plaza
(C, (USA, temporate))
Canada
Canada
Montreal
Toronto
Hitlon
Ramada
ICDT 2005
Relational Database as a Data Graph
Sites
(C, (Canada, diverse))
(C, (UK, temporate))
Country City
Site
UK
London Buckingham
USA
NY
Metropolitan
(C, (USA, temporate))
Accommodations
(A, (UK, London, Plaza))
(A, (Canda, Montreal, Hilton))
Country
UK
City
London
Hotel
Plaza
(A, (Canda, Toronto, Ramada))
Canada
Canada
Montreal
Toronto
Hitlon
Ramada
ICDT 2005
Relational Database as a Data Graph
Sites
(C, (Canada, diverse))
(C, (UK, temporate))
(C, (USA, temporate))
(A, (UK, London, Plaza))
Country City
Site
UK
London Buckingham
USA
NY
Metropolitan
(S, (UK, London, Buckingham))
(S, (USA, NY, Metropolitan))
(A, (Canda, Montreal, Hilton))
(A, (Canda, Toronto, Ramada))
ICDT 2005
Relational Database as a Data Graph
(C, (Canada, diverse))
(C, (UK, temporate))
(C, (USA, temporate))
(A, (UK, London, Plaza))
(S, (UK, London, Buckingham))
(S, (USA, NY, Metropolitan))
(A, (Canda, Montreal, Hilton))
(A, (Canda, Toronto, Ramada))
ICDT 2005
Queries

Queries are modeled as query graphs: (V, E, r,
CV, CE, s)
 r:
Can have a designated root
 CV : Vertex constraints on the vertices (basically, a
boolean function on vertices)
 CE : Edge constraints on the edges (basically, a
boolean function on pairs of vertices)
 s: A structural constraint, one of the letters C, R, N
(defines the required structure of answers, i.e.,
connected, rooted or none)

Note: Nodes correspond to query variables
ICDT 2005
XML Query as a Graph

= University
Returns faculty
members from the
Biology Department
Is Descendent
= Dept and
ContainsText(Biology)
Is Child
Structural Constraint:
Rooted
= Faculty
Is GrandChild
= Name
ICDT 2005
Join Query as a Graph

C
A
S
Structural Constraint:
Connected
Belongs to: C
C.Country =
A.Country
Belongs to: A
q1
q2
C.Country =
S.Country
q3
A.Country =
S.Company and
A.City = S.City
ICDT 2005
Belongs to: S
Motivation
Queries and Databases
Answers and Semantics
Graph Properties
ICDT 2005
Assignment Graphs
Assignment graphs are used to compactly
represent assignments of query nodes to
database nodes
 Basically, assignment graph for Q and D,
written QD has:

 Node
(q,d) for each pair q Q and d D such
that d satisfies the constraint on q
 Edge ((q,d), (q’,d’)) if there is an edge (q,q’)
in Q and (d,d’) satisfies the constraint on (q,q’)
 May also have a root (details omitted)
ICDT 2005
(S, (UK, London, Buckingham))
s1
(C, (UK, temporate))
c1
a1
(A, (UK, London, Plaza))
(A, (Canda, Toronto, Ramada))
(C, (Canada, diverse))
c2
(C, (USA, temporate))
c3
a2
a3
(A, (Canda, Montreal, Hilton))
s2
(S, (USA, NY, Metropolitan))
Belongs to: C
(q1, c1)
C.Country =
A.Country
q1
C.Country =
S.Country
(q1, c2)
Belongs
Belongs
q2
q3 to: S
to: A
A.Country = S.Company
and A.City = S.City
(q1, c3)
ICDT 2005
(S, (UK, London, Buckingham))
s1
(C, (UK, temporate))
c1
a1
(A, (UK, London, Plaza))
(A, (Canda, Toronto, Ramada))
(C, (Canada, diverse))
c2
(C, (USA, temporate))
c3
a2
a3
(A, (Canda, Montreal, Hilton))
s2
(S, (USA, NY, Metropolitan))
Belongs to: C
(q1, c1)
(q2, a1)
C.Country =
S.Country
(q1, c2)
(q2, a2)
Belongs
Belongs
q2
q3 to: S
to: A
A.Country = S.Company
and A.City = S.City
(q1, c3)
(q2, a3)
C.Country =
A.Country
q1
ICDT 2005
(S, (UK, London, Buckingham))
s1
(C, (UK, temporate))
c1
a1
(A, (UK, London, Plaza))
(A, (Canda, Toronto, Ramada))
(C, (Canada, diverse))
c2
(C, (USA, temporate))
c3
a2
a3
(A, (Canda, Montreal, Hilton))
s2
(S, (USA, NY, Metropolitan))
(q3, s1)
Belongs to: C
(q1, c1)
(q2, a1)
C.Country =
S.Country
(q1, c2)
(q2, a2)
Belongs
Belongs
q2
q3 to: S
to: A
A.Country = S.Company
and A.City = S.City
(q1, c3)
(q2, a3)
C.Country =
A.Country
q1
ICDT 2005
(q3, s2)
(S, (UK, London, Buckingham))
s1
(C, (UK, temporate))
c1
a1
(A, (UK, London, Plaza))
(A, (Canda, Toronto, Ramada))
(C, (Canada, diverse))
c2
(C, (USA, temporate))
c3
a2
a3
(A, (Canda, Montreal, Hilton))
s2
(S, (USA, NY, Metropolitan))
(q3, s1)
Belongs to: C
(q1, c1)
(q2, a1)
C.Country =
S.Country
(q1, c2)
(q2, a2)
Belongs
Belongs
q2
q3 to: S
to: A
A.Country = S.Company
and A.City = S.City
(q1, c3)
(q2, a3)
C.Country =
A.Country
q1
ICDT 2005
(q3, s2)
(S, (UK, London, Buckingham))
s1
(C, (UK, temporate))
c1
a1
(A, (UK, London, Plaza))
(A, (Canda, Toronto, Ramada))
(C, (Canada, diverse))
c2
(C, (USA, temporate))
c3
a2
a3
(A, (Canda, Montreal, Hilton))
s2
(S, (USA, NY, Metropolitan))
(q3, s1)
Belongs to: C
(q1, c1)
(q2, a1)
C.Country =
S.Country
(q1, c2)
(q2, a2)
Belongs
Belongs
q2
q3 to: S
to: A
A.Country = S.Company
and A.City = S.City
(q1, c3)
(q2, a3)
C.Country =
A.Country
q1
ICDT 2005
(q3, s2)
(S, (UK, London, Buckingham))
s1
(C, (UK, temporate))
c1
a1
(A, (UK, London, Plaza))
(A, (Canda, Toronto, Ramada))
(C, (Canada, diverse))
c2
(C, (USA, temporate))
c3
a2
a3
(A, (Canda, Montreal, Hilton))
s2
(S, (USA, NY, Metropolitan))
(q3, s1)
Belongs to: C
(q1, c1)
(q2, a1)
C.Country =
S.Country
(q1, c2)
(q2, a2)
Belongs
Belongs
q2
q3 to: S
to: A
A.Country = S.Company
and A.City = S.City
(q1, c3)
(q2, a3)
C.Country =
A.Country
q1
ICDT 2005
(q3, s2)
Partial Assignment

A partial assignment is any subgraph of
QD that does not contain two different
nodes (q,d) and (q,d’)
 otherwise,
would map the node q to two
different database nodes

Can distinguish special types of partial
assignments:
 vertex
complete
 edge complete
 structurally consistent
ICDT 2005
Every
query
node
must
Every
edge
constraint
The partial assignment
appear
in the
partial
between
query
satisfies the query’s
assignment
variables
in constraint
the partial
structural
assignment holds
Example
(q3, s1)
(q1, c1)
(q2, a1)
(q1, c2)
(q2, a2)
(q1, c3)
(q2, a3)
(q3, s2)
ICDT 2005
 Vertex Complete,
 Edge Complete,
 Structurally Consistent
 Vertex Complete,
 Edge Complete,
 Structurally Consistent
 Vertex Complete,
 Edge Complete,
 Structurally Consistent
Semantics


All partial assignments for Q over D that satisfy
the vertex and edge constraints are encoded in
QD
A semantics defines which subgraphs of the
answer graph (i.e., which partial assignments)
are in fact answers, e.g.,
 Sves
allows all partial assignments that are vertex
complete, edge complete and structurally consistent
 Ses allows all partial assignments that are edge
complete and structurally consistent
 Ss allows all partial assignments that are structurally
consistent

Usually, we are only interested in maximal
partial assignemnts
ICDT 2005
Example: Join
(q3, s1)
(q1, c1)
(q2, a1)
(q1, c2)
(q2, a2)
(q1, c3)
(q2, a3)
(q3, s2)
ICDT 2005
Using
semantics Sves
we get the
natural join
Example: Join “becomes” a Full Disjunction
(q3, s1)
(q1, c1)
(q2, a1)
(q1, c2)
(q2, a2)
(q1, c3)
(q2, a3)
(q3, s2)
ICDT 2005
Using
semantics Ses
we get the full
disjunction
Other Examples

Queries with incomplete answers over
semistructured data: Kanza, Nutt, Sagiv (PODS
99)
 Weak
semantics modeled by Ses; Or-semantics
modeled by Ss

FleXPath: Amer-Yahia, Lakshmanan, Pandit
(Sigmond 04)
 Modeled

by Ses
Interconnections: Cohen, Kanza, Sagiv (03)
 Complete
interconnection can be modeled by Ses;
Reachable interconnection can be modeled by Ss
ICDT 2005
Motivation
Queries and Databases
Answers and Semantics
Graph Properties
ICDT 2005
Semantics are a type of Graph Property

A graph property P is a set of graphs,
e.g.,
 is
a clique
 is a bipartite graph
A semantics defines a set of graphs, for
every Q, D (these graphs are subgraphs of
QD)
 Therefore, semantics are a type of graph
property

ICDT 2005
Hereditary Graph Properties and their
Variants




There are several interesting types of graph
properties that have been studied in graph
theory
A graph property P is hereditary if every
induced subgraph of a graph in P, is also in P
(e.g., clique, is a forest)
A graph property P is connected-hereditary if
every connected induced subgraph of a graph in
P, is also in P (e.g., is a tree)
Can define rooted-hereditary similarly
ICDT 2005
Semantics are usually Hereditary
Most semantics for partial answers
considered in the past are hereditary (in
some sense), i.e., subgraphs of a partial
answer are also partial answers
 Many semantics require connectivity of
results (e.g., full disjunctions)
 Some require answers to be rooted (e.g.,
FlexPath)

ICDT 2005
Maximal P-Subgraph Problem

Given a graph property P, and a graph G
The maximal P-subgraph problem is:
Find all maximal induced subgraphs of G
that have property P

Therefore, the problem of finding all
maximal answers for a query over a
database, under a given semantics, is a
special case of the maximal P-subgraph
problem
ICDT 2005
Efficient Query Evaluation

There are efficient algorithms that find all
maximal P-subgraphs for hereditary, connected
hereditary and rooted hereditary properties
 Efficient
in terms of the input and the output (i.e.,
incremental polynomial time)

Use these algorithms to find maximal query
answers, e.g., to find full disjunctions, weak
answers, or-answers, etc.
 Improves
upon previous results
ICDT 2005
Conclusion
Presented abstract framework
 Can model many different types of
queries, databases and semantics in the
framework
 Semantics in the framework are graph
properties
 Solve the maximal P-subgraph problem
once and reuse it to find maximal query
answers

ICDT 2005
Future Work



It is convenient to define ranking functions and
return answers in ranking order
How/when can this be done in our framework?
Note: From the modeling it is immediately
apparent that ranking cannot always be
performed efficiently
 The
problem of finding a maximal P-subgraph of size
k is NP complete for hereditary and connectedhereditary graph properties (Yannakakis, STOC 78)
ICDT 2005
Thank you!
Questions?
ICDT 2005