slides - VLDB 2005

Download Report

Transcript slides - VLDB 2005

Answering Imprecise Queries over Web
Databases
Ullas Nambiar and Subbarao Kambhampati
Department of CS & Engineering
Arizona State University
VLDB , Aug 30 – Sep 02, 2005, Trondheim, Norway
Why Imprecise Queries ?
A Feasible Query
Want a ‘sedan’ priced
around $7000
Make =“Toyota”,
Model=“Camry”,
Price ≤ $7000
What about the price of a
Honda Accord?
Is there a Camry for
$7100?
Solution: Support Imprecise Queries
Toyo
ta
Toyo
ta
Toyo
ta
Toyo
ta
………
Camr
y
Camr
y
Camr
y
Camr
y
$700
0
$700
0
$670
0
$650
0
1999
2001
2000
1998
The Imprecise Query Answering Problem
Problem Statement:
Given a conjunctive query Q over a relation R, find a ranked set
of tuples of R that satisfy Q above a threshold of similarity Tsim.
Ans(Q) ={x|x Є R, Similarity(Q,x) >Tsim}
Constraints:
– Autonomous Database
• Data accessible only by querying
• Data model, operators etc not modifiable
– Supports boolean model (relation)
Existing Approaches
 Similarity search over Vector
space
– Data must be stored as vectors of text
WHIRL, W. Cohen, 1998
Limitations:
1.
User/expert must
provide similarity
measures
2.
New operators to
use distance
measures
3.
Not applicable over
autonomous
databases
 Enhanced database model
– Add ‘similar-to’ operator to SQL. Distances
provided by an expert/system designer
VAGUE, A. Motro, 1998
– Support similarity search and query refinement
over abstract data types
Binderberger et al, 2003
 User guidance
– Users provide information about objects
required and their possible neighborhood
Proximity Search, Goldman et al, 1998
Motivation & Challenges what we did !

Objectives
–
–
–

Minimal burden on the end
user
No changes to existing
database
Domain independent
Motivation
–
–
–
Mimic relevance based ranked
retrieval paradigm of IR
systems
Can we learn relevance
statistics from database ?
Use the estimated relevance
model to improve the querying
experience of users
Challenges
 Estimating Query-Tuple
Similarity
– Weighted summation of
attribute similarities
– Syntactic similarity
inadequate
– Need to estimate semantic
similarity
• Not enough Ontologies
 Measuring Attribute
Importance
– Not all attributes equally
important
– Users cannot quantify
importance
AIMQ
Dependency Miner
Data Collector
DataSource
n
DataSource
2
DataSource
1
W
r
a
p
p
e
r
s
Probe using Random
Sample Queries
Data processing
Mine AFDs & Keys
Weighted Dependencies
Sample Dataset
Query Engine
Identify & Execute
Similar Queries
Map to
precise query
Result
Ranking
Similarity Miner
Extract Concepts
Estimate similarity
Similarity Matrix
WWW
Imprecise Query
Ranked Tuples
The AIMQ approach
Imprecise
Query Map: Convert
Q “like” to “=”
Qpr = Map(Q)
Derive Base
Set Abs
Abs = Qpr(R)
Use Base Set as set of
relaxable selection
queries
Using AFDs find
relaxation order
Derive Extended Set by
executing relaxed queries
Use Concept similarity
to measure tuple
similarities
Prune tuples below
threshold
Return Ranked Set
Imprecise
Query
Map: Convert
Q
“like” to “=”
Qpr = Map(Q)
Derive Base
Set Abs
Abs = Qpr(R)
Use Base Set as set of
relaxable selection
queries
Use Concept similarity
to measure tuple
similarities
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
Query-Tuple Similarity
 Tuples in extended set show different levels of relevance
 Ranked according to their similarity to the corresponding tuples in
base set using

VSim (Q. Ai, t. Ai)

if Dom ( Ai)  Categorical
n

Sim (Q, t )   Wimp( Ai)  
i 1
 | Q. Ai  t. Ai |

Q. Ai

if Dom ( Ai)  Numerical

– n = Count(Attributes(R)) and Wimp is the importance weight of the
attribute
– Euclidean distance as similarity for numerical attributes e.g. Price,
Year
– VSim – semantic value similarity estimated by AIMQ for
categorical attributes e.g. Make, Model
Imprecise
Query
Map: Convert
Q
“like” to “=”
Qpr = Map(Q)
Derive Base
Set Abs
Abs = Qpr(R)
Use Base Set as set of
relaxable selection
queries
Use Concept similarity
to measure tuple
similarities
Using AFDs find
relaxation order
Prune tuples below
threshold
Derive Extended Set by
executing relaxed queries
Return Ranked Set
Deciding Attribute Order


Mine AFDs and Approximate Keys
Create dependence graph using
AFDs
– Strongly connected hence a
topological sort not possible

Using Approximate Key with highest
support partition attributes into
– Deciding set
– Dependent set
– Sort the subsets using dependence
and influence weights

Measure attribute importance as
Wtdecides( Ai )
 Wtdecides


Re laxOrder ( Ai )

Wimp( Ai) 
 or
count ( Attributes( R)) 
Wtdepends( Ai )


  Wtdepends
•Attribute relaxation order is all nonkeys first then keys
•Greedy multi-attribute relaxation
CarDB(Make, Model, Year,
Price)
Decides: Make, Year
Depends: Model, Price
Order: Price, Model, Year,
Make
1- attribute: { Price, Model,
Year, Make}
2-attribute: {(Price, Model),
(Price, Year), (Price, Make).. }
Empirical Evaluation

Goal
– Test robustness of learned dependencies
– Evaluate the effectiveness of the query relaxation and similarity
estimation

Database
– Used car database CarDB based on Yahoo Autos
CarDB( Make, Model, Year, Price, Mileage, Location, Color)
•

Populated using 100k tuples from Yahoo Autos
Algorithms
–
AIMQ
•
•
RandomRelax – randomly picks attribute to relax
GuidedRelax – uses relaxation order determined using approximate
keys and AFDs
– ROCK: RObust Clustering using linKs (Guha et al, ICDE 1999)
•
Compute Neighbours and Links between every tuple


•
Neighbour – tuples similar to each other
Link – Number of common neighbours between two tuples
Cluster tuples having common neighbours
Robustness of Dependencies
0.4
0.35
100k
50k
25k
15k
0.9
0.8
0.7
0.3
0.6
0.25
Quality .
Dependence .
100k
50k
25k
15k
0.2
0.15
0.5
0.4
0.3
0.1
0.2
0.1
0.05
0
0
Model
1
Color
Dependent Attribute
Year
Make
6
11
16
21
Keys
Attribute dependence order & Key quality is unaffected by sampling
26
Robustness of Value Similarities
Value
Make=“Kia”
Make=“Bronco”
Year=“1985”
Similar Values
25K
100k
Hyundai
0.17
0.17
Isuzu
0.15
0.15
Subaru
0.13
0.13
Aerostar
0.19
0.21
F-350
0
0.12
Econoline Van
0.11
0.11
1986
0.16
0.16
1984
0.13
0.14
1987
0.12
0.12
Efficiency of Relaxation
Guided Relaxation
Random Relaxation
900
180
800
160
Є = 0.6
Є = 0.5
140
Work/Relevant Tuple
700
Work/Relevant Tuple
Є = 0.7
Є= 0.7
Є = 0.6
Є = 0.5
600
500
400
300
120
100
80
60
200
40
100
20
0
0
1
2
3
4
5
6
7
8
9
10
Queries
1
2
3
4
5
6
7
8
9
Queries
•Average 8 tuples extracted per
relevant tuple for Є =0.5. Increases to
120 tuples for Є=0.7.
•Average 4 tuples extracted per
relevant tuple for Є=0.5. Goes up to
12 tuples for Є= 0.7.
•Not resilient to change in Є
•Resilient to change in Є
10
Accuracy over CarDB
1
GuidedRelax
RandomRelax
ROCK
• Similarity learned using 25k
sample
Average MRR
.
0.8
•14 queries over 100K tuples
0.6
• Mean Reciprocal Rank (MRR)
estimated as
0.4


1

MRR (Q)  Avg
|
UserRank
(
t
i
)

AIMQRank
(
t
i
)
|

1


0.2
• Overall high MRR shows high
relevance of suggested answers
0
1
2
3
4
5
6
7
8
Queries
9
10
11
12
13
14
AIMQ - Summary
 An approach for answering imprecise queries over
Web database
– Mine and use AFDs to determine attribute order
– Domain independent semantic similarity estimation technique
– Automatically compute attribute importance scores
 Empirical evaluation shows
–
–
–
–
Efficiency and robustness of algorithms
Better performance than current approaches
High relevance of suggested answers
Domain independence