The DB/IR intersection

Download Report

Transcript The DB/IR intersection

4/29
Issues in Bridging DB & IR
Announcements:
Next class: Interactive Review (Come prepared)
Homework III solutions online
Demos tomorrow (instructions will be mailed by the
end of the class)
First did some discussion of
BibFinder—how queries are
mapped etc.
…”please be both enthusiastic
and adamant about this in your
classes”
–Associate Dean for Academic affairs
CEAS Online Evaluations

You can do them at
 https://intraweb.eas.asu.edu/eval
th
 Will be available until the end of day May 5


Instructors get to see it only after the grades have
been given


(so the exam is unfettered by what you might think
about it )
(so you don’t need to feel compelled to be particularly
nice)
Your feedback would be appreciated
(especially the written comments)

Last semester I got 2,196 words of comments;
let us see if we can break the record ;-)
The popularity of Web brings two broad challenges to Databases

Integration of autonomous  Supporting heterogeneous
data (combining DB/IR)
data sources


Data/information integration
Technically has to handle
heterogeneous data too

But we will sort of assume
that the sources are “quasirelational”


This can be tackled in the
presence of a single
database
The issues are

How to do effective
querying in the presence of
structured and text data


How to support IR-style
querying on DB



E.g. Stuff I have Seen
project
Because users seem to
know IR/keyword style
querying more
(notice the irony here—
we said structure is good
because it supports
structured querying)
How to support imprecise
queries
DB vs. IR




DBs allow structured
querying
Queries and results
(tuples) are different
objects
Soundness &
Completeness expected
User is expected to
know what she is doing




IR only supports
unstructured querying
Queries and results are
both documents!
High Precision & Recall
is hoped for
User is expected to be a
dunderhead.
Some specific problems
1.
2.
3.
How to handle textual attributes?
How to support keyword-based
querying?
How to handle imprecise queries?
(Ullas Nambiar’s work)
1. Handling text fields in data
tuples


Often you have database relations some of
whose fields are “Textual”
 E.g. a movie database, which has, in
addition to year, director etc., a column
called “Review” which is unstructured text
Normal DB operations ignore this
unstructured stuff (can’t join over them).
 SQL sometimes supports “Contains”
constraint (e.g. give me movies that
contain “Rotten” in the review
Soft Joins..WHIRL [Cohen]

We can extend the notion of Joins to
“Similarity Joins” where similarity is
measured in terms of vector similarity
over the text attributes. So, the join
tuples are output n a ranked form—
with the rank proportional to the
similarity
 Neat idea… but does have some
implementation difficulties

Most tuples in the cross-product
will have non-zero similarities. So,
need query processing that will
somehow just produce highly
ranked tuples
2. Supporting keyword search
on databases
How do we answer
a query like
“Soumen Sunita”?
Issues:
--the schema is normalized
(not everything in one table)
--How to rank multiple tuples
which contain the keywords?
What Banks Does
The whole DB seen
as a directed graph
(edges correspond to
foreign keys)
Answers are subgraphs
Ranked by edge weights
BANKS: Keyword Search in DB
3. Supporting Imprecise Queries

Increasing number of Web accessible databases



Difficulty in extracting desired information




E.g. bibliographies, reservation systems, department
catalogs etc
Support for precise queries only – exactly matching tuples
Want cars
priced
‘around’
$7000
Limited query capabilities provided by form based query
interface
Lack of schema/domain information
Increasing complexity of types of data e.g. hyptertext,
images etc
Often times user wants ‘about the same’ instead of
‘exact’

Bibliography search — find similar publications
Solution: Provide answers closely matching query
constraints
Relaxing queries…

It is obvious how to relax certain type of attribute values


But how do we relax categorical attributes?


E.g. price=7000 is approximately the same as price=7020
How should we relax Make=Honda?
Two possible approaches


Assume that domain specific information about similarity of values
is available (difficult to satisfy in practice)
Attempt to derive the similarity between attribute values directly
from the data


Qn: How do we compute similarity between “Make=Honda” and
“Make=Chevrolet”
Idea: Compare the set all tuples where Make=Honda to the set of all
tuples where Make=Chevrolet


Consider the set of tuples as a vector of bags (where bags correspond to
the individual attributes)
Use IR similarity techniques to compare the vectors
Finding similarities between
attribute values
5/4
Challenges in answering Imprecise Queries
Challenges:
 Extracting additional tuples with
minimal domain knowledge
 Estimating similarity with
minimal user input
We introduce IQE (Imprecise
Query Engine):
 Uses query workload to
identify other precise queries
 Extracts additional tuples
satisfying a query by issuing
similar precise queries
 Measures distance between
queries using Answerset
Similarity
Answerset Similarity

Answerset A(Q): Set of all answer
Widom Stream….
tuples of query Q given by relation R.
Widom

Query Similarity:


Sim(Q1,Q2) :- Sim(A(Q1), A(Q2))
Measuring answerset similarity


Relational model
 exact match between tuples
 captures complete overlap
Vector space model
 match keywords
 also detects partial overlaps
VLDB
2002
Optimize…. ICDE
1998
Answerset for Q(Author=Widom)
Ullman
Optimize… PODS
1998
Ullman
Mining…
2000
Answerset for Q(Author=Ullman)
ST(QAuthor=Widom)
Co-author

Problem: Vector Space model
representation for answersets

Answer: SuperTuple
VLDB
Title
Conference
Year
R. Motwani:3, Molina:6…
warehouse:5,
optimizing:2, streams:6..
SIGMOD:3, VLDB:4,…
2000:6,1999:5,……
Similarity Measures

Jaccard similarity metric with bag
semantics


Doc-Doc Similarity




SimJ(Q1,Q2) = |Q1 ∩ Q2| / |Q1 U Q2|
Equal importance to all attributes
Supertuple considered as “single
bag” of keywords

Co-author
Title
Conference
Year
C. Li:5, R. Motwani:7,…
Data-mining:3,
optimizing:5,..
SIGMOD:5, VLDB:5,…
2000:6,1999:5,……
Simdoc-doc(Q1, Q2) = SimJ(STQ1, STQ2)
Weighted-Attribute Similarity

ST(QAuthor=Ullman)
Weights assigned to attributes
signify importance to user
Simwatr(Q1,Q2) = ∑ wi x SimJ(STQ1(Ai),
STQ2(Ai))
ST(QAuthor=Widom)
Co-author
Title
Conference
Year
R. Motwani:3, Molina:6…
warehouse:5,
optimizing:2, streams:6..
SIGMOD:3, VLDB:4,…
2000:6,1999:5,……
Empirical Evaluation

Goal


Evaluate the efficiency and
effectiveness of our approach
Setup

A database system extending the
bibliography mediator BibFinder
projecting relation
Publications( Author, Title, Conference,
Journal, Year)


Query log consists of 10K precise
queries
User study:



3 graduate students
90 test queries - 30 chosen by each
student
Platform: Java 2 on a Linux Server –
Intel Celeron 2.2 Ghz, 512 MB
Time
Size
Supertuple Generation
126 sec
21 Mb
Similarity Estimation
10 hrs
6 Mb
Answering Imprecise Query

Estimating query similarity

For each q Є Qlog

Compute Sim(q,q’) for all q’ Є Qlog



Simdoc-doc(q, q’) = SimJ(STq, STq’)
Simwatr(q,q’) = ∑ wi x SimJ(STq(Ai), STq’(Ai))
Extracting similar answers

Given a query Q



Map Q to a query q Є Qlog
Identify ‘k’ queries similar to q
Execute the ‘k’ new queries
Some Results
1
Imprecise Query
Top Relevant Queries
Title=“web-based learning”
Title=“e learning”
Title=“web technology”
Conference=“WISE”
2
Title=“Information
Extraction”
Title=“Information filtering”
Title = “text mining”
Title = “relevance feedback”
3
Author=“Abiteboul”
Author=“vianu”
Author=“Dan Suciu”
Relevance of Suggested Answers
Relevance Estimation Error
1
Doc-Doc Sim ilarity
0.75
Weighted-Attribute Sim ilarity
0.5
0.25
0
1
11
21
Query
Are the results precise? Average error in relevance
estimation is around 25%
User Study – Summary
 Precision for top-10 related
queries is above 75%
 Lessons:
 Queries with popular
keywords difficult
 Efficiently and effectively
capturing user interest is
difficult
 A solution requiring less
input more acceptable
Doc-Doc
Relevance of top-10
 Doc-Doc similarity measure
dominates Weightedattribute similarity
85
Weighted-attribute
80
75
1
2
User
3
What’s Next ?
Open Issues:

Most similar query may not be present in the workload.

Answers to a similar query will have varying similarity
depending on the affected attributes
Solution:


Given an imprecise query generate the most similar
query.
Use attribute importance and value-value similarity to
order tuples.
Challenges:
•
Estimating attribute importance
•
Estimating value-value similarity
Learning the Semantics of the data

Estimate for value-value similarity

Similarity between values of categorical attribute



Sim(v11,v12) = ∑ wi x Sim(Co-related_value(Ai,v11), Co-related_value(Ai,v12))
where Ai Є Attributes(R), Ai <> A
Euclidean distance for numerical attributes
Use the Model of the database – AFDs, Keys, Value correlations to


Identify an implicit structure for the tuple.
Show other tuples that least break the structure.
CarDb(Make,Model, Year, Price, Mileage, Location, Color)
Approximate Keys
Model, Mileage, Location – uniquely decides 90% cars in Db
Model, Mileage, Color - uniquely decides 84% cars in Db
Approximate Functional Dependencies (AFDs)
Model -> Make
Year -> Price
Mileage -> Year
Query relaxation
Finding similarities between
attribute values