Towards Unifying Database Systems and Information Retrieval

Download Report

Transcript Towards Unifying Database Systems and Information Retrieval

CAREER: Towards Unifying
Database Systems and
Information Retrieval Systems
NSF IDM Workshop
10 Oct 2004
Jayavel Shanmugasundaram
Cornell University
10000 foot view of Data Management
Information
Retrieval
Systems
Ranked
Keyword
Search
Queries
Complex
and
Structured
Database
Systems
Structured
Unstructured
Data
10000 foot view of Data Management
Ranked
Keyword
Search
Information
Retrieval
Systems
Text search
in databases
Queries
Complex
and
Structured
Database
Systems
Ranking based on
structured values
Structured
Unstructured
Data
Internet Archive Database
Movies
Mid
Name
10 Amateur Film
Description
… they stand on the golden gate bridge and …
20 American Thrift … golden gate bridge with statue of liberty …
…
…
…
SELECT
*
FROM
Movies M
ORDER BY score(M.description, “golden gate”)
FETCH TOP 10 RESULTS ONLY
• Traditional IR scoring methods (e.g., TF*IDF) often
not very meaningful in this context
– Developed for stand-alone document collections
Internet Archive Database
Movies
Mid
Name
Description
10 Amateur Film
… they stand on the golden gate bridge and …
20 American Thrift … golden gate bridge with statue of liberty …
…
…
…
Statistics
Reviews
Rid Mid
901
Name Rating
10 bleblanc
902 10
903 20
904 20
… …
harry
cooker
alice
…
Sid Mid Visits Downloads
2
81
10
285
90
1
4
5
…
82
…
20
…
927
…
247
…
Structured Value Ranking (SVR)
Structured Value Ranking
• Use structured data values associated with
text columns to score results
• Main technical challenge
– Need to produce top-k results efficiently
• Order inverted lists by score
– But scores change frequently [Aizen et al., 2004]
• Flash crowds on Internet
• Recent award announcements
– How can we process top-k results efficiently
while allowing frequent score updates?
Solution Overview
• Order inverted lists by score
– Queries efficient
– Score updates slow
• Order inverted lists by document id
– Queries slow
– Score updates efficient
• Hybrid solution: order inverted lists by chunk
– Order chunks by score
– Order documents within chunk by id
• Guo et al. [ICDE 2005]
10000 foot view of Data Management
Information
Retrieval
Systems
Ranked
Keyword
Search
Queries
Complex
and
Structured
Database
Systems
Structured
Unstructured
Data
Applications
• Content management
– Mix of structured and unstructured data
• Database with date and time of accident (structured
data) and accident description (unstructured data)
– Semi-structured data
• Scientific documents, Shakespeare’s plays, …
• Support flexible keyword search interface
over mix of structured and unstructured data
– XRANK [Guo et al., SIGMOD 2003]
XML Keyword Search
<workshop date=”28 July 2000”>
<title> XML and Information Retrieval: A SIGIR 2000 Workshop </title>
<editors> David Carmel, Yoelle Maarek, Aya Soffer </editors>
<proceedings>
<paper id=”1”>
<title> XQL and Proximal Nodes </title>
<author> Ricardo Baeza-Yates </author>
<author> Gonzalo Navarro </author>
<abstract> We consider the recently proposed language … </abstract>
<section name=”Introduction”>
Searching on structured text is becoming more important with XML …
</section>
…
<cite xmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite>
</paper>
…
• Most specific results (exploits structure!)
• Ranking at granularity of elements (generalizes PageRank)
10000 foot view of Data Management
Information
Retrieval
Systems
Ranked
Keyword
Search
Queries
Complex
and
Structured
Database
Systems
Structured
Unstructured
Data
Applications
• The Internet is enabling end-users to directly ask
queries and explore results
– E.g., Used car marketplace
– Find all “bright red ford mustangs” that cost less than
20% of the average price of cars in its class
• Characteristics of queries
– Keyword search (for ease of use)
– Complex query operations (information synthesis)
– Want to see ranked results!
Towards Unifying DB and IR
• No standard query language for both DB and IR
– SQL, XQuery mostly “database query languages”
• Have developed TeXQuery: a full-text search
extension to XQuery
– Amer-Yahia et al. (WWW 2004)
– Full composability of database and IR primitives,
ranking
– Adopted as the precursor to the XQuery full-text
extensions currently being developed by the W3C
• Come see demo tomorrow
Related Work
• Integrating DB and IR systems
– For the most part, treat individual systems as “black
boxes”
– Our goal is to unify DB and IR systems
• Search over Semi-Structured Data
– Specialized techniques for search semi-structured data
– Our goal is to generalize DB and IR techniques
• Keyword search and ranking in databases
Summary
• Many emerging applications require a unification of DB
and IR techniques
– E-commerce applications
– Semi-structured documents
– Content management
• Argues for a new generation of systems and techniques
that seamlessly provide this capability
– SVR, XRank, TeXQuery, …
• Educational benefit: present unified view of data
management
– Currently at graduate level
– Eventually introduce concepts at undergraduate level