Towards Unifying Database Systems and Information

Download Report

Transcript Towards Unifying Database Systems and Information

Databases and Information Retrieval:
Rethinking the Great Divide
SIGMOD Panel
14 Jun 2005
Jayavel Shanmugasundaram
Cornell University
10000 Foot View of Data Management
The Great
Data Divide
Information
Retrieval
Systems
Ranked
Keyword
Search
Queries
Complex
and
Structured
The Great
Query Divide
Database
Systems
Structured
Unstructured
Data
Bridging the Great Divide
• Option 1: Tie together existing DB and IR systems
– Example: Approaches based on SQL/MM
• Option 2: Extend existing DB systems with IR
functionality, or vice versa
– Example: Add searching and ranking to RDBMSs
• Option 3: Design a new data management system
from the ground-up
– Example: Quark data management system
Why Option 1 Wont Work
Information
Retrieval
Systems
Ranked
Keyword
Search
Queries
Complex
and
Structured
Database
Systems
Structured
Unstructured
Data
Bridging the Great Divide
• Option 1: Tie together existing DB and IR systems
– Example: Approaches based on SQL/MM
– Drawback: Not powerful enough
• Option 2: Extend existing DB systems with IR
functionality, or vice versa
– Example: Add searching and ranking to RDBMSs
• Option 3: Design a new data management system
from the ground-up
– Example: Quark data management system
<workshop date=”28 July 2000”>
<title> XML and Information Retrieval: A SIGIR 2000 Workshop </title>
<editors> David Carmel, Yoelle Maarek, Aya Soffer </editors>
<proceedings>
<paper id=”1”>
<title> XQL and Proximal Nodes </title>
<author> Ricardo Baeza-Yates </author>
<author> Gonzalo Navarro </author>
<abstract> We consider the recently proposed language … </abstract>
<section name=”Introduction”>
Searching on structured text is becoming more important with XML …
</section>
…
<cite xmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite>
</paper>
…
Find relevant elements in important workshops between
the years 1999 and 2001 that are about ‘Ricardo’ and ‘XML’
Why Extending (R)DBMSs Won’t Work
• Violates many assumptions “hardwired” into current
database systems
• Structured queries over structured fields, keyword search
queries over text fields
– Is author name a structured or text field?
• Operators have precise, well-defined semantics
– Even the query result is not well-defined – do we return a paper
or a workshop?
• Scoring is an attribute tacked on as a relational attribute
– How can this scoring generalize IR scoring?
Why Extending IR Systems Won’t Work
• IR systems provide little support for structured
data
• No support for complex operators
– How can complex queries be evaluated?
• Scoring does not take structure into account
– How can scoring capture both structured and
unstructured data?
Bridging the Great Divide
• Option 1: Tie together existing DB and IR systems
– Example: Approaches based on SQL/MM
– Drawback: Not powerful enough
• Option 2: Extend existing DB systems with IR
functionality, or vice versa
– Example: Add searching and ranking to RDBMSs
– Drawback: Shoehorns alien functionality into already
complex systems
• Option 3: Design a new data management system
from the ground-up
– Example: Quark data management system
Why Option 3 Will Work
• Designed ground-up with three principles
• Structural data independence
– Users can issues any query (complex and keyword)
over any data (structured and unstructured)
• Generalized scoring
– Scoring works over any mix of structured and
unstructured data (e.g., XRank over HTML and XML)
• Flexible query language
– Allows for arbitrary return results and scores (e.g.,
TeXQuery, precursor to XQuery Full-Text, NEXI)
Bridging the Great Divide
• Option 1: Tie together existing DB and IR systems
– Example: Approaches based on SQL/MM
– Drawback: Not powerful enough
• Option 2: Extend existing DB systems with IR
functionality, or vice versa
– Example: Add searching and ranking to RDBMSs
– Drawback: Shoehorns alien functionality into already
complex systems
• Option 3: Design a new data management system
from the ground-up
– Example: Quark data management system
– Most promising alternative!