Towards Unifying Database Systems and Information

Download Report

Transcript Towards Unifying Database Systems and Information

Information Retrieval and Databases:
Synergies and Syntheses
IDM Workshop Panel
15 Sep 2003
Jayavel Shanmugasundaram
Cornell University
10000 foot view of Data Management
Information
Retrieval
Systems
Ranked
Keyword
Search
Queries
Complex
and
Structured
Database
Systems
Structured
Unstructured
Data
10000 foot view of Data Management
Information
Retrieval
Systems
Ranked
Keyword
Search
Queries
Complex
and
Structured
Database
Systems
Structured
Unstructured
Data
Applications
• Information discovery over structured
databases
• Keyword search over relational databases
– DBXplorer [Agrawal et al.]
– DISCOVER [Hristidis et al.]
– BANKS [Hulgeri et al.]
10000 foot view of Data Management
Information
Retrieval
Systems
Ranked
Keyword
Search
Queries
Complex
and
Structured
Database
Systems
Structured
Unstructured
Data
10000 foot view of Data Management
Information
Retrieval
Systems
Ranked
Keyword
Search
Queries
Complex
and
Structured
Database
Systems
Structured
Unstructured
Data
Applications
• Content management
– Mix of structured and unstructured data
• Database with date and time of accident (structured
data) and accident description (unstructured data)
– Semi-structured data
• Scientific documents, Shakespeare’s plays, …
• Support flexible ranked keyword search
interface over such data
– XRANK [Guo et al., SIGMOD 2003]
– XIRQL [Fuhr et al., SIGIR 2001]
XML Keyword Search
<workshop date=”28 July 2000”>
<title> XML and Information Retrieval: A SIGIR 2000 Workshop </title>
<editors> David Carmel, Yoelle Maarek, Aya Soffer </editors>
<proceedings>
<paper id=”1”>
<title> XQL and Proximal Nodes </title>
<author> Ricardo Baeza-Yates </author>
<author> Gonzalo Navarro </author>
<abstract> We consider the recently proposed language … </abstract>
<section name=”Introduction”>
Searching on structured text is becoming more important with XML …
</section>
…
<cite xmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite>
</paper>
…
• Most specific results (exploits structure!)
• Ranking at granularity of elements
10000 foot view of Data Management
Information
Retrieval
Systems
Ranked
Keyword
Search
Queries
Complex
and
Structured
Database
Systems
Structured
Unstructured
Data
Applications
• The Internet is enabling end-users to directly ask
queries and explore results
– E.g., Used car marketplace
– Find all “bright red ford mustangs” that cost less than
20% of the average price of cars in its class
• Characteristics of queries
– Keyword search (for ease of use)
– Complex query operations (information synthesis)
– Want to see ranked results!
Towards Unifying DB and IR
• No standard query language for both DB and IR
– SQL and XQuery mostly “database” query languages
• Currently developing TeXQuery: a full-text
search extension to XQuery
– With S. Amer-Yahia, C. Botev, J. Robie
– Full composability of database and IR primitives,
ranking
– Submitted to W3C committee on full-text extensions
to XQuery
Summary
• Applications have mix of structured (DB
domain) and unstructured (IR domain) data
– Stark difference in how they can be processed
• Benefits of unifying DB & IR
– Ranked keyword search (information discovery) over
both structured and unstructured data
– Complex queries over structured/semi-structured data
• A truly unified data store
– Need to generalize DB and IR techniques