presentation source
Download
Report
Transcript presentation source
Search and Retrieval:
Query Languages
Prof. Marti Hearst
SIMS 202, Lecture 19
Last Time
Finding Out About
Intro to Standard Information Retrieval
Intro to Boolean Queries
Marti A. Hearst
SIMS 202, Fall 1997
Finding Out About
Three phases:
Asking of a question
Construction of an answer
Assessment of the answer
Part of an iterative process
Marti A. Hearst
SIMS 202, Fall 1997
Finding Out About is an Iterative
Process
Repositories
Goals
Workspace
Marti A. Hearst
SIMS 202, Fall 1997
Information Retrieval:
A Restricted Form of FOA
The system has available only pre-existing,
“canned” text passages.
Its response is limited to selecting from these
passages and presenting them to the user.
It must select, say, 10 or 20 passages out of
millions or billions!
Marti A. Hearst
SIMS 202, Fall 1997
Query Languages
Express the user’s information need
Components:
query language
program to interpret the language
collection to compare the interpreted query
against
Marti A. Hearst
SIMS 202, Fall 1997
Types of Query Languages
Boolean
Natural language (free style)
Hybrid structured and free text
Form-based
SQL (for database queries)
Marti A. Hearst
SIMS 202, Fall 1997
Today
More on Boolean Queries
Database Queries
IR vs. Database Queries
Marti A. Hearst
SIMS 202, Fall 1997
Basic Boolean Queries
Components:
terms (operands)
connectors (operators)
AND
OR
NOT
Marti A. Hearst
SIMS 202, Fall 1997
Boolean Queries
Cat
Cat OR Dog
Cat AND Dog
(Cat AND Dog)
(Cat AND Dog) OR Collar
(Cat AND Dog) OR (Collar AND Leash)
(Cat OR Dog) AND (Collar OR Leash)
Marti A. Hearst
SIMS 202, Fall 1997
Boolean Queries
Usually expressed as INFIX operators in IR
NOT is UNARY PREFIX operator
((a AND b) OR (c AND (NOT b)))
AND and OR can be n-ary operators
((a AND b) OR (c AND b))
(a AND b AND c AND d)
Some rules
NOT(a) AND NOT(b) = NOT(a OR b)
NOT(a) OR NOT(b)= NOT(a AND b)
NOT(NOT(a)) = a
Marti A. Hearst
SIMS 202, Fall 1997
Information
need
Collections
Pre-process
text input
Parse
Query
Index
Rank
Result Sets
Run a query, get a result set
Two choices
Reformulate query, run on entire collection
Reformulate query, run on result set
Example: Dialog query
(Redford AND Newman)
-> S1 1450 documents
(S1 AND Sundance)
->S2 898 documents
Marti A. Hearst
SIMS 202, Fall 1997
Information
need
Collections
Pre-process
text input
Parse
Query
Rank
Index
Reformulated
Query
Re-Rank
Ordering of Retrieved Documents
Pure Boolean has no ordering
In practice:
order chronologically
order by total number of “hits” on query terms
What if one term has more hits than others?
Is it better to one of each term or many of one term?
Fancier methods have been investigated
p-norm is most famous
usually impractical to implement
usually hard for user to understand
Marti A. Hearst
SIMS 202, Fall 1997
Boolean
Advantages
simple queries are easy to understand
relatively easy to implement
Disadvantages
difficult to specify what is wanted
too much returned, or too little
ordering not well determined
Dominant language in commercial
systems until the WWW
Marti A. Hearst
SIMS 202, Fall 1997
Faceted Boolean Query
Strategy: break query into facets
(polysemous with earlier meaning of facets)
conjunction of disjunctions
a1 OR a2 OR a3
b1 OR b2
c1 OR c2 OR c3 OR c4
AND
each facet expresses a topic
“rain forest” OR jungle OR amazon
medicine OR remedy OR cure
Smith OR Zhou
Marti A. Hearst
SIMS 202, Fall 1997
AND
Faceted Boolean Query
Query still fails if one facet missing
Alternative: Coordination level ranking
Order results in terms of how many facets
(disjuncts) are satisfied
Also called Quorum ranking, Overlap ranking, and
Best Match
Problem: Facets still undifferentiated
Alternative: assign weights to facets
Marti A. Hearst
SIMS 202, Fall 1997
Proximity Searches
Proximity: terms occur within K positions of
one another
A “Near” function can be more vague
near(pen, paper)
Sometimes order can be specified
Also, Phrases and Collocations
pen w/5 paper
“United Nations”
“Bill Clinton”
Phrase Variants
“retrieval of information” “information retrieval”
Marti A. Hearst
SIMS 202, Fall 1997
Filters
Filters: Reduce set of candidate docs
Often specified simultaneous with query
Usually restrictions on metadata
restrict by:
date range
internet domain (.edu .com .berkeley.edu)
author
size
limit number of documents returned
Marti A. Hearst
SIMS 202, Fall 1997
SQL: Database Query Language
Somewhat like Boolean
Geared towards relational model
A typical combination:
SELECT X
FROM Y
WHERE Z
Marti A. Hearst
SIMS 202, Fall 1997
SQL basics
SELECT: lists columns of tables
FROM:
which tables to use
WHERE: which rows to include
Example: get customer 1234’s bill:
SELECT fee
FROM accounts
WHERE customer_number = 1234
Marti A. Hearst
SIMS 202, Fall 1997
Simple SQL Join
Given two relations:
accounts (customer name and fee)
customers (customer name and address)
Retrieve all customers names, addresses, and
fees:
SELECT customer_num, address, fee
FROM accounts, customers
WHERE customers.customer_num =
accounts.customer_num
See reader for more details
Marti A. Hearst
SIMS 202, Fall 1997
IR vs. Database Systems
(modified from van Rijsbergen)
IR
Database
System
Provides:
User’s query:
Pointer to data
Data item
General
Specific
Retrieval
method:
Probabilistic,
Approximate
Deterministic,
correct
Criteria for
success:
Utility
Efficiency
Marti A. Hearst
SIMS 202, Fall 1997
Comparing Information Retrieval
to Database Retrieval
Query: How many SCSI drives did Compaq buy
last year?
Database:
A natural query on a purchasing relation.
Text Collection:
More difficult to search for
Less likely to find a reliable answer
Interpretation needed on results.
Marti A. Hearst
SIMS 202, Fall 1997
Comparing Information Retrieval
to Database Retrieval
Query: What is the best SCSI disk drive to buy?
Database:
Complex query combining information from
many relations. Probably can’t be written.
Text Collection:
Find usenet articles with people’s opinions
Interpret and use text fragments that may
have been written for other purposes.
Marti A. Hearst
SIMS 202, Fall 1997
Next Time:
Assessing the Answer
How well do the results answer the
question?
How relevant are they to the user?
Marti A. Hearst
SIMS 202, Fall 1997
Assignment -- Text Search!
Get a Lexis-Nexis account from Roberta
Use it in the lab or at home
Many collections are accessible
News
Public Interest
Legal
Get to know the interface
Do some searches
Answer some questions
Marti A. Hearst
SIMS 202, Fall 1997