presentation source

Download Report

Transcript presentation source

Search and Retrieval:
Query Languages
Prof. Marti Hearst
SIMS 202, Lecture 19
Last Time
Finding Out About
 Intro to Standard Information Retrieval
 Intro to Boolean Queries

Marti A. Hearst
SIMS 202, Fall 1997
Finding Out About

Three phases:
Asking of a question
 Construction of an answer
 Assessment of the answer


Part of an iterative process
Marti A. Hearst
SIMS 202, Fall 1997
Finding Out About is an Iterative
Process
Repositories
Goals
Workspace
Marti A. Hearst
SIMS 202, Fall 1997
Information Retrieval:
A Restricted Form of FOA

The system has available only pre-existing,
“canned” text passages.

Its response is limited to selecting from these
passages and presenting them to the user.

It must select, say, 10 or 20 passages out of
millions or billions!
Marti A. Hearst
SIMS 202, Fall 1997
Query Languages
Express the user’s information need
 Components:

query language
 program to interpret the language
 collection to compare the interpreted query
against

Marti A. Hearst
SIMS 202, Fall 1997
Types of Query Languages
Boolean
 Natural language (free style)
 Hybrid structured and free text
 Form-based
 SQL (for database queries)

Marti A. Hearst
SIMS 202, Fall 1997
Today
More on Boolean Queries
 Database Queries
 IR vs. Database Queries

Marti A. Hearst
SIMS 202, Fall 1997
Basic Boolean Queries

Components:
terms (operands)
 connectors (operators)

AND
 OR
 NOT

Marti A. Hearst
SIMS 202, Fall 1997
Boolean Queries

Cat

Cat OR Dog

Cat AND Dog

(Cat AND Dog)

(Cat AND Dog) OR Collar

(Cat AND Dog) OR (Collar AND Leash)

(Cat OR Dog) AND (Collar OR Leash)
Marti A. Hearst
SIMS 202, Fall 1997
Boolean Queries

Usually expressed as INFIX operators in IR


NOT is UNARY PREFIX operator


((a AND b) OR (c AND (NOT b)))
AND and OR can be n-ary operators


((a AND b) OR (c AND b))
(a AND b AND c AND d)
Some rules
NOT(a) AND NOT(b) = NOT(a OR b)
 NOT(a) OR NOT(b)= NOT(a AND b)
 NOT(NOT(a)) = a

Marti A. Hearst
SIMS 202, Fall 1997
Information
need
Collections
Pre-process
text input
Parse
Query
Index
Rank
Result Sets
Run a query, get a result set
 Two choices

Reformulate query, run on entire collection
 Reformulate query, run on result set


Example: Dialog query
(Redford AND Newman)
 -> S1 1450 documents
 (S1 AND Sundance)
 ->S2 898 documents

Marti A. Hearst
SIMS 202, Fall 1997
Information
need
Collections
Pre-process
text input
Parse
Query
Rank
Index
Reformulated
Query
Re-Rank
Ordering of Retrieved Documents


Pure Boolean has no ordering
In practice:


order chronologically
order by total number of “hits” on query terms



What if one term has more hits than others?
Is it better to one of each term or many of one term?
Fancier methods have been investigated

p-norm is most famous


usually impractical to implement
usually hard for user to understand
Marti A. Hearst
SIMS 202, Fall 1997
Boolean

Advantages
simple queries are easy to understand
 relatively easy to implement


Disadvantages
difficult to specify what is wanted
 too much returned, or too little
 ordering not well determined


Dominant language in commercial
systems until the WWW
Marti A. Hearst
SIMS 202, Fall 1997
Faceted Boolean Query

Strategy: break query into facets
(polysemous with earlier meaning of facets)

conjunction of disjunctions
a1 OR a2 OR a3
b1 OR b2
c1 OR c2 OR c3 OR c4

AND
each facet expresses a topic
“rain forest” OR jungle OR amazon
medicine OR remedy OR cure
Smith OR Zhou
Marti A. Hearst
SIMS 202, Fall 1997
AND
Faceted Boolean Query
Query still fails if one facet missing
 Alternative: Coordination level ranking



Order results in terms of how many facets
(disjuncts) are satisfied
Also called Quorum ranking, Overlap ranking, and
Best Match
Problem: Facets still undifferentiated
 Alternative: assign weights to facets

Marti A. Hearst
SIMS 202, Fall 1997
Proximity Searches

Proximity: terms occur within K positions of
one another


A “Near” function can be more vague



near(pen, paper)
Sometimes order can be specified
Also, Phrases and Collocations


pen w/5 paper
“United Nations”
“Bill Clinton”
Phrase Variants

“retrieval of information” “information retrieval”
Marti A. Hearst
SIMS 202, Fall 1997
Filters
Filters: Reduce set of candidate docs
 Often specified simultaneous with query
 Usually restrictions on metadata


restrict by:
date range
 internet domain (.edu .com .berkeley.edu)
 author
 size
 limit number of documents returned

Marti A. Hearst
SIMS 202, Fall 1997
SQL: Database Query Language
Somewhat like Boolean
 Geared towards relational model
 A typical combination:

SELECT X
FROM Y
WHERE Z
Marti A. Hearst
SIMS 202, Fall 1997
SQL basics
SELECT: lists columns of tables
 FROM:
which tables to use
 WHERE: which rows to include


Example: get customer 1234’s bill:
SELECT fee
FROM accounts
WHERE customer_number = 1234
Marti A. Hearst
SIMS 202, Fall 1997
Simple SQL Join

Given two relations:



accounts (customer name and fee)
customers (customer name and address)
Retrieve all customers names, addresses, and
fees:
SELECT customer_num, address, fee
FROM accounts, customers
WHERE customers.customer_num =
accounts.customer_num

See reader for more details
Marti A. Hearst
SIMS 202, Fall 1997
IR vs. Database Systems
(modified from van Rijsbergen)
IR
Database
System
Provides:
User’s query:
Pointer to data
Data item
General
Specific
Retrieval
method:
Probabilistic,
Approximate
Deterministic,
correct
Criteria for
success:
Utility
Efficiency
Marti A. Hearst
SIMS 202, Fall 1997
Comparing Information Retrieval
to Database Retrieval
Query: How many SCSI drives did Compaq buy
last year?

Database:


A natural query on a purchasing relation.
Text Collection:
More difficult to search for
 Less likely to find a reliable answer
 Interpretation needed on results.

Marti A. Hearst
SIMS 202, Fall 1997
Comparing Information Retrieval
to Database Retrieval
Query: What is the best SCSI disk drive to buy?

Database:


Complex query combining information from
many relations. Probably can’t be written.
Text Collection:
Find usenet articles with people’s opinions
 Interpret and use text fragments that may
have been written for other purposes.

Marti A. Hearst
SIMS 202, Fall 1997
Next Time:
Assessing the Answer
How well do the results answer the
question?
 How relevant are they to the user?

Marti A. Hearst
SIMS 202, Fall 1997
Assignment -- Text Search!



Get a Lexis-Nexis account from Roberta
Use it in the lab or at home
Many collections are accessible



News
Public Interest
Legal

Get to know the interface
Do some searches

Answer some questions

Marti A. Hearst
SIMS 202, Fall 1997