Introduction to IR and the Search Process

Download Report

Transcript Introduction to IR and the Search Process

Lecture 15: Intro to Information Retrieval
SIMS 202:
Information Organization
and Retrieval
Prof. Ray Larson & Prof. Marc Davis
UC Berkeley SIMS
Tuesday and Thursday 10:30 am - 12:00 pm
Fall 2002
http://www.sims.berkeley.edu/academics/courses/is202/f02/
IS 202 – FALL 2002
2002.10.22 - SLIDE 1
Lecture Overview
• Review
– Database Design
– Normalization
– Web-enabled Databases
• Introduction to Information Retrieval
• The Information Seeking Process
• Information Retrieval History and
Developments
Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey
IS 202 – FALL 2002
2002.10.22 - SLIDE 2
Models (1)
Application 1
External
Model
Application 2
Application 3
Application 4
External
Model
External
Model
External
Model
Application 1
Conceptual
requirements
Application 2
Conceptual
requirements
Application 3
Conceptual
requirements
Conceptual
Model
Logical
Model
Internal
Model
Application 4
Conceptual
requirements
IS 202 – FALL 2002
2002.10.22 - SLIDE 3
Database System Life Cycle
Physical
Creation
2
Design
Conversion
1
3
Growth,
Change, &
Maintenance
6
Integration
4
Operations
5
IS 202 – FALL 2002
2002.10.22 - SLIDE 4
Normal Forms
•
•
•
•
•
•
First Normal Form (1NF)
Second Normal Form (2NF)
Third Normal Form (3NF)
Boyce-Codd Normal Form (BCNF)
Fourth Normal Form (4NF)
Fifth Normal Form (5NF)
IS 202 – FALL 2002
2002.10.22 - SLIDE 5
Normalization
No transitive
dependency
between
nonkey
attributes
All
determinants
are candidate
keys - Single
multivalued
dependency
IS 202 – FALL 2002
BoyceCodd and
Higher
Functional
dependency
of nonkey
attributes on
the primary
key - Atomic
values only
Full
Functional
dependency
of nonkey
attributes on
the primary
key
2002.10.22 - SLIDE 6
Dynamic Web Applications 2
Web
Server
Internet
Files
CGI
DBMS
Server
database
database
database
Clients
IS 202 – FALL 2002
2002.10.22 - SLIDE 7
Server Interfaces
SQL
HTML
DHTML
Web Server
JavaScript
Native
DB
Interfaces
Database
Web DB
CGI
App ODBC
Web Server
API’s
ColdFusion
Native DB
interfaces
JDBC
PhP Perl
Web Application
Server
Adapted from
John P. Ashenfelter,
Choosing a Database for Your Web Site
IS 202 – FALL 2002
Java
ASP
2002.10.22 - SLIDE 8
Photo Browser
• The current photo browser uses a
combination of
– Javascript for expandable hierarchies
– Database in MS Access
– ColdFusion to search the database when one
of the facets is selected
• The database design for the photo
database currently looks like…
IS 202 – FALL 2002
2002.10.22 - SLIDE 9
Photo Browser ER
IS 202 – FALL 2002
2002.10.22 - SLIDE 10
Photo Database
• Lets look at the photo database in the
Access interface
– Multi-Facet queries
– Queries for multiple descriptors in the same
facet (harder)
IS 202 – FALL 2002
2002.10.22 - SLIDE 11
Lecture Overview
• Review
– Database Design
– Normalization
– Web-enabled Databases
• Introduction to Information Retrieval
• The Information Seeking Process
• Information Retrieval History and
Developments
Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey
IS 202 – FALL 2002
2002.10.22 - SLIDE 12
Review: Information Overload
• “The world's total yearly production of print, film,
optical, and magnetic content would require
roughly 1.5 billion gigabytes of storage. This is
the equivalent of 250 megabytes per person for
each man, woman, and child on earth.” (Varian
& Lyman)
• “The greatest problem of today is how to teach
people to ignore the irrelevant, how to refuse to
know things, before they are suffocated. For too
many facts are as bad as none at all.” (W.H.
Auden)
IS 202 – FALL 2002
2002.10.22 - SLIDE 13
Course Outline
• Organization
–
–
–
–
Overview
Categorization
Metadata and markup
Metadata for multimedia
• Photo Project
– Controlled vocabularies,
classification, thesauri
– Information design
• Thesaurus design
• Database design
IS 202 – FALL 2002
• Retrieval
– The search process
– Content analysis
• Tokenization, Zipf’s law,
lexical associations
– IR implementation
– Term weighting and
document ranking
• Vector space model
– User interfaces
• Overviews, query
specification, providing
context
2002.10.22 - SLIDE 14
Key Issues In This Course
• How to describe information resources or
information-bearing objects in ways so that
they may be effectively used by those who
need to use them
– Organizing
• How to find the appropriate information
resources or information-bearing objects
for someone’s (or your own) needs
– Retrieving
IS 202 – FALL 2002
2002.10.22 - SLIDE 15
Key Issues
Creation
Active
Authoring
Modifying
Using
Creating
Retention/
Mining
Organizing
Indexing
Accessing
Filtering
Storing
Retrieval
Semi-Active
Discard
Utilization Disposition
Distribution
Networking
Searching
Inactive
IS 202 – FALL 2002
2002.10.22 - SLIDE 16
IR Textbook Topics
IS 202 – FALL 2002
2002.10.22 - SLIDE 17
More Detailed View
IS 202 – FALL 2002
2002.10.22 - SLIDE 18
A Lot
What We’ll Cover
A Little
IS 202 – FALL 2002
2002.10.22 - SLIDE 19
IR Topics for 202
•
•
•
•
The Search Process
Information Retrieval Models
Content Analysis/Zipf Distributions
Evaluation of IR Systems
– Precision/Recall
– Relevance
– User Studies
•
•
•
•
System and Implementation Issues
Web-Specific Issues
User Interface Issues
Special Kinds of Search
IS 202 – FALL 2002
2002.10.22 - SLIDE 20
Lecture Overview
• Review
– Database Design
– Normalization
– Web-enabled Databases
• Introduction to Information Retrieval
• The Information Seeking Process
• Information Retrieval History and
Developments
Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey
IS 202 – FALL 2002
2002.10.22 - SLIDE 21
The Standard Retrieval Interaction Model
IS 202 – FALL 2002
2002.10.22 - SLIDE 22
Standard Model of IR
• Assumptions:
– Maximizing precision and recall
simultaneously
– The information need remains static
– The value is in the resulting document set
IS 202 – FALL 2002
2002.10.22 - SLIDE 23
Problems with Standard Model
• Users learn during the search process:
– Scanning titles of retrieved documents
– Reading retrieved documents
– Viewing lists of related topics/thesaurus terms
– Navigating hyperlinks
• Some users don’t like long disorganized
lists of documents
IS 202 – FALL 2002
2002.10.22 - SLIDE 24
IR is an Iterative Process
Repositories
Goals
Workspace
IS 202 – FALL 2002
2002.10.22 - SLIDE 25
IR is a Dialog
• The exchange doesn’t end with first answer
• User can recognize elements of a useful answer
• Questions and understanding changes as the
process continues
IS 202 – FALL 2002
2002.10.22 - SLIDE 26
Bates’ “Berry-Picking” Model
• Standard IR model
– Assumes the information need remains the
same throughout the search process
• Berry-picking model
– Interesting information is scattered like berries
among bushes
– The query is continually shifting
IS 202 – FALL 2002
2002.10.22 - SLIDE 27
Berry-Picking Model
A sketch of a searcher… “moving through many actions towards a
general goal of satisfactory completion of research related to an
information need.” (after Bates 89)
Q2
Q4
Q3
Q1
Q5
Q0
IS 202 – FALL 2002
2002.10.22 - SLIDE 28
Berry-Picking Model (cont.)
• The query is continually shifting
• New information may yield new ideas and
new directions
• The information need
– Is not satisfied by a single, final retrieved set
– Is satisfied by a series of selections and bits
of information found along the way
IS 202 – FALL 2002
2002.10.22 - SLIDE 29
Information Seeking Behavior
• Two parts of a process:
– Search and retrieval
– Analysis and synthesis of search results
• This is a fuzzy area
– We will look at several different working
theories
IS 202 – FALL 2002
2002.10.22 - SLIDE 30
Search Tactics and Strategies
• Search Tactics
– Bates 1979
• Search Strategies
– Bates 1989
– O’Day and Jeffries 1993
IS 202 – FALL 2002
2002.10.22 - SLIDE 31
Tactics vs. Strategies
• Tactic: short term goals and maneuvers
– Operators, actions
• Strategy: overall planning
– Link a sequence of operators together to
achieve some end
IS 202 – FALL 2002
2002.10.22 - SLIDE 32
Information Search Tactics
• Monitoring tactics
– Keep search on track
• Source-level tactics
– Navigate to and within sources
• Term and Search Formulation tactics
– Designing search formulation
– Selection and revision of specific terms within
search formulation
IS 202 – FALL 2002
2002.10.22 - SLIDE 33
Monitoring Tactics (Strategy-Level)
• Check
– Compare original goal with current state
• Weigh
– Make a cost/benefit analysis of current or
anticipated actions
• Pattern
– Recognize common strategies
• Correct Errors
• Record
– Keep track of (incomplete) paths
IS 202 – FALL 2002
2002.10.22 - SLIDE 34
Source-Level Tactics
• “Bibble”:
– Look for a pre-defined result set
• E.g., a good link page on web
• Survey:
– Look ahead, review available options
• E.g., don’t simply use the first term or first source
that comes to mind
• Cut:
– Eliminate large proportion of search domain
• E.g., search on rarest term first
IS 202 – FALL 2002
2002.10.22 - SLIDE 35
Source-Level Tactics (cont.)
• Stretch
– Use source in unintended way
• E.g., use patents to find addresses
• Scaffold
– Take an indirect route to goal
• E.g., when looking for references to obscure poet,
look up contemporaries
• Cleave
– Binary search in an ordered file
IS 202 – FALL 2002
2002.10.22 - SLIDE 36
Search Formulation Tactics
• Specify
– Use as specific terms as possible
• Exhaust
– Use all possible elements in a query
• Reduce
– Subtract elements from a query
• Parallel
– Use synonyms and parallel terms
• Pinpoint
– Reducing parallel terms and refocusing query
• Block
– To reject or block some terms, even at the cost of
losing some relevant documents
IS 202 – FALL 2002
2002.10.22 - SLIDE 37
Term Tactics
• Move around the thesaurus
– Superordinate, subordinate, coordinate
– Neighbor (semantic or alphabetic)
– Trace – pull out terms from information
already seen as part of search (titles, etc.)
– Morphological and other spelling variants
– Antonyms (contrary)
IS 202 – FALL 2002
2002.10.22 - SLIDE 38
Additional Considerations (Bates 79)
• Add a Sort tactic!
• More detail is needed about short-term
cost/benefit decision rule strategies
• When to stop?
– How to judge when enough information has
been gathered?
– How to decide when to give up an
unsuccessful search?
– When to stop searching in one source and
move to another?
IS 202 – FALL 2002
2002.10.22 - SLIDE 39
Implications
• Interfaces should make it easy to store
intermediate results
• Interfaces should make it easy to follow
trails with unanticipated results
• Makes evaluation more difficult
IS 202 – FALL 2002
2002.10.22 - SLIDE 40
More Later…
• Later in the course:
– More on Search Process and Strategies
– User interfaces to improve IR process
– Incorporation of Content Analysis into better
systems
IS 202 – FALL 2002
2002.10.22 - SLIDE 41
Restricted Form of the IR Problem
• The system has available only preexisting, “canned” text passages
• Its response is limited to selecting from
these passages and presenting them to
the user
• It must select, say, 10 or 20 passages out
of millions or billions!
IS 202 – FALL 2002
2002.10.22 - SLIDE 42
Information Retrieval
• Revised Task Statement:
Build a system that retrieves documents that
users are likely to find relevant to their queries
• This set of assumptions underlies the field
of Information Retrieval
IS 202 – FALL 2002
2002.10.22 - SLIDE 43
Lecture Overview
• Review
– Database Design
– Normalization
– Web-enabled Databases
• Introduction to Information Retrieval
• The Information Seeking Process
• Information Retrieval History and
Developments
Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey
IS 202 – FALL 2002
2002.10.22 - SLIDE 44
Visions of IR Systems
• Paul Otlet, 1930’s
• Emanuel Goldberg, 1920’s - 1940’s
• H.G. Wells, “World Brain: The idea of a
permanent World Encyclopedia.”
(Introduction to the Encyclopedie
Francaise), 1937.
• Vannevar Bush, “As we may think.”
Atlantic Monthly, 1945.
IS 202 – FALL 2002
2002.10.22 - SLIDE 45
Card-Based IR Systems
• Uniterm (Casey, Perry, Berry, Kent: 1958)
– Developed and used from mid 1940’s)
LUNAR
110 181
430 241
820 761
901
IS 202 – FALL 2002
EXCURSION
90 241
52
130 281
92
640
122
870
342
12
42
602
982
73
113
233
44
74
134
194
15
85
95
165
63
83
93
46
76
136
34
44
104
25
66
75
86
115 146
12457
7
28
17
78
37 118
127 198
377 288
407
39
79
109
179
17
57
97
157
207
43821
58
49
88 119
158 139
178 199
248 269
298
2002.10.22 - SLIDE 46
Card Systems
• Batten Optical Coincidence Cards (“Peeka-Boo Cards”), 1948
Excursion
Lunar
IS 202 – FALL 2002
2002.10.22 - SLIDE 47
Card Systems
• Zatocode (edge-notched cards) Mooers,
1951
Document 34
Document 1
Title: lksd ksdj sjd Lunar
Title: lksd ksdj sjd sjsjfkl
Document
Author: Smith, J.
Author:
Smith, 200
J.
Title:lksf
Xksd
Lunar
sjd sjsjfkl Abstract: lksf uejm jshy
Abstract:
uejm
jshy
Author:
Jones,
ksd jh uyw hhy jha jsyhe
ksd jh
uyw hhy
jha R.
jsyhe
Abstract: Lunar uejm jshy
ksd jh uyw hhy jha jsyhe
IS 202 – FALL 2002
2002.10.22 - SLIDE 48
Computer-Based Systems
• Bagley’s 1951 MS thesis from MIT
suggested that searching 50 million item
records, each containing 30 index terms
would take approximately 41,700 hours
– Due to the need to move and shift the text in
core memory while carrying out the
comparisons
• 1957 – Desk Set with Katharine Hepburn
and Spencer Tracy – EMERAC
IS 202 – FALL 2002
2002.10.22 - SLIDE 49
Historical Milestones in IR Research
•
•
•
•
•
•
•
1958 Statistic Language Properties (Luhn)
1960 Probabilistic Indexing (Maron & Kuhns)
1961 Term association and clustering (Doyle)
1965 Vector Space Model (Salton)
1968 Query expansion (Roccio, Salton)
1972 Statistical Weighting (Sparck-Jones)
1975 2-Poisson Model (Harter, Bookstein,
Swanson)
• 1976 Relevance Weighting (Robertson, SparckJones)
• 1980 Fuzzy sets (Bookstein)
• 1981 Probability without training (Croft)
IS 202 – FALL 2002
2002.10.22 - SLIDE 50
Historical Milestones in IR Research (cont.)
• 1983 Linear Regression (Fox)
• 1983 Probabilistic Dependence (Salton, Yu)
• 1985 Generalized Vector Space Model (Wong,
Rhagavan)
• 1987 Fuzzy logic and RUBRIC/TOPIC (Tong, et
al.)
• 1990 Latent Semantic Indexing (Dumais,
Deerwester)
• 1991 Polynomial & Logistic Regression
(Cooper, Gey, Fuhr)
• 1992 TREC (Harman)
• 1992 Inference networks (Turtle, Croft)
• 1994 Neural networks (Kwok)
IS 202 – FALL 2002
2002.10.22 - SLIDE 51
Development of Bibliographic Databases
• Chemical Abstracts Service first produced
“Chemical Titles” by computer in 1961
• Index Medicus from the National Library of
Medicine soon followed with the creation
of the MEDLARS database in 1961
• By 1970, most secondary publications
(indexes, abstract journals, etc.) were
produced by machine
IS 202 – FALL 2002
2002.10.22 - SLIDE 52
Boolean IR Systems
•
•
•
•
•
•
•
•
Synthex at SDC, 1960
Project MAC at MIT, 1963 (interactive)
BOLD at SDC, 1964 (Harold Borko)
1964 New York World’s Fair – Becker and Hayes
produced system to answer questions (based on
airline reservation equipment)
SDC began production for a commercial service
in 1967 – ORBIT
NASA-RECON (1966) becomes DIALOG
1972 Data Central/Mead introduced LEXIS –
Full text
Online catalogs – late 1970’s and 1980’s
IS 202 – FALL 2002
2002.10.22 - SLIDE 53
Experimental IR systems
• Probabilistic indexing – Maron and Kuhns,
1960
• SMART – Gerard Salton at Cornell –
Vector space model, 1970’s
• SIRE at Syracuse
• I3R – Croft
• TREC – 1992
IS 202 – FALL 2002
2002.10.22 - SLIDE 54
The Internet and the WWW
• Gopher, Archie, Veronica, WAIS
• Tim Berners-Lee, 1991 creates WWW at
CERN – originally hypertext only
• Web-crawler
• Lycos
• Alta Vista
• Inktomi
• Google
IS 202 – FALL 2002
2002.10.22 - SLIDE 55
Information Retrieval – Historical View
•
•
•
•
•
Research
Boolean model, statistics
of language (1950’s)
Vector space model,
probablistic indexing,
relevance feedback
(1960’s)
Probabilistic querying
(1970’s)
Fuzzy set/logic, evidential
reasoning (1980’s)
Regression, neural nets,
inference networks, latent
semantic indexing, TREC
(1990’s)
IS 202 – FALL 2002
Industry
• DIALOG, Lexus-Nexus,
• STAIRS (Boolean based)
• Information industry
(O($B))
• Verity TOPIC (fuzzy logic)
• Internet search engines
(O($100B?)) (vector
space, probabilistic)
2002.10.22 - SLIDE 56
Research Sources in Information Retrieval
• ACM Transactions on Information Systems
• Am. Society for Information Science Journal
• Document Analysis and IR Proceedings (Las
Vegas)
• Information Processing and Management
(Pergammon)
• Journal of Documentation
• SIGIR Conference Proceedings
• TREC Conference Proceedings
IS 202 – FALL 2002
2002.10.22 - SLIDE 57
Research Systems Software
• INQUERY (Croft/U. Mass)
• OKAPI (Robertson)
• PRISE (Harman/NIST)
– http://potomac.ncsl.nist.gov/prise
• SMART (Buckley/Cornell)
• CHESHIRE (Larson/Berkeley)
– http://cheshire.lib.berkeley.edu
IS 202 – FALL 2002
2002.10.22 - SLIDE 58
Structure of an IR System
Search
Line
Interest profiles
& Queries
Formulating query in
terms of
descriptors
Information Storage and Retrieval System
Rules of the game =
Rules for subject indexing +
Thesaurus (which consists of
Lead-In
Vocabulary
and
Indexing
Language
Storage of
profiles
Store1: Profiles/
Search requests
Documents
& data
Storage
Line
Indexing
(Descriptive and
Subject)
Storage of
Documents
Comparison/
Matching
Store2: Document
representations
Adapted from Soergel, p. 19
Potentially
Relevant
Documents
IS 202 – FALL 2002
2002.10.22 - SLIDE 59
Structure of an IR System
Search
Line
Interest profiles
& Queries
Formulating query in
terms of
descriptors
Information Storage and Retrieval System
Rules of the game =
Rules for subject indexing +
Thesaurus (which consists of
Lead-In
Vocabulary
and
Indexing
Language
Storage of
profiles
Store1: Profiles/
Search requests
Documents
& data
Storage
Line
Indexing
(Descriptive and
Subject)
Storage of
Documents
Comparison/
Matching
Store2: Document
representations
Adapted from Soergel, p. 19
Potentially
Relevant
Documents
IS 202 – FALL 2002
2002.10.22 - SLIDE 60
Structure of an IR System
Search
Line
Interest profiles
& Queries
Formulating query in
terms of
descriptors
Information Storage and Retrieval System
Rules of the game =
Rules for subject indexing +
Thesaurus (which consists of
Lead-In
Vocabulary
and
Indexing
Language
Storage of
profiles
Store1: Profiles/
Search requests
Documents
& data
Storage
Line
Indexing
(Descriptive and
Subject)
Storage of
Documents
Comparison/
Matching
Store2: Document
representations
Adapted from Soergel, p. 19
Potentially
Relevant
Documents
IS 202 – FALL 2002
2002.10.22 - SLIDE 61
Structure of an IR System
Search
Line
Interest profiles
& Queries
Formulating query in
terms of
descriptors
Information Storage and Retrieval System
Rules of the game =
Rules for subject indexing +
Thesaurus (which consists of
Lead-In
Vocabulary
and
Indexing
Language
Storage of
profiles
Store1: Profiles/
Search requests
Documents
& data
Storage
Line
Indexing
(Descriptive and
Subject)
Storage of
Documents
Comparison/
Matching
Store2: Document
representations
Adapted from Soergel, p. 19
Potentially
Relevant
Documents
IS 202 – FALL 2002
2002.10.22 - SLIDE 62
Relevance (Introduction)
• In what ways can a document be relevant to a
query?
– Answer precise question precisely
• Who is buried in grant’s tomb? Grant.
– Partially answer question
• Where is Danville? Near Walnut Creek.
– Suggest a source for more information.
• What is lymphodema? Look in this Medical Dictionary.
– Give background information
– Remind the user of other knowledge
– Others...
IS 202 – FALL 2002
2002.10.22 - SLIDE 63
Next Time
• Boolean Search Logic
• Preparing information for search
– Lexical analysis
• Using Lexis-Nexis (Assignment 8)
IS 202 – FALL 2002
2002.10.22 - SLIDE 64