IR Through the Ages

Download Report

Transcript IR Through the Ages

Intelligent Information Retrieval
CS 336
Lisa Ballesteros
Spring 2006
What is Information Retrieval?
• Includes the following:
– Organization
– Storage/Representation
– Manipulation/Analysis
– Search/Retrieval
• How far back in history can we find
examples?
IR Through the Ages
• 3rd Century BCE
– Library of Alexandria
• 500,000 volumes
• catalogs and classifications
• 13th Century A.D.
– First concordance of the Bible
• What is a concordance?
• 15th Century A.D.
– Invention of printing
• 1600
– University of Oxford Library
• All books printed in England
IR Through the Ages
• 1755
– Johnson’s Dictionary
• Set standard for dictionaries
• Included common language
• Helped standardize spelling
• 1800
– Library of Congress
• 1828
– Webster’s Dictionary
• Significantly larger than previous dictionaries
• Standardized American spelling
• 1852
– Roget’s Thesaurus
IR Through the Ages
• 1876
– Dewey Decimal Classification
• 1880’s
– Carnegie Public Libraries
• 1,681 built (first public library 1850)
• 1930’s
– Punched card retrieval systems
• 1940’s
– Bush’s Memex
– Shannon’s Communication Theory
– Zipf’s “Law”
Historical Summary
• 1960’s
– Basic advances in retrieval and indexing techniques
• 1970’s
– Probabilistic and vector space models
– Clustering, relevance feedback
– Large, on-line, Boolean information services
– Fast string matching
• 1980’s
– Natural Language Processing and IR
– Expert systems and IR
– Off-the-shelf IR systems
IR Through the Ages
• Late 1980’s
– First mini-computer and PC systems incorporating
“relevance ranking”
• Early 1990’s
– information storage revolution
• 1992
– First large-scale information service incorporating
probabilistic retrieval (West’s legal retrieval system)
IR Through the Ages
• Mid 1990’s to present
– Multimedia databases
• 1994 to present
– The Internet and Web explosion
• e.g. Google, Yahoo, Lycos, Infoseek (now Go)
• 1995 to present
–
–
–
–
–
–
Digital Libraries
Data Mining
Agents and Filtering
Knowledge and Distributed Intelligence
Information Organization
Knowledge Management
• 1990’s
Historical Summary
– Large-scale, full-text IR and filtering experiments
and systems (TREC)
– Dominance of ranking
– Many web-based retrieval engines
– Interfaces and browsing
– Multimedia and multilingual
– Machine learning techniques
Trends in IR Technology
On-line
Information
Petabytes
Image and Video
Retrieval
Visualization
Data Mining
Terabytes
Distributed Retrieval
Summarization
Information Extraction
Gigabytes
Ranked Filtering
Concept-Based Retrieval
Ranked Retrieval
Boolean Retrieval and Filtering
1970
1990
Technologies
Time
Batch systems...Interactive systems...Database Systems…Cheap Storage...Internet…Multimedia...
1-page word document without any images = ~10 kilobytes (kb) of disk space.
1 terabyte = one-hundred million imageless word docs
1 petabyte = one-thousand terabytes.
• The Future
Historical Summary
– Logic-based IR?
– NLP?
– Integration with other functionality
– Distributed, heterogeneous database access
– IR in context
– “Anytime, Anywhere”
Information Retrieval
• Ad Hoc Retrieval
– Given a query and a large database of text objects, find the
relevant objects
• Distributed Retrieval
– Many distributed databases
• Information Filtering
– Given a text object from an information stream (e.g. newswire)
and many profiles (long-term queries), decide which profiles
match
• Multimedia Retrieval
– Databases of other types of unstructured data, e.g. images,
video, audio
Information Retrieval
• Multilingual Retrieval
– Retrieval in a language other than English
• Cross-language Retrieval
– Query in one language (e.g. Spanish), retrieve
documents in other languages (e.g. Chinese,
French, and Spanish)
Information Retrieval
• Text Representation (Indexing)
– given a text document, identify the concepts that describe the
content and how well they describe it
• what makes a “good” representation?
• how is a representation generated from text?
• what are retrievable objects and how are they organized?
• Representing an Information Need (Query Formulation)
– describe and refine information needs as explicit queries
• what is an appropriate query language?
• how can interactive query formulation and refinement be supported?
Information Retrieval
• Comparing Representations (Retrieval)
– compare text and information need representations to
determine which documents are likely to be relevant
• what is a “good” model of retrieval?
• how is uncertainty represented?
• Evaluating Retrieved Text (Feedback)
– present documents for user evaluation and modify query
based on feedback
• what are good metrics?
• what constitutes a good experimental testbed
Information Retrieval and Filtering
Information Need
Text Objects
Representation
Representation
Query
Indexed Objects
Comparison
Evaluation/Feedback
Retrieved Objects
Features of a Modern IR Product
•
•
•
•
•
•
•
•
•
•
•
•
Effective “relevance ranking”
Simple free text (“natural language”) query capability
Boolean and proximity operators
Term weighting
Query formulation assistance
Query by example
Filtering
Field-based retrieval
Distributed architecture
Index anything
Fast retrieval
Information Organization
Typical Systems
• IR systems
– Verity, Fulcrum, Excalibur
• Database systems
– Oracle, Informix
• Web search and In-house systems
– West, LEXIS/NEXIS, Dialog
– Yahoo, Google, MSN, AskJeeves
IR vs. Database Systems
• Emphasis on effective, efficient retrieval of
unstructured data
• IR systems typically have very simple schemas
• Query languages emphasize free text although
Boolean combinations of words is also common
IR vs. Database Systems
• Matching is more complex than with structured
data (semantics less obvious)
– easy to retrieve the wrong objects
– need to measure accuracy of retrieval
• Less focus on concurrency control and recovery,
although update is very important