Information Retrieval - Lyle School of Engineering

Download Report

Transcript Information Retrieval - Lyle School of Engineering

Information Retrieval
CSE 8337
Spring 2003
Introduction/Overview
Material for these slides obtained from:
Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto
http://www.sims.berkeley.edu/~hearst/irbook/
Data Mining Introductory and Advanced Topics by Margaret H. Dunham
http://www.engr.smu.edu/~mhd/book
Motivation



IR: representation, storage,
organization of, and access to
information items
Focus is on the user information need
User information need:


Find all docs containing information on college
tennis teams which: (1) are maintained by a USA
university and (2) participate in the NCAA
tournament.
Emphasis is on the retrieval of information (not
data)
CSE 8337 Spring 2003
2
DB vs IR




Records (tuples) vs. documents
Well defined results vs. fuzzy results
DB grew out of files and traditional
business systesm
IR grew out of library science and need
to categorize/group/access
books/articles
CSE 8337 Spring 2003
3
DB vs IR (cont’d)
Data retrieval
which docs contain a set of keywords?
Well defined semantics
a single erroneous object implies failure!
Information retrieval
information about a subject or topic
semantics is frequently loose
small errors are tolerated
IR system:
interpret contents of information items
generate a ranking which reflects relevance
notion of relevance is most important
CSE 8337 Spring 2003
4
Motivation
IR in the last 20 years:
classification and categorization
systems and languages
user interfaces and visualization
Still, area was seen as of narrow interest
Advent of the Web changed this perception
once and for all
universal repository of knowledge
free (low cost) universal access
no central editorial board
many problems though: IR seen as key to finding the
solutions!
CSE 8337 Spring 2003
5
Basic Concepts
The User Task
Retrieval
Database
Browsing
Retrieval
information or data
purposeful
Browsing
glancing around
cars, Le Mans, France, tourism
CSE 8337 Spring 2003
6
Basic Concepts
Logical view of the documents
Accents
spacing
Docs
stopwords
Noun
groups
stemming
Manual
indexing
structure
structure
Full text
Index terms
Document representation viewed as a continuum:
logical view of docs might shift
CSE 8337 Spring 2003
7
The Retrieval Process
Text
User
Interface
4, 10
user need
Text
Text Operations
6, 7
logical view
logical view
Query
Operations
DB Manager
Module
Indexing
user feedback
5
query
Searching
8
inverted file
Index
8
retrieved docs
Text
Database
Ranking
ranked docs
2
CSE 8337 Spring 2003
8
Fuzzy Sets and Logic




Fuzzy Set: Set membership function is a real
valued function with output in the range [0,1].
f(x): Probability x is in F.
1-f(x): Probability x is not in F.
EX:
 T = {x | x is a person and x is tall}
 Let f(x) be the probability that x is tall
 Here f is the membership function
CSE 8337 Spring 2003
9
Fuzzy Sets
CSE 8337 Spring 2003
10
IR is Fuzzy
Reject
Reject
Accept
Simple
CSE 8337 Spring 2003
Accept
Fuzzy
11
Information Retrieval






Information Retrieval (IR): retrieving
desired information from textual data.
Library Science
Digital Libraries
Web Search Engines
Traditionally keyword based
Sample query:
Find all documents about “data mining”.
CSE 8337 Spring 2003
12
Information Retrieval



Similarity: measure of how close a query is
to a document.
Documents which are “close enough” are
retrieved.
Metrics:
 Precision = |Relevant and Retrieved|
|Retrieved|
 Recall = |Relevant and Retrieved|
|Relevant|
CSE 8337 Spring 2003
13
IR Query Result Measures
IR
CSE 8337 Spring 2003
14