Information Retrieval - Lyle School of Engineering
Download
Report
Transcript Information Retrieval - Lyle School of Engineering
Information Retrieval
CSE 8337
Spring 2003
Introduction/Overview
Material for these slides obtained from:
Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto
http://www.sims.berkeley.edu/~hearst/irbook/
Data Mining Introductory and Advanced Topics by Margaret H. Dunham
http://www.engr.smu.edu/~mhd/book
Motivation
IR: representation, storage,
organization of, and access to
information items
Focus is on the user information need
User information need:
Find all docs containing information on college
tennis teams which: (1) are maintained by a USA
university and (2) participate in the NCAA
tournament.
Emphasis is on the retrieval of information (not
data)
CSE 8337 Spring 2003
2
DB vs IR
Records (tuples) vs. documents
Well defined results vs. fuzzy results
DB grew out of files and traditional
business systesm
IR grew out of library science and need
to categorize/group/access
books/articles
CSE 8337 Spring 2003
3
DB vs IR (cont’d)
Data retrieval
which docs contain a set of keywords?
Well defined semantics
a single erroneous object implies failure!
Information retrieval
information about a subject or topic
semantics is frequently loose
small errors are tolerated
IR system:
interpret contents of information items
generate a ranking which reflects relevance
notion of relevance is most important
CSE 8337 Spring 2003
4
Motivation
IR in the last 20 years:
classification and categorization
systems and languages
user interfaces and visualization
Still, area was seen as of narrow interest
Advent of the Web changed this perception
once and for all
universal repository of knowledge
free (low cost) universal access
no central editorial board
many problems though: IR seen as key to finding the
solutions!
CSE 8337 Spring 2003
5
Basic Concepts
The User Task
Retrieval
Database
Browsing
Retrieval
information or data
purposeful
Browsing
glancing around
cars, Le Mans, France, tourism
CSE 8337 Spring 2003
6
Basic Concepts
Logical view of the documents
Accents
spacing
Docs
stopwords
Noun
groups
stemming
Manual
indexing
structure
structure
Full text
Index terms
Document representation viewed as a continuum:
logical view of docs might shift
CSE 8337 Spring 2003
7
The Retrieval Process
Text
User
Interface
4, 10
user need
Text
Text Operations
6, 7
logical view
logical view
Query
Operations
DB Manager
Module
Indexing
user feedback
5
query
Searching
8
inverted file
Index
8
retrieved docs
Text
Database
Ranking
ranked docs
2
CSE 8337 Spring 2003
8
Fuzzy Sets and Logic
Fuzzy Set: Set membership function is a real
valued function with output in the range [0,1].
f(x): Probability x is in F.
1-f(x): Probability x is not in F.
EX:
T = {x | x is a person and x is tall}
Let f(x) be the probability that x is tall
Here f is the membership function
CSE 8337 Spring 2003
9
Fuzzy Sets
CSE 8337 Spring 2003
10
IR is Fuzzy
Reject
Reject
Accept
Simple
CSE 8337 Spring 2003
Accept
Fuzzy
11
Information Retrieval
Information Retrieval (IR): retrieving
desired information from textual data.
Library Science
Digital Libraries
Web Search Engines
Traditionally keyword based
Sample query:
Find all documents about “data mining”.
CSE 8337 Spring 2003
12
Information Retrieval
Similarity: measure of how close a query is
to a document.
Documents which are “close enough” are
retrieved.
Metrics:
Precision = |Relevant and Retrieved|
|Retrieved|
Recall = |Relevant and Retrieved|
|Relevant|
CSE 8337 Spring 2003
13
IR Query Result Measures
IR
CSE 8337 Spring 2003
14