Basics of Information Retrieval

Download Report

Transcript Basics of Information Retrieval

Basics of Information
Retrieval
Lillian N. Cassel
Some of these slides are taken or adapted from
Source: http://www.stanford.edu/class/cs276/cs276-2006-syllabus.html
Basic ideas
 Information overload
 The challenging byproduct of the information
age
 Huge amounts of information available -- how
to find what you need when you need it

Think about addresses, e-mail messages, files
of interesting articles, etc.
 Information retrieval is the formal study of
efficient and effective ways to extract the right
bit of information from a collection.
 The web is a special case, as we will discuss.
Some distinctions
 Data, information, knowledge
 How do you distinguish among them?

http://www.systems-thinking.org/dikw/dikw.htm
 Information sources
 Very well organized, indexed, controlled
 Totally unorganized, uncharacterized,
uncontrolled
 Something in between
Databases
 Databases hold specific data items
 Organization is explicit
 Keys relate items to each other
 Queries are constrained, but effective in retrieving the
data that is there
 Databases generally respond to specific queries with
specific results
 Browsing is difficult
 Searching for items not anticipated by the designers
can be difficult
The Web
 The Web contains many kinds of elements
 Organization?
 There are no keys to relate items to each other
 Queries are unconstrained; effectiveness depends on
the tools used.
 Web queries generally respond to general queries
with specific results
 Browsing is possible, though somewhat complicated
 There are no designers of the overall Web structure.
 Describe how you frequently use the web
 What works easily?
 What has been difficult?
Digital Library
 Something in between the very structured
database and the unstructured Web.
 Content is controlled. Someone makes the
entries. (Maybe a lot of people make the
entries, but there are rules for admission.)
 Searching and browsing are somewhat open,
not controlled by fixed keys and anticipated
queries.
 Nature of the collection regulates indexing
somewhat.
In all cases
 Trying to connect an information user to the
specific information wanted.
 Concerned with efficiency and effectiveness
 Effectiveness - how well did we do?
 Efficiency - how well did we use available
resources?
Effectiveness
 Two measures:
 Precision

Of the results returned, what percentage are meaningful
to the goal of the query?
 Recall

Of the materials available that match the query, what
percentage were returned?
 Ex. Search returns 590,000 responses and 195 are
relevant. How well did we do?
 Not enough information.


Did the 590,000 include all relevant responses? If so,
recall is perfect.
195/590,000 is not good precision!
The process
Query entered
Results Ranked
Query
Interpreted
Index
searched
Items
retrieved
The Collection
 Where does the collection come from?
 How is the index created?
 Those are important distinguishing
characteristics
 Inverted Index -- Ordered list of terms
related to the collected materials. Each
term has an associated pointer to the
related material(s).
 www.cs.cityu.edu.hk/~deng/5286/T51.doc
Crawling the web
 Misnomer as the spider or robot does not actually
move about the web
 Program sends a normal request for the page, just as
a browser would.
 Retrieve the page and parse it.

Look for anchors -- pointers to other pages.
•

Put them on a list of URLs to visit
Extract key words (possibly all words) to use as index
terms related to that page
 Take the next URL and do it again

Actually, the crawling and processing are parallel
activities
Responding to search queries
 Use the query string provided
 Form a boolean query
 Join all words with AND? With OR?
 Find the related index terms
 Return the information available about the
pages that correspond to the query terms.
 Many variations on how to do this. Usually
proprietary to the company.
Making the connections
 Stemming
 Making sure that simple variations in word form are
recognized as equivalent for the purpose of the search:
exercise, exercises, exercised, for example.
 Indexing
 A keyword or group of selected words
 Any word (more general)
 How to choose the most relevant terms to use as index
elements for a set of documents.
 Build an inverted file for the chosen index terms.
The Vector model
 Let,
 N be the total number of documents in the collection
 ni be the number of documents which contain ki
 freq(i,j) raw frequency of ki within dj
 A normalized tf (term frequency) factor is given by
 tf(i,j) = freq(i,j) / max(freq(i,j))
 where the maximum is computed over all terms which
occur within the document dj
 The idf (index term frequency) factor is computed as
 idf(i) = log (N/ni)
 the log is used to make the values of tf and idf
comparable. It can also be interpreted as the amount of
information associated with the term ki.
Anatomy of a web page
 Metatags: Information about the page
 Primary source of indexing information for a search
engine.
 Ex. Title. Never mind what has an H1 tag (though that
may be considered), what is in the <title> </title>
brackets?
 Other tags provide information about the page. This is
easier for the search engine to use than determining
the meaning of the text of the page.
 Dealing with the cheaters
 False information provided in the web page to make
the search engine return this page

False metatags, invisible words (repeated many times),
etc
Standard Metatags
 The Dublin Core (http://dublincore.org/)
15 common items to use in labeling any web
document
Title
Creator
Subject
Description
Publisher
Contributor
Date
Resources type
Format
Identifier
Source
Language
Relation
Coverage
Rights
Hubs and authorities
 Hub points to a lot of other places.
 CITIDEL is a hub for computing information
 NSDL is a hub for science, technology, engineering
and mathematics education.
 Authorities are pointed to by a lot of other places.
 W3C.org is an authority for information about the web.
 When Hub or Authority status is captured, the search
can be more accurate.
 If several pages match a query, and one is an authority
page, it will be ranked higher.
 When a hub matches a query, the pages it points to are
likely to be relevant.
Some Digital Library examples
 Between the chaos of the Web and the strict structure
of a database, the digital library contains an
organized collection.
 We saw the digital collection at the Falvey library
session.
 See also:
 NSDL www.nsdl.org

And the computing component, CITIDEL:
citidel.villanova.edu
 American Memory
http://memory.loc.gov/ammem/index.html
Conclusions
 The plan was to introduce the basic concepts
of information retrieval in a form accessible to
most students,before you have read anything
about it.
 We will look more deeply at these subjects in
the coming weeks.
 A word about the pattern for these slides …