Basics of Information Retrieval
Download
Report
Transcript Basics of Information Retrieval
Basics of Information
Retrieval
Lillian N. Cassel
Some of these slides are taken or adapted from
Source: http://www.stanford.edu/class/cs276/cs276-2006-syllabus.html
Basic ideas
Information overload
The challenging byproduct of the information
age
Huge amounts of information available -- how
to find what you need when you need it
Think about addresses, e-mail messages, files
of interesting articles, etc.
Information retrieval is the formal study of
efficient and effective ways to extract the right
bit of information from a collection.
The web is a special case, as we will discuss.
Some distinctions
Data, information, knowledge
How do you distinguish among them?
http://www.systems-thinking.org/dikw/dikw.htm
Information sources
Very well organized, indexed, controlled
Totally unorganized, uncharacterized,
uncontrolled
Something in between
Databases
Databases hold specific data items
Organization is explicit
Keys relate items to each other
Queries are constrained, but effective in retrieving the
data that is there
Databases generally respond to specific queries with
specific results
Browsing is difficult
Searching for items not anticipated by the designers
can be difficult
The Web
The Web contains many kinds of elements
Organization?
There are no keys to relate items to each other
Queries are unconstrained; effectiveness depends on
the tools used.
Web queries generally respond to general queries
with specific results
Browsing is possible, though somewhat complicated
There are no designers of the overall Web structure.
Describe how you frequently use the web
What works easily?
What has been difficult?
Digital Library
Something in between the very structured
database and the unstructured Web.
Content is controlled. Someone makes the
entries. (Maybe a lot of people make the
entries, but there are rules for admission.)
Searching and browsing are somewhat open,
not controlled by fixed keys and anticipated
queries.
Nature of the collection regulates indexing
somewhat.
In all cases
Trying to connect an information user to the
specific information wanted.
Concerned with efficiency and effectiveness
Effectiveness - how well did we do?
Efficiency - how well did we use available
resources?
Effectiveness
Two measures:
Precision
Of the results returned, what percentage are meaningful
to the goal of the query?
Recall
Of the materials available that match the query, what
percentage were returned?
Ex. Search returns 590,000 responses and 195 are
relevant. How well did we do?
Not enough information.
Did the 590,000 include all relevant responses? If so,
recall is perfect.
195/590,000 is not good precision!
The process
Query entered
Results Ranked
Query
Interpreted
Index
searched
Items
retrieved
The Collection
Where does the collection come from?
How is the index created?
Those are important distinguishing
characteristics
Inverted Index -- Ordered list of terms
related to the collected materials. Each
term has an associated pointer to the
related material(s).
www.cs.cityu.edu.hk/~deng/5286/T51.doc
Crawling the web
Misnomer as the spider or robot does not actually
move about the web
Program sends a normal request for the page, just as
a browser would.
Retrieve the page and parse it.
Look for anchors -- pointers to other pages.
•
Put them on a list of URLs to visit
Extract key words (possibly all words) to use as index
terms related to that page
Take the next URL and do it again
Actually, the crawling and processing are parallel
activities
Responding to search queries
Use the query string provided
Form a boolean query
Join all words with AND? With OR?
Find the related index terms
Return the information available about the
pages that correspond to the query terms.
Many variations on how to do this. Usually
proprietary to the company.
Making the connections
Stemming
Making sure that simple variations in word form are
recognized as equivalent for the purpose of the search:
exercise, exercises, exercised, for example.
Indexing
A keyword or group of selected words
Any word (more general)
How to choose the most relevant terms to use as index
elements for a set of documents.
Build an inverted file for the chosen index terms.
The Vector model
Let,
N be the total number of documents in the collection
ni be the number of documents which contain ki
freq(i,j) raw frequency of ki within dj
A normalized tf (term frequency) factor is given by
tf(i,j) = freq(i,j) / max(freq(i,j))
where the maximum is computed over all terms which
occur within the document dj
The idf (index term frequency) factor is computed as
idf(i) = log (N/ni)
the log is used to make the values of tf and idf
comparable. It can also be interpreted as the amount of
information associated with the term ki.
Anatomy of a web page
Metatags: Information about the page
Primary source of indexing information for a search
engine.
Ex. Title. Never mind what has an H1 tag (though that
may be considered), what is in the <title> </title>
brackets?
Other tags provide information about the page. This is
easier for the search engine to use than determining
the meaning of the text of the page.
Dealing with the cheaters
False information provided in the web page to make
the search engine return this page
False metatags, invisible words (repeated many times),
etc
Standard Metatags
The Dublin Core (http://dublincore.org/)
15 common items to use in labeling any web
document
Title
Creator
Subject
Description
Publisher
Contributor
Date
Resources type
Format
Identifier
Source
Language
Relation
Coverage
Rights
Hubs and authorities
Hub points to a lot of other places.
CITIDEL is a hub for computing information
NSDL is a hub for science, technology, engineering
and mathematics education.
Authorities are pointed to by a lot of other places.
W3C.org is an authority for information about the web.
When Hub or Authority status is captured, the search
can be more accurate.
If several pages match a query, and one is an authority
page, it will be ranked higher.
When a hub matches a query, the pages it points to are
likely to be relevant.
Some Digital Library examples
Between the chaos of the Web and the strict structure
of a database, the digital library contains an
organized collection.
We saw the digital collection at the Falvey library
session.
See also:
NSDL www.nsdl.org
And the computing component, CITIDEL:
citidel.villanova.edu
American Memory
http://memory.loc.gov/ammem/index.html
Conclusions
The plan was to introduce the basic concepts
of information retrieval in a form accessible to
most students,before you have read anything
about it.
We will look more deeply at these subjects in
the coming weeks.
A word about the pattern for these slides …