Transcript Document

Information Retrieval
04/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Information Retrieval Systems
key word query
IR System
Document
document
4/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Keyword Search

In full text retrieval, all the words in each document are
considered to be keywords.


We use the word term to refer to the words in a document
Ranking of documents on the basis of estimated
relevance to a query is critical
4/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Similarity Based Retrieval


Similarity based retrieval - retrieve documents similar to
a given document
Similarity can be used to refine answer set to keyword
query

User selects a few relevant documents from those retrieved by
keyword query, and system finds other documents similar to
these
4/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Similarity Measures

A similarity measure is a function that computes the
degree of similarity between two vectors.

Using a similarity measure between the query and each
document:


It is possible to rank the retrieved documents in the order of
presumed relevance.
It is possible to enforce a certain threshold so that the size of
the retrieved set can be controlled.
4/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Relevance Ranking

Relevance ranking is based on factors such as

Term frequency


Frequency of occurrence of query keyword in document
Inverse document frequency

How many documents the query keyword occurs in


Fewer  give more importance to keyword
Hyperlinks to documents

4/13/2005
More links to a document  document is more important
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Relevance Ranking Using Terms (Cont.)

Most systems add to the above model



Words that occur in title, author list, section headings, etc. are
given greater importance
Words whose first occurrence is late in the document are given
lower importance
Very common words such as “a”, “an”, “the”, “it” etc are
eliminated


Called stop words
Proximity: if keywords in query occur close together in the
document, the document has higher importance than if they
occur far apart
4/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Vector Space Model


Assume t distinct terms remain after preprocessing; call them index
terms or the vocabulary.
These “orthogonal” terms form a vector space.
Dimension = t = |vocabulary|


Each term, i, in a document or query, j, is given a real-valued
weight, wij.
Both documents and queries are expressed as
t-dimensional
vectors:
dj = (w1j, w2j, …, wtj)
4/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Term Weights

More frequent terms in a document are more important, i.e. more
indicative of the topic.
fij = frequency of term i in document j

May want to normalize term frequency (tf) by dividing by the
frequency of the most common term in the document:
tfij = fij / maxi{fij}
4/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Reverse Term Weights



Terms that appear in many different documents are less indicative
of overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
An indication of a term’s discrimination power.
Log used to dampen the effect relative to tf.
4/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
TF-IDF Weighting




A typical combined term importance indicator is tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
A term occurring frequently in the document but rarely in the rest of
the collection is given high weight.
Many other ways of determining term weights have been proposed.
Experimentally, tf-idf has been found to work well.
4/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Inner Product Measure

Similarity between vectors for the document di and query q can be
computed as the vector inner product:
t
sim(dj,q) = dj•q =
w · w
ij
iq
i 1


where wij is the weight of term i in document j and wiq is the weight of
term i in the query
For binary vectors, the inner product is the number of matched query
terms in the document (size of intersection).
For weighted term vectors, it is the sum of the products of the weights of
the matched terms.
4/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Inner Product -- Examples
Binary:

D = 1,
1,
1, 0,
1,
1,
0

Q = 1,
0 , 1, 0,
0,
1,
1
sim(D, Q) = 3
Size of vector = size of vocabulary = 7
0 means corresponding term not found
in document or query
Weighted:
D1 = 2T1 + 3T2 + 5T3
Q = 0T1 + 0T2 + 2T3
D2 = 3T1 + 7T2 + 1T3
sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10
sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2
4/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Cosine Similarity Measure


t3
Cosine similarity measures the cosine of the angle
between two vectors.
Inner product normalized by the vector lengths.
1
D1
 
dj q
 
dj  q

CosSim(dj, q) =

 ( wij  wiq )
i 1
t
t
 wij   wiq
i 1
4/13/2005
2
t
2
i 1
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
2
t2
D2
Q
t1
Relevance Using Hyperlinks




Problem with key words search?
Problem with most frequented visited website search?
Idea: use popularity of Web site (e.g. how many people visit it)
to rank site pages that match given keywords
Problem: hard to find actual popularity of site
4/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Different Ranking Factors


Key word and anchor text based search find all the
related pages first
PageRank rank the search result set

A high ranked page is not interesting to you at all if it is
not related
4/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Link Counts
Taher’s Home Page
DB Pub Server
Sep’s Home Page
CS361
Linked by 2
Unimportant pages
4/13/2005
Yahoo!
CNN
Linked by 2
Important Pages
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Definition of PageRank
let us calculate
1
xi   x j
jBi N j
4/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Definition of PageRank
0.05
0.25
Sep
Taher
1/2 1/2
1
DB Pub Server CNN
0.1
4/13/2005
0.1
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
1
Yahoo!
0.1
PageRank Diagram
0.333
0.333
0.333
Initialize all nodes to rank
4/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
xi( 0 ) 
1
n
PageRank Diagram
0.167
0.167
0.333
0.333
Propagate ranks across links
(multiplying by link weights)
4/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
PageRank Diagram
0.5
0.333
0.167
xi
4/13/2005
(1)
1 ( 0)

xj
jBi N j
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
PageRank Diagram
0.167
0.5
0.167
4/13/2005
0.167
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
PageRank Diagram
0.333
0.5
0.167
xi
4/13/2005
( 2)
1 (1)

xj
jBi N j
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
PageRank Diagram
0.4
0.4
0.2
After a while…
1
xi  
xj
jBi N j
4/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Computing PageRank
(0)
i
1

n

Initialize: x

Repeat until convergence:
( k 1)
i
x
1 (k )
  xj
jBi N j
importance of page i
importance of page j
pages j that link to
page i
4/13/2005
number of outlinks from
page j
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Definition of PageRank

The importance of a page is given by the
importance of the pages that link to it
 d is a damping factor, usually 0.85
1
xi  (1  d )  d  x j
jBi N j
importance of page i
importance of page j
pages j that link to page i
4/13/2005
number of outlinks from page j
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Synonyms and Homonyms

Synonyms

E.g. document: “motorcycle repair”, query: “motorcycle
maintenance”



need to realize that “maintenance” and “repair” are synonyms
System can extend query as “motorcycle and (repair or
maintenance)”
Homonyms


E.g. “object” has different meanings as noun/verb
Can disambiguate meanings (to some extent) from the context
4/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Indexing of Documents

An inverted index maps each keyword Ki to a set of
documents Si that contain the keyword

Documents identified by identifiers
4/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Measuring Retrieval Effectiveness

Relevant performance metrics:


4/13/2005
Precision - what percentage of the retrieved documents
are relevant to the query.
Recall - what percentage of the documents relevant to
the query were retrieved.
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Precision and Recall

Precision: a/(a+c)


Among all the retrieved, how many are actual
positive?
Recall: a/(a+b)
Percentage of
actual positive
data retrieved


false
positive
a
b
negative
c
d
actual
F measure: 2pr/(r+p)
4/13/2005
true
predict
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Training Data

Problem: which documents are actually relevant, and
which are not


Usual solution: human judges
Create a corpus of documents and queries, with humans
deciding which documents are relevant to which queries

4/13/2005
TREC (Text REtrieval Conference) Benchmark
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Web Crawling

Web crawlers are programs that locate and gather
information on the Web

Recursively follow hyperlinks present in known documents, to
find other documents


Starting from a seed set of documents
Fetched documents


4/13/2005
Handed over to an indexing system
Can be discarded after indexing, or store as a cached copy
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Browsing

Storing related documents together in a library
facilitates browsing



users can see not only requested document but also
related ones.
Browsing is facilitated by classification system that
organizes logically related documents together.
Organization is hierarchical: classification
hierarchy
4/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
A Classification Hierarchy For A Library System
4/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Classification DAG


Documents can reside in multiple places in a
hierarchy in an information retrieval system, since
physical location is not important.
Classification hierarchy is thus Directed Acyclic
Graph (DAG)
4/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
A Classification DAG For A Library Information
Retrieval System
4/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Web Directories

A Web directory is just a classification directory on
Web pages


E.g. Yahoo! Directory, Open Directory project
Issues:



What should the directory hierarchy be?
Given a document, which nodes of the directory are categories
relevant to the document
Often done manually

4/13/2005
Classification of documents into a hierarchy may be done based on
term similarity
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval
Some slides of this slide set adapted from the
following slides:


Prof. James Allan’s course slides
Extrapolation Methods for Accelerating PageRank Computations by
Sepandar D. Kamvar et. al.
4/13/2005
Yan Huang - CSCI5330 Database
Implementation – Information Retrieval