Transcript PowerPoint

CS 430: Information Discovery
Lecture 9
Term Weighting and Ranking
1
Course Administration
•
2
Building an Index
Documents
text
assign document IDs
document
numbers
and *field
numbers
break into words
words
*Indicates
optional
operation
from Frakes,
page 7
3
documents
stoplist
non-stoplist
words
stemming*
stemmed
words
term weighting*
terms with
weights
Index
database
Term weighting
Zipf's Law: If the words, w, in a collection are ranked, r(w),
by their frequency, f(w), they roughly fit the relation:
r(w) * f(w) = c
This suggests that some terms are more effective than others in
retrieval.
In particular relative frequency is a useful measure that
identifies terms that occur with substantial frequency in some
documents, but with relatively low overall collection
frequency.
Term weights are functions that are used to quantify these
concepts.
4
Inverse Document Frequency
Weighting
Principle:
(a) Weight is proportional to the number of times that the term
appears in the document
(b) Weight is inversely proportional to the number of documents
that contain the term:
Very simple approach:
wij = fij / dj
Where: wij is the weight given to term j in document i
fij is the frequency with which term j appears in document i
dj is the number of documents that contain term j
5
Term Frequency
Concept
A term that appears many times within a document is
likely to be more important than a term that appears
only once.
6
Term Frequency
Suppose term j appears fij times in document i
Let mi = max (fij)
i
i.e., mi is the maximum frequency of any term in
document i
Term frequency (TF):
tfij = fij / mi
7
Inverse Document Frequency
Concept
A term that occurs in a few documents is likely to be a
better discriminator that a term that appears in most or all
documents.
8
Inverse Document Frequency
For term j
number of documents = n
document frequency (number of documents in
which term j occurs) = dj
One possible measure is: n/dj
But this over-emphasizes small differences. Therefore a
more useful definition is:
Inverse document frequency (IDF):
idfj = log2 (n/dj) + 1
9
Example of Inverse Document
Frequency
Example
n = 1,000 documents
term k
A
B
C
D
dk
ik
100
500
900
1,000
4.32
2.00
1.13
1.00
From: Salton and McGill
10
Weighting
Practical experience has demonstrated that weights of the
following form perform well in a wide variety of circumstances:
(Weight of term j in document i)
= (Term frequency) * (Inverse document frequency)
The standard weighting scheme is:
wij = tfij * idfj
= (fij / mi) * (log2 (n/dj) + 1)
Frake, Chapter 14 discusses many variations on this basic
scheme.
11
Page Rank Algorithm (Google)
Concept:
The rank of a web page is higher if many pages link to it.
Links from highly ranked pages are given greater weight
than links from less highly ranked pages.
12
Intuitive Model
A user:
1. Starts at a random page on the web
2. Selects a random hyperlink from the current page and
jumps to the corresponding page
3. Repeats Step 2 a very large number of times
Pages are ranked according to the relative frequency with
which they are visited.
13
Page Ranks
Citing page
P1
P1
P2
P3
1
1
P2
Cited
page
14
P5
P6
1
1
P3
1
P4
1
P5
1
1
P6
Number
P4
1
1
1
2
1
4
1
1
2
2
Normalize by Number of Links
from Page
Citing page
P1
P1
P2
P3
1
0.25
P2
Cited
page
15
P5
P6
0.5
0.25
P3
0.5
P4
0.5
P5
0.5
0.25
P6
Number
P4
1
0.5
0.25
2
1
4
0.5
1
2
2
=B
Weighting of Pages
Initially all pages
have weight 1
Recalculate
weights
1.75
1
w1 =
16
1
w2 = Bw1 =
0.25
1
0.50
1
2.25
1
0.50
1
0.75
Page Ranks (Basic Algorithm)
Iterate until wk = Bwk-1
This w is the high order eigenvector of B
It ranks the pages by links to them, normalized by the number
of citations from each page and weighted by the ranking of
the cited pages
Basic algorithm:
17
•
calculates the ranks for all pages
•
lists hits in page rank order
Google PageRank Model
A user:
1. Starts at a random page on the web
2a. With probability p, selects any random page and jumps to it
2b. With probability 1-p, selects a random hyperlink from the
current page and jumps to the corresponding page
3. Repeats Step 2a and 2b a very large number of times
Pages are ranked according to the relative frequency with
which they are visited.
18
Google: PageRank
The Google PageRank algorithm is usually written with the
following notation
If page A has pages Ti pointing to it.
– d: damping factor
– C(A): # of links out of A
Iterate until:
 n PTi  

P A  1  d   d  
 i 1 C Ti  
19
Compare TF.IDF to PageRank
With TF.IDF document are ranked depending on how well they
match a specific query.
With PageRank, the pages are ranked in order of importance, with
no reference to a specific query.
20