csci4220-week10x - Rensselaer Polytechnic Institute: Computer
Download
Report
Transcript csci4220-week10x - Rensselaer Polytechnic Institute: Computer
Rensselaer Polytechnic Institute
CSCI-4220 – Network Programming
David Goldschmidt, Ph.D.
from Search Engines: Information Retrieval in Practice, 1st edition
by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0
how do we make
acquired documents
searchable?
how do we best
convert documents
to their index terms
Simplest approach is find, which
requires no text transformation
Useful in user applications,
but not in search (why?)
Optional transformation
handled during the find
operation: case sensitivity
English documents are predictable:
Top two most frequently occurring words are
“the” and “of” (10% of word occurrences)
Top six most frequently occurring words
account for 20% of word occurrences
Top fifty most frequently occurring words
account for 50% of word occurrences
Given all unique words in a (large) document,
approximately 50% occur only once
George Kingsley Zipf
(1902-1950)
Zipf’s law:
Rank words in order of decreasing frequency
The rank (r) of a word times its frequency (f) is
approximately equal to a constant (k)
r
x
f = k
In other words, the frequency of the rth most
common word is inversely proportional to r
The probability of occurrence (Pr)
for English,
c ≈ 0.1
of a word is the word frequency
divided by the total number of
words in the document
Revise Zipf’s law as:
r
x
Pr = c
Verify Zipf’s law using the AP89 dataset:
Collection of Associated Press (AP) news stories
from 1989 (available at http://trec.nist.gov):
Total documents
Total word occurrences
Vocabulary size
Words occurring > 1000 times
Words occurring once
84,678
39,749,179
198,763
4,169
70,064
Top 50
words
of AP89
As the corpus grows, so does vocabulary size
Fewer new words when corpus is already large
The relationship between corpus size (n) and
vocabulary size (v) was defined empirically by
Heaps (1978) and called Heaps law:
v = k
x
nβ
Constants k and β vary
Typically 10 ≤ k ≤ 100 and β ≈ 0.5
note values of k and β
Web pages crawled from .gov in early 2004
Word occurrence statistics can be used to
estimate result set size of a user query
Aside from stop words, how many pages
contain all of the query terms?
▪ To figure this out, first assume that words
occur independently of one another
▪ Also assume that the search engine knows N,
the number of documents it indexes
Given three query terms a, b, and c
Probability of a document containing all three
is the product of individual probabilities for
each query term:
P(a b c) = P(a)
x
P(b)
x
P(c)
P(a b c) is the joint probability of
events a, b, and c occurring
We assume the search engine knows the
number of documents that a word occurs in
Call these na, nb, and nc
▪ Note that the book uses fa, fb, and fc
Estimate individual query term probabilities:
P(a) = na / N
P(b) = nb / N
P(c) = nc / N
Given P(a), P(b), and P(c), we estimate
the result set size as:
nabc = N x (na / N) x (nb / N)
nabc = (na x nb x nc) / N2
x
(nc / N)
This estimation sounds good, but is lacking due
to our query term independence assumption
Using the GOV2 dataset with N = 25,205,179
Poor results,
because of the
query term
independence
assumption
Could use word
co-occurrence
data...
Extrapolate based on the size
of the current result set:
The current result set is the subset of documents
that have been ranked thus far
Let C be the number of documents found thus
far containing all the query words
Let s be the proportion of the total documents
ranked (use least frequently occurring term)
Estimate result set size via nabc = C / s
Given example query: tropical fish aquarium
Least frequently occurring term is aquarium
(which occurs in 26,480 documents)
After ranking 3,000 documents,
258 documents contain all three query terms
Thus, nabc = C / s = 258 / (3,000 ÷ 26,480) = 2,277
After processing 20% of the documents, the
estimate is 1,778
▪ Which overshoots actual value of 1,529