csci4220-week10x - Rensselaer Polytechnic Institute: Computer

Download Report

Transcript csci4220-week10x - Rensselaer Polytechnic Institute: Computer

Rensselaer Polytechnic Institute
CSCI-4220 – Network Programming
David Goldschmidt, Ph.D.
from Search Engines: Information Retrieval in Practice, 1st edition
by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0
how do we make
acquired documents
searchable?
how do we best
convert documents
to their index terms

Simplest approach is find, which
requires no text transformation
 Useful in user applications,
but not in search (why?)
 Optional transformation
handled during the find
operation: case sensitivity

English documents are predictable:
 Top two most frequently occurring words are
“the” and “of” (10% of word occurrences)
 Top six most frequently occurring words
account for 20% of word occurrences
 Top fifty most frequently occurring words
account for 50% of word occurrences
 Given all unique words in a (large) document,
approximately 50% occur only once

George Kingsley Zipf
(1902-1950)
Zipf’s law:
 Rank words in order of decreasing frequency
 The rank (r) of a word times its frequency (f) is
approximately equal to a constant (k)
r
x
f = k
 In other words, the frequency of the rth most
common word is inversely proportional to r
 The probability of occurrence (Pr)
for English,
c ≈ 0.1
of a word is the word frequency
divided by the total number of
words in the document
 Revise Zipf’s law as:
r
x
Pr = c

Verify Zipf’s law using the AP89 dataset:
 Collection of Associated Press (AP) news stories
from 1989 (available at http://trec.nist.gov):
Total documents
Total word occurrences
Vocabulary size
Words occurring > 1000 times
Words occurring once
84,678
39,749,179
198,763
4,169
70,064

Top 50
words
of AP89

As the corpus grows, so does vocabulary size
 Fewer new words when corpus is already large

The relationship between corpus size (n) and
vocabulary size (v) was defined empirically by
Heaps (1978) and called Heaps law:
v = k
x
nβ
 Constants k and β vary
 Typically 10 ≤ k ≤ 100 and β ≈ 0.5
note values of k and β
Web pages crawled from .gov in early 2004

Word occurrence statistics can be used to
estimate result set size of a user query
 Aside from stop words, how many pages
contain all of the query terms?
▪ To figure this out, first assume that words
occur independently of one another
▪ Also assume that the search engine knows N,
the number of documents it indexes

Given three query terms a, b, and c
 Probability of a document containing all three
is the product of individual probabilities for
each query term:
P(a  b  c) = P(a)
x
P(b)
x
P(c)
 P(a  b  c) is the joint probability of
events a, b, and c occurring

We assume the search engine knows the
number of documents that a word occurs in
 Call these na, nb, and nc
▪ Note that the book uses fa, fb, and fc

Estimate individual query term probabilities:
 P(a) = na / N
P(b) = nb / N
P(c) = nc / N

Given P(a), P(b), and P(c), we estimate
the result set size as:
nabc = N x (na / N) x (nb / N)
nabc = (na x nb x nc) / N2
x
(nc / N)
 This estimation sounds good, but is lacking due
to our query term independence assumption

Using the GOV2 dataset with N = 25,205,179
 Poor results,
because of the
query term
independence
assumption
 Could use word
co-occurrence
data...

Extrapolate based on the size
of the current result set:
 The current result set is the subset of documents
that have been ranked thus far
 Let C be the number of documents found thus
far containing all the query words
 Let s be the proportion of the total documents
ranked (use least frequently occurring term)
 Estimate result set size via nabc = C / s

Given example query: tropical fish aquarium
 Least frequently occurring term is aquarium
(which occurs in 26,480 documents)
 After ranking 3,000 documents,
258 documents contain all three query terms
 Thus, nabc = C / s = 258 / (3,000 ÷ 26,480) = 2,277
 After processing 20% of the documents, the
estimate is 1,778
▪ Which overshoots actual value of 1,529