Scalable Methods for Estimating Document Frequencies
Download
Report
Transcript Scalable Methods for Estimating Document Frequencies
Scalable Methods for Estimating
Document Frequencies of
Collocations in Databases
Tan Yee Fan
2006 December 15
WING Group Meeting
Motivation
Input:
Output:
A list of items, L
For each a, b in L, the
cooccurrence value
between a and b
Often done by querying
some database for
document frequencies
e.g. f(a), f(b) and f(a b)
Many cooccurrence
measures need f(a b)
Problem Statement
Input
Output
A list of items, L
For each a, b in L, the document frequency
f(a b) in database
Naïve pairwise algorithm need O(n2) queries
Not scalable (e.g. n ~ 1000)
Bandwidth issues and server overload
Related Work
C-PANKOW (Cimiano et al., 2005)
POLYPHONET (Matsuo et al., 2006)
Matching named entities to concepts
Building a social network
Avoid pairwise queries as far as possible
Both C-PANKOW and POLYPHONET perform
“document processing” to achieve this goal
Is document processing really necessary?
Related Work
QProber
(Ipeirotis, 2002)
Obtain a sample of
documents from
database
Select some words to
query and fit a power law
curve
Estimate document
frequencies of the rest
Figure from Ipeirotis (2002)
This Project
Extend QProber algorithm to collocations
Algorithm
Obtain a sample of documents from database
Select some collocations to query and fit a power
law curve
Estimate document frequencies of the rest
Query Selection Strategy
Query selection strategy
For each word w, order collocations in sampled
documents containing w by rank
Uniformly select q collocations to query
decreasing rank
Use O(qn) queries, with q << n
Experiment
Database of 2000 newsgroup articles
Evaluated on a lexicon of 100 words
Vary sample size s and number of queries q
Conclusion
Possible to estimate document frequencies of
collocations reliably using O(n) queries
Next step
Can the methods be applied to disambiguating
author names, publication venue titles, etc.?
Additional Slides
Estimating Actual Document Frequencies
Alternative method
For each word w, fit a power law curve using the
collocations containing w
Estimation for unknown collocation w1 w2:
Average the values estimated from the curve of w1 and
the curve of w2
Problem
Quality of each curve is not as good as lesser
training examples used
Query Selection Strategy
Alternative strategy
Uniform selection of
collocations to query
without regard to
frequencies
Problem
Together with alternative
method, can produce
large errors due to
selection of collocations
at the tail of the power
law curve to query