Scalable Methods for Estimating Document Frequencies

Download Report

Transcript Scalable Methods for Estimating Document Frequencies

Scalable Methods for Estimating
Document Frequencies of
Collocations in Databases
Tan Yee Fan
2006 December 15
WING Group Meeting
Motivation

Input:


Output:


A list of items, L
For each a, b in L, the
cooccurrence value
between a and b
Often done by querying
some database for
document frequencies

e.g. f(a), f(b) and f(a  b)

Many cooccurrence
measures need f(a  b)
Problem Statement

Input


Output


A list of items, L
For each a, b in L, the document frequency
f(a  b) in database
Naïve pairwise algorithm need O(n2) queries


Not scalable (e.g. n ~ 1000)
Bandwidth issues and server overload
Related Work

C-PANKOW (Cimiano et al., 2005)


POLYPHONET (Matsuo et al., 2006)


Matching named entities to concepts
Building a social network
Avoid pairwise queries as far as possible


Both C-PANKOW and POLYPHONET perform
“document processing” to achieve this goal
Is document processing really necessary?
Related Work

QProber
(Ipeirotis, 2002)



Obtain a sample of
documents from
database
Select some words to
query and fit a power law
curve
Estimate document
frequencies of the rest
Figure from Ipeirotis (2002)
This Project

Extend QProber algorithm to collocations

Algorithm



Obtain a sample of documents from database
Select some collocations to query and fit a power
law curve
Estimate document frequencies of the rest
Query Selection Strategy

Query selection strategy


For each word w, order collocations in sampled
documents containing w by rank
Uniformly select q collocations to query
decreasing rank

Use O(qn) queries, with q << n
Experiment



Database of 2000 newsgroup articles
Evaluated on a lexicon of 100 words
Vary sample size s and number of queries q
Conclusion

Possible to estimate document frequencies of
collocations reliably using O(n) queries

Next step

Can the methods be applied to disambiguating
author names, publication venue titles, etc.?
Additional Slides
Estimating Actual Document Frequencies

Alternative method


For each word w, fit a power law curve using the
collocations containing w
Estimation for unknown collocation w1 w2:


Average the values estimated from the curve of w1 and
the curve of w2
Problem

Quality of each curve is not as good as lesser
training examples used
Query Selection Strategy

Alternative strategy

Uniform selection of
collocations to query
without regard to
frequencies

Problem

Together with alternative
method, can produce
large errors due to
selection of collocations
at the tail of the power
law curve to query