Using TF-IDF to Determine Word Relevance in Document Queries

Download Report

Transcript Using TF-IDF to Determine Word Relevance in Document Queries

Using TF-IDF to Determine Word
Relevance in Document Queries
Juan Ramos
[email protected]
Department of Computer Science, Rutgers University,
23515 BPO Way, Piscataway, NJ, 08855
Information Retrieval Problem


Given corpus D, query q = w1, w2, …
wn, return documents d that maximize
Pr(d | q, D).
Easy to dismiss given widespread use of
query retrieval today (web searches,
database management, etc.)
Approaches to Ad Hoc
Retrieval

Probability and Statistics



Naïve Bayes
Approaches include the user’s mindset.
Vector Models



Latent Semantic Indexing
Reduce n-dimensional vector space of documents
Return documents whose distance to query is
small
TF-IDF Weighing Scheme

Given corpus D, word w, document d,
calculate wd = fw, d * log (|D|/fw, D)


Many varieties of basic mathematical
scheme
Procedure

Scan each d, compute each wi, d, return set
D’ that maximizes Σi wi, d
Experiment



Documents from Linguistic Data
Consortium’s United Nations Parallel
Text Corpus
Support noise by enforcing casesensitivity, no parsing of SGML symbols
Brute force approach- consider only fw, d
Results
Query Results with TF-IDF
100
60
40
20
Documents
96
91
86
81
76
71
66
61
56
51
46
41
36
31
26
21
16
11
6
0
1
Relevance
80
Extensions and Further
Research



Genetic TF-IDF: evolve weighing
schemes that compete with TF-IDF.
Hillclimbing, gradient descent TF-IDF.
Cross-language settings: return
documents in different language than
query.
References

Berger, A & Lafferty, J. (1999). Information
Retrieval as Statistical Translation. In
Proceedings of the 22nd ACM Conference on
Research and Development in Information
Retrieval (SIGIR’99), 222-229.

Berger, A et al (2000). Bridging the Lexical
Chasm: Statistical Approaches to Answer
Finding. In Proc. Int. Conf. Research and
Development in Information Retrieval, 192199.
References pt. 2


Berry, Michael W. et al. (1995). Using
Linear Algebra for Intelligent
Information Retrieval. SIAM Review,
37(4):177-196.
Brown, Peter F. et al. (1990). A
Statistical Approach to Machine
Translation. In Computational
Linguistics 16(2): 79-85.
References Pt. 3


Oren, Nir. (2002). Reexamining tf.idf
based information retrieval with Genetic
Programming. In Proceedings of
SAICSIT 2002, 1-10.
Salton, G. & Buckley, C. (1988). Termweighing approache sin automatic text
retrieval. In Information Processing &
Management, 24(5): 513-523.