Using TF-IDF to Determine Word Relevance in Document Queries
Download
Report
Transcript Using TF-IDF to Determine Word Relevance in Document Queries
Using TF-IDF to Determine Word
Relevance in Document Queries
Juan Ramos
[email protected]
Department of Computer Science, Rutgers University,
23515 BPO Way, Piscataway, NJ, 08855
Information Retrieval Problem
Given corpus D, query q = w1, w2, …
wn, return documents d that maximize
Pr(d | q, D).
Easy to dismiss given widespread use of
query retrieval today (web searches,
database management, etc.)
Approaches to Ad Hoc
Retrieval
Probability and Statistics
Naïve Bayes
Approaches include the user’s mindset.
Vector Models
Latent Semantic Indexing
Reduce n-dimensional vector space of documents
Return documents whose distance to query is
small
TF-IDF Weighing Scheme
Given corpus D, word w, document d,
calculate wd = fw, d * log (|D|/fw, D)
Many varieties of basic mathematical
scheme
Procedure
Scan each d, compute each wi, d, return set
D’ that maximizes Σi wi, d
Experiment
Documents from Linguistic Data
Consortium’s United Nations Parallel
Text Corpus
Support noise by enforcing casesensitivity, no parsing of SGML symbols
Brute force approach- consider only fw, d
Results
Query Results with TF-IDF
100
60
40
20
Documents
96
91
86
81
76
71
66
61
56
51
46
41
36
31
26
21
16
11
6
0
1
Relevance
80
Extensions and Further
Research
Genetic TF-IDF: evolve weighing
schemes that compete with TF-IDF.
Hillclimbing, gradient descent TF-IDF.
Cross-language settings: return
documents in different language than
query.
References
Berger, A & Lafferty, J. (1999). Information
Retrieval as Statistical Translation. In
Proceedings of the 22nd ACM Conference on
Research and Development in Information
Retrieval (SIGIR’99), 222-229.
Berger, A et al (2000). Bridging the Lexical
Chasm: Statistical Approaches to Answer
Finding. In Proc. Int. Conf. Research and
Development in Information Retrieval, 192199.
References pt. 2
Berry, Michael W. et al. (1995). Using
Linear Algebra for Intelligent
Information Retrieval. SIAM Review,
37(4):177-196.
Brown, Peter F. et al. (1990). A
Statistical Approach to Machine
Translation. In Computational
Linguistics 16(2): 79-85.
References Pt. 3
Oren, Nir. (2002). Reexamining tf.idf
based information retrieval with Genetic
Programming. In Proceedings of
SAICSIT 2002, 1-10.
Salton, G. & Buckley, C. (1988). Termweighing approache sin automatic text
retrieval. In Information Processing &
Management, 24(5): 513-523.