WI01talk - Carnegie Mellon University
Download
Report
Transcript WI01talk - Carnegie Mellon University
Online Learning for Web Query
Generation: Finding Documents
Matching a Minority Concept on the
Web
Rayid Ghani
Accenture Technology Labs, USA
Rosie Jones
Carnegie Mellon University, USA
Dunja Mladenic
J. Stefan Institute, Slovenia
Motivation
Need a collection of documents matching a
particular concept
Search on the web, modify query, analyze
documents, modify query,…
Repetitive, time-consuming, requires
reasonable familiarity with the concept
Task
Given:
1 Document in Target Concept
1 Other Document (negative example)
Access to a Web Search Engine
Create
a Corpus of the Target
Concept quickly with no human
effort
Algorithm
Seed Docs
Query Generator
WWW
Filter/Classifier
Web
Initial Docs
Word Statistics
Build
Query
Learning
Relevant
Filter
Non-Relevant
Query Generation
Examine current relevant and non-relavent
documents to generate a query likely to find
documents that ARE similar to the relevant
ones and NOT similar to non-relevant ones
A Query consists of m inclusion terms and n
exclusion terms
e.g +intelligence +web –military
Query Term Selection Methods
Uniform (UN) – select k words randomly from
the current vocabulary
Term-Frequency (TF) – select top k words
ranked according to their frequency
Probabilistic TF (PTF) – k words with
probability proportional to their frequency
Query Term Selection Methods
RTFIDF – top k words according to their
rtfidf scores
Odds-Ratio (OR) – top k words according
to their odds-ratio scores
Probabilistic OR (POR) – select k words
with probability proportional to their OddsRatio scores
Query Parameters
4 Parameters
Inclusion Term-Selection Method
Exclusion Term-Selection Method
Inclusion Length
Exclusion Length
Example: Odds-Ratio, rtfidf, 3,6
Experimental Setup
Language: Slovenian
Initial documents: 1 web page in Slovenian, 1
in English
Search engine: Altavista
Evaluation
Goal: Collect as many relevant documents as
possible while minimizing the cost
Cost
Number of total documents retrieved from the Web
Number of distinct Queries issued to the Search Engine
Evaluation Measures
Percentage of retrieved documents that are relevant
Number of relevant documents retrieved per unique query
Fixed Query Parameters
Fix Query Lengths and Vary Term-Selection
Methods
Fix Term-Selection Methods and Vary Query
Lengths
Results (Ghani et al. , SIGIR 2001):
Odds-Ratio works well overall
Long Queries are precise but with low recall
Why Online Learning?
Different Term-Selection Methods Excel with
different Query Lengths
Best Combination of methods and lengths
may change as different parts of the
Web/feature space are explored
Learning Methods
Memory-Less (ML) Learning
Long-Term Memory (LT) Learning
Ignore all history and only use the current performance
Use all of the previous history
Additive Update Rule
Multiplicative Update Rule
Fading Memory (FM) Learning
Use all of the history but with a decay function over time
Results
LTM
LTM
Memory-Less
Memory-Less
Results
Further Experiments
Other Languages
Keywords
Similar results with Croatian, Czech and Tagalog
Similar results when initializing with keywords
instead of documents
Comparison to Altavista’s “More Like This”
Better performance than Altavista’s feature
Conclusions
Successfully able to build corpora for minority
languages (Slovenian, Croatian, Czech,
Tagalog) using Web search engines
Online Learning is useful in adapting to
different parts of the Web space
System and Corpora are/will be available at
www.cs.cmu.edu/~TextLearning/CorpusBuilder