WI01talk - Carnegie Mellon University

Transcript WI01talk - Carnegie Mellon University

Online Learning for Web Query
Generation: Finding Documents
Matching a Minority Concept on the
Web
Rayid Ghani
Accenture Technology Labs, USA
Rosie Jones
Carnegie Mellon University, USA
Dunja Mladenic
J. Stefan Institute, Slovenia
Motivation



Need a collection of documents matching a
particular concept
Search on the web, modify query, analyze
documents, modify query,…
Repetitive, time-consuming, requires
reasonable familiarity with the concept
Task

Given:
 1 Document in Target Concept
 1 Other Document (negative example)
 Access to a Web Search Engine
 Create
a Corpus of the Target
Concept quickly with no human
effort
Algorithm
Seed Docs
Query Generator
WWW
Filter/Classifier
Web
Initial Docs
Word Statistics
Build
Query
Learning
Relevant
Filter
Non-Relevant
Query Generation

Examine current relevant and non-relavent
documents to generate a query likely to find
documents that ARE similar to the relevant
ones and NOT similar to non-relevant ones

A Query consists of m inclusion terms and n
exclusion terms

e.g +intelligence +web –military
Query Term Selection Methods

Uniform (UN) – select k words randomly from
the current vocabulary

Term-Frequency (TF) – select top k words
ranked according to their frequency

Probabilistic TF (PTF) – k words with
probability proportional to their frequency
Query Term Selection Methods

RTFIDF – top k words according to their
rtfidf scores

Odds-Ratio (OR) – top k words according
to their odds-ratio scores

Probabilistic OR (POR) – select k words
with probability proportional to their OddsRatio scores
Query Parameters

4 Parameters





Inclusion Term-Selection Method
Exclusion Term-Selection Method
Inclusion Length
Exclusion Length
Example: Odds-Ratio, rtfidf, 3,6
Experimental Setup



Language: Slovenian
Initial documents: 1 web page in Slovenian, 1
in English
Search engine: Altavista
Evaluation


Goal: Collect as many relevant documents as
possible while minimizing the cost
Cost



Number of total documents retrieved from the Web
Number of distinct Queries issued to the Search Engine
Evaluation Measures


Percentage of retrieved documents that are relevant
Number of relevant documents retrieved per unique query
Fixed Query Parameters



Fix Query Lengths and Vary Term-Selection
Methods
Fix Term-Selection Methods and Vary Query
Lengths
Results (Ghani et al. , SIGIR 2001):


Odds-Ratio works well overall
Long Queries are precise but with low recall
Why Online Learning?


Different Term-Selection Methods Excel with
different Query Lengths
Best Combination of methods and lengths
may change as different parts of the
Web/feature space are explored
Learning Methods

Memory-Less (ML) Learning


Long-Term Memory (LT) Learning




Ignore all history and only use the current performance
Use all of the previous history
Additive Update Rule
Multiplicative Update Rule
Fading Memory (FM) Learning

Use all of the history but with a decay function over time
Results
LTM
LTM
Memory-Less
Memory-Less
Results
Further Experiments

Other Languages


Keywords


Similar results with Croatian, Czech and Tagalog
Similar results when initializing with keywords
instead of documents
Comparison to Altavista’s “More Like This”

Better performance than Altavista’s feature
Conclusions

Successfully able to build corpora for minority
languages (Slovenian, Croatian, Czech,
Tagalog) using Web search engines

Online Learning is useful in adapting to
different parts of the Web space

System and Corpora are/will be available at
www.cs.cmu.edu/~TextLearning/CorpusBuilder

WI01talk - Carnegie Mellon University

Transcript WI01talk - Carnegie Mellon University

Directory