Distributed Information Retrieval

Transcript Distributed Information Retrieval

Distributed Information
Retrieval
Server Ranking for Distributed Text
Retrieval Systems on the Internet
B. Yuwono and D. Lee
Siemens TREC-4 Report: Further
Experiments with Database Merging
E. Vorhees
Brian Shaw
CS 5604
Issue: Merging for Effective
Results

multiple brokers (take search queries),
multiple collection servers
 broker must select appropriate collection
servers and merge results
Server Ranking: overview…
Problem: “cost” (including user’s time) of
broadcasting to all servers and processing
power
 Solution: broker ranks collection servers
(“goodness score”); broadcasts query to at
most σ (sigma) collection servers (preset
number or scoring threshold); merges
results

1- Server Ranking for Distributed Text Retrieval on the Internet
Server Ranking: Server
Selection

Relies solely on Document Frequency data
(DF); all collection servers must report
changes to broker
 Cue Validity Variance (CVV) goodness
score is based on estimate that term j
distinguishes one collection server from
another; not an indication of quantity or
quality of relevance
1- Server Ranking for Distributed Text Retrieval on the Internet
Server Ranking: Merging

Assumption 1: the best document in collection i is equally
relevant to the best document in collection k
 A collection server containing a few but highly relevant
documents will contribute to the final list.
 Assumption 2: the distance between two consecutive
document ranks is inversely proportional to the goodness
score
 Relative goodness scores are roughly proportional to the
number of documents contributed to the final list.
 Final ranking is a combination of goodness score and local
rankings.
1- Server Ranking for Distributed Text Retrieval on the Internet
Experiments: (overview)…

Problem: broker has no access to meta-data
from isolated collection servers
 Solution: choose collection server(s) based
on results from previous training queries
2- Further Experiments with Database Merging
Experiments: Server
Selection, two approaches

Query Clustering (QC): cluster training queries
(based on # of same documents retrieved) and
calculate cluster “centroid vector”; compare query
vector to centroid vector and assign weight to
collection
 Modeling Relevant Document Distributions
(MRDD): find M most similar training queries
and assign weights to collections based on the
training run’s relevant document distribution
2- Further Experiments with Database Merging
Experiments: Merging

N documents retrieved from each server as
determined by weights
 Final ranking is a random process: roll a Cfaced die that is biased by the number of
documents still to be picked from each of
the C collections
2- Further Experiments with Database Merging
Comparison
Broker’s
Knowledge
1-Server Ranking
2-Experiments
Shared Document
Frequency Data
Training Query
Results
Collection
CVV Goodness
Server Selection Scoring
Comparison to
Training Queries
Merging
Random
Goodness Score
& Local Rank
Conclusions

The server ranking method proposed by Yuwono
and Lee is an effective way to minimize operating
costs (such as time) in an environment where
brokers and collection servers can share document
frequency data.
 The “isolated merging strategies” proposed by
Vorhees is an effective way to choose a collection
server where no meta-information is shared
between the broker and collection server.

Distributed Information Retrieval

Transcript Distributed Information Retrieval

Directory