Efficient Result-set Merging Across Thousands of Hosts

Download Report

Transcript Efficient Result-set Merging Across Thousands of Hosts

Efficient Result-set Merging
Across Thousands of Hosts
Simulating an Internet-scale GIR
application with the GOV2 Test
Collection
Christopher Fallen
Arctic Region Supercomputing Center
University of Alaska Fairbanks
Overview
• Internet-scale GIR simulation for the
TREC 2006 Terabyte Track
• Using the distribution of relevance
judgments from the TREC 2004-6
Terabyte Tracks to limit result-set length
• Collection ranking
TREC 2006 Terabyte Track
• GOV2 test collection is the publicly available
text from http://*.gov in early 2004
– 426 GB of text
– 25 million documents
• Retrieve ranked result sets for 50 topics
– Top results from each group are placed in the
relevance pool
– 140,000 relevance judgments (qrels) for 149
topics (three years of TREC Terabyte Tracks)
• {qrels} := {(topic, document, relevance)}
A GIR simulation using GOV2
• Suppose each host provides an Index/Search
(I/S) Service
– The query processor (QP) must merge ranked
results from every responding host
• Partition GOV2 into hostnames, create one
index per host
– 17,000 hostnames
– {Mean, median} size = {1,500, 34} documents
– Standard deviation = 18,000 documents
Ranked distribution of GOV2
host sizes
Consequences of search-byhost for result-set merging
• Hundreds of ranked lists (responding hosts)
per topic
– Many short lists
– Several long lists
– Large bandwidth requirement for each query
• Merging algorithms must be robust with
respect to long and short result-sets and the
large number of result-sets
“Logistic Regression” Merge
• Find logistic curve
parameters that
yield the best fit to
the relevance scoreat-rank data
• Parameter
estimates need
several results per
result-set
Distribution of Relevance
Judgments
• For most topics, there are many more nonrelevant qrels than relevant qrels
• Across all topics, the number of non-relevant,
relevant, and very relevant qrels is strongly
correlated with host size
• The hosts that contain relevant qrels also
contain non-relevant qrels
– But the relevant documents are probably near the
top of each host’s result-set!
(# relevant)/(# non-relevant) qrels
Skimming the top n
documents from each host
• Is there a simple functional relationship
between the number of likely relevant
documents in a host (that retrieves any
documents at all) to the size of the
host?
• A proportional model of relevance for
each relevance score r is simple…
(# r qrels from Host) = Cr(topic) * |Host|
Skimming the top n
documents from each host
• … and the constant Cr(topic) of
proportionality can be measured from TREC
Terabyte Track data, then averaged over
topics to get <Cr>
• Does the model adequately describe the
data?
– A posteriori: Does the model describe TREC data?
– A priori: Can the model be used to truncate resultsets based on host size without affecting IR
performance?
Proportional relevance model
• Two-way ANOVA applied to host-by-topic
table of the values
(# r qrels from Host)/|Host| = Cr
• For the relevant and very-relevant qrels only,
the total variance is largely between topics
but not within the topics
– Crel is not sensitive across hosts for a fixed topic
– The standard deviation of Crel across topics is
larger than the mean value so the stdv. can be
used as a conservative estimate and the mean
can be used as an aggressive estimate
Proportional relevance model
• Select the top-performing topics from one run
of the TREC 2006 TB Track
• Truncate the result-set from each host using
the aggressive <Crel> = .0005
• Merge truncated result sets and compare the
IR performance of the merged list with the list
merged from the entire result sets
– No statistically significant difference in P@20
performance of either result-set was observed
after discarding >30% of the results from one set.
Relevance-ranking hosts
• Assume that web documents are grouped
non-arbitrarily by hostname according to
content
• Then many orderings or rankings are
possible on (host, document) pairs
– Dictionary order
– Round-robin
• Truncating the ranked lists of hosts may lead
to increased search efficiency with negligible
IR performance penalty
Retrieve from only 1:5 hosts?
Future work
• Collection ranking
– Maximum document relevance score
– Minimum query projection residual into
reduced collection term-doc subspace