SIGIR01 - Information retrieval

Download Report

Transcript SIGIR01 - Information retrieval

MetaSearch
R. Manmatha,
Center for Intelligent Information Retrieval,
Computer Science Department,
University of Massachusetts, Amherst.
Manmatha
Introduction
• MetaSearch / Distributed Retrieval
– Well defined problem
• Language Models are a good way to solve these problems.
– Grand Challenge
• Massively Distributed Multi-lingual Retrieval
Manmatha
MetaSearch
• Combine results from different search engines.
– Single Database – Or Highly Overlapped Databases.
–
» Example, Web.
Multiple Databases or Multi-lingual databases.
• Challenges
– Incompatible scores even if the same search engine is used for different
–
databases.
» Collection Differences, and engine differences.
Document Scores depend on query. Combination on a per query basis
makes training difficult.
• Current Solutions involve learning how to map scores between
different systems.
–
Manmatha
Alternative approach involves aggregating ranks.
Current Solutions for MetaSearch – Single
Database Case
• Solutions
– Reasonable solutions involving mapping scores either by simple
–
–
–
–
normalization, equalizing score distributions, training
Rank Based methods – eg Borda counts, Markov Chains..
Mapped scores are usually combined using linear weighting.
Performance improvement about 5 to 10%.
Search engines need to be similar in performance
» May explain why simple normalization schemes work.
• Other Approaches
– A Markov Chain approach has been tried. However, results on standard
–
Manmatha
datasets are not available for comparison.
Shouldn’t be difficult to try more standard LM approaches.
Challenges – MetaSearch for Single
Databases
• Can one combine search engines which differ a lot in
performance effectively?
– Improve performance even using poorly performing engines?
–
How?
Or use resource selection like approach case to eliminate poorly
performing engines on a per query basis.
• Techniques from other fields.
– Techniques in economics and social sciences for voter aggregation
may be useful (Borda count, Condorcet ..)
• LM approaches
– Will possibly improve performance by characterizing the scores at
a finer granularity than say score distributions.
Manmatha
Multiple Databases
• Two main factors determine variation in document scores
– Search engine scoring functions.
– Collection variations which essentially change the IDF.
• Effective score normalization requires
– Disregarding databases which are unlikely to have the answer
–
–
» Resource Selection.
Normalizing out collection variations on a per query basis.
Mostly ad hoc normalizing functions.
• Language Models.
– Resource Descriptions already provide language models for collections.
– Could use these to factor out collection variations.
– Tricky to do this for different search engines.
Manmatha
Multi-lingual Databases
• Normalizing scores across multiple databases.
– Difficult Problem
• Possibility:
– Create language models for each database.
– Use simple translation models to map across databases.
– Use this to normalize scores.
– Difficult.
Manmatha
Distributed Web Search
• Distribute web search over multiple sites/servers.
– Localized/ Regional.
– Domain dependent.
– Possibly no central coordination.
– Server Selection/ Database Selection with/without explicit queries.
• Research Issues
– Partial representations of the world.
– Trust, Reliability.
• Peer to peer.
Manmatha
Challenges
• Formal Methods for Resource Descriptions, Ranking,
Combination
– Example. Language Modeling
– Beyond collections as big documents
• Multi-lingual retrieval
– Combining the outputs of systems searching databases in many
languages.
• Peer to Peer Systems
– Beyond broadcasting simple keyword searches.
– Non-centralized
– Networking considerations e.g. availability, latency, transfer time.
• Distributed Web Search
• Data, Web Data.
Manmatha