Web Search - Electures

Download Report

Transcript Web Search - Electures

Web Search – Summer Term 2006
VII. Selected Topics Metasearch Engines [1]
(c) Wolfgang Hürst, Albert-Ludwigs-University
Introduction
So far: General-purpose search engines
(index and search the whole web)
In addition: Special-purpose search
engines exist (index and search documents
from a particular domain)
Problem:
- Both only cover only a fraction of the web
- Users sometimes have to address different
search engines
Possible solution: Metasearch engines
Motivation
Why should we build metasearch engines?
- To increase the coverage of the web
- To solve the scalability of searching the web
- To facilitate the invocation of multiple
search engines
- To improve the retrieval effectiveness
A typical metasearch engine session
1. User sends query to metasearch engine
2. Metasearch engine processes query and
sends it to component search engines
3. Component search engines return results
(based on local similarity) to metasearch
engine
4. Metasearch engine merges and ranks
returned results (based on global similarities)
and presents them to the user
A typical architecture
USER INTERFACE
DATABASE SELECTOR
Database
selection
problem
DOCUMENT SELECTOR
Document
selection
problem
Result
merging
problem
RESULT MERGER
QUERY DISPATCHER
SEARCH ENGINE
...
SEARCH ENGINE
Challenges for metasearch engines
Why is it hard to create metasearch engines?
Because of the heterogeneity among the
involved autonomous component search
engines:
- Various indexing methods
- Different document term weighting schemes
- Different query term weighting schemes
- Varying similarity functions
- Different document databases
- Unequal document versions
- Different result representation
A typical architecture
USER INTERFACE
DATABASE SELECTOR
Database
selection
problem
DOCUMENT SELECTOR
Document
selection
problem
Result
merging
problem
RESULT MERGER
QUERY DISPATCHER
SEARCH ENGINE
...
SEARCH ENGINE
The database selection problem
Selection is important if the number of
component search engines is large
Goal: Identifying as many potentially useful
databases as possible while minimizing
wrongly identifying useless ones
Decision is usually based on a
representative of the database
Most important questions:
- What are good representatives?
- How do we get them?
Approaches for database selection
Three types of techniques:
- Rough representative approaches
e.g. several key words or paragraphs
- Statistical representative approaches
statistical information (e.g. document
frequency of a term)
- Learning-based approaches
learned from past experience (training
queries or real (previous) user queries)
Approaches for database selection
Three types of techniques:
- Rough representative approaches
e.g. several key words or paragraphs
- Statistical representative approaches
statistical information (e.g. document
frequency of a term)
- Learning-based approaches
learned from past experience (training
queries or real (previous) user queries)
Rough representative approaches
Typically several keywords and paragraphs
Only give a rather general idea about the database
Often manually generated
Automatic approaches exist (e.g. taking the text from
the interface page or anchor text from pages pointing
to the search engine)
Alternatively: Involve the user, e.g. by explicit search
engine selection or need to specify a subject area
Advantages: Easy and little storage space
requirements, works well for highly specialized
component search engines with diversified topics
Disadvantage: Does not work well for databases with
documents of diverse interests
Statistical representative approaches
Examples:
- D-WISE approach
- CORI Net approach
- gGIOSS approach
- Estimating the number of potentially
useful documents
- Estimating the similarity of the most
similar documents
The D-WISE approach
The representative of a component search engine
contains
- the document frequency of each term in
the component search engine (n values dij)
- the number of documents in the database
(1 value ni)
Ranking the component search engines based on a
given query q:
1. Calculate the cue
validity Cij of each
search term tj
The D-WISE approach (cont.)
2. Calculate the variance CVVj of the CVij's of each
query term tj for all component databases
(ACVj = average of all CVij's for all comp. databases)
3. Compute the ranking
score ri for each
component database i
with respect to q
The D-WISE approach (cont.)
4. Select component databases with the highest
ranking score
(Intuitive interpretation of the ranking score: Indicates
where useful query terms are concentrated)
Advantages of this approach:
- easy scalable
- easy to compute
Disadvantages:
- ranking scores are relative scores
(no absolute quality measures)
- does not distinguish between multiple term
appearances within single documents
The D-WISE approach (cont.)
4. Select component databases with the highest
ranking score
(Intuitive interpretation of the ranking score: Indicates
where useful query terms are concentrated)
Advantages of this approach:
- easy scalable
- easy to compute
Disadvantages:
- ranking scores are relative scores
(no absolute quality measures)
- does not distinguish between multiple term
appearances within single documents
Learning-based approaches
Predict usefulness of a database for new queries based
on the experiences with the database from past ones
How to obtain these retrieval experiences?
- Usage of training queries (static learning)
- Usage of real user queries and accumulate retrieval
knowledge gradually and update continuously
(dynamic learning)
- Use a combined learning approach, i.e. start with
training queries and update continuously based on
real user queries
Examples:
- The MRDD approach
- The SavvySearch approach
- The ProFusion approach
The MRDD approach
Static learning approach, i.e. a set of training
queries is given and all relevant documents for each
training query have to be identified (manually)
Stores a vector reflecting the distribution of the
relevant documents: <r1, r2, ..., rs>
with ri = number of top-ranked documents that
must be retrieved to obtain i relevant documents
Example:
Retrieval result: (d1, ..., d100)
Relevant documents: d1, d4, d10, d17, d30
Distribution vector:
<r1, r2, r3, r4, r5> = <1, 4, 10, 17, 30>
The MRDD approach (cont.)
Application:
Input: query q from the user
1. Compare q to the training queries and select the k
most similar ones (e.g. k=8)
2. Calculate the average distribution vector over these
k queries for each database
3. Select the databases Di's which maximize precision
Example:
D1: <1, 4, 6, 7, 10, 12, 17>
D2: <3, 5, 7, 9, 15, 20>
D3: <2, 3, 6, 9, 11, 16>
A typical architecture
USER INTERFACE
DATABASE SELECTOR
Database
selection
problem
DOCUMENT SELECTOR
Document
selection
problem
Result
merging
problem
RESULT MERGER
QUERY DISPATCHER
SEARCH ENGINE
...
SEARCH ENGINE
The document selection problem
The naive approach (i.e. return all results) won't
work because it delivers too many documents
Goal: Retrieve as many potentially useful documents
from a component database as possible while
minimizing the retrieval of useless documents
Different categories of approaches:
- User determination
- Weighted allocation
- Learning-based approaches
- Guaranteed retrieval
User determination
Let the user select a number for each component
database (or use a default number, if none is given
by the user)
Advantage:
- Works well with small numbers of component
databases and if the user is reasonably familiar
with them
Disadvantage:
- Does not work for larger sets of component
databases
Weighted allocation
Use the rank (or ranking score) calculated by the
database selection algorithm to specify the number
of documents that should be selected from the
respective database
Example:
If ri is the ranking score of database Di (i = 1, ...,
N) and m documents should be returned to the
user, then select
documents from database Di
Learning-based approaches
Idea: Learn how many documents to retrieve for a
given query based on past retrieval experiences for
similar queries
Example: The MRDD approach (see before) is a
learning-based approach that already includes
document selection!
Guaranteed retrieval
Idea: Take global similarities into account (not only
local ones, e.g. component database dependent
document frequencies, etc.)
Example approaches:
- Query modification
- Computing the tightest local threshold
A typical architecture
USER INTERFACE
DATABASE SELECTOR
Database
selection
problem
DOCUMENT SELECTOR
Document
selection
problem
Result
merging
problem
RESULT MERGER
QUERY DISPATCHER
SEARCH ENGINE
...
SEARCH ENGINE
Result merging approaches
Goal: Combine all returned results (sorted by
local similarity) into one single result (sorted by
global similarity)
Why is this a hard problem?
- Local similarities are not provided by all
search engines
- The heterogeneity of the component search
engines make them hard (if not impossible)
to compare
- Local similarities might differ significantly
from global ones
- Should documents returned from multiple
search engines considered differently?
Result merging approaches (cont.)
Two types of approaches exist:
Local similarity adjustment, e.g.
- Adjust local similarities using additional
information (e.g. quality of the component
database)
- Convert local document ranks to similarity
Global similarity estimation
Attempts to estimate the true global similarities
Local similarity adjustment
Distinguish three cases:
1. The selected databases (or returned results)
are pairwise disjoint (or nearly disjoint)
2. The selected databases (or returned results)
overlap but are not identical
3. The selected databases are identical
The latter case is known as data fusion and
normally not considered for metasearch engines
In the following: Assume case 1 (disjoint results)
Local similarity adjustment
1st: assume local similarities exist
1. Normalize the returned similarities
2. Use database scores to adjust the local
similarities (i.e. give higher preference to
documents from highly rated databases), e.g.
If s is the ranking score of database D and
s is the average of all these scores, then
is the weight for database D
(N = no. of databases)
Define the adjusted similarity as w*x
(x = original, returned local similarity)
3. Sort results by adjusted similarity
Local similarity adjustment
2nd: assume local similarities do NOT exist
Possible approaches:
- Use the local document rank information
directly to perform the merge, e.g.
1. Arrange databases based on their scores
2. Select documents with the round-robin method
(Alternatively: randomized version where the
documents are selected based on the probability
of the database being relevant)
- Convert the local document ranks to similarities
Local similarity adjustment
Distinguish three cases:
1. The selected databases (or returned results)
are pairwise disjoint (or nearly disjoint)
2. The selected databases (or returned
results) overlap but are not identical
How to deal with results returned by multiple
search engines?
Calculate adjusted similarities as before and use
(e.g.) methods from data fusion to combine
them. However, this might not work well because
of the different coverage of the component
databases (-> active research field).
Global similarity estimation
Approaches to estimate global similarities:
1. Document fetching:
Idea: Download returned documents to get
(e.g.) term frequencies, estimate global
document frequency
Disadvantages: expensive (but remedies exist,
e.g. download docs in parallel, keep downloading
and analyzing docs while initial results are
presented, only download the beginning portion)
Advantages:
- Identifies obsolete URLs
- Ranking based on current (up-to-date) content
- Better result representation
Global similarity estimation
Approaches to estimate global similarities (cont.):
2. Use of discovered knowledge:
Basic idea:
Try to figure out the specific document indexing
and similarity computation methods used by the
different component search engines
Use this information to
- better compare local similarities
- better adjust local similarities to make them
comparable with each other
- better derive global from local similarities
Global similarity estimation
2. Use of discovered knowledge: Examples
Assume all component search engines use same
indexing and local similarity estimation methods
that do not include collection-dependent statistics
-> local similarities are comparable and
can be used directly
Assume the only difference is the usage of
(different) lists of stopwords
-> modify the query to make results comparable
Assume idf information is also used
-> either adjust local similarities or compute global
similarities directly
(cont. on next slide)
Global similarity estimation
2. Use of discovered knowledge: Examples
Assume idf information is also used
-> either adjust local similarities or compute global
similarities directly
Case 1: Query q contains just one single term t
Similarity in the component database:
Global similarity: Multiply local sim. with
Case 2: Query q contains multiple terms: see lit.
Other challenges for metasearch engines
1. Integrate local systems employing different
indexing techniques
2. Integrate local systems supporting different
types of queries (e.g. Boolean vs. vector space
queries)
3. Discover knowledge about component search
engines
4. Develop more effective result merging methods
5. Study the appropriate cooperation between a
metasearch engine and the local systems
6. Incorporate new indexing and weighting
techniques to build better metasearch engines
Other challenges for metasearch engines
7. Improve the effectiveness of metasearch
8. Decide where to place the software components
of a metasearch engine
9. Create a standard testbed to evaluate the
proposed techniques for database selection,
document selection, and result merging
10. Extend metasearch techniques to different
types of data sources
References
[1] MENG, YU, LIU: BUILDING EFFICIENT
AND EFFECTIVE METASEARCH ENGINES.
ACM COMPUTING SURVEYS, VOL. 34,
NO. 1, MARCH 2002.
Recap: IR System & Tasks Involved
INFORMATION NEED
DOCS.
User Interface
QUERY
RESULTS
QUERY PROCESSING
(PARSING & TERM
PROCESSING)
RESULT
REPRESENTATION
DOCUMENTS
SELECT DATA FOR
INDEXING
PARSING & TERM
PROCESSING
RANKING
LOGICAL VIEW OF
THE INFORM. NEED
SEARCHING
PERFORMANCE EVALUATION
INDEX
General web search engine architecture
CLIENT
QUERIES
QUERY
ENGINE
WWW
RESULTS
PAGE
REPOSITORY
RANKING
CRAWLER(S)
COLLECTION
ANALYSIS MOD.
INDEXER
MODULE
CRAWL
CONTROL
INDEXES
UTILITY
STRUCTURE
TEXT
USAGE FEEDBACK
(CF. FIG. 1 IN ARASU ET AL. "SEARCHING THE WEB", ACM TRANSACTIONS ON
INTERNET TECHNOLOGY, VOL 1/1, 8-2001, PAGE 4)
Next week
1. Final grading of the programming exercises
2. Exercises with the PageRank simulation tool
(participation mandatory, but no grading)
Exam dates
August 28th and 29th OR
September 18th and 19th OR
by personal arrangement