Distributed Search over the Hidden Web P. G. Ipeirotis and L

Transcript Distributed Search over the Hidden Web P. G. Ipeirotis and L

Distributed Search over the Hidden Web
Hierarchical Database
Sampling and Selection
Panagiotis G. Ipeirotis & Luis Gravano
Outline







Introduction
Background
Focused Probing for Content Summary
Construction
Exploiting Topic Hierarchies for Database
Selection
Experiments: Data and Metrics
Experimental Results
Conclusion and Future Work
Introduction

Search engines create their indexes by
spidering or crawling Web pages

Hidden Web sources store their content
in searchable databases
An Example…


Searching in the medical database
CANCERLIT – www.cancer.gov, the
query [ lung and cancer ] returns
68,430 matches.
Searching Google with the query [“lung
and cancer” site: www.cancer.gov ]
returns 23 matches
Meta - searchers

Tools for searching Hidden data sources

Relies on statistical or content summaries

Performs three main tasks:
 Database Selection
 Query Translation
 Result Merging
This Paper Presents


An algorithm to derive content summaries
from “uncooperative” databases
A Database Selection Algorithm that
exploits:
Extracted content summaries
 Hierarchical classification of the databases

Background:
Existing Database Selection Algorithms
Contd. : Database Selection Algorithms



Assumption: Query words are
independently distributed over database
documents.
The answer to a query is the set of all the
documents that satisfy the Boolean
expression.
Deficiency: Are the content summaries
accurate and up to date?
Uniform Probing for Content
Summary Construction

Extracts a document sample from a given
database, D and computes the frequency of
each observed word w in the sample,
SampleDF(w)
The Algorithm:




Start with an empty content summary where
SampleDF(w) = 0 for each word w, and a
general (i.e., not specific to D),
comprehensive word dictionary.
Pick a word and send it as a query to
database D.
Retrieve the top-k documents returned.
If the number of retrieved documents exceeds
a pre-specified threshold, stop. Else continue
the sampling process by returning to Step 2.
2 Versions of Algorithm


RS-Ord :
RandomSampling-OtherResource
RS-Lrd :
RandomSampling-LearnedResource
Deficiencies:

ActualDF(w) for each word w is not
revealed

RS-Ord tends to produce inefficient
executions in which it repeatedly issues
queries to databases that produce no
matches
Database Classification

Rationale : Queries closely associated
with topical categories retrieve mainly
documents about that category

Place the database in a classification
scheme based on the number of matches
Automation and Hierarchical
Classification


Automates classification by queries derived
automatically from a rule-based document
classifier.
A rule-based classifier is a set of logical rules
defining classification decisions.


jordan AND bulls-->Sports, hepatitis-->Health
Apply this principle recursively to create a
hierarchical classifier.
Focused Probing



Sends query probes, and extracts number
of matches without retrieving any
documents.
Calculates two metrics, the Coverage(Ci)
and Specificity(Ci) for the subcategory Ci
If the values of Coverage(Ci) and
Specificity(Ci) exceed two pre-specified
thresholds Tc and Ts, respectively, classify
the database into a category Ci
Author’s Algorithm

Exploit Topic Hierarchy

Produce a document sample that
 Is
topically representative of the contents
 Gives accurate and efficient content
summary
Content-Summary Construction

Steps of the algorithm:

Query the database using focused probing to:
Retrieve a document sample.
 Generate a preliminary content summary
 Categorize the database.


Estimate the absolute frequencies of the
words retrieved from the database.
Building Content Summaries from
Extracted Documents

ActualDF(w):



The actual number of documents in the
database that contain word w.
The algorithm knows this number only if [w] is
a single word query probe that was issued to
the database
SampleDF(w):

The number of documents in the extracted
sample that contain word w.
Focused Probing for Content
Summary Construction
Focused Probing for Content
Summary Construction
Focused Probing for Content
Summary Construction
Estimating Absolute Document
Frequencies

Use Mandelbrot’s equation P(r+p)-B for distribution of
words for estimating unknown ActualDF (¢)
frequencies.
 Sort words in descending order of their
SampleDF(¢) frequencies
 Focus on words with known ActualDF (¢)
frequencies.
 Find the P, B, and p parameter values that best fit
the data.

Estimate ActualDF (wi) for all words wi with
unknown ActualDF (wi) as P(ri+p)-B
Example
Creating Content Summaries
for Topic Categories
Example:




“metastasis” did not appear in any of the
documents sampled from CANCERLIT during
probing
Cancer-BACUP classified under “Cancer”, has a
high ActualDFest(metastasis) = 3, 569
Convey this information by associating a content
summary with category “Cancer” that is obtained
by merging the summaries of all databases under
this category
In merged summary, ActualDFest(w) is sum of the
document frequency of w for databases under this
category
Creating Content Summaries
for Topic Categories
Selecting Databases
Hierarchically: Algorithm


Inputs : a query Q, target databases K, top
category C
Steps: HierSelect(Query Q, Category C, int K)
1: Use a flat database selection algorithm to assign a score for Q to
each subcategory of C
2:
if there is a subcategory C with a non-zero score
3:
Pick the subcategory Cj with the highest score
4:
if NumDBs(Cj) >= K //Cj has enough databases
5:
6:
7:
8:
9:
return HierSelect(Q,Cj ,K)
else // Cj does not have enough databases
return DBs(Cj)  FlatSelect(Q,C-Cj,K-NumDBs(Cj))
else // no subcategory C has non-zero score
return FlatSelect(Q,C,K)
Example: Topic hierarchy for database
selection (babe AND ruth ,k=3)
Experiments :Data and Metrics

Evaluate two main sets of techniques:
1. Content-summary construction techniques
2. Database selection techniques

Evaluate the algorithms, using two data
sets


Controlled Database Set
Web Database Set
Data Sets

Controlled Database Set




500,000 newsgroup articles from 54
newsgroup
81,000 articles to train documents classifiers
over the 72 – node topic hierarchy
419,000 articles to build the set of Controlled
Databases
Contained 500 databases ranging in size from
25 to 25,000 documents.
Data Sets

Web Database Set

50 real web accessible databases with no
control over it.

Databases picked randomly from two
directories of hidden-web databases, namely
InvisibleWeb and Complete Planet
Content-summary construction


Test variations of Focused Probing technique
against RS-Ord and RS-Lrd.
Focused Probing:




Evaluated configurations with different underlying
document classifiers for query-probe creation.
Different values for the thresholds Ts and Tc
Varied the specificity threshold Ts from 0 to 1
Fixed coverage threshold to Tc = 10.
Database Selection
Effectiveness

Underlying Database selection algorithm:
Hierarchical algorithm

Relies on a “flat” database selection
algorithm.

Chose algorithms: CORI, bGlOSS

Adapted both algorithms to work with
category content summary.
Database Selection
Effectiveness

Content Summary Construction
 Evaluated how the hierarchical database
selection algorithm behaved over content
summaries generated by different techniques

Also studied QPilot Strategy
 Exploits HTML links to characterize text
databases.
Content Summary Quality


Metric : content summaries coverage of the actual
database vocabulary
 ctf = ΣwєTr ActualDF(w) / ΣwєTd ActualDF(w)
 Tr = set of terms in content summary, Td =
complete set of words in vocabulary
Results:
 Focused Probing techniques achieve much
higher ctf ratios than RS-Ord and RS-Lrd.
 The coverage of the Focused Probing summaries
increases for lower thresholds of Ts
Content Summary Quality


Correlation of word rankings: Used Spearman
Rank Correlation Coefficient (SRCC ) –
 to measure how well a content summary
orders words by frequencies with respect to
the actual word frequency order in the
database.
Result :The Focused Probing method have
higher SRCC values than the RS-Ord and RSLrd.
Content Summary Quality
Content Summary Quality - Efficiency

Focused Probing techniques on average
retrieve one document per query sent

RS-Lrd retrieves about one document per
two queries.

RS-Ord unnecessarily issues many
queries that produce no document
matches.
Content Summary Quality

Produce significantly better-quality
summaries than RS-Ord and RS-Lrd do
 in terms of vocabulary coverage
 and word ranking preservation.
Database Selection Effectiveness

Methodology:
 Web set of real web-accessible databases
 50 queries from the Web Track of TREC
 Each database selection algorithm picked 3
databases for the query

Retrieved the top 5 documents for the query.

Human evaluators to judge

the relevance of each retrieved document
for the query
Database Selection Effectiveness

Measured the precision of a technique for each
query q as :
Average precision of different database selection algorithms.
Database Selection Effectiveness

Analysis:

All the flat selection techniques suffer from
incomplete coverage of the underlying
probing-generated summaries.

QPilot summaries do not work well for
database selection because they generally
contain only a few words and are hence
highly incomplete.
Hierarchical vs. flat database
selection

The hierarchical algorithm using CORI as
flat database selection has 50% better
precision than CORI for flat selection with
the same content summaries.

For bGlOSS, the improvement is 92%.

Reason: Topic hierarchy compensates for
incomplete content summaries.
Hierarchical vs. flat database
selection

Measured fraction of times that
hierarchical database selection algorithm
picked a database for a query
 That produced matches for the query
 And was given a zero score by the flat
database selection algorithm of choice.
Conclusion



Presented a novel and efficient method for the
construction of content summaries of web
accessible text databases
Presented a hierarchical database selection
algorithm that exploits the database content
summaries
Algorithm generated classification to produce
accurate results even for imperfect content
summaries.
Future Work



Alternative hierarchy traversing techniques.
For example, “route” queries to multiple
categories if appropriate.
Examine the effect of absolute frequency
estimation on database selection.
Alternative methods for creating content
summaries.

Distributed Search over the Hidden Web P. G. Ipeirotis and L

Transcript Distributed Search over the Hidden Web P. G. Ipeirotis and L

Directory