Distributed Search over the Hidden Web P. G. Ipeirotis and L
Download
Report
Transcript Distributed Search over the Hidden Web P. G. Ipeirotis and L
Distributed Search over the Hidden Web
Hierarchical Database
Sampling and Selection
Panagiotis G. Ipeirotis & Luis Gravano
Outline
Introduction
Background
Focused Probing for Content Summary
Construction
Exploiting Topic Hierarchies for Database
Selection
Experiments: Data and Metrics
Experimental Results
Conclusion and Future Work
Introduction
Search engines create their indexes by
spidering or crawling Web pages
Hidden Web sources store their content
in searchable databases
An Example…
Searching in the medical database
CANCERLIT – www.cancer.gov, the
query [ lung and cancer ] returns
68,430 matches.
Searching Google with the query [“lung
and cancer” site: www.cancer.gov ]
returns 23 matches
Meta - searchers
Tools for searching Hidden data sources
Relies on statistical or content summaries
Performs three main tasks:
Database Selection
Query Translation
Result Merging
This Paper Presents
An algorithm to derive content summaries
from “uncooperative” databases
A Database Selection Algorithm that
exploits:
Extracted content summaries
Hierarchical classification of the databases
Background:
Existing Database Selection Algorithms
Contd. : Database Selection Algorithms
Assumption: Query words are
independently distributed over database
documents.
The answer to a query is the set of all the
documents that satisfy the Boolean
expression.
Deficiency: Are the content summaries
accurate and up to date?
Uniform Probing for Content
Summary Construction
Extracts a document sample from a given
database, D and computes the frequency of
each observed word w in the sample,
SampleDF(w)
The Algorithm:
Start with an empty content summary where
SampleDF(w) = 0 for each word w, and a
general (i.e., not specific to D),
comprehensive word dictionary.
Pick a word and send it as a query to
database D.
Retrieve the top-k documents returned.
If the number of retrieved documents exceeds
a pre-specified threshold, stop. Else continue
the sampling process by returning to Step 2.
2 Versions of Algorithm
RS-Ord :
RandomSampling-OtherResource
RS-Lrd :
RandomSampling-LearnedResource
Deficiencies:
ActualDF(w) for each word w is not
revealed
RS-Ord tends to produce inefficient
executions in which it repeatedly issues
queries to databases that produce no
matches
Database Classification
Rationale : Queries closely associated
with topical categories retrieve mainly
documents about that category
Place the database in a classification
scheme based on the number of matches
Automation and Hierarchical
Classification
Automates classification by queries derived
automatically from a rule-based document
classifier.
A rule-based classifier is a set of logical rules
defining classification decisions.
jordan AND bulls-->Sports, hepatitis-->Health
Apply this principle recursively to create a
hierarchical classifier.
Focused Probing
Sends query probes, and extracts number
of matches without retrieving any
documents.
Calculates two metrics, the Coverage(Ci)
and Specificity(Ci) for the subcategory Ci
If the values of Coverage(Ci) and
Specificity(Ci) exceed two pre-specified
thresholds Tc and Ts, respectively, classify
the database into a category Ci
Author’s Algorithm
Exploit Topic Hierarchy
Produce a document sample that
Is
topically representative of the contents
Gives accurate and efficient content
summary
Content-Summary Construction
Steps of the algorithm:
Query the database using focused probing to:
Retrieve a document sample.
Generate a preliminary content summary
Categorize the database.
Estimate the absolute frequencies of the
words retrieved from the database.
Building Content Summaries from
Extracted Documents
ActualDF(w):
The actual number of documents in the
database that contain word w.
The algorithm knows this number only if [w] is
a single word query probe that was issued to
the database
SampleDF(w):
The number of documents in the extracted
sample that contain word w.
Focused Probing for Content
Summary Construction
Focused Probing for Content
Summary Construction
Focused Probing for Content
Summary Construction
Estimating Absolute Document
Frequencies
Use Mandelbrot’s equation P(r+p)-B for distribution of
words for estimating unknown ActualDF (¢)
frequencies.
Sort words in descending order of their
SampleDF(¢) frequencies
Focus on words with known ActualDF (¢)
frequencies.
Find the P, B, and p parameter values that best fit
the data.
Estimate ActualDF (wi) for all words wi with
unknown ActualDF (wi) as P(ri+p)-B
Example
Creating Content Summaries
for Topic Categories
Example:
“metastasis” did not appear in any of the
documents sampled from CANCERLIT during
probing
Cancer-BACUP classified under “Cancer”, has a
high ActualDFest(metastasis) = 3, 569
Convey this information by associating a content
summary with category “Cancer” that is obtained
by merging the summaries of all databases under
this category
In merged summary, ActualDFest(w) is sum of the
document frequency of w for databases under this
category
Creating Content Summaries
for Topic Categories
Selecting Databases
Hierarchically: Algorithm
Inputs : a query Q, target databases K, top
category C
Steps: HierSelect(Query Q, Category C, int K)
1: Use a flat database selection algorithm to assign a score for Q to
each subcategory of C
2:
if there is a subcategory C with a non-zero score
3:
Pick the subcategory Cj with the highest score
4:
if NumDBs(Cj) >= K //Cj has enough databases
5:
6:
7:
8:
9:
return HierSelect(Q,Cj ,K)
else // Cj does not have enough databases
return DBs(Cj) FlatSelect(Q,C-Cj,K-NumDBs(Cj))
else // no subcategory C has non-zero score
return FlatSelect(Q,C,K)
Example: Topic hierarchy for database
selection (babe AND ruth ,k=3)
Experiments :Data and Metrics
Evaluate two main sets of techniques:
1. Content-summary construction techniques
2. Database selection techniques
Evaluate the algorithms, using two data
sets
Controlled Database Set
Web Database Set
Data Sets
Controlled Database Set
500,000 newsgroup articles from 54
newsgroup
81,000 articles to train documents classifiers
over the 72 – node topic hierarchy
419,000 articles to build the set of Controlled
Databases
Contained 500 databases ranging in size from
25 to 25,000 documents.
Data Sets
Web Database Set
50 real web accessible databases with no
control over it.
Databases picked randomly from two
directories of hidden-web databases, namely
InvisibleWeb and Complete Planet
Content-summary construction
Test variations of Focused Probing technique
against RS-Ord and RS-Lrd.
Focused Probing:
Evaluated configurations with different underlying
document classifiers for query-probe creation.
Different values for the thresholds Ts and Tc
Varied the specificity threshold Ts from 0 to 1
Fixed coverage threshold to Tc = 10.
Database Selection
Effectiveness
Underlying Database selection algorithm:
Hierarchical algorithm
Relies on a “flat” database selection
algorithm.
Chose algorithms: CORI, bGlOSS
Adapted both algorithms to work with
category content summary.
Database Selection
Effectiveness
Content Summary Construction
Evaluated how the hierarchical database
selection algorithm behaved over content
summaries generated by different techniques
Also studied QPilot Strategy
Exploits HTML links to characterize text
databases.
Content Summary Quality
Metric : content summaries coverage of the actual
database vocabulary
ctf = ΣwєTr ActualDF(w) / ΣwєTd ActualDF(w)
Tr = set of terms in content summary, Td =
complete set of words in vocabulary
Results:
Focused Probing techniques achieve much
higher ctf ratios than RS-Ord and RS-Lrd.
The coverage of the Focused Probing summaries
increases for lower thresholds of Ts
Content Summary Quality
Correlation of word rankings: Used Spearman
Rank Correlation Coefficient (SRCC ) –
to measure how well a content summary
orders words by frequencies with respect to
the actual word frequency order in the
database.
Result :The Focused Probing method have
higher SRCC values than the RS-Ord and RSLrd.
Content Summary Quality
Content Summary Quality - Efficiency
Focused Probing techniques on average
retrieve one document per query sent
RS-Lrd retrieves about one document per
two queries.
RS-Ord unnecessarily issues many
queries that produce no document
matches.
Content Summary Quality
Produce significantly better-quality
summaries than RS-Ord and RS-Lrd do
in terms of vocabulary coverage
and word ranking preservation.
Database Selection Effectiveness
Methodology:
Web set of real web-accessible databases
50 queries from the Web Track of TREC
Each database selection algorithm picked 3
databases for the query
Retrieved the top 5 documents for the query.
Human evaluators to judge
the relevance of each retrieved document
for the query
Database Selection Effectiveness
Measured the precision of a technique for each
query q as :
Average precision of different database selection algorithms.
Database Selection Effectiveness
Analysis:
All the flat selection techniques suffer from
incomplete coverage of the underlying
probing-generated summaries.
QPilot summaries do not work well for
database selection because they generally
contain only a few words and are hence
highly incomplete.
Hierarchical vs. flat database
selection
The hierarchical algorithm using CORI as
flat database selection has 50% better
precision than CORI for flat selection with
the same content summaries.
For bGlOSS, the improvement is 92%.
Reason: Topic hierarchy compensates for
incomplete content summaries.
Hierarchical vs. flat database
selection
Measured fraction of times that
hierarchical database selection algorithm
picked a database for a query
That produced matches for the query
And was given a zero score by the flat
database selection algorithm of choice.
Conclusion
Presented a novel and efficient method for the
construction of content summaries of web
accessible text databases
Presented a hierarchical database selection
algorithm that exploits the database content
summaries
Algorithm generated classification to produce
accurate results even for imperfect content
summaries.
Future Work
Alternative hierarchy traversing techniques.
For example, “route” queries to multiple
categories if appropriate.
Examine the effect of absolute frequency
estimation on database selection.
Alternative methods for creating content
summaries.