Topic Crawlers for Building Digital Library Collections
Download
Report
Transcript Topic Crawlers for Building Digital Library Collections
Topical Crawlers for Building
Digital Library Collections
Presenter: Qiaozhu Mei
1. J.Qin et al. Building Domain Specific Web
Collections for Scientific Digital Libraries: A
Meta-Search Enhanced Focused Crawling
Method
2. G.Pant et al. Panorama: Extending Digital
Libraries with Topical Crawlers
(JCDL 2004)
Outline
Problem Description
Research Background
Their Approaches
Designing Classifier
Enhancing Meta-Search
Identifying Communities in Collection
Experiments
Discussion
Problem Description
Problem: Collect domain specific documents
from the Web and manage the literature
collection.
Topical Crawlers (Focused Crawlers) are
designed to collect domain specific docs
How to bridge the gab of Web communities.
Digital Library vs. Search Engine
Digital Library
Domain specific,
serving for
literature study
High Quality
Topical Crawler +
Collection
management
Knowledge
discovery
Search Engine
General, serving for
web search
High Quantity
General Crawler +
Online retrieval
Indexing, retrieving
performance, etc.
Research Background
Domain
Definition
Expand training set,
Get starting url
VSM,
Naïve Bayesian,
SVM, etc.
TF-IDF,
K-Mean, etc
Page Rank,
HITS, etc.
BFS,
Best first search,
Tree pruning,
Multiple starting
urls,
Tunneling
Why General Crawler with Breadth first
search doesn’t work?
Web Communities
Design Classifier (Pant et al.)
Motivation: define the domain and distinguish
relevant & non-relevant documents
Approach:
Query Google with title & reference to construct
positive/negative example set (training set)
Use Vector Space Model to represent documents,
use TF-IDF as term weights
Use Naïve Bayesian Classifier to estimate Pr(c+|q),
which is used for ranking
Design Classifier (cont.)
TF-IDF weighting:
Enhancing Meta-Search (Qin et al.)
Motivation: Solve the limitation of Local
Search algorithm in Crawling, bridge
distributed web communities
Approach:
Manually provide domain specific queries
Query Meta-search Engine to get multiple starting
urls.
Identifying Communities in Collection
(Pant et al.)
Motivation: analyze the latent structures in
collection, summarize and represent potential
communities
Approach:
Use k-mean for content clustering
Use HITS for structural clustering
Label clusters by TF-IDF filtering
Experiments (Qin et al.)
Experiments Design
Compare with Google and a Domain Specific SE.
996028 pages, 1/3 from meta-search method.
pre@20.
Compare meta-search enhanced crawling with
traditional one, by means of precision.
997632 pages from baseline method.
pre@10
Experts define queries and judge results.
Experiments (Qin et al.) cont.
Experiment Result 1:
Their approach:
General SE:
Domain Specific:
Meta-search enhanced method better than general search
engine and traditional domain specific search engine.
Experiment Result 2
Expert ranking results in range 1-4
Meta-search Enhanced: 2.77
Baseline: 2.51
In Top 100 results from Meta-search Enhanced collection:
from meta-search: 3.22, rest: 2.61
Experiments (Pant et al.)
Experiments Design:
Test Bed: from CiteSeer. 94 papers as initial
documents.
Use one (expanded by querying Google) for building
positive example set, 93 for building negative example
set.
Compare with a BFS crawler.
Harvest rate:
Experiments (Pant et al.) cont.
Experiment results:
InterWeave: A middleware system for
distributed shared states
Conclusions
System overview for building literature
collection by topical web crawlers
Classifier enhanced Best first search
performs better than Breadth first search.
Meta-search enhanced topical crawler
performs better than topical crawlers without
meta-search.
A clustering based method to represent
latent community structures in collection
Discussion
Contribution of these two papers
Qin et’ al: enhance meta-search to get multiple
starting urls
Pant et’ al: clarify and implement a sound system
structure. Post a way to discover latent
communities in collection
Constraints
no significant theoretical contribution
experiments not convincing
Thanks!