Topic Crawlers for Building Digital Library Collections

Download Report

Transcript Topic Crawlers for Building Digital Library Collections

Topical Crawlers for Building
Digital Library Collections
Presenter: Qiaozhu Mei
 1. J.Qin et al. Building Domain Specific Web
Collections for Scientific Digital Libraries: A
Meta-Search Enhanced Focused Crawling
Method
 2. G.Pant et al. Panorama: Extending Digital
Libraries with Topical Crawlers
(JCDL 2004)
Outline
 Problem Description
 Research Background
 Their Approaches
Designing Classifier
Enhancing Meta-Search
Identifying Communities in Collection
 Experiments
 Discussion
Problem Description
 Problem: Collect domain specific documents
from the Web and manage the literature
collection.
 Topical Crawlers (Focused Crawlers) are
designed to collect domain specific docs
 How to bridge the gab of Web communities.
Digital Library vs. Search Engine
 Digital Library
Domain specific,
serving for
literature study
High Quality
Topical Crawler +
Collection
management
Knowledge
discovery
 Search Engine
General, serving for
web search
High Quantity
General Crawler +
Online retrieval
Indexing, retrieving
performance, etc.
Research Background
Domain
Definition
Expand training set,
Get starting url
VSM,
Naïve Bayesian,
SVM, etc.
TF-IDF,
K-Mean, etc
Page Rank,
HITS, etc.
BFS,
Best first search,
Tree pruning,
Multiple starting
urls,
Tunneling
 Why General Crawler with Breadth first
search doesn’t work?
Web Communities
Design Classifier (Pant et al.)
 Motivation: define the domain and distinguish
relevant & non-relevant documents
 Approach:
Query Google with title & reference to construct
positive/negative example set (training set)
Use Vector Space Model to represent documents,
use TF-IDF as term weights
Use Naïve Bayesian Classifier to estimate Pr(c+|q),
which is used for ranking
Design Classifier (cont.)
 TF-IDF weighting:
Enhancing Meta-Search (Qin et al.)
 Motivation: Solve the limitation of Local
Search algorithm in Crawling, bridge
distributed web communities
 Approach:
Manually provide domain specific queries
Query Meta-search Engine to get multiple starting
urls.
Identifying Communities in Collection
(Pant et al.)
 Motivation: analyze the latent structures in
collection, summarize and represent potential
communities
 Approach:
Use k-mean for content clustering
Use HITS for structural clustering
Label clusters by TF-IDF filtering
Experiments (Qin et al.)
 Experiments Design
Compare with Google and a Domain Specific SE.
 996028 pages, 1/3 from meta-search method.
 pre@20.
Compare meta-search enhanced crawling with
traditional one, by means of precision.
 997632 pages from baseline method.
 pre@10
Experts define queries and judge results.
Experiments (Qin et al.) cont.
 Experiment Result 1:




Their approach:
General SE:
Domain Specific:
Meta-search enhanced method better than general search
engine and traditional domain specific search engine.
 Experiment Result 2




Expert ranking results in range 1-4
Meta-search Enhanced: 2.77
Baseline: 2.51
In Top 100 results from Meta-search Enhanced collection:
 from meta-search: 3.22, rest: 2.61
Experiments (Pant et al.)
 Experiments Design:
 Test Bed: from CiteSeer. 94 papers as initial
documents.
 Use one (expanded by querying Google) for building
positive example set, 93 for building negative example
set.
 Compare with a BFS crawler.
 Harvest rate:
Experiments (Pant et al.) cont.
 Experiment results:

InterWeave: A middleware system for
distributed shared states
Conclusions
 System overview for building literature
collection by topical web crawlers
 Classifier enhanced Best first search
performs better than Breadth first search.
 Meta-search enhanced topical crawler
performs better than topical crawlers without
meta-search.
 A clustering based method to represent
latent community structures in collection
Discussion
 Contribution of these two papers
 Qin et’ al: enhance meta-search to get multiple
starting urls
 Pant et’ al: clarify and implement a sound system
structure. Post a way to discover latent
communities in collection
 Constraints
 no significant theoretical contribution
 experiments not convincing
 Thanks!