Raghuwanshi K. - To USC Home Page

Download Report

Transcript Raghuwanshi K. - To USC Home Page

Detecting Near-Duplicates for Web
Crawling
Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma
WWW '07 Proceedings of the 16th international conference on World Wide Web
Speaker: Kunwar Aditya Raghuwanshi
3/28/13
Detecting Near-Duplicates for Web
Crawling
1
Near-Duplicates
• Near-duplicates are documents which offer
identical content but differ in small portions.
These differing portions may be attributed to
advertisements, banners, timestamps, news
feeds etc.
3/28/13
Detecting Near-Duplicates for Web
Crawling
2
Why the need to be detected?
• To save storage space for crawled data.
• Improve search quality (page rank) by finetuning incoming/outgoing link numbers.
• Better top results.
• Politeness
3/28/13
Detecting Near-Duplicates for Web
Crawling
3
Near-duplicate detection
considerations
• Need to detect near-duplicate websites during
crawl phase.
• Scale of the World Wide Web is very large.
Tens of billions of documents need to indexed
while millions of pages are crawled every day.
• Crawl decisions need to be made quickly.
• Should support Single or Multiple document
(batch) queries.
3/28/13
Detecting Near-Duplicates for Web
Crawling
4
How to detect Near-Duplicates?
• Use Charikar’s simhash
– Obtain f-bit fingerprint for each
document/webpage.
– Documents are near-duplicates iff fingerprints
differ at most k-bits. (Hamming Distance between
two simhashes is less than or equal to k)
– The paper demonstrated that the algorithm
perform well for f=64 bits and k=3.
• Exact or focused probes in a pre-sorted list of
fingerprints is very slow and impractical.
3/28/13
Detecting Near-Duplicates for Web
Crawling
5
Simhash
Feature set can be Shingles,
Document Vector, Connectivity
Information, Phrases, Anchor Text,
Anchor Window etc.
3/28/13
Detecting Near-Duplicates for Web
Crawling
6
Efficient Approach
• Only consider 2d sorted f-bit fingerprints
considering by virtue of simhash
– quite a few 2d bit- combinations exist
– very few d-bit combinations are duplicated
• Choose d’ such that |d′ − d| is small enough
to do quick exact probes
3/28/13
Detecting Near-Duplicates for Web
Crawling
7
Example
•
•
•
•
Split 64-bits into 4 pieces.
Make 4 tables of permuted fingerprints.
Perform exact probes on 16 bits
For selecting correct d, we need a tradeoff
between space required for storing tables and
the time required in matching. Optimal
number of tables given by:
3/28/13
Detecting Near-Duplicates for Web
Crawling
8
Example (contd.)
3/28/13
Detecting Near-Duplicates for Web
Crawling
9
Optimization
• Tables can be compressed as first d bits are
common in successive fingerprints (sorted).
• Compute Huffman code for distribution of
position of the most-significant 1-bit in XOR of
two successive fingerprints.
• Record first fingerprint as it is.
• XOR with successive fingerprint and append
Huffman code for position of most-significant 1bit.
• Results in half the storage size of crawled set.
3/28/13
Detecting Near-Duplicates for Web
Crawling
10
Batch mode
• Crawlers crawl millions of web pages everyday.
Need quick decision for new pages.
• Use Map-Reduce framework for batch queries
– Map: Mappers work on chunks of main fingerprint set
and whole query file. Output near-duplicate
fingerprints.
– Reduce: Collects all output, removes duplicates and
produces single sorted file.
– Chunk is small so no can be probed sequentially.
Tables for query can be built in memory so no need
for compression.
3/28/13
Detecting Near-Duplicates for Web
Crawling
11
Map-Reduce
3/28/13
Detecting Near-Duplicates for Web
Crawling
12
Why choose f=64, k=3
3/28/13
Detecting Near-Duplicates for Web
Crawling
13
Related Works
• Techniques are different when they deal with web
documents, Files, Domain-specific corpora as the endgoals vary from detecting web mirrors, clustering,
plagiarism to spam detection.
• Efficient theoretical study for d=1. Progress is reported
for large d.
• Signature Schemes:
–
–
–
–
3/28/13
Mod-p shingles
Min-hash for Jaccard similarity of sets
Signatures/fingerprints over IR-based document vectors
Checksums
Detecting Near-Duplicates for Web
Crawling
14
Conclusions
• The paper addressed the problem of nearduplicate detection specifically for web-crawlers.
• Proposed practical approaches and algorithms for
single and batch mode queries.
• Validated use of simhash properties.
• Surveys existing de-duplicaton methodologies.
• No comparison available as techniques have not
been tested on such a large corpus (8 Billion
Documents).
3/28/13
Detecting Near-Duplicates for Web
Crawling
15
Pros & Cons
• Pros
– Provides good set of results on real data(100M queries on
a corpus of 8B documents). Previous work done on the
same field does not provide results for a real world query
size.
– Practical & Efficient. Solves space issues with compression.
• Cons
– Not based on content matching but similarity, so limited
accuracy is expected.
– Web-pages need to be downloaded, so bandwidth is not
reduced.
– Doesn’t provide comparison with existing techniques.
3/28/13
Detecting Near-Duplicates for Web
Crawling
16
Relation to topics from Class
•
•
•
•
Search Engine Basics
Crawlers & Crawling
Deduplication
Map Reduce
3/28/13
Detecting Near-Duplicates for Web
Crawling
17