iRobot: An Intelligent Crawler for Web Forums
Download
Report
Transcript iRobot: An Intelligent Crawler for Web Forums
iRobot: An Intelligent Crawler
for Web Forums
Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang
Microsoft Research, Asia
July 16, 2015
Outline
• Motivation & Challenge
• iRobot – Our Solution
– System Overview
– Module Details
• Evaluation
2
Outline
• Motivation & Challenge
• iRobot – Our Solution
– System Overview
– Module Details
• Evaluation
3
Why Web Forum is Important
• Forum is a huge resource of human knowledge
– Popular all over the world
– Contain any conceivable topics and issues
• Forum data can benefit many applications
– Improve quality of search result
– Various data mining on forum data
• Collecting forum data
– Is the basis of all forum related research
– Is not a trivial task
4
Why Forum Crawling is Difficult
• Duplicate Pages
– Forum is with complex in-site structure
– Many shortcuts for browsing
• Invalid Pages
– Most forums are with access control
– Some pages can only be visited after registration
• Page-flipping
– Long thread is shown in multiple pages
– Deep navigation levels
5
The Limitation of Generic Crawlers
• In general crawling, each page is treated
independently
– Fixed crawling depth
– Cannot avoid duplicates before downloading
– Fetch lots of invalid pages, such as login prompt
– Ignore the relationships between pages from a
same thread
• Forum crawling needs a site-level perspective!
6
Statistics on Some Forums
• Around 50% crawled pages are useless
• Waste of both bandwidth and storage
7
Outline
• Motivation & Challenge
• Our Solution – iRobot
– System Overview
– Module Details
• Evaluation
8
What is Site-Level Perspective?
• Understand the organization structure
• Find our an optimal crawling strategy
List-of-Thread
Entry
Post-of-Thread
List-of-Board
Login Portal
Search Result
Digest
Browse-by-Tag
The site-level perspective of "forums.asp.net"
9
iRobot: An Intelligent Forum Crawler
General Web Crawling
Res
tart
Forum Crawling
Sitemap Construction
Crawler
Segmentation
& Archiving
Traversal Path Selection
Raw Pages
Meta
10
Outline
• Motivation & Challenge
• Our Solution – iRobot
– System Overview
– Module Details
•
•
•
•
Sitemap
Construction
How many kinds of pages?
How do these pages link with each other?
Which pages are valuable?
Which links should be followed?
• Evaluation
Traversal Path
Selection
11
Page Clustering
• Forum pages are based on database & template
• Layout is robust to describe template
– Repetitive regions are everywhere on forum pages
– Layout can be characterized by repetitive regions
(a)
(b)
(c)
(d)
12
Page Clustering
13
List-of-Thread
Post-of-Thread
List-of-Board
Login Portal
Search Result
Digest
Browse-by-Tag
14
Link Analysis
• URL Pattern can distinguish links, but not
reliable on all the sites
• Location can also distinguish links
4. Thread List
5. Thread
1. Login
A Link = URL Pattern + Location
15
List-of-Thread
Entry
Post-of-Thread
List-of-Board
Login Portal
Search Result
Digest
Browse-by-Tag
16
Informativeness Evaluation
• Which kind of pages (nodes) are valuable?
• Some heuristic criteria
– A larger node is more like to be valuable
– Page with large size are more like to be valuable
– A diverse node is more like to be valuable
• Based on content de-dup
17
List-of-Thread
Entry
Post-of-Thread
List-of-Board
Login Portal
Search Result
Digest
Browse-by-Tag
18
Traversal Path Selection
• Clean sitemap
– Remove valueless nodes
– Remove duplicate nodes
– Remove links to valueless / duplicate nodes
• Find an optimal path
– Construct a spanning tree
– Use depth as cost
• User browsing behaviors
– Identify page-flipping links
• Number, Pre/Next
19
List-of-Thread
Entry
Post-of-Thread
List-of-Board
Login Portal
Search Result
Digest
Browse-by-Tag
20
Outline
• Motivation & Challenge
• iRobot – Our Solution
– System Overview
– Module Details
• Evaluation
21
Evaluation Criteria
25%
Mirrored Pages
iRobot
20%
• Duplicate ratio
15%
10%
5%
0%
Biketo
Asp
Baidu
Douban
CQZG
Tripadvisor Hoopchina
70%
Mirrored Pages
iRobot
60%
• Invalid ratio
50%
40%
30%
20%
10%
0%
Biketo
• Coverage ratio
Asp
Baidu
Douban
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
CQZG
Tripadvisor Hoopchina
Coverage ratio
22
Biketo
Asp
Baidu
Douban
CQZG
Tripadvisor Hoopchina
Effectiveness and Efficiency
• Effectiveness
6000
Invalididate
(a) A Generic Crawler
Duplicate
Valuable
6000
5000
5000
4000
4000
3000
3000
2000
2000
1000
1000
0
Invalididate
Duplicate
Valuable
0
Biketo
Asp
Baidu
• Efficiency
20000
(b) iRobot
Douban
CQZG
(a) A Generic Crawler
Tripadvisor Hoopchina
Invalididate
17500
Duplicate
15000
Valuable
Biketo
20000
10000
10000
7500
7500
5000
5000
2500
2500
0
0
Baidu
Douban
CQZG
Tripadvisor Hoopchina
Douban
CQZG
(b) iRobot
Tripadvisor
Gentoo
Invalididate
Duplicate
15000
12500
Asp
Baidu
17500
12500
Biketo
Asp
Valuable
Biketo
Asp
Baidu
Douban
CQZG
Tripadvisor Hoopchina
23
Performance vs. Sampled Page#
90%
80%
70%
60%
50%
Coverage ratio
40%
Duplicate ratio
30%
Invalid ratio
20%
10%
0%
10
20
50
100
Number of Sampled Pages
500
1000
24
Preserved Discussion Threads
Forums
Mirrored
Biketo
Asp
Baidu
Douban
CQZG
Tripadvisor
Hoopchina
1584
600
−
62
1393
326
2935
Crawled by
iRobot
1313
536
−
60
1384
272
2829
Correctly
Recovered
1293
536
−
37
1311
272
2593
94.5%
87.6%
25
Conclusions
• An intelligent forum crawler based on sitelevel structure analysis
– Identify page templates / valuable pages / link
analysis / traversal path selection
• Some modules can still be improved
– More automated & mature algorithms in SIGIR’08
• More future work directions
– Queue management
– Refresh strategies
26
Thanks!
27