Crowd Crawling - ACM Conference on Online Social Networks

Download Report

Transcript Crowd Crawling - ACM Conference on Online Social Networks

Crowd Crawling:
Towards Collaborative Data Collection
for Large-scale Online Social Networks
Cong Ding, Yang Chen*, and Xiaoming Fu
University of Göttingen
*Duke University
Significance of social network data crawling
• Understanding user behaviors
• Improving SNS architectures
• Handling privacy/security issues
• and so on...
Current data collection methods (1)
• ISP-based measurement [Schneider IMC’09]
Only ISP companies
can do that
Current data collection methods (2)
• Cooperate with SNS companies [Yang IMC’11]
Most research groups
do not have chance
Current data collection methods (3)
• Crawl data by a single group (and share them to others)
[Gjoka INFOCOM’10]
Suffering request
rate limiting
Shortages of crawling by a single group
• Waste computing and
network resources
• Introduce overhead to
service providers (and
may lead stricter rate limiting)
• Lack of ground truth for
the research community
A new thought
Why not collect data collaboratively?
System overview
Coordinator
Crawlers
System design
• Fetching UIDs (BFS, etc.)
• Handling crawling failure (timeout)
• Bypassing request rate limiting (massive IP addresses)
• Data fidelity (redundant crawling)
Implementation
• A proof-of-concept prototype (without the data fidelity part)
to crawl in Weibo
• 472 PlanetLab servers as crawlers
Evaluation
• In 24 hours, we have crawled 2.22M users’ data from Weibo,
including user profiles, all the posts, all the social connections
• Comparison:
• Fu et al. (PLOS ONE 2013) get 30K user’s data in 6 days
• Guo et al. (PAM 2013) get 1M user’s data in 1 month
#UIDs/day
Crowd
Crawling
Fu et al.
Guo et al.
2.22M
5K
33K
Evaluation
Evaluation
Conclusion and Discussion
• Data sharing may violate some providers’ terms of services
oTwitter does not allow to share data (even for research)
oWeibo allows to share data among researchers
• Unlimited data sharing might cause ethical issues
oThe data should be anonymized
• We will publish the data crawled in the evaluation