The Design and Implementation of Crawler Framework Based on

Download Report

Transcript The Design and Implementation of Crawler Framework Based on

2016Fall.01
厦门大学智能分析与推荐系统研究组
Group of Intelligent Analysis & Recommendation System
基于Scrapy的爬虫框架设计与实现
The Design and Implementation of Crawler Framework Based on Scrapy
2016年9月19日
王玮玮
厦门大学
自动化系
WANG Weiwei, Department Of Automation,
Xiamen university.
CONTENT
Which sites can be crawled
01
The Framework of Crawler
02
Data processing and
application
03
04
Open Source Code
CONTENT
Our Code
05
Distributed Crawls
06
Avoiding getting banned
07
08
Papers and Research
01
PART ONE
Which sites can be crawled
1. Which sites can be crawled
All kinds of sites
Which sites are worth us to crawl……
02
PART TWO
The Framework of Crawler
2. The Framework of Crawler
Scrapy
(https://scrapy.org/)
A Fast and Powerful Scraping and Web Crawling Framework
03
PART THREE
Data processing and application
3. Data processing and application
Content and Text Analysis
News websites, like:http://news.sina.com.cn/、
http://news.163.com/、http://news.qq.com/……
Industry Analysis
Shopping Site, like:http://www.jd.com/、
https://www.taobao.com/、http://www.yhd.com/……
Social Media Monitor
Social Network, like:Weibo、Public WeChat Account、
Facebook、Twitter……
04
PART FOUR
Open Source Code
4. Open Source Code
Scrapy is a fast high-level web crawling and web
scraping framework, used to crawl websites and
extract structured data from their pages. It can be
used for a wide range of purposes, from data
mining to monitoring and automated testing.
4. Open Source Code
微信公众号爬虫
https://github.com/hexcola/wcspider
豆瓣读书爬虫
https://github.com/lanbing510/DouBanSpider
知乎爬虫
https://github.com/LiuRoy/zhihu_spider
Bilibili用户爬虫
https://github.com/airingursb/bilibili-user
新浪微博爬虫
https://github.com/LiuXingMing/SinaSpider
小说下载分布式爬虫 https://github.com/gnemoug/distribute_crawler
中国知网爬虫
https://github.com/yanzhou/CnkiSpider
链家网爬虫
https://github.com/lanbing510/LianJiaSpider
京东爬虫
https://github.com/taizilongxu/scrapy_jingdong
QQ 群爬虫
https://github.com/caspartse/QQ-Groups-Spider
乌云爬虫
https://github.com/hanc00l/wooyun_public
05
PART FIVE
Our Code
5.Our Code
- Base on Scrapy
- Encapsulation
- Provide API
5.Our Code
WORKFLOW
5.Our Code
What to do next on our Framework?
-
JavaScript
-
Simulated user login
-
Cookie
-
Proxy Server
-
Redis
06
PART SIX
Distributed Crawls
6. Distributed Crawls
07
PART SEVEN
Avoiding getting banned
7. Avoiding getting banned
• rotate your user agent from a pool of well-known ones
from browsers (google around to get a list of them)
• disable cookies (see COOKIES_ENABLED) as some sites
may use cookies to spot bot behaviour
• use download delays (2 or higher). See
DOWNLOAD_DELAY setting.
• if possible, use Google cache to fetch pages, instead of
hitting the sites directly
• use a pool of rotating IPs. For example, the free Tor
project or paid services like ProxyMesh
• use a highly distributed downloader that circumvents
bans internally, so you can just focus on parsing clean
pages. One example of such downloaders is Crawlera
08
PART EIGHT
Papers and Research
8. Papers and Research
- Crawler Technology
- Data Mining
A&Q
Thanks for
Listening