A Web Crawler Design for Data Mining
Download
Report
Transcript A Web Crawler Design for Data Mining
A Web Crawler Design for Data Mining
Mike Thelwall
University of Wolverhampton, Wolverhampton, UK
Journal of Information Science 2001
27 April 2011
Presentation @ IDB Lab Seminar
Presented by Jee-bum Park
Outline
Introduction
Architecture
Implementation
System Testing
Conclusion
2
Introduction
- Motive
The importance of the web has guaranteed academic interest in it,
not only for affiliated technologies, but also for its content
3
Introduction
- Motive
Information scientists and others wish to perform data mining on
large numbers of web pages
They will require the services of a web crawler,
– To extract patterns from the web
– To extract meaning from the link structure of the web
The necessity of an effective paradigm for a web mining crawler
4
Introduction
- Web Crawler
A web crawler, robot or spider
A program that is capable of iteratively and automatically,
– Downloading web pages
– Extracting URLs from their HTML
– Fetching them
5
Introduction
- Web Crawler: Workflow
/
• index.html
• login.php
• logo.gif
/images/
• menu.jpg
• bg.png
http://idb.snu.ac.kr/
Web Crawler
• index.php
/board/
• index.php?id=2
• Index.php?id=3
• a.jpg
/board/files/
• b.txt
• c.zip
6
Introduction
- Web Crawler: Architecture
7
Introduction
- Web Crawler: Roles
A sophisticated web crawler may also perform,
– Identifying pages judged relevant to the crawl
– Rejecting pages as duplicates of ones previously visited
– Supporting the action of search engines
For example, constructing the searchable index
8
Introduction
- Web Crawler: Issue
In the normal course of operation,
a simple crawler will spend most of its time awaiting data
– Requesting a web page
– Receiving a web page
For this reason, crawlers are normally multi-threaded
If the crawling task requires more complex processing,
the speed of the crawler will be reduced
A distributed approach for crawlers is needed
9
Introduction
- Distributed Systems
Using idle computers connected to the internet
– To gain extra processing power
– To distribute processing power
For personal site-specific crawlers, a single personal computer
solution may be fast enough
An alternative is a distributed model
– A central control unit
– Many crawlers operating on individual personal computers
10
Outline
Introduction
Architecture
Implementation
System Testing
Conclusion
11
Architecture
The crawler/analyzer units
The control unit
Four constraints
1.
2.
3.
4.
Almost all processing should be conducted on idle computers
The distributed architecture should not increase network traffic
The system must be able to operate through a firewall
The components must be easy to install and remove
12
Architecture
Crawler
idb.snu.ac.kr
Crawler
Crawler
siva.snu.ac.kr
brahma.snu.ac.kr
Control
unit
Crawler
Crawler
my.snu.ac.kr
sugang.snu.ac.kr
Crawler
etl.snu.ac.kr
13
Architecture
- The Crawler/Analyzer Units
The program
– Crawl a site or set of sites
– Analyze the pages
– Report its results
It can execute on the type of computers on which
there will be spare time, normally personal computers
14
Architecture
- The Crawler/Analyzer Units: Data Management
Accessing permanent storage space to save the web pages
– Linking to a database
– Using the normal file storage system
Pages must be saved on each host computer,
in order to minimize network traffic
If the system is capable of handling enough data,
a large-scale server-based database can be used
It must provide a facility for the user to delete all saved data
15
Architecture
- The Crawler/Analyzer Units: Interface
Immediate stop
Clear all data from the computer
16
Architecture
- The Control Unit
The control unit will live on a web server
When a crawler unit requests a job or sends some data,
It will be triggered
It will need to store the commands
– The owner wishes to be executed
– Indicating status
Completed
In progress
Unallocated
17
Architecture
Crawler
idb.snu.ac.kr
Crawler
Crawler
siva.snu.ac.kr
brahma.snu.ac.kr
Control
unit
Crawler
Crawler
my.snu.ac.kr
sugang.snu.ac.kr
Crawler
etl.snu.ac.kr
18
Outline
Introduction
Architecture
Implementation
System Testing
Conclusion
19
Implementation
- The Crawler/Analyzer Units
The architecture was employed to create a system for analyzing
the link structure of university web sites
20
Implementation
- The Crawler/Analyzer Units
Previous system
– Running a single crawler/analyzer program
Issues
– Not run quickly enough
– Individually set up and run on a number of computers
– Inefficient in terms of both human time and processor use!
New system
–
–
–
–
The existing stand-alone crawler was used as the basis
Communication and easy installation features added
Buttons to instantly close the program and remove any saved data
Processed by compressor for easy distribution
21
Implementation
- The Crawler/Analyzer Units
Choice of the types of checking for duplicate pages
– No page checking
– HTML page checking
– Weak HTML page checking
Comparing methods
– Comparing each page against all of the others
Naive
– Various numbers were calculated from the text of each page
For example, the length of the page, MD5 or SHA-1 hash, etc.
22
Implementation
- The Control Unit
Entirely new!
It was given a reporting facility
– Statistics
– To deliver a summary of crawlers
23
Outline
Introduction
Architecture
Implementation
System Testing
Conclusion
24
System Testing
In June and July of 2000
A set of sites or web pages to download
An analysis to perform on the downloaded sites
25
System Testing
- Result
The total number of crawler units
– Peaked at just over 100 with three rooms of computers
9112 tasks completed by the system
Over 100,000 pages downloaded
Each crawler used approximately 1 GB of hard disk space
The system had become a virtual computer with over
100 GB of disk space and over 100 processors
26
System Testing
- Limitations
The system was not able to run fully automatically
The problem was randomly generated web pages
– For example, a huge set of web pages containing usage statistics for
electronic equipment with one page per device per day
The solution was
– To manually check the root cause of the problem
– To add their URLs to a banned list operated by the control unit
There is the alternative of designing a heuristic to avoid problems
– For example, a maximum crawl depth
27
Outline
Introduction
Architecture
Implementation
System Testing
Conclusion
28
Conclusion
The distributed architecture has shown itself
– Capable of crawling a large collection of web sites
– By using idle processing power and disk space
The testing of the system has shown that
– It cannot operate fully automatically
– Without an effective heuristic for identifying duplicate pages
29
Conclusion
The architecture is particularly suited to situations
– Where a task can be decomposed into a collection of crawling based tasks
It would be unsuitable if
– The crawls had to cross-reference each other
– The data mining had to be performed in an integrated way
The architecture is an effective way to use idle computing
resources in order to perform large-scale web data mining tasks
30
Thank You!
Any Questions or Comments?