Crawler.Design.Mid.II

Download Report

Transcript Crawler.Design.Mid.II

Web Categorization
Crawler II
Mohammed Agabaria
Adam Shobash
Supervisor: Victor Kulikov
Spring 2010
Design & Architecture
May. 2010
Contents
 Crawler Overview

Crawling Problems
 System Components



Main Components
Project Goals
Improvements



Improvements in Categorizer
Improvements in Ranker
Improvements in Frontier
 Schedule
Web Categorization Crawler II
2
Crawler Overview




A Web Crawler is a computer program that browses the World Wide
Web in a methodical automated manner
The Crawler starts with a list of URLs to visit, called the seeds list
The Crawler visits these URLs and identifies all the hyperlinks in the
page and adds them to the list of URLs to visit, called the frontier
URLs from the frontier are recursively visited according to a
predefined set of policies
Web Categorization Crawler II
3
Crawling Problems

The World Wide Web contains a large volume of data



Crawler can only download a fraction of the Web pages
Thus there is a need to prioritize and speed up downloads, and crawl
only the relevant pages
Dynamic page generation


May cause duplication in content retrieved by the crawler
Also causes a crawler traps


Endless combination of HTTP requests to the same page
Fast rate of Change


Pages that were downloaded may have been changed since the last
time they were visited
Some crawlers may need to revisit the pages in order to keep up to date
data
Web Categorization Crawler II
4
Main Components
Crawler
Worker1
Web
Application
GUI
Worker2
Frontier
Load
configurations
Store
configuration
..
.
storing results
Storage System
Data Base
Handler
View
Results
Data Base
Web Categorization Crawler II
5
Project Goals-Part II

Improve the current simple and naïve algorithms to a more efficient
and sophisticated algorithms:

Improve the Categorizer.


Improve the Ranker.



Will use algorithm similar to “Shark Search”
Fetch first the pages that were ranked as preferred.
Improve the frontier.



Will examine the content of the page.
Must be faster.
Must spare more space.
Extend the Web Application GUI :


User will have more options using the crawler.
More user friendly.
Web Categorization Crawler II
6
Improvements In Categorizer


Part I Categorizer ascribed every web page only to “Web Page”
category – dummy Categorizer.
Part II Categorizer will use a little “more sophisticated” algorithm:


Will go over all the categories that have been defined by the user.
Will compare the current web page to all the categories.



Every category has a confidence level (CL).
The match level of the current web page to certain category is evaluated.
If evaluation bigger than CL of certain category.


Current web page is ascribed to that category.
Eventually the current web page will be ascribed to all the
categories it is belonged to (hopefully) .
Web Categorization Crawler II
7
Improvements In Ranker



Part I Ranker gives all the urls a rank of zero – dummy ranker.
Part II ranker will replace the recent “algorithm” to a better one.
Will use an algorithm close to the Shark Search algorithm:

Will take into consideration the following:



The nearby text of the extracted.
The rank of the parent web page.
Will use formula to calculate the rank of a url
 rank (url )    rank ( parentUrl )  (1   )  rank (Url _ neighborhood )

Were
0   1
Web Categorization Crawler II
8
Improvements In Frontier

Part I frontier – FIFO


Not efficient enough
Part II frontier –Trie + Rank_List

Every url that has been extracted will be passed to Trie


Will use less memory
Much more easier to search for certain url


To check If url has been visited before

If so don’t visit again!
Every url will have priority level (rank)

Next url to be fetched is the one with highest rank.
Web Categorization Crawler II
9
Schedule

Part I project:

Still in progress

Crawler part is almost finished.




All the units of it has been finished
Including the testing for all units
Remained: Storage part, Primitive GUI, Report book
Part II project:

Study in more details the shark search algorithm
Explore the algorithm of categorizer
Apply both algorithms to the project
Built the improved Frontier

Built improved and more friendly GUI



Web Categorization Crawler II
10
Thank You!
Web Categorization Crawler
11
Appendix
Web Categorization Crawler
12
The Need for a Crawler



The main “core” for search engines
Can be used to gather specific information from Web pages (e.g.
statistical info, classifications ..)
Also, crawlers can be used for automating maintenance task on
Web site such as checking links
Web Categorization Crawler
13
Project Properties



Multi-threaded design in order to utilize all the system resources
Implements customized page rank algorithm in order determine the
priority of the URLs
Contains categorizer unit that determines the category of a
downloaded page



Category set can be customized by the user
Contains URL filter unit that can support crawling only specified
networks, and allow other URL filtering options
Working environment




Windows platform
C# programming language
Dot Net environment
MS-SQL data base system (extensible to work with other data bases)
Web Categorization Crawler
14
Worker Class Diagram
ResourceContent
FetcherManager
-timeOut : int
-protocolList : List
+fetchResource(in url : string) : ResourceContent
+setTimeOut(in timeout : int) : void
+addProtocol(in protocol : ResourceFetcher) : void
1
1
*
-url : string
-resourceType : object
-resourceContent : object
-returnCode : int
+getResourceURL() : string
+getResourceType() : object
+getResourceContent() : object
+getReturnCode() : int
+isValid() : bool
*
1
+processResource(in resource : ResourceContent) : void
1
*
*
«interface»
ResourceFetcher
+fetch(in url : string, in timeOut : int) : ResourceContent
+canFetch(in url : string) : bool
ResourceProcessorManager
-resourceProcessorList : List
«interface»
ResourceProcessor
+process(in resource : ResourceContent) : void
+canProcess(in resource : ResourceContent) : bool
Queue is given to the constructor as a reference,
the processor class should not allocate new queue,
and should use the given reference instead.
HttpResourceFetcher
HtmlPageCategorizationProcessor
+fetch() : ResourceFetcher
+canFetch(in url : string) : bool
-urlCategorized
-urlHashed
+process(in content : ResourceContent) : void
+canProcess(in content : ResourceContent) : bool
+deployResourceToStorage() : void
-hashUrl(in urlName : string) : int
Web Categorization Crawler
15
Worker Class Diagram-cont.
«requirement»
In extractLinks the output will be a list
of struct which contain url and neighbour
word list for further processing
HtmlPageCategorizationProcessor
Extractor
1
1
+extractLinks(in url : string, in page : string) : <unspecified>
+extractContentWordList(in page : string) : <unspecified>
-findLinks(in page : string) : <unspecified>
-normalizeLink(in parentUrl : string, in link : string) : string
-urlCategorized
-urlHashed
+process(in content : ResourceContent) : void
+canProcess(in content : ResourceContent) : bool
+deployResourceToStorage() : void
-hashUrl(in urlName : string) : int
1
1
Ranker
Category
-categoryID : string
-ParentName : string
-categoryName : string
-keywordList : List
-confidenceLevel : int
+Category(in id, in pname, in name, in keywords, in cl)
+getCategoryID() : int
+getParentName() : int
+getCategoryName()
+getConfidenceLevel() : int
+getKeywords()
+getMatchLevel(in WordList) : int
-canonicForm() : string
-synonymousList(in word : string) : <unspecified>
Categorizer is given as a reference to the
constuctor of Ranker,and Ranker should not allocate
new categorizer,instead it should use the given reference.
1
1
1
+rankUrl(in parentRank : int, in parentConent : string, in url : string) : int
Categorizer
1
-CategoryList : List
*1
-categorizer : Categorizer
+classifyContent(in wordList) : void
+getSuitableCategoryID() : int
+getMatchLevel() : int
1
Constraints
1
Filter
+filterLinks(in linkList) : <unspecified>
+canonize(in url : string) : string
Category class will be immutable class, which means
that every property will be defined while constructing
the object instance
1
Web Categorization Crawler
-linkDepth : int
-restrictedNetworks
-crawlNetworks
-allowUrlParameters : bool
+getAllowedDepth()
+getRestrictionList()
+getCrawlList()
+isParametrizationAllowed()
+isUrlValid(in url : string) : bool
-getUrlDepth() : int
-getUrlNetwork() : string
-containsParameter() : bool
1
16