Crawling the Web

Download Report

Transcript Crawling the Web

Lecture 3: Crawling the Web
(Chap 2, Charkrabarti)
Wen-Hsiang Lu (盧文祥)
Department of Computer Science and Information Engineering,
National Cheng Kung University
2007/10/02
1
Crawling the Web
•Web pages
—Few thousand characters long
—Served through the internet using the hypertext
transport protocol (HTTP)
—Viewed at client end using `browsers’
•Crawler
—To fetch the pages to the computer
—At the computer
•Automatic programs can analyze hypertext
documents
HTML
• HyperText Markup Language
• Lets the author
— specify layout and typeface
— embed diagrams
— create hyperlinks.
• expressed as an anchor tag with a HREF attribute
• HREF names another page using a Uniform
Resource Locator (URL),
— URL =
• protocol field (“HTTP”) +
• a server hostname (“www.cse.iitb.ac.in”) +
• file path (/, the `root' of the published file system).
3
HTTP(hypertext transport protocol)
• Built on top of the Transport Control Protocol
(TCP)
• Steps (from client end)
— resolve server host name to Internet address (IP)
• Use Domain Name Server (DNS)
• DNS is a distributed database of name-to-IP mappings
maintained at a set of known servers
— contact the server using TCP
• connect to default HTTP port (80) on the server.
• Enter the HTTP requests header (E.g.: GET)
• Fetch the response header
— MIME (Multipurpose Internet Mail Extensions)
— A meta-data standard for email and Web content transfer
• Fetch the HTML page
4
Crawl “all” Web pages?
• Problem: no catalog of all accessible
URLs on the Web.
• Solution:
— start from a given set of URLs
— Progressively fetch and scan them for new
outlinking URLs
— fetch these pages in turn…..
— Submit the text in page to a text indexing
system
— and so on……….
5
Crawling procedure
• Simple
— Great deal of engineering goes into industrystrength crawlers
— Industry crawlers crawl a substantial fraction
of the Web
— E.g.: Alta Vista, Northern Lights, Inktomi
• No guarantee that all accessible Web
pages will be located in this fashion
• Crawler may never halt …….
— pages will be added continually even as it is
running.
6
Crawling overheads
• Delays involved in
— Resolving the host name in the URL to an IP
address using DNS
— Connecting a socket to the server and
sending the request
— Receiving the requested page in response
• Solution: Overlap the above delays by
— fetching many pages at the same time
7
Anatomy of a crawler.
• Page fetching threads
— Starts with DNS resolution
— Finishes when the entire page has been fetched
• Each page
— stored in compressed form to disk/tape
— scanned for outlinks
• Work pool of outlinks
— maintain network utilization without overloading it
• Dealt with by load manager
• Continue till the crawler has collected a sufficient
number of pages.
8
Typical anatomy of a large-scale crawler
9
Large-scale crawlers: performance
and reliability considerations
• Need to fetch many pages at same time
— utilize the network bandwidth
— single page fetch may involve several seconds of network
latency
• Highly concurrent and parallelized DNS lookups
• Use of asynchronous sockets
— Explicit encoding of the state of a fetch context in a data
structure
— Polling socket to check for completion of network transfers
— Multi-processing or multi-threading: Impractical
• Care in URL extraction
— Eliminating duplicates to reduce redundant fetches
— Avoiding “spider traps” (e.g., fake URLs)
10
DNS caching, pre-fetching and
resolution
• A customized DNS component with…..
1. Custom client for address resolution
2. Caching server
3. Prefetching client
11
Custom client for address
resolution
• Tailored for concurrent handling of multiple
outstanding requests
• Allows issuing of many resolution requests
together
— polling at a later time for completion of
individual requests
• Facilitates load distribution among many
DNS servers.
12
Caching server
• With a large cache, persistent across DNS
restarts
• Residing largely in memory if possible.
• DNS resolution
– UNIX gethostbyname: cannot provide
concurrent requests
– Mercator: reduce time from 87% to 25%
– ADNS: asynchronous DNS client library
www.chiark.greenend.org.uk/~ian/adns
13
Prefetching client
• Steps
1. Parse a page that has just been fetched
2. extract host names from HREF targets
3. Make DNS resolution requests to the caching server
• Usually implemented using UDP
— User Datagram Protocol
— connectionless, packet-based communication
protocol
— does not guarantee packet delivery
• Does not wait for resolution to be completed.
14
Multiple concurrent fetches
• Managing multiple concurrent connections
— A single download may take several seconds
— Open many socket connections to different HTTP
servers simultaneously
• Multi-CPU machines not useful
— crawling performance limited by network and disk
• Two approaches
1. using multi-threading
2. using non-blocking sockets with event handlers
15
Multi-threading
• logical threads
— physical thread of control provided by the operating
system (E.g.: pthreads) OR
— concurrent processes
• fixed number of threads allocated in advance
• programming paradigm
—
—
—
—
create a client socket
connect the socket to the HTTP service on a server
Send the HTTP request header
read the socket (recv) until
• no more characters are available
— close the socket.
• use blocking system calls
16
Multi-threading: Problems
• performance penalty
— mutual exclusion
— concurrent access to data structures
• slow disk seeks.
— great deal of interleaved, random input-output
on disk
— Due to concurrent modification of document
repository by multiple threads
17
Non-blocking sockets and event
handlers
• non-blocking sockets
— connect, send or recv call returns immediately without waiting for
the network operation to complete.
— poll the status of the network operation separately
• “select” system call
— lets application suspend until more data can be read from or
written to the socket
— timing out after a pre-specified deadline
— Monitor polls several sockets at the same time
• More efficient memory management
— code that completes processing not interrupted by other
completions
— No need for locks and semaphores on the pool
— only append complete pages to the log
18
Link extraction and normalization
• Goal: Obtaining a canonical form of URL
• URL processing and filtering
— Avoid multiple fetches of pages known by
different URLs
— many IP addresses
• For load balancing on large sites
• Mirrored contents/contents on same file system
• “Proxy pass“
• Mapping of different host names to a single IP address
• need to publish many logical sites
— Relative URLs
• need to be interpreted w.r.t a base URL.
19
Canonical URL
•
•
•
•
Formed by
Using a standard string for the protocol
Canonicalizing the host name
Adding an explicit port number
Normalizing and cleaning up the path
• /books/../papers/index.html =>
/papers/index.html
20
Robot exclusion
• Check
— whether the server prohibits crawling a
normalized URL
— In robots.txt file in the HTTP root directory of
the server
• species a list of path prefixes which crawlers
should not attempt to fetch.
• http://www.iitb.ac.in/robots.txt
• Meant for crawlers only
• User-agent specification
21
Eliminating already-visited URLs
• Checking if a URL has already been fetched
— Before adding a new URL to the work pool
— Needs to be very quick.
— Achieved by computing MD5 ( Message-Digest
algorithm 5) hash function
• (www.rsasecurity.com/rsalabs/faq/3-6-6.html)
• X = MD5("vote1234") =
8339e38c61175dbd07846ad70dc226b2
• Exploiting spatio-temporal locality of access
• Two-level hash function.
— most significant bits (say, 24) derived by hashing the host name
plus port
— lower order bits (say, 40) derived by hashing the path
• concatenated bits use d as a key in a B-tree
• qualifying URLs added to frontier of the crawl.
• hash values added to B-tree.
22
Spider traps
• Protecting from crashing on
— Ill-formed HTML
• E.g.: page with 68 kB of null characters
— Misleading sites
• indefinite number of pages dynamically generated
by CGI scripts
• paths of arbitrary depth created using soft directory
links and path remapping features in HTTP server
23
Spider Traps: Solutions
• No automatic technique can be foolproof
• Check for URL length
• Guards
— Preparing regular crawl statistics
— Adding dominating sites to guard module
— Disable crawling active content such as CGI
form queries
— Eliminate URLs with non-textual data types
24
Avoiding repeated expansion of
links on duplicate pages
• Reduce redundancy in crawls
• Duplicate detection
— Mirrored Web pages and sites
• Detecting exact duplicates
— Checking against MD5 digests of stored URLs
— Representing a relative link v (relative to aliases u1
and u2) as tuples (h(u1); v) and (h(u2); v)
• Detecting near-duplicates
— Even a single altered character will completely
change the digest !
• E.g.: date of update/ name and email of the site administrator
— Solution : Shingling
25
Load monitor
•
Keeps track of various system statistics
— Recent performance of the wide area
network (WAN) connection
•
E.g.: latency and bandwidth estimates.
— Operator-provided/estimated upper bound
on open sockets for a crawler
— Current number of active sockets.
26
Thread manager
• Responsible for
 Choosing units of work from frontier
 Scheduling issue of network resources
 Distribution of these requests over multiple
ISPs if appropriate.
• Uses statistics from load monitor
27
Per-server work queues
• Denial of service (DoS) attacks
 limit the speed or frequency of responses to
any fixed client IP address
• Avoiding DOS
 limit the number of active requests to a given
server IP address at any time
 maintain a queue of requests for each server
 Use the HTTP/1.1 persistent socket capability.
 Distribute attention relatively evenly between
a large number of sites
• Access locality vs. politeness dilemma
28
Text repository
• Crawler’s last task
 Dumping fetched pages into a repository
• Decoupling crawler from other functions
for efficiency and reliability preferred
• Page-related information stored in two
parts
 meta-data
 page contents
29
Storage of page-related information
• Meta-data
 relational in nature
 usually managed by custom software to avoid
relation database system overheads
 text index involves bulk updates
 includes fields like content-type, last-modified
date, content-length, HTTP status code, etc.
30
Page contents storage
• Typical HTML Web page compresses to 24 kB (using zlib)
• File systems have a 4-8 kB file block size
 Too large !!
• Page storage managed by custom storage
manager
 simple access methods for
 crawler to add pages
 Subsequent programs (Indexer etc) to retrieve
documents
31
Page Storage
• Small-scale systems
 Repository fitting within the disks of a single
machine
 Use of storage manager
(E.g.: Berkeley DB: www.sleepycat.com)
 Manage disk-based databases within a single file
 configuration as a hash-table/B-tree for URL
access key
 To handle ordered access of pages
 configuration as a sequential log of page records.
 Since Indexer can handle pages in any order
32
Page Storage
• Large-scale systems
 Repository distributed over a number of
storage servers
 Storage servers
 Connected to the crawler through a fast local
network (E.g.: gigabit Ethernet)
 Hashed by URLs
 `T3' grade leased lines.
 To handle 10 million pages (40 GB) per hour
33
Large-scale crawlers often use multiple ISPs and a bank of local
storage servers to store the pages crawled.
34
Refreshing crawled pages
• Search engine's index should be fresh
• Web-scale crawler never `completes' its
job
• High variance of rate of page changes
• “If-modified-since” request header with
HTTP protocol
 Impractical for a crawler
• Solution
 At commencement of new crawling round
estimate which pages have changed
35
Determining page changes
• “Expires” HTTP response header
 For page that come with an expiry date
• Otherwise need to guess if revisiting that
page will yield a modified version.
 Score reflecting probability of page being
modified
 Crawler fetches URLs in decreasing order of
score.
 Assumption : recent past predicts the future
36
Estimating page change rates
• Brewington and Cybenko & Cho
 Algorithms for maintaining a crawl in which most
pages are fresher than a specified epoch.
• Prerequisite
 average interval at which crawler checks for changes
is smaller than the inter-modification times of a page
• Small scale intermediate crawler runs
 to monitor fast changing sites
 E.g.: current news, weather, etc.
 Patched intermediate indices into master index
37
Putting together a crawler
 Reference implementation of the HTTP
client protocol
 World-wide Web Consortium
(http://www.w3c.org/)
 w3c-libwww package
38
Design of the core components:
Crawler class.
• To copy bytes from network sockets to storage
media
• Three methods to express Crawler's contract
with user
 pushing a URL to be fetched to the Crawler
(fetchPush)
 Termination callback handler (fetchDone) called with
same URL
 Method (start) which starts Crawler's event loop.
• Implementation of Crawler class
 Need for two helper classes called DNS and Fetch
39
40
41
42
43
Crawler at WKD Lab. of Academia
Sinica
44
Parameters
45
Initial URLs
46
Crawling
47
Storing pages
48
Criterions for the Project Crawler
• Performance
— Speed
— Scalability
• Interesting/useful design strategies
—
—
—
—
—
Prefetching
Threading
Compressing
Data management
Others
• Algorithms: MD5, Depth first search/Breadth first search
• Configuration: flexible parameter setting
49