Transcript Document
Overview of Web-Crawlers
Neal Richter & Anthony Arnone
• Nov 30, 2005 – CS Conference Room
• These slides are edited versions of the chapter 2 lecture notes from “Mining the Web”
•
by Soumen Chakrabarti.
“Programming Spiders, Bots, and Aggregators” in Java by Jeff Heaton also is a good
reference.
Mining the Web
Chakrabarti and Ramakrishnan
1
Crawling the Web
Web pages
•Few thousand characters long
•Served through the internet using the hypertext
transport protocol (HTTP)
•Viewed at client end using `browsers’
Crawler
•To fetch the pages to the computer
•At the computer
Automatic
documents
programs can analyze hypertext
HTML
HyperText Markup Language
Lets the author
• specify layout and typeface
• embed diagrams
• create hyperlinks.
expressed
as an anchor tag with a HREF attribute
HREF names another page using a Uniform
Resource Locator (URL),
• URL =
protocol
field (“HTTP”) +
a server hostname (“www.cse.iitb.ac.in”) +
file path (/, the `root' of the published file system).
Mining the Web
Chakrabarti and Ramakrishnan
3
HTTP(hypertext transport protocol)
Built on top of the Transport Control Protocol
(TCP)
Steps(from client end)
• resolve the server host name to an Internet address
(IP)
Use Domain Name Server (DNS)
DNS is a distributed database of name-to-IP mappings
maintained at a set of known servers
• contact the server using TCP
connect to default HTTP port (80) on the server.
Enter the HTTP requests header (E.g.: GET)
Fetch the response header
– MIME (Multipurpose Internet Mail Extensions)
– A meta-data standard for email and Web content transfer
Mining the Web
Chakrabarti and Ramakrishnan
Fetch the HTML page
4
Crawl “all” Web pages?
Problem: no catalog of all accessible
URLs on the Web.
Solution:
• start from a given set of URLs
• Progressively fetch and scan them for new
•
•
•
outlinking URLs
fetch these pages in turn…..
Submit the text in page to a text indexing
system
and so on……….
Mining the Web
Chakrabarti and Ramakrishnan
5
Crawling procedure
Simple
• Great deal of engineering goes into industry-
•
strength crawlers
Industry crawlers crawl a substantial fraction
of the Web
E.g.: Alta Vista, Northern Lights, Inktomi
•
No guarantee that all accessible Web
pages will be located in this fashion
Crawler may never halt …….
• pages will be added continually even as it is
running.
Mining the Web
Chakrabarti and Ramakrishnan
6
Crawling overheads
Delays involved in
• Resolving the host name in the URL to an IP
•
address using DNS
Connecting a socket to the server and
sending the request
Receiving the requested page in response
•
Solution: Overlap the above delays by
• fetching many pages at the same time
Mining the Web
Chakrabarti and Ramakrishnan
7
Anatomy of a crawler.
Page fetching threads
• Starts with DNS resolution
• Finishes when the entire page has been
fetched
Each page
• stored in compressed form to disk/tape
• scanned for outlinks
Work pool of outlinks
• maintain network utilization without
overloading it
Dealt
with by load manager
Continue till the crawler has collected a
Mining the Web
Chakrabarti and Ramakrishnan
8
Typical anatomy of a large-scale crawler.
Mining the Web
Chakrabarti and Ramakrishnan
9
Large-scale crawlers: performance
and reliability considerations
Need to fetch many pages at same time
• utilize the network bandwidth
• single page fetch may involve several seconds of
network latency
Highly concurrent and parallelized DNS lookups
Use of asynchronous sockets
• Explicit encoding of the state of a fetch context in a
•
data structure
Polling socket to check for completion of network
transfers
Multi-processing or multi-threading: Impractical
•
Care in URL extraction
• Eliminating duplicates to reduce redundant fetches
Avoiding “spider Chakrabarti
traps”and Ramakrishnan
Mining •
the Web
10
DNS caching, pre-fetching and
resolution
A customized DNS component with…..
1. Custom client for address resolution
Tailored for concurrent handling of multiple outstanding requests
Allows issuing of many resolution requests together
–
polling at a later time for completion of individual requests
Facilitates load distribution among many DNS servers.
2. Caching server
With a large cache, persistent across DNS restarts
Residing largely in memory if possible.
3. Prefetching client
Mining the Web
Parse a page that has just been fetched & extract host names from HREF
targets
Query DNS cache via UDP
Don’t wait for it, Check later (when you need it)
Chakrabarti and Ramakrishnan
11
Multiple concurrent fetches
•
Managing multiple concurrent
connections
• A single download may take several
•
•
seconds
Open many socket connections to different
HTTP servers simultaneously
Multi-CPU machines not useful
• crawling performance limited by network and
disk
•
Two approaches
1. using multi-threading
Mining the Web
Chakrabarti and Ramakrishnan
12
Multi-threading
• logical threads
• physical thread of control provided by the operating
system (E.g.: pthreads) OR
• concurrent processes
• fixed number of threads allocated in advance
• programming paradigm
• create a client socket
• connect the socket to the HTTP service on a server
• Send the HTTP request header
• read the socket (recv) until
•
no more characters are available
• close the socket.
• use blocking system calls
Mining the Web
Chakrabarti and Ramakrishnan
13
Multi-threading: Problems
• performance penalty
• mutual exclusion
• concurrent access to data structures
• slow disk seeks.
• great deal of interleaved, random input-output
•
on disk
Due to concurrent modification of document
repository by multiple threads
Mining the Web
Chakrabarti and Ramakrishnan
14
Non-blocking sockets and event
handlers
• non-blocking sockets
• connect, send or recv call returns immediately without
waiting for the network operation to complete.
• poll the status of the network operation separately
• “select” system call
• lets application suspend until more data can be read
from or written to the socket
• timing out after a pre-specified deadline
• Monitor polls several sockets at the same time
• More efficient memory management
• code that completes processing not interrupted by
other completions
• No need for locks and semaphores on the pool
• only append complete
pages to the log
Mining the Web
Chakrabarti and Ramakrishnan
15
Link extraction and normalization
• Goal: Obtaining a canonical form of URL
• URL processing and filtering
• Avoid multiple fetches of pages known by
•
different URLs
Relative URLs
•
need to be interpreted w.r.t to a base URL.
• many IP addresses / Mirror???
Mining the Web
Chakrabarti and Ramakrishnan
16
Canonical URL
•
•
•
•
Formed by
Using a standard string for the protocol
Canonicalizing the host name
Adding an explicit port number
Normalizing and cleaning up the path
Mining the Web
Chakrabarti and Ramakrishnan
17
Robot exclusion
• Check
• whether the server prohibits crawling a
•
normalized URL
In robots.txt file in the HTTP root directory of
the server
•
species a list of path prefixes which crawlers
should not attempt to fetch.
• Meant for crawlers only
Mining the Web
Chakrabarti and Ramakrishnan
18
Eliminating already-visited URLs
Checking if a URL has already been fetched
• Before adding a new URL to the work pool
• Needs to be very quick.
• Achieved by computing MD5 hash function on the
URL
Exploiting spatio-temporal locality of access
Two-level hash function.
– most significant bits (say, 24) derived by hashing the host name
plus port
– lower order bits (say, 40) derived by hashing the path
concatenated bits used as a key in a B-tree
qualifying URLs added to frontier of the crawl.
hash values added to B-tree.
Mining the Web
Chakrabarti and Ramakrishnan
19
Spider traps
Protecting from crashing on
• Ill-formed HTML
E.g.:
page with 68 kB of null characters
• Misleading sites
indefinite
number of pages dynamically generated
by CGI scripts
paths of arbitrary depth created using soft directory
links and path remapping features in HTTP server
Mining the Web
Chakrabarti and Ramakrishnan
20
Spider Traps: Solutions
No automatic technique can be foolproof
Check for URL length
Guards
• Preparing regular crawl statistics
• Adding dominating sites to guard module
• Disable crawling active content such as CGI
•
form queries
Eliminate URLs with non-textual data types
Mining the Web
Chakrabarti and Ramakrishnan
21
Avoiding repeated expansion of
links on duplicate pages
Reduce redundancy in crawls
Duplicate detection
• Mirrored Web pages and sites
Detecting exact duplicates
• Checking against MD5 digests of stored URLs
• Representing a relative link v (relative to aliases u1
and u2) as tuples (h(u1); v) and (h(u2); v)
Detecting near-duplicates
• Even a single altered character will completely
change the digest !
E.g.: date of update/ name and email of the site administrator
• Solution : Shingling
Mining the Web
Chakrabarti and Ramakrishnan
22
Text repository
Crawler’s last task
Dumping fetched pages into a repository
Decoupling crawler from other functions
for efficiency and reliability preferred
Page-related information stored in two
parts
meta-data
page contents.
Mining the Web
Chakrabarti and Ramakrishnan
23
Large-scale crawlers often use multiple ISPs and a bank of local storage
servers to store the pages crawled.
Mining the Web
Chakrabarti and Ramakrishnan
24
Refreshing crawled pages
Search engine's index should be fresh
Web-scale crawler never `completes' its
job
High variance of rate of page changes
“If-modified-since” request header with
HTTP protocol
Impractical for a large web crawler
Solution
At commencement of new crawling round
estimate which pages have changed
Mining the Web
Chakrabarti and Ramakrishnan
25
Determining page changes
“Expires” HTTP response header
For page that come with an expiry date
Otherwise need to guess if revisiting that
page will yield a modified version.
Score reflecting probability of page being
modified
Crawler fetches URLs in decreasing order of
score.
Assumption : recent past predicts the future
Mining the Web
Chakrabarti and Ramakrishnan
26
Estimating page change rates
Brewington and Cybenko & Cho
Algorithms for maintaining a crawl in which
most pages are fresher than a specified
epoch.
Prerequisite
average interval at which crawler checks for
changes is smaller than the inter-modification
times of a page
Small scale intermediate crawler runs
to monitor fast changing sites
E.g.: current news, weather, etc.
Patched intermediate indices into master
index
Mining the Web
Chakrabarti and Ramakrishnan
27
Questions?
Lots of Programming detail
missing!!!!