Transcript ppt

Crawling The Web
Motivation
• By crawling the Web, data is retrieved from the
Web and stored in local repositories
• Most common example: search engines, Web
archives
• The idea: use links between pages to traverse the
Web
• Since the Web is dynamic, updates should be
done continuously (or frequently)
Crawling Basic Algorithm
visited URLS
Init
initial seeds
Get next URL
to-visit URLS
Get page
database
Extract Data Extract Links
www
The Web as a Graph
• The Web is modeled as a directed graph
- The nodes are the Web pages
- The edges are pairs (P1, P2) such that there is a
link from P1 to P2
• Crawling the Web is a graph traversal (search
algorithm)
• Can we traverse all of the Web this way?
The Hidden Web
• The hidden Web consists of
- Pages that no other page has a link to them
• how can we get to this pages?
- Dynamic pages that are created as a result of filling a
form
• http://www.google.com/search?q=crawlers
Traversal Orders
• Different traversal orders can be used:
- Breath-First Crawlers
• to-visit pages are stored in a queue
- Depth-First Crawlers
• to-visit pages are stored in a stack
- Best-First Crawlers
• to-visit pages are stored in a priority-queue,
according to some metric
- How should the traversal order be chosen?
Avoiding Cycles
• To avoid visiting the same page more than once,
a crawler has to keep a list of the URLs it has
visited
• The target of every encountered link is checked
before inserting it to the to-visit list
• Which data structure for visited-links should be
used?
Directing Crawlers
• Sometimes people want to direct automatic crawling
over their resources
• Direction examples:
“Do not visit my files!”
“Do not index my files!”
“Only my crawler may visit my files!”
“Please, follow my useful links…”
• Solution: publish instructions in some known format
• Crawlers are expected to follow these instructions
Robots Exclusion Protocol
• A method that allows Web servers to indicate
which of their resources should not be visited by
crawlers
• Put the file robots.txt at the root directory of the
server
-
http://www.cnn.com/robots.txt
http://www.w3.org/robots.txt
http://www.ynet.co.il/robots.txt
http://www.whitehouse.gov/robots.txt
http://www.google.com/robots.txt
robots.txt Format
• A robots.txt file consists of several records
• Each record consists of a set of some crawler id’s
and a set of URLs these crawlers are not allowed
to visit
- User-agent lines: which crawlers are directed?
- Disallowed lines: Which URLs are not to be visited
by these crawlers (agents)?
robots.txt Format
The following example is taken from
http://www.w3.org/robots.txt:
User-agent: W3Crobot/1
Disallow: /Out-Of-Date
User-agent: *
Disallow: /Team
Disallow: /Project
Disallow: /Systems
Disallow: /Web
Disallow: /History
Disallow: /Out-Of-Date
W3Crobot/1 is
not allowed to
visit files under
directory Outof-Date
And those that
are not
W3Crobot/1…
Robots Meta Tag
• A Web-page author can also publish directions for
crawlers
• These are expressed by the meta tag with name robots,
inside the HTML file
• Format: <meta name="robots" content="options"/>
• Options:
- index(noindex): index (do not index) this file
- follow(nofollow): follow (do not follow) the links of this file
Robots Meta Tag
An Example:
<html>
<head>
<meta name="robots" content="noindex,follow">
<title>...</title>
</head>
<body> …
How should a crawler act
when it visits this page?
Revisit Meta Tag
• Web page authors may want Web applications to
have an up-to-date copy of their page
• Using the revisit meta tag, page authors can give
crawlers some idea of how often the page is
being updated
• For example:
<meta name="revisit-after" content="10 days" />
Stronger Restrictions
• It is possible for a (non-polite) crawler to ignore
the restrictions imposed by robots.txt and robots
meta directions
• Therefore, if one wants to ensure that automatic
robots do not visit her resources, she has to use
other mechanisms
- For example, password protections
Resources
• Read this nice tutorial about web crawling:
http://informatics.indiana.edu/fil/Papers/crawling.pdf
• To find more about crawler direction visit
www.robotstxt.org
• A dictionary of HTML meta tags can be found at
http://vancouver-webpages.com/META/
Basic HTTP Security
Authentication
• Web servers expose their pages to Web users
• However, some of the pages should sometimes be
exposed only to certain users
- Examples?
• Authentication allows the server to expose a specific
page only after a correct name and password has been
specified
• HTTP includes a specification for a basic access
authentication scheme
- Some servers avoid it and use other mechanisms
Basic Authentication Scheme
Realm A
/a/A.html
/a/B.jsp
OK + Content
GET E.xsl
Realm B
/b/C.css
/b/D.xml
E.xsl
F.xml
Basic Authentication Scheme
Realm A
/a/A.html
/a/B.jsp
401 + Basic realm="A"
GET /a/B.jsp
Realm B
/b/C.css
/b/D.xml
E.xsl
F.xml
Basic Authentication Scheme
Realm A
/a/A.html
/a/B.jsp
OK + Content
GET /a/B.jsp + user:pass
Realm B
/b/C.css
/b/D.xml
E.xsl
F.xml
Basic Authentication Scheme
Realm A
/a/A.html
/a/B.jsp
OK + Content
GET /a/A.html + user:pass
Realm B
/b/C.css
/b/D.xml
E.xsl
F.xml
Basic Authentication Scheme
• To restrict a set of pages for certain users, the server
designates a realm name for these pages and defines the
authorized users (usernames and passwords)
• When a page is requested without correct authentication
information, the server returns a 401 (Unauthorized)
response, with the "WWW-Authenticate" header like the
following:
WWW-Authenticate: Basic realm="realm-name"
Basic Authentication Scheme
• The browser then prompts the user for a username and a
password, and sends them in the "Authorization" header:
Authorization: Basic username:password
• The string username:password is trivially encoded
(everyone can decode it...)
• Does the user fill her name and password for other
requests from the same server?
Browser Cooperation
• Through the session, the browser stores the
username and password and automatically sends
the latter authorization header in either one of the
following cases:
- The requested resource is under the directory of the
originally authenticated resource
- The browser received 401 from the Web server and
the WWW-Authenticate header has the same realm as
the originally authenticated resource