Introduction and background - Villanova Computer Science

Download Report

Transcript Introduction and background - Villanova Computer Science

Web Crawling
Fall 2011
Dr. Lillian N. Cassel
Overview of the class
• Purpose: Course Description
– How do they do that? Many web applications, from Google to travel
sites to resource collections, present results found by crawling the
Web to find specific materials of interest to the application
theme. Crawling the Web involves technical issues, politeness
conventions, characterization of materials, decisions about the
breadth and depth of a search, and choices about what to present
and how to display results. This course will explore all of these
issues. In addition, we will address what happens after you crawl
the web and acquire a collection of pages. You will decide on the
questions, but some possibilities might include these: What
summer jobs are advertised on web sites in your favorite
area? What courses are offered in most (or few) computer science
departments? What theatres are showing what
movies? etc? Students will develop a web site built by crawling at
least some part of the web to find appropriate materials, categorize
them, and display them effectively. Prerequisites: some
programming experience: CSC 1051 or the equivalent.
Tonight’s class
• Review the syllabus and general plan
• Get some information about student
experience
• Look at some background about the web
and information handling
Note: We will touch on several topics that we will
explore in greater depth later in the semester
Overview -2
• Note:
– Students will develop a web site built by
crawling at least some part of the web to find
appropriate materials, categorize them, and
display them effectively.
• You will need to choose a project, or at least a
general area of interest, soon.
– Prerequisites: some programming experience:
CSC 1051 or the equivalent.
• I need to know what programming experience you
have.
Overview – 3
• Web presence
– www.csc.villanova.edu/~cassel/5930Crawl/593-WebCrawlingSyllabus.html
– Blackboard site
• We will use it for the things it is good for
– Keep in touch with both the web site and
the Blackboard site
– Also – be sure to watch your e-mail. Any
announcements or general information will
go there – to the address on the class list
The Web
• Beginnings – Information Webs
– Vannevar Bush As We May Think
– Hypertext. Ted Nelson and others
• The World Wide Web
– Origins and Structure
• Directed graph
• Possibility of cycles
– Characteristics
Web Document Collection
•
No design/co-ordination
•
Distributed content creation, linking,
democratization of publishing
•
Content includes truth, lies, obsolete
information, contradictions …
•
Unstructured (text, html, …), semistructured (XML, annotated photos),
structured (Databases)…
•
Scale much larger than previous text
collections … but corporate records are
catching up
•
Growth – slowed down from initial
“volume doubling every few months” but
still expanding
•
Content can be dynamically generated
The Web
Web Operation
• Basic Client Server model
– The http protocol
• HyperText Transfer Protocol
– Few simple commands that allow communication between
the server and an application requesting something from the
server – usually a browser, but not always.
– Server
• The site where the content resides.
• Most of the web is served up by Apache and its
byproducts.
– Client
• The program requesting something from the server.
• Browsers most often, but also web crawlers and other
applications.
HTTP: GET and POST
• GET <path> HTTP/<version>
– Requests that the server send the specific page
at <path> back to the requestor.
– The version number allows compatible
communication
– Server sends header and the requested file
(page).
– Additional requests can follow.
• POST
– Similar to a GET but allows additional
information to be sent to the server.
– Useful for purchases or page edits.
HEAD
• HEAD <path> HTTP/<version>
• Useful for checking whether a previously
fetched web page has changed.
• The request results in header information,
but not the page itself.
• Response:
–
–
–
–
Confirm http version compatibility
Date:
Server:
Last-Modified:
Full set of HTTP commands
•
•
•
•
•
•
•
•
CONNECT Command
DISCONNECT Command
GET Command
HEAD Command
LOAD RESPONSE_INFO BODY Command
LOAD RESPONSE_INFO HEADER Command
POST Command
SYNCHRONIZE REQUESTS Command
Search
• Search engines, whether general engines
like Google or Yahoo do not crawl the
web looking for results after receiving a
query.
– That would take much too long and provide
unacceptable performance
• Search engines actually search a
carefully constructed database with
indices created for efficiently locating
content
Architecture of a Search Engine
Ref: Manning Introduction to
Information Retrieval
Crawling in Context
• So, we see that crawling is just one step
in a complex process of acquiring
information from the Web to use in any
application.
• Usually, we will want to sort through the
information we found to get the most
relevant part for our use. So, the
example of a search engine is relevant.
Our goals
• We will crawl the Web.
• We will also store the results of the
crawl so that we can retrieve what we
want when we want it.
• We will learn about fundamental
concepts of information retrieval,
including index creation.
Crawling
• A web crawler (aka a spider) is a program
– Starts with one or more URL – the seed
• Other URLs will be found in the pages pointed to by the seed
URLs. They will be the starting point for further crawling
– Uses the standard protocols for requesting a resource
from a server
• Requirements for respecting server policies
• Politeness
– Parses the resource obtained
• Obtains additional URLs from the fetched page
– Implements policies about duplicate content
– Recognizes and eliminates duplicate or unwanted URLs
– Adds found URLs to the queue and continues from the
request to server step
The Depth of the Web
• A URL gives access to a web page.
• That page may have links to other pages.
• Some pages are generated only when
information is provided through a form.
• These pages cannot be discovered just
by crawling.
• The surface web is huge.
• The deeper web is unfathomable.
Anatomy of a URL
• http://www.csc.villanova.edu/~cassel
• That is a web page.
• Three parts
– http – the protocol to use for retrieving the
page
– www.csc.villanova.edu -- the name of the
domain
• csc is a subdomain of the villanova domain
– ~cassel
• Abbreviation for index.html in the directory cassel
at the machine associated with
www.csc.villanova.edu
The major domain categories
• Generic categories:
– .net -- Originally restricted to major participants in maintaining the
Internet. Now open.
– .org -- Generally non profit organizations, including professional
organizations such as acm.org
– .com -- Commercial organizations such as amazon.com, etc.
– .edu -- Restricted to higher education (post secondary) institutions. High
schools and elementary schools are not allowed to use it.
– .gov – Government organizations, such as nsf.gov
– .mil – Military sites
– .int – International
• Country Codes
– .us
– .uk
• Uses second level domains such as ac.uk or co.uk
– And other country designations. Who is .tv?
• Newer ones: .biz, .name, etc.
• All regulated by the Internet Assigned Numbers Authoriity (IANA)
If not http:// then what?
• Other protocols can be specified in the
request to a server:
– file:// local file on the current host
– ftp:// use the ftp protocol to fetch the file
– Etc.
Domain categories
• The domain categories serve to partition the
universe of domain names.
• Domain Name Servers (DNS) do lookup to translate
a domain name to an IP address.
• An IP address locates a particular machine and
makes a communication path known.
– Most common still: 32 bit IPv4 addresses
– Newer: 128 bit IPv6
• A server will typically have many programs running,
several listening for network connections.
– A port number (16 bits) identifies the specific process
for the desired connection.
– Default port for web connections: 80
– If other than 80, it must be specified in the URL
Crawler features
• A crawler must be
– Robust: Survive spider traps. Websites that fool
a spider into fetching large or limitless numbers
of pages within the domain.
• Some deliberate; some errors in site design
– Polite: Crawlers can interfere with the normal
operation of a web site. Servers have policies,
both implicit and explicit, about the allowed
frequency of visits by crawlers. Responsible
crawlers obey these. Others become
recognized and rejected outright.
Ref: Manning Introduction to Information Retrieval
Crawler features
• A crawler should be
– Distributed: able to execute on multiple systems
– Scalable: The architecture should allow additional
machines to be added as needed
– Efficient: Performance is a significant issue if
crawling a large web
– Useful: Quality standards should determine which
pages to fetch
– Fresh: Keep the results up-to-date by crawling
pages repeatedly in some organized schedule
– Extensible: Modular, well crafter architecture
allows the crawler to expand to handle new
formats, protocols, etc.
Ref: Manning Introduction to Information Retrieval
Scale
• A one month crawl of a billion pages
requires fetching several hundred pages per
second
• It is easy to lose sight of the numbers when
dealing with data sources on the scale of
the Web.
– 30 days * 24 hours/day * 60 minutes/hour * 60
seconds/minute = 2,592,000 seconds
– 1,000,000,000 pages/2,592,000 seconds = 385.8
pages/second
• Note that those numbers assume that the
crawling is continuous
Ref: Manning Introduction to Information Retrieval
Google Search
 See http://video.google.com/videoplay?docid=1243280683715323550&hl=en#
 Marissa Mayer of Google on how a search happens
at Google.
Basic Crawl Architecture
DNS
WWW
Doc
FP’s
robots
filters
URL
set
Content
seen?
URL
filter
Dup
URL
elim
Parse
Fetch
URL Frontier
Ref: Manning Introduction to Information Retrieval
Crawler Architecture
• Modules:
– The URL frontier (the queue of URLs still to be
fetched, or fetched again)
– A DNS resolution module (The translation from
a URL to a web server to talk to)
– A fetch module (use http to retrieve the page)
– A parsing module to extract text and links from
the page
– A duplicate elimination module to recognize
links already seen
Ref: Manning Introduction to Information Retrieval
Crawling threads
• With so much space to explore, so many
pages to process, a crawler will often
consist of many threads, each of which
cycles through the same set of steps we
just saw. There may be multiple threads
on one processor or threads may be
distributed over many nodes in a
distributed system.
Politeness
• Not optional.
• Explicit
– Specified by the web site owner
– What portions of the site may be crawled and what portions
may not be crawled
• robots.txt file
• Implicit
– If no restrictions are specified, still restrict how often you hit
a single site.
– You may have many URLs from the same site. Too much
traffic can interfere with the site’s operation. Crawler hits
are much faster than ordinary traffic – could overtax the
server. (Constitutes a denial of service attack) Good web
crawlers do not fetch multiple pages from the same server
at one time.
Robots.txt
• Protocol nearly as old as the web
– www.robotstxt.org/wc/norobots.html
– File: URL/robots.txt
• Contains the access restrictions
– Example:
All robots (spiders/crawlers)
User-agent: *
Disallow: /yoursite/temp/
Robot named
searchengine only
User-agent: searchengine
Disallow:
Nothing disallowed
Processing robots.txt
• First line:
– User-agent – identifies to whom the instruction
applies. * = everyone; otherwise, specific crawler
name
– Disallow: or Allow: provides path to exclude or
include in robot access.
• Once the robots.txt file is fetched from a site,
it does not have to be fetched every time you
return to the site.
– Just takes time, and uses up hits on the
server
– Cache the robots.txt file for repeated
reference
Robots <META> tag
• robots.txt provides information about
access to a directory.
• A given file may have an html meta tag
that directs robot behavior
• A responsible crawler will check for that
tag and obey its direction.
• Ex:
– <META NAME=“ROBOTS” CONTENT = “INDEX, NOFOLLOW”>
– OPTIONS: INDEX, NOINDEX, FOLLOW, NOFOLLOW
See http://www.w3.org/TR/html401/appendix/notes.html#h-B.4.1.2 and http://www.robotstxt.org/meta.html
Crawling
• Pick a URL from the frontier
• Fetch the document at the URL
• Parse the URL
Which one?
– Extract links from it to other docs (URLs)
• Check if URL has content already seen
– If not, add to indices
• For each extracted URL
E.g., only crawl .edu, obey
robots.txt, etc.
– Ensure it passes certain URL filter tests
– Check if it is already in the frontier (duplicate URL
elimination)
Ref: Manning Introduction to Information Retrieval
Basic Crawl Architecture
DNS
WWW
Doc
FP’s
robots
filters
URL
set
Content
seen?
URL
filter
Dup
URL
elim
Parse
Fetch
URL Frontier
Ref: Manning Introduction to Information Retrieval
DNS – Domain Name Server
• Internet service to resolve URLs into IP
addresses
• Distributed servers, some significant
latency possible
• OS implementations – DNS lookup is
blocking – only one outstanding request at
a time.
• Solutions
– DNS caching
– Batch DNS resolver – collects requests and
sends them out together
Ref: Manning Introduction to Information Retrieval
Parsing
• Fetched page contains
– Embedded links to more pages
– Actual content for use in the application
• Extract the links
– Relative link? Expand (normalize)
– Seen before? Discard
– New?
• Meet criteria? Append to URL frontier
• Does not meet criteria? Discard
• Examine content
Content
• Seen before?
–How to tell?
• Finger Print, Shingles
–Documents identical, or similar
–If already in the index, do not
process it again
Ref: Manning Introduction to Information Retrieval
Distributed crawler
• For big crawls,
– Many processes, each doing part of the job
• Possibly on different nodes
• Geographically distributed
– How to distribute
• Give each node a set of hosts to crawl
• Use a hashing function to partition the set of
hosts
– How do these nodes communicate?
• Need to have a common index
Ref: Manning Introduction to Information Retrieval
Communication between nodes
The output of the URL filter at each node is sent to the
Duplicate URL Eliminator at all nodes
DNS
Doc
FP’s
robots
filters
To
othe
r
hosts
URL
set
WWW
Parse
Fetch
Content
seen?
URL Frontier
Ref: Manning Introduction to Information Retrieval
URL
filter
Host
splitter
From
othe
r
hosts
Dup
URL
elim
URL Frontier
• Two requirements
– Politeness: do not go too often to the same
site
– Freshness: keep pages up to date
• News sites, for example, change frequently
• Conflicts – The two requirements may be
directly in conflict with each other.
• Complication
– Fetching URLs embedded in a page will yield
many URLs located on the same server. Delay
Ref: Manning Introduction
to Information
Retrieval
fetching
those.
More …
• We will examine these things more
completely. What will you actually do?
• Goal
– Write a simple crawler
• Not distributed, not multi-threaded
• Use a seed URL, connect with the server, fetch the
document, extract links, extract content
– Explore existing crawlers
• Evaluate their characteristics
• Learn to use one to do serious crawling
– Process the documents fetched to serve some
purpose. Create a web site for that purpose.
Ref: Manning Introduction to Information Retrieval
Processing the documents
• Create an index and store the
documents and the index so that
appropriate content can be found when
needed.
• Learn the fundamentals of information
retrieval as they apply to web services
Next week
•
•
•
•
A simple web crawler
Web crawl visualization
Criteria for judging web crawlers
Some web crawlers to compare