WEB - C. Lee Giles

Download Report

Transcript WEB - C. Lee Giles

Basic WWW Technologies
Thanks to P. Smyth, Hayes, Mark
Sapossnekk, B. Arms.
Web and Internet
• Focus
– Infrastructure
– Standards
– Languages
– Structure (crawlers)
– Access
2
What Is the World Wide Web?
The world wide web (web) is a network of
information resources. The web relies on three
mechanisms to make these resources readily
available to the widest possible audience:
1. A uniform naming scheme for locating resources
on the web (e.g., URIs).
2. Protocols, for access to named resources over
the web (e.g., HTTP).
3. Hypertext, for easy navigation among resources
(e.g., HTML).
3
Internet vs. Web
Internet:
• Internet is a more general term
• Includes physical aspect of underlying networks
and mechanisms such as email, FTP, HTTP…
Web:
• Associated with information stored on the
Internet
• Refers to a broader class of networks, i.e. Web
of English Literature
Both Internet and web are networks
4
Networks vs
Graphs
Examples?
http://www.cybergeography.org/
5
Essential Components of WWW
Resources:
• Conceptual mappings to concrete or abstract entities, which do not
change in the short term
• ex: IST512 website (web pages and other kinds of files)
Resource identifiers (hyperlinks):
• Strings of characters represent generalized addresses that may
contain instructions for accessing the identified resource
• http://clgiles.ist.psu.edu/IST512 is used to identify our course
homepage
Transfer protocols:
• Conventions that regulate the communication between a browser
(web user agent) and a server
6
Internet Technologies
The World Wide Web
A way to access and share information
Technical papers, marketing materials, recipes, ...
A huge network of computers: the Internet
Graphical, not just textual
Information is linked to other information
Application development platform
Shop from home
Provide self-help applications for customers and
partners
...
7
Internet Technologies
WWW Architecture
Client/Server, Request/Response architecture
You request a Web page
e.g. http://www.msn.com/default.asp
HTTP request
The Web server responds with data in the form of a Web page
HTTP response
Web page is expressed as HTML
Pages are identified as a Uniform Resource Locator (URL)
Protocol: http
Web server: www.msn.com
Web page: default.asp
Can also provide parameters: ?name=Leon
8
Internet Technologies
Web Standards
Internet Engineering Task Force (IETF)
http://www.ietf.org/
Founded 1986
Request For Comments (RFC) at
http://www.ietf.org/rfc.html
World Wide Web Consortium (W3C)
http://www.w3.org
Founded 1994 by Tim Berners-Lee
Publishes technical reports and recommendations
9
Internet Technologies
Web Design Principles
Interoperability: Web languages and protocols
must be compatible with one another
independent of hardware and software.
Evolution: The Web must be able to accommodate
future technologies. Encourages simplicity,
modularity and extensibility.
Decentralization: Facilitates scalability and
robustness.
10
Languages of the WWW
Markup languages
A markup language combines text and extra information
about the text. The extra information, for example
about the text's structure or presentation, is
expressed using markup, which is intermingled with
the primary text. The best-known markup language is
in modern use is HTML (Hypertext Markup
Language), one of the foundations of the World Wide
Web. Historically, markup was (and is) used in the
publishing industry in the communication of printed
work between authors, editors, and printers.
11
What is a markup language?
Textual (i.e. person readable) language
where significant elements are indicated
by markers
<TITLE>XML</TITLE>
Examples are RTF, HTML, XML, TEX etc.
Easy to process and can be manipulated by
a variety of application programs
12
Standard Generalized Markup
Language (SGML)
• Based on GML (generalized markup language),
developed by IBM in the 1960s
• An international standard (ISO 8879:1986) defines how
descriptive markup should be embedded in a document
• Can define any document format of any
complexity
• Enables, extensibility, structure and validation
• Too many optional features for the Web
• Gave birth to the extensible markup language (XML),
W3C recommendation in 1998
14
The Purpose of SGML
SGML is designed to make your information last longer than
the systems that created it. Such longevity also implies
immunity to short-term changes -- such as a change from one
application program to another -- so SGML is also inherently
designed for re-purposing and portability.
15
What is SGML?
SGML (and it's derivatives, HTML and XML) are
ASCII character based representations of
electronic data
Remember, it's all bits--meaning is derived from
how they are organized…
Think of SGML docs as strings that must be
parsed--A web browser parses an HTML doc
and uses the markup codes to display the data
contained
Since it's all ASCII, these docs can also be
handled by non parsing tools (such as vi, emacs,
perl, etc.)
16
What is SGML?
SGML is:
very large, powerful and complex
been in heavy industrial and commercial
use for two decades (ISO standard
1985)
XML is lightweight, cut down version of
SGML
17
SGMLXMLHTML
SGML is the “mother tongue” – but is overkill for most
common desktop applications.
XML is an abbreviated version of SGML
easier to define own document types
easier for programmers to write programs to handle
documents (and data)
omits all the options (and most of more complex and
less-used parts) of SGML)
HTML is just one of many SGML or XML “applications”
– most frequently used on the Web
18
SGML Components
SGML documents have three parts:
• Declaration: specifies which characters and delimiters
may appear in the application
• DTD (document type definition) / style sheet: defines the
syntax of markup constructs
• Document instance: actual text (with the tag) of the
documents
More info could be found:
http://www.W3.Org/markup/SGML
19
20
Structure of SGML documents
Prolog
SGML Declaration--information about the dialect of
SGML used, codes used, delimiters.
Document Type Description (DTD)--external description
of the relationship of data elements
Instance
Content
Descriptive Markup
Output Specifications (eg. a style sheet)
DSSSL (Document Style Semantic Specification
Language)
FOSI (Formatted Output Specification Instance)
21
SGML Markup
Looks like HTML (really, HTML looks like SGML,
because it is SGML!)
What is done with tagged text determined by
applications
<anthology><poem><title>The SICK ROSE
<stanza>
<line>O Rose thou art sick.
<line>The invisible worm,
<line>That flies in the night
<line>In the howling storm:
<stanza>
<line>Has found out thy bed
<line>Of crimson joy:
<line>And his dark secret love
<line>Does thy life destroy.
<poem>
<!-- more poems go here
-->
</anthology>
22
The DTD
In SGML, documents are given a type, defined in
the Document Type Definition
The DTD is just another text file that:
Lists Constituent Parts in a series of Declaration
Statements
Insures Consistent Structure
Think of this as being analogous to objects, which
have specific properties and values
SGML Documents can be checked against the
DTD by a parser
23
Simple Example of a DTD
"!" marks a Declaration Statement
Elements are named and have Start Tags and
End Tags
Declarations consist of three parts:
Type and Name, consisting of other Elements or
Reserved keywords (eg. #PCDATA or Element)
Minimization Rules governing tags
Content Model, with Occurrence Indicators (+ ? *) or
Group Connectors (, & |)
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
anthology
poem
title
stanza
<!ELEMENT line
-
O
O
O O
(poem+)>
(title?, stanza+)>
(#PCDATA) >
(line+)
>
(#PCDATA) >
24
Using Data:
The "Traditional" Model
CGI stands for Common Gateway Interface. CGI allows HTML pages to
interact with programming applications.
Open Database Connectivity (ODBC) is a standard software API
for connecting to database management systems (DBMS).
25
Using Data:
The SGML Approach
26
The Up Side
Data Independence--data structure
controlled by use of an open standard
Longevity--structure is determined by DTD,
not a monolithic and possibly proprietary
application
Flexibility--separation of formatting and
content description yields multiple uses by
different parsing systems
27
The Down Side
Strict encoding
DTDs
Lack of SGML Applications
28
HTML Background
• HTML was originally developed by Tim BernersLee while at CERN, and popularized by the
Mosaic browser developed at NCSA.
• The Web depends on Web page authors and
vendors sharing the same conventions for
HTML. This has motivated joint work on
specifications for HTML.
• HTML standards are organized by W3C :
http://www.w3.org/MarkUp/
31
HTML Functionalities
HTML gives authors the means to:
• Publish online documents with headings, text, tables,
lists, photos, etc
– Include spread-sheets, video clips, sound clips, and other
applications directly in their documents
• Link information via hypertext links, at the click of a
button
• Design forms for conducting transactions with remote
services, for use in searching for information, making
reservations, ordering products, etc
32
HTML Versions
• HTML 4.01 is a revision of the HTML 4.0 Recommendation first
released on 18th December 1997.
– HTML 4.01 Specification:
http://www.w3.org/TR/1999/REC-html401-19991224/html40.txt
• HTML 4.0 was first released as a W3C Recommendation on 18
December 1997
• HTML 3.2 was W3C's first Recommendation for HTML which
represented the consensus on HTML features for 1996
• HTML 2.0 (RFC 1866) was developed by the IETF's HTML
Working Group, which set the standard for core HTML
features based upon current practice in 1994.
33
Sample Webpage HTML
Structure
<HTML>
<HEAD>
<TITLE>The title of the webpage</TITLE>
</HEAD>
<BODY> <P>Body of the webpage
</BODY>
</HTML>
35
HTML Structure
• An HTML document is divided into a head section
(here, between <HEAD> and </HEAD>) and a body
(here, between <BODY> and </BODY>)
• The title of the document appears in the head (along
with other information about the document)
• The content of the document appears in the body. The
body in this example contains just one paragraph,
marked up with <P>
36
HTML Hyperlink
<a href="relations/alumni">alumni</a>
• A link is a connection from one Web resource
to another
• It has two ends, called anchors, and a direction
• Starts at the "source" anchor and points to the
"destination" anchor, which may be any Web
resource (e.g., an image, a video clip, a sound
bite, a program, an HTML document)
37
What is XML?
XML – eXtensible Markup Language
designed to improve the functionality of the
Web by providing more flexible and
adaptable information and identification
“extensible” because not a fixed format like
HTML
a language for describing other languages (a
meta-language)
design your own customised markup language
38
Why use XML?
XML is written in SGML – the
Standardized General Markup
Language, an international standard
(ISO 8879)
XML = very simple dialect of SGML
goal = enable generic SGML to be served,
received and processed on the Web in
ways not possible with HTML
39
Why use XML?
XML is not just for Web pages
use to store any kind of structured
document
to enclose/encapsulate information in
order to pass it between different
computing systems that are otherwise
unable to communicate
40
Key feature of XML
An application is free to use XML tagged data in
many different ways, e.g.
produce an image
generate a formatted text listing
display the XML document’s markup in pretty
colors
restructure the data into a format for storing in a
database, transmission over a network, input to
another program.
41
XML is important because...
Removes 2 constraints that held back
Web development:
dependence on a single, inflexible
document type (HTML) [much abused]
reduced the complexity of full SGML
[many options but hard to program]
42
XML… allows the flexible development of
user-defined document types.
provides a robust, non-proprietary,
persistent, and verifiable file format
for the storage and transmission of
text and data both on and off the Web
43
XML Software?
hundreds (probably thousands) of
programs are “XML ready” already
today.
xml.coverpages.org covers news of new
additions to XML
44
Is XML a Computer Language?
XML is not C or C++ or like any other
programming language
By itself, it cannot specify calculations,
actions, decisions to be carried out in
any order
XML is a markup specification language
45
XML - a Markup Language
with XML, you can design ways of describing
information (text or data), usually for storage,
transmission or processing by a program
XML conveys no information about what should be
done with the data or text – it merely describes it.
By itself, XML does anything – it is a data description
format
46
How do I run or execute an XML file?
You can’t and you don’t !
XML is not a programming language
XML is a markup specification language
XML files are just data (waiting for a
program to do something with them)
XML files can be viewed with an XML
editor or XML-compatible browser
47
Things to Remember
XML does not replace HTML – it provides an
alternative which allows you to define your own
set of markup elements to a published standard:
<?xml version="1.0" standalone="yes"?>
<conversation>
<greeting>Hello, world!</greeting>
<response>Stop the planet, I want to get
off!</response>
</conversation>
48
Things to Remember
All parts of an XML document are case
sEnSiTiVe
Element type names are case sensitive,
so <BODY> …</b ody> is out.
Attribute names are case sensitive …
<PIC width=“7cm”/> and
<PIC WIDTH=“6cm”/>
describe different attributes, not just
different values for the attribute “PIC
width”.
49
What is XQuery?
XQuery is the language for querying XML data
The best way to explain XQuery is to say that
XQuery is to XML what SQL is to database
tables.
XQuery uses XPath expressions to extract XML data.
XPath is a language for finding information in an XML document.
XPath is used to navigate through elements and attributes in an XML
document.
XQuery is defined by the W3C.
XQuery is supported by all the major database engines (IBM, Oracle,
Microsoft, etc.)
XQuery 1.0 is not yet a W3C Recommendation (XQuery is a Working
Draft). Hopefully it will be a recommendation in the near future.
50
Resource Identifiers
URI: Uniform Resource Identifiers
• URL: Uniform Resource Locators
• URN: Uniform Resource Names
– Legacy, not used
– Ex urn://isbn:4322347
51
Ping – TCP/IP
IP discovery
PING - Packet Internet Groper; a utility
used to determine whether a particular
computer is currently connected to the
Internet. It works by sending a packet to
the specified IP address and waiting for
a reply.
52
Ping (Packet Internet Groper)
ping command
53
Introduction to URIs
Every resource available on the Web has an
address that may be encoded by a URI
URIs typically consist of three pieces:
• The naming scheme of the mechanism used
to access the resource. (HTTP, FTP)
• The name of the machine hosting the
resource
• The name of the resource itself, given as a
path
54
URI Example
http://www.w3.org/TR
• There is a document available via the HTTP
protocol
• Residing on the machines hosting www.w3.org
• Accessible via the path "/TR"
55
Protocols
Describe how messages are encoded and
exchanged
For the internet
Different Layering Architectures
• ISO OSI 7-Layer Architecture
• TCP/IP 4-Layer Architecture
56
ISO OSI Layering Architecture
57
ISO’s Design Principles
• A layer should be created where a different level
of abstraction is needed
• Each layer should perform a well-defined
function
• The layer boundaries should be chosen to
minimize information flow across the interfaces
• The number of layers should be large enough
that distinct functions need not be thrown
together in the same layer, and small enough
that the architecture does not become unwieldy
58
TCP/IP Layering Architecture
59
TCP/IP Layering Architecture
• A simplified model, provides the end-toend reliable connection
• The network layer
– Hosts drop packages into this layer, layer
routes towards destination
– Only promise “Try my best”
• The transport layer
– Reliable byte-oriented stream
60
The Internet Model
61
Hypertext Transfer Protocol (HTTP)
• A connection-oriented protocol (TCP) used
to carry WWW traffic between a browser
and a server
• One of the transport layer protocol
supported by Internet
• HTTP communication is established via a
TCP connection and server port 80
62
GET Method in HTTP
63
Domain Name System
DNS (domain name service): mapping from
domain names to IP address
IPv4:
• IPv4 was initially deployed January 1st. 1983 and
is still the most commonly used version.
• 32 bit address, a string of 4 decimal numbers
separated by dot, range from 0.0.0.0 to
255.255.255.255.
IPv6:
• Revision of IPv4 with 128 bit address
64
IP Addresses
All devices connected to the Internet have a 32-bit IP (IPv4)
address associated with it. 232 = total addresses?
Think of the IP address as a logical address (possibly
temporary), while the 48-bit address on every NIC is the
physical, or permanent address.
Computers, networks and routers use the 32-bit binary
address, but a more readable form is the dotted decimal
notation.
65
IP Addresses
For example, the 32-bit binary address
10000000 10011100 00001110 00000111 (4 octets)
translates to
128.156.14.7 (called dotted decimal notation)
Range of octets is 0-255 = 28
There are basically four types of IP addresses:
Classes A, B, C and D.
A particular class address has a unique network address size
and a unique host address size.
66
DNS Lookup
http://www.bankes.com/nslookup.htm
nslookup - Name Server Lookup; A linux/windows utility
used to query Internet domain name servers. An
nslookup is usually used to find the IP address
corresponding to a hostname.
whois - An Internet program which allows users to query a
database of people and other Internet entities, such as
domains, networks, and hosts, kept at the NIC. The
information for people shows a person's company name,
address, phone number and email address
67
Top Level Domains (TLD)
Top level domain names, .com, .edu, .gov and ISO
3166 country codes
There are three types of top-level domains:
• Generic domains were created for use by the Internet
public
• Country code domains were created to be used by
individual country
• The .arpa domain Address and Routing Parameter Area
domain is designated to be used exclusively for Internetinfrastructure purposes
68
Registrars
• Domain names ending with .aero, .biz,
.com, .coop, .info, .museum, .name, .net,
.org, or .pro can be registered through
many different companies (known as
"registrars") that compete with one another
• InterNIC at http://internic.net
• Registrars Directory:
http://www.internic.net/regist.html
69
Web Search Engine Use and Commerce Continues to Grow
Pew Internet & American Life Internet Project Survey: Sept, 2005
- Search Engine News:
Search engine advertising revenues exceed TV networks
Walmart and other retailers express concern over Google
Google fights DOJ
European Union decides to build own search engine to combat US control of search
FOG replaces FOM
http://www.pewinternet.org
74
Web Search Engine Use and Commerce Continues to Grow
http://www.pewinternet.org
75
Search Engine Coverage of the WWW
Overlap analysis used for estimating the size of
the indexable web
• W: size of set of webpages available to search engines
• Wa, Wb: number of pages crawled by two independent
engines a and b
• P(Wa), P(Wb): probabilities that a page was crawled by
search engine a or b
• P(Wa)= Wa / W
• P(Wb)= Wb / W
• P(Wa  Wb ) = Wa  Wb / W
76
Overlap Analysis Capture/recapture
• Bayes rule: P(A  B) = P(A|B)P(B) = P(B|A)P(A)
• P(Wa  Wb) = P(Wa | Wb) P(Wb)
If a and b are independent:
P(Wa  Wb) = P(Wa)*P(Wb)
• Wa  Wb / W = Wa / W * Wb / W
W = Wb Wa / Wa  Wb
Wa
Wb
WEB
Need the search engines to tell you what they have and what overlaps
with each other.
77
Overlap Analysis
Researchers (Lawrence and Giles) found:
• Web had at least 320 million pages in 1997
• 60% of web was covered by six major engines
• Maximum coverage of a single engine was 1/3
of the web
What is the overlap today? What is the size of the
web? Can it be measured?
78
Dynamic HTML
Refers to Web content that changes each time it is viewed. For example,
the same URL could result in a different page depending on any
number of parameters, such as:
•
•
•
•
Geographic location of the reader
Time of day
Previous pages viewed by the reader
Profile of the reader
There are many technologies for producing dynamic HTML, including CGI
scripts, Server-Side Includes (SSI), cookies, Java, JavaScript, and
ActiveX.
79
How to Improve the Coverage?
• Meta-search engine: dispatch the user
query to several engines at same time,
collect and merge the results into one list
to the user.
• Any suggestions?
80
Graph Structure in the Web
http://www9.org/w9cdrom/160/160.html
81
What is a Web Crawler?
Web Crawler
• A program for downloading web pages.
• Given an initial set of seed URLs, it recursively
downloads every page that is linked from pages in
the set up to some limit.
• A focused web crawler downloads only those
pages whose content satisfies some criterion.
Also known as a web spider
Very dependent on the structure of the web.
82
Pseudocode for a Simple Crawler
Start_URL = “http://www.ebizsearch.org”;
List_of_URLs ={}; #empty at first
append(List_of_URLs,Start_URL); # add start url to list
While(notEmpty(List_of_URLs)) {
for each URL_in_List in (List_of_URLs) {
if(URL_in_List is_of HTTProtocol) {
if(URL_in_List permits_robots(me)){
Content=fetch(Content_of(URL_in_List));
Store(someDataBase,Content);
# caching
if(isEmpty(Content) or isError(Content){
skip to next_URL_in_List;
} #if
else {
URLs_in_Content=extract_URLs_from_Content(Content);
append(List_of_URLs,URLs_in_Content);
} #else
} else { discard(URL_in_List); skip to next_URL_in_List; }
if(stop_Crawling_Signal() is TRUE) { break; }
} #foreach
} #while
83
Web Crawler
• A crawler is a program that picks up a
page and follows all the links on that page
• Crawler = Spider = Bot = Harvester
• Usual types of crawler:
– Breadth First
– Depth First
84
Breadth First Crawlers
Use breadth-first search (BFS) algorithm
• Get all links from the starting page, and
add them to a queue
• Pick the 1st link from the queue, get all
links on the page and add to the queue
• Repeat above step till queue is empty
85
Search Strategies BF
Breadth-first Search
86
Breadth First Crawlers
87
Depth First Crawlers
Use depth first search (DFS) algorithm
• Get the 1st link not visited from the start
page
• Visit link and get 1st non-visited link
• Repeat above step till no no-visited links
• Go to next non-visited link in the previous
level and repeat 2nd step
88
Search Strategies DF
Depth-first Search
89
Depth First Crawlers
90
Intelligent Crawlers
Crawls based on page analysis and
relevance – focused crawlers
91
Search Strategy Trade-Off’s
Breadth-first explores uniformly outward from the root page
but requires memory of all nodes on the previous level
(exponential in depth). Standard spidering method.
Depth-first requires memory of only depth times branchingfactor (linear in depth) but gets “lost” pursuing a single
thread.
Both strategies implementable using a queue of links
(URL’s).
92
Avoiding Page Duplication
Must detect when revisiting a page that has already been
spidered (web is a graph not a tree).
Must efficiently index visited pages to allow rapid
recognition test.
Tree indexing (e.g. trie)
Hashtable
Index page using URL as a key.
Must canonicalize URL’s (e.g. delete ending “/”)
Not detect duplicated or mirrored pages.
Index page using textual content as a key.
Requires first downloading page.
93
Robot Exclusion
Web sites and pages can specify that robots
should not crawl/index certain areas.
Two components:
Robots Exclusion Protocol: Site wide
specification of excluded directories.
Robots META Tag: Individual document tag to
exclude indexing or following links.
94
Robots Exclusion Protocol
Site administrator puts a “robots.txt” file at the root
of the host’s web directory.
http://www.ebay.com/robots.txt
http://www.cnn.com/robots.txt
File is a list of excluded directories for a given
robot (user-agent).
Exclude all robots from the entire site:
User-agent: *
Disallow: /
95
Robot Exclusion Protocol
Examples
Exclude specific directories:
User-agent: *
Disallow: /tmp/
Disallow: /cgi-bin/
Disallow: /users/paranoid/
Exclude a specific robot:
User-agent: GoogleBot
Disallow: /
Allow a specific robot:
User-agent: GoogleBot
Disallow:
User-agent: *
Disallow: /
96
Robot Exclusion Protocol
Details
Only use blank lines to separate different
User-agent disallowed directories.
One directory per “Disallow” line.
No regex patterns in directories.
Do not use “robot.txt”
Ethical robots obey “robots.txt”
97
Robots META Tag
Include META tag in HEAD section of a specific HTML
document.
<meta name=“robots” content=“none”>
Content value is a pair of values for two aspects:
index | noindex: Allow/disallow indexing of this page.
follow | nofollow: Allow/disallow following links on this page.
98
Robots META Tag (cont)
Special values:
all = index,follow
none = noindex,nofollow
Examples:
<meta name=“robots” content=“noindex,follow”>
<meta name=“robots” content=“index,nofollow”>
<meta name=“robots” content=“none”>
99
Robot Exclusion Issues
META tag is newer and less well-adopted than “robots.txt”.
Standards are conventions to be followed by “good robots.”
Companies have been prosecuted for “disobeying” these
conventions and “trespassing” on private cyberspace.
“Good robots” also try not to “hammer” individual sites with
lots of rapid requests.
“Denial of service” attack.
100
Multi-Threaded Spidering
Bottleneck is network delay in downloading individual
pages.
Best to have multiple threads running in parallel each
requesting a page from a different host.
Distribute URL’s to threads to guarantee equitable
distribution of requests across different hosts to
maximize through-put and avoid overloading any single
server.
Early Google spider had multiple co-ordinated crawlers with
about 300 threads each, together able to download over
100 pages per second.
101
Directed/Focused Spidering
Sort queue to explore more “interesting”
pages first.
Two styles of focus:
Topic-Directed
Link-Directed
102
Not so Simple…
Performance -- How do you crawl
1,000,000,000 pages?
Politeness -- How do you avoid overloading
servers?
Failures -- Broken links, time outs, spider traps.
Strategies -- How deep do we go? Depth first
or breadth first?
Implementations -- How do we store and
update S and the other data structures
needed?
104
What to Retrieve
No web crawler retrieves everything
Most crawlers retrieve only
HTML (leaves and nodes in the tree)
ASCII clear text (only as leaves in the tree)
Some retrieve
PDF
PostScript,…
Indexing after crawl
Some index only the first part of long files
Do you keep the files (e.g., Google cache)?
105
Crawling to build an historical
archive
Internet Archive:
http://www.archive.org
A non-for profit organization in San
Francisco, created by Brewster Kahle, to
collect and retain digital materials for future
historians.
Services include the Wayback Machine.
107