Characterizing the Web Using Sampling Methods

Download Report

Transcript Characterizing the Web Using Sampling Methods

Ed O’Neill
Brian Lavoie
OCLC Online Computer Library Center, Inc.
Web Measurement, Metrics,
and Mathematical Models Workshop
WWW9 Conference
Nonprofit, membership, library computer
service and research organization …
• 9,000 member libraries world-wide
• W3C Member
• Cataloging, reference, resource sharing,
and preservation services
• Maintain and distribute the Dewey Decimal
Classification
Roadmap
• Web Characterization Project
• Sampling the Web
• Data Collection and Storage
• Data Analysis
• Ongoing project: 1997 to present
• Answer basic questions about the Web:
–
–
–
–
How big is it?
What’s out there?
How is it evolving?
Focus on content, not network infrastructure
• Help libraries cope with integrating Web
content into their collections
Definitions: Web Objects
• Sampling the Web requires clear and
unambiguous definition of units
• The organization of Web-accessible
information suggests three object types:
Web resource, Web page, Web site
• Based on W3C Working Draft:
http://www.w3.org/1999/05/WCA-terms/
Web Resource
• An information object that:
– is accessible from the Web (via HTTP)
– is irreducible (finest level of meaningful
granularity)
– has an unambiguous identity (URI)
• In practice, a Web resource is a file
accessible from the Internet via HTTP
Web Page
An aggregate object, consisting of one or
more Web resources that are:
• Collectively identified by a single URI
• Rendered simultaneously as a single object
http://www.oclc.org/info.htm
http://www.oclc.org/images/logo.gif
http://www.oclc.org/applet.class
Web Site
A collection of Web pages that …
– reside at a single network location (IP
address)
– are interlinked: any of site’s Web pages can
be accessed by:
• following a sequence of hyperlink references
• beginning at the site’s home page
• spanning only Web pages residing at the same
network location.
Sampling the Web
• Objective: Collect representative Web sample
• Methodology: Identify and collect random
sample of Web sites — every Web site should
have an equal probability of being included in
the sample
• Result: Random sample of Web sites; cluster
sample of Web pages
Sampling Approach
IP Address Space (4,294,967,296)
Sampled addresses
Allocated addresses
HTTP hosts
Data Collection
“Hello … Do you speak HTTP?”
No response
“Yes … Welcome”
HTTP Code = 200
“Yes … Go away”
HTTP Code = 403
IP #1
Random
IPs
IP #2
IP #3
Harvester
Polychrest Harvester
• Java-based Web harvesting agent
• Analyzes URI references in HTML markup
to determine object type and extent
• Currently analyzes following elements:
<A>
<AREA>
<BASE>
<BODY>
<FRAME>
<HEAD>
<IFRAME>
<IMG>
<INPUT>
<LINK>
<SCRIPT>
URI Analysis
• Two stages:
(1) determine object type
(2) filter on network location (if applicable)
• Examples: Sample IP: 132.174.1.5
<A HREF=“http://www.oclc.org/page.htm”>
YES
<A HREF=“http://www.microsoft.com”>
NO
<IMG SRC=“oclc.gif”>
YES
<IMG SRC=“http://www.w3.org/w3.gif”>
YES
Harvesting
• Harvesting of a Web site is initiated immediately
after it is identified
• Polychrest understands Web object definitions
for resources, pages, and sites
• Web site extent determined by:
– breadth-first search, using home page as root
– follow internal Web page links only
Unique Web Sites
• Not uncommon for a single Web site to be
accessible from multiple IP addresses *
• Sites at different IPs, but with identical
content, are considered to be one logical site
(often identified with a single domain name)
• Creates bias in sample: greater probability of
these sites being selected than sites
associated with a single address
Filtering Rule
A harvested IP is only considered a “hit” if ...
… sample IP is “lowest” among all IPs
associated with a given collection of Web
pages)
Example:
132.174.1..6
132.174.1.5 132.174.1.4
How can we identify sites with multiple IPs?
De-Duping Tests
Domain-name-to-IP-address mapping:
– for sites with domain names
– resolve domain name to IP address; if sampled IP is
lowest among returned IP(s), OK
Example:
Sample IP: 207.46.130.149
Resolves to www.microsoft.com
www.microsoft.com resolves to:
207.46.131.137
207.46.131.30
207.46.130.149
207.46.130.45
207.46.130.14
De-Duping Tests … Continued
“Same-Octet” Test:
– Harvest home page from IP addresses with same
first three octets as sampled IP, but lower 4th octet
Example:
132.174.1.5
132.174.1.3
132.174.1.1
132.174.1.4
132.174.1.2
132.174.1.0
– If any home page harvested from a lower 4th octet
matches home page from sampled IP, filtering rule
is failed
De-Duping Tests … Continued
• Intra-Sample Duplicate Detection:
– Identify sites within sample with identical content
– Retain only site with lowest IP address
• Unique Web Site:
– Defined as any site identified in the sample that
passes all three of the duplicate detection tests
Synopsis: 1999 Sample
IP Addresses:
Sampled IPs (0.1%):
4,294,967,296
4,294,967
Connect to Port 80 for each sampled IP address
– Web site identified if HTTP response code = 200
Sampled Web Sites:
4,882
– hit rate of about 1 out of a thousand
Apply De-Duping Tests
Sampled Unique Sites:
3,649
Network Security
• Attempts to connect to random IP addresses have been
viewed suspiciously by network administrators
– like calling unlisted telephone numbers
• Inquiries have been made about our activity (mostly
cordial) *
• For June 2000 Web sample:
– assign separate IP and domain name to machine running
harvester
– run Web server with page explaining our project and supplying
contact information
Data Storage
• Polychrest stores data collected from a
single Web site in one SGML-format
archive file
• Software splits archive file into separate
file for manual viewing; links are localized
– Harvested Site Example: 192.48.117.67.dmp *
• For long-term storage, converting SGML
into Internet Archive format
Site Growth (1,000)
1999
1999: 3.6 million
Year
1998: 2 million
1997:
1.2 million
0
500
1000
1500
2000
2500
3000
3500
4000
Web Site Types
• Provisional site: serves only temporary or
transitional pages (server templates, “under
construction” pages, “site has moved” pages)
• Private site: prohibits access explicitly
(password, IP filter, firewall) or implicitly (site
intended to be used by specific users)
• Public site: provides unrestricted access to
some portion of the site containing
meaningful content
Types of Sites (1,000): 1999
Provisional: 1 million
Private:
400,000
Public: 2.2 million
0
500
1000
1500
2000
2500
Accomplishments
• Well-tested sampling methodology
• Data collection and analysis tools
• Innovative data analysis
• Only consistent time-series (1998 present)
• Data available on request for scholarly use
Further Information...
OCLC Online Computer Library Center, Inc.:
http://www.oclc.org/
Web Characterization Project:
http://www.oclc.org/oclc/research/projects/webstats/
E-mail:
[email protected]
Web Publishing Patterns
• Self-publishing: Web publishing patterns do not follow
print model. Vast majority of Web sites exist to promote
and disseminate information about site’s publisher. Unlike
traditional print publishers, only a minority of Web
publishers ‘sell’ information
• Volatility: The Web is very volatile—less than half of the
Web sites in the 1998 sample still existed when the 1999
sample was collected. Pages are even more volatile
• Inaccessible: Less than half the Web sites have been
indexed by the major search engines, even a lower
proportion of the pages have been indexed
Emergence of Dark Matter
• Dynamically generated information,
usually in response to a query
• Inaccessible to harvesters
• Cannot be indexed *
• Dark information appears to be more
common in the latest sample
Dark Matter Example
!
Site with Multiple IP Addresses
194.66.97.202, 194.66.99.88, 194.66.102.59,
194.66.110.112, 194.66.122.251, 194.66.123.63
These six IP addresses from the sample produced:
Example Responses (Edited)
For the
pasthas
twonoweeks
or so, aonhost
registered
[Our]
server
restrictions
access,
but as to
faryou,
as I
has
been
sending
network-scanning-like
activity
toor
We have
antooclc
has been
regularly
know,
therenoticed
are no that
links
it onserver
any other
web
sites
port
80
ofaseemingly
IPno
addresses
in
checking
machine
our domain.
Can
you
tell us why this
search
engines,
and weinrandom
have
told
one
but
ourour
address
I’m in
not
sure
purpose
thissurprised
server isspace.
interested
our
little
purple
SGI?
development
partners
about
it.the
Therefore,
Iofwas
activity
but it[oclc]
appears
to be
in error.
It appears
when
I found
in the
server's
log files
many times over
enough;
figured
it would’ve
stopped
its how
theinnocuous
last several
months.
So...can
you identify
thison
user,
ownfound
by now.
they
out about our server, and what their intentions
are? If it is a user, we'd appreciate knowing who it is.