Measuring the Web

Download Report

Transcript Measuring the Web

Measuring the Web
What?
• Use, size
– Of entire Web, of sites (popularity), of pages
– Growth thereof
• Technologies in use (servers, media types)
• Properties
– Traffic (periodicity, self-similarity, timeouts, ..)
– Page-change properties (frequency, amount, ..)
– Links (self-similarity,
Why?
•
•
•
•
•
Improve Web technologies
Improve sites
Improve search
Justify prices
Science
How?
• Surveys
• Instrumentation
– Proxy/router logs, server logs,
• Sampling and statistical inference
A few survey (services)
• Nielson/NetRatings
• Pew Internet Project
• DLF/CLIR study
Some survey results
• NetRatings (Dec 2002)
– 168M US “Home” Internet Users
– Use Web 7 hours/week to view 17 sites
• Pew Study (July 2002)
– 111M US Internet Users
– 33M of them search engine once/day
Simple sampling
• Netcraft server survey
– Generate crawling and URL submission
– 35M sites in 2002 (Archive has 50M)
• OCLC Host survey
– Generate random IP addrs and look for hosts
– 9,040,000 IP addrs with web servers in 2002
• 8,712,000 Unique Web sites
OCLC technique
• Generate 1% * 2^32 random IP numbers
• Screen out “bad ones”
– Private addresses, IANA lists
• HTTP to port 80 of remainder
• Multiply number of responses by 1000
• Use heuristics to eliminate “duplicates”
IP sampling and virtual hosting
• Netcraft says 1/2 of domain names virtually
hosted on 100K IP numbers
• In 2000, OCLC said 3M IP addrs serving
data, versus 3.4M IP addrs found by
Netcraft
Interlude: “Size of Web”
• Size in (virtual) hosts, probably 40-60M
– Based on Netcraft, OCLC, and Archive data
• Size in pages: infinite
– People are obsessed with provide pageestimates, but this is a silly thing to do!
Heavy-tailed distributions
• Zipf, Pareto, power laws, lognormal
• Chic to find such things (Web, physics, bio)
– …and then postulate “generative models”
• Statistics are squirrelly
– For example, averages can be misleading
Heavy-tails on the Web
• Host and page:
–
–
–
–
Links (in and out)
Sizes
Popularity
In page case, both inter- and intrasite
• Page-size-to-popularity (Zipfian)
• Page and user reading times
Tripping on heavy tails
• How not to compute size of Web:
– Use OCLC approach to find random hosts
– Crawl each of these to measure average size
– Multiply average size by host count
• Problem: heavy-tailed distribution of host
size means that host sample is biased
towards smaller hosts
Advanced inference
• Determine relative size of search engines
A,B
• Pr[A&B|A] = |A+B|/|A|
• Pr[A&B|B] = |A+B|/|B|
• => |A|/|B| = Pr[A&B|B] / Pr[A&B|A]
Advanced inference
• Sample URLs from A
– Issue random conjunctive query with <200
results, select a random result
• Test if present in B
– Query with 8 rarest words and look for result
• Assume Pr[A&B|A] = # URLs discovered in
A also found to be in B
• URL sampling biased to long documents
• Biased by ranking and details of engine
Conclusions
• Measuring Web is hard because it cannot be
enumerated or even reliably sampled
• Statistical methods impacted by biases that
cannot be quantified
• Validation is not possible
• The problem is getting harder (e.g., link
spam)
• Quantitative studies are fascinating and a
good research problem