Web measurement
Download
Report
Transcript Web measurement
Web Measurement
Chapter 7, Section 7.3
Hessam Mirsadeghi
ECE Department, University of Tehran
Fall 2009
Outline
Web measurement motivation
Properties of interest
Challenges of web measurement
Web measurement tools
State of the art
Web properties
Web traffic data gathering and analysis
Web performance
Web applications
University of Tehran
Web Measurement
2
Motivation
The single most popular Internet application. Measurement
can be very useful.
The single largest application studied in Internet
measurement
75% of the Internet traffic in the first decade of existence
Around a billion web users
University of Tehran
Web Measurement
3
Properties of Interest
University of Tehran
Web Measurement
4
Web Measurement Properties
Web is at the most-visible level for users
Some of the properties are decomposable into components at
other layers of protocol stack
Web latency
DNS, TCP, HTTP
Web server delay
Client-side rendering
University of Tehran
Web Measurement
5
Web Measurement Properties (cont’d)
University of Tehran
Web Measurement
6
High-Level Characterization
Measuring fraction of web traffic
Measuring the use of HTTP protocol
Considerable traffic over HTTP while the clients and servers
are p2p nodes
Here we consider web traffic that involves web clients
communicating with a web server
University of Tehran
Web Measurement
7
High-Level Characterization (cont’d)
Knowledge of entities involved in web transactions
Clients, proxies, servers
Measuring the count and growth of web entities
Providing insight on how the web has evolved and is being used
e.g. number of clients behind a proxy provides insights on the extent of
caching
University of Tehran
Web Measurement
8
Location
Identifying where clients and proxies are present can
help content providers move resources closer to them
Location data can help businesses tailor content ,
manner of delivery, and consider alternate
architectural improvements in placement of services.
Network and physical location
University of Tehran
Web Measurement
9
Configuration
Different server configuration impact
performance
Clients and proxies configurations
Protocol variants supported
Compliance with protocol specification
Clients connectivity
University of Tehran
Web Measurement
10
User Workload Models
How resources are accessed within a web site
reconfiguring the web site
modifying the resources
Alternatives for delivery of popular resources
Constructing models for “think-time” of users
Help in dealing with the new classes of users
Modeling novel phenomena such as flash crowds and attacks
University of Tehran
Web Measurement
11
Traffic Properties
Reduction of redundant transfers and sudden surges
Caching the resources
Cacheability of resources, deployment and use of
caches, performance of caches
Handling circumstances like flash crowds
University of Tehran
Web Measurement
12
Application Demands
Better understanding of the interaction between the
application and transport-level protocols
Improvements in the protocols
Reducing time-to-glass
The actual flow of a web transaction from the user
click to displaying data
University of Tehran
Web Measurement
13
Web Performance
Dominating much of the web measurement work
Popularity of a web site is highly dependant on it’s
performance
Finding ways to reduce delays
Sources of slowdowns
University of Tehran
Web Measurement
14
Challenges of web measurement
University of Tehran
Web Measurement
15
Challenges to Measurement
Application-level nature
Dependence on multiple protocols
DNS, TCP, HTTP
Large sets of entities with varying configurations
Equally diverse user population
University of Tehran
Web Measurement
16
Challenges to Measurement (cont’d)
Hidden data
Hidden layers
Hidden entities
University of Tehran
Web Measurement
17
Hidden Data
Much of the traffic is intra-net and inaccessible.
Access to remote server data, even old logs is often
unavailable.
From the server end, information about the clients (e.g.
connection bandwidth) is obscured.
New pages are constantly added, old ones removed or
modified.
University of Tehran
Web Measurement
18
Hidden Data (cont’d)
Access information of web pages are not accessible.
TCP configuration parameters significantly impact
performance and can not be remotely ascertained
Tools like TBIT for testing impacts of TCP variants like
Reno, Tahoe, or Vegas
University of Tehran
Web Measurement
19
Hidden Layers
Protocol and network layers are harder to measure.
Requires both deep knowledge of the network protocol as well as an
understanding of the precise interactions between the different
network protocols
Not knowing the number of end-clients due to proxies.
Requests may be redirected at different layers of the protocol
to different servers.
Redirections can happen at DNS, TCP, or HTTP level.
University of Tehran
Web Measurement
20
Hidden Layers (cont’d)
CDN Server 1
CDN Server 2
Index.html
<text>
Foo1.jpg
ad1
Foo2.jpg
Client
Ad Server1
ad2
Index.html
Server
Foo3.jpg
ad3
Ad Server2
ad1
ad2
ad3
Ad Server3
University of Tehran
Web Measurement
21
Hidden Entities
Proxies, HTTP and TCP redirectors
Transparent interception proxies, return results from a cache.
Different behavior of switches for web-related and non webrelated traffic
Lack of predictability due to multiple hidden entities at various
layers of protocol stack.
University of Tehran
Web Measurement
22
Web Measurement Tools
University of Tehran
Web Measurement
23
Tools: Estimation of Web Traffic
From 21st century peer-to-peer traffic took the lead in terms of
number of bytes
Web still remains the number one application in terms of
active users
Almost 1 billion Internet users, a vast majority of whom use
the web
University of Tehran
Web Measurement
24
Tools: Sampling & DNS
Netflow: traffic to the HTTP port (80)
DNS traces to see what IP addresses are looked up
Well-known web servers are likely to be high
University of Tehran
Web Measurement
25
Tools: Server Logs
Number of requests and clients are logged in web server logs
Web log analyzers for generating statistics
Presence of obscured data
Proxies
Inter-arrival time of requests
Range and diversity of resources requested
Crawlers and Spiders
Disproportionate number of requests from one of a few IP addresses
Anonymizers
Caches
University of Tehran
Web Measurement
26
Tools: Surveys
Estimating the number of web servers (Netcraft)
Important metric: number and identity of popular
web servers
Business, technical, and social implications
University of Tehran
Web Measurement
27
Tools: Locating Entities
An increasingly difficult problem
Servers resources are distributed geographically
Large number of resources
Increase availability
Being closer to clients
Several businesses can use the same server farm to increase
utilization.
Locating clients: simple ‘traceroute’, techniques such as
network aware clustering
University of Tehran
Web Measurement
28
Tools: Structural View
The linkage structure on web pages
HITS algorithm for identifying hubs and authorities
Hub: a page having multiple high-value links about a topic
Authority: the page having high-quality content on a given topic
Web pages as nodes and links as edges in a graph model
Page rankings and Improvement of web searching
University of Tehran
Web Measurement
29
Tools: Web Searching & Crawling
One of the most important www applications
Components:
Crawler: traverses the accessible part of the web to fetch
web pages
Indexer: indexes the crawled pages
Search tool: accepts queries and returns pointers to the
matching pages
University of Tehran
Web Measurement
30
Tools: Web Performance (cont’d)
Measuring a particular web site’s latency and availability
from diverse client perspectives.
Examining different latency components such as DNS,
TCP or HTTP differences, and CDNs
Global measurements of the web to examine protocol
compliance and ensure reduction of outages.
University of Tehran
Web Measurement
31
Tools: Web Performance (cont’d)
A variety of companies offer such services:
Keynote, Akamai, eValid Test Suit, etc.
A common technique: a distributed set of monitors around the
world sending periodic requests to web sites.
University of Tehran
Web Measurement
32
Tools: Network Aware Clustering
An effective technique to group IP addresses into clusters
quickly and automatically
Non-overlapping cluster
Being close topologically
Common administrative control
Clustering by use of BGP routing table snapshots and longest
prefix matching.
Same prefix → same cluster
University of Tehran
Web Measurement
33
Tools: Network Aware Clustering (cont’d)
BGP routing table snapshot
University of Tehran
Web Measurement
34
Tools: Network Aware Clustering (cont’d)
Application
Used to group client IP addresses in web server
logs
Recognizing proxies and spiders
Better content access prediction
etc
University of Tehran
Web Measurement
35
Tools: Network Aware Clustering (cont’d)
Total server log
Client containing
spider
Cluster containing
proxy
University of Tehran
Web Measurement
36
Tools: Handling Mobile Clients (cont’d)
Figure 3. Document Browsing with Summarizer on WAP
Christopher C. Yang and Fu Lee Wang. Fractal Summarization
for Mobile Devices to Access Large Documents on the Web. In
Proceedings of the World Wide Web Conference, May 2003.
University of Tehran
Web Measurement
39
Tools: Handling Mobile Clients (cont’d)
Continues growth in mobile web
Wireless network delays
Tailored content
Similar methods:
Server logs of mobile content providers
Lab experiments (e.g emulate mobile devices, induce packet loss)
Wide-area experiments
University of Tehran
Web Measurement
40
State of the Art
University of Tehran
Web Measurement
41
State of the Art
Traffic gathering and
analysis
Web properties
Four main parts of
web
measurement:
Performance issues
University of Tehran
Web Measurement
Applications
42
Web Properties: High Level
Reduction in web traffic estimation
Unreachable data
Firewalls and other barriers due to attacks
Use of internal web sites
The shift from Web to P2P
Around a million new sites a month (Netcraft)
University of Tehran
Web Measurement
43
Web Properties: High Level (cont’d)
60 million web sites in fall 2004
A vast fraction have little or no traffic compared to the top few hundred.
Apache and Microsoft server implementations
together have 90% of the market (68% for Apache)
University of Tehran
Web Measurement
44
Web Properties: High Level (cont’d)
University of Tehran
Netcraft survey. (news.netcraft.com)
Web Measurement
45
Web Properties: High Level (cont’d)
Netcraft survey. (news.netcraft.com)
Web Measurement
University of Tehran
46
Web Properties: Location
Steadily growing number of users are in Asian
countries such as China and India.
The fraction of web content from the US and Europe
is falling.
Implications on where servers will be mirrored and
supported languages.
University of Tehran
Web Measurement
47
Web Properties: Configuration
Popular sites use a variety of techniques to improve
server performance:
Distribute servers geographically (e.g. 3 world cup servers
in the U.S., 1 in France)
Redirecting requests to the least loaded server in a farm.
Caching frequently requested resources
University of Tehran
Web Measurement
48
Web Properties: User Workload Models
We measure user workload by looking at:
the duration of HTTP connections
request and response sizes,
unique number of IP addresses contacting a given Web site
number and frequency of accesses of individual resources at
a given Web site
etc.
University of Tehran
Web Measurement
49
Web Properties: Access Dynamics
Web page access has been experimentally verified to
follow Zipf-like distribution.
Zipf’s law:
Probability of a request to the ith most popular page is
proportional to 1/i
University of Tehran
Web Measurement
50
State of the Art
Traffic gathering &
analysis
Web properties
Four main parts of
web
measurement:
Performance issues
University of Tehran
Web Measurement
Applications
51
Web Traffic: Critical Path Analysis
Constructing critical path to understand where delays
are introduced in web requests
Packet propagation
Network variation (e.g. queuing at routers)
Packet loss
Delay at server and client
University of Tehran
Web Measurement
52
Web Traffic: Critical Path Analysis (cont’d)
Only some of the components are responsible for
overall response time
Importance of activities on the critical path
University of Tehran
Web Measurement
53
Web Traffic: Software Aid
httperf:
Sends HTTP requests and processes responses
Simulates workload
Gathers statistics
Supports HTTP/1.1
Freely available in source code
University of Tehran
Web Measurement
54
Web Traffic: Software Aid (cont’d)
wget
Fetches a large number of pages rooted at a particular node.
Can fetch all the pages up to a certain “level” according to
links
Mercator (a personalized crawler)
Uses a seed page and then does breadth-first search on the
links to find pages.
Higher weight for pages having more incoming links.
University of Tehran
Web Measurement
55
State of the Art
Traffic gathering and
analysis
Web properties
Four main parts of
web
measurement:
Performance issues
University of Tehran
Web Measurement
Applications
58
Web Performance: Intro
User-perceived latency is a key factor because it
affects the popularity of a site.
beyond a certain delay, user cancellations of the page
increases sharply.
University of Tehran
Web Measurement
59
Web Performance: CDNs
Busy servers outsource delivery of some of their pages
CDNs combine the workload of several sites into a single
provider.
Mirroring the CDNs to be located near clients.
DNS-based redirection
DNS overhead is a serious bottleneck in some CDNs
University of Tehran
Web Measurement
60
Web Performance: CDNs (cont’d)
• Motivation:
• More hops between client and Web server => more
congestion!
• Same data flowing repeatedly over links between
clients and Web server
C1
C3
C4
S
C2
University of Tehran
Web Measurement
- IP router
61
Web Performance: CDNs (cont’d)
Caches
Web Server
www.cnn.com
New Content
WTC News!
1000,000
other hosts
request
1000,000
other hosts
ISP
old
content
University of Tehran
request
User
merlot.cis.udel.edu
Web Measurement
- Congestion /
Bottleneck
- Caching Proxy
62
Web Performance: CDNs (cont’d)
• Caching problems:
• Caching proxies serve only their clients, not all users on
the Internet
• Content providers (say, Web servers) cannot rely on
existence and correct implementation of caching
proxies
• Accounting issues with caching proxies.
For instance, www.cnn.com needs to know the number
of hits to the webpage for advertisements displayed on
the webpage
University of Tehran
Web Measurement
63
Web Performance: CDNs (cont’d)
Web Server
www.cnn.com
New Content
WTC News!
WA
1000,000
other users
CA
MI
IL
MA
1000,000
other users
NY
FL
DE
request
new
content
User
merlot.cis.udel.edu
University of Tehran
Web Measurement
- Distribution
Infrastructure
- Mirrors
64
Web Performance: CDNs (cont’d)
• Overlay network to distribute content from origin servers to
users
• Avoids large amounts of same data repeatedly traversing
potentially congested links on the Internet
• Reduces Web server load
• Reduces user perceived latency
University of Tehran
Web Measurement
65
DNS-based Request Routing
Q:
How does the Akamai
DNS know which
surrogate is closest ?
www.cnn.com
Akamai
CDN
Akamai DNS
california.cnn.akamai.com
delaware.cnn.akamai.com
Surrogate
58.15.100.15
2
Surrogate
145.155.10.1
5
DNS query:
www.cnn.com
merlot.cis.udel
.edu
DNS response:
128.4.30.15 A 145.155.10.15
University of Tehran
local DNS server (louie.udel.edu)
128.4.4.12
Web Measurement
66
DNS-Based Request Routing (cont’d)
www.cnn.com
Akamai
CDN
Akamai DNS
Surrogate
Surrogate
DNS query
merlot.cis.udel
.edu
DNS
128.4.30.15
University of Tehran response
local DNS server
(louie.udel.edu)
128.4.4.12
Web Measurement
67
DNS-Based Redirection
Problem:
The content server is optimized for the local name server,
not the actual client
Client may be far from name server
In a study, only 16% of the clients were in the same
network-aware cluster as the local DNS server
University of Tehran
Web Measurement
68
Total & Selective Redirection
1. Total redirection
Any request for origin server is redirected to CDN
Basically, CDN takes control of content provider’s DNS zone
Benefit: All requests are automatically redirected
Disadvantage: May send lots of traffic to CDN, hence expensive for the
content provider
2. Selective redirection
Content provider marks which objects are to be served from CDN
Typically, larger objects like images are selected
Refer to images as: <img src=http://cdn.com/foo/bar/img.gif>
Pro: Fine-grained control over what gets delivered
Con: Have to (manually) mark content for CDN
University of Tehran
Web Measurement
69
Total Redirection
index.html
CDN
embedded image1.gif
image2.gif
Surrogate
Server
Client
Origin
Server
University of Tehran
Web Measurement
70
Partial Redirection
index.html
CDN
embedded image1.gif
image2.gif
Surrogate
Server
Client
Origin
Server
University of Tehran
Web Measurement
71
Total vs. Selective Redirection
Total redirection has clearly superior performance
Selective redirection is typically slower than downloading
everything from the origin server
But origin server might be loaded…
Which redirection is more used?
Initially, selective redirection was used
These days, mainly total redirection
University of Tehran
Web Measurement
72
Web Performance: Client Connectivity
Finding clients’ connection quality
Delivering the most suitable version of content
Tailoring server’s policy
Sending just the base document
Using compression
Keep persistent connections open longer
Measure the inter-arrival time of requests to classify
clients.
University of Tehran
Web Measurement
73
Web Performance: Client Connectivity (cont’d)
Stability of client classification
Classifying new clients using network-aware
clustering
same cluster → same class
Classification works best for sites having variety of
clients.
University of Tehran
Web Measurement
74
Web performance: Client Connectivity (cont’d)
Server Action conclusions:
- Compression - consistently good
results for poorer but not wellconnected clients.
- Reducing the quality of objects only
yielded benefits for a modem client.
- Bundling was effective when there
was good connectivity or poor
connectivity with large latency.
- Persistent connections with
serialized requests did not show
significant improvement
- Pipelining was only significant for
client with high throughput or RTT.
University of Tehran
Balachander Krishnamurthy, Craig E. Wills, Yin Zhang, and
Kashi Vishwanath. Design, Implementation, and Evaluation of a
Client Characterization Driven Web Server. In Proceedings of the
World Wide Web Conference, May 2003.
75
Web Measurement
Web Performance: Protocol Compliance
A 16-month study used the httperf tool to test for HTTP
protocol compliance.
Absence of required headers (such as date)
Nearly half the servers did not implement range requests.
Inability to handle long URIs in a graceful manner.
The popular Apache server was most compliant, then
Microsoft’s IIS.
University of Tehran
Web Measurement
76
State of the Art
Traffic gathering and
analysis
Web properties
Four main parts of
web
measurement:
Performance issues
University of Tehran
Web Measurement
Applications
77
Web Applications: Searching
In 1999, 200 million pages and 1.5 billion links were
examined.
The probability of a node having in-degree i is proportional to
1/ix (x>1).
Nodes with a large in-degree are considered “high rank”
Used frequently in search engines
Sites may use fake linkages to trick crawlers.
University of Tehran
Web Measurement
78
Web Applications: Searching (cont’d)
A four-part separation in web structure.
A central core
Two parts connected to the core
One part with no connection to the core
All the components have roughly equal number of pages!
University of Tehran
Web Measurement
79
Web Applications: Searching (cont’d)
Over 90% of web pages are reachable from each other.
The probability of reaching a random page from another is
only 0.25.
The well-connected component will remain connected even if
we remove nodes with large degrees (hubs).
University of Tehran
Web Measurement
80
Web Applications: Searching (cont’d)
Image resources change infrequently.
Many text documents change periodically.
Some studies have tried to model the rate of change of pages
as a Poisson process.
Some studies done to examine the rate of change in different
domains.(e.g. .com vs .org)
University of Tehran
Web Measurement
81
Web Applications: Searching (cont’d)
150 web sites were studied over a 7-month period.
Incoming links of the pages were computed
Rich getting richer!
Pages in the bottom 60% ranking received no additional
links.
Need for change in search engines ranking manner.
University of Tehran
Web Measurement
82
Web Applications: Searching (cont’d)
A study examined several subset of pages.
Significant fraction of links were dead with impact on
crawling an page ranking.
Over 50% dead links in some cases.
Faster crawling and more useful ranking by avoiding dead
links.
University of Tehran
Web Measurement
83
Web Applications: Flash Crowds
Large number of legitimate and wanted requests (unlike DoS
attacks in which the requests are not wanted)
During flash crowds
Same average number of requests per client
No increase in the number of client clusters
Between 60% and 82% of the resources are accessed only at this time.
Less than 10% are responses for 90% of the requests.
DoS attackers have no way of knowing the typical distribution
of client clusters.
Many new clusters emerge.
University of Tehran
Web Measurement
84
Flash Crowd vs DoS Attack
Flash crowd
Increase in number of clients
Fixed number of clusters
University of Tehran
DoS attack
Web Measurement
Increase in number of both
clients and clusters
85
Web Applications: Blogs
Providing early warning of flash crowds
Different rate of change comparing to traditional web pages
Having much references, the same as popular web sites
Significant fraction of links going to other blogs
having significantly more self-references
University of Tehran
Web Measurement
86