Web measurement

Download Report

Transcript Web measurement

Web Measurement
Chapter 7, Section 7.3
Hessam Mirsadeghi
ECE Department, University of Tehran
Fall 2009
Outline

Web measurement motivation

Properties of interest

Challenges of web measurement

Web measurement tools

State of the art




Web properties
Web traffic data gathering and analysis
Web performance
Web applications
University of Tehran
Web Measurement
2
Motivation

The single most popular Internet application. Measurement
can be very useful.

The single largest application studied in Internet
measurement

75% of the Internet traffic in the first decade of existence

Around a billion web users
University of Tehran
Web Measurement
3
Properties of Interest
University of Tehran
Web Measurement
4
Web Measurement Properties

Web is at the most-visible level for users

Some of the properties are decomposable into components at
other layers of protocol stack

Web latency



DNS, TCP, HTTP
Web server delay
Client-side rendering
University of Tehran
Web Measurement
5
Web Measurement Properties (cont’d)
University of Tehran
Web Measurement
6
High-Level Characterization

Measuring fraction of web traffic

Measuring the use of HTTP protocol

Considerable traffic over HTTP while the clients and servers
are p2p nodes

Here we consider web traffic that involves web clients
communicating with a web server
University of Tehran
Web Measurement
7
High-Level Characterization (cont’d)

Knowledge of entities involved in web transactions


Clients, proxies, servers
Measuring the count and growth of web entities


Providing insight on how the web has evolved and is being used
e.g. number of clients behind a proxy provides insights on the extent of
caching
University of Tehran
Web Measurement
8
Location

Identifying where clients and proxies are present can
help content providers move resources closer to them

Location data can help businesses tailor content ,
manner of delivery, and consider alternate
architectural improvements in placement of services.

Network and physical location
University of Tehran
Web Measurement
9
Configuration





Different server configuration impact
performance
Clients and proxies configurations
Protocol variants supported
Compliance with protocol specification
Clients connectivity
University of Tehran
Web Measurement
10
User Workload Models

How resources are accessed within a web site



reconfiguring the web site
modifying the resources
Alternatives for delivery of popular resources

Constructing models for “think-time” of users

Help in dealing with the new classes of users

Modeling novel phenomena such as flash crowds and attacks
University of Tehran
Web Measurement
11
Traffic Properties

Reduction of redundant transfers and sudden surges

Caching the resources

Cacheability of resources, deployment and use of
caches, performance of caches

Handling circumstances like flash crowds
University of Tehran
Web Measurement
12
Application Demands

Better understanding of the interaction between the
application and transport-level protocols



Improvements in the protocols
Reducing time-to-glass
The actual flow of a web transaction from the user
click to displaying data
University of Tehran
Web Measurement
13
Web Performance

Dominating much of the web measurement work

Popularity of a web site is highly dependant on it’s
performance

Finding ways to reduce delays

Sources of slowdowns
University of Tehran
Web Measurement
14
Challenges of web measurement
University of Tehran
Web Measurement
15
Challenges to Measurement

Application-level nature

Dependence on multiple protocols

DNS, TCP, HTTP

Large sets of entities with varying configurations

Equally diverse user population
University of Tehran
Web Measurement
16
Challenges to Measurement (cont’d)

Hidden data

Hidden layers

Hidden entities
University of Tehran
Web Measurement
17
Hidden Data

Much of the traffic is intra-net and inaccessible.

Access to remote server data, even old logs is often
unavailable.

From the server end, information about the clients (e.g.
connection bandwidth) is obscured.

New pages are constantly added, old ones removed or
modified.
University of Tehran
Web Measurement
18
Hidden Data (cont’d)

Access information of web pages are not accessible.

TCP configuration parameters significantly impact
performance and can not be remotely ascertained

Tools like TBIT for testing impacts of TCP variants like
Reno, Tahoe, or Vegas
University of Tehran
Web Measurement
19
Hidden Layers

Protocol and network layers are harder to measure.

Requires both deep knowledge of the network protocol as well as an
understanding of the precise interactions between the different
network protocols

Not knowing the number of end-clients due to proxies.

Requests may be redirected at different layers of the protocol
to different servers.

Redirections can happen at DNS, TCP, or HTTP level.
University of Tehran
Web Measurement
20
Hidden Layers (cont’d)
CDN Server 1
CDN Server 2
Index.html
<text>
Foo1.jpg
ad1
Foo2.jpg
Client
Ad Server1
ad2
Index.html
Server
Foo3.jpg
ad3
Ad Server2
ad1
ad2
ad3
Ad Server3
University of Tehran
Web Measurement
21
Hidden Entities

Proxies, HTTP and TCP redirectors

Transparent interception proxies, return results from a cache.

Different behavior of switches for web-related and non webrelated traffic

Lack of predictability due to multiple hidden entities at various
layers of protocol stack.
University of Tehran
Web Measurement
22
Web Measurement Tools
University of Tehran
Web Measurement
23
Tools: Estimation of Web Traffic

From 21st century peer-to-peer traffic took the lead in terms of
number of bytes

Web still remains the number one application in terms of
active users

Almost 1 billion Internet users, a vast majority of whom use
the web
University of Tehran
Web Measurement
24
Tools: Sampling & DNS

Netflow: traffic to the HTTP port (80)

DNS traces to see what IP addresses are looked up

Well-known web servers are likely to be high
University of Tehran
Web Measurement
25
Tools: Server Logs



Number of requests and clients are logged in web server logs
Web log analyzers for generating statistics
Presence of obscured data
 Proxies


Inter-arrival time of requests

Range and diversity of resources requested
Crawlers and Spiders



Disproportionate number of requests from one of a few IP addresses
Anonymizers
Caches
University of Tehran
Web Measurement
26
Tools: Surveys

Estimating the number of web servers (Netcraft)

Important metric: number and identity of popular
web servers

Business, technical, and social implications
University of Tehran
Web Measurement
27
Tools: Locating Entities

An increasingly difficult problem

Servers resources are distributed geographically


Large number of resources

Increase availability

Being closer to clients
Several businesses can use the same server farm to increase
utilization.

Locating clients: simple ‘traceroute’, techniques such as
network aware clustering
University of Tehran
Web Measurement
28
Tools: Structural View

The linkage structure on web pages

HITS algorithm for identifying hubs and authorities


Hub: a page having multiple high-value links about a topic

Authority: the page having high-quality content on a given topic

Web pages as nodes and links as edges in a graph model
Page rankings and Improvement of web searching
University of Tehran
Web Measurement
29
Tools: Web Searching & Crawling

One of the most important www applications

Components:

Crawler: traverses the accessible part of the web to fetch
web pages

Indexer: indexes the crawled pages

Search tool: accepts queries and returns pointers to the
matching pages
University of Tehran
Web Measurement
30
Tools: Web Performance (cont’d)

Measuring a particular web site’s latency and availability
from diverse client perspectives.

Examining different latency components such as DNS,
TCP or HTTP differences, and CDNs

Global measurements of the web to examine protocol
compliance and ensure reduction of outages.
University of Tehran
Web Measurement
31
Tools: Web Performance (cont’d)

A variety of companies offer such services:


Keynote, Akamai, eValid Test Suit, etc.
A common technique: a distributed set of monitors around the
world sending periodic requests to web sites.
University of Tehran
Web Measurement
32
Tools: Network Aware Clustering

An effective technique to group IP addresses into clusters
quickly and automatically


Non-overlapping cluster

Being close topologically

Common administrative control
Clustering by use of BGP routing table snapshots and longest
prefix matching.

Same prefix → same cluster
University of Tehran
Web Measurement
33
Tools: Network Aware Clustering (cont’d)

BGP routing table snapshot
University of Tehran
Web Measurement
34
Tools: Network Aware Clustering (cont’d)

Application

Used to group client IP addresses in web server
logs

Recognizing proxies and spiders

Better content access prediction

etc
University of Tehran
Web Measurement
35
Tools: Network Aware Clustering (cont’d)
Total server log
Client containing
spider
Cluster containing
proxy
University of Tehran
Web Measurement
36
Tools: Handling Mobile Clients (cont’d)
Figure 3. Document Browsing with Summarizer on WAP
Christopher C. Yang and Fu Lee Wang. Fractal Summarization
for Mobile Devices to Access Large Documents on the Web. In
Proceedings of the World Wide Web Conference, May 2003.
University of Tehran
Web Measurement
39
Tools: Handling Mobile Clients (cont’d)

Continues growth in mobile web

Wireless network delays

Tailored content

Similar methods:

Server logs of mobile content providers

Lab experiments (e.g emulate mobile devices, induce packet loss)

Wide-area experiments
University of Tehran
Web Measurement
40
State of the Art
University of Tehran
Web Measurement
41
State of the Art
Traffic gathering and
analysis
Web properties
Four main parts of
web
measurement:
Performance issues
University of Tehran
Web Measurement
Applications
42
Web Properties: High Level

Reduction in web traffic estimation



Unreachable data

Firewalls and other barriers due to attacks

Use of internal web sites
The shift from Web to P2P
Around a million new sites a month (Netcraft)
University of Tehran
Web Measurement
43
Web Properties: High Level (cont’d)

60 million web sites in fall 2004


A vast fraction have little or no traffic compared to the top few hundred.
Apache and Microsoft server implementations
together have 90% of the market (68% for Apache)
University of Tehran
Web Measurement
44
Web Properties: High Level (cont’d)
University of Tehran
Netcraft survey. (news.netcraft.com)
Web Measurement
45
Web Properties: High Level (cont’d)
Netcraft survey. (news.netcraft.com)
Web Measurement
University of Tehran
46
Web Properties: Location

Steadily growing number of users are in Asian
countries such as China and India.

The fraction of web content from the US and Europe
is falling.

Implications on where servers will be mirrored and
supported languages.
University of Tehran
Web Measurement
47
Web Properties: Configuration

Popular sites use a variety of techniques to improve
server performance:

Distribute servers geographically (e.g. 3 world cup servers
in the U.S., 1 in France)

Redirecting requests to the least loaded server in a farm.

Caching frequently requested resources
University of Tehran
Web Measurement
48
Web Properties: User Workload Models

We measure user workload by looking at:

the duration of HTTP connections

request and response sizes,

unique number of IP addresses contacting a given Web site

number and frequency of accesses of individual resources at
a given Web site

etc.
University of Tehran
Web Measurement
49
Web Properties: Access Dynamics

Web page access has been experimentally verified to
follow Zipf-like distribution.

Zipf’s law:

Probability of a request to the ith most popular page is
proportional to 1/i
University of Tehran
Web Measurement
50
State of the Art
Traffic gathering &
analysis
Web properties
Four main parts of
web
measurement:
Performance issues
University of Tehran
Web Measurement
Applications
51
Web Traffic: Critical Path Analysis

Constructing critical path to understand where delays
are introduced in web requests

Packet propagation

Network variation (e.g. queuing at routers)

Packet loss

Delay at server and client
University of Tehran
Web Measurement
52
Web Traffic: Critical Path Analysis (cont’d)

Only some of the components are responsible for
overall response time

Importance of activities on the critical path
University of Tehran
Web Measurement
53
Web Traffic: Software Aid

httperf:





Sends HTTP requests and processes responses
Simulates workload
Gathers statistics
Supports HTTP/1.1
Freely available in source code
University of Tehran
Web Measurement
54
Web Traffic: Software Aid (cont’d)

wget



Fetches a large number of pages rooted at a particular node.
Can fetch all the pages up to a certain “level” according to
links
Mercator (a personalized crawler)


Uses a seed page and then does breadth-first search on the
links to find pages.
Higher weight for pages having more incoming links.
University of Tehran
Web Measurement
55
State of the Art
Traffic gathering and
analysis
Web properties
Four main parts of
web
measurement:
Performance issues
University of Tehran
Web Measurement
Applications
58
Web Performance: Intro

User-perceived latency is a key factor because it
affects the popularity of a site.

beyond a certain delay, user cancellations of the page
increases sharply.
University of Tehran
Web Measurement
59
Web Performance: CDNs

Busy servers outsource delivery of some of their pages

CDNs combine the workload of several sites into a single
provider.

Mirroring the CDNs to be located near clients.

DNS-based redirection

DNS overhead is a serious bottleneck in some CDNs
University of Tehran
Web Measurement
60
Web Performance: CDNs (cont’d)
• Motivation:
• More hops between client and Web server => more
congestion!
• Same data flowing repeatedly over links between
clients and Web server
C1
C3
C4
S
C2
University of Tehran
Web Measurement
- IP router
61
Web Performance: CDNs (cont’d)

Caches
Web Server
www.cnn.com
New Content
WTC News!
1000,000
other hosts
request
1000,000
other hosts
ISP
old
content
University of Tehran
request
User
merlot.cis.udel.edu
Web Measurement
- Congestion /
Bottleneck
- Caching Proxy
62
Web Performance: CDNs (cont’d)
• Caching problems:
• Caching proxies serve only their clients, not all users on
the Internet
• Content providers (say, Web servers) cannot rely on
existence and correct implementation of caching
proxies
• Accounting issues with caching proxies.
For instance, www.cnn.com needs to know the number
of hits to the webpage for advertisements displayed on
the webpage
University of Tehran
Web Measurement
63
Web Performance: CDNs (cont’d)
Web Server
www.cnn.com
New Content
WTC News!
WA
1000,000
other users
CA
MI
IL
MA
1000,000
other users
NY
FL
DE
request
new
content
User
merlot.cis.udel.edu
University of Tehran
Web Measurement
- Distribution
Infrastructure
- Mirrors
64
Web Performance: CDNs (cont’d)
• Overlay network to distribute content from origin servers to
users
• Avoids large amounts of same data repeatedly traversing
potentially congested links on the Internet
• Reduces Web server load
• Reduces user perceived latency
University of Tehran
Web Measurement
65
DNS-based Request Routing
Q:
How does the Akamai
DNS know which
surrogate is closest ?
www.cnn.com
Akamai
CDN
Akamai DNS
california.cnn.akamai.com
delaware.cnn.akamai.com
Surrogate
58.15.100.15
2
Surrogate
145.155.10.1
5
DNS query:
www.cnn.com
merlot.cis.udel
.edu
DNS response:
128.4.30.15 A 145.155.10.15
University of Tehran
local DNS server (louie.udel.edu)
128.4.4.12
Web Measurement
66
DNS-Based Request Routing (cont’d)
www.cnn.com
Akamai
CDN
Akamai DNS
Surrogate
Surrogate
DNS query
merlot.cis.udel
.edu
DNS
128.4.30.15
University of Tehran response
local DNS server
(louie.udel.edu)
128.4.4.12
Web Measurement
67
DNS-Based Redirection

Problem:

The content server is optimized for the local name server,
not the actual client

Client may be far from name server

In a study, only 16% of the clients were in the same
network-aware cluster as the local DNS server
University of Tehran
Web Measurement
68
Total & Selective Redirection

1. Total redirection





Any request for origin server is redirected to CDN
Basically, CDN takes control of content provider’s DNS zone
Benefit: All requests are automatically redirected
Disadvantage: May send lots of traffic to CDN, hence expensive for the
content provider
2. Selective redirection





Content provider marks which objects are to be served from CDN
Typically, larger objects like images are selected
Refer to images as: <img src=http://cdn.com/foo/bar/img.gif>
Pro: Fine-grained control over what gets delivered
Con: Have to (manually) mark content for CDN
University of Tehran
Web Measurement
69
Total Redirection
index.html
CDN
embedded image1.gif
image2.gif
Surrogate
Server
Client
Origin
Server
University of Tehran
Web Measurement
70
Partial Redirection
index.html
CDN
embedded image1.gif
image2.gif
Surrogate
Server
Client
Origin
Server
University of Tehran
Web Measurement
71
Total vs. Selective Redirection

Total redirection has clearly superior performance

Selective redirection is typically slower than downloading
everything from the origin server


But origin server might be loaded…
Which redirection is more used?


Initially, selective redirection was used
These days, mainly total redirection
University of Tehran
Web Measurement
72
Web Performance: Client Connectivity

Finding clients’ connection quality

Delivering the most suitable version of content



Tailoring server’s policy


Sending just the base document
Using compression
Keep persistent connections open longer
Measure the inter-arrival time of requests to classify
clients.
University of Tehran
Web Measurement
73
Web Performance: Client Connectivity (cont’d)

Stability of client classification

Classifying new clients using network-aware
clustering


same cluster → same class
Classification works best for sites having variety of
clients.
University of Tehran
Web Measurement
74
Web performance: Client Connectivity (cont’d)
Server Action conclusions:
- Compression - consistently good
results for poorer but not wellconnected clients.
- Reducing the quality of objects only
yielded benefits for a modem client.
- Bundling was effective when there
was good connectivity or poor
connectivity with large latency.
- Persistent connections with
serialized requests did not show
significant improvement
- Pipelining was only significant for
client with high throughput or RTT.
University of Tehran
Balachander Krishnamurthy, Craig E. Wills, Yin Zhang, and
Kashi Vishwanath. Design, Implementation, and Evaluation of a
Client Characterization Driven Web Server. In Proceedings of the
World Wide Web Conference, May 2003.
75
Web Measurement
Web Performance: Protocol Compliance

A 16-month study used the httperf tool to test for HTTP
protocol compliance.

Absence of required headers (such as date)

Nearly half the servers did not implement range requests.

Inability to handle long URIs in a graceful manner.

The popular Apache server was most compliant, then
Microsoft’s IIS.
University of Tehran
Web Measurement
76
State of the Art
Traffic gathering and
analysis
Web properties
Four main parts of
web
measurement:
Performance issues
University of Tehran
Web Measurement
Applications
77
Web Applications: Searching

In 1999, 200 million pages and 1.5 billion links were
examined.

The probability of a node having in-degree i is proportional to
1/ix (x>1).

Nodes with a large in-degree are considered “high rank”

Used frequently in search engines

Sites may use fake linkages to trick crawlers.
University of Tehran
Web Measurement
78
Web Applications: Searching (cont’d)

A four-part separation in web structure.

A central core

Two parts connected to the core

One part with no connection to the core

All the components have roughly equal number of pages!
University of Tehran
Web Measurement
79
Web Applications: Searching (cont’d)

Over 90% of web pages are reachable from each other.

The probability of reaching a random page from another is
only 0.25.

The well-connected component will remain connected even if
we remove nodes with large degrees (hubs).
University of Tehran
Web Measurement
80
Web Applications: Searching (cont’d)

Image resources change infrequently.

Many text documents change periodically.

Some studies have tried to model the rate of change of pages
as a Poisson process.

Some studies done to examine the rate of change in different
domains.(e.g. .com vs .org)
University of Tehran
Web Measurement
81
Web Applications: Searching (cont’d)

150 web sites were studied over a 7-month period.

Incoming links of the pages were computed

Rich getting richer!

Pages in the bottom 60% ranking received no additional
links.

Need for change in search engines ranking manner.
University of Tehran
Web Measurement
82
Web Applications: Searching (cont’d)

A study examined several subset of pages.

Significant fraction of links were dead with impact on
crawling an page ranking.

Over 50% dead links in some cases.

Faster crawling and more useful ranking by avoiding dead
links.
University of Tehran
Web Measurement
83
Web Applications: Flash Crowds

Large number of legitimate and wanted requests (unlike DoS
attacks in which the requests are not wanted)

During flash crowds





Same average number of requests per client
No increase in the number of client clusters
Between 60% and 82% of the resources are accessed only at this time.
Less than 10% are responses for 90% of the requests.
DoS attackers have no way of knowing the typical distribution
of client clusters.

Many new clusters emerge.
University of Tehran
Web Measurement
84
Flash Crowd vs DoS Attack

Flash crowd



Increase in number of clients
Fixed number of clusters
University of Tehran
DoS attack

Web Measurement
Increase in number of both
clients and clusters
85
Web Applications: Blogs

Providing early warning of flash crowds

Different rate of change comparing to traditional web pages

Having much references, the same as popular web sites

Significant fraction of links going to other blogs

having significantly more self-references
University of Tehran
Web Measurement
86