internal-tutorial

Transcript internal-tutorial

Mining Web Traces:
Workload Characterization, Performance
Diagnosis, and Applications
Lili Qiu
Microsoft Research
Internal Talk
November 2002
1
Motivation

Why do we care about Web traces?

Content providers






How do users come to visit the Web site?
Why do users leave the Web site? Is poor
performance the cause for this?
Where are the performance bottlenecks?
What content are users interested in?
How do users’ interest vary in time?
How do users’ interest vary across different
geographical regions?
2
Motivation (Cont.)

Web hosting companies




ISPs



Accounting & billing
Server selection
Provisioning server farms: where to place servers
How to save bandwidth by storing proxy caches?
Traffic engineering & provisioning
Researchers



Where are the performance bottlenecks?
How to improve Web performance?
Examples: Traffic measurements have influenced the
design of HTTP (e.g., persistent connections and
pipeline), TCP (e.g., initial congestion window)
3
Outline





Background
Web workload characterization
Performance diagnosis
Applications of traces
Bibliography
4
Part I: Background




Web software components
Web semantic components
Web protocols
Types of Web traces
5
Web Software Components

Web clients

replica
proxy

Internet
proxy
Web
Clients
replica
proxy

Web servers

Web
Servers



An application that
establishes connections
to send Web requests
E.g., Mosaic, Netscape
Navigator, Microsoft IE
An application that
accepts connections to
service requests by
sending back responses
E.g., Apache, Microsoft
IIS
Web proxies (optional)
Web replicas (optional)
6
Web Semantic Components

Uniform Resource Identifier (URI)

An identifier for a Web resource





Hypertext Markup Language (HTML)


Name of protocol: http, https, ftp, ..
Name of the server
Name of the resource on the server
e.g., http://www.foobar.com/info.html
Platform-independent styles (indicated by markup tags)
that define the various components of a Web document
Hypertext Transfer Protocol (HTTP)

Define the syntax and semantics of messages exchanged
between Web software components
7
Example of a Web Transaction
1. DNS query
DNS
server
2. Setup TCP connection
3. HTTP request
4. HTTP response
Browser
Web server
8
Internet Protocol Stack
Application layer: application programs
(HTTP, Telnet, FTP, DNS)
Transport layer: error control + flow control
(TCP,UDP)
Network layer: routing
(IP)
Datalink layer: handle hardware details
(Ethernet, ATM)
Physical layer: moving bits
(coaxial cable, optical fiber)
9
HTTP Protocol

Hypertext Transfer Protocol (HTTP)

HTTP 1.0 [BLFF96]



The most widely used HTTP version
A “Stop and wait” protocol
HTTP 1.1 [GMF+99]

Adds persistent connections, pipelining,
caching, content negotiation, …
10
HTTP 1.0

HTTP request


Request = Simple-Request | Full-Request
Simple-Request = "GET" SP Request-URI CRLF
Full-Request = Request-Line;
*( General/Request/Entity Header) ;
CRLF
[ Entity-Body ] ;
Request-Line = Method SP Request-URI SP HTTPVersion CRLF
Method = "GET" ;| "HEAD" ; | "POST" ;| extensionmethod
Example: GET /info.html HTTP/1.0
11
HTTP 1.0 (Cont.)

HTTP response


Response = Simple-Response | Full-Response
Simple-Response = [ Entity-Body ]
Full-Response = Status-Line;
*( General/Response/Entity Header );
CRLF
[ Entity-Body ] ;
Example:
HTTP/1.0 200 OK
Date: Mon, 09 Sep 2002 06:07:53 GMT
Server: Apache/1.3.20 (Unix) (Red-Hat/Linux) PHP/4.0.6
Last-Modified: Mon, 29 Jul 2002 10:58:59 GMT
Content-Length: 21748
Content-Type: text/html
This is the document content …
<21748 bytes of the current version of info.html>
12
HTTP 1.1

Connection management

Persistent connections [Mogul95]


Use one TCP connection for multiple HTTP requests
Pros:



Cons:



Reduce the overhead of connection setup and teardown
Avoid TCP slow start
head-of-line blocking
increase servers’ state
Pipeline [Pad95]



Send multiple requests without waiting for a response
between requests
Pros: avoid the round-trip delay of waiting for each
response
Cons: connection aborts are harder to deal with
13
HTTP 1.1 (Cont.)

Caching


Continues to support the notion of expiration
used in HTTP 1.0
Add a cache-control header to handle the issues
of cacheability and semantic transparency [KR01]


E.g., no-cache, only-if-cache, no-store, max-age, maxstale, min-fresh, …
Others




Range request
Content negotiation
Security
…
14
Types of Web Traces

Application level traces





Server logs: CLF and ECLF formats
CLF format
<remoteIP,remoteID,usrName,Time,request,responseCode,
contentLength>
e.g., 192.1.1.1, -, -, 8/1/2000, 10:00:00, “GET
/news/index.asp HTTP/1.1”, 200, 3410
Proxies logs: CLF and ECLF formats
Client logs: no standard logging formats
Packet level traces



Collection method: monitor a network link
Available tools: tcpdump, libpcap, netmon
Concerns: packet dropping, timestamp accuracy
15
Tutorial Outline





Background
Web workload characterization
Performance diagnosis
Applications of traces
Bibliography
16
Part II: Web Workload Characterization





Overview of workload
characterization
Content dynamics
Access dynamics
Common pitfalls
Case studies
17
Overview of Workload
Characterization




Process of trace analyses
Common analysis techniques
Common analysis tools
Challenges in workload characterization
18
Process of Trace Analyses

Collect traces





where to monitor, how to collect (e.g., efficiency,
privacy, accuracy)
Determine key metrics to characterize
Process traces
Draw inferences from the data
Apply the traces or insights gained from the
trace analyses to design better protocols &
systems
19
Common Analysis Techniques Statistics



Mean
Median
Geometric mean: less sensitive to outliers
GM  n xi   E (log x )


Variance and standard deviation
1 N
var( x)   ( xi  u ) 2 , std ( x)  var( x)
N i 1
Confidence interval


A range of values that has a specified probability of
containing the parameter being estimated
Example: 95% confidence interval 10  x  20
20
Common Analysis Techniques –
Statistics (Cont.)

Cumulative distribution (CDF)


Probability density function (PDF)


Points: (x, P(X  x))
Derivative of CDF: f(x) = dF(x)/dx
Check for heavy tail distribution


Log-log complementary plot, and check its tail
Example: Pareto distribution
a 
F ( x)  1  ( ) , a,   0, x  a
x
If 2, distribution has infinite variance (a heavy
tail)
If 1, distribution has infinite mean
21
Common Analysis Techniques –
Data Fitting


Visually compare two distributions
Chi Squared tests [AS86,Jain91]





Divide the data points into k bins
2
k
(
x

E
)
Compute X 2 
i
i

Ei
i 1
If X2  X2(,k-c), then two distributions are close,
where  is significance level, c is the number of
estimated parameters for the distribution + 1
Need enough samples
Kolmogorov-Smirnov tests [AS86,Jain91]


Compares two distributions by finding the maximum
differences between two variables’ cumulative
distribution functions
Need to fully specify the distribution
22
Common Analysis Techniques –
Data Fitting (Cont.)

Anderson-Darling Test [Ste74]

Modification of the Kolmogorov-Smirnov test, giving more
weight to the tails
A2   N  S , where
N
S 
i 1


2i  1
[ InF (Yi )  In(1  F (YN 1i ))]
N
If A  critical value, two distributions are similar;
otherwise they are not (F is CDF, and Yi are ordered data)
Quantile-quantile plots [AS86,Jain91]


Compare two distributions by plotting the inverse of the
cumulative distribution function
F-1(x) for two variables, and find best fitting line
If the slope of the line is close to 1, and y-intercept is
close to 0, the two data sets are almost identically
distributed
23
Common Analysis Tools

Scripting languages


Databases


SQL, DB2, …
Statistics packages


VB, Perl, awk, UNIX shell scripts, …
Matlab, S+, R, SAS, …
Write our own low level programs

C, C++, C#, …
24
Challenges in Workload
Characterization
replica
proxy
Internet
proxy
Clients


replica
proxy
Servers
Workload characteristics vary both in space and in
time
Each of the Web components provides a limited
perspective on the functioning of the Web
25
Workload Variation


Vary with measurement points
Vary with sites being measured



Vary with the clients being measured




Information servers (news site), e-commercial servers,
query servers, streaming servers, upload servers
US vs. Europe, …
Internet clients vs. wireless clients
University clients vs. home users
US vs. Europe, …
Vary in time




Day vs. night
Weekday vs. weekend
Changes with new applications, recent events
Evolve over time, …
26
Different Web Components’ Views

View from clients



View from servers



Know details of client activities, such as requests satisfied by
browser caches, client aborts
The ability to record detailed information, as this does not
impose significant load on a client browser
Requests satisfied by browser & proxy caches will not appear in
the logs
May not log detailed information to ensure fast processing of
client requests
View from proxies


Depending on the proxy’s location
 A proxy close to clients see requests from a a small client
group to a large number of servers [KR00]
 A proxy close to the servers see requests from a large client
group to a small number of servers [KR00]
Requests satisfied by browser caches or proxy caches
27
encountered earlier will not appear in the logs
Part II: Web Workload





Overview
Content dynamics
Access dynamics
Common pitfalls
Case studies
28
Content Dynamics



File types
File size distribution
File update patterns


How often files are updated
How much files are updated
29
File Types

Text files


Images


Javascript, cgi, asp, pdf, ps, gzip, ppt, …
Multimedia files


Jpeg, gif, bitmap, …
Applications


HTML, plain text, …
Audio, video
…
30
File Size Distribution

Two definitions




D1: Size of all files on a Web server
D2: Size of all files transferred by a Web server
D1  D2, because some files can be transferred
multiple times or not in completion and other
files are not transferred
Studies show that the distribution of file
sizes in both definitions exhibit heavy tails
(i.e., P[F > x] ~ x-, 0    2)
31
File Update Interval

Varies in time


Varies across sites



Hot events and fast changing events require more
frequent update, e.g., Worldcup
Depending on server update policies & update tools
Depending on the nature of content (e.g., University sites
have slower update rate than news sites)
Recent studies


Study of the proxy traces collected at DEC and AT&T in
1996 showed the rate of change depended on content type,
top-level domains etc. [DFK+97]
Study of 1999 MSNBC logs shows that modification
history yields a rough predictor of future modification
interval [PQ00]
32
Extent of Change upon
Modifications

Varies in time


Varies across sites



Different events trigger different amount of updates
Depending on servers’ update policies and update tools
Depending on the nature of the content
Recent studies

Studies of 1996 DEC and AT&T proxy [MDF+97] and
1999 MSNBC log [PQ00] show that most file
modifications are small  delta encoding can be very
useful
33
Part II: Web Workload






Motivation
Limitations of workload
measurements
Content dynamics
Access Dynamics
Common pitfalls
Case studies
34
Access Dynamics




File popularity distribution
Temporal stability
Spatial locality
User request arrivals & durations
35
Document Popularity

Web requests follow Zipf-like distribution



Request frequency  1/i, where i is a document’s ranking
The value of  depends on the point of measurements
 Between 0.6 and 1 for client traces and proxy traces
 Close to or larger than 1 for server traces [ABC+96, PQ00]
The value of  varies over time (e.g., larger  during hot events)
Clients/Proxies
Less popular servers
MSNBC
2
1.5

1
0.5
0
36
Impact of the value 
Percentage of Requests

1
0.8
0.6
0.4

0.2
Larger  means more
concentrated accesses on
popular documents
caching is more
beneficial
90% of the accesses are
accounted by

0
0
0.2
0.4
0.6
0.8
1
Percentage of Documents (sorted by popularity)
12/17/98 Server Traces
10/06/99 Proxy Traces

08/01/99 Server Traces

Top 36% files in proxy
traces [BCF+99, PQ00]
Top 10% files in small
departmental server logs
reported in [AW96]
Top 2-4% files in
MSNBC traces
37
Temporal Stability

Metrics

Coarse-grained: likely duration that a current
popular file remains popular


e.g., overlap between the set of popular documents on
day 1 and day 2
Fine-grained: how soon a requested file will be
requested again

e.g., LRU stack distance [ABC+96]
File 5
File 4
File 3
File 2
File 1
Stack distance = 4
File 2
File 5
File 4
File 3
File 1
38
Spatial Locality

Refers to if users in the same geographical
location or at the same organization tend to
request a similar set of content

E.g., compare the degree of requests locally
shared
39
Spatial Locality (Cont.)
Hot Event
Fraction of
requests shared
Fraction of requests
shared
Normal Day
1
0.8
0.6
0.4
0.2
0
0.E+00
1.E+04
2.E+04
3.E+04
Domain ID
4.E+04
5.E+04
1.2
1
0.8
0.6
0.4
0.2
0
0.0E+00 5.0E+03 1.0E+04 1.5E+04 2.0E+04 2.5E+04 3.0E+04 3.5E+04
Domain ID
Domain-based clustering
Random clustering
Domain membership is significant
except when there is a “hot” event of global interest
40
User Request Arrivals & Duration

User workload at three levels




Exponential distribution [LNJV99,KR01]


Session: a consecutive series of requests from a user to a
Web site
Click: a user action to request a page, submit a form, etc.
Request: each click generates one or more HTTP requests
Session duration
Heavy-tail distribution [KR01]




# clicks in a session, most in the range of 4-6 [Mah97]
# embedded references in a Web page
Think time: time between clicks
Active time: time to download a Web page and its
embedded images
41
Common Pitfalls

Trace analyses are all about writing scripts &
plotting nice graphs

Challenges






Understanding the limitation of the traces



Trace collection: where to monitor, how to collect (e.g.,
efficiency, privacy, accuracy)
Identify important metrics, and understand why they are
important
Sound measurements require disciplines [Pax97]
Dealing with errors and outliers
Draw implications from data analyses
No representative traces: workload changes in time and in
space
Try to diversify data sets (e.g., collect traces at different
places and different sites) before jumping into conclusions
Draw inferences more than what data show
42
Part II: Web Workload






Motivation
Limitations of workload measurements
Content dynamics
Access dynamics
Common pitfalls
Case studies




Boston University client log study
UW proxy log study
MSNBC server log study
MSN Mobile server log study
43
Case Study I: BU Client Log Study

Overview



One of the few client log studies
Analyze clients’ browsing pattern and their impact on
network traffic [CBC95]
Approaches

Trace collection



Modify Mosaic and distribute it to machines in CS Dept.
at Boston Univ. to collect client traces in 1995
Log format: <client machine, request time, user id, URI,
document size, retrieval time>
Data analyses



Distribution of document size, document popularity
Relationship between retrieval latency and response
size
Implications on caching strategies
44
Major Findings

Power law distributions




Distribution of document sizes
Distribution of user requests for documents
# requests to documents as a function of their
popularity
Caching strategies should take into account
of document size (i.e., give preference to
smaller documents)
45
Case Study II: UW Proxy Log Study

Overview


Proxy traces collected at the University of
Washington
Approaches [WVS+99a, WVS+99b]


Trace collection: deploy a passive network
sniffer between the Univ. of Washington and the
rest of the Internet in May 1999
Set well-defined objectives


Understand the extent of document sharing within an
organization and across different organizations
Understand the performance benefit of cooperative
proxy caching
46
Major Findings




Members of an organization are more likely
to request the same documents than a
random set of clients
Most popular documents are globally popular
Cooperative caching is most beneficial for
small organizations
Cooperative caching among large
organizations yield minor improvement if any
47
Case Study III: MSNBC Server
Log Study

Overview of MSNBC server site




a large news site
server cluster with 40 nodes
25 million accesses a day (HTML content alone)
Period studied: Aug. – Oct. 99 & Dec. 17, 98
flash crowd
48
Approaches

Trace collection




HTTP access logs
Content Replication System (CRS) logs
HTML content logs
Data analyses

Content dynamics




How often files are modified?
How to predict modification interval?
How much does a file change upon modification?
Access dynamics




Document popularity
Temporal stability
Spatial locality
Correlation between document age and popularity
49
Major Findings

Content dynamics



Modification history is a rough predictor  guide for
setting TTL, but need an alternative mechanism (e.g.,
callback based invalidation) as backup
Frequent but minimal file modifications  delta encoding
Access dynamics



Set of popular files remains stable for days 
pushing/prefetching previous hot data that have
undergone modifications
Domain membership has a significant bearing on client
accesses except during a flash crowd of global interest
 make sense to have a proxy cache for an organization
Zipf-like distribution of file popularity but with a much
larger  than at proxies  potential of reverse caching
and replication
50
Case Study IV: Mobile Server Log
Study

Overview of a popular commercial Web site for
mobile clients
 Content


Services



news, weather, stock quotes, email, yellow pages,
travel reservations, entertainment etc.
Notification
Browse
Period studied


3.25 million notifications in Aug. 20 – 26, 2000
33 million browse requests in Aug. 15 – 26, 2000
51
Approaches

Analyze by user categories

Cellular users


Offline users


Signup services and specify preferences
Analyze by Web services



Download content onto their PDAs for later (offline)
browsing, e.g. AvantGo
Desktop users


Browse the Web in real time using cellular
technologies
Browse
Notifications
Use SQL database to manage data
52
Major Findings

Notification Services



Browse Services



Popularity of notification messages follows a Zipf-like
distribution, with top 1% notification objects responsible
for 54-64% of total messages  multicast notifications
Exhibits geographical locality  useful to provide localized
notification services
0.1% - 0.5% urls account for 90% requests  cache the
results of popular queries
The set of popular urls remain stable  cache a stable set
of queries or optimize query based on a stable workload
Correlation between the two services

Correlation is limited  influence design of pricing plans
53
Tutorial Outline




Background
Web Workload
Performance Diagnosis
Applications of traces
54
Part III: Performance Diagnosis



Overview of performance diagnosis
Infer the causes of high end-to-end delay in
Web transfers [BC00]
Infer the causes of high end-to-end loss
rate in Web transfers
[CDH+99,DPP+01,NC01,PQ02, PQW02]
55
Overview of Performance Diagnosis

Web Server

Goal: determine trouble
spot locations
Metrics of interest



Sprint
AT&T
MCI
UUNET
Qwest
AOL


Earthlink

Why so
slow?
Delay
Loss rate
Raw bandwidth
Available bandwidth
Traffic rate
Why interesting



Resolve the trouble spots
Server selection
Placement of mirror
servers
56
Finding the Sources of Delays

Goal


Why is my Web transfer slow? Is it because of
the server or the network or the client?
Sources of delay in Web transfer




DNS lookup
Server delays
Client delays
Network delays



Propagation delays
Queuing delays
Delays introduced by packet losses (e.g., signaled by
the fast retransmit mechanism or TCP timeouts)
57
TCPEval Tool




Inputs: “tcpdump” packet traces taken at the
communicating Web server and client
Generates a variety of statistics for file
transactions
 File and packet transfer latencies
 Packet drop characteristics
 Packet and byte counts per unit time
Generates both timeline and sequence plots for
transactions
Generates critical path profiles and statistics for
transactions
58
Critical Path Analysis Tool [BC00]
Data flow
Client
Server
Critical Path
Client
Server
Network delay
Network delay
Server delay
Network delay
Client delay
Network delay
Server delay
Network delay
due to pkt loss
59
Finding Sources of Packet Losses
Goal


Identify lossy links
server
(1-l1)*(1-l2)*(1-l4) = (1-p1)
(1-l1)*(1-l2)*(1-l5) = (1-p2)
…
(1-l1)*(1-l3)*(1-l8) = (1-p5)
l1
l2
l4
l5 l6
l3
l7
l8
clients
p1
p2
p3
p4
p5
- an under-constrained
system of equations
- measurement errors
60
Approaches

Active probing

Probing



Multicast probes
Striped unicast probes
Technique -- Expectation Maximization (EM)

a numerical algorithm to compute  that maximizes
P(D|), where D are observations,  are ensemble of
link loss rates
S
A
S
B
A
B
61
Approaches (Cont.)

Passive monitoring

Random sampling



Linear optimization


Determine a unique solution by optimizing an objective
function
Gibbs sampling


Random sample the solution space, and draw conclusions
based on samples
Akin to Monte Carlo sampling
Determine P(|D) by drawing samplings, where  is
ensemble of loss rates of links in the network, and D is
observed packet transmission and losses at the clients
EM

A numerical algorithm to compute  that maximizes
P(D|)
62
Other Performance Studies using
Web traces



Characterize Internet performance (e.g.,
spatial & temporal locality) [BSS+97]
Study the behavior of TCP during Web
transfers [BPS+98]
Reconstruct different client page accesses
and measure performance characteristics
for the accesses [FCT+02]
63
Tutorial Outline





Background
Web Workload
Performance Diagnosis
Applications of traces
Bibliography
64
Part IV: Applications of Traces


Synthetic workload generation
Cache design







Pre-fetching algorithms [CB98, FJC+99]
Placement of Web proxies/replicas [QPV01]
Other optimizations



Cache replacement policies [CI97,BCF+99]
Cache consistency algorithms [LC97, YBS99,YAD+01]
Cooperative cache or not [WVS+99]
Cache infrastructure
…
Improving TCP for Web transfers [Mah97,PK98,ZQK00]
Concurrent downloads, pipelining, compression,…
65
Synthetic Workload Generation

Generate user requests



Generate user sessions using a Poisson arrival
process
For each user session, determine # clicks using a
Pareto distribution
Assign a click to a request for a Web page, while
making sure



The popularity distribution of files follows a Zipf-like
distribution [BC98]
Capture the temporal locality of successive requests
for the same resource
Generate a next click from the same user with
think time following a Pareto distribution
66
Synthetic Workload Generation
(Cont.)

Generate Web pages





Determine the number of Web pages
Generate the size of each Web pages using a lognormal distribution
Associate a page with some number of embedded
pages using an empirical distribution (heavy-tail)
Generate file modification events
Examples of generators

Webbench [Wbe], WebStone[TS95], Surge
[BC98], SPecweb99 [SP99], Web Polygraph [WP],
…
67
Cache Replacement Policies

Problem formulation


Given a fixed size cache, how to evict pages to
maximize the hit ratio once the cache is full?
Hit ratio



Fraction of requests satisfied by the cache
Fraction of the total size of requested data satisfied
by the cache
Factors to consider





Request frequency
Modification frequency
Benefit of caching: reduction in latency & BW
Cost of caching: storage
Caveat: NOT all hits are equal. Hit ratios do NOT
map directly to performance improvement.
68
Cache Replacement Policies (Cont.)

Approaches


Least recently used (LRU)
Least frequently used (LFU)



GreedyDual-size [CI97]


Perfect: maintain counters for all pages seen
In-cache: maintain counters only for pages that are in
cache
Assign a utility value to each object, and replace the
one with the lowest utility
Use of traces


Evaluate the algorithms using trace-driven
simulations or synthetic workload
Analytically derive the hit ratios for different
replacement policies based on a workload model
69
Placement of Web Proxies/Replicas

Problem formulation [JJK+01,QPV01]


How to place a fixed number of proxies/replicas
to minimize users’ request latency
Factors to consider


Spatial distribution of requests
Temporal stability of requests


Stability in popularity of objects
Stability in spatial distribution of requests
70
Placement of Web Proxies/Replicas
(Cont.)

Approaches




Greedy placement
Hot-spot placement
Random placement
Use of traces



Trace-driven simulations
High concentration of requests to a small
number of objects  focus on replicating only
popular objects
Temporal stability in requests  no need to
frequently change the locations of
proxies/replicas
71
References







[AS86] R. B. D’Agostino and M. A. Stephens. Goodness-of-Fit Techniques.
Marcel Dekker, New York, NY 1986.
[ABC+96] Virgilio Almeida, Azer Bestavros, Mark Crovella and Adriana de
Oliveria. Characterizing reference locality in the WWW. In Proceedings of
1996 International Conference on Parallel and Distributed Information
Systems (PDIS'96), December 1996.
[ABQ01] A. Adya, P. Bahl, and L. Qiu. Analyzing Browse Patterns of Mobile
Clients. In Proc. of SIGCOMM Measurement Workshop, Nov. 2001.
[ABQ02] A. Adya, P. Bahl, and L. Qiu. Characterizing Alert and Browse
Services for Mobile Clients. In Proc. of USENIX, Jun. 2002.
[AL01] P. Albitz, and C. Liu. DNS and BIND (4th Edition), O’Reilly &
Associates, Apr. 2001.
[AW97] M. Arlitt and C. Williamson. Internet Web Servers: Workload
Characterization and Performance Implications. IEEE/ACM Transactions on
Networking, Vol. 5, No. 5, pp. 631-645, October 1997.
[BC98] P. Barford and M. Crovella. Generating representative workloads for
network and server performance evaluation. In Proc. of SIGMETRICS, 1998.
72
References (Cont.)






[BBC+98] P. Barford, A. Bestavros, M. Crovella, and A. Bradley. Changes in
Web Client Access Patterns: Characteristics and Caching Implications, Special
Issue on World Wide Web Characterization and Performance Evaluation; World
Wide Web Journal, December 1998.
[BCF+99] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web Caching
and Zipf-like Distributions: Evidence and Implications. In Proc. of INFOCOM,
Mar. 1999.
[BC00] P. Barford and M. Crovella. Critical Path Analysis of TCP Transactions.
In Proc. of ACM SIGCOMM, Aug. 2000.
[BLFF96] T. Berners-Lee, R. Fielding, and H. Frystyk. Hypertext Transfer
Protocol -- HTTP/1.0. RFC 1945, May 1996.
[BPS+98] H. Balakrishnan, V. N. Padmanabhan, S. Seshan, M. Stemm and R. H.
Katz. TCP Behavior of a Busy Internet Server: Analysis and Improvements. In
Proc. IEEE Infocom, San Francisco, CA, USA, March 1998.
[BSS+97] H. Balakrishnan, S. Seshan, M. Stemm, and R. H. Katz. Analyzing
Stability in Wide-Area Network Performance. In Proc. of SIGMETRICS, Jun.
1997.
73
References (Cont.)







[CDH+99] R. Caceres, N. G. Duffield, J. Horowitz, D. Towsley, T. Bu.
Multicast-Based Inference of Network Internal Loss Characteristics. In Proc.
Infocom, Mar. 1999.
[CB98] M. Crovella and P. Barford. The network effects of prefetching. In
Proc. of INFOCOM, 1998.
[CBC95] C. R. Cunha, A. Bestavros, and M. E. Crovella. Characteristics of
WWW client-based traces. Technical Report BU-CS-95-010, CS Dept., Boston
University, 1995.
[CI97] P. Cao and S. Irani. Cost-Aware WWW proxy caching algorithms. In
Proc. of USITS, Dec. 1997.
[DFK+97] F. Douglis, A. Feldmann, B. Krishnamurth, and J. Mogul. Rate of
change and other metrics: a live study of the World Wide Web. In Proc. of
USITS, 1997.
[DPP+01] N. G. Duffield, F. Lo Presti, V. Paxson, D. Towsley. In Proc.
Infocom, Apr. 2001.
[FCD+99] A. Feldmann, R. Caceres, F. Douglis, and M. Rabinovich.
Performance of Web Proxy Caching in heterogeneous bandwidth
enviornments. In Proc. of INFOCOM, March 1999.
74
References (Cont.)








[FJC+99] L. Fan, Q. Jacobson, P. Cao and W. Lin. Web Prefetching Between
Low-Bandwidth Clients and Proxies: Potential and Performance. In Proc. of
SIGMETRICS, 1999.
[FCT+02] Y. Fu, L. Cherkassova, W. Tang, and A. Vahdat. EtE: Passive End-toEnd Internet Service Performance Monitering. In Proc. of USENIX, Jun. 2002.
[GMF+99] J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. BernersLee. Hypertext Transfer Protocol – HTTP 1.1. RFC 2616, Jun. 1999.
[JK88] V. Jacobson, M. J. Karels. Congestion Avoidance and Control. In Proc.
SIGCOMM, Aug. 1988.
[JJK+01] S. Jamin, C. Jin, A. R. Kurc, D. Raz, and Y. Shavitt. Constrained
Mirror Placement on the Internet. In Proc. of INFOCOM, Apr. 2001.
[Jain91] R. Jain. The Art of Computer Systems Performance Analysis. John
Wiley and Sons, 1991.
[Kel02] T. Kelly. Thin-Client Web Access Patterns: Measurements from a
Cache-Busting Proxy. Computer Communications, Vol. 25, No. 4 (March 2002),
pages 357-366.
[KR01] B. Krishnamurthy and J. Rexford. Web Protocols and Practice,
HTTP/1.1, Networking Protocols, Caching, and Traffic Measurement. AddisonWesley, May 2001.
75
References (Cont.)







[LC97] C. Liu and P. Cao. Maintaining Strong Cache Consistency in the WorldWide Web. In Proc. of ICDCS'97, pp. 12-21, May 1997.
[LNJV99] Z. Liu, N. Niclausse, and C. Jalpa-Villaneuva. Web Traffic Modeling
and Performance Comparison Between HTTP 1.0 and HTTP 1.1. In Erol
Gelenbe, editor, System Performance Evaluation: Methodologies and
Applications. CRC Press, Aug. 1999.
[Mah97] Bruce Mah. An empirical model of HTTP network traffic. In Proc. of
INFOCOM, April 1997.
[Mogul95] Jeffrey C. Mogul. The Case for Persistent-Connection HTTP. In
Proc. SIGCOMM '95, pages 299-313. Cambridge, MA, August, 1995.
[MDF+97] J. C. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy.
Potential benefits of delta-encoding and data compression for HTTP, In Proc. of
SIGCOMM, September 1997.
[NC01] R. Nowak and M. Coates. Unicast Network Tomography using the EM
algorithm. Submitted to IEEE Transactions on Information Theory, Dec. 2001
[Pad95] V. N. Padmanabhan. Improving World Wide Web Latency. Technical
Report UCB/CSD-95-875, University of California, Berkeley, May 1995.
76
References (Cont.)








[PQ00] V. N. Padmanabhan and L. Qiu. The Content and Access Dynamics of a
Busy Web Server. In Proc. of SIGCOMM, Aug. 2000.
[PQ02] V. N. Padmanabhan and L. Qiu. Network Tomography using Passive
End-to-End Measurements, DIMACS on Internet and WWW Measurement,
Mapping and Modeling, Feb. 2002.
[PQW02] V. N. Padmanabhan, L. Qiu, and H. J. Wang. Passive Network
Tomography using Bayesian Inference. Internet Measurement Workshop, Nov.
2002.
[QPV01] L. Qiu, V. N. Padmanabhan, and G. M. Voelker. On the Placement of
Web Server Replicas. In Proc. of INFOCOM, Apr. 2001.
[SP99] SPECWeb99 Benchmark. http://www.spec.org/osg/web99/.
[Pax98] V. Paxson. An Introduction to Internet Measurement and Modeling.
SIGCOMM’98 tutorial, August 1998.
[Ste74] M. A. Stephens. EDF Statistics for Goodness of Fit and Some
Comparison. Journal of the American Statistical Association, Vol. 69, pp. 730 –
737.
[TS95] G. Trent and M. Sake. WebStone: The First Generation in HTTP Server
Benchmarking, Feb. 1995. http://www.mindcraft.com/webstone/paper.html.
77
References (Cont.)






[Wbe] Webbench.
http://www.zdnet.com/etestinglabs/stories/benchmarks/0,8829,2326243,00.html.
[WP] Web Polygraph: Proxy performance benchmark.
http://polygraph.ircache.net/.
[WVS+99a] A. Wolman, G. Voelker, N. Sharma, N. Cardwell, M.
Brown, T. Landray,D. Pinnel, A. Karlin, and H. Levy. OrganizationBased Analysis of Web-Object Sharing and Caching. In Proc. of the
Second USENIX Symposium on Internet Technologies and Systems,
Boulder, CO, October 1999.
[WVS+99b] A. Wolman, G. M. Voelker, N. Sharma, N. Cardwell, A.
Karlin, and H. M. Levy. On the scale and performance of cooperative
Web proxy caching. In Proc. of the 17th ACM Symposium on Operating
Systems Principles, Kiawah Island, SC, Dec. 1999.
[YAD01] J. Yin, L. Alvisi, M. Dahlin, A. Iyengar. Engineering serverdriven consistency for large scale dynamic services.
[YBS99] H. Yu, L. Breslau, and S. Shenker. A Scalable Web Cache Consistency
Architecture. In Proc. of SIGCOMM, August 1999.
78
Acknowledgement
Thank Alec Wolman for his helpful
comments.
79
Thank you!
http://www.research.microsoft.com/~liliq/talks/internal-web-perf.ppt
80
Web Protocols
HTTP messages
HTTP
TCP segments
TCP
IP
HTTP
IP pkt
IP
IP pkt
Ethernet Ethernet Sonet
Ethernet
TCP
IP
IP pkt
Sonet Ethernet
Sonet link
A picture taken from [KR01]
IP
Ethernet
Ethernet
81
#1: Random Sampling

server


l1
l3
l2

Randomly sample the solution space
Repeat this several times
Draw conclusions based on overall
statistics
How to do random sampling?

l4
l5 l6
l7
l8

determine loss rate bound for each link using
best downstream client
iterate over all links:

p1
p2
p3
p4
clients
p5


pick loss rate at random within bounds
update bounds for other links
Problem: little tolerance for estimation
error
82
#2: Linear Optimization
server
l1
l3
l2
l4
p1
l5 l6
p2
p3
l7
l8
p4
p5
clients
1) Convert the constraints into linear
constraints using log transform
Li = log(1/(1-li)), Pj = log(1/(1-pj))
(1-l1)*(1-l2)*(1-l4) = (1-p1)  L1+L2+L4 = P1
2) Add slack variables to account for errors
L1+L2+L4 = P1 L1+L2+L4 + S1 = P1
minimize wLi + |Sj|
subject to
L1+L2+L4 + S1 = P1
L1+L2+L5 + S2 = P2
…
L1+L3+L8 + S5 = P5
Goals

Parsimonious explanation

Robust to error in client loss rate
estimate
83
# 3: Gibbs Sampling

D





ensemble of loss rates of links in the network
Goal


observed packet transmission and loss at the clients
determine the posterior distribution P(|D)
Approach


Use Markov Chain Monte Carlo with Gibbs sampling to
obtain samples from P(|D)
Draw conclusions based on the samples
84
# 3: Gibbs Sampling (Cont.)

Applying Gibbs sampling to network
tomography




1) Initialize link loss rates arbitrarily
2) For j = 1 : warmup
for each link i
compute P(li|D, {li’})
where li is loss rate of link i, and {li’} = kI lk
3) For j = 1 : realSamples
for each link i
compute P(li|D, {li’})
Use all the samples obtained at step 3 to
approximate P(|D)
85
Simulation Experiments


Advantage: no uncertainty about link loss rate!
Methodology

Topologies used:



randomly-generated packet loss events at each link




randomly-generated: 20 - 3000 nodes, max degree = 5-50
real topology obtained by tracing paths to microsoft.com
clients
A fraction f of the links are good, and the rest are “bad”
LM1: good links: 0 – 1%, bad links: 5 – 10%
LM2: good links: 0 – 1%, bad links: 1 – 100%
Goodness metrics:


Coverage: # correctly inferred lossy links
False positive: # incorrectly inferred lossy links
86
Random Topologies
1000-node random topologies (d=10, f=0.95)
1000-node random topologies (d=10, f=0.5)
160
600
140
500
100
# links
# links
120
80
60
40
400
300
200
100
20
0
0
Random
LP
Gibbs
Random
"# true lossy links"
LP
Gibbs
"# true lossy links"
"# correctly identified lossy links"
"# false positive"
"# correctly identified lossy links"
"# false positive"
Techniques
Coverage
False Positive
Computation
Random
High
High
Low
LP
Modest
Low
Medium
Gibbs sampling
High
Low
High
87
Trace-driven Validation

Validation approach




Experimental setup


Divide client traces into two: tomography and validation
Tomography data set => loss inference
Validation set => check if clients downstream of the
inferred lossy links experience high loss
Real topologies and loss traces collected from traceroute
and tcpdump at microsoft.com during Dec. 20, 2000 and
Jan. 11, 2002
Results


False positive rate is between 5 – 30%
Likely candidates for lossy links:



links crossing an inter-AS boundary
links having a large delay (e.g. transcontinental links)
links that terminate at clients
88
Tutorial Outline


Background
Web Workload Characterization








Performance Diagnosis



Motivation
Data Analyses and fittings
Understanding the limitations
Content dynamics
Access dynamics
Case Studies
Synthetic workload generation
Infer causes of high end-to-end delay in Web transfers
Infer causes of high end-to-end loss in Web transfers
Applications of traces
89
Part II: Web Workload Characterization

Overview




Content dynamics







File size distribution
File update patterns
Access dynamics


Process of trace analyses
Common analysis techniques & tools
Challenges in workload characterization
File popularity distribution
Temporal stability
Spatial locality
Browser sessions: length & arrival pattern
Common pitfalls
Case studies

Boston University client log study, UW proxy log study, MSNBC
server log study, a mobile log study
90
Cache Consistency Algorithms




Problem formulation
Factors to consider
Approaches
Use of traces
91
Major Findings
Observations
Implications
Top 1% notification
objects account for 5464% of total messages.
Delivering notifications
via multicast would be
effective.
Notification exhibits
geographical locality.
Useful to provide
localized notification
services.
92
Major Findings (Cont.)
Observations
Implications
0.1% - 0.5% full urls (i.e.
121-442) account for 90%
requests.
Caching the results of
popular queries would be
very effective.
Cache a stable set of
popular query results or
optimize query performance
based on a stable workload.
The set of popular urls
remain stable.
Limited correlation between Service providers cannot
users’ browsing and
solely rely on users’
notification pattern.
notification profile to
predict how much & what
they will browse.
93
Types of Web Traces

Flow level traces

<srcIP, destIP, nextIP, input, output, pkts, bytes,
sTime, fTime, srcPort, destPort, pad, flags, prot,
tos, srcAs, destAs, srcMask, destMask, seq>
94
Views from Clients


Capture clients’ requests to all servers
Pros



Know details of client activities, such as requests
satisfied by browser caches, client aborts
The ability to record detailed information, as
this does not impose significant load on a client
browser
Cons


Need to modify browser software
Hard to deploy for a large number of clients
95
Views from Web Servers


Capture most clients’ requests (excluding
those satisfied by caches) to a single server
Pros


Relatively easy to deploy/change logging
software
Cons


Requests satisfied by browser & proxy caches
will not appear in the logs
May not log detailed information to ensure fast
processing of client requests
96
Views from Web Proxies

Depending on the proxy’s location



Pros




A proxy close to clients see requests from a a small
client group to a large number of servers [KR00]
A proxy close to the servers see requests from a large
client group to a small number of servers [KR00]
See requests from a diverse set of clients to a diverse
set of servers, and determine the popularity ranking of
different Web sites
Useful for studying caching policies
Ease of collection
Cons


Requests satisfied by browser caches will not appear in
the logs
May not log detailed information to ensure fast
97
processing of requests
Web Protocols (Cont.)

DNS [AL01]


TCP [JK88]


An application layer protocol responsible for
translating hostname to IP and vice versa (e.g.,
msn.com  207.68.172.246)
A transport layer protocol that does error control
and flow control
Hypertext Transfer Protocol (HTTP)

HTTP 1.0 [BLFF96]



The most widely used HTTP version
A “Stop and wait” protocol
HTTP 1.1 [GMF+99]

Adds persistent connections, pipelining, caching,
content negotiation, …
98

internal-tutorial

Transcript internal-tutorial

Directory