internal-tutorial
Download
Report
Transcript internal-tutorial
Mining Web Traces:
Workload Characterization, Performance
Diagnosis, and Applications
Lili Qiu
Microsoft Research
Internal Talk
November 2002
1
Motivation
Why do we care about Web traces?
Content providers
How do users come to visit the Web site?
Why do users leave the Web site? Is poor
performance the cause for this?
Where are the performance bottlenecks?
What content are users interested in?
How do users’ interest vary in time?
How do users’ interest vary across different
geographical regions?
2
Motivation (Cont.)
Web hosting companies
ISPs
Accounting & billing
Server selection
Provisioning server farms: where to place servers
How to save bandwidth by storing proxy caches?
Traffic engineering & provisioning
Researchers
Where are the performance bottlenecks?
How to improve Web performance?
Examples: Traffic measurements have influenced the
design of HTTP (e.g., persistent connections and
pipeline), TCP (e.g., initial congestion window)
3
Outline
Background
Web workload characterization
Performance diagnosis
Applications of traces
Bibliography
4
Part I: Background
Web software components
Web semantic components
Web protocols
Types of Web traces
5
Web Software Components
Web clients
replica
proxy
Internet
proxy
Web
Clients
replica
proxy
Web servers
Web
Servers
An application that
establishes connections
to send Web requests
E.g., Mosaic, Netscape
Navigator, Microsoft IE
An application that
accepts connections to
service requests by
sending back responses
E.g., Apache, Microsoft
IIS
Web proxies (optional)
Web replicas (optional)
6
Web Semantic Components
Uniform Resource Identifier (URI)
An identifier for a Web resource
Hypertext Markup Language (HTML)
Name of protocol: http, https, ftp, ..
Name of the server
Name of the resource on the server
e.g., http://www.foobar.com/info.html
Platform-independent styles (indicated by markup tags)
that define the various components of a Web document
Hypertext Transfer Protocol (HTTP)
Define the syntax and semantics of messages exchanged
between Web software components
7
Example of a Web Transaction
1. DNS query
DNS
server
2. Setup TCP connection
3. HTTP request
4. HTTP response
Browser
Web server
8
Internet Protocol Stack
Application layer: application programs
(HTTP, Telnet, FTP, DNS)
Transport layer: error control + flow control
(TCP,UDP)
Network layer: routing
(IP)
Datalink layer: handle hardware details
(Ethernet, ATM)
Physical layer: moving bits
(coaxial cable, optical fiber)
9
HTTP Protocol
Hypertext Transfer Protocol (HTTP)
HTTP 1.0 [BLFF96]
The most widely used HTTP version
A “Stop and wait” protocol
HTTP 1.1 [GMF+99]
Adds persistent connections, pipelining,
caching, content negotiation, …
10
HTTP 1.0
HTTP request
Request = Simple-Request | Full-Request
Simple-Request = "GET" SP Request-URI CRLF
Full-Request = Request-Line;
*( General/Request/Entity Header) ;
CRLF
[ Entity-Body ] ;
Request-Line = Method SP Request-URI SP HTTPVersion CRLF
Method = "GET" ;| "HEAD" ; | "POST" ;| extensionmethod
Example: GET /info.html HTTP/1.0
11
HTTP 1.0 (Cont.)
HTTP response
Response = Simple-Response | Full-Response
Simple-Response = [ Entity-Body ]
Full-Response = Status-Line;
*( General/Response/Entity Header );
CRLF
[ Entity-Body ] ;
Example:
HTTP/1.0 200 OK
Date: Mon, 09 Sep 2002 06:07:53 GMT
Server: Apache/1.3.20 (Unix) (Red-Hat/Linux) PHP/4.0.6
Last-Modified: Mon, 29 Jul 2002 10:58:59 GMT
Content-Length: 21748
Content-Type: text/html
This is the document content …
<21748 bytes of the current version of info.html>
12
HTTP 1.1
Connection management
Persistent connections [Mogul95]
Use one TCP connection for multiple HTTP requests
Pros:
Cons:
Reduce the overhead of connection setup and teardown
Avoid TCP slow start
head-of-line blocking
increase servers’ state
Pipeline [Pad95]
Send multiple requests without waiting for a response
between requests
Pros: avoid the round-trip delay of waiting for each
response
Cons: connection aborts are harder to deal with
13
HTTP 1.1 (Cont.)
Caching
Continues to support the notion of expiration
used in HTTP 1.0
Add a cache-control header to handle the issues
of cacheability and semantic transparency [KR01]
E.g., no-cache, only-if-cache, no-store, max-age, maxstale, min-fresh, …
Others
Range request
Content negotiation
Security
…
14
Types of Web Traces
Application level traces
Server logs: CLF and ECLF formats
CLF format
<remoteIP,remoteID,usrName,Time,request,responseCode,
contentLength>
e.g., 192.1.1.1, -, -, 8/1/2000, 10:00:00, “GET
/news/index.asp HTTP/1.1”, 200, 3410
Proxies logs: CLF and ECLF formats
Client logs: no standard logging formats
Packet level traces
Collection method: monitor a network link
Available tools: tcpdump, libpcap, netmon
Concerns: packet dropping, timestamp accuracy
15
Tutorial Outline
Background
Web workload characterization
Performance diagnosis
Applications of traces
Bibliography
16
Part II: Web Workload Characterization
Overview of workload
characterization
Content dynamics
Access dynamics
Common pitfalls
Case studies
17
Overview of Workload
Characterization
Process of trace analyses
Common analysis techniques
Common analysis tools
Challenges in workload characterization
18
Process of Trace Analyses
Collect traces
where to monitor, how to collect (e.g., efficiency,
privacy, accuracy)
Determine key metrics to characterize
Process traces
Draw inferences from the data
Apply the traces or insights gained from the
trace analyses to design better protocols &
systems
19
Common Analysis Techniques Statistics
Mean
Median
Geometric mean: less sensitive to outliers
GM n xi E (log x )
Variance and standard deviation
1 N
var( x) ( xi u ) 2 , std ( x) var( x)
N i 1
Confidence interval
A range of values that has a specified probability of
containing the parameter being estimated
Example: 95% confidence interval 10 x 20
20
Common Analysis Techniques –
Statistics (Cont.)
Cumulative distribution (CDF)
Probability density function (PDF)
Points: (x, P(X x))
Derivative of CDF: f(x) = dF(x)/dx
Check for heavy tail distribution
Log-log complementary plot, and check its tail
Example: Pareto distribution
a
F ( x) 1 ( ) , a, 0, x a
x
If 2, distribution has infinite variance (a heavy
tail)
If 1, distribution has infinite mean
21
Common Analysis Techniques –
Data Fitting
Visually compare two distributions
Chi Squared tests [AS86,Jain91]
Divide the data points into k bins
2
k
(
x
E
)
Compute X 2
i
i
Ei
i 1
If X2 X2(,k-c), then two distributions are close,
where is significance level, c is the number of
estimated parameters for the distribution + 1
Need enough samples
Kolmogorov-Smirnov tests [AS86,Jain91]
Compares two distributions by finding the maximum
differences between two variables’ cumulative
distribution functions
Need to fully specify the distribution
22
Common Analysis Techniques –
Data Fitting (Cont.)
Anderson-Darling Test [Ste74]
Modification of the Kolmogorov-Smirnov test, giving more
weight to the tails
A2 N S , where
N
S
i 1
2i 1
[ InF (Yi ) In(1 F (YN 1i ))]
N
If A critical value, two distributions are similar;
otherwise they are not (F is CDF, and Yi are ordered data)
Quantile-quantile plots [AS86,Jain91]
Compare two distributions by plotting the inverse of the
cumulative distribution function
F-1(x) for two variables, and find best fitting line
If the slope of the line is close to 1, and y-intercept is
close to 0, the two data sets are almost identically
distributed
23
Common Analysis Tools
Scripting languages
Databases
SQL, DB2, …
Statistics packages
VB, Perl, awk, UNIX shell scripts, …
Matlab, S+, R, SAS, …
Write our own low level programs
C, C++, C#, …
24
Challenges in Workload
Characterization
replica
proxy
Internet
proxy
Clients
replica
proxy
Servers
Workload characteristics vary both in space and in
time
Each of the Web components provides a limited
perspective on the functioning of the Web
25
Workload Variation
Vary with measurement points
Vary with sites being measured
Vary with the clients being measured
Information servers (news site), e-commercial servers,
query servers, streaming servers, upload servers
US vs. Europe, …
Internet clients vs. wireless clients
University clients vs. home users
US vs. Europe, …
Vary in time
Day vs. night
Weekday vs. weekend
Changes with new applications, recent events
Evolve over time, …
26
Different Web Components’ Views
View from clients
View from servers
Know details of client activities, such as requests satisfied by
browser caches, client aborts
The ability to record detailed information, as this does not
impose significant load on a client browser
Requests satisfied by browser & proxy caches will not appear in
the logs
May not log detailed information to ensure fast processing of
client requests
View from proxies
Depending on the proxy’s location
A proxy close to clients see requests from a a small client
group to a large number of servers [KR00]
A proxy close to the servers see requests from a large client
group to a small number of servers [KR00]
Requests satisfied by browser caches or proxy caches
27
encountered earlier will not appear in the logs
Part II: Web Workload
Overview
Content dynamics
Access dynamics
Common pitfalls
Case studies
28
Content Dynamics
File types
File size distribution
File update patterns
How often files are updated
How much files are updated
29
File Types
Text files
Images
Javascript, cgi, asp, pdf, ps, gzip, ppt, …
Multimedia files
Jpeg, gif, bitmap, …
Applications
HTML, plain text, …
Audio, video
…
30
File Size Distribution
Two definitions
D1: Size of all files on a Web server
D2: Size of all files transferred by a Web server
D1 D2, because some files can be transferred
multiple times or not in completion and other
files are not transferred
Studies show that the distribution of file
sizes in both definitions exhibit heavy tails
(i.e., P[F > x] ~ x-, 0 2)
31
File Update Interval
Varies in time
Varies across sites
Hot events and fast changing events require more
frequent update, e.g., Worldcup
Depending on server update policies & update tools
Depending on the nature of content (e.g., University sites
have slower update rate than news sites)
Recent studies
Study of the proxy traces collected at DEC and AT&T in
1996 showed the rate of change depended on content type,
top-level domains etc. [DFK+97]
Study of 1999 MSNBC logs shows that modification
history yields a rough predictor of future modification
interval [PQ00]
32
Extent of Change upon
Modifications
Varies in time
Varies across sites
Different events trigger different amount of updates
Depending on servers’ update policies and update tools
Depending on the nature of the content
Recent studies
Studies of 1996 DEC and AT&T proxy [MDF+97] and
1999 MSNBC log [PQ00] show that most file
modifications are small delta encoding can be very
useful
33
Part II: Web Workload
Motivation
Limitations of workload
measurements
Content dynamics
Access Dynamics
Common pitfalls
Case studies
34
Access Dynamics
File popularity distribution
Temporal stability
Spatial locality
User request arrivals & durations
35
Document Popularity
Web requests follow Zipf-like distribution
Request frequency 1/i, where i is a document’s ranking
The value of depends on the point of measurements
Between 0.6 and 1 for client traces and proxy traces
Close to or larger than 1 for server traces [ABC+96, PQ00]
The value of varies over time (e.g., larger during hot events)
Clients/Proxies
Less popular servers
MSNBC
2
1.5
1
0.5
0
36
Impact of the value
Percentage of Requests
1
0.8
0.6
0.4
0.2
Larger means more
concentrated accesses on
popular documents
caching is more
beneficial
90% of the accesses are
accounted by
0
0
0.2
0.4
0.6
0.8
1
Percentage of Documents (sorted by popularity)
12/17/98 Server Traces
10/06/99 Proxy Traces
08/01/99 Server Traces
Top 36% files in proxy
traces [BCF+99, PQ00]
Top 10% files in small
departmental server logs
reported in [AW96]
Top 2-4% files in
MSNBC traces
37
Temporal Stability
Metrics
Coarse-grained: likely duration that a current
popular file remains popular
e.g., overlap between the set of popular documents on
day 1 and day 2
Fine-grained: how soon a requested file will be
requested again
e.g., LRU stack distance [ABC+96]
File 5
File 4
File 3
File 2
File 1
Stack distance = 4
File 2
File 5
File 4
File 3
File 1
38
Spatial Locality
Refers to if users in the same geographical
location or at the same organization tend to
request a similar set of content
E.g., compare the degree of requests locally
shared
39
Spatial Locality (Cont.)
Hot Event
Fraction of
requests shared
Fraction of requests
shared
Normal Day
1
0.8
0.6
0.4
0.2
0
0.E+00
1.E+04
2.E+04
3.E+04
Domain ID
4.E+04
5.E+04
1.2
1
0.8
0.6
0.4
0.2
0
0.0E+00 5.0E+03 1.0E+04 1.5E+04 2.0E+04 2.5E+04 3.0E+04 3.5E+04
Domain ID
Domain-based clustering
Random clustering
Domain membership is significant
except when there is a “hot” event of global interest
40
User Request Arrivals & Duration
User workload at three levels
Exponential distribution [LNJV99,KR01]
Session: a consecutive series of requests from a user to a
Web site
Click: a user action to request a page, submit a form, etc.
Request: each click generates one or more HTTP requests
Session duration
Heavy-tail distribution [KR01]
# clicks in a session, most in the range of 4-6 [Mah97]
# embedded references in a Web page
Think time: time between clicks
Active time: time to download a Web page and its
embedded images
41
Common Pitfalls
Trace analyses are all about writing scripts &
plotting nice graphs
Challenges
Understanding the limitation of the traces
Trace collection: where to monitor, how to collect (e.g.,
efficiency, privacy, accuracy)
Identify important metrics, and understand why they are
important
Sound measurements require disciplines [Pax97]
Dealing with errors and outliers
Draw implications from data analyses
No representative traces: workload changes in time and in
space
Try to diversify data sets (e.g., collect traces at different
places and different sites) before jumping into conclusions
Draw inferences more than what data show
42
Part II: Web Workload
Motivation
Limitations of workload measurements
Content dynamics
Access dynamics
Common pitfalls
Case studies
Boston University client log study
UW proxy log study
MSNBC server log study
MSN Mobile server log study
43
Case Study I: BU Client Log Study
Overview
One of the few client log studies
Analyze clients’ browsing pattern and their impact on
network traffic [CBC95]
Approaches
Trace collection
Modify Mosaic and distribute it to machines in CS Dept.
at Boston Univ. to collect client traces in 1995
Log format: <client machine, request time, user id, URI,
document size, retrieval time>
Data analyses
Distribution of document size, document popularity
Relationship between retrieval latency and response
size
Implications on caching strategies
44
Major Findings
Power law distributions
Distribution of document sizes
Distribution of user requests for documents
# requests to documents as a function of their
popularity
Caching strategies should take into account
of document size (i.e., give preference to
smaller documents)
45
Case Study II: UW Proxy Log Study
Overview
Proxy traces collected at the University of
Washington
Approaches [WVS+99a, WVS+99b]
Trace collection: deploy a passive network
sniffer between the Univ. of Washington and the
rest of the Internet in May 1999
Set well-defined objectives
Understand the extent of document sharing within an
organization and across different organizations
Understand the performance benefit of cooperative
proxy caching
46
Major Findings
Members of an organization are more likely
to request the same documents than a
random set of clients
Most popular documents are globally popular
Cooperative caching is most beneficial for
small organizations
Cooperative caching among large
organizations yield minor improvement if any
47
Case Study III: MSNBC Server
Log Study
Overview of MSNBC server site
a large news site
server cluster with 40 nodes
25 million accesses a day (HTML content alone)
Period studied: Aug. – Oct. 99 & Dec. 17, 98
flash crowd
48
Approaches
Trace collection
HTTP access logs
Content Replication System (CRS) logs
HTML content logs
Data analyses
Content dynamics
How often files are modified?
How to predict modification interval?
How much does a file change upon modification?
Access dynamics
Document popularity
Temporal stability
Spatial locality
Correlation between document age and popularity
49
Major Findings
Content dynamics
Modification history is a rough predictor guide for
setting TTL, but need an alternative mechanism (e.g.,
callback based invalidation) as backup
Frequent but minimal file modifications delta encoding
Access dynamics
Set of popular files remains stable for days
pushing/prefetching previous hot data that have
undergone modifications
Domain membership has a significant bearing on client
accesses except during a flash crowd of global interest
make sense to have a proxy cache for an organization
Zipf-like distribution of file popularity but with a much
larger than at proxies potential of reverse caching
and replication
50
Case Study IV: Mobile Server Log
Study
Overview of a popular commercial Web site for
mobile clients
Content
Services
news, weather, stock quotes, email, yellow pages,
travel reservations, entertainment etc.
Notification
Browse
Period studied
3.25 million notifications in Aug. 20 – 26, 2000
33 million browse requests in Aug. 15 – 26, 2000
51
Approaches
Analyze by user categories
Cellular users
Offline users
Signup services and specify preferences
Analyze by Web services
Download content onto their PDAs for later (offline)
browsing, e.g. AvantGo
Desktop users
Browse the Web in real time using cellular
technologies
Browse
Notifications
Use SQL database to manage data
52
Major Findings
Notification Services
Browse Services
Popularity of notification messages follows a Zipf-like
distribution, with top 1% notification objects responsible
for 54-64% of total messages multicast notifications
Exhibits geographical locality useful to provide localized
notification services
0.1% - 0.5% urls account for 90% requests cache the
results of popular queries
The set of popular urls remain stable cache a stable set
of queries or optimize query based on a stable workload
Correlation between the two services
Correlation is limited influence design of pricing plans
53
Tutorial Outline
Background
Web Workload
Performance Diagnosis
Applications of traces
54
Part III: Performance Diagnosis
Overview of performance diagnosis
Infer the causes of high end-to-end delay in
Web transfers [BC00]
Infer the causes of high end-to-end loss
rate in Web transfers
[CDH+99,DPP+01,NC01,PQ02, PQW02]
55
Overview of Performance Diagnosis
Web Server
Goal: determine trouble
spot locations
Metrics of interest
Sprint
AT&T
MCI
UUNET
Qwest
AOL
Earthlink
Why so
slow?
Delay
Loss rate
Raw bandwidth
Available bandwidth
Traffic rate
Why interesting
Resolve the trouble spots
Server selection
Placement of mirror
servers
56
Finding the Sources of Delays
Goal
Why is my Web transfer slow? Is it because of
the server or the network or the client?
Sources of delay in Web transfer
DNS lookup
Server delays
Client delays
Network delays
Propagation delays
Queuing delays
Delays introduced by packet losses (e.g., signaled by
the fast retransmit mechanism or TCP timeouts)
57
TCPEval Tool
Inputs: “tcpdump” packet traces taken at the
communicating Web server and client
Generates a variety of statistics for file
transactions
File and packet transfer latencies
Packet drop characteristics
Packet and byte counts per unit time
Generates both timeline and sequence plots for
transactions
Generates critical path profiles and statistics for
transactions
58
Critical Path Analysis Tool [BC00]
Data flow
Client
Server
Critical Path
Client
Server
Network delay
Network delay
Server delay
Network delay
Client delay
Network delay
Server delay
Network delay
due to pkt loss
59
Finding Sources of Packet Losses
Goal
Identify lossy links
server
(1-l1)*(1-l2)*(1-l4) = (1-p1)
(1-l1)*(1-l2)*(1-l5) = (1-p2)
…
(1-l1)*(1-l3)*(1-l8) = (1-p5)
l1
l2
l4
l5 l6
l3
l7
l8
clients
p1
p2
p3
p4
p5
- an under-constrained
system of equations
- measurement errors
60
Approaches
Active probing
Probing
Multicast probes
Striped unicast probes
Technique -- Expectation Maximization (EM)
a numerical algorithm to compute that maximizes
P(D|), where D are observations, are ensemble of
link loss rates
S
A
S
B
A
B
61
Approaches (Cont.)
Passive monitoring
Random sampling
Linear optimization
Determine a unique solution by optimizing an objective
function
Gibbs sampling
Random sample the solution space, and draw conclusions
based on samples
Akin to Monte Carlo sampling
Determine P(|D) by drawing samplings, where is
ensemble of loss rates of links in the network, and D is
observed packet transmission and losses at the clients
EM
A numerical algorithm to compute that maximizes
P(D|)
62
Other Performance Studies using
Web traces
Characterize Internet performance (e.g.,
spatial & temporal locality) [BSS+97]
Study the behavior of TCP during Web
transfers [BPS+98]
Reconstruct different client page accesses
and measure performance characteristics
for the accesses [FCT+02]
63
Tutorial Outline
Background
Web Workload
Performance Diagnosis
Applications of traces
Bibliography
64
Part IV: Applications of Traces
Synthetic workload generation
Cache design
Pre-fetching algorithms [CB98, FJC+99]
Placement of Web proxies/replicas [QPV01]
Other optimizations
Cache replacement policies [CI97,BCF+99]
Cache consistency algorithms [LC97, YBS99,YAD+01]
Cooperative cache or not [WVS+99]
Cache infrastructure
…
Improving TCP for Web transfers [Mah97,PK98,ZQK00]
Concurrent downloads, pipelining, compression,…
65
Synthetic Workload Generation
Generate user requests
Generate user sessions using a Poisson arrival
process
For each user session, determine # clicks using a
Pareto distribution
Assign a click to a request for a Web page, while
making sure
The popularity distribution of files follows a Zipf-like
distribution [BC98]
Capture the temporal locality of successive requests
for the same resource
Generate a next click from the same user with
think time following a Pareto distribution
66
Synthetic Workload Generation
(Cont.)
Generate Web pages
Determine the number of Web pages
Generate the size of each Web pages using a lognormal distribution
Associate a page with some number of embedded
pages using an empirical distribution (heavy-tail)
Generate file modification events
Examples of generators
Webbench [Wbe], WebStone[TS95], Surge
[BC98], SPecweb99 [SP99], Web Polygraph [WP],
…
67
Cache Replacement Policies
Problem formulation
Given a fixed size cache, how to evict pages to
maximize the hit ratio once the cache is full?
Hit ratio
Fraction of requests satisfied by the cache
Fraction of the total size of requested data satisfied
by the cache
Factors to consider
Request frequency
Modification frequency
Benefit of caching: reduction in latency & BW
Cost of caching: storage
Caveat: NOT all hits are equal. Hit ratios do NOT
map directly to performance improvement.
68
Cache Replacement Policies (Cont.)
Approaches
Least recently used (LRU)
Least frequently used (LFU)
GreedyDual-size [CI97]
Perfect: maintain counters for all pages seen
In-cache: maintain counters only for pages that are in
cache
Assign a utility value to each object, and replace the
one with the lowest utility
Use of traces
Evaluate the algorithms using trace-driven
simulations or synthetic workload
Analytically derive the hit ratios for different
replacement policies based on a workload model
69
Placement of Web Proxies/Replicas
Problem formulation [JJK+01,QPV01]
How to place a fixed number of proxies/replicas
to minimize users’ request latency
Factors to consider
Spatial distribution of requests
Temporal stability of requests
Stability in popularity of objects
Stability in spatial distribution of requests
70
Placement of Web Proxies/Replicas
(Cont.)
Approaches
Greedy placement
Hot-spot placement
Random placement
Use of traces
Trace-driven simulations
High concentration of requests to a small
number of objects focus on replicating only
popular objects
Temporal stability in requests no need to
frequently change the locations of
proxies/replicas
71
References
[AS86] R. B. D’Agostino and M. A. Stephens. Goodness-of-Fit Techniques.
Marcel Dekker, New York, NY 1986.
[ABC+96] Virgilio Almeida, Azer Bestavros, Mark Crovella and Adriana de
Oliveria. Characterizing reference locality in the WWW. In Proceedings of
1996 International Conference on Parallel and Distributed Information
Systems (PDIS'96), December 1996.
[ABQ01] A. Adya, P. Bahl, and L. Qiu. Analyzing Browse Patterns of Mobile
Clients. In Proc. of SIGCOMM Measurement Workshop, Nov. 2001.
[ABQ02] A. Adya, P. Bahl, and L. Qiu. Characterizing Alert and Browse
Services for Mobile Clients. In Proc. of USENIX, Jun. 2002.
[AL01] P. Albitz, and C. Liu. DNS and BIND (4th Edition), O’Reilly &
Associates, Apr. 2001.
[AW97] M. Arlitt and C. Williamson. Internet Web Servers: Workload
Characterization and Performance Implications. IEEE/ACM Transactions on
Networking, Vol. 5, No. 5, pp. 631-645, October 1997.
[BC98] P. Barford and M. Crovella. Generating representative workloads for
network and server performance evaluation. In Proc. of SIGMETRICS, 1998.
72
References (Cont.)
[BBC+98] P. Barford, A. Bestavros, M. Crovella, and A. Bradley. Changes in
Web Client Access Patterns: Characteristics and Caching Implications, Special
Issue on World Wide Web Characterization and Performance Evaluation; World
Wide Web Journal, December 1998.
[BCF+99] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web Caching
and Zipf-like Distributions: Evidence and Implications. In Proc. of INFOCOM,
Mar. 1999.
[BC00] P. Barford and M. Crovella. Critical Path Analysis of TCP Transactions.
In Proc. of ACM SIGCOMM, Aug. 2000.
[BLFF96] T. Berners-Lee, R. Fielding, and H. Frystyk. Hypertext Transfer
Protocol -- HTTP/1.0. RFC 1945, May 1996.
[BPS+98] H. Balakrishnan, V. N. Padmanabhan, S. Seshan, M. Stemm and R. H.
Katz. TCP Behavior of a Busy Internet Server: Analysis and Improvements. In
Proc. IEEE Infocom, San Francisco, CA, USA, March 1998.
[BSS+97] H. Balakrishnan, S. Seshan, M. Stemm, and R. H. Katz. Analyzing
Stability in Wide-Area Network Performance. In Proc. of SIGMETRICS, Jun.
1997.
73
References (Cont.)
[CDH+99] R. Caceres, N. G. Duffield, J. Horowitz, D. Towsley, T. Bu.
Multicast-Based Inference of Network Internal Loss Characteristics. In Proc.
Infocom, Mar. 1999.
[CB98] M. Crovella and P. Barford. The network effects of prefetching. In
Proc. of INFOCOM, 1998.
[CBC95] C. R. Cunha, A. Bestavros, and M. E. Crovella. Characteristics of
WWW client-based traces. Technical Report BU-CS-95-010, CS Dept., Boston
University, 1995.
[CI97] P. Cao and S. Irani. Cost-Aware WWW proxy caching algorithms. In
Proc. of USITS, Dec. 1997.
[DFK+97] F. Douglis, A. Feldmann, B. Krishnamurth, and J. Mogul. Rate of
change and other metrics: a live study of the World Wide Web. In Proc. of
USITS, 1997.
[DPP+01] N. G. Duffield, F. Lo Presti, V. Paxson, D. Towsley. In Proc.
Infocom, Apr. 2001.
[FCD+99] A. Feldmann, R. Caceres, F. Douglis, and M. Rabinovich.
Performance of Web Proxy Caching in heterogeneous bandwidth
enviornments. In Proc. of INFOCOM, March 1999.
74
References (Cont.)
[FJC+99] L. Fan, Q. Jacobson, P. Cao and W. Lin. Web Prefetching Between
Low-Bandwidth Clients and Proxies: Potential and Performance. In Proc. of
SIGMETRICS, 1999.
[FCT+02] Y. Fu, L. Cherkassova, W. Tang, and A. Vahdat. EtE: Passive End-toEnd Internet Service Performance Monitering. In Proc. of USENIX, Jun. 2002.
[GMF+99] J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. BernersLee. Hypertext Transfer Protocol – HTTP 1.1. RFC 2616, Jun. 1999.
[JK88] V. Jacobson, M. J. Karels. Congestion Avoidance and Control. In Proc.
SIGCOMM, Aug. 1988.
[JJK+01] S. Jamin, C. Jin, A. R. Kurc, D. Raz, and Y. Shavitt. Constrained
Mirror Placement on the Internet. In Proc. of INFOCOM, Apr. 2001.
[Jain91] R. Jain. The Art of Computer Systems Performance Analysis. John
Wiley and Sons, 1991.
[Kel02] T. Kelly. Thin-Client Web Access Patterns: Measurements from a
Cache-Busting Proxy. Computer Communications, Vol. 25, No. 4 (March 2002),
pages 357-366.
[KR01] B. Krishnamurthy and J. Rexford. Web Protocols and Practice,
HTTP/1.1, Networking Protocols, Caching, and Traffic Measurement. AddisonWesley, May 2001.
75
References (Cont.)
[LC97] C. Liu and P. Cao. Maintaining Strong Cache Consistency in the WorldWide Web. In Proc. of ICDCS'97, pp. 12-21, May 1997.
[LNJV99] Z. Liu, N. Niclausse, and C. Jalpa-Villaneuva. Web Traffic Modeling
and Performance Comparison Between HTTP 1.0 and HTTP 1.1. In Erol
Gelenbe, editor, System Performance Evaluation: Methodologies and
Applications. CRC Press, Aug. 1999.
[Mah97] Bruce Mah. An empirical model of HTTP network traffic. In Proc. of
INFOCOM, April 1997.
[Mogul95] Jeffrey C. Mogul. The Case for Persistent-Connection HTTP. In
Proc. SIGCOMM '95, pages 299-313. Cambridge, MA, August, 1995.
[MDF+97] J. C. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy.
Potential benefits of delta-encoding and data compression for HTTP, In Proc. of
SIGCOMM, September 1997.
[NC01] R. Nowak and M. Coates. Unicast Network Tomography using the EM
algorithm. Submitted to IEEE Transactions on Information Theory, Dec. 2001
[Pad95] V. N. Padmanabhan. Improving World Wide Web Latency. Technical
Report UCB/CSD-95-875, University of California, Berkeley, May 1995.
76
References (Cont.)
[PQ00] V. N. Padmanabhan and L. Qiu. The Content and Access Dynamics of a
Busy Web Server. In Proc. of SIGCOMM, Aug. 2000.
[PQ02] V. N. Padmanabhan and L. Qiu. Network Tomography using Passive
End-to-End Measurements, DIMACS on Internet and WWW Measurement,
Mapping and Modeling, Feb. 2002.
[PQW02] V. N. Padmanabhan, L. Qiu, and H. J. Wang. Passive Network
Tomography using Bayesian Inference. Internet Measurement Workshop, Nov.
2002.
[QPV01] L. Qiu, V. N. Padmanabhan, and G. M. Voelker. On the Placement of
Web Server Replicas. In Proc. of INFOCOM, Apr. 2001.
[SP99] SPECWeb99 Benchmark. http://www.spec.org/osg/web99/.
[Pax98] V. Paxson. An Introduction to Internet Measurement and Modeling.
SIGCOMM’98 tutorial, August 1998.
[Ste74] M. A. Stephens. EDF Statistics for Goodness of Fit and Some
Comparison. Journal of the American Statistical Association, Vol. 69, pp. 730 –
737.
[TS95] G. Trent and M. Sake. WebStone: The First Generation in HTTP Server
Benchmarking, Feb. 1995. http://www.mindcraft.com/webstone/paper.html.
77
References (Cont.)
[Wbe] Webbench.
http://www.zdnet.com/etestinglabs/stories/benchmarks/0,8829,2326243,00.html.
[WP] Web Polygraph: Proxy performance benchmark.
http://polygraph.ircache.net/.
[WVS+99a] A. Wolman, G. Voelker, N. Sharma, N. Cardwell, M.
Brown, T. Landray,D. Pinnel, A. Karlin, and H. Levy. OrganizationBased Analysis of Web-Object Sharing and Caching. In Proc. of the
Second USENIX Symposium on Internet Technologies and Systems,
Boulder, CO, October 1999.
[WVS+99b] A. Wolman, G. M. Voelker, N. Sharma, N. Cardwell, A.
Karlin, and H. M. Levy. On the scale and performance of cooperative
Web proxy caching. In Proc. of the 17th ACM Symposium on Operating
Systems Principles, Kiawah Island, SC, Dec. 1999.
[YAD01] J. Yin, L. Alvisi, M. Dahlin, A. Iyengar. Engineering serverdriven consistency for large scale dynamic services.
[YBS99] H. Yu, L. Breslau, and S. Shenker. A Scalable Web Cache Consistency
Architecture. In Proc. of SIGCOMM, August 1999.
78
Acknowledgement
Thank Alec Wolman for his helpful
comments.
79
Thank you!
http://www.research.microsoft.com/~liliq/talks/internal-web-perf.ppt
80
Web Protocols
HTTP messages
HTTP
TCP segments
TCP
IP
HTTP
IP pkt
IP
IP pkt
Ethernet Ethernet Sonet
Ethernet
TCP
IP
IP pkt
Sonet Ethernet
Sonet link
A picture taken from [KR01]
IP
Ethernet
Ethernet
81
#1: Random Sampling
server
l1
l3
l2
Randomly sample the solution space
Repeat this several times
Draw conclusions based on overall
statistics
How to do random sampling?
l4
l5 l6
l7
l8
determine loss rate bound for each link using
best downstream client
iterate over all links:
p1
p2
p3
p4
clients
p5
pick loss rate at random within bounds
update bounds for other links
Problem: little tolerance for estimation
error
82
#2: Linear Optimization
server
l1
l3
l2
l4
p1
l5 l6
p2
p3
l7
l8
p4
p5
clients
1) Convert the constraints into linear
constraints using log transform
Li = log(1/(1-li)), Pj = log(1/(1-pj))
(1-l1)*(1-l2)*(1-l4) = (1-p1) L1+L2+L4 = P1
2) Add slack variables to account for errors
L1+L2+L4 = P1 L1+L2+L4 + S1 = P1
minimize wLi + |Sj|
subject to
L1+L2+L4 + S1 = P1
L1+L2+L5 + S2 = P2
…
L1+L3+L8 + S5 = P5
Goals
Parsimonious explanation
Robust to error in client loss rate
estimate
83
# 3: Gibbs Sampling
D
ensemble of loss rates of links in the network
Goal
observed packet transmission and loss at the clients
determine the posterior distribution P(|D)
Approach
Use Markov Chain Monte Carlo with Gibbs sampling to
obtain samples from P(|D)
Draw conclusions based on the samples
84
# 3: Gibbs Sampling (Cont.)
Applying Gibbs sampling to network
tomography
1) Initialize link loss rates arbitrarily
2) For j = 1 : warmup
for each link i
compute P(li|D, {li’})
where li is loss rate of link i, and {li’} = kI lk
3) For j = 1 : realSamples
for each link i
compute P(li|D, {li’})
Use all the samples obtained at step 3 to
approximate P(|D)
85
Simulation Experiments
Advantage: no uncertainty about link loss rate!
Methodology
Topologies used:
randomly-generated packet loss events at each link
randomly-generated: 20 - 3000 nodes, max degree = 5-50
real topology obtained by tracing paths to microsoft.com
clients
A fraction f of the links are good, and the rest are “bad”
LM1: good links: 0 – 1%, bad links: 5 – 10%
LM2: good links: 0 – 1%, bad links: 1 – 100%
Goodness metrics:
Coverage: # correctly inferred lossy links
False positive: # incorrectly inferred lossy links
86
Random Topologies
1000-node random topologies (d=10, f=0.95)
1000-node random topologies (d=10, f=0.5)
160
600
140
500
100
# links
# links
120
80
60
40
400
300
200
100
20
0
0
Random
LP
Gibbs
Random
"# true lossy links"
LP
Gibbs
"# true lossy links"
"# correctly identified lossy links"
"# false positive"
"# correctly identified lossy links"
"# false positive"
Techniques
Coverage
False Positive
Computation
Random
High
High
Low
LP
Modest
Low
Medium
Gibbs sampling
High
Low
High
87
Trace-driven Validation
Validation approach
Experimental setup
Divide client traces into two: tomography and validation
Tomography data set => loss inference
Validation set => check if clients downstream of the
inferred lossy links experience high loss
Real topologies and loss traces collected from traceroute
and tcpdump at microsoft.com during Dec. 20, 2000 and
Jan. 11, 2002
Results
False positive rate is between 5 – 30%
Likely candidates for lossy links:
links crossing an inter-AS boundary
links having a large delay (e.g. transcontinental links)
links that terminate at clients
88
Tutorial Outline
Background
Web Workload Characterization
Performance Diagnosis
Motivation
Data Analyses and fittings
Understanding the limitations
Content dynamics
Access dynamics
Case Studies
Synthetic workload generation
Infer causes of high end-to-end delay in Web transfers
Infer causes of high end-to-end loss in Web transfers
Applications of traces
89
Part II: Web Workload Characterization
Overview
Content dynamics
File size distribution
File update patterns
Access dynamics
Process of trace analyses
Common analysis techniques & tools
Challenges in workload characterization
File popularity distribution
Temporal stability
Spatial locality
Browser sessions: length & arrival pattern
Common pitfalls
Case studies
Boston University client log study, UW proxy log study, MSNBC
server log study, a mobile log study
90
Cache Consistency Algorithms
Problem formulation
Factors to consider
Approaches
Use of traces
91
Major Findings
Observations
Implications
Top 1% notification
objects account for 5464% of total messages.
Delivering notifications
via multicast would be
effective.
Notification exhibits
geographical locality.
Useful to provide
localized notification
services.
92
Major Findings (Cont.)
Observations
Implications
0.1% - 0.5% full urls (i.e.
121-442) account for 90%
requests.
Caching the results of
popular queries would be
very effective.
Cache a stable set of
popular query results or
optimize query performance
based on a stable workload.
The set of popular urls
remain stable.
Limited correlation between Service providers cannot
users’ browsing and
solely rely on users’
notification pattern.
notification profile to
predict how much & what
they will browse.
93
Types of Web Traces
Flow level traces
<srcIP, destIP, nextIP, input, output, pkts, bytes,
sTime, fTime, srcPort, destPort, pad, flags, prot,
tos, srcAs, destAs, srcMask, destMask, seq>
94
Views from Clients
Capture clients’ requests to all servers
Pros
Know details of client activities, such as requests
satisfied by browser caches, client aborts
The ability to record detailed information, as
this does not impose significant load on a client
browser
Cons
Need to modify browser software
Hard to deploy for a large number of clients
95
Views from Web Servers
Capture most clients’ requests (excluding
those satisfied by caches) to a single server
Pros
Relatively easy to deploy/change logging
software
Cons
Requests satisfied by browser & proxy caches
will not appear in the logs
May not log detailed information to ensure fast
processing of client requests
96
Views from Web Proxies
Depending on the proxy’s location
Pros
A proxy close to clients see requests from a a small
client group to a large number of servers [KR00]
A proxy close to the servers see requests from a large
client group to a small number of servers [KR00]
See requests from a diverse set of clients to a diverse
set of servers, and determine the popularity ranking of
different Web sites
Useful for studying caching policies
Ease of collection
Cons
Requests satisfied by browser caches will not appear in
the logs
May not log detailed information to ensure fast
97
processing of requests
Web Protocols (Cont.)
DNS [AL01]
TCP [JK88]
An application layer protocol responsible for
translating hostname to IP and vice versa (e.g.,
msn.com 207.68.172.246)
A transport layer protocol that does error control
and flow control
Hypertext Transfer Protocol (HTTP)
HTTP 1.0 [BLFF96]
The most widely used HTTP version
A “Stop and wait” protocol
HTTP 1.1 [GMF+99]
Adds persistent connections, pipelining, caching,
content negotiation, …
98