Transcript Slide 1
10
Web Workload
Characterization
1
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Topics
2
Web Workload Definition
Workload Characterization
Statistics and Probability Distributions
HTTP Message Characteristics
Web Resource Characteristics
User Behavior Characteristics
Applying Workload Models
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Web Workload Definition
Important performance metrics, such as userperceived latency and server throughput,
depend on the interaction of numerous protocols
and software components.
A workload consists of the set of all inputs a
system receives over a period of time.
Web workload models are used to generate
request traffic for comparing the performance of
different proxy and server implementation.
3
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Web Workload Definition
Developing a workload model involves three
main steps:
Identifying the important workload parameters
Analyzing measurement data to qualify these
parameters
Validating the model against reality
Constructing a workload model requires an
understanding of statistical techniques for
analyzing measurement data and representing
the key properties of Web traffic.
4
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Web Workload Definition
Key properties of Web workloads are:
HTTP message characteristics
Resource characteristics
User behavior
5
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Workload Characterization
A workload model consists of a collection of
parameters that represent the key features of
the workload that affect the resource allocation
and system performance.
Workload model can be applied to a variety of
performance evaluation tasks, such as the
following:
Identifying performance problems
Benchmarking Web components
Capacity planning
6
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Workload Characterization
Workload models have several approaches:
Trace-driven workload
» Constructs requests directly from an existing log or
trace
» Reproduces a known workload
» Avoids the intermediate step of analyzing the traffic
» Not provide flexibility for experimenting with changes
to the workload
» No clear separation between the load and
performance
7
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Workload Characterization
Stress testing
» Sends requests as fast as possible to evaluate a
proxy or a server under heavy load
» May not present the realistic traffic patterns
8
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Workload Characterization
Synthetic Workload
» derives from an explicit mathematical model that can
be inspected, analyzed, and criticized
» Represents the key properties of real Web traffic
» Explores system performance in a controlled manner
by changing the parameters associated with each
probability distribution
9
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Workload Characterization
To ensure that a workload model is
representative of real workloads, the parameters
of the model should have certain properties:
Decoupling from underlying system
Proper level of detail
Independence from other parameters
(Table 10.1)
10
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Table 10.1. Examples of Web workload parameters
Category
11
Parameter
Protocol
Request method
Response code
Resource
Content type
Resource size
Response size
Popularity
Modification frequency
Temporal locality
Number of embedded resources
Users
Session interarrival times
Number of clicks per session
Request interarrival times
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Statistics and Probability Distributions
Statistics such as the mean, median, and
variance capture the basic properties of many
workload parameters.
Mean shows the average value of the
parameters.
Median shows the middle value of parameters.
Half of the values are smaller than the median
and the other half are larger than the median.
Variance or standard deviation attempt to quantify
how much the parameters varies from the
average value.
12
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Statistics and Probability Distributions
For a sequence of 4100, 4700, 4200, 20,000,
4000 bytes
mean size = 7400 bytes
median size = 4200 bytes
For a sequence of 4100, 4700, 4200, 4800,
4000 bytes
mean size = 4360 bytes
median size = 4200 bytes
13
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Statistics and Probability Distributions
Probability distributions capture how a
parameter varies over a wide range of values.
14
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Statistics and Probability Distributions
For a sequence of 4100, 4700, 4200, 20,000,
4000 bytes
F(x) = P(X <= x)
Example of cumulative distribution Function (CDF)
15
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Statistics and Probability Distributions
For a sequence of 4100, 4700, 4200, 20,000,
4000 bytes
Fc (x) = P(X > x) = 1 − F(x)
Figure 10.1. Example of complementary cumulative distribution Function (CCDF)
16
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Statistics and Probability Distributions
Several probability distributions have been
widely applied to workload characterization.
One of the most popular probability distributions
is the exponential distribution with the form
f ( x ) e x
mean E ( x) 1 /
17
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Statistics and Probability Distributions
Relating a measured distribution to an equation
requires justifying the hypothesis that the
equation is capable of accurately representing
the measured data.
Justifying this hypothesis consists of two key
steps:
The measured data is fitted with the equation to
determine the value of each variable.
Statistical tests are performed to compare the
resulting equation with the measured equation.
18
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Statistics and Probability Distributions
In some cases, no single well-known distribution
matches the measured data.
It may be necessary to represent different parts
of the measured distribution with different
equations.
19
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
HTTP Message Characteristics
HTTP Request Methods
HTTP Response Codes
20
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
HTTP Request Methods
Knowing which request methods arise in
practice is useful for optimizing server
implementation and developing realistic
benchmarks for evaluating Web proxies and
servers.
Traffic characteristics:
The overwhelming majority or Web requests use
the GET method to fetch resources and invoke
scripts.
Small fraction of HTTP requests use the POST
method to submit data in forms.
21
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
HTTP Request Methods
Measurements show a small number of HEAD
requests to test an operational Web server.
Web Distributed Authoring and Versioning
(WEBDAV) use PUT and DELETE methods
frequently.
The emergence of tools for testing and debugging
Web components may increase the use of the
TRACE method.
The exact distribution of request methods varies
from site to site.
22
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
HTTP Response Codes
Knowing how servers respond to client requests
is an important part of constructing a realistic
model of Web workloads.
Traffic characteristics:
200 OK: for 75% to 90% of responses
304 Not Modified: for 10% to 30% of responses
The other redirection(3xx) codes and the client
error(4xx) codes are the most common
206 Partial Content: may become more common
when the server returns a range of bytes from the
requested resource
23
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
HTTP Response Codes
302 Found: is used for redirection responses and
varies from site to site
24
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Web Resource Characteristics
25
Content Type
Resource Size
Response Size
Resource Popularity
Modification Frequency (Resource Changes)
Temporal Locality
Number of Embedded Resources
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Web Resource Characteristics
Understanding the characteristics of Web
resources is an important part of modeling Web
workload.
Resources are vary in terms of:
26
How big they are
How popular they are
How often they change
Characteristics of Web resources are:
Content type
Resource size
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Web Resource Characteristics
27
Response size
Resource popularity
Modification frequency (Resource changes)
Temporal locality
Number of embedded resources
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Content type
Content type has a direct relationship to other
key workload parameters, such as resource size
and modification frequency.
Traffic characteristics:
Overwhelming majority of resources are text
content (plain and HTML) and images (jpeg and
gif)
The remaining content types include documents
such as postscript and PDF, software such as
JavaScript of Java applets, and audio and video
data.
28
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Content type
The emergence of new application can have an
influence on the distribution of content types.
29
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Resource Sizes
The sizes of Web resources affect:
The storage requirements at the origin server
The overhead of caching resources at browsers
and proxies
The load on the network
The latency in delivering the response message
Traffic characteristics:
The average resource size is relatively small
» Average size of an HTML: 4 to 8 KB
» Median size of an HTML: 2 KB
» Average size of an image: 14 KB
30
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Resource Sizes
Knowing the distribution of resource sizes at
Web sites is useful for deciding how to allocate
memory or disk space at a server or proxy.
The high variability in resource size is captured
by the Pareto distribution
f ( x) ( k / x) , x k
mean E ( x) k /( 1), 1
α is a shape parameter
k is a scale parameter
31
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Statistics and Probability Distributions
Figure 10.2. Exponential and Pareto distributions (with mean of 1)
32
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Statistics and Probability Distributions
Figure 10.3. Exponential and Pareto distributions on a logarithmic scale
33
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Statistics and Probability Distributions
Figure 10.4. Lognormal distribution
34
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Response Sizes
In analyzing the server and network
performance, the size of response messages is
a more important factor.
Traffic characteristics:
Response sizes may differ from resource sizes for
a variety of reasons:
» Some HTTP response messages do not have a
message body.
» Some Web resources are never requested and do
not contribute to the set of response messages.
» Some responses are aborted before they complete,
resulting in shorter transfers.
35
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Response Sizes
The median of the response size distribution is
several hundred bytes smaller than the median
resource size.
Response sizes can be represented by a
combination of the lognormal and Pareto
distributions.
Response size distribution has a heavy tail.
Some factors suggest that the distribution of
response sizes is not the same as the distribution
of resource sizes.
36
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Resource Popularity
The popularity of the various resources at a Web
site has important performance implications.
The most popular resources are likely to reside
in main memory at the origin server, obviating
the need to fetch the data from disk.
Traffic characteristics:
Popularity is measured in terms of the proportion
of requests that access a particular resource
The probability mass function (pmf) P(r) captures
the proportion of requests directed to each
resources.
37
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Resource Popularity
The proportion of requests for a resource follows
Zipf’s Law:
P(r ) kr
1
r is the rank of an object
k is a constant that ensures that P(r) sums to 1.
(Figure 10.5)
38
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Resource Popularity
Figure 10.5. Zipf’s law
39
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Resource Popularity
more generally, a Zipf-like distribution has the
form
P(r ) kr
c
for some constant c.
» The extreme case of c= 0 corresponds to all
resources having equal popularity.
» Early studies of requests to Web servers found c
values close to 1.
» More recent studies show values for c in the range of
0.75 to 0.95
40
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Resource Changes
Web resources change over time as a result of
modifications at the origin server.
Modifications to resources affect the
performance of Web caching.
Resources that change less often may be given
preference in caching or revalidated with the
origin server less frequently.
Traffic characteristics:
41
Images do not change very often
Text and HTML files change more often than
images
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Resource Changes
Some resources change in a periodic fashion:
News stories
The Expires header could indicate the next time
that a cached resource would change.
Accurate timing information in the HTTP
response message can reduce the load on the
origin server as well as the user-perceived
latency for accessing the resource.
An accurate model of Web workloads need to
consider the frequency of resource changes.
42
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Temporal Locality
The time between successive requests for the
same resource has a significant affect on Web
traffic.
Resource popularity indicates the frequency of
requests without indicating the spacing between
the requests.
Temporal locality captures the likelihood that a
requested resource will be requested again in
the near future.
43
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Temporal Locality
Testing a server with a benchmark that has low
temporal locality would underestimate the
potential throughput.
High temporal locality also increases the
likelihood that a request is satisfied by a browser
or proxy cache and reduces the likelihood that a
resource has changed since the previous
access.
44
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Temporal Locality
Traffic characteristics:
Temporal locality can be measured by sequencing
through the stream of requests, putting each
request at the top of a stack, and noting the
position in the stack- the stack distance - of the
previous access to each resource.
The small stack distance suggests high temporal
locality.
The stack distances for requests for a resource
follow a lognormal distribution.
45
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Number of Embedded Resources
Embedded resources include images,
JavaScript programs, and other HTML files that
appear as frames in the containing Web page.
The number of embedded references in a Web
page has significant impact on the server and
network load.
Traffic characteristics:
Web pages have a median of 8 to 20 embedded
resources.
The distribution has high variability, following the
Pareto distribution.
46
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Number of Embedded Resources
The number of embedded images has tended to
increase over time as more users have highbandwidth connection to the Internet.
A large number of embedded resources does
not necessarily translate into a large number of
requests to the Web server:
A cached copy of embedded resource may be
available.
Some embedded images do not reside at the
same Web server as the containing Web page.
47
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
User Behavior Characteristics
Web workload characteristics depend on the
behavior of users as they download Web pages
from various sites.
The workload introduced by a single user can be
modeled at three levels:
Session
» The series of requests by a single user to a single
Web site could be viewed as a logical session.
Click
» A user performs one or more clicks to request Web
pages.
48
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
User Behavior Characteristics
Request
» Each click triggers the browser to issue an HTTP
request for a resource.
Each session arrival brings a new user to the
site.
The client may establish a new TCP connection
for a request or send a request on an existing
TCP connection.
Session arrivals can be studied by considering
the time between the start of one user session
and the start of the next user session.
49
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
User Behavior Characteristics
The session arrival times follow an exponential
distribution.
Exponential interarrival times correspond to a
Poisson process, when users arrive
independently of one another.
The exponential distribution is not an accurate
model of interarrival times of TCP connections
and HTTP requests.
50
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
User Behavior Characteristics
A workload model that assumes that HTTP
requests arrive as a Poisson process would
underestimate the possibility of the heavy-load
periods and would overestimate the potential
performance of the Web server.
The number of clicks associated with user
sessions has considerable influence on the load
on a server.
51
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
User Behavior Characteristics
The number of clicks follows a Pareto
distribution, suggesting that some sessions
involve a much larger number of clicks than
others.
The time between successive requests (request
interarrival time) by each user has important
implications on the server and network load.
The time between the downloading of one page
and its embedded images and the user’s next
click is referred to as think time or quiet time.
52
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
User Behavior Characteristics
The characteristics of user think times influence
the effectiveness of policies for closing
persistent connections.
Most interrequest times are less than 60
seconds.
The think times follow a Pareto distribution with
a heavy tail, with a around 1.5.
Heavy-tailed distributions apply to numerous
properties of Web traffic:
Resource sizes
53
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
User Behavior Characteristics
Response sizes
The number of embedded references in a Web
page
The number of click per session
The time between successive clicks
A Web session can be modeled as a sequence
of on/off periods, in which each on period
corresponds to downloading a Web page and its
embedded images and each off period
corresponds to the user’s think time.
54
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
User Behavior Characteristics
The duration of on/off periods both follow a
heavy-tailed distribution.
The load on Web servers and the network
exhibits a phenomenon known as self similarity,
in which the traffic varies dramatically on a
variety of time scales from microseconds to
several minutes.
55
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Applying Workload Models
A deeper understanding of Web workload
characteristics can drive the creation of a
workload model for evaluating Web protocols
and software components.
Generating synthetic traffic involves sampling
the probability distribution associated with each
workload parameter.
(Table 10.2)
56
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Table 10.2. Probability distributions in Web workload models
Distribution
57
Workload parameter
Exponential
Session interarrival times
Pareto
Response sizes (tail of distribution)
Resource sizes (tail of distribution)
Number of embedded images
Request interarrival times
Lognormal
Response sizes (body of distribution)
Resource sizes (body of distribution)
Temporal locality
Zipf-like
Resource popularity
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Applying Workload Models
Generating synthetic traffic that accurately
represents a real workload is very challenging.
Validation of the synthetic workload model is an
important step in constructing and using a
workload model.
Validation is different from verification:
Verification involves testing that the synthetic
traffic has the statistical properties embodied in
the workload model.
58
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Applying Workload Models
Validation requires demonstrating that the
performance of a system subjected to the
synthetic workload matches the performance of
the same system under a real workload,
according to some predefined performance
metrics.
Synthetic workload models are also used to test
servers over a range of scenarios that might not
have happened in practice.
Generating synthetic traffic provides an
opportunity to evaluate a proxy or server in a
controlled manner.
59
Web Protocols and Practice
WEB WORKLOAD
CHARACTERIZATION
Applying Workload Models
Web performance depends on the interaction
between user behavior, resource characteristics,
server load, and network dynamics.
Synthetic workloads help address the need to
evaluate and compare Web software
components in a controlled manner.
60
Web Protocols and Practice