Web Performance and Errors
Download
Report
Transcript Web Performance and Errors
Workshop on Dependability
of e-Business Systems
Internet Performance / Availability
from an end-user perspective
Eric Siegel
[email protected]
KEYNOTE
The Internet Performance Authority
2855 Campus Drive
San Mateo, CA 94403
(650) 522-1000
www.keynote.com
Agenda
•
•
•
•
•
•
The importance of performance
A quick web-technology and Internet-technology tutorial
Web page performance factors and benchmarks
Transaction performance factors and benchmarks
Performance measurement goals, technologies, and issues
Load testing for web transactions
2
Performance Is Important!
“Twenty-eight percent of shoppers who have suffered failed performance
attempts said they stopped shopping at the web site where they had
problems, and six percent said they stopped buying at that particular
company’s off-line store.” (Boston Consulting Group, quoted in Infoworld /
Computerworld 3/00)
“It takes only 8 ½ seconds for half of the subjects to [give up]” (Peter
Bickford, “Worth the Wait?” in Netscape/View Source Magazine 10/97)
“Perhaps as much as $4.35 billion in e-commerce sales in the U.S. may be
lost each year due to unacceptable download speeds and resulting user
bailout behaviors.” (Zona Research 4/99)
“Fifty-eight percent of online customers surveyed indicated quick
download time as a key factor in determining whether they would return
to a web site.” (Forrester Research 1/99)
“One of the top three reasons cited by online shoppers for dissatisfaction
with a web site is slow site performance.” (Jupiter Communications / NFO
Worldwide 1/99)
“At one site, the abandonment rate fell from 30% to 6-8% because of a one
second improvement in load time.” (Zona Research 4/99)
3
Effects of Poor Performance
• Lost prospective customer
– If the site didn’t work, or took too long, your prospect may not
return for a long time – if ever.
• Lost sale
– If your competitor’s site was up and responsive, you may have
lost a single sale.
• Lost customer
– If this happens repeatedly, you’ve lost a customer,
– AND the customer may stop going to associated web sites and
physical locations!
• Lost reputation
– People talk about poor performance; word spreads.
– People are looking for a few good sites that they can trust!
4
E-Commerce Performance Challenges
• 24x7 availability and geographic distribution; expectation of
universal access
• A shared network resource
• No control over customers’ environment
• Multiple servers, which may be geographically distributed,
participate in a single user interaction
• Dynamic, complex content
• Poor support for session structures
• Potentially massive peak volumes
• Difficult to predict workload mix
5
An Instant Web Tutorial
• The Domain Name System (DNS), a worldwide hierarchy of
directories, translates www.fangdog.com into 10.9.23.22.
• TCP/IP carries the data between your browser and 10.9.23.22; it
detects errors and corrects them by retransmitting.
• The data consists of HTTP, HTML, and the page’s information.
• HTTP (Hypertext Transfer Protocol) carries the Hypertext Markup
Language (HTML) and provides the basic Web page commands:
– GET
– POST
– Query String (e.g., http://www.fangdog.com/filename?fur=matted )
• HTML describes the page:
– Formatting
– Content, and the servers/files (e.g., pix.fangdog.com/gifs/picture1.gif)
from which that content can be downloaded
– Links
6
An Instant Internet Tutorial
Some of the additional
servers provide third-party
ads; others are distributed
content providers.
Servers
DNS
Cache
Access
Routers Devices
Routers
Access
Provider
Internet Browser
The Internet
Web
Server
Peering
Point
Routers
Routers
PSInet
Digex
UUnet
BBN
Servers
Verio
GTE
Mindspring
Sprint
Worldcom
Servers
7
Internet Routing Within An ISP
• Routers read every packet’s header and select
an outgoing path for the next hop
– Each hop adds delay
– Routing information is imperfect
– Packets may use non-optimal paths
Routers
– Packets may loop
– Routers may not notice moderate
Routers
congestion
• Packets can be discarded by routers or
otherwise lost
– Noise in the communications link can corrupt
packets, causing them to be discarded
– Hopelessly looping packets are discarded
– Temporary overflow of router buffers cause
packet loss
– Severely overloaded routers tend to lose
massive numbers of packets in waves
Routers
Routers
8
Internet Routing Between ISPs (Peering)
• Internet Service Providers enter into legal contracts
to carry each other’s traffic
– Traffic transfer between ISPs occurs at peering
points
– Some peering points are public; e.g., MAE-EAST
(and MAE-WEST !)
– Other peering points are privately arranged
ISP “A”
– Peering philosophies differ among ISPs
• Congestion may occur at peering points,
Routers
especially public ones!
– The primary inter-ISP “routing” protocol,
BGP-4, usually does not look at congestion
• The end-to-end route in one direction is usually
different from the end-to-end route in the other
direction!
– Depends on legal and financial arrangements
between ISPs, etc.
Peering
Point
ISP “B”
Routers
9
Internet Access Providers, Caching, and
Distributed Content Providers
• End-users (customers!) connect to
an Access Device maintained by an
Internet Access Provider or by their
corporate IS department
– Dial-in, xDSL from home
– LAN link at the office
• Access Device connects to
routers and then to the Internet
• End-users convert hostname
(www.com) into Internet address
(10.9.23.22) by using Domain Name
System (DNS) distributed directory
– A worldwide hierarchy of directories
– Controlled by authoritative record
created by hostname’s owner
Cache
DNS
Routers
Routers
Access
Devices
Access
Provider
Internet
Browser
• Cache or Distributed Content system
may also be locally available
10
Domain Name System
• The end-user’s browser asks the end-user’s local DNS server to
translate the hostname (www.fangdog.com) into an IP address
– The end-user’s local DNS server may be owned by the Access
Service Provider or by the end-user’s corporation
– It may be very close, or it may be geographically distant
• DNS servers retain translation information for a period of time
(“time to live”) controlled by the authoritative name server
– The authoritative name server is controlled by the name’s owner
• If a DNS server doesn’t have the information, it looks elsewhere in
the hierarchy
– It may need to go all the way to the authoritative name server
• Authoritative Name Servers can furnish multiple addresses, etc.
– For “round-robin” load balancing
• Some proprietary load balancers provide a DNS server function
– Can make a sophisticated choice of an IP address to send
11
Caching and Distributed Content Providers
• Many ISPs install Caching systems that retain commonlyrequested web pages locally.
– Usually, only static, unchanging content is cached.
– The web page designer can try to influence the behavior of remote
caching devices.
– Caching decreases the amount of traffic that the ISP must pull
through adjacent ISPs.
– Caching improves speed to the users.
– Remember, the browser also caches content.
• Distributed Content Providers have constructed worldwide
systems of caching and content distribution devices.
– Usually in partnership with local ISPs
– These systems may also be able to handle streaming media and
some dynamically-generated web pages.
– In some cases, they use distribution systems (leased lines,
satellite, etc.) that completely bypass the Internet’s core.
12
Server Farm Architecture Summary
Routers
Servers
Security Control
Load-sharing Devices
Web Server
Web Server
Web Server Web Server
Database Back-End
13
A Definition of Performance
• Web e-commerce performance measures the user's experience
interacting with your web site, not your in-house experience or
the experience inside your web hosting center.
– Download time
– Transaction Time
– banking, stock trading, purchasing
– Availability
– Errors
– Failed connection attempts
– Missing pages
– Missing page components
– Broken links
– Transaction failure
– Fulfillment failure (product delivery failure)
14
Web Performance Factors
• The web page seen by the browser is often generated from a
number of different sources:
– Ad servers
– Geographically-distributed content servers
– Static content servers
– Dynamic content servers
– Back-end databases
• Download performance is therefore affected by:
– Geographic location of the browser
– Congestion and latency between servers and browser
– Performance of load-distribution and load-sharing schemes
– Performance of the servers and their back-end databases
• For example...
15
Components of Web Page Download Time
external ad server
Akamai server
application delay
KEYNOTE
redirection delay
slow
downloading
image
This page includes “Akamaized” content distribution and external ad servers
16
What Is “Good” Performance?
• Commonly-cited “Eight-Second Rule”
• But a better measure is your competitors’ performance
– What is your end-user’s frame of reference?
– Competitors
– Commonly-accessed consumer sites (Yahoo!, etc.)
– (how will you measure these sites?)
– What does your end-user care about?
– Browsing your catalog quickly
– Placing orders quickly, without failures or delays
• The location of your end-user affects expectations.
– On a corporate LAN with high-speed access
– At home, on a 28.8k modem
– Using a laptop in the rain at a gas station in Milan
17
For Comparison...
Performance of Major Web Sites
Amazon in
US
IBM in US
•
•
•
•
Business Day
Mean
Error Rate
Download
4.5 sec
1.1 %
24x7
Mean
Download
3.7 sec
Error Rate
0.5 %
3.4 sec
4.8 %
2.5 sec
4.6 %
Yahoo in US
0.7 sec
0.3 %
0.6 sec
0.4 %
KB 40 in US
4.7 sec
3.8 sec
KB 40 in
Europe
Euro 20 in
Europe
7.3 sec
6.5 sec
6.8 sec
4.0 sec
Measured May 1 – 31, 2000 over high-speed links
67 measurement locations in U.S., 22 in Europe; each location measures every 15 min
“Business Day” is 10:00 a.m. to 4:00 p.m. CDT or MET, M – F.
Benchmark measurements available at www.keynote.com
18
Improving Web Page Performance
• Decrease the use of frames and Java.
• Avoid complex, deeply-nested tables.
• Decrease the number of individual page components.
• Decrease the size of each component; “thin” the images.
• Give the viewer something to look at while the page is loading; minimize
perceived delay.
• Consider using a distributed hosting solution, at least for static page
components.
– This is particularly important for cross-oceanic connections!
– You may be able to serve graphics locally while going to the central
server system to handle the transaction itself.
• Use flat files (and a naming convention) instead of databases.
• If you’re dynamically generating pages:
– Be sure to tune the system (add memory; use server cache; etc.)
– Dynamically starting a process takes time.
• Be sure your ISP has good peering to your customers’ ISPs.
19
Transaction Performance Factors
• Scaling transactions is much more difficult than scaling simple
web page delivery!
– Need to maintain transaction context between web pages,
associating a user with a transaction-in-progress
– Use IP address?
– Different users of one ISP can have same address
– Users can switch IP addresses in mid-transaction
– Use a cookie?
– Users can set browser to refuse cookies
– Embed user and state information within each page, or each link?
– Requires dynamic page generation logic
– Use Secure Socket Layer (SSL) session ID?
– Load balancer may need to be one end of the secure connection
• Need to detect and handle abandoned transactions
– Each “active” session consumes server memory
– A timeout value is a reasonable technique
20
Components of a Transaction’s Download Time
21
Keynote Broker Trading Index
• Average response times and success rates for creating a
standard stock-order transaction
– Enters the brokerage’s home page, then logs on, obtains a stock
quote, creates an order to buy stock, and logs out before
confirming the order.
• Measurements are performed every 15 minutes between 9 a.m.
and 4 p.m. EST during market trading days.
• From 10 major metropolitan areas in the U.S.
• Unsuccessful transactions include those in which any Web page
fails to download completely and those that do not complete
within a specified time limit.
– The time limit for a transaction is calculated by multiplying the
number of Web pages in a transaction by 12 seconds.
– Each week, individual brokerage error rates typically range from
0.2% to 30%
• Available at http://www.keynote.com/
22
30
16
14
25
12
20
10
15
8
6
10
% Error Rate
Transaction Time (mean seconds)
For Comparison...
Keynote Broker Trading Index
4
5
2
0
0
Aug
'99
Sep
'99
Oct
'99
Nov
'99
Dec
'99
Total Time
Jan
'00
Feb
'00
Mar
'00
Apr
'00
May
'00
Error Rate
23
Improving Transaction Performance
• Decrease the number of pages required per transaction
– Each page is a new chance for connection failure
• Measure performance
– Detect performance issues and triage them quickly!
– Use proxy Agents in geographic locations of customer groups
– Count abandoned shopping carts, etc.
• Plan for failure
– How will customer get help?
– Number to call; transaction ID
24
Performance Measurement Goals
• Evaluation of improvements and competition
– From a stable, representative set of measurement agents
– Long-term trending and benchmarks
• Quick diagnostics and triage when problems occur.
– Get the problem assigned to the proper support groups quickly.
– Use a “white box” unloaded server for comparisons
• System tuning
– Where, in the complex system, are the bottlenecks?
– How is response time and availability affected by site traffic?
– Complex if users and servers are geographically distributed!
– How are response time and availability affected by background
traffic and events on the Internet?
• Prepare for and evaluate load testing
– Understand load details
– Validate load-test results against production performance
25
A Note About Availability
• Combination of MTBF (Mean Time Between Failures) and MTTR
(Mean Time To Repair)
• Affected by error rate?
– At what error rate or pattern is the system “unavailable”?
• Affected by time of day or date?
– Do you care if the system is down at 1am Eastern Time on Sundays?
• What must be “available”, and from where?
– Designated servers?
– Access to backbone routers?
– Access to specific gateways or routers?
– Designated end-to-end paths?
• How is it measured?
– Sampling by testing devices?
– Sampling from the designated servers, etc?
– What is measurement granularity?
26
Measurement Technologies
• Element vs. End-to-End
• Active vs. Passive
• Quick Overviews of:
– Element Measurement (usually Passive)
– Active End-to-End
– Passive End-to-End
27
Element vs. End-to-End Measurement
• Element (“point measurement”)
– Show only the behavior within a particular network element
(router, switch, link, server)
– Network internal measures are crucial for problem solving.
– Must be correlated with End-to-End View, for quick fixes to
problems seen by end users.
• End-to-End
– End-to-end . . . but . . .where are the “ends” in “end-to-end”?
– Network internals are usually irrelevant and confusing to network
users.
– Used in constructing Service Level Agreements (SLAs)
28
Active vs. Passive Measurement
• Active Measurement adds traffic to the system
– Special software or hardware / software Agents perform scripted
transactions, “pings,” and other simulated end-user actions
– Based on sampling
• Passive Measurement watches real users
– Watches existing traffic and system components
– Can sample or can look at every packet and at other data
– Great for Network Operators
29
Element Measurement
• This is measurement of individual network and server elements
– Great for system operators and for triage
– Necessary for tracking load vs. element utilization
• Typical element measures:
– Device status: CPU, memory, link utilization; queue sizes
– Identity of heavy users, hardware port flows
– Bandwidth usage measurement
– Application statistics (page hits, user counts, abandoned shopping
carts, etc.)
• Measurement technology can be active or passive.
– Passive measurement is the most common and may not need to
be based on sampling.
– Some passive measurement tools can examine frame or packet
headers to track response times, etc.
– Active measurement helps correlation with end-user views.
30
Typical Passive Measurement of Elements
Passive Disk
usage measures
Passive Router
SNMP component
measures
31
Active Measurement of the End-User’s Experience
• Network-level pings, etc. are useful for debugging, but are not a
true measurement of end-user experience.
– Reaches only outskirts of web hosting system, not the application
– Does not indicate the health of application
– Usually is not directly correlated with end-user’s web page
experience
• Automated measurement agents run scripts to download web
pages and run transactions.
– Includes non-network (e.g., server) time
– May include detailed component measurements that are useful for
triage and trending
– Finds errors as seen by users
• Active measurement of the end-user’s experience builds a
baseline that can be used to evaluate any single-site or
distributed web serving solution, even if the web serving
solution’s technology changes over time.
32
Typical Active End-User Measurement
• Details of download can be
timed and displayed.
• Download details can be
trended over time.
• Includes:
–
–
–
–
–
–
DNS lookup
TCP connection complete
Redirections complete
First packet of base page
Base page complete
Content (gifs, etc.) complete
33
Active Measurement Issues
• What should be measured? (Page URLs? Transaction scripts?)
– How will you measure your competitors?
• What sampling rate is sufficient?
• How many Agents are needed, where should they be located, and
how should they be built?
– Stand-alone Agents, on dedicated workstations, located at sites
you control
– Applets downloaded into user machines
– Do you have permission?
– Do you have control? (Are these user machines portable? Do
users reconfigure them without telling you? How will you build
long-term trending baselines using this data?)
– What if there aren’t any active users in an important location?
How do you detect communications failures?
– Measurement services (e.g., Keynote Systems) that can represent
“Internet” users in the world at large
34
Passive Measurement of the End-User’s Experience
• Watches actual end-user performance
– Can be embedded in end-user’s browser
– An intermediate network device can examine packet headers to
track response times, etc.
– Server application can use an API to send signals to tool.
– This can’t usually be used to measure competitors
• Some passive tools can examine web server data logs to track
locations of users and their web site activity.
– This can track every user, without delaying production (if analysis
is done off-line).
– But this won’t see failures and time delays caused by page
elements that are delivered from a different geographical location
(e.g., ad servers)
– And it can’t be used to measure competitors
35
Passive Measurement Issues
• What should be measured (Page URLs? Transaction scripts?)
• How will you track true response time as seen by end-user, not just
pieces of that response time?
• How will you standardize measurements for long-term trending?
• Location of passive measurement probes?
– Will the sampled pages and transactions be representative of all
users?
– For probes downloaded into user machines:
– Do you have permission?
– Do you have control? (Are these user machines portable? Do
users reconfigure them without telling you? How will you build
long-term trending baselines using this data?)
– What if there aren’t any active users in an important location?
How do you detect communications failures?
• Does passive measurement affect response time?
36
Load Characterization and Measurement
• Understanding load patterns and measuring load is as important
as measuring performance.
– Customer response to marketing campaigns
– Changes in usage patterns of web site
– Correlation of load, system element utilization, and response time
– Gathering data for testing
• Characterization (understanding load patterns) gathers data
about how individual end-users travel through the site.
• Load Measurement gathers data on the load presented to the
server system
– The server system may be geographically distributed.
37
Site Testing
• If your site breaks under load, it’s very easy for your customers
to click away ... and they will!
• Unfortunately, it’s risky to predict behavior beyond what you’ve
already seen.
– System performance is non-linear at best.
– Just one additional user may reveal a bug in the system.
– Response time vs. load is probably exponential.
– Internet traffic is worse than exponential; it’s fractal.
– Advertising or other factors may change the pattern of accesses to
your site.
– The total count of visitors may stay the same, but their paths
through the site may change.
– If people suddenly start to buy, instead of browse, will that
break your site?
38
Types of Testing – 1
• Functional testing – does the site work at all? – is critically important,
but it’s not enough.
– Functional tests find missing elements, broken links, errors.
– Most functional tests succeed even if they have to wait for a long
time; most real users have abandoned the site by then.
• Load testing measures the response of the site to a specified load.
– This type of testing can be used to measure the effects of
changes to the web system.
– The load in a “load test” doesn’t need to stress the system; many
load tests are designed to emulate a normal load.
39
Types of Testing – 2
• Stress testing finds the instantaneous breaking points.
– Under what load level, and what type of load, does the system fail
or provide unacceptable response times?
– Will load-sharing devices fail?
– Will database replies time-out and result in empty pages?
• Endurance testing measures system performance after a sustained
high load.
– System performance may degrade after a large number of users
– Poor re-use of system resources
– Poor handling of abandoned sessions
– Some systems may break entirely after a sustained high load.
40
Testing Tools
• Most testing is done within the server site.
– Functional Testing (if all page components are within the site)
– Initial stress testing
– Testing of web server and back-end database integration
• Final testing should be done across the network.
– Find problems with geographically-distributed systems
– Distributed servers
– DNS difficulties
– Find problems in Internet connectivity
– ISP connectivity
– Network aggregation bottlenecks (routers, etc.)
– Peering to ISPs that are used by customers
• Use realistic scripts!
41
Scripts for Testing and Measuring
• What is the definition of “response time”?
– Will you get statistics for each page, not just for the total transaction?
• Can your scripts react to the received web pages?
– Different “think time” for different responses
– Transaction abandonment if response time is too long
• Which transactions are used by important users?
– Which transactions are used by customers who are buying?
– Which transactions are used by irritable, politically-powerful users?
• The transaction sets should exercise all major parts of the system.
• Create different transaction sets for different situations, e.g.,
– Monthly cycles
– Special advertising promotions
• System designers and administrators will optimize the system to
make the measurements look good; be sure that when they do that,
they’ll also help the end users!
42
... and remember ...
Even after the site seems to be running perfectly,
You must measure and monitor continually
To Avoid the Dreaded . . .
43
Nightmare on Web Street!
Webmaster goes home
Webmaster arrives at work
KEYNOTE
Yeeks!
44
Keynote Systems (www.keynote.com), “The Internet Performance Authority,” is the world’s
leading supplier of Internet performance measurement, diagnostic, and consulting services to
companies with e-commerce web sites. Keynote captures over 24 million performance
measurements daily, using Keynote’s global infrastructure of nearly 500 measurement computers
connected to the major Internet backbones from over 120 statistically selected Internet access
locations representing 50 metropolitan areas worldwide. Internet performance and availability data
are collected at Keynote’s sophisticated operations center and are instantly available to customers
through any Web browser. Keynote currently measures individual web pages as well as
transactions and streaming media. Keynote also supplies web load testing services through its
recent acquisition of Velogic, inc.
Eric Siegel is a Senior Internet Consultant with Keynote Systems, Inc. and is the author of Designing
Quality of Service Solutions for the Enterprise (John Wiley & Sons, 1999). Before joining Keynote
Systems, Mr. Siegel was a Senior Network Analyst at NetReference, Inc., which specializes in
network architectural design and strategic planning, and he was a Senior Network Architect with
Tandem Computers, where he was the technical leader and coordinator for all of Tandem's data
communications specialists worldwide. Mr. Siegel also worked for Network Strategies, Inc. and for
the MITRE Corporation, where he specialized in computer network design and performance
evaluation. Mr. Siegel received both his B.S. and M.E.E. degrees in Electrical Engineering from
Cornell University, and he has been a member of the Internet community since 1978.
45