Empirical Quantification of Opportunities for Content

Download Report

Transcript Empirical Quantification of Opportunities for Content

Empirical Quantification of
Opportunities for Content Adaptation
in Web Servers
Michael Gopshtein and Dror Feitelson
School of Engineering and Computer Science
The Hebrew University of Jerusalem
Supported by a grant from the Israel Internet Association
Capacity Planning
capacity
Daily cycle of activity
time
Wasted capacity
Utilized capacity
Capacity Planning
capacity
Flash crowd
time
Capacity Planning
• The problem:
– Required capacity for flash crowds cannot be
anticipated in advance
– Even capacity for daily fluctuations is highly
wasteful
• Academic solution: use admission control
• Business practice: unacceptable to reject
any clients
– Especially in cases of surge in traffic
Content Adaptation
• Trade off quality for throughput
– Installed capacity matches normal load
– Handle abnormal load by reducing quality
– But still manage to provide meaningful service
to all clients
• Assumes normal optimizations have been
made already
– Compress or combine images, promote
caching, …
– Empirically this usually is not the case
Content Adaptation
Low load
Content Adaptation
High load
Content Adaptation
• Maintain the invariant:
 rate of   cost per

  
  capacity
 requests  request 
• Need to change quality (and cost!) of
content
– Prepare multiple versions in advance
The Questions
• What are the main costs in web service?
– Bottleneck is CPU / network / disk?
– What do we gain by eliminating HTTP requests?
– What do we gain by reducing file sizes?
• What can realistically be done?
– What is the structure of a “random” site?
– How much can we reduce quality?
Assumption: static web pages only
Costs of Serving Web Pages
Measuring Random Web Sites
•
•
•
•
•
http://en.wikipedia.org/wiki/Special:Random
Use title of page as input to Google search
Extract domain of first link to get home page
Retrieve it using IE
Collect statistical data by intercepting
system calls to send and receive
Retrieved Component Sizes
A ¼ of total data from
components larger
than 200 KB
This is only 0.02% of
the components
Download Times
Download time
(and bandwidth
requirements)
roughly proportional
to image size
Network Bandwidth
• Typical Ethernet packets are 1526 bytes
– Ethernet and TCP/IP headers require 54 bytes
– HTTP response headers require 280-325
• Most components fit into few packets
– 43% fit into a single packet
– 24% more fit into 2 packets
Save bandwidth by reducing
number of small components
or size of large components
Locality and Caching
• Flash crowds typically involve a very small
number of pages (possibly the home
page)
• Servers allocate GB of memory for cache
• This is enough for thousands of files
Disk is not expected to be
a bottleneck
CPU Overhead
• CPU usage reflects several activities
– Opening TCP connection
– Processing request
– Sending data
• Measure using combinatorical
microbenchmarks
– Open connection only
– One extremely large file
– Many small files
– Many requests for non-existent file
CPU Overhead
Example: single 10KB file
Establishing connection
25%
Processing request
72%
Data transfer
3%
• Equal processing and transfer at 240KB
– Only 0.3% of files are so big
If CPU is bottleneck, need
to reduce number of requests
Optimizations
Guidelines
• Either CPU or network are the bottleneck
• Network bandwidth saved by reducing
large components
• CPU saved by eliminating small
components
• Maintaining “acceptable” quality is
subjective
Eliminating Images
• Images have many functions
– Story (main illustrative item)
– Preview (for other page)
– Commercial
– Logo
– Decoration (bullets, background)
– Navigation (buttons, menus)
– Text (special formatting)
• Some can be eliminated or replaced
Distribution of Types
• Manually classified
959 images from 30
random sites
• 50% decoration
• 18% preview
• 11% commercial
• 6% logo
• 6% text
Automatic Identification
• Decorations are candidates for elimination
• Identified by combination of attributes:
– Use gif format
– Appear in HTML tags other than <IMG>
– Appear multiple times in same page
– Small original size
– Displayed size much bigger than original
– Large change in aspect ratio when displayed
Image Sizes Distribution
commercial
preview
decoration
Auxiliary Files
• JavaScript
– May be crucial for page function
– Impossible to understand automatically
• CSS (style sheets)
– May be crucial for page structure
– May be possible to identify those parts that
are used
Auxiliary Files
• Cannot be eliminated
• Common wisdom: use separate files
– Allow caching at client
– Save retransmission with each page
• Alternative: embed in HTML
– Reduce number of requests
– May be better for flash crowds that do not
request multiple pages
Text and HTML
• Some areas may be eliminated under
extreme conditions
– Commercials
– Some previews and navigation options
• Often encapsulated in <DIV> tags
• Sometimes identified by ID or class
names, e.g. “sidebanner”
– Especially when using modular design
Summary
Content Adaptation
• Degraded content usually better than
exclusion
• Only way to handle flash crowds that
overwhelm installed capacity
• Empirical results identify main options
– Identify and eliminate decorations
– Compress large images (story, commercial)
– Embed JavaScript and CSS
– Hide unnecessary blocks
Next Paper Preview
• Implementation in Apache
• Monitor CPU utilization and idle threads to
switch between modes
• Use mod_rewrite to redirect URLs to
adapted content
• Achieve up to x10 increase in throughput
for extreme adaptation