Web Caching - CS
Download
Report
Transcript Web Caching - CS
Web Caching
Elliot Jaffe
Presentation for The Seminar on Database and
Internet
Hebrew University, Fall 2002
Agenda
Caching: Why, Where, How, What
Some empirical data: Zipf’s Law
Content Delivery Networks
Bibliography
Why cache?
Number of unique pages: 800M < X < 2.2B
Number of unique web sites: 8,500,000
static pages: %30 - %40
pages revisited: %80
expected hit-rate: %24 - %32
Why cache?
Bandwidth
Latency
Performance = Response Time
Server Load
Failure Redundancy
Where
Reverse
Reverse
Reverse
Proxy
Reverse
Proxy
Proxy
Proxy
Local ISP
Content
Content
Content
Content
Server
Server
Server
Server
cache cdn
L4 Switch
cache
Intranet
cache
cache
cache
Browser Browser Browser
Data Center
ISP
cdn
Hot-potato routing
Get traffic off of your
network as soon as
possible
Bounces traffic around
the internet
Increases chance of
dropped packet
Increases latency
Destination
You are here
How: Types of Caches
Simple Proxy
Transparent Proxy
Reverse Proxy
Adaptive Caching
Push Caching
Active Caching
Streaming Caches
How: Simple Proxy
Harvest/Squid
Provide web content for a fixed user base
Standalone operation
May be transparent
Commodity product/technology
Easy to get 90% correct
How: Transparent Proxy
No client configuration
Violates end-to-end paradigm
Client thinks it is talking directly to server
Server thinks it is talking to cache
Implemented as
Pass-through unit
L4 switch
How: Reverse Proxy
Designed to offload duties from one or more
specific servers
Data size is limited to size of static content on
the server
Challenge is fast, disk-less operation
Cache consistency is easy
Single point of failure
How: Adaptive Caching
ISP Level caching
Cooperating multiple distributed caches
Operate as a cache-mesh based on content
demand
Multicast for group membership (GCS)
Content Routing Protocol sends request to
the appropriate cache within the mesh
How: Push Caching
Send the data out proactively
Content Delivery Networks
Paid for by data providers
More on this later!
How: Active Caching
Use an applet inside of the cache to
customize dynamic pages on the fly
How do you identify dynamic pages?
Where does the custom data come from?
Who is going to pay for this service?
How: Streaming Caches
What about streaming content
Movies
Audio
Proprietary streaming protocols
Challenge is to maintain Quality of content
and service
Who pays for this?
What: Content and Protocols
Mostly Static Content
HTML
XML
GIF
AVI
EXE
Etc.
What: Content and Protocols
HTTP 1.0 Basic protocol
Send Request based on fix number of verbs
GET
HEAD
POST
Receive response, meta-data, content
What: Content and Protocols
HTTP Request
Request = Simple-Request | Full-Request
Simple-Request = "GET" SP Request-URI CRLF
Full-Request = Request-Line ;
* ( General-Header ;
| Request-Header ;
| Entity-Header ) ;
CRLF
[ Entity-Body ]
What: Content and Protocols
Example:
GET /pub/www/index.html HTTP/1.0
Response:
HTTP/1.1 200 OK
Server: Microsoft-IIS/5.0
Date: Sat, 19 Oct 2002 05:46:53 GMT
Expires: Sun, 20 Oct 2002 16:00:00 GMT
Content-Length: 2291
Content-Type: text/html
Cache-control: private
What: Content and Protocols
Example “if-modified-since”:
GET /pub/www/index.html HTTP/1.0
If-Modified-Since: Sat, 19 Oct 2002 19:43:31 GMT
Response:
HTTP/1.1 200 OK
Server: Microsoft-IIS/5.0
Date: Thu, 13 Jul 2000 05:46:53 GMT
Expires: Sun, 20 Oct 2002 16:00:00 GMT
Content-Length: 2291
Content-Type: text/html
Cache-control: private
What: Content and Protocols
Example “if-modified-since”:
GET /pub/www/index.html HTTP/1.0
If-Modified-Since: Sat, 19 Oct 2002 19:43:31 GMT
Response:
HTTP/1.1 304 Not Modified
Basic caching algorithm
Pages may be
Fresh: up-to-date
Expired: current date > expiration date
Stale: “old”
Basic caching algorithm - #2
If (page is in the cache)
if ( page is expired or stale )
Get from server - if-modified-since
If not modified, Get from cache
Get from cache
Else
Get from Server
Basic caching algorithm - #3
If cache has space
Store the file
Else
1. Delete expired from cache
2. Delete stale from cache
3. Delete LRU from cache
4. Delete largest/smallest from cache?
Agenda
Caching: Why, Where, How, What
Some empirical data: Zipf’s Law
Content Delivery Networks
Bibliography
Zipf’s law
Zipf’s law: The frequency of an event P as a function
of rank i is a power law function:
Pi = Ω / iα where α ≤ 1
Zipf’s law
Observed to be true for
Frequency
of written words in English
texts
Population of cities
Income of a company as a function of
rank
Zipf’s law and web access
For a given server, page access by rank follows Zipf’s
law
Web requests from a fixed population of users follows
Zipf’s law 0.64 < α < 0.83
Observations
Top %1 of all documents account for %20 -
%35 of proxy requests
Top %10 account for %45 - %55 of requests
It takes %25 to %40 of all documents to
account for %70 of requests
It takes %70 to %80 of all documents to
account for %90 of requests
Observations
Observations
For an infinite sized
cache, the hit-ratio for a
web-proxy grows in a
log-like fashion as a
function of the client
population of the proxy
and the number of
requests seen by the
proxy.
Observations
The hit-ratio of a web cache grows in a log-
like fashion as a function of the cache size.
Observations
Locality of Reference
The probability that a document will be
referenced k requests after it was last
referenced is roughly proportional to 1/k.
Observations - NOT
There is very little correlation between access
frequency and document size
There is no correlation between access
frequency and the change rate of a document
No single web server contributes to most of
the popular pages
Zipf’s Law and Caching
Discussion
How does this help in cache design?
Are there any business implications?
Agenda
Caching: Why, Where, How, What
Some empirical data: Zipf’s Law
Content Delivery Networks
Bibliography
CDN
“Traditional” CDN
Dirty Secrets
P2P content delivery systems
Why use a CDN?
Reverse
Reverse
Reverse
Proxy
Reverse
Proxy
Proxy
Proxy
Local ISP
Content
Content
Content
Content
Server
Server
Server
Server
cache cdn
L4 Switch
cache
Intranet
cache
cache
cache
Browser Browser Browser
Data Center
ISP
cdn
What is CDN?
Content Deliver Networks = PUSH
PUSH = Prefetch
CDN Mechanisms
DNS redirection
Complete
Partial
URL rewrite
A DNS-redirecting CDN
DNS
redirector
example.com ?
B
Client
Network
Model
HTTP
server
A
HTTP
B
server
GET http://example.com/foo
http://example.com/foo
HTTP
server
Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
C
Original
server
CDN DNS Full Redirection
(Semi)automatic mechanism to replicate original site
on CDN servers
Replace original DNS entry with enhanced DNS
server that uses knowledge of network and server
load to direct clients to appropriate CDN server
TTL on DNS entries are very short
Adero, NetCaching, IntelliDNS
CDN DNS Partial Redirection
Statically modify selected URL’s within pages to point
to CDN service
Replicate selected objects to CDN service
Redirect clients of selected URL’s using enhanced
DNS server that uses knowledge of network and
server load
Akamai, Digital Island, MirrorImage, SolidSpeed,
Speedera
CDN rewrite
Modify pages at the origin server on the fly
Change embedded URL’s based on up-to-
date knowledge of the network and CDN
server loads
Does not require additional DNS lookups
Fasttide, Clearway
Measuring a CDN’s performance
Two papers
K.L.Johnson,J.F.Carr,M.S.Day,and
M.F.Kaashoek,”The measured performance of
content distribution networks,”in Proceedings of the
5th International Web Caching Workshop and
Content Delivery Workshop,(Lisbon,Portugal),May
2000.
B. Krishnamurthy,C. Wills,Y. Zhang, “On the Use and
Performance of Content Distribution Networks” in
ACM SIGCOMM INTERNET MEASUREMENT
WORKSHOP 2001.
The measured performance of content
distribution networks
Client Actions
R: Resolve domain name
F: Fetch content
Ordinary client use of CDN: RF
Instead of doing (RF)+ we do R+ then F+
This allows us to compare the server chosen
to some other servers that could have been
chosen, over a large number of fetches.
Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
The measured performance of content
distribution networks
Procedure
R+: Collect a set of servers by repeated DNS
queries
to a variety of name servers
over a number of hours
F+: Fetch a particular piece of content from
each member of the set, measuring latency
Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
The measured performance of content
distribution networks
Important Details
Interleaved fetches
Fetch1 at server1, fetch1 at server2, etc.
Not fetch1 at server1, fetch2 at server1, etc.
Unmeasured fetch before measured fetch
Avoids cache misses
Measure only HTTP fetch latency
CDN not penalized for cost of DNS resolution
Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
The measured performance of content
distribution networks: Looking at these graphs
Note: log plot of latency
Gray line: cumulative
distribution at one server
Red line: cumulative
distribution at all servers
Blue line: cumulative
distribution at CDN
Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
The measured performance of content
distribution networks
Cumulative Distribution
Right way to look at this data
Want to understand frequency and magnitude
of bad choices
Consistent = vertical
Fast = to the left
Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
The measured performance of content
distribution networks
Results
Akamai does a better job than Digital Island
Neither does a particularly good job of
selecting the optimal server
Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
The measured performance of content
distribution networks
What’s wrong with this study?
Focus is on choice of server
Cost of DNS is explicitly excluded
How does this relate to client performance?
Measuring a CDN’s performance
Two papers
K.L.Johnson,J.F.Carr,M.S.Day,and
M.F.Kaashoek,”The measured performance of
content distribution networks,”in Proceedings of the
5th International Web Caching Workshop and
Content Delivery Workshop,(Lisbon,Portugal),May
2000.
B. Krishnamurthy,C. Wills,Y. Zhang, “On the Use and
Performance of Content Distribution Networks” in
ACM SIGCOMM INTERNET MEASUREMENT
WORKSHOP 2001.
On the Use and Performance of
Content Distribution Networks
Focus is on client perceived performance
Build canonical web page with images from
CDN server
On the Use and Performance of
Content Distribution Networks
If each CDN serves different content, then
how did they create comparable pages?
Size matters!
Select images of (almost) identical sizes from
each of the CDN services
On the Use and Performance of
Content Distribution Networks
Step 1:
For services using only DNS redirection, get
an IP address from the DNS server
For services using rewriting, get the page and
extract the CDN content server from the page
Amortize DNS lookup time over all images in
this page
On the Use and Performance of
Content Distribution Networks
Step 2:
Download all the images from the IP address
of the identified server
Throw this data away
The purpose is to make sure that there are no
cache misses
On the Use and Performance of
Content Distribution Networks
Step 3:
Download all the images from the IP address
of the identified server just like a browser
would (4 in parallel)
Repeat every 30 minutes over a period of 24
hours with a 10 minute jitter
On the Use and Performance of
Content Distribution Networks
Results
On the Use and Performance of
Content Distribution Networks
Four Conclusions
1.
Forcing a DNS lookup in the critical path of resource retrieval,
does not generally result in better server choices
2.
The download time from a previously selected server is often
better than from the download time from the newly selected
server
3.
CDN servers are generally not loaded so frequent DNS lookup
is not helpful
4.
It makes sense for CDNs to increase the DNS TTL given to a
client unless the servers are known to be loaded
On the Use and Performance of
Content Distribution Networks
Is this a better study?
More detailed results
Relates to observed performance
A good marketing white paper
What did we learn?
Dirty Secrets of the CDN world
CDNs are tremendously underutilized
CDNs are over-architected
The value of a CDN is its remote presence in
the ISP. Not in its ability to load balance
Remember the ISP Interconnect?
P2P content delivery systems
PUSH content to the
leaf nodes
Content Manager
Server other leaf nodes
from the edges
Kontiki
2
1
client
client
3
P2P CDN
Four Challenges
1. Aggregate input streams
2. Deal with unstable peers
3. Manage Malicious peers
4. Who really pays for this?
P2P Caching?
Discussion:
Is this a good idea?
What are the issues?
Where is the payback?
Agenda
Caching: Why, Where, How, What
Some empirical data: Zipf’s Law
Content Delivery Networks
Bibliography
Bibliography
Gray, Shenoy, Rules of Thumb in Data Engineering 1999, Revised March 2000.
Microsoft Research MS-TR-99-100
Berners-Lee, Fielding, Frystyk, Hypertext Transfer Protocol -- HTTP/1.0, IETF
RFC 1945, http://www.w3.org/Protocols/rfc1945/rfc1945
Fielding, Gettys, Mogul, Frystyk, Masinter, Leach, Berners-Lee, Hypertext
Transfer Protocol -- HTTP/1.1, ftp://ftp.isi.edu/in-notes/rfc2616.txt
Greg Barish and Katia Obraczka. World Wide Web Caching: Trends and
Techniques. IEEE Communications, May 2000.
http://www.isi.edu/people/katia/cache-survey.pdf.
Breslau, Cao, Fan, Phillips, Shenker, Web Caching and Zipf-like distributions:
Evidence and Implications, IEEE Infocom 1999
K.L.Johnson,J.F.Carr,M.S.Day,and M.F.Kaashoek,”The measured performance
of content distribution networks,”in Proceedings of the 5th International Web
Caching Workshop and Content Delivery Workshop,(Lisbon,Portugal),May
2000. www.terena.nl/conf/wcw/Proceedings/S4/S4-1.pdf
B. Krishnamurthy,C. Wills,Y. Zhang, “On the Use and Performance of Content
Distribution Networks” in ACM SIGCOMM INTERNET MEASUREMENT
WORKSHOP 2001. http://www.icir.org/vern/imw-2001/imw2001-papers/10.pdf
Bibliography
“Zipf Distribution of Web Site Popularity”,
http://www.useit.com/alertbox/zipf.html
S. Gribble, E. Brewer, “System Design Issues for Internet
Middleware Services: Deductions from a Large Client Trace”,
Proceedings of the USENIX Symposium on Internet
Technologies and Systems Monterey,California,December 1997
“The Internet is a little bit broken”,
http://www.internap.com/about/theproblem.html
“Reliable Internet Connectivity with BGP, Chapter 7, Influencing
Entrance Selection”,
http://www.bgpbook.com/archpolicyenter.html
A. Cockburn, B. McKenzie, “What do Web Users Do? An
Empirical Analysis of Web Use”,
http://www.cosc.cantebury.ac.nz/~andy/papers/ijhcsAnalysis.pdf