Web Caching - CS

Download Report

Transcript Web Caching - CS

Web Caching
Elliot Jaffe
Presentation for The Seminar on Database and
Internet
Hebrew University, Fall 2002
Agenda
 Caching: Why, Where, How, What
 Some empirical data: Zipf’s Law
 Content Delivery Networks
 Bibliography
Why cache?
 Number of unique pages: 800M < X < 2.2B
 Number of unique web sites: 8,500,000
 static pages: %30 - %40
 pages revisited: %80
 expected hit-rate: %24 - %32
Why cache?
 Bandwidth
 Latency
 Performance = Response Time
 Server Load
 Failure Redundancy
Where
Reverse
Reverse
Reverse
Proxy
Reverse
Proxy
Proxy
Proxy
Local ISP
Content
Content
Content
Content
Server
Server
Server
Server
cache cdn
L4 Switch
cache
Intranet
cache
cache
cache
Browser Browser Browser
Data Center
ISP
cdn
Hot-potato routing
 Get traffic off of your
network as soon as
possible
 Bounces traffic around
the internet
 Increases chance of
dropped packet
 Increases latency
Destination
You are here
How: Types of Caches
 Simple Proxy
 Transparent Proxy
 Reverse Proxy
 Adaptive Caching
 Push Caching
 Active Caching
 Streaming Caches
How: Simple Proxy
 Harvest/Squid
 Provide web content for a fixed user base
 Standalone operation
 May be transparent
 Commodity product/technology
 Easy to get 90% correct
How: Transparent Proxy
 No client configuration
 Violates end-to-end paradigm

Client thinks it is talking directly to server

Server thinks it is talking to cache
 Implemented as

Pass-through unit

L4 switch
How: Reverse Proxy
 Designed to offload duties from one or more
specific servers
 Data size is limited to size of static content on
the server
 Challenge is fast, disk-less operation
 Cache consistency is easy
 Single point of failure
How: Adaptive Caching
 ISP Level caching
 Cooperating multiple distributed caches
 Operate as a cache-mesh based on content
demand
 Multicast for group membership (GCS)
 Content Routing Protocol sends request to
the appropriate cache within the mesh
How: Push Caching
 Send the data out proactively
 Content Delivery Networks
 Paid for by data providers
 More on this later!
How: Active Caching
 Use an applet inside of the cache to
customize dynamic pages on the fly
 How do you identify dynamic pages?
 Where does the custom data come from?
 Who is going to pay for this service?
How: Streaming Caches
 What about streaming content

Movies

Audio
 Proprietary streaming protocols
 Challenge is to maintain Quality of content
and service
 Who pays for this?
What: Content and Protocols
 Mostly Static Content






HTML
XML
GIF
AVI
EXE
Etc.
What: Content and Protocols
 HTTP 1.0 Basic protocol

Send Request based on fix number of verbs




GET
HEAD
POST
Receive response, meta-data, content
What: Content and Protocols
 HTTP Request
Request = Simple-Request | Full-Request
Simple-Request = "GET" SP Request-URI CRLF
Full-Request = Request-Line ;
* ( General-Header ;
| Request-Header ;
| Entity-Header ) ;
CRLF
[ Entity-Body ]
What: Content and Protocols
 Example:
GET /pub/www/index.html HTTP/1.0
 Response:
HTTP/1.1 200 OK
Server: Microsoft-IIS/5.0
Date: Sat, 19 Oct 2002 05:46:53 GMT
Expires: Sun, 20 Oct 2002 16:00:00 GMT
Content-Length: 2291
Content-Type: text/html
Cache-control: private
What: Content and Protocols
 Example “if-modified-since”:
GET /pub/www/index.html HTTP/1.0
If-Modified-Since: Sat, 19 Oct 2002 19:43:31 GMT
 Response:
HTTP/1.1 200 OK
Server: Microsoft-IIS/5.0
Date: Thu, 13 Jul 2000 05:46:53 GMT
Expires: Sun, 20 Oct 2002 16:00:00 GMT
Content-Length: 2291
Content-Type: text/html
Cache-control: private
What: Content and Protocols
 Example “if-modified-since”:
GET /pub/www/index.html HTTP/1.0
If-Modified-Since: Sat, 19 Oct 2002 19:43:31 GMT
 Response:
HTTP/1.1 304 Not Modified
Basic caching algorithm
Pages may be
 Fresh: up-to-date
 Expired: current date > expiration date
 Stale: “old”
Basic caching algorithm - #2
If (page is in the cache)
if ( page is expired or stale )
Get from server - if-modified-since
If not modified, Get from cache
Get from cache
Else
Get from Server
Basic caching algorithm - #3
If cache has space
Store the file
Else
1. Delete expired from cache
2. Delete stale from cache
3. Delete LRU from cache
4. Delete largest/smallest from cache?
Agenda
 Caching: Why, Where, How, What
 Some empirical data: Zipf’s Law
 Content Delivery Networks
 Bibliography
Zipf’s law
 Zipf’s law: The frequency of an event P as a function
of rank i is a power law function:
Pi = Ω / iα where α ≤ 1
Zipf’s law
 Observed to be true for
 Frequency
of written words in English
texts
 Population of cities
 Income of a company as a function of
rank
Zipf’s law and web access
 For a given server, page access by rank follows Zipf’s
law
 Web requests from a fixed population of users follows
Zipf’s law 0.64 < α < 0.83
Observations
 Top %1 of all documents account for %20 -
%35 of proxy requests
 Top %10 account for %45 - %55 of requests
 It takes %25 to %40 of all documents to
account for %70 of requests
 It takes %70 to %80 of all documents to
account for %90 of requests
Observations
Observations
 For an infinite sized
cache, the hit-ratio for a
web-proxy grows in a
log-like fashion as a
function of the client
population of the proxy
and the number of
requests seen by the
proxy.
Observations
 The hit-ratio of a web cache grows in a log-
like fashion as a function of the cache size.
Observations
Locality of Reference
 The probability that a document will be
referenced k requests after it was last
referenced is roughly proportional to 1/k.
Observations - NOT
 There is very little correlation between access
frequency and document size
 There is no correlation between access
frequency and the change rate of a document
 No single web server contributes to most of
the popular pages
Zipf’s Law and Caching
Discussion
 How does this help in cache design?
 Are there any business implications?
Agenda
 Caching: Why, Where, How, What
 Some empirical data: Zipf’s Law
 Content Delivery Networks
 Bibliography
CDN
 “Traditional” CDN
 Dirty Secrets
 P2P content delivery systems
Why use a CDN?
Reverse
Reverse
Reverse
Proxy
Reverse
Proxy
Proxy
Proxy
Local ISP
Content
Content
Content
Content
Server
Server
Server
Server
cache cdn
L4 Switch
cache
Intranet
cache
cache
cache
Browser Browser Browser
Data Center
ISP
cdn
What is CDN?
Content Deliver Networks = PUSH
PUSH = Prefetch
CDN Mechanisms
 DNS redirection


Complete
Partial
 URL rewrite
A DNS-redirecting CDN
DNS
redirector
example.com ?
B
Client
Network
Model
HTTP
server
A
HTTP
B
server
GET http://example.com/foo
http://example.com/foo
HTTP
server
Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
C
Original
server
CDN DNS Full Redirection
 (Semi)automatic mechanism to replicate original site
on CDN servers
 Replace original DNS entry with enhanced DNS
server that uses knowledge of network and server
load to direct clients to appropriate CDN server
 TTL on DNS entries are very short
 Adero, NetCaching, IntelliDNS
CDN DNS Partial Redirection
 Statically modify selected URL’s within pages to point
to CDN service
 Replicate selected objects to CDN service
 Redirect clients of selected URL’s using enhanced
DNS server that uses knowledge of network and
server load
 Akamai, Digital Island, MirrorImage, SolidSpeed,
Speedera
CDN rewrite
 Modify pages at the origin server on the fly
 Change embedded URL’s based on up-to-
date knowledge of the network and CDN
server loads
 Does not require additional DNS lookups
 Fasttide, Clearway
Measuring a CDN’s performance
 Two papers
 K.L.Johnson,J.F.Carr,M.S.Day,and
M.F.Kaashoek,”The measured performance of
content distribution networks,”in Proceedings of the
5th International Web Caching Workshop and
Content Delivery Workshop,(Lisbon,Portugal),May
2000.
 B. Krishnamurthy,C. Wills,Y. Zhang, “On the Use and
Performance of Content Distribution Networks” in
ACM SIGCOMM INTERNET MEASUREMENT
WORKSHOP 2001.
The measured performance of content
distribution networks
Client Actions
 R: Resolve domain name
 F: Fetch content
 Ordinary client use of CDN: RF
 Instead of doing (RF)+ we do R+ then F+

This allows us to compare the server chosen
to some other servers that could have been
chosen, over a large number of fetches.
Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
The measured performance of content
distribution networks
Procedure
 R+: Collect a set of servers by repeated DNS
queries


to a variety of name servers
over a number of hours
 F+: Fetch a particular piece of content from
each member of the set, measuring latency
Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
The measured performance of content
distribution networks
Important Details
 Interleaved fetches


Fetch1 at server1, fetch1 at server2, etc.
Not fetch1 at server1, fetch2 at server1, etc.
 Unmeasured fetch before measured fetch

Avoids cache misses
 Measure only HTTP fetch latency

CDN not penalized for cost of DNS resolution
Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
The measured performance of content
distribution networks: Looking at these graphs
 Note: log plot of latency
 Gray line: cumulative
distribution at one server
 Red line: cumulative
distribution at all servers
 Blue line: cumulative
distribution at CDN
Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
The measured performance of content
distribution networks
Cumulative Distribution
 Right way to look at this data

Want to understand frequency and magnitude
of bad choices
 Consistent = vertical
 Fast = to the left
Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
The measured performance of content
distribution networks
Results
 Akamai does a better job than Digital Island
 Neither does a particularly good job of
selecting the optimal server
Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
The measured performance of content
distribution networks
 What’s wrong with this study?
 Focus is on choice of server
 Cost of DNS is explicitly excluded
 How does this relate to client performance?
Measuring a CDN’s performance
 Two papers
 K.L.Johnson,J.F.Carr,M.S.Day,and
M.F.Kaashoek,”The measured performance of
content distribution networks,”in Proceedings of the
5th International Web Caching Workshop and
Content Delivery Workshop,(Lisbon,Portugal),May
2000.
 B. Krishnamurthy,C. Wills,Y. Zhang, “On the Use and
Performance of Content Distribution Networks” in
ACM SIGCOMM INTERNET MEASUREMENT
WORKSHOP 2001.
On the Use and Performance of
Content Distribution Networks
 Focus is on client perceived performance
 Build canonical web page with images from
CDN server
On the Use and Performance of
Content Distribution Networks
 If each CDN serves different content, then
how did they create comparable pages?
 Size matters!
 Select images of (almost) identical sizes from
each of the CDN services
On the Use and Performance of
Content Distribution Networks
 Step 1:
 For services using only DNS redirection, get
an IP address from the DNS server
 For services using rewriting, get the page and
extract the CDN content server from the page
 Amortize DNS lookup time over all images in
this page
On the Use and Performance of
Content Distribution Networks
 Step 2:
 Download all the images from the IP address
of the identified server
 Throw this data away
 The purpose is to make sure that there are no
cache misses
On the Use and Performance of
Content Distribution Networks
 Step 3:
 Download all the images from the IP address
of the identified server just like a browser
would (4 in parallel)
 Repeat every 30 minutes over a period of 24
hours with a 10 minute jitter
On the Use and Performance of
Content Distribution Networks
 Results
On the Use and Performance of
Content Distribution Networks
Four Conclusions
1.
Forcing a DNS lookup in the critical path of resource retrieval,
does not generally result in better server choices
2.
The download time from a previously selected server is often
better than from the download time from the newly selected
server
3.
CDN servers are generally not loaded so frequent DNS lookup
is not helpful
4.
It makes sense for CDNs to increase the DNS TTL given to a
client unless the servers are known to be loaded
On the Use and Performance of
Content Distribution Networks
 Is this a better study?
 More detailed results
 Relates to observed performance
 A good marketing white paper
 What did we learn?
Dirty Secrets of the CDN world
 CDNs are tremendously underutilized
 CDNs are over-architected
 The value of a CDN is its remote presence in
the ISP. Not in its ability to load balance
 Remember the ISP Interconnect?
P2P content delivery systems
 PUSH content to the
leaf nodes
Content Manager
 Server other leaf nodes
from the edges
 Kontiki
2
1
client
client
3
P2P CDN
Four Challenges
1. Aggregate input streams
2. Deal with unstable peers
3. Manage Malicious peers
4. Who really pays for this?
P2P Caching?
 Discussion:
 Is this a good idea?
 What are the issues?
 Where is the payback?
Agenda
 Caching: Why, Where, How, What
 Some empirical data: Zipf’s Law
 Content Delivery Networks
 Bibliography
Bibliography







Gray, Shenoy, Rules of Thumb in Data Engineering 1999, Revised March 2000.
Microsoft Research MS-TR-99-100
Berners-Lee, Fielding, Frystyk, Hypertext Transfer Protocol -- HTTP/1.0, IETF
RFC 1945, http://www.w3.org/Protocols/rfc1945/rfc1945
Fielding, Gettys, Mogul, Frystyk, Masinter, Leach, Berners-Lee, Hypertext
Transfer Protocol -- HTTP/1.1, ftp://ftp.isi.edu/in-notes/rfc2616.txt
Greg Barish and Katia Obraczka. World Wide Web Caching: Trends and
Techniques. IEEE Communications, May 2000.
http://www.isi.edu/people/katia/cache-survey.pdf.
Breslau, Cao, Fan, Phillips, Shenker, Web Caching and Zipf-like distributions:
Evidence and Implications, IEEE Infocom 1999
K.L.Johnson,J.F.Carr,M.S.Day,and M.F.Kaashoek,”The measured performance
of content distribution networks,”in Proceedings of the 5th International Web
Caching Workshop and Content Delivery Workshop,(Lisbon,Portugal),May
2000. www.terena.nl/conf/wcw/Proceedings/S4/S4-1.pdf
B. Krishnamurthy,C. Wills,Y. Zhang, “On the Use and Performance of Content
Distribution Networks” in ACM SIGCOMM INTERNET MEASUREMENT
WORKSHOP 2001. http://www.icir.org/vern/imw-2001/imw2001-papers/10.pdf
Bibliography
 “Zipf Distribution of Web Site Popularity”,




http://www.useit.com/alertbox/zipf.html
S. Gribble, E. Brewer, “System Design Issues for Internet
Middleware Services: Deductions from a Large Client Trace”,
Proceedings of the USENIX Symposium on Internet
Technologies and Systems Monterey,California,December 1997
“The Internet is a little bit broken”,
http://www.internap.com/about/theproblem.html
“Reliable Internet Connectivity with BGP, Chapter 7, Influencing
Entrance Selection”,
http://www.bgpbook.com/archpolicyenter.html
A. Cockburn, B. McKenzie, “What do Web Users Do? An
Empirical Analysis of Web Use”,
http://www.cosc.cantebury.ac.nz/~andy/papers/ijhcsAnalysis.pdf