caching - The University of Sydney

Download Report

Transcript caching - The University of Sydney

1
ELEC 5501
Advanced Communication Networks
Web Caching and
Content Distribution Networks
Bjorn Landfeldt, The University of Sydney
2
Outcomes
• Understand the drive for content replication
• Awareness of the differences between
Caching and CDN and the similarities
• Awareness of the current best practices and
standards
• Understanding how replication can help
increasing QoS
Bjorn Landfeldt, The University of Sydney
3
Problem
• Massive amounts of data stored on servers
• Server capacity and network capacity
limited
• Expensive to go “long distances” over the
Internet
• Solution: replicate content or cache content
• Today, the web - tomorrow any data
Bjorn Landfeldt, The University of Sydney
4
Overview of Web
Caching
•
•
•
•
Cache server (proxy)
Why caching in the network?
Hierarchical caching
Problems with caching
Bjorn Landfeldt, The University of Sydney
Strongly based on Keith Ross’s Tutorial
5
Cache Server
• A cache is both a server and a client
Client
Cache
Server
(Proxy)
Origin
Server
Origin
Server
Client
ISP Boundary
Bjorn Landfeldt, The University of Sydney
Why Cache in the
Network
• Reduce latency by avoiding slow links
between client and origin server
– Low bandwidth links
– Congested links
• Reduce traffic on links
– Between institutional network and regional ISP
– Reduce traffic on transoceanic links
• Spread load of overloaded origin server to
caches
– An Internet dense with cache allows a content
provider to offer high performance distribution at
low cost
• Inexpensive server
• Low-bandwidth Internet connection
Bjorn Landfeldt, The University of Sydney
6
7
Implications of Cache in
the Network
• Network caching complements client
caching
• Paradigm shift in traffic engineering
– Bandwidth is no longer the only shared
resource; now there is bandwidth and storage
Bjorn Landfeldt, The University of Sydney
8
Hierarchical Caching
• Each ISP can have a cache
• ISPs higher in the hierarchy have
– Larger user populations
– Higher hit rates
National ISP
Regional ISP
Local ISP
Clients
Bjorn Landfeldt, The University of Sydney
= cache
Origin
Servers
Regional ISP
9
Cache Chaining
• Hierarchies use cache chaining
• All communications along chain can be
over HTTP
client
cache
server
User configures browser to point to cache
client
cache
cache
server
User configures browser to point to 1st cache and first cache
points to2nd cache ..
Bjorn Landfeldt, The University of Sydney
10
Cooperative Caching
• Multiple sibling caches within a single ISP
• One or more of the siblings could contain
the requested object
• Cooperation
– ICP: siblings send messages to each other to
find a copy of object
Sibling
– CARP: URL space partitioned
caches
Clients
Bjorn Landfeldt, The University of Sydney
11
Caching Challenges
• Cache consistency:
– Cache often must guess whether a stored object is state or fresh
• Dynamic content:
– Caches shouldn't cache outputs of CGI scripts
• Hit counts and personalization:
– Caches can cause hit count calculations and cookie transactions to
fail
• Less-savvy users and privacy-concerned users:
– How do you get a user to point his/her browser to a cache?
• Access control:
– How do you make sure that the seller of the documents gets paid?
– Legal and security restrictions
• Enormous multimedia files:
– Disk storage is increasing at a rate of approx 60% a year still!
Bjorn Landfeldt, The University of Sydney
12
Replication Caching
• When an ftp or HTTP server is very busy it can
replicate itself
– Load balancing distributes the load across all servers
• Round-robin DNS
– Maps a single host names to multiple servers with
different IP addresses
– DNS rotates the IP addresses each time it receives a
request
• Re-directions
– Webserver returns a re-direction to a parallel server
• Can be done with a 301 Moved Permanently and location:
header in response message
• Main server re-directs request to a pool of servers
Bjorn Landfeldt, The University of Sydney
Round Robin DNS
Advantages &
Disadvantages
• Advantages
–
–
–
–
Inexpensive
Easy to set up
Application OS independent
Requires no resources from the application servers
• Disadvantages
–
–
–
–
–
Doesn't monitor server load
Doesn't remove failed servers from the rotation
Won't work well if servers are of different size/power
Doesn't work well if session state must be maintained
DNS Caching causes problems
Bjorn Landfeldt, The University of Sydney
13
14
DNS Redirection
• Some intermediary intercepts the request, and
directs it to a selected site.
– Layer 4-7 switching? E.g., look at URL or server IP address.
– Interpose on the binding procedure, before the client
sends the request itself.
• Smart clients, Active Names, RPC binding, or DNS lookup
• Most third-party CDNs are based on DNS servers
that select the cache/replica site on DNS lookup for
the request.
• Akamai, Digital Island, Web hosting providers (e.g., Exodus), etc.
• Like DNS-RR....but smarter...
Bjorn Landfeldt, The University of Sydney
15
Pre-fetching Cache
• Retrieves specific pages or sites at regular
intervals
– Can also pre-fetch pages that are outside the referenced
site
– Pre-fetch their referenced pages
– Can pre-fetch a hierarchy of pages across a number of
sites
• May also perform periodic up-to-date checks on
all documents in the cache
Bjorn Landfeldt, The University of Sydney
16
Cache Effectiveness
• Previous work has shown that hit rate
increases with population size
• However, single proxy caches have
practical limits
– Load, network topology, organizational
constraints
• One technique to scale the client population
is to have proxy caches cooperate
Bjorn Landfeldt, The University of Sydney
17
Hierarchical Caches
Idea: place caches at exchange or
switching points in the network, and
cache at each level of the hierarchy.
origin Web site
Internet
Resolve misses through the parent.
upstream
downstream
clients
clients
clients
Bjorn Landfeldt, The University of Sydney
18
Cache Array Resolution
Protocol
• A set of caching proxies can effectively function
as a single logical cache
• Uses a hash function to partition the URLs across
caches
• All queries are done over HTTP
– No new application layer protocol such as ICP
– Can take advantage of HTTP/1.1
• Implemented in MS and Netscape cache server
products
Bjorn Landfeldt, The University of Sydney
19
Operation
• A client trying to locate a cached resource
targets the request to the appropriate cache
by applying a hash function
• The hash function uses the request URL and
the identity of the proxy members to
construct a resolution path
Bjorn Landfeldt, The University of Sydney
20
Hash Routing Overview
• Choose a hash function h() which maps URLs to a
hash space
– Let the hash space be {1,….,60}
– Let h() be the sum of the ASCII representation of the characters
in the URL, modulo 60
• Partition hash space: one set for each sibling
– Client hashes URL, determines set to which hashed URL
belongs and sends request to corresponding sibling
– Set for cache 1 = {1,…,30}, set for cache 2 = {31,…,60}
– h(URLa) = 35, send request to cache 2
– If sibling does not have object, obtain from origin server
Bjorn Landfeldt, The University of Sydney
21
Hashing: Cache Array
Routing Protocol (CARP)
Internet
g-p
a-f
“GET www.hotsite.com”
hash
function
Bjorn Landfeldt, The University of Sydney
q-u
v-z
Advantages
1. single-hop request resolution
2. no redundant caching of objects
3. allows client-side implementation
4. no new cache-cache protocols
5. reconfigurable
22
Hash Routing (2)
• Each object resides in at most one sibling
• Client is immediately directed to the correct
sibling
– Disk and RAM storage are effectively
aggregated -> higher hit rates
Bjorn Landfeldt, The University of Sydney
Edward Chow
Content Distribution
Networks
• Thus far we have looked at caching
– Caches are provided by the ISP (network) or the
client Forward Proxy Caches
Bjorn Landfeldt, The University of Sydney
23
24
Edward Chow
Another Solution
• Push Content to the edges of the network
Bjorn Landfeldt, The University of Sydney
25
Content Distribution
Networks (CDNs)
• Be proactive and distribute the content
closer to the clients
• The distribution infrastructure is not owned
by the ISP, or the owners of content
– Third Party
• A CDN is a collection of interconnected
cache servers that are scattered around the
world which are able to serve a client
Bjorn Landfeldt, The University of Sydney
26
Basic CDN Operation
• When a request is sent to the server (origin),
it is redirected to another server (proxy
cache server) which is closer and/or can
serve faster
– The origin server must be able to determine the
location of the client and find the appropriate
proxy cache server
Bjorn Landfeldt, The University of Sydney
Jeff Chase
Generalized Cache/CDN
(Internal View)
Interior Caches
Request
Routing
Function ƒ
ƒ
root caches
reverse proxies
CDN caches
Leaf Caches
(e.g., ISP proxies)
bound client populations
Bjorn Landfeldt, The University of Sydney
27
28
CDN Challenges
• Challenges are
– Which cache server to use (request routing
function) ?
– When/where/how to push/delivery the content
(content distribution)?
– Where to put cache servers?
– Associated questions
• How many cache servers are needed?
• How about dynamic content?
Bjorn Landfeldt, The University of Sydney
29
How L4-Aware Systems
Work
• By making intelligent switching decisions and to forward
frames based on TCP/UDP port information and IP
source/destination addresses
• L4 switching=Session Switching
– examines client requests directed at the L4 switch
– multiplexes client requests across any server available to handle those requests
– passively measures application health and responsiveness to determine server
availability
– stateful processing
• By combining the benefits of L4 sofware on a high-speed
L2 switching platform
• By using this information to establish policy
controls for how traffic is to be managed
Bjorn Landfeldt, The University of Sydney
30
Key Layer 4-based
Applications
1. Local/Global Server load balancing
2. High availability applications
3. Web Cache Redirection
4. DNS redirection
5. Firewall Load Balancing
6. URL-based redirection, switching
Bjorn Landfeldt, The University of Sydney
31
E.g. Local Server Load
Balancing
• Scalable application processing capacity
–
FTP
HTTP
DNS
•
Add servers on-demand
High availability
–
–
–
Server/application health monitoring
Backup and overflow servers
Hot-standby switch configurations
• Tiers-of-service by servers
–
Database
Queries
D
N
S
F H
T T
P T
P
Clients
Bjorn Landfeldt, The University of Sydney
Priority users/applications can be
directed to premium servers
• Integrated switch and load balancer
–
–
–
–
Flexibility
Scalability
Economy of scale
Performance
32
Jeff Chase
•
Alternative
Solution
Intelligent DNS-based request routing has some tricky
parts:
– Third-party CDNs contract with content providers (e.g., Web sites
such as cnn.com) to serve a subset of their content.
• Resource-rich content, e.g., images, audio, video.
– To use DNS request routing, the CDN must assume DNS duties for
the URLs that reference the content it serves.
– The content provider does not want to designate the CDN as the
authoritative DNS server for its domain (e.g., cnn.com).
• Solution: make up new DNS domains for the content served
by the CDN – URL rewriting
Bjorn Landfeldt, The University of Sydney
33
URL Rewriting
• Origin server dynamically generates
pages to redirect clients to different
content servers
• Page is dynamically rewritten with the
IP address of a mirror server.
Bjorn Landfeldt, The University of Sydney
34
Pre-Caching
• Content is delivered to cache before
requests are generated
• Used for highly distributed usage
• Caches can be updated during off-hours
to reduce network load
• There are no standardised schemes so we
will look at how this is done in practice
Bjorn Landfeldt, The University of Sydney
35
Just-In-Time
• Content is pulled from the origin server to the
cache when a request is received from a client
• The object is delivered to the client and
simultaneously stored on the cache for later use
• Can implement multicasting for efficient
content transfer between caches
• Leased lines may be used between servers to
ensure QoS
Bjorn Landfeldt, The University of Sydney
36
Example - Akamai (1)
• Akamai sells a content delivery service that looks
like what a hosting company sells as Internet
interconnection bandwidth
• When you "Akamaize" content, the content is
subsequently served by Akamai’s system rather
than from the origin server.
• Content provider pays Akamai on the basis of the
peak load experienced (in Mbits/second - just like
bandwidth).
• The net result is usually a significant
improvement in access performance
Bjorn Landfeldt, The University of Sydney
37
Akami (2)
• Have implemented a distributed network of servers on
multiple service provider backbones across the Internet
– No central server that knows about all proxies and controls them
– They have put proxies in the networks of many service providers
– This way they hope that every client will be in the vicinity of at
least one of them
• In order for servers to cooperate and exchange information
(sort of pre-fetching) they have developed a dynamic
discovery scheme called Name-Dropper
• The way they have picked these locations for the proxies is
not known
Bjorn Landfeldt, The University of Sydney
38
Akami – Operation (1)
•
Size of the majority of web pages (according to Akami
70%), is driven not by the text it contains but from other
embedded objects
– Get the text from the server and all the other objects from a
nearby proxy
•
Every page that is served by the Akamai network
1) is passed through a program that tags all embedded objects
2) When the client downloads this page and request the embedded
objects are directed to a nearby proxy
– Therefore every client gets a slightly different page
•
This can be done in two ways
– Dynamically generating the proper code and feeding it to client
– Intercepting packets that have references to the tagged objects and
change them on the fly.
Bjorn Landfeldt, The University of Sydney
39
Akami – Operation (2)
Bjorn Landfeldt, The University of Sydney
40
Akami – Operation (3)
• The Akamai scheme basically relies on a
preprocessing phase where the large objects of a
page are identified and tagged
• No other changes to the original page or the
software for generating the page
• Then these objects are distributed to some proxies
– Map different objects to different proxies in order to
balance the traffic
Bjorn Landfeldt, The University of Sydney
41
Akami – Operation (4)
• Akamai has developed a technique for mapping
objects to proxies which is called consistent
hashing
• The client decides which proxy contains the
required information and can deliver it faster
• client's software doesn't have the capability to perform such a
function!
• Performed during the resolving of names to IP
addresses using the Akamai's DNS Server
– The DNS server performs the hashing function for the
client and return as answer the IP address of the closest
proxy
Bjorn Landfeldt, The University of Sydney
Jeff Chase
Domain Granularity and
“Akamaizing”
42
– Akamai creates new domain names for each client content
provider.
• e.g., a128.g.akamai.net
– Akamai’s DNS servers are authoritative for the new domains.
– The client content provider modifies its content so that
embedded URLs reference the new domains.
• “Akamaize” content, e.g.: http://www.cnn.com/image-of-the-day.gif
becomes http://a128.g.akamai.net/image-of-the-day.gif.
– Using multiple domain names for each client allows the CDN
to further subdivide the content into groups.
• DNS sees only the requested domain name, but it can route requests for
different domains independently.
Bjorn Landfeldt, The University of Sydney