Transcript slides

Availability of Network
Application Services
Candidacy Exam
Jong Yul Kim
February 3, 2010
Motivation

Critical systems are being deployed on the
internet, for example:




Next Generation 9-1-1
Business transactions
Smart grid
How do we enhance the availability of these
critical services?
Scope

What can application service providers do to
enhance end-to-end service availability?
X

X
The problems are that:
 The underlying internet is unreliable.
 Servers fail.
Scope

Level of abstraction: system design



On top of networks as clouds (but we can probe it.)
Using servers as nodes
The following are important for availability, but
not in scope:
 Techniques that ISPs use.
 Techniques to enhance availability of
individual components of the system.
 Defense against security vulnerability attacks.
Contents
①
②
③
The underlying internet is unreliable.

Symptoms of unavailability in the application



Web : “Unable to connect to server.”
VoIP : call establishment failure, call drop[2]
Symptoms of network during unavailability


High percentage of packet loss
Long bursts of packet loss[1,2]


1
2
23% of lost packets belong to outages[2]
Packet delay[2]
Dahlin et al. End-to-End WAN Service Availability. IEEE/ACM TON 2003.
Jiang and Schulzrinne. Assessment of VoIP Service Availability in the Current Internet. PAM 2003.
The underlying internet is unreliable.

Network availability figures


1
2
Paths to popular web servers: 99.6%[1]
Call success probability: 99.53%[2]
Gummadi et al. Improving the Reliability of Internet Paths with One-hop Source Routing. OSDI 2004.
Jiang and Schulzrinne. Assessment of VoIP Service Availability in the Current Internet. PAM 2003.
The underlying internet is unreliable.

Possible causes of these symptoms


Network congestion
BGP updates



1
2
90% of lost packets near BGP updates are part of a
burst of 1500 packets or longer. At least 30 seconds of
continuous loss.[1]
Intra-domain link failure[1,2]
Software bugs in routers[2]
Kushman et al. Can you hear me now?!: it must be BGP. SIGCOMM CCR 2007.
Jiang and Schulzrinne. Assessment of VoIP Service Availability in the Current Internet. PAM 2003.
The underlying internet is unreliable.

Challenges


Quick re-routing around network failures
With limited control of the network
Contents
Resilient Overlay


An overlay of cooperative nodes inside the network
Each node is like a router



It has its own forwarding table
Link-state algorithm used to construct map of the RON network
They actively probe the path between themselves


1
[1]
Network
Fully meshed
= Fast detection of path outages
Andersen et al. Resilient Overlay Networks. SOSP 2001.
Resilient Overlay Network



Outage defined as
outage(r,p) = 1 when
Observed packet loss rate
averaged over an interval r
is larger than p on the path.
“RON Win” defined as
when average loss rate on
the Internet ≥ p,
RON loss rate < p
Results on right
r = 30 minutes
1
RON1
12 nodes
>64 hours
Andersen et al. Resilient Overlay Networks. SOSP 2001.
RON2
16 nodes
>85 hours
Resilient Overlay Network

Merits

Limitations

Re-routes quickly around
network failures

Does not recover from edge
failures or last-hop failures

Good recovery from core
network failures

Active probing is expensive


Clients and servers cannot
use RON directly.

Need to have a strategy to
deploy RON nodes.

1
Andersen et al. Resilient Overlay Networks. SOSP 2001.
Trades off scalability for
reliability[1]
May violate ISP contracts
Contents
One-Hop Source Routing
Intermediaries
X

Main idea is to try another path through an
intermediary if mine does not work.


1
“Spray and pray”
Issue a random packet to 4 others every 5 seconds and
see if they succeed (called random-4)
Gummadi et al. Improving the Reliability of Internet Paths with One-hop Source Routing. OSDI 2004.
One-Hop Source Routing

1
.
Results
Gummadi et al. Improving the Reliability of Internet Paths with One-hop Source Routing. OSDI 2004.
One-Hop Source Routing

1
.
Results
Gummadi et al. Improving the Reliability of Internet Paths with One-hop Source Routing. OSDI 2004.
One-Hop Source Routing

Merits

Limitations

Simple, stateless

Not always guaranteed to
find alternate path

Good for applications
where users are tolerant
of faults

Cannot reroute around
edge network failures

Intermediaries

Good for rerouting
around core network
failures



1
.
How does the client find
them?
Where to place them?
Will they cooperate?
Gummadi et al. Improving the Reliability of Internet Paths with One-hop Source Routing. OSDI 2004.
Contents
Multi-homing

Use multiple ISPs to allow network path
redundancy at the edge network.
Figure: Availability gains by adding best 2 and 3 ISPs based on RTT performance.
(Overlay routing achieved 100% availability during the 5-day testing period.)[1]
1
Akella et al. A Comparison of Overlay Routing and Multihoming Route Control. SIGCOMM 2004.
Multi-homing

Availability depends on
choosing:[1]




1
2
Good upstream ISPs
ISPs that do not have
path overlaps upstream
Hints at the need for path
diversity
Some claim that gains
are comparable to
overlay routing.[2]
Akella et al. A Measurement-Based Analysis of Multihoming. SIGCOMM 2003.
Akella et al. A Comparison of Overlay Routing and Multihoming Route Control. SIGCOMM 2004.
Resilient Overlay Network
One-hop source routing
Multihoming
Recovery
mechanism
Re-route through another
node with path to dest.
Spray to other nodes
and hope for best
Route control
Recoverable
failure location
Core network
Core network
Edge network
Management
overhead
Costly: active
measurements needed
Requestor: none
Intermediary: relay
Measurements on links
Recovery time
seconds
Seconds
N/A
Path guarantee
Deterministic
Probabilistic
N/A
Scalability
2 ~ 50 nodes
Scales well
Does not scale well
Client modification
Yes
Yes
No
Supported
Applications
All IP-based services
All IP-based services
All IP-based services
Not directly usable by
clients and servers
 May violate ISP contracts

Limitations

Bootstrap problem
Need to find good ISP
 Non-overlapping paths
 Probably will suffer
from slow BGP
convergence

Contents
②
Server failures.

Hardware failures


More use of commodity hardware and components
“Well designed and manufactured HW: >1% fail/year.”[1]
Software failures

Concurrent programs are prevalent and so are bugs.


Environmental failures


Electricity outage
Operator error
1 Patterson.
2
Cases of introducing another bug while trying to fix one.[1]
Recovery oriented computing: A new research agenda for a new century. Keynote address, HPCA 2002.
Lu et al. Learning from Mistakes – A comprehensive study on real world concurrency bug characteristics. ASPLOS 2008.
Contents
One-site Clustering

Three models in organizing servers in a cluster[1]



1
Cluster-based Web system  one Virtual IP address exposed to client
 one Virtual IP address shared by all servers
Virtual Web cluster
Distributed Web system  IP addresses of servers exposed to client
Cardellini et al. The State of the Art in Locally Distributed Web-Server Systems. ACM Computing Survey 2002.
Cluster-based Web System

One IP address exposed
to client

Request routing


Server selection



1
By the web switch
Content-aware
Client-aware
Server-aware
Cardellini et al. The State of the Art in Locally Distributed Web-Server Systems. ACM Computing Survey 2002.
Virtual Web System

All servers share same
IP address

Request routing



Server selection

1
All servers have same
MAC address
Or layer 2 multicast
By hash function
Cardellini et al. The State of the Art in Locally Distributed Web-Server Systems. ACM Computing Survey 2002.
Distributed Web System

Multiple IP addresses
exposed

Request routing



Server selection

1
Primary by DNS
Secondary by server
By DNS
Cardellini et al. The State of the Art in Locally Distributed Web-Server Systems. ACM Computing Survey 2002.
Effects of design on service
availability
Cluster-based
Virtual Web
Distributed Web
Server selection
Web switch
Hash function
Authoritative DNS,
Servers
Request routing
IP or MAC addr
MAC uni-/multi-cast
IP address
Load Balancing
Web switch
None
Using DNS
Group membership
Not necessary
Not possible (Need
another network)
Possible
Single Point Of
Failure
Web switch
None
(can be redundant) (all servers)
DNS server
(can be redundant)
Failover method
Web switch
decides
N/A
Takeover IP using
ARP
Fault detection
Heart beat
N/A
Heart beat
Other limitations
None
Hash function is
static
DNS records are
cached by client
Improving Availability of
Distributed Web Model

PRESS Web Server




Cooperative nodes
Serve from cache instead of
disk
All servers have a map of
where cached files are
Availability mechanism



1
[1]
PRESS
Servers are organized in a ring
structure. Each sends
heartbeat to next one in the
ring.
Three heartbeats missing:
predecessor has crashed.
Node restarts – broadcasts its
IP address to join cluster again.
Nagaraja et al. Quantifying and Improving the Availability of High-Performance Cluster-Based Internet Services. SC 2003.
Improving Availability of
[1]
PRESS
Cluster-based Web Model!
To enhance availability

1.
Add front-end node to
mask failures

2.
3.
4.
Robust group membership
Application-level heart
beats
Fault Model Enforcement



1
Layer 4 switch (LVS) with
IP tunneling
If there’s fault, crash whole
node and restart.
99.5%  99.99%
Cluster-based model!
Nagaraja et al. Quantifying and Improving the Availability of High-Performance Cluster-Based Internet Services. SC 2003.
Effects of design on service
availability Best choice!
Cluster-based
Virtual Web
Distributed Web
Server selection
Web switch
Hash function
Authoritative DNS,
Servers
Request routing
IP or MAC addr
MAC uni-/multi-cast
IP address
Load Balancing
Web switch
None
Using DNS
Group membership
Not necessary
Not possible (Need
another network)
Possible
Single Point Of
Failure
Web switch
None
(can be redundant) (all servers)
DNS server
(can be redundant)
Failover method
Web switch
decides
N/A
Takeover IP using
ARP
Fault detection
Heart beat
N/A
Heart beat
Other limitations
None
Hash function is
static
DNS records are
cached by client
Contents
Reliable Server Pooling (RSerPool)
“RSerPool provides an application-independent set of
services and protocols for building fault-tolerant and highlyavailable client/server applications.”[1]

Specifies behavior of registrar, server, and client

Two protocols designed over SCTP

Aggregate Server Access Protocol (ASAP)


Endpoint haNdlespace Redundancy Protocol (ENRP)

1
Used between server - registrar and client - registrar
Used among registrars
Lei et al. An Overview of Reliable Server Pooling Protocols. RFC 5351. IETF 2008.
Reliable Server Pooling (RSerPool)
Each registrar is home registrar to a set of
servers. Each maintains a complete view of
pools by synchronizing with each other.
• Servers can add itself by registering
to a registrar. That registrar becomes
the home registrar.
• Home registrar probes server status.
• Clients cache list of servers from registrar.
• Clients also report failures to registrar.
Servers provide service through normal
application protocol such as HTTP.
Registrars

Announce themselves to servers, clients, and other
registrars using IP multicast or by static configuration.

Share all knowledge about pool with other registrars




All server updates (registration, re-registration, deregistration)
are announced by home registrar.
Maintain connections to peer registrars using SCTP
Checksum mechanism used to audit consistency of
handlespace.
Keeps the number of unreachability reports for each
server in its pool and if some threshold is reached, the
server is removed from the pool.
Fault detection
[1]
mechanisms
Failed Component
Detection by
Detection Mechanism
Client
Server
Application-specific
Client
Registrar
Broken connection
Server
Client
Application-specific
Server
Registrar
ASAP Keep-Alive Timeout
Server
Registrar
ASAP Endpoint Unreachable
Registrar
Client
ASAP Request Timeout
Registrar
Server
ASAP Request Timeout
Registrar
Registrar
ENRP Presence Timeout
Registrar
Registrar
ENRP Request Timeout
1
Dreibholz. Reliable Server Pooling – Evaluation, Optimization, and Extension of a Novel IETF Architecture. Ph.D. Thesis 2007.
Reliability

1
[1]
mechanisms
Registrar takeover

Each registrar already has a complete view

Registrars volunteer to take over the failed one

Conflict resolved by comparing registrar ID
(lowest one wins)

Send probes to servers under failed registrar
Dreibholz. Reliable Server Pooling – Evaluation, Optimization, and Extension of a Novel IETF Architecture. Ph.D. Thesis 2007.
Reliability

Server failover

1
[1]
mechanisms
Uses client-side state-keeping mechanism

A state cookie is sent to the client via control channel. The
cookie serves as a checkpoint.

When server fails, client connects to a new server and sends
the state cookie. The new server picks up from that state.

Uses ASAP session layer between client and server.

Data and control channel are multiplexed in over a single
connection to enforce correct order of transaction and cookie.
Dreibholz. Reliable Server Pooling – Evaluation, Optimization, and Extension of a Novel IETF Architecture. Ph.D. Thesis 2007.
Evaluation of RSerPool
Merits

Includes the client in system design



Every element keeps an eye on
each other
Limitations

The client doesn’t do much to help
when unavailability problems occur

Too much overhead in maintaining
consistent view of the pools. Not
scalable to many servers.

Use of IP multicast confines
deployment of registrars and servers
to multicast domain. Most likely,
clients would have to be statically
configured.

Assumes that client connectivity
problem is a server problem.
Servers at one site will all get
dropped if there’s a big network
problem.
There are mechanisms for registrar
takeover and server failover
Multiple layers of failure detection.
(e.g. SCTP level and application
level.)
Contents
③
Distributed Clusters

Distributed clusters



Intrinsically has both
network path and server
redundancy
How can we utilize these
redundancies?
Study of availability
techniques of


Content Distribution
Networks
Domain Name System
Contents
Akamai vs.

Akamai

Philosophy







Limelight

Go closer to client
Scatter small clusters all
over the place
Scale

1
[1]
Limelight
27,000 servers
65 countries
656 ASes
Two level DNS
Philosophy



Scale




Huang et al. Measuring and Evaluating Large-Scale CDNs. IMC 2008.
Go closer to ISPs
Large clusters at few key
locations near many
ISPs
4,100 servers
18 countries
Has own AS
Flat DNS, uses anycast
Akamai vs. Limelight

Methodology



Results


1
Connect to port 80 once
every hours.
Failure = two consecutive
connection error.
Server and cluster
availability is higher for
Limelight.
But service availability may
be different!
Huang et al. Measuring and Evaluating Large-Scale CDNs. IMC 2008.
Akamai Failure Model

Failure model
“We assume that a significant and constantly
changing number of components or other failures
occur at all times in the network.”[1]
1

Components: link, machine, rack, data center,
multi-network…

This leads having small clusters scattered in
diverse ISPs and geographic locations.
Afergan et al. Experience with some Principles for Building an Internet-Scale Reliable System. USENIX WORLDS 2005.
Akamai: scalability and availability

Use two-level DNS to direct clients to server






Top level returns low-level name servers in multiple regions
Low level returns short DNS TTL (20 seconds)
Servers use ARP to takeover a failed server
Use internet for inter-cluster communication

1
Top level directs clients to a region
e.g. g.akamai.net
Region resolves lower level queries.
Uses multi-path routing that’s directed by SW logic
= overlay network?
Afergan et al. Experience with some Principles for Building an Internet-Scale Reliable System. USENIX WORLDS 2005.
Contents
DNS Root

Redundancy

Redundant hardware that takes over failed one with or
without human intervention




At least 3 recommended, with one in a remote site[3]
Backups of the zone file stored at off-site locations
Connectivity to the internet
Diversity

Geographically located in 130 places in 53 countries




1
[1,2]
Servers
Topological diversity matters more
Hardware, software, operating system of servers
Diverse organizations, personnel, operational processes
Distribution of zone files within root server operator
Bush et al. Root Name Server Operational Requirements. RFC 2870. IETF 2000.
http://www.icann.org/en/committees/security/dns-security-update-1.htm
3 Elz et al. Selection and Operation of Secondary DNS Servers. RFC 2182. IETF 1997.
2
The use of anycast for

Basic anycast



Announce identical IP
address
Routing system takes client
request to closest node
Hierarchical anycast



1 Abley,
[1]
availability
Global vs. local nodes
If any node fails, stop
announcement
Global node takes over
automatically
Hierarchical Anycast for Global Service Distribution. ISC Technical Note 2003-1. 2003.
Is anycast good for
[1]
everyone?

Not really…

Packets for long sessions may go to another node if
the routing dynamics change


Service time and stability of routing
A lot of routing considerations



1 Abley
Aggregated prefixes
Multiple services from a prefix
Consideration of route propagation radius
and Lindqvist, Operation of Anycast Services. RFC 4786. IETF 2006.
Contents
Conclusion

Techniques presented here attempt to recover
from either network failure or server failure
through redundancy



There is redundancy in network path
Server redundancy can be handled
An application service provider may have
to use a combination of these techniques.
Conclusion

CDN, DNS has been good so far…
 Can these be applied to other applications?

System designed for availability is only as
good as the failure model.

Further research on availability


Few techniques
Not so much about availability in literature.