Experience with some Principles for Building an Internet

Download Report

Transcript Experience with some Principles for Building an Internet

Experience with some Principles for
Building an Internet-Scale Reliable
System
Mike Afergan
(Akamai and MIT)
Joel Wein
(Akamai and Polytechnic University,
Brooklyn NY)
Amy LaMeyer
(Akamai)
Overview




Background
Our Development Philosophy
Guiding Principles
Metrics and Benefits of the Approach
Downloading www.xyz.com
with Akamai’s EdgeSuite
Customer
Web server
WWW.XYZ.COM
DNS
1
2
7
5
6
3
4
• User enters www.xyz.com
• Akamai server assembles
page, contacting customer
• Browser requests IP
Web server if necessary
address for www.xyz.com
which is CNAMEd to Akamai • Optimal Akamai server
returns HTML
• DNS returns IP address
• Browser obtains objects
of optimal Akamai server
from optimal Akamai
• Browser requests HTML
servers, contacting the
customer Web server if
necessary
15,000+ Servers
1,100+ Networks
2,500+ Locations
What is this Paper About?


Internal effort to assess and further formalize internal
processes for reliability.
Produced a long list of principles, some quite basic


e.g. Input checking
A smaller set of principles capturing our basic
approach to building distributed systems emerged.


Some we realized only in retrospect
Many are not unique or new to us
Sharing our Principles

“Not always easy in practice”
Similarities with academic literature
Enables useful operational approach

This talk is not:





Detailed exposition or justification of entire system
or architecture
Scientific reliability study
Adequate comparison with previous literature
Overview




Background
Our Development Philosophy
Guiding Principles
Metrics and Benefits of the Approach
Challenges

Failures all the time at different levels:


“Health”
Time
Machines, racks, datacenters, networks, multiple
networks
Connectivity Statistics:
Our Philosophy
Assumption: We assume that a significant and
constantly changing number of component or other
failures occur at all times in the network.
Our software is designed to seamlessly work despite
numerous failures as part of the operational network.
Consequence of Philosophy
Do
Commodity hardware
Third-party
Datacenters
Smaller regions
Don’t
More reliable servers
Own our own network
Larger more reliable
clusters
Spread regions within Find most reliable
ISPs
datacenters
Use the public
Have dedicated links
Internet
Overview




Background
Our Development Philosophy
Guiding Principles
Metrics and Benefits of the Approach
Our Principles
Principle #6: Notice and Quarantine Faults
Principle #5: Zoning for Releases
Principle #4: Fail-Stop & Restart
Principle #3: Distributed Control
Principle #2: Logic and Software for Message Reliability
Principle #1: Ensure Significant Redundancy
Philosophy: Work with numerous failures
Assumption: Significant and constantly changing failures
Redundancy
Principle #1: Ensure
Significant Redundancy

Base Approach: Redundancy at every layer

Example Problem:




gTLDs return only 13 entries
The set is relatively static
Solution: IP Anycast to overload the IP addresses
Other Practical Constraints



DNS TTLs constrain flexibility
Third-party protocols
Cost
Simple in theory, often challenging in practice.
Redundancy
Logic and
Software
Principle #2: Use Logic and Software to
Provide Message Reliability

Many message types:




Monitoring information
Customer content
We use an overlay transport (UDP and HTTP)
We do not:


Have dedicated pipes
Own datacenters
Logic and
Software
Redundancy
Distributed
Control
Principle #3: Distributed
Control

Different Layers:


Leader-Election
Failover
Suspending region
ensures reliability
Region contains the
only reliable content!
Our ability to tolerate failures facilitates our
approach to software development and operations.
Logic and
Software
Redundancy
Distributed
Control
Fail-Stop &
Restart
Principle #4: Fail Stop and
Restart
Why?
Significant downside to a mistake
2. Strong mechanism for recovery
1.
Akamai could be viewed as a seven-year experiment in
running Recovery Oriented Computing.
Redundancy
Logic and
Software
Distributed
Control
Fail-Stop &
Restart
A Cautious Approach
X
X XX
XX
Problems:
1) Continual Rolls
2) System-wide Issues
1.2.3.4
Redundancy
Logic and
Software
Distributed
Control
Fail-Stop &
Restart
Zoning
Fault-Isolation
Principle #6: Notice and
Quarantine Faults

Challenging Problem

Many classes of solution

Open problem with vital importance
Redundancy
Logic and
Software
Distributed
Control
Fail-Stop &
Restart
Zoning
Fault-Isolation
Principle #5: Zoning
Phase 3:
Entire Network
Phase 2:
Subset (< 1/8th)
the network
Phase 1:
One Machine
The prior principles – unexpectedly – have
enabled a much more reliable and
aggressive release process.
Release
Type
1->2
Customer 15 mins
Config
2->3
20 mins
System
Config
30 min
2 hours
Standard
Software
Release
24
hours
24 hours
Overview




Background
Our Development Philosophy
Guiding Principles
Metrics and Benefits of the Approach
Benefits to Software
Development
High Rate of Change per Month:


~22 network/software releases
~1000 customer configuration releases
Our confidence in our network’s
ability to handle faults enables
an aggressive rate of change.
Phase
Aborts
% of Total
Phase One
36
6.49
Phase Two
17
3.06
Add’l Phase
3
0.54
World
23
4.14
Data from 25 months = 556 software releases
Benefit to Operations

Normal NOCC Staffing:



7-8 people during the day
3 people at night
Per person:


1800 servers
300 datacenters
Our ability to treat faults as normal occurrences
– not as crises – helps us scale
Principles
1.
2.
3.
4.
5.
6.
Ensure Significant Redundancy
Use Logic and Software for Messaging
Employ Distributed Control
Fail Stop and Restart
Employ Zoning
Notice and Quarantine Faults
Key Points
These principles:
1) Build upon each other
2) Enable Akamai’s highly reliable service and ability to scale