Scalability -- Vanguard Conference

Transcript Scalability -- Vanguard Conference

Some Claims about
Giant-Scale Systems
Prof. Eric A. Brewer
UC Berkeley
Vanguard Conference, May 5, 2003
Five Claims
 Scalability
 How
is easy (availability is hard)
to build a giant-scale system
 Services
are King (infrastructure centric)
 P2P is cool but overrated
 XML doesn’t help much
 Need new IT for the Third World
May 5, 2003
Vanguard Conference
Claim 1:
Availability >> Scalability
Key New Problems

Unknown but large growth



Must be truly highly available




Incremental & Absolute scalability
1000’s of components
Hot swap everything (no recovery time)
$6M/hour for stocks, $300k/hour for retailers
Graceful degradation under faults & saturation
Constant evolution (internet time)



Software will be buggy
Hardware will fail
These can’t be emergencies...
May 5, 2003
Vanguard Conference
Typical Cluster
May 5, 2003
Vanguard Conference
Scalability is EASY
 Just
add more nodes….
 Unless
you want HA
 Or you want to change the system…
 Hard
parts:
 Availability
 Overload
 Evolution
May 5, 2003
Vanguard Conference
Step 1: Basic Architecture
Client
Client
Client
Client
IP Network
Single Site
Server
Load
Manager
Op tio nal
B ackp lane
Persistent Data Store
May 5, 2003
Vanguard Conference
Step 2: Divide Persistent Data

Transactional (expensive, must be correct)
 Supporting Data




HTML, images, applets, etc.
Read mostly
Publish in snapshots (normally stale)
Session state



Persistent across connections
Limited lifetime
Can be lost (rarely)
May 5, 2003
Vanguard Conference
Step 3: Basic Availability
 a)
depend on layer-7 switches
 Isolate
external names (IP address, port)
from specific machines
 b)
automatic detection of problems
 Node-level
checks (e.g. memory footprint)
 (Remote) App-level checks
 c)
Focus on MTTR
 Easier
May 5, 2003
to fix & test than MTBF
Vanguard Conference
Step 4: Overload Strategy
 Goal:
degrade service (some) to allow
more capacity during overload
 Examples:
 Simpler
pages (less dynamic, smaller size)
 Less options (just the basics)
 Must
kick in automatically
 “overload
mode”
 Relatively easy to detect
May 5, 2003
Vanguard Conference
Step 5: Disaster Tolerance
 a)
pick a few locations
 Independent
 b)
failure
Dynamic redirection of load
 Best:
client-side control
 Next: switch all traffic (long routes)
 Worst: DNS remapping (takes a while)
 c)
Target site will get overloaded
 --
May 5, 2003
but you have overload handling
Vanguard Conference
Step 6: Online Evolution
 Goal:
rapid evolution without downtime
 a) Publishing model
 Decouple
development from live system
 Atomic push of content
 Automatic revert if trouble arises
 b)
Three methods
May 5, 2003
Vanguard Conference
Evolution: Three Approaches

Flash Upgrade




Rolling Upgrade




Fast reboot into new version
Focus on MTTR (< 10 sec)
Reduces yield (and uptime)
Upgrade nodes one at time in a “wave”
Temporary 1/n harvest reduction, 100% yield
Requires co-existing versions
“Big Flip”
May 5, 2003
Vanguard Conference
The Big Flip

Steps:
1) take down 1/2 the nodes
2) upgrade that half
3) flip the “active half” (site upgraded)
4) upgrade second half
5) return to 100%

Avoids mixed versions (!)


can replace schema, protocols, ...
Twice used to change physical location
May 5, 2003
Vanguard Conference
The Hat Trick
 Merge
the handling of:
 Disaster
tolerance
 Online evolution
 Overload handling
 The
first two reduce capacity, which
then kicks in overload handling (!)
May 5, 2003
Vanguard Conference
Claim 2:
Services are King
Coming of The Infrastructure
100
90
80
1996
1998
2000
70
60
50
40
30
20
10
0
Telephone
May 5, 2003
PC
Vanguard Conference
Non-PC
Internet
Infrastructure Services
 Much


lower cost & more functionality
longer battery life
 Data



is in the infrastructure
can lose the device
enables groupware
can update/access from home or work


phone book on the web, not in the phone
can use a real PC & keyboard
 Much


simpler devices
faster access
Surfing is 3-7 times faster
Graphics look good
May 5, 2003
Vanguard Conference
Transformation Examples

Tailor content for each user & device
6.8x
65x
10x
1.2 The Remote Queue Model
We introduce Remote Queues (RQ), ….
Infrastructure Services (2)
 Much


cheaper overall cost (20x?)
Device utilization = 4%, infrastructure = 80%
Admin & supports costs also decrease
 “Super



View powerpoint slides with teleconference
Integrated cell phone, pager, web/email access
Map, driving directions, location-based services
 Can


Convergence” (all -> IP)
upgrade/add services in place!
Devices last longer and grow in usefulness
Easy to deploy new services => new revenue
May 5, 2003
Vanguard Conference
Internet Phases (prediction)

Internet as New Media


Consumer Services (today)


Shopping, travel, Yahoo!, eBay, tickets, …
Industrial Services


HTML, basic search
XML, micropayments, spot markets
Everything in the Infrastructure


May 5, 2003
store your data in the infrastructure
access anytime/anywhere
Vanguard Conference
Claim 3:
P2P is cool but overrated
P2P Services?
 Not
soon…
 Challenges
 Untrusted
nodes !!
 Network Partitions
 Much harder to understand behavior
 Harder to upgrade
 Relatively
May 5, 2003
few advantages…
Vanguard Conference
Better: Smart Clients
 Mostly
helps with:
 Load
balancing
 Disaster tolerance
 Overload
 Can
also offload work from servers
 Can also personalize results
 E.g.
mix search results locally
 Can include private data
May 5, 2003
Vanguard Conference
Claim 4:
XML doesn’t help much…
The Problem

Need services for computers to use
 HTML only works for people


Sites depend on human interpretation of
ambiguous content
“Scaping” content is BAD



Very error prone
No strategy for evolution
XML doesn’t solve any of these issues!

At best: RPC with an extensible schema
May 5, 2003
Vanguard Conference
Why it is hard…

The real problem is *social*



Doesn’t make evolution better…





What do the fields mean?
Who gets to decide?
Two sides still need to agree on schema
Can ignore stuff you don’t understand?
When can a field change? Consequences?
At least need versioning system…
XML can mislead us to ignore/postpone the
real issues!
May 5, 2003
Vanguard Conference
Claim 5:
New IT for the Third World
Plug for new area…
 Bridging
the IT gap is the only long-term
path to global stability
 Convergence makes it possible:
 802.11
wireless ($5/chipset)
 Systems on a chip (cost, power)
 Infrastructure services (cost, power)
 Goal:
10-100x reduction in overall cost
and power
May 5, 2003
Vanguard Conference
Five Claims
 Scalability
 How
is easy (availability is hard)
to build a giant-scale system
 Services
are King (infrastructure centric)
 P2P is cool but overrated
 XML doesn’t help much
 Need new IT for the Third World
May 5, 2003
Vanguard Conference
Backup
Refinement

Retrieve part of distilled object at higher quality
Distilled image
(by 60X)
Zoom in to original
resolution
The CAP Theorem
Consistency
Availability
Tolerance to network
Partitions
May 5, 2003
Theorem: You can have at
most two of these properties
for any shared-data system
Vanguard Conference
Forfeit Partitions
Consistency
Availability
Examples
 Single-site databases
 Cluster databases
 LDAP
 xFS file system
Tolerance to network
Partitions
May 5, 2003
Traits
 2-phase commit
 cache validation
Vanguard Conference
protocols
Forfeit Availability
Consistency
Availability
Examples
 Distributed databases
 Distributed locking
 Majority protocols
Tolerance to network
Partitions
May 5, 2003
Traits
 Pessimistic locking
 Make minority
Vanguard Conference
partitions unavailable
Forfeit Consistency
Consistency
Availability



Tolerance to network
Partitions
May 5, 2003
Examples
Coda
Web cachinge
DNS
Traits
 expirations/leases
 conflict resolution
 optimistic
Vanguard Conference
These Tradeoffs are Real

The whole space is useful
 Real internet systems are a careful mixture of
ACID and BASE subsystems

We use ACID for user profiles and logging (for
revenue)

But there is almost no work in this area
 Symptom of a deeper problem: systems and
database communities are separate but
overlapping (with distinct vocabulary)
May 5, 2003
Vanguard Conference
CAP Take Homes

Can have consistency & availability within a
cluster (foundation of Ninja), but it is still hard
in practice
 OS/Networking good at BASE/Availability, but
terrible at consistency
 Databases better at C than Availability
 Wide-area databases can’t have both
 Disconnected clients can’t have both
 All systems are probabilistic…
May 5, 2003
Vanguard Conference
The DQ Principle
Data/query * Queries/sec = constant =
DQ
 for
a given node
 for a given app/OS release
 A fault
can reduce the capacity (Q),
completeness (D) or both
 Faults reduce this constant linearly (at
best)
May 5, 2003
Vanguard Conference
Harvest & Yield

Yield: Fraction of Answered Queries




Harvest: Fraction of the Complete Result



Related to uptime but measured by queries, not by
time
Drop 1 out of 10 connections => 90% yield
At full utilization: yield ~ capacity ~ Q
Reflects that some of the data may be missing due to
faults
Replication: maintain D under faults
DQ corollary: harvest * yield ~ constant

ACID => choose 100% harvest (reduce Q but 100%
Vanguard Conference
D)
May 5, 2003
Harvest Options
1) Ignore lost nodes



RPC gives up
forfeit small part of the database
reduce D, keep Q
2) Pair up nodes



RPC tries alternate
survives one fault per pair
reduce Q, keep D
3) n-member replica groups
Decide when you care...
RAID
RAID
Replica Groups
With n members:
 Each fault reduces Q by 1/n
 D stable until nth fault
 Added load is 1/(n-1) per fault




n=2 => double load or 50% capacity
n=4 => 133% load or 75% capacity
“load redirection problem”
Disaster tolerance: better have >3 mirrors
May 5, 2003
Vanguard Conference
Graceful Degradation

Goal: smooth decrease in harvest/yield
proportional to faults


Saturation will occur




we know DQ drops linearly
high peak/average ratios...
must reduce harvest or yield (or both)
must do admission control!!!
One answer: reduce D dynamically

disaster => redirect load, then reduce D to
compensate for extra load
May 5, 2003
Vanguard Conference
Thinking Probabilistically

Maximize symmetry


Make faults independent





SPMD + simple replication schemes
requires thought
avoid cascading errors/faults
understand redirected load
KISS
Use randomness



May 5, 2003
makes worst-case and average case the same
ex: Inktomi spreads data & queries randomly
Node loss implies a random 1% harvest
reduction
Vanguard Conference
Server Pollution
Can’t fix all memory leaks
 Third-party software leaks memory and
sockets



so does the OS sometimes
Some failures tie up local resources
Solution: planned periodic “bounce”



Not worth the stress to do any better
Bounce time is less than 10 seconds
Nice to remove load first…
May 5, 2003
Vanguard Conference
Key New Problems

Unknown but large growth



Must be truly highly available




Incremental & Absolute scalability
1000’s of components
Hot swap everything (no recovery time
allowed)
No “night”
Graceful degradation under faults & saturation
Constant evolution (internet time)



Software will be buggy
Hardware will fail
These can’t be emergencies...
May 5, 2003
Vanguard Conference
Conclusions

Parallel Programming is very relevant,
except…





historically avoids availability
no notion of online evolution
limited notions of graceful degradation
(checkpointing)
best for CPU-bound tasks
Must think probabilistically about everything




no such thing as a 100% working system
no such thing as 100% fault tolerance
partial results are often OK (and better than none)
Capacity * Completeness == Constant
May 5, 2003
Vanguard Conference
Partial checklist

What is shared? (namespace, schema?)
 What kind of state in each boundary?
 How would you evolve an API?
 Lifetime of references? Expiration impact?
 Graceful degradation as modules go down?
 External persistent names?
 Consistency semantics and boundary?
May 5, 2003
Vanguard Conference
The Move to Clusters
 No
single machine can handle the load
 Only
 Other
solution is clusters
cluster advantages:
 Cost:
about 50% cheaper per CPU
 Availability: possible to build HA systems
 Incremental growth: add nodes as needed
 Replace whole nodes (easier)
May 5, 2003
Vanguard Conference
Goals

Sheer scale

Handle 100M users, going toward 1B


Largest: AOL Web Cache: 12B hits/day
High Availability

Large cost for downtime



$250K per hour for online retailers
$6M per hour for stock brokers
Disaster Tolerance?

Overload Handling
 System Evolution
 Decentralization?
May 5, 2003
Vanguard Conference

Scalability -- Vanguard Conference

Transcript Scalability -- Vanguard Conference

Directory