Scalability -- Vanguard Conference
Download
Report
Transcript Scalability -- Vanguard Conference
Some Claims about
Giant-Scale Systems
Prof. Eric A. Brewer
UC Berkeley
Vanguard Conference, May 5, 2003
Five Claims
Scalability
How
is easy (availability is hard)
to build a giant-scale system
Services
are King (infrastructure centric)
P2P is cool but overrated
XML doesn’t help much
Need new IT for the Third World
May 5, 2003
Vanguard Conference
Claim 1:
Availability >> Scalability
Key New Problems
Unknown but large growth
Must be truly highly available
Incremental & Absolute scalability
1000’s of components
Hot swap everything (no recovery time)
$6M/hour for stocks, $300k/hour for retailers
Graceful degradation under faults & saturation
Constant evolution (internet time)
Software will be buggy
Hardware will fail
These can’t be emergencies...
May 5, 2003
Vanguard Conference
Typical Cluster
May 5, 2003
Vanguard Conference
Scalability is EASY
Just
add more nodes….
Unless
you want HA
Or you want to change the system…
Hard
parts:
Availability
Overload
Evolution
May 5, 2003
Vanguard Conference
Step 1: Basic Architecture
Client
Client
Client
Client
IP Network
Single Site
Server
Load
Manager
Op tio nal
B ackp lane
Persistent Data Store
May 5, 2003
Vanguard Conference
Step 2: Divide Persistent Data
Transactional (expensive, must be correct)
Supporting Data
HTML, images, applets, etc.
Read mostly
Publish in snapshots (normally stale)
Session state
Persistent across connections
Limited lifetime
Can be lost (rarely)
May 5, 2003
Vanguard Conference
Step 3: Basic Availability
a)
depend on layer-7 switches
Isolate
external names (IP address, port)
from specific machines
b)
automatic detection of problems
Node-level
checks (e.g. memory footprint)
(Remote) App-level checks
c)
Focus on MTTR
Easier
May 5, 2003
to fix & test than MTBF
Vanguard Conference
Step 4: Overload Strategy
Goal:
degrade service (some) to allow
more capacity during overload
Examples:
Simpler
pages (less dynamic, smaller size)
Less options (just the basics)
Must
kick in automatically
“overload
mode”
Relatively easy to detect
May 5, 2003
Vanguard Conference
Step 5: Disaster Tolerance
a)
pick a few locations
Independent
b)
failure
Dynamic redirection of load
Best:
client-side control
Next: switch all traffic (long routes)
Worst: DNS remapping (takes a while)
c)
Target site will get overloaded
--
May 5, 2003
but you have overload handling
Vanguard Conference
Step 6: Online Evolution
Goal:
rapid evolution without downtime
a) Publishing model
Decouple
development from live system
Atomic push of content
Automatic revert if trouble arises
b)
Three methods
May 5, 2003
Vanguard Conference
Evolution: Three Approaches
Flash Upgrade
Rolling Upgrade
Fast reboot into new version
Focus on MTTR (< 10 sec)
Reduces yield (and uptime)
Upgrade nodes one at time in a “wave”
Temporary 1/n harvest reduction, 100% yield
Requires co-existing versions
“Big Flip”
May 5, 2003
Vanguard Conference
The Big Flip
Steps:
1) take down 1/2 the nodes
2) upgrade that half
3) flip the “active half” (site upgraded)
4) upgrade second half
5) return to 100%
Avoids mixed versions (!)
can replace schema, protocols, ...
Twice used to change physical location
May 5, 2003
Vanguard Conference
The Hat Trick
Merge
the handling of:
Disaster
tolerance
Online evolution
Overload handling
The
first two reduce capacity, which
then kicks in overload handling (!)
May 5, 2003
Vanguard Conference
Claim 2:
Services are King
Coming of The Infrastructure
100
90
80
1996
1998
2000
70
60
50
40
30
20
10
0
Telephone
May 5, 2003
PC
Vanguard Conference
Non-PC
Internet
Infrastructure Services
Much
lower cost & more functionality
longer battery life
Data
is in the infrastructure
can lose the device
enables groupware
can update/access from home or work
phone book on the web, not in the phone
can use a real PC & keyboard
Much
simpler devices
faster access
Surfing is 3-7 times faster
Graphics look good
May 5, 2003
Vanguard Conference
Transformation Examples
Tailor content for each user & device
6.8x
65x
10x
1.2 The Remote Queue Model
We introduce Remote Queues (RQ), ….
Infrastructure Services (2)
Much
cheaper overall cost (20x?)
Device utilization = 4%, infrastructure = 80%
Admin & supports costs also decrease
“Super
View powerpoint slides with teleconference
Integrated cell phone, pager, web/email access
Map, driving directions, location-based services
Can
Convergence” (all -> IP)
upgrade/add services in place!
Devices last longer and grow in usefulness
Easy to deploy new services => new revenue
May 5, 2003
Vanguard Conference
Internet Phases (prediction)
Internet as New Media
Consumer Services (today)
Shopping, travel, Yahoo!, eBay, tickets, …
Industrial Services
HTML, basic search
XML, micropayments, spot markets
Everything in the Infrastructure
May 5, 2003
store your data in the infrastructure
access anytime/anywhere
Vanguard Conference
Claim 3:
P2P is cool but overrated
P2P Services?
Not
soon…
Challenges
Untrusted
nodes !!
Network Partitions
Much harder to understand behavior
Harder to upgrade
Relatively
May 5, 2003
few advantages…
Vanguard Conference
Better: Smart Clients
Mostly
helps with:
Load
balancing
Disaster tolerance
Overload
Can
also offload work from servers
Can also personalize results
E.g.
mix search results locally
Can include private data
May 5, 2003
Vanguard Conference
Claim 4:
XML doesn’t help much…
The Problem
Need services for computers to use
HTML only works for people
Sites depend on human interpretation of
ambiguous content
“Scaping” content is BAD
Very error prone
No strategy for evolution
XML doesn’t solve any of these issues!
At best: RPC with an extensible schema
May 5, 2003
Vanguard Conference
Why it is hard…
The real problem is *social*
Doesn’t make evolution better…
What do the fields mean?
Who gets to decide?
Two sides still need to agree on schema
Can ignore stuff you don’t understand?
When can a field change? Consequences?
At least need versioning system…
XML can mislead us to ignore/postpone the
real issues!
May 5, 2003
Vanguard Conference
Claim 5:
New IT for the Third World
Plug for new area…
Bridging
the IT gap is the only long-term
path to global stability
Convergence makes it possible:
802.11
wireless ($5/chipset)
Systems on a chip (cost, power)
Infrastructure services (cost, power)
Goal:
10-100x reduction in overall cost
and power
May 5, 2003
Vanguard Conference
Five Claims
Scalability
How
is easy (availability is hard)
to build a giant-scale system
Services
are King (infrastructure centric)
P2P is cool but overrated
XML doesn’t help much
Need new IT for the Third World
May 5, 2003
Vanguard Conference
Backup
Refinement
Retrieve part of distilled object at higher quality
Distilled image
(by 60X)
Zoom in to original
resolution
The CAP Theorem
Consistency
Availability
Tolerance to network
Partitions
May 5, 2003
Theorem: You can have at
most two of these properties
for any shared-data system
Vanguard Conference
Forfeit Partitions
Consistency
Availability
Examples
Single-site databases
Cluster databases
LDAP
xFS file system
Tolerance to network
Partitions
May 5, 2003
Traits
2-phase commit
cache validation
Vanguard Conference
protocols
Forfeit Availability
Consistency
Availability
Examples
Distributed databases
Distributed locking
Majority protocols
Tolerance to network
Partitions
May 5, 2003
Traits
Pessimistic locking
Make minority
Vanguard Conference
partitions unavailable
Forfeit Consistency
Consistency
Availability
Tolerance to network
Partitions
May 5, 2003
Examples
Coda
Web cachinge
DNS
Traits
expirations/leases
conflict resolution
optimistic
Vanguard Conference
These Tradeoffs are Real
The whole space is useful
Real internet systems are a careful mixture of
ACID and BASE subsystems
We use ACID for user profiles and logging (for
revenue)
But there is almost no work in this area
Symptom of a deeper problem: systems and
database communities are separate but
overlapping (with distinct vocabulary)
May 5, 2003
Vanguard Conference
CAP Take Homes
Can have consistency & availability within a
cluster (foundation of Ninja), but it is still hard
in practice
OS/Networking good at BASE/Availability, but
terrible at consistency
Databases better at C than Availability
Wide-area databases can’t have both
Disconnected clients can’t have both
All systems are probabilistic…
May 5, 2003
Vanguard Conference
The DQ Principle
Data/query * Queries/sec = constant =
DQ
for
a given node
for a given app/OS release
A fault
can reduce the capacity (Q),
completeness (D) or both
Faults reduce this constant linearly (at
best)
May 5, 2003
Vanguard Conference
Harvest & Yield
Yield: Fraction of Answered Queries
Harvest: Fraction of the Complete Result
Related to uptime but measured by queries, not by
time
Drop 1 out of 10 connections => 90% yield
At full utilization: yield ~ capacity ~ Q
Reflects that some of the data may be missing due to
faults
Replication: maintain D under faults
DQ corollary: harvest * yield ~ constant
ACID => choose 100% harvest (reduce Q but 100%
Vanguard Conference
D)
May 5, 2003
Harvest Options
1) Ignore lost nodes
RPC gives up
forfeit small part of the database
reduce D, keep Q
2) Pair up nodes
RPC tries alternate
survives one fault per pair
reduce Q, keep D
3) n-member replica groups
Decide when you care...
RAID
RAID
Replica Groups
With n members:
Each fault reduces Q by 1/n
D stable until nth fault
Added load is 1/(n-1) per fault
n=2 => double load or 50% capacity
n=4 => 133% load or 75% capacity
“load redirection problem”
Disaster tolerance: better have >3 mirrors
May 5, 2003
Vanguard Conference
Graceful Degradation
Goal: smooth decrease in harvest/yield
proportional to faults
Saturation will occur
we know DQ drops linearly
high peak/average ratios...
must reduce harvest or yield (or both)
must do admission control!!!
One answer: reduce D dynamically
disaster => redirect load, then reduce D to
compensate for extra load
May 5, 2003
Vanguard Conference
Thinking Probabilistically
Maximize symmetry
Make faults independent
SPMD + simple replication schemes
requires thought
avoid cascading errors/faults
understand redirected load
KISS
Use randomness
May 5, 2003
makes worst-case and average case the same
ex: Inktomi spreads data & queries randomly
Node loss implies a random 1% harvest
reduction
Vanguard Conference
Server Pollution
Can’t fix all memory leaks
Third-party software leaks memory and
sockets
so does the OS sometimes
Some failures tie up local resources
Solution: planned periodic “bounce”
Not worth the stress to do any better
Bounce time is less than 10 seconds
Nice to remove load first…
May 5, 2003
Vanguard Conference
Key New Problems
Unknown but large growth
Must be truly highly available
Incremental & Absolute scalability
1000’s of components
Hot swap everything (no recovery time
allowed)
No “night”
Graceful degradation under faults & saturation
Constant evolution (internet time)
Software will be buggy
Hardware will fail
These can’t be emergencies...
May 5, 2003
Vanguard Conference
Conclusions
Parallel Programming is very relevant,
except…
historically avoids availability
no notion of online evolution
limited notions of graceful degradation
(checkpointing)
best for CPU-bound tasks
Must think probabilistically about everything
no such thing as a 100% working system
no such thing as 100% fault tolerance
partial results are often OK (and better than none)
Capacity * Completeness == Constant
May 5, 2003
Vanguard Conference
Partial checklist
What is shared? (namespace, schema?)
What kind of state in each boundary?
How would you evolve an API?
Lifetime of references? Expiration impact?
Graceful degradation as modules go down?
External persistent names?
Consistency semantics and boundary?
May 5, 2003
Vanguard Conference
The Move to Clusters
No
single machine can handle the load
Only
Other
solution is clusters
cluster advantages:
Cost:
about 50% cheaper per CPU
Availability: possible to build HA systems
Incremental growth: add nodes as needed
Replace whole nodes (easier)
May 5, 2003
Vanguard Conference
Goals
Sheer scale
Handle 100M users, going toward 1B
Largest: AOL Web Cache: 12B hits/day
High Availability
Large cost for downtime
$250K per hour for online retailers
$6M per hour for stock brokers
Disaster Tolerance?
Overload Handling
System Evolution
Decentralization?
May 5, 2003
Vanguard Conference