Mega-services - Duke University

Download Report

Transcript Mega-services - Duke University

“Mega-services”:
Scale and Consistency
A Lightning Tour
Now with Elastic Scaling!
Jeff Chase
Duke University
A service
Client
request
Web
Server
client
reply
server
App
Server
DB
Server
Store
The Steve Yegge rant, part 1
Products vs. Platforms
Selectively quoted/clarified from http://steverant.pen.io/, emphasis added.
This is an internal google memorandum that ”escaped”. Yegge had moved
to Google from Amazon. His goal was to promote service-oriented software
structures within Google.
So one day Jeff Bezos [CEO of Amazon] issued a mandate....[to the
developers in his company]:
His Big Mandate went something along these lines:
1) All teams will henceforth expose their data and functionality through
service interfaces.
2) Teams must communicate with each other through these interfaces.
3) There will be no other form of interprocess communication allowed:
no direct linking, no direct reads of another team's data store, no sharedmemory model, no back-doors whatsoever. The only communication allowed
is via service interface calls over the network.
The Steve Yegge rant, part 2
Products vs. Platforms
4) It doesn't matter what technology they use. HTTP, Corba, PubSub,
custom protocols -- doesn't matter. Bezos doesn't care.
5) All service interfaces, without exception, must be designed from the
ground up to be externalizable. That is to say, the team must plan and
design to be able to expose the interface to developers in the outside
world. No exceptions.
6) Anyone who doesn't do this will be fired.
7) Thank you; have a nice day!
Managing overload
• What should we do when a service is in overload?
– Overload: service is close to saturation, leading to
unacceptable response time.
λ > λmax
– Work queues grow without bound, increasing memory
consumption.
Throughput
X
λ
λmax offered load
Options for overload
1. Thrashing
–
Keep trying and hope things get better. Accept each request
and inject it into the system. Then drop requests at random if
some queue overflows its memory bound. Note: leads to
dropping requests after work has been invested, wasting work
and reducing throughput (e.g., congestion collapse).
2. Admission control or load conditioning
–
Reject requests as needed to keep system healthy. Reject
them early, before they incur processing costs. Choose your
victims carefully, e.g., prefer “gold” customers, or reject the
most expensive requests.
3. Dynamic provisioning or elastic scaling
– E.g., acquire new capacity on the fly from a cloud provider, and
shift load over to the new capacity.
Elastic scaling: “pay as you grow”
EC2 Elastic Compute Cloud
The canonical public cloud
Virtual
Appliance
Image
Client
Service
Guest
Cloud
Provider(s)
Host
IaaS: Infrastructure as a Service
Client
Service
Platform
Hosting performance
and isolation is
determined by
virtualization layer
Virtual Machines
(VM): VMware, KVM,
etc.
OS
VMM
Physical
EC2 is a public IaaS
cloud (fee-for-service).
Deployment of private
clouds is growing
rapidly w/ open IaaS
cloud software.
Native virtual machines (VMs)
• Slide a hypervisor underneath the kernel.
– New OS/TCB layer: virtual machine monitor (VMM).
• Kernel and processes run in a virtual machine (VM).
– The VM “looks the same” to the OS as a physical machine.
– The VM is a sandboxed/isolated context for an entire OS.
• Can run multiple VM instances on a shared computer.
Thank you, VMware
Also in the news, the Snowden NSA leak for Hallowe’en 2013. Scary?
Scaling a service
Dispatcher
Work
Support substrate
Server cluster/farm/cloud/grid
Data center
Add servers or “bricks” for scale and robustness.
Issues: state storage, server selection, request routing, etc.
Service scaling and bottlenecks
Scale up by adding capacity incrementally?
• “Just add bricks/blades/units/elements/cores”...but that presumes we
can parallelize the workload.
• “Service workloads parallelize easily.”
– Many independent requests: spread requests across multiple units.
– Problem: some requests use shared data. Partition data into chunks and
spread them across the units: be sure to read/write a common copy.
• Load must be evenly distributed, or else some unit saturates before
the others (bottleneck).
A bottleneck limits throughput
and/or may increase response
time for some class of requests.
Work
Coordination for service clusters
• How to assign data and functions among servers?
– To spread the work across an elastic server cluster?
• How to know which server is “in charge” of a given
function or data object?
– E.g., to serialize reads/writes on each object, or otherwise
ensure consistent behavior: each read sees the value stored by
the most recent write (or at least some reasonable value).
• Goals: safe, robust, even, balanced, dynamic, etc.
• Two key techniques:
– Leases (leased locks)
– Hash-based adaptive data distributions
What about failures?
X
• Systems fail. Here’s a reasonable set of assumptions
about failure properties:
• Nodes/servers/replicas/bricks
– Fail-stop or fail-fast fault model
– Nodes either function correctly or remain silent
– A failed node may restart, or not
– A restarted node loses its memory state, and recovers its
secondary (disk) state
– Note: nodes can also fail by behaving in unexpected ways, like
sending false messages. These are called Byzantine failures.
• Network messages
– “delivered quickly most of the time” but may be dropped.
– Message source and content are safely known (e.g., crypto).
Challenge: data management
• Data volumes are growing enormously.
• Mega-services are “grounded” in data.
• How to scale the data tier?
– Scaling requires dynamic placement of data items across data
servers, so we can grow the number of servers.
– Caching helps to reduce load on the data tier.
– Replication helps to survive failures and balance read/write load.
– E.g., alleviate hot-spots by spreading read load across multiple
data servers.
– Caching and replication require careful update protocols to
ensure that servers see a consistent view of the data.
Service-oriented
architecture of
Amazon’s platform
Dynamo is a scalable,
replicated key-value store.
Key-value stores
• Many mega-services are built on key-value stores.
– Store variable-length content objects: think “tiny files” (value)
– Each object is named by a “key”, usually fixed-size.
– Key is also called a token: not to be confused with a crypto key!
Although it may be a content hash (SHAx or MD5).
– Simple put/get interface with no offsets or transactions (yet).
– Goes back to literature on Distributed Data Structures [Gribble
1998] and Distributed Hash Tables (DHTs).
[image from Sean Rhea, opendht.org]
Key-value stores
• Data objects named in a “flat” key space (e.g., “serial numbers”)
• K-V is a simple and clean abstraction that admits a scalable, reliable
implementation: a major focus of R&D.
• Is put/get sufficient to implement non-trivial apps?
Distributed application
put(key, data)
Distributed hash table
lookup(key)
Lookup service
node
node
get (key)
data
node IP address
….
node
[image from Morris, Stoica, Shenker, etc.]
Scalable key-value stores
• Can we build massively scalable key/value stores?
– Balance the load: distribute the keys across the nodes.
– Find the “right” server(s) for a given key.
– Adapt to change (growth and “churn”) efficiently and reliably.
– Bound the spread of each object.
• Warning: it’s a consensus problem!
• What is the consistency model for massive stores?
– Can we relax consistency for better scaling? Do we have to?
Memcached is a scalable in-memory key-value cache.
Scaling database access
• Many services are data-driven.
– Multi-tier services: the “lowest”
layer is a data tier with
authoritative copy of service data.
• Data is stored in various stores
or databases, some with
advanced query API.
SQL
query
API
– e.g., SQL
• Databases are hard to scale.
– Complex data: atomic, consistent,
recoverable, durable. (“ACID”)
database servers
SQL: Structured
Query Language
Caches can help if much of the workload is simple reads.
web
servers
Memcached
memcached
servers
• “Memory caching daemon”
• It’s just a key/value store
• Scalable cluster service
get/put
API
– array of server nodes
– distribute requests among nodes
etc…
– how? distribute the key space
– scalable: just add nodes
• Memory-based
• LRU object replacement
• Many technical issues:
Multi-core server scaling, MxN
communication, replacement, consistency
SQL
query
API
database servers
web
servers
[From Spark Plug to Drive Train: The Life of an App Engine Request, Along Levi, 5/27/09]
Consistency and coordination
Very loosely:
• Consistency: each read of a {key, file, object} sees the
value stored by the most recent write.
– Or at least writes propagate to readers “eventually”.
– More generally: the service behaves as expected: changes to
shared state are properly recorded and observed.
• Coordination: the roles and functions of each element
are understood and properly adjusted to respond to
changes (e.g., failures, growth, or rebalancing: “churn”).
– E.g., distribution of functions or data among the elements.
Example: mutual exclusion
• It is often necessary to grant some node/process
the “right” to “own” some given data or function.
• Ownership rights often must be mutually exclusive.
– At most one owner at any given time.
• How to coordinate ownership?
• Warning: it’s a consensus problem!
One solution: lock service
acquire
grant
acquire
x=x+1
release
grant
A
x=x+1
release
lock service
B
A lock service in the real world
acquire
acquire
grant
X
x=x+1
A
???
???
B
B
Solution: leases (leased locks)
• A lease is a grant of ownership or
control for a limited time.
• The owner/holder can renew or
extend the lease.
• If the owner fails, the lease expires
and is free again.
• The lease might end early.
– lock service may recall or evict
– holder may release or relinquish
A lease service in the real world
acquire
acquire
grant
X
x=x+1
A
???
grant
x=x+1
release
B
Leases and time
• The lease holder and lease service must agree when
a lease has expired.
– i.e., that its expiration time is in the past
– Even if they can’t communicate!
• We all have our clocks, but do they agree?
– synchronized clocks
• For leases, it is sufficient for the clocks to have a
known bound on clock drift.
– |T(Ci) – T(Cj)| < ε
– Build in slack time > ε into the lease protocols as a safety
margin.
OK, fine, but…
• What if the A does not fail, but is instead
isolated by a network partition?
A network partition
C ras hed
ro ute r
A network partition is any event that blocks all
message traffic between subsets of nodes.
Never two kings at once
acquire
acquire
grant
x=x+1
A
???
grant
x=x+1
release
B
Lease example
network file cache consistency
[From Spark Plug to Drive Train: The Life of an App Engine Request, Along Levi, 5/27/09]
Google File System (GFS)
Similar: Hadoop HDFS, p-NFS, many other parallel file systems.
A master server stores metadata (names, file maps) and acts as lock server.
Clients call master to open file, acquire locks, and obtain metadata. Then they
read/write directly to a scalable array of data servers for the actual data. File
data may be spread across many data servers: the maps say where it is.
OK, fine, but…
• What if the manager/master itself fails?
X
We can replace it, but the nodes must agree on
who the new master is: requires consensus.
The Answer
• Replicate the functions of the
manager/master.
– Or other coordination service…
• Designate one of the replicas as a primary.
– Or master
• The other replicas are backup servers.
– Or standby or secondary
• If the primary fails, use a high-powered
consensus algorithm to designate and
initialize a new primary.
Consensus: abstraction
P1
P1
v1
d1
Unreliable
multicast
P2
v2
P3
Step 1
Propose.
v3
Each P proposes a value to the others.
Coulouris and Dollimore
Consensus
algorithm
P2
P3
d2
Step 2
Decide.
d3
All nonfaulty P agree on a value in
a bounded time.
Fischer-Lynch-Patterson (1985)
• No consensus can be guaranteed in an
asynchronous system in the presence of failures.
• Intuition: a “failed” process may just be slow, and
can rise from the dead at exactly the wrong time.
• Consensus may occur recognizably, rarely or often.
Network partition
Split brain
“CAP theorem”
consistency
C
Dr. Eric Brewer
CA: available, and
consistent, unless
there is a partition.
A
Availability
C-A-P
choose
two
CP: always consistent, even
in a partition, but a reachable
replica may deny service if it
is unable to agree with the
others (e.g., quorum).
AP: a reachable replica
provides service even in
a partition, but may be
inconsistent.
P
Partition-resilience
Properties for Correct Consensus
• Termination: All correct processes eventually decide.
• Agreement: All correct processes select the same di.
– Or…(stronger) all processes that do decide select the same
di, even if they later fail.
• Consensus “must be” both safe and live.
• FLP and CAP say that a consensus algorithm
can be safe or live, but not both.
Now what?
• We have to build practical, scalable, efficient
distributed systems that really work in the
real world.
• But the theory says it is impossible to build
reliable computer systems from unreliable
components.
• So what are we to do?
Butler W. Lampson
/
http://research.microsoft.com/en-us/um/people/blampson
Butler Lampson is a Technical Fellow at Microsoft Corporation and an Adjunct Professor at
MIT…..He was one of the designers of the SDS 940 time-sharing system, the Alto personal
distributed computing system, the Xerox 9700 laser printer, two-phase commit protocols, the
Autonet LAN, the SPKI system for network security, the Microsoft Tablet PC software, the
Microsoft Palladium high-assurance stack, and several programming languages. He received
the ACM Software Systems Award in 1984 for his work on the Alto, the IEEE Computer
Pioneer award in 1996 and von Neumann Medal in 2001, the Turing Award in 1992, and the
NAE’s Draper Prize in 2004.
[Lampson 1995]
Summary/preview
• Master coordinates, dictates consensus
– e.g., lock service
– Also called “primary”
• Remaining consensus problem: who is the
master?
– Master itself might fail or be isolated by a network
partition.
– Requires a high-powered distributed consensus
algorithm (Paxos) to elect a new master.
– Paxos is safe but not live: in the worst case
(multiple repeated failures) it might not terminate.
A Classic Paper
• ACM TOCS:
– Transactions on Computer Systems
• Submitted: 1990. Accepted: 1998
• Introduced:
???
A Paxos Round
Self-appoint
Wait for majority
“Can I
lead b?” “OK, but”
“v?”
Wait for majority
“OK”
“v!”
L
N
1a
1b
log
Propose
2b
2a
log
Promise
Accept
3
safe
Ack
Commit
Nodes may compete to serve as leader, and may
interrupt one another’s rounds. It can take many
rounds to reach consensus.
Consensus in Practice
• Lampson: “Since general consensus is expensive,
practical systems reserve it for emergencies.”
– e.g., to select a primary/master, e.g., a lock server.
• Centrifuge, GFS master, Frangipani, etc.
• Google Chubby service (“Paxos Made Live”)
• Pick a primary with Paxos. Do it rarely; do it right.
– Primary holds a “master lease” with a timeout.
• Renew by consensus with primary as leader.
– Primary is “czar” as long as it holds the lease.
– Master lease expires? Fall back to Paxos.
– (Or BFT.)
Coordination services
• Build your cloud apps around a coordination
service with consensus (Paxos) at its core.
• This service is a fundamental building block
for consistent scalable services.
– Chubby (Google)
– Zookeeper (Yahoo!)
– Centrifuge (Microsoft)
Coordination and Consensus
• If the key to availability and scalability is to decentralize and
replicate functions and data, how do we coordinate the nodes?
–
–
–
–
–
–
–
–
data consistency
update propagation
mutual exclusion
consistent global states
failure notification
group membership (views)
group communication
event delivery and ordering
We stopped here
• But I’ll leave these slides in.
• Hidden slides (e.g., on VMs and network file
cache consistency) are not to be tested.
• Obviously we only peeked at some of these
topics, but you should understand the problem
of failures (lock service).
Caching in the Web
• Web “proxy” caches are servers that cache Web content.
• Reduce traffic to the origin server.
• Deployed by enterprises to reduce external network
traffic to serve Web requests of their members.
• Also deployed by third-party companies that sell caching
service to Web providers.
– Content Delivery/Distribution Network (CDN)
– Help Web providers serve their clients better.
– Help absorb unexpected load from “flash crowds”.
– Reduce Web server infrastructure costs.
Content Delivery Network (CDN)
DNS Caching
Cache freshness by TTL “freshness
date”), also used by early NFS, and
HTTP/Web.