Transcript CAP Theorem

The CAP Theorem
Tomer Gabel, Wix
BuildStuff 2014
Credits
Originally a talk by
Yoav Abrahami (Wix)
Based on “Call Me Maybe” by
Kyle “Aphyr” Kingsbury
Brewer’s CAP Theorem
Partition
Tolerance
Availability
Consistency
Brewer’s CAP Theorem
Partition
Tolerance
Availability
Consistency
By Example
• I want this book!
– I add it to the cart
– Then continue
browsing
• There’s only one copy
in stock!
By Example
• I want this book!
– I add it to the cart
– Then continue
browsing
• There’s only one copy
in stock!
• … and someone else
just bought it.
Consistency
Consistency: Defined
• In a consistent
system:
All participants
see the same value
at the same time
• “Do you have this
book in stock?”
Consistency: Defined
• If our book store is an
inconsistent system:
– Two customers may
buy the book
– But there’s only one
item in inventory!
• We’ve just violated a
business constraint.
Availability
Availability: Defined
• An available system:
– Is reachable
– Responds to requests
(within SLA)
• Availability does not
guarantee success!
– The operation may fail
– “This book is no longer
available”
Availability: Defined
• What if the system is
unavailable?
– I complete the
checkout
– And click on “Pay”
– And wait
– And wait some more
– And…
• Did I purchase the
book or not?!
Partition
Tolerance
Partition Tolerance: Defined
• Partition: one or more
nodes are unreachable
• This is distinct from a
dead node…
• … but observably the
same
• No practical system
runs on a single node
• So all systems are
susceptible!
B
A
C
D
E
“The Network is Reliable”
A
B
time
A
drop
A
delay
B
duplicate
1
B
A
B
reorder
• All four happen in an
IP network
• To a client, delays
and drops are the
same
• Perfect failure
detection is provably
impossible1!
“Impossibility of Distributed Consensus with One Faulty Process”, Fischer, Lynch and Paterson
Partition Tolerance: Reified
• External causes:
– Bad network config
– Faulty equipment
– Scheduled
maintenance
• Even software causes
partitions:
– Bad network config.
– GC pauses
– Overloaded servers
1
• Plenty of war stories!
–
–
–
–
Netflix
Twilio
GitHub
Wix :-)
• Some hard numbers1:
– 5.2 failed devices/day
– 59K lost packets/day
– Adding redundancy
only improves by 40%
“Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications”, Gill et al
“Proving” CAP
In Pictures
• Let’s consider a simple
system:
– Service A writes values
– Service B reads values
– Values are replicated
between nodes
• These are “ideal”
systems
– Bug-free, predictable
Node 1
A
V0
Node 2
B
V0
In Pictures
• “Sunny day scenario”:
Node 1
V1
– A writes a new value V1
A
V10
– The value is replicated
Node 2
to node 2
– B reads the new value
V1
B
V10
In Pictures
• What happens if the
network drops?
Node 1
V1
A
V10
– A writes a new value V1
– Replication fails
– B still sees the old value
– The system is
inconsistent
Node 2
V0
B
V0
In Pictures
• Possible mitigation is
synchronous replication
Node 1
– A writes a new value V1
A
V1
V0
– Cannot replicate, so write is
rejected
– Both A and B still see V0
– The system is logically
unavailable
Node 2
B
V0
What does it all mean?
The network is not reliable
• Distributed systems must handle partitions
• Any modern system runs on >1 nodes…
• … and is therefore distributed
• Ergo, you have to choose:
– Consistency over availability
– Availability over consistency
Granularity
• Real systems comprise many operations
– “Add book to cart”
– “Pay for the book”
• Each has different properties
• It’s a spectrum, not a binary choice!
Checkout
Shopping Cart
Consistency
Availability
CAP IN THE REAL
WORLD
Kyle “Aphyr” Kingsbury
Breaking consistency
guarantees since 2013
PostgreSQL
• Traditional RDBMS
– Transactional
– ACID compliant
• Primarily a CP system
– Writes against a
master node
• “Not a distributed
system”
– Except with a client at
play!
PostgreSQL
Client
• Writes are a simplified
2PC:
Server
Store
– Client votes to commit
– Server validates
transaction
– Server stores changes
– Server acknowledges
commit
– Client receives
acknowledgement
PostgreSQL
Client
Server
Store
?
• But what if the ack is
never received?
• The commit is already
stored…
• … but the client has
no indication!
• The system is in an
inconsistent state
PostgreSQL
• Let’s experiment!
• 5 clients write to a
PostgreSQL instance
• We then drop the server
from the network
• Results:
– 1000 writes
– 950 acknowledged
– 952 survivors
So what can we do?
1. Accept false-negatives
– May not be acceptable for your use case!
2. Use idempotent operations
3. Apply unique transaction IDs
– Query state after partition is resolved
• These strategies apply to any RDBMS
• A document-oriented database
• Availability/scale via replica sets
– Client writes to a master node
– Master replicates writes to n replicas
• User-selectable consistency guarantees
MongoDB
• When a partition occurs:
– If the master is in the
minority, it is demoted
– The majority promotes a
new master…
– … selected by the highest
optime
MongoDB
• The cluster “heals” after partition resolution:
– The “old” master rejoins the cluster
– Acknowleged minority writes are reverted!
MongoDB
• Let’s experiment!
• Set up a 5-node
MongoDB cluster
• 5 clients write to
the cluster
• We then partition
the cluster
• … and restore it to
see what happens
MongoDB
• With write concern
unacknowleged:
– Server does not ack
writes (except TCP)
– The default prior to
November 2012
• Results:
–
–
–
–
6000 writes
5700 acknowledged
3319 survivors
42% data loss!
MongoDB
• With write concern
acknowleged:
– Server acknowledges
writes (after store)
– The default guarantee
• Results:
–
–
–
–
6000 writes
5900 acknowledged
3692 survivors
37% data loss!
MongoDB
• With write concern
replica acknowleged:
– Client specifies
minimum replicas
– Server acks after
writes to replicas
• Results:
–
–
–
–
6000 writes
5695 acknowledged
3768 survivors
33% data loss!
MongoDB
• With write concern
majority:
– For an n-node cluster,
requires at least n/2
replicas
– Also called “quorum”
• Results:
–
–
–
–
–
6000 writes
5700 acknowledged
5701 survivors
2 false positives :-(
No data loss
So what can we do?
1. Keep calm and carry on
– As Aphyr puts it, “not all applications need
consistency”
– Have a reliable backup strategy
– … and make sure you drill restores!
2. Use write concern majority
– And take the performance hit
The prime suspects
• Aphyr’s Jepsen tests
include:
–
–
–
–
–
–
–
–
Redis
Riak
Zookeeper
Kafka
Cassandra
RabbitMQ
etcd (and consul)
ElasticSearch
• If you’re
considering them,
go read his posts
• In fact, go read his
posts regardless
http://aphyr.com/tags/jepsen
STRATEGIES FOR
DISTRIBUTED SYSTEMS
Immutable Data
• Immutable (adj.):
“Unchanging over
time or unable to be
changed.”
• Meaning:
–
–
–
–
No deletes
No updates
No merge conflicts
Replication is trivial
Idempotence
• An idempotent
operation:
– Can be applied one or
more times with the
same effect
• Enables retries
• Not always possible
– Side-effects are key
– Consider: payments
Eventual Consistency
• A design which prefers
availability
• … but guarantees that
clients will eventually see
consistent reads
• Consider git:
– Always available locally
– Converges via push/pull
– Human conflict resolution
Eventual Consistency
• The system expects
data to diverge
• … and includes
mechanisms to regain
convergence
– Partial ordering to
minimize conflicts
– A merge function to
resolve conflicts
Vector Clocks
• A technique for partial ordering
• Each node has a logical clock
– The clock increases on every write
– Track the last observed clocks for each item
– Include this vector on replication
• When observed and inbound vectors have
no common ancestor, we have a conflict
• This lets us know when history diverged
CRDTs
• Commutative Replicated Data Types1
• A CRDT is a data structure that:
– Eventually converges to a consistent state
– Guarantees no conflicts on replication
1
“A comprehensive study of Convergent and Commutative Replicated Data Types”, Shapiro et al
CRDTs
• CRDTs provide specialized semantics:
– G-Counter: Monotonously increasing counter
– PN-Counter: Also supports decrements
– G-Set: A set that only supports adds
– 2P-Set: Supports removals but only once
• OR-Sets are particularly useful
– Keeps track of both additions and removals
– Can be used for shopping carts
Questions?
Complaints?
[email protected]
@tomerg
http://il.linkedin.com/in/tomergabel
Aphyr’s “Call Me Maybe” blog posts:
http://aphyr.com/tags/jepsen
Thank you for listening
WE’RE DONE
HERE!