slides - Duke Computer Science

Download Report

Transcript slides - Duke Computer Science

Scalable, Distributed Data
Structures for Internet Service
Construction
Landon Cox
March 2, 2016
In the year 2000 …
• Portals were thought to be a good idea
• Yahoo!, Lycos, AltaVista, etc.
• Original content up front + searchable directory
• The dot-com bubble was about to burst
• Started to break around 1999
• Lots of companies washed out by 2001
• Google was really taking off
• Founded in 1998
• PageRank was extremely accurate
• Proved: great search is enough (and portals are dumb)
• Off in the distance: Web 2.0, Facebook, AWS, “the cloud”
Questions of the day
1. How do we build highly-available web services?
• Support millions of users
• Want high-throughput
2. How do we build highly-available peer-to-peer
services?
•
•
•
•
•
Napster had just about been shut down (centralized)
BitTorrent was around the corner
Want to scale to thousands of nodes
No centralized trusted administration or authority
Problem: everything can fall apart (and does)
• Some of the solutions to #2 can help with #1
Storage interfaces
Physical storage
Physical storage
File hierarchy
Logical schema
D
F
D
F
mkdir, create
Process 1
Attr1
…
Val1
…
Attr
N
ValN
F
open, read, write
Process 2
What is the interface to a file system?
SQL Query
Process 1
SQL Query
Process 2
What is the interface to a DBMS?
Data independence
• Data independence
• Idea that storage issues should be hidden from programs
• Programs should operate on data independently of underlying details
• In what way do FSes and DBs provide data independence?
• Both hide the physical layout of data
• Can change layout without altering how programs operate on data
• In what way do DBs provide stronger data independence?
• File systems leave format of data within files up to programs
• One program can alter/corrupt layout file format
• Database clients cannot corrupt schema definition
ACID properties
• Databases also ensure ACID
• What is meant by Atomicity?
• Sequences of operations are submitted via transactions
• All operations in transaction succeed or fail
• No partial success (or failure)
• What is meant by Consistency?
• After transaction commits DB is in “consistent” state
• Consistency is defined by data invariants
• i.e., after transaction completes all invariants are true
• What is the downside of ensuring Consistency?
• In tension with concurrency and scalability
• Particularly in distributed settings
ACID properties
• Databases also ensure ACID
• What is meant by Isolation?
• Other processes cannot view modification of in-flight transactions
• Similar to atomicity
• Effects of a transaction cannot be partially viewed
• What is meant by Durability?
• After transaction commits data will not be lost
• Committed transactions survive hardware and software failures
ACID properties
• Databases also ensure ACID
• Do file systems ensure ACID properties?
•
•
•
•
•
Not really
Atomicity: operations can be buffered, re-ordered, flushed async
Consistency: many different consistency models
Isolation: hard to ensure isolation without notion of transaction
Durability: need to cache undermines guarantees (can use sync)
• What do file systems offer instead of ACID?
• Faster performance
• Greater flexibility for programs
• Byte-array abstraction rather than table abstraction
Needs of cluster-based storage
• Want three things
• Scalability (incremental addition machines)
• Availability (failure/loss of machines)
• Consistency (sensible answers to requests)
• Traditional DBs fail to provide these features
• Focus on strong consistency hinders scalability and availability
• Requires a lot of coordination and complexity
• For file systems, it depends
• Some offer strong consistency guarantees (poor scalability)
• Some offer good scalability (poor consistency)
Distributed data structures
(DDS)
• Paper from OSDI ‘00
• Steve Gribble, Eric Brewer, Joseph Hellerstein, and David Culler
• Pointed out inadequacies for traditional storage for large-scale services
• Proposed a new storage interface
• More structured than file systems (structure is provided by DDS)
• Not as fussy as databases (no SQL)
• A few operations on data structure elements
Distributed data structures
(DDS)
• Present a new storage interface
• More structured than file systems (structure is provided by DDS)
• Not as fussy as databases (no SQL)
• A few operations on data structure elements
Storage brick
DDS
Get, Put
Process 1
Process 2
Get, Put
Key1
Val1
…
…
KeyN
ValN
Storage brick
Distributed Hash Tables (DHTs)
• DHT: same idea as DDS but decentralized
• Same interface as a traditional hash table
• put(key, value) — stores value under key
• get(key) — returns all the values stored under key
• Built over a distributed overlay network
• Partition key space over available nodes
• Route each put/get request to appropriate node
Sean C. Rhea
Fixing the Embarrassing Slowness of
OpenDHT on PlanetLab
How DHTs Work
How do we
ensure the put
and the get
find the same
machine?
How does this work in DNS?
K V
K V
K V
K V
k1
k1,v1
K V
v1
K V
K V
K V
put(k1,v1)
Sean C. Rhea
K V
K V
OpenDHT: A Public DHT Service
get(k1)
Nodes form a logical ring
000
110
First question: how do
new nodes figure out
where they should go on
the ring?
Sean C. Rhea
010
100
Fixing the Embarrassing Slowness of
OpenDHT on PlanetLab
Step 1: Partition Key Space
• Each node in DHT will store some k,v pairs
• Given a key space K, e.g. [0, 2160):
• Choose an identifier for each node, idi  K, uniformly at random
• A pair k,v is stored at the node whose identifier is closest to k
• Key technique: cryptographic hashing
• Node id = SHA1(MAC address)
Contrast this to DDS, in
• P(sha1 collision) <<< P(hardware failure) which an admin manually
• Nodes can independently compute their id assigned nodes to
partitions.
2160
0
Sean C. Rhea
OpenDHT: A Public DHT Service
Step 2: Build Overlay Network
• Each node has two sets of neighbors
• Immediate neighbors in the key space
• Important for correctness
• Long-hop neighbors
• Allow puts/gets in O(log n) hops
2160
0
Sean C. Rhea
OpenDHT: A Public DHT Service
Step 3: Route Puts/Gets Thru
Overlay
• Route greedily, always making
progress
get(k)
2160
0
k
Sean C. Rhea
OpenDHT: A Public DHT Service
Explain the
How Does Lookup Work?
green arrows.
Source
• Assign IDs to nodes
• Map hash values to
node with closest ID
• Leaf set is
successors and
predecessors
111…
110…
00…
• Correctness
• Routing table
matches
Explain the
successively longer
red arrows.
prefixes
• Efficiency
Sean C. Rhea
10…
OpenDHT: A Public DHT Service
Lookup ID
Iterative vs. recursive
• Previous example: recursive lookup
Which one is faster?
• Could also perform lookup iteratively:
Recursive
Sean C. Rhea
Iterative
Fixing the Embarrassing Slowness of
OpenDHT on PlanetLab
Iterative vs. recursive
• Previous example: Why
recursive
lookup
might I want to do this
iteratively?
• Could also perform lookup
iteratively:
Recursive
Sean C. Rhea
Iterative
Fixing the Embarrassing Slowness of
OpenDHT on PlanetLab
Iterative vs. recursive
• Previous example: recursive
lookup
What does DNS do and
why?
• Could also perform lookup
iteratively:
Recursive
Sean C. Rhea
Iterative
Fixing the Embarrassing Slowness of
OpenDHT on PlanetLab
(LPC: from Pastry pape
Example routing state
Sean C. Rhea
Fixing the Embarrassing Slowness of
OpenDHT on PlanetLab
OpenDHT Partitioning
responsible for these keys
• Assign each node
an identifier from
the key space
• Store a key-value
pair (k,v) on
several nodes with
IDs closest to k
• Call them replicas
for (k,v)
Sean C. Rhea
Fixing the Embarrassing Slowness of
OpenDHT on PlanetLab
id = 0xC9A1…
OpenDHT Graph Structure
• Overlay
neighbors
match prefixes
of local
identifier
• Choose among
nodes with
same matching
prefix length by
network latency
Sean C. Rhea
0xED
0xC0
Fixing the Embarrassing Slowness of
OpenDHT on PlanetLab
0x41
0x84
Performing Gets in OpenDHT
• Client sends a get
request to gateway
• Gateway routes it
along neighbor
links to first replica
encountered
• Replica sends
response back
directly over IP
client
gateway
get(0x6b)
0x41
get(0x6b)
get
response
0x6c
Sean C. Rhea
Fixing the Embarrassing Slowness of
OpenDHT on PlanetLab
DHTs: The Hype
• High availability
• Each key-value pair replicated on multiple nodes
• Incremental scalability
• Need more storage/tput? Just add more nodes.
• Low latency
• Recursive routing, proximity neighbor selection,
server selection, etc.
Sean C. Rhea
Fixing the Embarrassing Slowness of
OpenDHT on PlanetLab
Robustness Against Failure
• If a neighbor
dies, a node
routes through
its next best
one
• If replica dies,
remaining
replicas create
a new one to
replace it
Sean C. Rhea
client
0xC0
0x41
0x6c
Fixing the Embarrassing Slowness of
OpenDHT on PlanetLab
Routing Around Failures
• Under churn, neighbors may have
failed
• How to detect failures?
• acknowledge eachACK
hop
ACK
2160
0
k
Sean C. Rhea
OpenDHT: A Public DHT Service
Routing Around Failures
• What if we don’t receive an ACK?
• resend through different neighbor
Timeout!
2160
0
k
Sean C. Rhea
OpenDHT: A Public DHT Service
Computing Good Timeouts
• What if timeout is too long?
• increases put/get latency
• What if timeout is too short?
• get message explosion
Timeout!
2160
0
k
Sean C. Rhea
OpenDHT: A Public DHT Service
(LPC)
Computing Good Timeouts
• Three basic approaches to timeouts
• Safe and static (~5s)
• Rely on history of observed RTTs (TCP style)
• Rely on model of RTT based on location
2160
0
k
Sean C. Rhea
OpenDHT: A Public DHT Service
Computing Good Timeouts
• Chord errs on the side of caution
• Very stable, but gives long lookup latencies
Timeout!
2160
0
k
Sean C. Rhea
OpenDHT: A Public DHT Service
(LPC)
Timeout results
Sean C. Rhea
Fixing the Embarrassing Slowness of
OpenDHT on PlanetLab
Recovering From Failures
• Can’t route around failures forever
• Will eventually run out of neighbors
• Must also find new nodes as they join
• Especially important if they’re our immediate
predecessors or successors:
old responsibility
new node
2160
0
new responsibility
Sean C. Rhea
OpenDHT: A Public DHT Service
Recovering From Failures
• Obvious algorithm: reactive recovery
• When a node stops sending
acknowledgements, notify other neighbors
of potential replacements
• Similar techniques for arrival of new nodes
0
Sean C. Rhea
A
B
C
D
OpenDHT: A Public DHT Service
2160
Recovering From Failures
• Obvious algorithm: reactive recovery
• When a node stops sending
acknowledgements, notify other neighbors
of potential replacements
• Similar techniques for arrival of new nodes
0
A
B
B failed, use D
Sean C. Rhea
C
D
B failed, use A
OpenDHT: A Public DHT Service
2160
The Problem with Reactive Recovery
• What if B is alive, but network is congested?
•
•
•
•
C still perceives a failure due to dropped ACKs
C starts recovery, further congesting network
More ACKs likely to be dropped
Creates a positive feedback cycle (=BAD)
0
A
B
B failed, use D
Sean C. Rhea
C
D
B failed, use A
OpenDHT: A Public DHT Service
2160
The Problem with Reactive Recovery
• What if B is alive, but network is congested?
• This was the problem with Pastry
• Combined with poor congestion control, causes
network to partition under heavy churn
0
A
B
B failed, use D
Sean C. Rhea
C
D
B failed, use A
OpenDHT: A Public DHT Service
2160
Periodic Recovery
• Every period, each node sends its
neighbor list to each of its neighbors
0
A
B
C
D
E
my neighbors are A, B, D, and E
Sean C. Rhea
OpenDHT: A Public DHT Service
2160
Periodic Recovery
• Every period, each node sends its
neighbor list to each of its neighbors
0
A
B
C
D
E
my neighbors are A, B, D, and E
Sean C. Rhea
OpenDHT: A Public DHT Service
2160
Periodic Recovery
• Every period, each node sends its
neighbor list to each of its neighbors
• How does this break the feedback loop?
• Volume of recovery msgs independent of
failures
0
A
B
C
D
E
my neighbors are A, B, D, and E
Sean C. Rhea
OpenDHT: A Public DHT Service
2160
Periodic Recovery
• Every period, each node sends its
neighbor list to each of its neighbors
• Do we need to send the entire list?
• No, can send delta from last message
0
A
B
C
D
E
my neighbors are A, B, D, and E
Sean C. Rhea
OpenDHT: A Public DHT Service
2160
Periodic Recovery
• Every period, each node sends its
neighbor list to each of its neighbors
• What if we contact only a random neighbor
(instead of all neighbors)?
• Still converges in log(k) rounds (k=num neighbors)
0
A
B
C
D
E
my neighbors are A, B, D, and E
Sean C. Rhea
OpenDHT: A Public DHT Service
2160
(LPC)
Recovery results
Sean C. Rhea
Fixing the Embarrassing Slowness of
OpenDHT on PlanetLab
More key-value stores
• Two settings in which you can use DHTs
• DDS in a cluster
• Bamboo on the open Internet
• How is “the cloud” (e.g., EC2)
different/similar?
• Cloud is a combination of fast/slow networks
• Cloud is under a single administrative domain
• Cloud machines should fail less frequently
Sean C. Rhea
Fixing the Embarrassing Slowness of
OpenDHT on PlanetLab