PNUTS: Yahoo!`s Hosted Data Serving Platform
Download
Report
Transcript PNUTS: Yahoo!`s Hosted Data Serving Platform
PNUTS: Yahoo!’s Hosted Data
Serving Platform
Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava,
Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick
Puz, Daniel Weaver and Ramana Yerneni
Yahoo! Research
How do I build a cool new web
app?
Option 1: Code it up! Make it live!
Scale it later
It gets posted to slashdot
Scale it now!
Flickr, Twitter, MySpace, Facebook, …
2
How do I build a cool new web
app?
Option 2: Make it industrial strength!
Evaluate scalable database backends
Evaluate scalable indexing systems
Evaluate scalable caching systems
Architect data partitioning schemes
Architect data replication schemes
Architect monitoring and reporting infrastructure
Write application
Go live
Realize it doesn’t scale as well as you hoped
Rearchitect around bottlenecks
1 year later – ready to go!
3
Example: social network
updates
Brian
Sonja
Jimi
Brandon
Kurt
What are my friends up to?
Sonja:
Brandon:
4
Example: social network
updates
<photo>
<title>Flower</title>
<url>www.flickr.com</url>
</photo>
6
8
12
15
16
17
Jimi
Mary
Sonja
Brandon
Mike
Bob
<ph..
<re..
<ph..
<po..
<ph..
<re..
5
What do we need from our DBMS?
Web applications need:
Scalability
And the ability to scale linearly
Geographic scope
High availability
Web applications typically have:
Simplified query needs
No joins, aggregations
Relaxed consistency needs
Applications can tolerate stale or reordered data
6
What is PNUTS?
7
What is PNUTS?
A
B
42342
42521
E
W
C
66354
W
D
E
12352
75656
E
C
F
15677
E
A
B
C
D
E
F
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
Indexes and views
Parallel database
CREATE TABLE Parts (
ID VARCHAR,
StockNumber INT,
Status VARCHAR
A 42342
E
…
B 42521
W
)
C 66354
W
D
E
F
12352
75656
15677
Geographic replication
Structured, flexible schema
E
C
E
Hosted, managed infrastructure
8
Query model
Per-record operations
Multi-record operations
Get
Set
Delete
Multiget
Scan
Getrange
Web service (RESTful) API
9
Detailed architecture
Clients
Data-path components
REST API
Routers
Tablet
controller
Message
Broker
Storage units
10
Detailed architecture
Local region
Remote regions
Clients
REST API
Routers
YMB
Tablet controller
Storage
units
11
Tablet splitting and balancing
Each storage unit has many tablets (horizontal partitions of the table)
Storage unit may become a hotspot
Storage unit
Tablet
Overfull tablets split
Tablets may grow over time
Shed load by moving tablets to other servers
12
Query processing
13
Range queries
Apple
Avocado
Grapefruit…Pear?
Banana
Blueberry
Canteloupe
Grape
Kiwi
Lemon
MIN-Canteloupe
SU1
Canteloupe-Lime
SU3
Lime-Strawberry
SU2
Strawberry-MAX
SU1
Router
Grapefruit…Lime?
Lime…Pear?
Lime
Mango
Orange
Strawberry
Tomato
Watermelon
Storage unit 1
Storage unit 2
Storage unit 3
16
Updates
1
8 Sequence # for key k
Write key k
Routers
Message brokers
3
Write key k
2
7 Sequence # for key k
4
Write key k
5
SUCCESS
SU
SU
SU
6
Write key k
17
Asynchronous
replication and
consistency
18
Asynchronous replication
19
Consistency model
Goal: make it easier for applications to reason about updates
and cope with asynchrony
What happens to a record with primary key “Brian”?
Record Update
inserted
v. 1
v. 2
Update
v. 3
Update Update
v. 4
Update
Delete
Update Update
v. 5
v. 6
Generation 1
v. 7
v. 8
Time
20
Consistency model
Read
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
21
Consistency model
Read up-to-date
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
22
Consistency model
Read ≥ v.6
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
23
Consistency model
Write
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
24
Consistency model
Write if = v.7
ERROR
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
25
Consistency model
Write if = v.7
ERROR
Stale version
Stale version
Current
version
Mechanism: per record mastership
v. 1
v. 2
v. 3
v. 4
v. 5
v. 6
Generation 1
v. 7
v. 8
Time
26
Experiments
27
Experimental setup
Production PNUTS code
Three PNUTS regions
Enhanced with ordered table type
2 west coast, 1 east coast
5 storage units, 2 message brokers, 1 router
West: Dual 2.8 GHz Xeon, 4GB RAM, 6 disk RAID 5 array
East: Quad 2.13 GHz Xeon, 4GB RAM, 1 SATA disk
Workload
1200-3600 requests/second
0-50% writes
80% locality
28
Scalability
160
Average latency (ms)
140
120
100
80
60
40
20
0
1
2
3
4
5
6
Storage units
Hash table
Ordered table
29
Request skew
100
90
Average latency (ms)
80
70
60
50
40
30
20
10
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Zipf parameter
Hash table
Ordered table
30
Size of range scans
8000
Average latency (ms)
7000
6000
5000
4000
3000
2000
1000
0
0
0.02
0.04
0.06
0.08
0.1
0.12
Fraction of table scanned
30 clients
300 clients
31
Related work
Distributed and parallel databases
Distributed filesystems
Ceph, Boxwood, Sinfonia
Distributed (P2P) hash tables
Especially query processing and transactions
BigTable, Dynamo, S3, SimpleDB, SQL Server Data Services,
Cassandra
Chord, Pastry, …
Database replication
Master-slave, epidemic/gossip, synchronous…
32
Conclusions and ongoing work
PNUTS is an interesting research product
Research: consistency, performance, fault
tolerance, rich functionality
Product: make it work, keep it (relatively) simple,
learn from experience and real applications
Ongoing work
Indexes and materialized views
Bundled updates
Batch query processing
33
Thanks!
[email protected]
research.yahoo.com
34