Transcript Pond

Pond
The OceanStore Prototype
Introduction
Problem: Rising cost of storage
management
Observations:
Universal connectivity via Internet
$100 terabyte storage within three years
Solution: OceanStore
OceanStore
Internet-scale
Cooperative file system
High durability
Universal availability
Two-tier storage system
Upper tier: powerful servers
Lower tier: less powerful hosts
OceanStore
More on OceanStore
Unit of storage: data object
Applications: email, UNIX file system
Requirements for the object interface
Information universally accessible
Balance between privacy and sharing
Simple and usable consistency model
Data integrity
OceanStore Assumptions
Infrastructure untrusted except in
aggregate
Most nodes are not faulty and malicious
Infrastructure constantly changing
Resources enter and exit the network
without prior warning
Self-organizing, self-repairing, self-tuning
OceanStore Challenges
Expressive storage interface
High durability on untrusted and
changing base
Data Model
The view of the system that is
presented to client applications
Storage Organization
OceanStore data object ~= file
Ordered sequence of read-only versions
Every version of every object kept forever
Can be used as backup
An object contains metadata, data, and
references to previous versions
Storage Organization
A stream of objects identified by AGUID
Active globally-unique identifier
Cryptographically-secure hash of an
application-specific name and the owner’s
public key
Prevents namespace collisions
Storage Organization
Each version of data object stored in a
B-tree like data structure
Each block has a BGUID
• Cryptographically-secure hash of the block
content
Each version has a VGUID
Two versions may share blocks
Storage Organization
Application-Specific Consistency
An update is the operation of adding a
new version to the head of a version
stream
Updates are applied atomically
Represented as an array of potential
actions
Each guarded by a predicate
Application-Specific Consistency
Example actions
Replacing some bytes
Appending new data to an object
Truncating an object
Example predicates
Check for the latest version number
Compare bytes
Application-Specific Consistency
To implement ACID semantic
Check for readers
If none, update
Append to a mailbox
No checking
No explicit locks or leases
Application-Specific Consistency
Predicate for reads
Examples
• Can’t read something older than 30 seconds
• Only can read data from a specific time frame
System Architecture
Unit of synchronization: data object
Changes to different objects are
independent
Virtualization through Tapestry
Resources are virtual and not tied to
particular hardware
A virtual resource has a GUID, globally
unique identifier
Use Tapestry, a decentralized object
location and routing system
Scalable overlay network, built on TCP/IP
Virtualization through Tapestry
Use GUIDs to address hosts and
resources
Hosts publish the GUIDs of their
resources in Tapestry
Hosts also can unpublish GUIDs and
leave the network
Replication and Consistency
A data object is a sequence of read-only
versions, consisting of read-only blocks,
named by BGUIDs
No issues for replication
The mapping from AGUID to the latest
VGUID may change
Use primary-copy replication
Replication and Consistency
The primary copy
Enforces access control
Serializes concurrent updates
Archival Storage
Replication: 2x storage to tolerate one
failure
Erasure code is much better
A block is divided into m fragments
m fragments encoded into n > m
fragments
Any m fragments can restore the original
object
Caching of Data Objects
Reconstructing a block from erasure
code is an expensive process
Need to locate m fragments from m
machines
Use whole-block caching for frequentlyread objects
Caching of Data Objects
To read a block, look for the block first
If not available
Find block fragments
Decode fragments
Publish that the host now caches the block
Amortize the cost of erasure
encoding/decoding
Caching of Data Objects
Updates are pushed to secondary
replicas via application-level multicast
tree
The Full Update Path
Serialized updates are disseminated via
the multicast tree for an object
At the same time, updates are encoded
and fragmented for long-term storage
The Full Update Path
The Primary Replica
Primary servers run Byzantine
agreement protocol
Need more than 2/3 nonfaulty participants
Messages required grow quadratic in the
number of participants
Public-Key Cryptography
Too expensive
Use symmetric-key message
authentication codes (MACs)
Two to three orders of magnitude faster
Downside: can’t prove the authenticity of
a message to the third party
Used only for the inner ring
Public-key cryptography for outer ring
Proactive Threshold Signatures
Byzantine agreement guarantees
correctness if not more than 1/3 servers
fail during the life of the system
Not practical for a long-lived system
Need to reboot servers at regular
intervals
Key holders are fixed
Proactive Threshold Signatures
Proactive threshold signatures
More flexibility in choosing the membership
of the inner ring
A public key is paired with a number of
private keys
Each server uses its key to generate a
signature share
Proactive Threshold Signatures
Any k shares may be combined to
produce a full signature
To change membership of an inner ring
Regenerate signature shares
No need to change the public key
Transparent to secondary hosts
The Responsible Party
Who chooses the inner ring?
Responsible party:
A server that publishes sets of failureindependent nodes
• Through offline measurement and analysis
Software Architecture
Java atop the Staged Event Driven
Architecture (SEDA)
Each subsystem is implemented as a stage
With each own state and thread pool
Stages communicate through events
50,000 semicolons by five graduate
students and many undergrad interns
Software Architecture
Language Choice
Java: speed of development
Strongly typed
Garbage collected
Reduced debugging time
Support for events
Easy to port multithreaded code in Java
• Ported to Windows 2000 in one week
Language Choice
Problems with Java:
Unpredictability introduced by garbage
collection
Every thread in the system is halted while
the garbage collector runs
Any on-going process stalls for ~100
milliseconds
May add several seconds to requests travel
cross machines
Experimental Setup
Two test beds
Local cluster of 42 machines at Berkeley
• Each with 2 1.0 GHz Pentium III
• 1.5GB PC133 SDRAM
• 2 36GB hard drives, RAID 0
• Gigabit Ethernet adaptor
• Linux 2.4.18 SMP
Experimental Setup
PlanetLab, ~100 nodes across ~40 sites
• 1.2 GHz Pentium III, 1GB RAM
• ~1000 virtual nodes
Storage Overhead
For 32 choose 16 erasure encoding
2.7x for data > 8KB
For 64 choose 16 erasure encoding
4.8x for data > 8KB
The Latency Benchmark
A single client submits updates of
various sizes to a four-node inner ring
Metric: Time from before the request is
signed to the signature over the result
is checked
Update 40 MB of data over 1000
updates, with 100ms between updates
The Latency Benchmark
Latency Breakdown
Update Latency (ms)
Key
Size
512b
1024b
Update 5%
Size
Time
Median
Time
95%
Time
Phase
Time (ms)
Check
0.3
Serialize
6.1
4kB
39
40
41
2MB
1037
1086
1348
Apply
1.5
4kB
98
99
100
Archive
4.5
2MB
1098
1150
1448
Sign
77.8
The Throughput Microbenchmark
A number of clients submit updates of
various sizes to disjoint objects, to a
four-node inner ring
The clients
Create their objects
Synchronize themselves
Update the object as many time as
possible for 100 seconds
The Throughput Microbenchmark
Archive Retrieval Performance
Populate the archive by submitting
updates of various sizes to a four-node
inner ring
Delete all copies of the data in its
reconstructed form
A single client submits reads
Archive Retrieval Performance
Throughput:
1.19 MB/s (Planetlab)
2.59 MB/s (local cluster)
Latency
~30-70 milliseconds
The Stream Benchmark
Ran 500 virtual nodes on PlanetLab
Inner Ring in SF Bay Area
Replicas clustered in 7 largest P-Lab sites
Streams updates to all replicas
One writer - content creator – repeatedly
appends to data object
Others read new versions as they arrive
Measure network resource consumption
The Stream Benchmark
The Tag Benchmark
Measures the latency of token passing
OceanStore 2.2 times slower than
TCP/IP
The Andrew Benchmark
File system benchmark
4.6x than NFS in read-intensive phases
7.3x slower in write-intensive phases