Transcript VENTI.pps

Venti : A new approach to archival storage
Sean Quinlan & Sean Dorward
Bell Labs, Lucent Technology
FAST 2002
Presented by: Fabián E. Bustamante
CS 443 Advanced OS
Fabián E. Bustamante, Spring 2005
Archival Storage
Abundant storage - storing data for long time, even forever
Archival system – write-once policy – changes how we see
storage
Prevalent form of archival storage - tape backup
– Central service for a number of clients
– Restoring data tedious and error prone (requires sys admin or
privileged software)
– Infrequent, so problems may go undetected for long
– Tradeoff bet/ performance of backup & restore (incremental
backups are more efficient to generate, but not self-contained)
Snapshots (Plan 8, Andrew FS, …)
–
–
–
–
A consistent read-only view of FS at some point in past
Maintains file system permissions
Can be accessed with standard tools – ls, cat, cp, grep, diff
Avoids tradeoff between full vs. incremental backup
• Looks like full backup
• Implementation resembles incremental backup – share blocks
• Only a small number of versions are kept
2
Venti
GOAL: “To provide a write-once archival repository
that can be shared by multiple client machines and
applications”
Block level network storage system
– Actually a backend storage for client apps
– Block level i/f places few restrictions on data
structures/format
Blocks addressed by hash of their contents
– Uses SHA-1 algorithm
– SHA-1 output is 160 bit (20 byte) fingerprint of data block
Write once policy
– Once written cannot be deleted
Multiple writes of same data coalesced
– Data sharing – increases effective storage capacity
– Makes write operation idempotent
3
Venti
Multiple clients can share a Venti server
– Hash fn gives an universal namespace
Inherent integrity checking of data
– Fingerprint computation on retrieval
Caching is simplified
Uses magnetic disks as storage technology
– Access time comparable to non-archival data
Hashing function – sha1 is more than enough for now
Exabyte of storage (1018 bytes) in 8 Kbytes blocks
of
-20
(~1014 blocks - n) withNumber
sha1
pairs of (b = 160 bits) – p < 10
blocks
p
n(n  1) 1
 b
2
2
Probability that a
given pair will collide
4
Applications - general
Venti as a building block
to construct storage
apps
Data divided into blocks
App needs fingerprint
for retrieval
One approach fingerprints packed into
pointer blocks, also
written to server
Repeated recursively to
get single fingerprint root of tree
5
Applications - general
Tree cannot be
modified, but
Unchanged sections of
tree reused
More complex data
structures
– Mixing data & fingerprints
in a block
e.g. structure for storing
file system
– 3 types of blocks
• Directory – file metadata
+ root fingerprint
• Pointer
• Data
6
Example app: Vac
Similar to zip & tar – store collection of files &
directories as single object
Tree of blocks formed for selected files
vac archive file - 45 bytes long
– 20 bytes for root fingerprint
– 25 bytes fixed header string
– any amount of data compressed down to 45 bytes
Unvac – to restore file from archive
Duplicate copies of file coalesced on server
– Multiple users vac same data – only 1 copy stored
– vac on changed contents
7
Example app: Physical level backup
Copy the raw disk blocks to Venti
– No need to walk file hierarchy
– Higher throughput
Duplicate blocks are coalesced
Free block – null in pointer tree
User sees full backup of device
Storage space advs of incremental backup retained
Random access possible
– Directly mounting a backup file system image (w/ OS
support)
– Enables lazy (on demand) restore
8
Example app - Plan 9 FS
Previously
– Magnetic disk + write-once optical jukebox
– Cache for faster access & accumulated changes bet/
snapshots
– Cache smaller than active file system
– Cache misses are painful
Plan 9 FS on top of Venti
– Venti as storage device (instead of jukebox)
– Equalizes access to both active & archival storage
– Cache can also be smaller
9
Implementation
Append-only log of data blocks
Network
– RAID array
Separate index maps
Venti Server
fingerprints to log location
Client
– Can be regenerated from data log
FS
Client
Block
Cache
Index
Cache
Data
Client
Index
10
Venti’s data log
Log divided into
arenas for
maintenance
Append only section of
data blocks (variable
size)
For integrity checking
& data recovery
Data compressed
(LZ77) –
• encoding – if, algo.
• esize size after
compression
For fast index rebuild
& error recovery
Data log and directory grow in opposite directions
When arena is filled, mark as sealed, fingerprint compute
11
Venti’s index
No. of fingerprints >> no. of blocks on a server
Index as disk-resident hash table
Hashing function maps fingerprints to index buckets
Going through index is main performance penalty
– Caching (block and index)
– Index striped across multiple disks – fingerprint location in index is random
& index is too big to fit in memory
– Write buffering (to increase number of concurrent accesses)
Block cache - hit - index lookup & data log bypassed
Index cache - hit - index lookup bypassed
12
Performance: computing environments
Testbed
–
–
–
–
Dedicated dual 550Mhz Pentium 3, 2GB mem.
Access through a 100Mbps Ethernet
Data log stored on a 500GB MaxTronic IDE Raid 5
Index stored on 8 Seagate Cheetah 18XL 9GB
SCSI
Micro-benchmark results
Sequential
reads
Random
reads
Virgin writes
Duplicates
writes
Uncached
0.9
0.4
3.7
5.6
Index cache
4.2
0.7
-
6.2
Block cache
6.8
-
-
6.5
Raw raid
14.8
1.0
12.4
12.4
13
Storage usage – file servers used
2 plan 9 file servers, bootes and emelie
– Spanning 1990 to 2001
522 user accounts, 50-100 active at a time
Numerous development projects hosted
Several large data sets
– Astronomical data, satellite imagery, multimedia files
Next figures
–
–
–
–
Size of active file system
Space consumed on jukebox
Space when using Venti
Ratio – using Venti reduces the cost of retaining snapshots
(so ratio is smaller)
14
Storage usage
15
Storage saving
When stored on Venti, size of jukebox data
reduced by 3 factors
– Elimination of duplicate blocks
– Elimination of block fragmentation
– Compression of block contents
% reduction of storage
Bootes
Emelie
Elimination of duplicates
27.8%
31.3%
Elimination of fragments
10.2%
25.4%
Data compression
33.8%
54.1%
Total reduction
59.7%
76.5%
16
Reliability & recovery
Tools for integrity checking & error recovery
– Verifying structure of arena
– Checking there is an index entry for every block in data log,
vice versa
– Rebuilding index from data log
– Copying arena to removable media
Data log on RAID 5 disk array
– Protection against single drive failures
Off-site mirrors for extra protection, if that’s not
enough …
Stored sealed arenas into removable media
17
Future Work
Load balancing
– Distribute Venti across multiple machines
– Replicate server
– Use of proxy server to hide it from client
Better access control
– Currently just authentication to server
– Single root fingerprint gives access to entire file
tree
Use of variable sized blocks as in LBFS
18
Conclusion
Addressing block by SHA-1 hash of contents
– good fit for archival storage
Magnetic disks as storage technology
– Large capacity at low price – online storage
– Random access
– Performance comparable to non-archival data
Write once model
–
–
–
–
Reduces accidental or malicious data loss
Simplifies administration
Simplifies caching
Allows sharing of data
19