PowerPoint - New York University

Download Report

Transcript PowerPoint - New York University

G22.3250-001
SGI’s XFS
or Cool Pet Tricks with B+ Trees
Robert Grimm
New York University
Altogether Now:
The Three Questions
 What is the problem?
 What is new or different?
 What are the contributions and limitations?
Some Background
 Inode
 On-disk data structure containing a file’s metadata
and pointers to its data
 Defines FS-internal namespace
 Inode numbers
 Originated in Unix FFS
 Vnode
 Kernel data structure for
representing files
 Provides standard API
for accessing different FS’s
 Originated in Sun OS
Motivation
 On one side: I/O bottleneck
 It’s becoming harder to utilize increasing I/O capacity and
bandwidth
 112 9 GB disk drives provide 1 TB of storage
 High end drives provide 500 MB/sec sustained disk bandwidth
 On the other side: I/O intensive applications
 Editing of uncompressed video
 30 MB/sec per stream, 108 GB for one hour
 Streaming compressed video on demand
 2.7 TB for 1,000 movies, 200 movies require 100 MB/sec
Scalability Problems of Existing
File Systems
 Slow crash recovery
 fsck needs to scan entire disk
 No support for large file systems
 32 bit block pointers address only 4 million blocks
 At 8 KB per block, 32 TB
 No support for large, sparse files
 64 bit block pointers require more levels of indirection
 Are also quite inefficient
 Fixed-size extents are still too limiting
Scalability Problems of Existing
File Systems (cont.)
 No support for large, contiguous files
 Bitmap structures for tracking free and allocated blocks
do not scale
 Hard to find large regions of contiguous space
 But, we need contiguous allocation for good utilization
of bandwidth
 No support for large directories
 Linear layout does not scale
 In-memory hashing imposes high memory overheads
 No support for large numbers of files
 Inodes preallocated during file system creation
XFS Approach in a Nutshell
 Use 64 bit block addresses
 Support for larger files systems
 Use B+ trees and extents
 Support for larger number of files, larger files,
larger directories
 Better utilization of I/O bandwidth
 Log metadata updates
 Faster crash recovery
XFS Architecture
 I/O manager
 I/O requests
 Directory manager
 File system name space
 Space manager
 Free space, inode & file allocation
 Transaction manager
 Atomic metadata updates
 Unified buffer cache
 Volume manager
 Striping, concatenation, mirroring
Storage Scalability
 Allocation groups
 Are regions with their own free space maps and inodes
 Support AG-relative block and inode pointers
 Reduce size of data structures
 Improve parallelism of metadata management
 Allow concurrent accesses to different allocation groups
 Unlike FFS, are mostly motivated by scalability not locality
 Free space
 Two B+ trees describing extents (what’s a B+ tree?)
 One indexed by starting block (used when?)
 One indexed by length of extent (used when?)
Storage Scalability (cont.)
 Large files
 File storage tracked by extent map
 Each entry: block offset in file, length in blocks, starting block
on disk
 Small extent map organized as list in inode
 Large extent map organized as B+ tree rooted in inode
 Indexed by block offset in file
 Large number of files
 Inodes allocated dynamically
 In chunks of 64
 Inode locations tracked by B+ tree
 Only points to inode chunks
Storage Scalability (cont.)
 Large directories
 Directories implemented as (surprisingly) B+ trees
 Map 4 byte hashes to directory entries (name, inode number)
 Fast crash recovery
 Enabled by write ahead log
 For all structural updates to metadata
 E.g., creating a file  directory block, new inode, inode allocation
tree block, allocation group header block, superblock
 Independent of actual data structures
 However, still need disk scavengers for catastrophic failures
Performance Scalability
 Allocating files contiguously
 On-disk allocation is delayed until flush
 Uses (cheap) memory to improve I/O performance
 Typically enables allocation in one extent
 Even for random writes (think memory-mapped files)
 Avoids allocation for short-lived files
 Extents have large range: 21 bit length field
 Block size can vary by file system
 Small blocks for file systems with many small files
 Large blocks for file systems with mostly large files
 What prevents long-term fragmentation?
Performance Scalability (cont.)
 Performing file I/O
 Read requests issued for large I/O buffers
 Followed by multiple read ahead requests for sequential reads
 Writes are clustered to form larger I/O requests
 Delayed allocation helps with buffering writes
 Direct I/O lets applications bypass buffer cache and use
DMA
 Applications have control over I/O, while still accessing file system
 But also need to align data on block boundaries and issue requests on
multiples of block size
 Reader/writer locking supports more concurrency
 Direct I/O leaves serialization entirely to applications
Performance Scalability (cont.)
 Accessing and updating metadata
 Updates performed in asynchronous write-ahead log
 Modified data still only flushed after log update has completed
 But metadata not locked, multiple updates can be batched
 Log may be placed on different device from file system
 Including non-volatile memory
 Log operation is simple (though, log is centralized)
 Provide buffer space, copy, write out, notify
 Copying can be done by processors performing transaction
Experiences
I/O Throughput
 What can we conclude?
 Read speed, difference between creates and writes,
paralellism
Benchmark Results
 Datamation sort
 3.52 seconds (7 seconds previous record)
 Indy MinuteSort
 1.6 GB sorted in 56 seconds (1.08 GB previously)
 SPEC SFS
 8806 SPECnfs instead of 7023 SPECnfs with EFS
 12% increase with mostly small & synchronous writes
on similar hardware
Directory Lookups
Why this noticeable break?
Why this noticeable break?
A Final Note
 Are these systems/approaches contradictory?
 Recoverable virtual memory
 Simpler is better, better performing
 XFS
 Way more complex is better, better performing
What Do You Think?