FilesystemImplementation

Download Report

Transcript FilesystemImplementation

CSC 660: Advanced OS
Filesystem Implementation
CSC 660: Advanced Operating Systems
Slide #1
Topics
1.
2.
3.
4.
5.
6.
7.
8.
Disks
Ext2 Filesystem Layout
Inode Allocation
Block Addressing
Block Allocation
e2fsck
Journaling
Stackable Filesystems
CSC 660: Advanced Operating Systems
Slide #2
Filesystem Layering
CSC 660: Advanced Operating Systems
Slide #3
Hard Drive Components
CSC 660: Advanced Operating Systems
Slide #4
Hard Drive Components
Actuator
Moves arm across disk to read/write data.
Arm has multiple read/write heads (often 2/platter.)
Platters
Rigid substrate material + magnetic coating.
Divided into many concentric tracks.
Spindle Motor
Spins platters from 3600-15,000 rpm.
Speed determines disk latency.
Cache
2-16MB of cache memory
Reliability: write-back vs. write-through
CSC 660: Advanced Operating Systems
Slide #5
Disk Information: hdparm
# hdparm -i /dev/hde
/dev/hde:
Model=WDC WD1200JB-00CRA1, FwRev=17.07W17, SerialNo=WD-WMA8C4533667
Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=40
BuffType=DualPortCache, BuffSize=8192kB, MaxMultSect=16, MultSect=off
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=234441648
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5
AdvancedPM=no WriteCache=enabled
Drive conforms to: device does not report version:
* signifies the current active mode
CSC 660: Advanced Operating Systems
Slide #6
Disk Performance
Seek Time
Time to move head to desired track (3-8 ms)
Rotational Delay
Time until head over desired block (8ms for 7200)
Latency
Seek Time + Rotational Delay
Throughput
Data transfer rate (20-80 MB/s)
CSC 660: Advanced Operating Systems
Slide #7
Latency vs. Throughput
Which is more important?
Depends on the type of load.
Sequential access – Throughput
Multimedia on a single user PC
Random access – Latency
Most servers
How to improve performance
Faster disks
Caching
More spindles (disks).
More disk controllers.
CSC 660: Advanced Operating Systems
Slide #8
Ext2 Disk Data Structures
CSC 660: Advanced Operating Systems
Slide #9
Block Groups
• Each block group contains
–
–
–
–
Block bitmap
Inode bitmap
Inode blocks
Data blocks
• How many blocks in a group?
– Bitmaps are only 1 block in size.
– Block bitmap can map 8 x blocksize blocks.
– 4Kbyte blocks => # data blocks = 32K (128M)
CSC 660: Advanced Operating Systems
Slide #10
Inode Table
• Consecutive set of inode blocks.
– Inodes are 128 bytes in size.
– A 4K block contains 32 inodes.
• Extended Inode Attributes
– Problem: inodes are fixed size, must be 2n
– Solution: i_file_acl attribute points to non-inode block
containing extended attributes (immutable bit, ACLs)
– System calls: setxattr(), getxattr()
– ACLs are most common application and have their own
system calls: setfacl(), getfacl()
CSC 660: Advanced Operating Systems
Slide #11
Disk Block Usage
• Regular files
– Zero-size files use no blocks.
• Symlinks
– Pathnames < 60 chars stored in i_block field of inode.
• Directories
– Format:
• Dev/IPC Files
– No data
blocks
CSC 660: Advanced Operating Systems
Slide #12
Creating an ext2 Filesystem
1. Initializes superblock + group descriptors.
2. Checks for bad blocks and creates list.
3. For each block group
1. Reserves space: super,desc,inodes,bitmaps.
2. Initializes inode and block bitmaps to zero.
3. Initializes inode table.
4.
5.
6.
7.
Creates root (/) directory.
Creates lost+found directory for e2fsck.
Updates bitmaps with two directories.
Groups bad blocks in lost+found.
CSC 660: Advanced Operating Systems
Slide #13
Managing Disk Space
• How to avoid file fragmentation?
– If file blocks aren’t contiguous on disk,
expensive seeks are required
• How to access blocks quickly?
– The kernel should be able to quickly convert file
offset into a logical block number on disk with
few disk accesses.
CSC 660: Advanced Operating Systems
Slide #14
Creating Inodes
1. Allocate VFS inode with new_inode()
2. If inode is a dir, find a suitable block group
–
–
If subdirectory of /, find block group with above
average free inodes + free blocks.
If not subdir of /, use block group of parent dir if
•
•
•
–
Group does not have too many directories.
Group has enough free inodes.
Group has small “debt” value (+dirs, -files)
Else use 1st block group with free inodes > avg.
CSC 660: Advanced Operating Systems
Slide #15
Creating Inodes
3. If new inode not a directory
–
–
–
Logarithmic search for free inodes starting with block
group of parent directory.
Ex: searches i % n, (i+1)%n, (i+1+2)%n
If log search fails, perform linear search.
4. Reads bitmap of selected block group.
–
Searches for 1st unused bit to get inode #.
5. Allocates disk inode.
–
–
Sets bit, marks inode bitmap block dirty.
Sets inode fields and writes to disk.
CSC 660: Advanced Operating Systems
Slide #16
Data Block Addressing
• Block Numbers
– File: relative position of block within file.
• File Offset -> File Block: (int) (offset / blocksize).
– Logical: position within disk partition.
• File Block -> Logical Block: use inode to translate.
• Inodes
–
–
–
–
Direct blocks: 12 logical block numbers (48K)
Indirect: points to block of block #s (4M)
Double-indirect (4G)
Triple-indirect (4T)
CSC 660: Advanced Operating Systems
Slide #17
Inode Block Addressing
CSC 660: Advanced Operating Systems
Slide #18
File Holes
• Portion of file not stored on disk.
– Can contain only null bytes.
– echo –n “x” | dd of=/tmp/hole bs=1024 seek=6
– Used by databases and similar hashing apps.
• How big is a file with a hole?
– i_size includes null bytes in hole.
– i_blocks stores data blocks actually used.
CSC 660: Advanced Operating Systems
Slide #19
File Holes
CSC 660: Advanced Operating Systems
Slide #20
Allocating Data Blocks
• Goal parameter
–
–
–
–
Preferred logical block number.
If prev 2 blocks consecutive, goal = prev+1
Else if at least 1 block alloc, goal = prev
Else goal = first logical block of group
• ext2_alloc_block()
– If goal block pre-allocated to file, allocates.
– Else, discards remaining pre-allocated and
invokes ext2_new_block().
CSC 660: Advanced Operating Systems
Slide #21
ext2_new_block()
• If goal is free, allocates.
• Otherwise checks for nearby free blocks.
• If no nearby free blocks, checks all groups
– Starts with block group of goal block.
– Searches for group of 8+ adjacent free blocks.
– If no such group, looks for single free block.
• Will allocate up to 8 adjacent free blocks.
CSC 660: Advanced Operating Systems
Slide #22
e2fsck
1. Checks validity of all inodes.
Is file mode valid? Are blocks valid?
Are blocks used by multiple inodes?
2. Checks validity of all directories.
Valid format? Do all entries refer to inodes from 1?
3. Checks directory connectivity.
Is there a path from / to each directory?
4. Checks inode reference counts.
Compares link counts with values calculated in 1+2.
Moves undeleted 0 link count files to /lost+found.
5. Checks filesystem summary validity.
Do on-disk inode/block bitmaps match e2fsck ones?
CSC 660: Advanced Operating Systems
Slide #23
ext3 = ext2 + journaling
• ext3 adds a journal to the filesystem.
–
–
–
–
Journal (log) does sequential writes.
Just blocks, no inodes, bitmaps, etc.
Kernel thread writes log blocks to ext2 format.
Why? Eliminate need for e2fsck after crash.
CSC 660: Advanced Operating Systems
Slide #24
Journaling
Perform system call-level changes by:
1.
2.
3.
4.
Write blocks to journal.
Wait for write to be committed to journal.
Write blocks to filesystem.
Discard blocks from journal.
CSC 660: Advanced Operating Systems
Slide #25
System Failure Resolution
• Failure before journal commit
– Ignore missing or incomplete journal blocks.
– Change is lost, but filesystem is consistent.
• Failure after commit
– Journal blocks are written to filesystem.
CSC 660: Advanced Operating Systems
Slide #26
Journal Types
• Journal
– All data and metadata logged to journal.
– Safest and slowest ext3 mode.
• Ordered (default)
– Only metadata changes are logged.
– Ensures data blocks written before metadata.
– Guarantees writes that enlarge are safe.
• Writeback
– Only metadata logged, no re-ordered.
– Fastest and least safe ext3 mode.
CSC 660: Advanced Operating Systems
Slide #27
Stackable Filesystems
• Filesystems useful for enhancing OS.
–
–
–
–
–
File encryption.
Secure deletion.
Virus detection.
File versioning.
UnionFS.
• But, filesystems are difficult to develop.
– 10,000+ lines of C code is typical.
– Most of which you don’t want to change.
CSC 660: Advanced Operating Systems
Slide #28
Stackable Filesystems
• Solution #1
– Copy ext3fs + add your code.
– Problem: maintenance, keeping up with ext3.
• Solution #2
– Add a layer of indirection: stackable filesystems.
CSC 660: Advanced Operating Systems
Slide #29
Stackable Filesystems
CSC 660: Advanced Operating Systems
Slide #30
File Data API
• encode_data
– Called by write calls before data sent to lowerlevel filesystem.
• decode_data
– Called by read calls after data received from
lower-level filesystem.
• Arguments
– I/O blocks: cannot change size.
– File attributes, user credentials.
CSC 660: Advanced Operating Systems
Slide #31
Filename API
• encode_filename
– Modifies filename from user system call that is
sent to lower-level filesystem.
• decode_filename
– Modifies filename from filesystem before
returning to user.
• Arguments
– Filenames: can change length, but no invalid chars
– File’s vnode, user credentials.
CSC 660: Advanced Operating Systems
Slide #32
File Attributes API
• No specific API.
• Must modify wrapfs calls directly.
CSC 660: Advanced Operating Systems
Slide #33
References
1.
2.
3.
4.
5.
6.
7.
Daniel P. Bovet and Marco Cesati, Understanding the Linux Kernel,
3rd edition, O’Reilly, 2005.
Remy Card, Theodore T’so, Stephen Tweedie, “Design and
Impementation of the Second Extended Filesystem,”
http://web.mit.edu/tytso/www/linux/ext2intro.html, 1994.
Robert Love, Linux Kernel Development, 2nd edition, Prentice-Hall,
2005.
Claudia Rodriguez et al, The Linux Kernel Primer, Prentice-Hall,
2005.
Mendel Rosenblum and John K. Osterhout, “The Design and
Implementation of a Log-structured Filesystem,” 13th ACM SOSP,
1991.
Andrew S. Tanenbaum, Modern Operating Systems, 3rd edition,
Prentice-Hall, 2005.
Erek Zadok and Jason Nieh, “FIST: A Language for Stackable
Filesystems,” http://www.filesystems.org/docs/fist-lang/fist.pdf, 2000.
CSC 660: Advanced Operating Systems
Slide #34