UNIX File Systems 3

Download Report

Transcript UNIX File Systems 3

UNIX File and Directory
Caching
How UNIX Optimizes File System
Performance and Presents Data to
User Processes Using a Virtual File
System
V-NODE Layer
•
V-node in-memory interface to the disk consist of:
•
File system independent functions dealing with:
–
Hierarchical naming;
–
Locking;
–
Quotas;
–
Attribute management and protection.
•
Object (file) creation and deletion, read and write, changes in space allocation:
–
These functions refer to file-store internals specific to the file system:
–
Physical organization of data on device;
–
For local data files, these functions refer to v-node refers to UNIX-specific structure
called i-node (index node) that has all necessary information to access the actual data
store.
Actual File I/O
•
CPU cannot access the file data directly.
•
Must be first brought to the main memory.
– How to do this efficiently?
•
Read/Write mapping using in-memory system file/directory buffer cache.
•
Memory mapped files – INODE lists, Directories, Regular files.
•
Then fed from memory to data pipeline in CPU.
Virtual INODES (In Memory)
Relation between logical and
physical data views
Program I/O - Virtual to Real
File Read/Write Memory Mapping
•
File data is made available to applications via a pre-allocated main memory region - the
buffer cache.
•
•
The file systems transfers data between the buffer cache and disk in granularity of disk
blocks.
The data is explicitly copied from/to buffer cache to/from the user application address space
(process).
•
A file (or a portion thereof) is mapped into a contiguous region of the process virtual memory.
•
Mapping operation is very efficient: just marking a block.
•
The access to file is governed by the virtual memory subsystem.
•
Advantages:
– reduce copying
– no need for a pre-allocated buffer cache in the main memory
•
Disadvantages:
– less or no control over the actual disk writing: the file data becomes volatile
– A mapped area must fit the virtual address space
Read/Write Mapping
Kernel
Main Memory
File C
File A
File B
Buffer Cache
Reading data (Disk block=1K)
User
Kernel
Buffer Cache
Buf
ptr
File C
1324
3172
UNSIGNED CHAR BUF[8192];
UNSIGNED CHAR *PTR=BUF+126;
FD = OPEN(“C”,…);
SEEK(FD,1324);
READ(FD,PTR,1848);
// 1324=1024+300
// 724+1024+100=1848
Writing data (Disk block=1K)
User
Kernel
Buffer Cache
Buf
ptr
File C
1324
3172
Unallocat
ed
region
UNSIGNED CHAR BUF[8192];
UNSIGNED CHAR *PTR=BUF+126;
FD = OPEN(“C”,…);
SEEK(FD,1324);
WRITE(FD,PTR,1848);
// 1324=1024+300
// 724+1024+100=1848
Buffer Cache management
•
All disk I/O goes through the buffer cache. Both data and metadata (e.g., i-node, directories)
are cached using LRU replacement
•
Dirty (modified) marker to indicate whether write-back is needed for data blocks.
•
Advantages:
- Hiding disk access the user program. Block size, memory alignment, memory allocation in
multiples of the block size, etc…
- Disk blocks are cached
- Block aggregation for small transfers (locality)
- Block re-use across processes
- Transient data might be never written to disk
•
Disadvantages:
- Extra copying: Disk->buffer cache->user space
- Vulnerability to failures
- Does not care about the user data blocks
- Control data blocks (metadata) are the real problem
• INODES, pointer blocks, directories can be in cache when a failure occurs
• As a result the file system internal state might be corrupted
• fsck required, resulting in long (re-)boot times
File System Reliability and
Recovery
•
File system data consists of file control data (metadata), user data
•
Failures can cause data loss and corruption for cached metadata or user data
•
Power failure during the sector write may corrupt physically the data stored in the sector
•
Lost or corruption of the metadata might lead to a more massive user data loss.
– File systems must care about the metadata more than about the user data
– The Operating System cares about the file system data (e.g. metadata)
– Users must care about their data themselves (e.g., backups)
•
Caching affects the WRITE process reliability.
– Is it guaranteed that the requested data is indeed written on disk?
– What if cache blocks are the metadata blocks versus user data?
•
Solutions:
– write-through: writes bypass cache
– write-back: dirty blocks are written asynchronously [bracket processes]
Data Reliability in UNIX
•
User data writes based on write-back policy:
– User data is written back to disk periodically
– Program commands like sync and fsync are used for forced write of the dirty blocks.
•
Metadata writes are based on write-through policy. Updates are written to disk immediately
bypassing cache.
Problem:
- Some data is not written in-place. Can go back to the last consistent version
- Some data is replicated like UNIX superblock.
- File system goes through consistency check/repair cycle at the boot time as specified in
/etc/fstab options (see manpage on fsck, fstab).
- Write-through negatively affects performance
•
Solution: maintain a sequential log of metadata updates, a Journal: e.g. IBM’s Journal File
System (JFS) in AIX
Journal File System (JFS)
•
Metadata operations logged (journaled):
– create,link,mkdir,truncate,allocating, write, …
– each operation may involve several metadata updates (transaction)
•
Once operation is logged it returns, write ahead logging
•
The disk writes are performed asynchronously. Block aggregation possible.
•
A cursor (pointer) is maintained. The cursor is advanced once the updated blocks associated
with the transaction are written to disk (hardened). Hardened transaction records can be
deleted from the journal.
•
Upon recovery: Re-do all the operations starting from the last cursor position.
•
Advantages:
– Asynchronous metadata write
– Fast recovery: depends on the Journal size and not on the file-system size
•
Disadvantages
– extra write
– space wasted by journal (insignificant)