CENG334 Introduction to Operating Systems

Download Report

Transcript CENG334 Introduction to Operating Systems

CENG334
Introduction to Operating Systems
Filesystem implementation
Topics:
Erol Sahin
Dept of Computer Eng.
Middle East Technical University
Ankara, TURKEY
URL: http://kovan.ceng.metu.edu.tr/~erol/Courses/CENG334
1
File System Layout
Slide adapted from: Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
2
Contiguous Allocation
(a) Contiguous allocation of disk space for 7 files.
(b) The state of the disk after files D and F have been removed.
Slide adapted from: Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
3
Linked List Allocation
Storing a file as a linked list of disk blocks.
Slide adapted from: Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
4
Linked List Allocation Using a Table in Memory
Linked list allocation using a file allocation table in main
memory. Will be discussed more in detail in FAT.
Slide adapted from: Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
5
I-nodes
Slide adapted from: Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
6
Implementing Directories (1)
(a)
(b)
A simple directory containing fixed-size entries with the disk addresses and
attributes in the directory entry.
A directory in which each entry just refers to an i-node.
Slide adapted from: Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
7
Implementing Directories (2)
Two ways of handling long file names in a directory. (a) In-line. (b)
In a heap.
Slide adapted from: Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
8
Shared Files (1)
File system containing a shared file.
Slide adapted from: Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
9
Shared Files (2)
(a) Situation prior to linking. (b) After the link is created. (c) After the
original owner removes the file.
10
Disk Space Management Block Size (1)
Percentage of files smaller than a given size (in bytes).
Slide adapted from: Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
11
Disk Space Management Block Size (2)
The solid curve (left-hand scale) gives the data rate of a disk. The dashed curve (righthand scale) gives the disk space efficiency. All files are 4 KB.
Slide adapted from: Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
12
Keeping track of Free Blocks -1
Linked list of disk blocks, with
each block filled with free disk
block numbers.
With 1KB block size, and 32 bit
disk block number, each block
can hold the numbers for 255
free blocks. The 256th slot
points to the next block
holding free disk blocks.
Example:



500 GB disk has approx. 488
million blocks.
If the disk is empty: storing the
free blocks requires 488
million/255 = 1.9 million blocks.
If the disk is nearly full, then the
linked list requires small.
Storage is essential free.. Free
blocks are used.
13
Cont’d
The free list method can be enhanced..
Instead of keeping track of each free block, we can keep track of
consecutive free blocks represented by:


The address of the first free block
The count of free blocks
In the free list method, only one block of pointers need to be stored in
the memory. When it runs out, another one can be brought into the
memory.
14
Keeping track of Free Blocks -2
One bit for keeping the state of each block.


1 : free blocks
0 : allocated blocks
500 GB disk, 1KB blocks


488 million bits = 60000 1KB blocks
For an empty disk:
 Less space than linked list method, since we use
1-bit instead of 32-bits for each free block
17
CENG334
Introduction to Operating Systems
Filesystems – Case studies
Topics:
FAT
Linux
Erol Sahin
Dept of Computer Eng.
Middle East Technical University
Ankara, TURKEY
URL: http://kovan.ceng.metu.edu.tr/ceng334
Slide adapted from: Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
18
FAT (MS-DOS) Filesystems
MS-DOS, Windows 98, Windows ME
Still supported in Windows NT, XP, Vista
Its use has been shifted towards embedded devices such as



Digital cameras
MP3 playes
iPod (the default filesystem)
19
Directories
Filenames are limited to 8+3 characters.. Smaller ones are left justified and padded with
space.
Attributes: read-only, archive, hidden, system
Time represented with 2 bytes: correct upto +-2 second
Date: Counts in three fields: day (5 bit), month (4 bits), year (7 bits).

Contains: Y2108 problem
File size: 2 bytes. Theoretically 4GB limit, but the limit is 2GB due to other reasons.
10 bits reserved for future use.
20
File Allocation Table
MS-DOS keeps track of files through
FAT table hold in memory.
First block number (2 bytes) used as
the index to the FAT table which
has 64K entries.
Block size can be set as multiple of
512 byes.
Three versions of FAT depending on
the number of bits a disk address
contains:



FAT-12
FAT-16
FAT-32 (actually should be called as
FAT-28)
FAT is also used for keeping track of
free blocks.
21
FAT-12
Block size: 512 bytes
(2^12-10) X 512 bytes =~ 2MB

10 disk addresses are used as
special markers.
FAT table size: 4096 entries
with 2 bytes each.
Worked well for floppy disks.
The limit was extended using
larger block sizes: 1KB, 2KB,
4KB providing support for
partitions 16MB.
Limited for hard disks.
22
The MS-DOS File System (2)
Figure 4-32. Maximum partition size for different block sizes. The
empty boxes represent forbidden combinations.
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
23
The UNIX V7 File System (1)
Figure 4-33. A UNIX V7 directory entry.
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
24
The UNIX V7 File System (2)
Figure 4-34. A UNIX i-node.
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
25
The UNIX V7 File System (3)
Figure 4-35. The steps in looking up /usr/ast/mbox.
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
26
27
28
29
30
31
The ISO 9660 File System
Figure 4-30. The ISO 9660 directory entry.
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
32
Rock Ridge Extensions
Rock Ridge extension fields:
•
•
•
•
•
•
•
•
PX - POSIX attributes.
PN - Major and minor device numbers.
SL - Symbolic link.
NM - Alternative name.
CL - Child location.
PL - Parent location.
RE - Relocation.
TF - Time stamps.
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
33
Joliet Extensions
Joliet extension fields:
•
•
•
•
Long file names.
Unicode character set.
Directory nesting deeper than eight levels.
Directory names with extensions
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
34
Virtual File Systems
VFS: another layer of
abstraction
Upper interface: for processes
implementing POSIX
interface
Lower interface: for concrete
file systems
VFS translates the POSIX calls
to the calls of the filesystems
under it.
35
VFS- 2
At boot time, the root filesystem is registered with VFS.
When other filesystems are mounted, they must also register with VFS.
When a filesystem registers, it provides the list of addresses of the functions that
the VFS demands, such as reading a block.
After registration, when one opens a file:
open(“/usr/include/unistd.h”, O_RDONLY)
VFS creates a v-node and makes a call to the concrete filesystem to return all the
information needed.
The created v-node also contains pointers to the table of functions for the concrete
filesystem that the file resides.
36
Virtual File Systems (2)
A simplified view of the data structures and code used by the VFS and concrete
file system to do a read.
37
CENG334
Introduction to Operating Systems
Filesystem Corruption and
Backups
Erol Sahin
Dept of Computer Eng.
Middle East Technical University
Ankara, TURKEY
URL: http://kovan.ceng.metu.edu.tr/ceng334
38
Filesystem corruption
What happens when you are making changes to a filesystem and the
system crashes?



Example: Modifying block 5 of a large directory, adding lots of new file entries
System crashes while the block is being written
The new files are “lost!”
System runs fsck program on reboot


Scans through the entire filesystem and locates corrupted inodes and directories
Can typically find the bad directory, but may not be able to repair it!
 The directory could have been left in any state during the write
fsck can take a very long time on large filesystems

And, no guarantees that it fixes the problems anyway
39
Example:
Example: removing a file requires
 Remove the file from its directory
 Release the i-node to the pool of free i-nodes
 Return all the disk blocks to the pool of free disk blocks
In the absence of crashes the order these steps taken do not matter.
In the presence of crashes, however, it does!
40
Example:
Example: removing a file requires
 Remove the file from its directory
 Release the i-node to the pool of free i-nodes
 Return all the disk blocks to the pool of free disk blocks
The inodes and file blocks will not be accessible from any file yet they
will not be available for reassignment.
41
Example:
Example: removing a file requires
 Remove the file from its directory
 Release the i-node to the pool of free i-nodes
 Return all the disk blocks to the pool of free disk blocks
The directory node will point to an invalid inode or (if the inode is
reassigned) point to a different file.
The blocks of the file will not be available for reassignment.
42
Example:
Example: removing a file requires
 Remove the file from its directory
 Release the i-node to the pool of free i-nodes
 Return all the disk blocks to the pool of free disk blocks
The file will point to empty blocks, or (after reassignment) it will share
the blocks of other files to which these were reassigned..
43
File system consistency
Consistency check is typically done after a crash..


UNIX: fsck
Windoze: scandisk
Redundant information in the filesystem is used:


Check the blocks
Check the files
Two tables, each containing a counter initialized to 0


Blocks in use: How many times a block is present in a file
 Read all the i-nodes using a raw device (not through the filesystem calls)
 For each block that is referenced in the inode structure, increment the
corresponding block use counter by one.
Free blocks:
 Examine the free block list of free block bitmap structure
 Each appearance of a free block increments the counter by one
44
File System Consistency
File system states. (a) Consistent. (b) Missing block. (c) Duplicate
block in free list. (d) Duplicate data block.
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
45
Missing block
•
•
Harmless but wastes space
Action: Add the missing block to the free list.
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
46
Duplicate block in free list
•
•
Can only occur in linked list representation. Bitmap
representation does not have this problem.
Action: Rebuild the free list.
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
47
Duplicate data block
•
•
The worst thing that can happen! A data block appears in two different
files..
Action:
•Allocate a free block
•Copy the contents into the new block.
•Change the links such that each copy appears once in each file.
•For sure, the contents of one of the files is garbled.
•The filesystem is made to be consistent.
•The user is informed.
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
48
Consistency check for directories
Uses a table of counters per file (rather than per block)
Starts from the root and traverses the tree




For each i-node, it increments the corresponding counter for that file
 Remember due to hard links, a file can appear more than once
It then checks the link counts stored in the i-nodes to these values.
If the link count > counter
 Even if the file is deleted by from all the directory entries, it will continue to exist.
 Solution: correct the link count
If the counter > link count
 Although the file is linked from, say, two directories, removal from one would
cause the i-node deleted leaving the other one invalid.
 Solution: correct the link count
49
Journaling Filesystems
Ensure that changes to the filesystem are made atomically



That is, a group of changes are made all together, or not at all
In the directory modification example, this means that after the system reboots:
 The directory either looks exactly as it did before the block was modified
 Or the directory looks exactly as it did after the block was modified
Cannot leave an FS entity (data block, inode, directory, etc.) in an intermediate state!
Idea: Maintain a log of all changes to the filesystem


Log contains entries that indicate what was done
e.g., “Directory 2841 had inodes 404, 407, and 408 added to it”
To make a filesystem change:



1. Write an intent-to-commit record to the log
2. Write the appropriate changes to the log
 Do not modify the filesystem data directly!!!
3. Write a commit record to the log
This is essentially the same as the notion of database transactions
50
Journaling FS Recovery
What happens when the system crashes?


Filesystem data has not actually been modified, just the log!
So, the FS itself reflects only what happened before the crash
Periodically synchronize the log with the filesystem data


Called a checkpoint
Ensures that the FS data reflects all of the changes in the log
No need to scan the entire filesystem after a crash...

Only need to look at the log entries since the last checkpoint!
For each log entry, see if the commit record is there

If not, consider the changes incomplete, and don't try to make them
51
Journaling FS Example
File 1
File 2
Checkpoint
Log
52
Journaling FS Example
File 1
File 2
Checkpoint
Log
53
Journaling FS Example
File 1
File 2
Checkpoint
Log
Filesystem reflects changes up to last checkpoint
Fsck scans changelog from last checkpoint forward
Doesn't find a commit record ... changes are simply ignored
54
File System Backups (1)
Backups are generally made to handle one of two
potential problems:
•
•
Recover from disaster.
Recover from stupidity.
•Thrash bins
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
55
Backup what?
•
In order to have storage efficiency, one can
choose not to backup:
•Executable files (since they can usually be restored
from manufacturer CD-ROMs
•Temporary files
• /tmp
•Special files (which correspond to I/O devices)
• /dev
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
56
Back-up issues
Full backup


backup all the filesystem providing a snapshot of the system at that point which can be fully
restored
Typically done weekly or monthly
Incremental backup



Backup only the files that were changed after the most recent backup
Typically, a weekly backup is followed by daily incremental backups
Smaller backup size, and faster
Compressed backups


Reduced storage
Less secure: a single bad byte can screw the whole backup
Backing up and active filesystem is tricky

Since during backup, files and directories are being modified
Makes the system less secure


Each backup tape/disk needs to be as safe as the serve itself..
It doesn’t matter if your backup tapes are lying around, even if you have the most secure
computer system..
57
Dumping strategies – physical dump
Algorithm

Starts at block 0 of the disk and copies all the data onto a tape/disk
Pros


Simple to implement in a bug-free way
Fast
Cons


Backups the free blocks as well
 Backuping a free disk takes as much storage/time as a full disk
Dumping of “bad blocks” is a concern
 Typically disk controllers provide bad block replacement transparently without the
OS even knowing about it
 If a block goes bad after formatting, then the OS typically creates a “file”
consisting of all the bad blocks.
58
Dumping strategies – logical dump
Pros

Allows to restore a single file. If
directories that lie on the path
to the file-to-be-restored were
deleted, then they would also
be restored.
Cons

Slow and complicated.
59
Dumping strategies – logical dump
Algorithm


Full backup: Traverses the
filesystem as a tree and
creates the same filesystem
structure on the backup
disk/tape.
Partial backup: all the
directories on the path to the
particular file needs to be
saved. For instance, backing
up file 9 requires the saving of
directories, 1,5,6, and 7.
Squares are directories, circles are files.
Shaded items have been modified since
last dump. Each directory and file is
labeled by its i-node number.
60
Dumping strategies – incremental logical dump
Bitmaps, indexed by i-node
number are used.
Phase 1: Examine all files
and directories below the
starting directory (root in
this case), and


mark the modified files.
mark ALL THE DIRECTORIES
Squares are directories, circles are files.
Shaded items have been modified since
last dump. Each directory and file is
labeled by its i-node number.
61
Dumping strategies – incremental logical dump
Phase 2: Recursively walk
the tree again, and

UNMARK all the directories
that do not have any modified
files under them.
Note:


10 and 11 are unmarked
5 and 6 remain marked
Squares are directories, circles are files.
Shaded items have been modified since
last dump. Each directory and file is
labeled by its i-node number.
62
Dumping strategies – incremental logical dump
Phase 3: Scan the i-nodes
in numerical order and

Dump all the marked
directories
Squares are directories, circles are files.
Shaded items have been modified since
last dump. Each directory and file is
labeled by its i-node number.
63
Dumping strategies – incremental logical dump
Phase 4: dump the marked
files.
Squares are directories, circles are files.
Shaded items have been modified since
last dump. Each directory and file is
labeled by its i-node number.
64
Dumping strategies – incremental logical dump
Restoring:


Restore all the directories that
were backupped
Restore all the files
Squares are directories, circles are files.
Shaded items have been modified since
last dump. Each directory and file is
labeled by its i-node number.
65
Issues
Links:

If a file is linked to more than one directory, only one copy should be saved.
Holes:




In UNIX, some file, such as core files may contain holes.
These files, write a few bytes, and then seek to a distant file offset and write some
more bytes.
These empty blocks that are not written should not be dumped and stored.
Cores typically have may megabytes of empty blocks.
Special files

Such as named pipes (which can appear anywhere in the filesystem) should not be
dumped.
66
CENG334
Introduction to Operating Systems
Filesystem Caching and LFS
Topics:
Disks
Erol Sahin
Dept of Computer Eng.
Middle East Technical University
Ankara, TURKEY
URL: http://kovan.ceng.metu.edu.tr/ceng334
67
File System Caching
Most filesystems cache significant amounts of disk in memory


e.g., Linux tries to use all “free” physical memory as a giant cache
Avoids huge overhead for going to disk for every I/O
Issues:


When do you commit a write to the disk?
 What happens if you write only to the memory cache and then the system crashes?
How do you keep the memory and disk views of a file consistent?
 What if the file metadata (inodes, etc.) is modified before the data blocks?
Read-ahead


Read a few extra blocks into memory when you do one read operation
 Amortize the cost of the seek
Useful if the blocks of a file are laid out in contiguous blocks
 Take advantage of sequential access patterns on the file
68
Caching
Reading a 32-bit word from memory takes 10 nsec.
Hard disks can transfer data at: 100MB/sec, that is 40nsec per 32-bit
words..

PLUS 5-10 msecs of seek time!
Caching aims to fill in the gap..
Often thousands of blocks are kept in cache.
69
Caching (1)
Hash the device and the disk address and look up the result in
a hash table.
All the blocks with the same cache value are chained together
through a linked list.
In addition to this, a bidirectional link runs through all the
blocks implementing a LRU list.
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
70
Caching (2)
•
•
•
•
Some blocks, such as i-node blocks, are rarely referenced two times
within a short interval.
Consider a modified LRU scheme, taking two factors into account:
•Is the block likely to be needed again soon?
•Is the block essential to the consistency of the file system?
For both questions, blocks can be divided into categories
•i-node blocks
•Indirect blocks
•Directory blocks
•Data blocks
• Full
• Partially full
Blocks that are essential for filesystem consistency should be written to disk
immediately – write-through-caches
•UNIX: synch (every 30 seconds)
•Windows: In the past: none, recent ones: FlushFileBuffers
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
71
Block read ahead
The system tries to get blocks into the cache, before they are
accessed to increase the hit rate.


Sequential access: performance improvement
Random access: performance degradation
72
Reducing Disk Arm Motion
Figure 4-29. (a) I-nodes placed at the start of the disk.
(b) Disk divided into cylinder groups, each with its own blocks
and i-nodes.
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
73