linux-review2

Transcript linux-review2

Linux Review
nd
(2 Half)
COMS W4118
Spring 2008
Interrupts
in Linux
COMS W4118
Spring 2008
Interrupts


Forcibly change normal flow of control
Similar to context switch (but lighter weight)




Hardware saves some context on stack; Includes
interrupted instruction if restart needed
Enters kernel at a specific point; kernel then
figures out which interrupt handler should run
Execution resumes with special “iret” instruction
Many different types of interrupts
Types of Interrupts

Asynchronous



From external source, such as I/O device
Not related to instruction being executed
Synchronous (also called exceptions)


Processor-detected exceptions:
 Faults — correctable; offending instruction is retried
 Traps — often for debugging; instruction is not retried
 Aborts — major error (hardware failure)
Programmed exceptions:
 Requests for kernel intervention (software intr/syscalls)
CPU’s ‘fetch-execute’ cycle
User
Program
Fetch instruction at IP
ld
add
IP
st
Save context
Decode the fetched instruction
mul
ld
Get INTR ID
Execute the decoded instruction
sub
Lookup ISR
bne
add
Advance IP to next instruction
Execute ISR
jmp
…
IRQ?
no
yes
IRET
Faults


Instruction would be illegal to execute
Examples:






Writing to a memory segment marked ‘read-only’
Reading from an unavailable memory segment (on disk)
Executing a ‘privileged’ instruction
Detected before incrementing the IP
The causes of ‘faults’ can often be ‘fixed’
If a ‘problem’ can be remedied, then the CPU can
just resume its execution-cycle
Traps



A CPU might have been programmed to
automatically switch control to a ‘debugger’
program after it has executed an instruction
That type of situation is known as a ‘trap’
It is activated after incrementing the IP
Error Exceptions



Most error exceptions — divide by zero, invalid
operation, illegal memory reference, etc. — translate
directly into signals
This isn’t a coincidence. . .
The kernel’s job is fairly simple: send the
appropriate signal to the current process




force_sig(sig_number, current);
That will probably kill the process, but that’s not the
concern of the exception handler
One important exception: page fault
An exception can (infrequently) happen in the kernel

die(); // kernel oops
Interrupt Hardware
Legacy PC Design
(for single-proc
systems)
Ethernet
IRQs
Slave
PIC
(8259)
SCSI Disk
Master
PIC
(8259)
INTR
x86
CPU
Real-Time Clock
Keyboard Controller



Programmable Interval-Timer
I/O devices have (unique or shared) Interrupt Request
Lines (IRQs)
IRQs are mapped by special hardware to interrupt
vectors, and passed to the CPU
This hardware is called a Programmable Interrupt
Controller (PIC)
The `Interrupt Controller’





Responsible for telling the CPU when a specific
external device wishes to ‘interrupt’
 Needs to tell the CPU which one among several
devices is the one needing service
PIC translates IRQ to vector
 Raises interrupt to CPU
 Vector available in register
 Waits for ack from CPU
Interrupts can have varying priorities
 PIC also needs to prioritize multiple requests
Possible to “mask” (disable) interrupts at PIC or CPU
Early systems cascaded two 8 input chips (8259A)
APIC, IO-APIC, LAPIC

Advanced PIC (APIC) for SMP systems




Local APIC (LAPIC) versus “frontend” IO-APIC



Used in all modern systems
Interrupts “routed” to CPU over system bus
IPI: inter-processor interrupt
Devices connect to front-end IO-APIC
IO-APIC communicates (over bus) with Local APIC
Interrupt routing




Allows broadcast or selective routing of interrupts
Ability to distribute interrupt handling load
Routes to lowest priority process
 Special register: Task Priority Register (TPR)
Arbitrates (round-robin) if equal priority
Assigning IRQs to Devices







IRQ assignment is hardware-dependent
Sometimes it’s hardwired, sometimes it’s set physically,
sometimes it’s programmable
PCI bus usually assigns IRQs at boot
Some IRQs are fixed by the architecture
 IRQ0: Interval timer
 IRQ2: Cascade pin for 8259A
Linux device drivers request IRQs when the device is opened
Note: especially useful for dynamically-loaded drivers, such as
for USB or PCMCIA devices
Two devices that aren’t used at the same time can share an IRQ,
even if the hardware doesn’t support simultaneous sharing
Assigning Vectors to IRQs


Vector: index (0-255) into interrupt descriptor table
Vectors usually IRQ# + 32




Below 32 reserved for non-maskable intr & exceptions
Maskable interrupts can be assigned as needed
Vector 128 used for syscall
Vectors 251-255 used for IPI
Interrupt Descriptor Table


The ‘entry-point’ to the interrupt-handler is located
via the Interrupt Descriptor Table (IDT)
IDT: “gate descriptors”



Segment selector + offset for handler
Descriptor Privilege Level (DPL)
Gates (slightly different ways of entering kernel)
 Task gate: includes TSS to transfer to (not used by
Linux)


Interrupt gate: disables further interrupts
Trap gate: further interrupts still allowed
Interrupt Masking




Two different types: global and per-IRQ
Global — delays all interrupts
Selective — individual IRQs can be masked
selectively
Selective masking is usually what’s needed
— interference most common from two
interrupts of the same type
Putting It All Together
Memory Bus
IRQs
0
idtr
PIC
INTR
CPU
0
IDT
vector
N
handler
Mask points
255
Dispatching Interrupts







Each interrupt has to be handled by a special
device- or trap-specific routine
Interrupt Descriptor Table (IDT) has gate descriptors
for each interrupt vector
Hardware locates the proper gate descriptor for this
interrupt vector, and locates the new context
A new stack pointer, program counter, CPU and
memory state, etc., are loaded
Global interrupt mask set
The old program counter, stack pointer, CPU and
memory state, etc., are saved on the new stack
The specific handler is invoked
Handling Nested Interrupts




As soon as possible, unmask the global
interrupt
As soon as reasonable, re-enable interrupts
from that IRQ
But that isn’t always a great idea, since it
could cause re-entry to the same handler
IRQ-specific mask is not enabled during
interrupt-handling
Nested Execution

Interrupts can be interrupted





Exceptions can be interrupted


By different interrupts; handlers need not be reentrant
No notion of priority in Linux
Small portions execute with interrupts disabled
Interrupts remain pending until acked by CPU
By interrupts (devices needing service)
Exceptions can nest two levels deep



Exceptions indicate coding error
Exception code (kernel code) shouldn’t have bugs
Page fault is possible (trying to touch user data)
Interrupt Handling Philosophy



Do as little as possible in the interrupt handler
Defer non-critical actions till later
Structure: top and bottom halves




Top-half: do minimum work and return (ISR)
Bottom-half: deferred processing (softirqs,
tasklets)
Again — want to do as little as possible with
IRQ interrupts masked
No process context available
No Process Context





Interrupts (as opposed to exceptions) are not
associated with particular instructions
They’re also not associated with a given
process
The currently-running process, at the time of
the interrupt, as no relationship whatsoever to
that interrupt
Interrupt handlers cannot refer to current
Interrupt handlers cannot sleep!
Interrupt Stack

When an interrupt occurs, what stack is
used?




Exceptions: The kernel stack of the current
process, whatever it is, is used (There’s always
some process running — the “idle” process, if
nothing else)
Interrupts: hard IRQ stack (1 per processor)
SoftIRQs: soft IRQ stack (1 per processor)
These stacks are configured in the IDT and
TSS at boot time by the kernel
First-Level Interrupt Handler




Often in assembler
Perform minimal, common functions: save
registers, unmask other interrupts
Eventually, undoes that: restores registers,
returns to previous context
Most important: call proper second-level
interrupt handler (C program)
Finding the Proper Handler




On modern hardware, multiple I/O devices
can share a single IRQ and hence interrupt
vector
First differentiator is the interrupt vector
Multiple interrupt service routines (ISR) can
be associated with a vector
Each device’s ISR for that IRQ is called; the
determination of whether or not that device
has interrupted is device-dependent
Deferrable Work

We don’t want to do too much in regular interrupt
handlers:






Interrupts are masked
We don’t want the kernel stack to grow too much
Instead, interrupt handlers schedule work to be
performed later
Three deferred work mechanisms: softirqs, tasklets,
and work queues
Tasklets are built on top of softirqs
For all of these, requests are queued
Softirqs


Statically allocated: specified at kernel compile time
Limited number:
Priority Type
0
High-priority tasklets
1
Timer interrupts
2
Network transmission
3
Network reception
4
SCSI disks
5
Regular tasklets
Running Softirqs

Run at various points by the kernel



Most important: after handling IRQs and after timer
interrupts
Softirq routines can be executed simultaneously on
multiple CPUs:



System calls, exceptions
Code must be re-entrant
Code must do its own locking as needed
Hardware interrupts always enabled when softirqs
are running.
Rescheduling Softirqs




A softirq routine can reschedule itself
This could starve user-level processes
Softirq scheduler only runs a limited number
of requests at a time
The rest are executed by a kernel thread,
ksoftirqd, which competes with user
processes for CPU time
Tasklets





Similar to softirqs
Created and destroyed dynamically
Individual tasklets are locked during
execution; no problem about re-entrancy, and
no need for locking by the code
Only one instance of tasklet can run, even
with multiple CPUs
Preferred mechanism for most deferred
activity
Work Queues




Always run by kernel threads
Softirqs and tasklets run in an interrupt
context; work queues have a process context
Because they have a process context, they
can sleep
However, they’re kernel-only; there is no user
mode associated with it
Synchronization
in Linux
COMS W4118
Spring 2008
Linux Synch Primitives

Memory barriers


Atomic operations


Local, global
Spin locks


memory bus lock, read-modify-write ops
Interrupt/softirq disabling/enabling


avoids compiler, cpu instruction re-ordering
general, read/write, big reader
Semaphores

general, read/write
Choosing Synch Primitives

Avoid synch if possible! (clever instruction ordering)



Use atomics or rw spinlocks if possible
Use semaphores if you need to sleep



Example: inserting in linked list (needs barrier still)
Can’t sleep in interrupt context
Don’t sleep holding a spinlock!
Complicated matrix of choices for protecting data
structures accessed by deferred functions
Architectural Dependence




The implementation of the synchronization
primitives is extremely architecture
dependent.
This is because only the hardware can
guarantee atomicity of an operation.
Each architecture must provide a mechanism
for doing an operation that can examine and
modify a storage location atomically.
Some architectures do not guarantee
atomicity, but inform whether the operation
attempted was atomic.
Barriers: Definition



Barriers are used to prevent a processor and/or the compiler
from reordering instruction execution and memory modification.
Barriers are instructions to hardware and/or compiler to complete
all pending accesses before issuing any more
 read memory barrier – acts on read requests
 write memory barrier – acts on write requests
Intel –
 certain instructions act as barriers: lock, iret, control regs
 rmb – asm volatile("lock;addl $0,0(%%esp)":::"memory")


add 0 to top of stack with lock prefix
wmb – Intel never re-orders writes, just for compiler
Atomic Operations

Many instructions not atomic in hardware (smp)



Compiler may not generate atomic code




Read-modify-write instructions: inc, test-and-set, swap
unaligned memory access
even i++ is not necessarily atomic!
If the data that must be protected is a single word,
atomic operations can be used. These functions
examine and modify the word atomically.
The atomic data type is atomic_t.
Intel implementation

lock prefix byte 0xf0 – locks memory bus
Serializing with Interrupts



Basic primitive in original UNIX
Doesn’t protect against other CPUs
Intel: “interrupts enabled bit”


cli to clear (disable), sti to set (enable)
Enabling is often wrong; need to restore


local_irq_save()
local_irq_restore()
Interrupt Operations

Services used to serialize with interrupts are:
local_irq_disable - disables interrupts on the current
CPU
local_irq_enable - enable interrupts on the current
CPU
local_save_flags - return the interrupt state of the
processor
local_restore_flags - restore the interrupt state of the
processor

Dealing with the full interrupt state of the
system is officially discouraged. Locks should
be used.
Spin Locks



A spin lock is a data structure (spinlock_t )
that is used to synchronize access to critical
sections.
Only one thread can be holding a spin lock at
any moment. All other threads trying to get
the lock will “spin” (loop while checking the
lock status).
Spin locks should not be held for long
periods because waiting tasks on other
CPUs are spinning, and thus wasting CPU
execution time.
__raw_spin_lock_string
1: lock; decb %0
jns
3f
2: rep; nop
cmpb $0, %0
jle 2b
jmp 1b
3:
#
#
#
#
#
#
#
atomically decrement
if clear sign bit jump forward to 3
wait
spin – compare to 0
go back to 2 if <= 0 (locked)
unlocked; go back to 1 to try again
we have acquired the lock …
From linux/include/asm-i386/spinlock.h
spin_unlock merely writes 1 into the lock field.
Spin Lock Operations

Functions used to work with spin locks:
spin_lock_init – initialize a spin lock before
using it for the first time
spin_lock – acquire a spin lock, spin waiting
if it is not available
spin_unlock – release a spin lock
spin_unlock_wait – spin waiting for spin lock
to become available, but don't acquire it
spin_trylock – acquire a spin lock if it is
currently free, otherwise return error
spin_is_locked – return spin lock state
Spin Locks & Interrupts

The spin lock services also provide
interfaces that serialize with interrupts
(on the current processor):
spin_lock_irq - acquire spin lock and disable
interrupts
spin_unlock_irq - release spin lock and
reenable
spin_lock_irqsave - acquire spin lock, save
interrupt state, and disable
spin_unlock_irqrestore - release spin lock
and restore interrupt state
RW Spin Locks & Interrupts

The read/write lock services also
provide interfaces that serialize with
interrupts (on the current processor):
read_lock_irq - acquire lock for read and
disable interrupts
read_unlock_irq - release read lock and
reenable
read_lock_irqsave - acquire lock for read,
save interrupt state, and disable
read_unlock_irqrestore - release read lock
and restore interrupt state

Corresponding functions for write exist
as well (e.g., write_lock_irqsave).
The Big Reader Lock


Reader optimized RW spinlock
RW spinlock suffers cache contention


Per-CPU, cache-aligned lock arrays


One for reader portion, another for writer portion
To read: set bit in reader array, spin on writer


On lock and unlock because of write to rwlock_t
Acquire when writer lock free; very fast!
To write: set bit and scan ALL reader bits

Acquire when reader bits all free; very slow!
Semaphores



A semaphore is a data structure that is used
to synchronize access to critical sections or
other resources.
A semaphore allows a fixed number of tasks
(generally one for critical sections) to "hold"
the semaphore at one time. Any more tasks
requesting to hold the semaphore are
blocked (put to sleep).
A semaphore can be used for serialization
only in code that is allowed to block.
Semaphore Structure

Struct semaphore
 count (atomic_t):





wait: wait queue
sleepers:


0 (none),
1 (some), occasionally 2
wait: wait queue
implementation requires lower-level
synch
 atomic updates, spinlock, interrupt
disabling


> 0: free;
= 0: in use, no waiters;
< 0: in use, waiters
struct semaphore
atomic_t count
int sleepers
wait_queue_head_t wait
lock
prev
next
Semaphores

optimized assembly code for normal case (down())


up() is easy


atomically decrement; continue
contended down() is really complex!



atomically increment; wake_up() if necessary
uncontended down() is easy


C code for slower “contended” case (__down())
basically increment sleepers and sleep
loop because of potentially concurrent ups/downs
still in down() path when lock is acquired
RW Semaphores





A rw_semaphore is a semaphore that allows
either one writer or any number of readers (but
not both at the same time) to hold it.
Any writer requesting to hold the rw_semaphore
is blocked when there are readers holding it.
A rw_semaphore can be used for serialization
only in code that is allowed to block. Both types
of semaphores are the only synchronization
objects that should be held when blocking.
Writers will not starve: once a writer arrives,
readers queue behind it
Increases concurrency; introduced in 2.4
Mutexes
A mutex is a data structure that is also
used to synchronize access to critical
sections or other resources, introduced
in 2.6.16.
 Why? (Documentation/mutex-design.txt)






simpler (lighter weight)
tighter code
slightly faster, better scalability
no fastpath tradeoffs
debug support – strict checking of
adhering to semantics
Linux Virtual File
System (VFS) and Ext2
COMS W4118
Spring 2008
Linux File System Model

basically UNIX file semantics



File systems are mounted at various points
Files identified by device inode numbers
VFS layer just dispatches to fs-specific functions

libc read() -> sys_read()





what type of filesystem does this file belong to?
call filesystem (fs) specific read function
maintained in open file object (file)
example: file->f_op->read(…)
similar to device abstraction model in UNIX
VFS System Calls

fundamental UNIX abstractions




files (everything is a file)
 ex: /dev/ttyS0 – device as a file
 ex: /proc/123 – process as a file
processes
users
lots of syscalls related to files! (~100)



most dispatch to filesystem-specific calls
some require no filesystem action
 example: lseek(pos) – change position in file
others have default VFS implementations
VFS System Calls (cont.)








filesystem ops – mounting, info, flushing, chroot, pivot_root
directory ops – chdir, getcwd, link, unlink, rename, symlink
file ops – open/close, (p)read(v)/(p)write(v), seek, truncate, dup
fcntl, creat,
inode ops – stat, permissions, chmod, chown
memory mapping files – mmap, munmap, madvise, mlock
wait for input – poll, select
flushing – sync, fsync, msync, fdatasync
file locking – flock
Big Four Data Structures




struct file
 information about an open file
 includes current position (file pointer)
struct dentry
 information about a directory entry
 includes name + inode#
struct inode
 unique descriptor of a file or directory
 contains permissions, timestamps, block map (data)
 inode#: integer (unique per mounted filesystem)
struct superblock
 descriptor of a mounted filesystem
Two More Data Structures

struct file_system_type

pointer to implementing module
 including how to read a superblock
 On module load, you call register_file_system and pass a pointer
to this structure
struct vfsmount
 Represents a mounted instance of a particular file system
 One super block can be mounted in two places, with different
covering sub mounts
 Thus lookup requires parent dentry and a vfsmount


name of file system
Data Structure Relationships
task_struct
inode cache
inode
inode
inode
fds
open
file
object
open
file
object
open
file
object
superblock
dentry
dentry
dentry
dentry
dentry
dentry cache
disk
Sharing Data Structures

calling dup() –



opening the same file twice –


shares dentries
opening same file via different hard links –


shares open file objects
example: 2>&1
shares inodes
mounting same filesystem on different dirs –

shares superblocks
Superblock

mounted filesystem descriptor


usually first block on disk (after boot block)
copied into (similar) memory structure on mount



distinction: disk superblock vs memory superblock
dirty bit (s_dirt), copied to disk frequently
important fields






s_dev, s_bdev – device, device-driver
s_blocksize, s_maxbytes, s_type
s_flags, s_magic, s_count, s_root, s_dquot
s_dirty – dirty inodes for this filesystem
s_op – superblock operations
u – filesystem specific data
Inode

"index" node – unique file or directory descriptor




meta-data: permissions, owner, timestamps, size, link
count
data: pointers to disk blocks containing actual data
 data pointers are "indices" into file contents (hence "inode")
inode # - unique integer (per-mounted filesystem)
what about names and paths?



high-level fluff on top of a "flat-filesystem"
implemented by directory files (directories)
directory contents: name + inode
File Links

UNIX link semantics

hard links – multiple dir entries with same inode #
 equal status; first is not "real" entry
 file deleted when link count goes to 0
 restrictions



can't hard link to directories (avoids cycles)
or across filesystems
soft (symbolic) links – little files with pathnames
 just aliases for another pathname
 no restrictions, cycles possible, dangling links possible
(Open) File Object

struct file (usual variable name - filp)





file descriptor (small ints)


association between file and process
no disk representation
created for each open (multiple possible, even same file)
most important info: file pointer
index into array of pointers to open file objects
file object states

unused (memory cache + root reserve (10))



get_empty_filp()
inuse (per-superblock lists)
system-wide max on open file objects (~8K)

/proc/sys/fs/file-max
Dentry

abstraction of directory entry





directory api: dentry iterators



ex: line from ls -l
either files (hard links) or soft links or subdirectories
every dentry has a parent dentry (except root)
sibling dentries – other entries in the same directory
posix: opendir(), readdir(), scandir(), seekdir(), rewinddir()
syscall: getdents()
why an abstraction?


UNIX: directories are really files with directory "records"
MSDOS, etc.: directory is just a big table on disk (FAT)


no such thing as subdirectories!
just fields in table (file->parentdir), (dir->parentdir)
Dentry Cache

very important cache for filesystem performance



dentry cache "controls" inode cache


every file access causes multiple dentry accesses!
example: /tmp/foo
 dentries for "/", "/tmp", "/tmp/foo" (path components)
inodes released only when dentry is released
dentry cache accessed via hash table

hash(dir, filename) -> dentry
Dentry Cache (continued)

dentry states





free (not valid; maintained by slab cache)
in-use (associated with valid open inode)
unused (valid but not being used; LRU list)
negative (file that does not exist)
dentry ops

just a few, mostly default actions

ex: d_compare(dir, name1, name2)

case-insensitive for MSDOS
Process-related Files

current->fs (fs_struct)




root (for chroot jails)
pwd
umask (default file permissions)
current->files (files_struct)



fd[] (file descriptor array – pointers to file objects)
 0, 1, 2 – stdin, stdout, stderr
originally 32, growable to 1,024 (RLIMIT_NOFILE)
 complex structure for growing … see book
close_on_exec memory (bitmap)
 open files normally inherited across exec
Filesystem Types

Linux must "know about" filesystem before mount


multiple (mounted) instances of each type possible
special (virtual) filesystems (like /proc)



structuring technique to touch kernel data
examples:
 /proc, /dev (devfs)
 sockfs, pipefs, tmpfs, rootfs, shmfs
associated with fictitious block device (major# 0)

minor# distinguishes special filesystem types
Registering a Filesystem Type

must register before mount


register_filesystem() / unregister_filesystem


static (compile-time) or dynamic (modules)
adds file_system_type object to linked-list
 file_systems (head; kernel global variable)
 file_systems_lock (rw spinlock to protect list)
file_system_type descriptor



name, flags, pointer to implementing module
list of superblocks (mounted instances)
read_super() – pointer to method for reading superblock
 most important thing! filesystem specific
Data Structure Relationships (2)
open
file
object
f_dentry
dentry
d_subdirs
open
file
object
dentry
dentry
inode
inode
dentry inode
┴
open
file
object
d_inode
d_subdirs
d_parent
i_sb
superblock
d_sb
i_dentries
Ext2

“Standard” Linux File System



Uses FFS like layout



Previously it was the most commonly used
Serves as a basis for Ext3 which adds journaling
Each file system is composed of identical block groups
Allocation is designed to improve locality
Inodes contain pointers (32 bits) to blocks



Direct, Indirect, Double Indirect, Triple Indirect
Maximum file size: 4.1TB (4K Blocks)
Maximum file system size: 16TB (4K Blocks)
Ext2 Disk Layout


Files in the same directory are stored in the
same block group
Files in different directories are spread
among the block groups
Picture from Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
Block Addressing in Ext2
Twelve “direct” blocks
Inode
Data
Data
Block
Data
Block
Block
BLKSIZE/4
Indirect
Blocks
Indirect
Blocks
(BLKSIZE/4)2
Double
Indirect
Indirect
Blocks
(BLKSIZE/4)3
Triple
Indirect
Double
Indirect
Indirect
Blocks
Indirect
Blocks
Data
Data
Block
Data
Block
Block
Data
Data
Block
Data
Block
Block
Data
Data
Block
Data
Block
Block
Data
Data
Block
Data
Block
Block
Data
Data
Block
Data
Block
Block
Picture from Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
Ext2 Directory Structure
(a) A Linux directory with three files. (b) The same directory
after the file voluminous has been removed.
Picture from Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639