linux-review2
Download
Report
Transcript linux-review2
Linux Review
nd
(2 Half)
COMS W4118
Spring 2008
Interrupts
in Linux
COMS W4118
Spring 2008
Interrupts
Forcibly change normal flow of control
Similar to context switch (but lighter weight)
Hardware saves some context on stack; Includes
interrupted instruction if restart needed
Enters kernel at a specific point; kernel then
figures out which interrupt handler should run
Execution resumes with special “iret” instruction
Many different types of interrupts
Types of Interrupts
Asynchronous
From external source, such as I/O device
Not related to instruction being executed
Synchronous (also called exceptions)
Processor-detected exceptions:
Faults — correctable; offending instruction is retried
Traps — often for debugging; instruction is not retried
Aborts — major error (hardware failure)
Programmed exceptions:
Requests for kernel intervention (software intr/syscalls)
CPU’s ‘fetch-execute’ cycle
User
Program
Fetch instruction at IP
ld
add
IP
st
Save context
Decode the fetched instruction
mul
ld
Get INTR ID
Execute the decoded instruction
sub
Lookup ISR
bne
add
Advance IP to next instruction
Execute ISR
jmp
…
IRQ?
no
yes
IRET
Faults
Instruction would be illegal to execute
Examples:
Writing to a memory segment marked ‘read-only’
Reading from an unavailable memory segment (on disk)
Executing a ‘privileged’ instruction
Detected before incrementing the IP
The causes of ‘faults’ can often be ‘fixed’
If a ‘problem’ can be remedied, then the CPU can
just resume its execution-cycle
Traps
A CPU might have been programmed to
automatically switch control to a ‘debugger’
program after it has executed an instruction
That type of situation is known as a ‘trap’
It is activated after incrementing the IP
Error Exceptions
Most error exceptions — divide by zero, invalid
operation, illegal memory reference, etc. — translate
directly into signals
This isn’t a coincidence. . .
The kernel’s job is fairly simple: send the
appropriate signal to the current process
force_sig(sig_number, current);
That will probably kill the process, but that’s not the
concern of the exception handler
One important exception: page fault
An exception can (infrequently) happen in the kernel
die(); // kernel oops
Interrupt Hardware
Legacy PC Design
(for single-proc
systems)
Ethernet
IRQs
Slave
PIC
(8259)
SCSI Disk
Master
PIC
(8259)
INTR
x86
CPU
Real-Time Clock
Keyboard Controller
Programmable Interval-Timer
I/O devices have (unique or shared) Interrupt Request
Lines (IRQs)
IRQs are mapped by special hardware to interrupt
vectors, and passed to the CPU
This hardware is called a Programmable Interrupt
Controller (PIC)
The `Interrupt Controller’
Responsible for telling the CPU when a specific
external device wishes to ‘interrupt’
Needs to tell the CPU which one among several
devices is the one needing service
PIC translates IRQ to vector
Raises interrupt to CPU
Vector available in register
Waits for ack from CPU
Interrupts can have varying priorities
PIC also needs to prioritize multiple requests
Possible to “mask” (disable) interrupts at PIC or CPU
Early systems cascaded two 8 input chips (8259A)
APIC, IO-APIC, LAPIC
Advanced PIC (APIC) for SMP systems
Local APIC (LAPIC) versus “frontend” IO-APIC
Used in all modern systems
Interrupts “routed” to CPU over system bus
IPI: inter-processor interrupt
Devices connect to front-end IO-APIC
IO-APIC communicates (over bus) with Local APIC
Interrupt routing
Allows broadcast or selective routing of interrupts
Ability to distribute interrupt handling load
Routes to lowest priority process
Special register: Task Priority Register (TPR)
Arbitrates (round-robin) if equal priority
Assigning IRQs to Devices
IRQ assignment is hardware-dependent
Sometimes it’s hardwired, sometimes it’s set physically,
sometimes it’s programmable
PCI bus usually assigns IRQs at boot
Some IRQs are fixed by the architecture
IRQ0: Interval timer
IRQ2: Cascade pin for 8259A
Linux device drivers request IRQs when the device is opened
Note: especially useful for dynamically-loaded drivers, such as
for USB or PCMCIA devices
Two devices that aren’t used at the same time can share an IRQ,
even if the hardware doesn’t support simultaneous sharing
Assigning Vectors to IRQs
Vector: index (0-255) into interrupt descriptor table
Vectors usually IRQ# + 32
Below 32 reserved for non-maskable intr & exceptions
Maskable interrupts can be assigned as needed
Vector 128 used for syscall
Vectors 251-255 used for IPI
Interrupt Descriptor Table
The ‘entry-point’ to the interrupt-handler is located
via the Interrupt Descriptor Table (IDT)
IDT: “gate descriptors”
Segment selector + offset for handler
Descriptor Privilege Level (DPL)
Gates (slightly different ways of entering kernel)
Task gate: includes TSS to transfer to (not used by
Linux)
Interrupt gate: disables further interrupts
Trap gate: further interrupts still allowed
Interrupt Masking
Two different types: global and per-IRQ
Global — delays all interrupts
Selective — individual IRQs can be masked
selectively
Selective masking is usually what’s needed
— interference most common from two
interrupts of the same type
Putting It All Together
Memory Bus
IRQs
0
idtr
PIC
INTR
CPU
0
IDT
vector
N
handler
Mask points
255
Dispatching Interrupts
Each interrupt has to be handled by a special
device- or trap-specific routine
Interrupt Descriptor Table (IDT) has gate descriptors
for each interrupt vector
Hardware locates the proper gate descriptor for this
interrupt vector, and locates the new context
A new stack pointer, program counter, CPU and
memory state, etc., are loaded
Global interrupt mask set
The old program counter, stack pointer, CPU and
memory state, etc., are saved on the new stack
The specific handler is invoked
Handling Nested Interrupts
As soon as possible, unmask the global
interrupt
As soon as reasonable, re-enable interrupts
from that IRQ
But that isn’t always a great idea, since it
could cause re-entry to the same handler
IRQ-specific mask is not enabled during
interrupt-handling
Nested Execution
Interrupts can be interrupted
Exceptions can be interrupted
By different interrupts; handlers need not be reentrant
No notion of priority in Linux
Small portions execute with interrupts disabled
Interrupts remain pending until acked by CPU
By interrupts (devices needing service)
Exceptions can nest two levels deep
Exceptions indicate coding error
Exception code (kernel code) shouldn’t have bugs
Page fault is possible (trying to touch user data)
Interrupt Handling Philosophy
Do as little as possible in the interrupt handler
Defer non-critical actions till later
Structure: top and bottom halves
Top-half: do minimum work and return (ISR)
Bottom-half: deferred processing (softirqs,
tasklets)
Again — want to do as little as possible with
IRQ interrupts masked
No process context available
No Process Context
Interrupts (as opposed to exceptions) are not
associated with particular instructions
They’re also not associated with a given
process
The currently-running process, at the time of
the interrupt, as no relationship whatsoever to
that interrupt
Interrupt handlers cannot refer to current
Interrupt handlers cannot sleep!
Interrupt Stack
When an interrupt occurs, what stack is
used?
Exceptions: The kernel stack of the current
process, whatever it is, is used (There’s always
some process running — the “idle” process, if
nothing else)
Interrupts: hard IRQ stack (1 per processor)
SoftIRQs: soft IRQ stack (1 per processor)
These stacks are configured in the IDT and
TSS at boot time by the kernel
First-Level Interrupt Handler
Often in assembler
Perform minimal, common functions: save
registers, unmask other interrupts
Eventually, undoes that: restores registers,
returns to previous context
Most important: call proper second-level
interrupt handler (C program)
Finding the Proper Handler
On modern hardware, multiple I/O devices
can share a single IRQ and hence interrupt
vector
First differentiator is the interrupt vector
Multiple interrupt service routines (ISR) can
be associated with a vector
Each device’s ISR for that IRQ is called; the
determination of whether or not that device
has interrupted is device-dependent
Deferrable Work
We don’t want to do too much in regular interrupt
handlers:
Interrupts are masked
We don’t want the kernel stack to grow too much
Instead, interrupt handlers schedule work to be
performed later
Three deferred work mechanisms: softirqs, tasklets,
and work queues
Tasklets are built on top of softirqs
For all of these, requests are queued
Softirqs
Statically allocated: specified at kernel compile time
Limited number:
Priority Type
0
High-priority tasklets
1
Timer interrupts
2
Network transmission
3
Network reception
4
SCSI disks
5
Regular tasklets
Running Softirqs
Run at various points by the kernel
Most important: after handling IRQs and after timer
interrupts
Softirq routines can be executed simultaneously on
multiple CPUs:
System calls, exceptions
Code must be re-entrant
Code must do its own locking as needed
Hardware interrupts always enabled when softirqs
are running.
Rescheduling Softirqs
A softirq routine can reschedule itself
This could starve user-level processes
Softirq scheduler only runs a limited number
of requests at a time
The rest are executed by a kernel thread,
ksoftirqd, which competes with user
processes for CPU time
Tasklets
Similar to softirqs
Created and destroyed dynamically
Individual tasklets are locked during
execution; no problem about re-entrancy, and
no need for locking by the code
Only one instance of tasklet can run, even
with multiple CPUs
Preferred mechanism for most deferred
activity
Work Queues
Always run by kernel threads
Softirqs and tasklets run in an interrupt
context; work queues have a process context
Because they have a process context, they
can sleep
However, they’re kernel-only; there is no user
mode associated with it
Synchronization
in Linux
COMS W4118
Spring 2008
Linux Synch Primitives
Memory barriers
Atomic operations
Local, global
Spin locks
memory bus lock, read-modify-write ops
Interrupt/softirq disabling/enabling
avoids compiler, cpu instruction re-ordering
general, read/write, big reader
Semaphores
general, read/write
Choosing Synch Primitives
Avoid synch if possible! (clever instruction ordering)
Use atomics or rw spinlocks if possible
Use semaphores if you need to sleep
Example: inserting in linked list (needs barrier still)
Can’t sleep in interrupt context
Don’t sleep holding a spinlock!
Complicated matrix of choices for protecting data
structures accessed by deferred functions
Architectural Dependence
The implementation of the synchronization
primitives is extremely architecture
dependent.
This is because only the hardware can
guarantee atomicity of an operation.
Each architecture must provide a mechanism
for doing an operation that can examine and
modify a storage location atomically.
Some architectures do not guarantee
atomicity, but inform whether the operation
attempted was atomic.
Barriers: Definition
Barriers are used to prevent a processor and/or the compiler
from reordering instruction execution and memory modification.
Barriers are instructions to hardware and/or compiler to complete
all pending accesses before issuing any more
read memory barrier – acts on read requests
write memory barrier – acts on write requests
Intel –
certain instructions act as barriers: lock, iret, control regs
rmb – asm volatile("lock;addl $0,0(%%esp)":::"memory")
add 0 to top of stack with lock prefix
wmb – Intel never re-orders writes, just for compiler
Atomic Operations
Many instructions not atomic in hardware (smp)
Compiler may not generate atomic code
Read-modify-write instructions: inc, test-and-set, swap
unaligned memory access
even i++ is not necessarily atomic!
If the data that must be protected is a single word,
atomic operations can be used. These functions
examine and modify the word atomically.
The atomic data type is atomic_t.
Intel implementation
lock prefix byte 0xf0 – locks memory bus
Serializing with Interrupts
Basic primitive in original UNIX
Doesn’t protect against other CPUs
Intel: “interrupts enabled bit”
cli to clear (disable), sti to set (enable)
Enabling is often wrong; need to restore
local_irq_save()
local_irq_restore()
Interrupt Operations
Services used to serialize with interrupts are:
local_irq_disable - disables interrupts on the current
CPU
local_irq_enable - enable interrupts on the current
CPU
local_save_flags - return the interrupt state of the
processor
local_restore_flags - restore the interrupt state of the
processor
Dealing with the full interrupt state of the
system is officially discouraged. Locks should
be used.
Spin Locks
A spin lock is a data structure (spinlock_t )
that is used to synchronize access to critical
sections.
Only one thread can be holding a spin lock at
any moment. All other threads trying to get
the lock will “spin” (loop while checking the
lock status).
Spin locks should not be held for long
periods because waiting tasks on other
CPUs are spinning, and thus wasting CPU
execution time.
__raw_spin_lock_string
1: lock; decb %0
jns
3f
2: rep; nop
cmpb $0, %0
jle 2b
jmp 1b
3:
#
#
#
#
#
#
#
atomically decrement
if clear sign bit jump forward to 3
wait
spin – compare to 0
go back to 2 if <= 0 (locked)
unlocked; go back to 1 to try again
we have acquired the lock …
From linux/include/asm-i386/spinlock.h
spin_unlock merely writes 1 into the lock field.
Spin Lock Operations
Functions used to work with spin locks:
spin_lock_init – initialize a spin lock before
using it for the first time
spin_lock – acquire a spin lock, spin waiting
if it is not available
spin_unlock – release a spin lock
spin_unlock_wait – spin waiting for spin lock
to become available, but don't acquire it
spin_trylock – acquire a spin lock if it is
currently free, otherwise return error
spin_is_locked – return spin lock state
Spin Locks & Interrupts
The spin lock services also provide
interfaces that serialize with interrupts
(on the current processor):
spin_lock_irq - acquire spin lock and disable
interrupts
spin_unlock_irq - release spin lock and
reenable
spin_lock_irqsave - acquire spin lock, save
interrupt state, and disable
spin_unlock_irqrestore - release spin lock
and restore interrupt state
RW Spin Locks & Interrupts
The read/write lock services also
provide interfaces that serialize with
interrupts (on the current processor):
read_lock_irq - acquire lock for read and
disable interrupts
read_unlock_irq - release read lock and
reenable
read_lock_irqsave - acquire lock for read,
save interrupt state, and disable
read_unlock_irqrestore - release read lock
and restore interrupt state
Corresponding functions for write exist
as well (e.g., write_lock_irqsave).
The Big Reader Lock
Reader optimized RW spinlock
RW spinlock suffers cache contention
Per-CPU, cache-aligned lock arrays
One for reader portion, another for writer portion
To read: set bit in reader array, spin on writer
On lock and unlock because of write to rwlock_t
Acquire when writer lock free; very fast!
To write: set bit and scan ALL reader bits
Acquire when reader bits all free; very slow!
Semaphores
A semaphore is a data structure that is used
to synchronize access to critical sections or
other resources.
A semaphore allows a fixed number of tasks
(generally one for critical sections) to "hold"
the semaphore at one time. Any more tasks
requesting to hold the semaphore are
blocked (put to sleep).
A semaphore can be used for serialization
only in code that is allowed to block.
Semaphore Structure
Struct semaphore
count (atomic_t):
wait: wait queue
sleepers:
0 (none),
1 (some), occasionally 2
wait: wait queue
implementation requires lower-level
synch
atomic updates, spinlock, interrupt
disabling
> 0: free;
= 0: in use, no waiters;
< 0: in use, waiters
struct semaphore
atomic_t count
int sleepers
wait_queue_head_t wait
lock
prev
next
Semaphores
optimized assembly code for normal case (down())
up() is easy
atomically decrement; continue
contended down() is really complex!
atomically increment; wake_up() if necessary
uncontended down() is easy
C code for slower “contended” case (__down())
basically increment sleepers and sleep
loop because of potentially concurrent ups/downs
still in down() path when lock is acquired
RW Semaphores
A rw_semaphore is a semaphore that allows
either one writer or any number of readers (but
not both at the same time) to hold it.
Any writer requesting to hold the rw_semaphore
is blocked when there are readers holding it.
A rw_semaphore can be used for serialization
only in code that is allowed to block. Both types
of semaphores are the only synchronization
objects that should be held when blocking.
Writers will not starve: once a writer arrives,
readers queue behind it
Increases concurrency; introduced in 2.4
Mutexes
A mutex is a data structure that is also
used to synchronize access to critical
sections or other resources, introduced
in 2.6.16.
Why? (Documentation/mutex-design.txt)
simpler (lighter weight)
tighter code
slightly faster, better scalability
no fastpath tradeoffs
debug support – strict checking of
adhering to semantics
Linux Virtual File
System (VFS) and Ext2
COMS W4118
Spring 2008
Linux File System Model
basically UNIX file semantics
File systems are mounted at various points
Files identified by device inode numbers
VFS layer just dispatches to fs-specific functions
libc read() -> sys_read()
what type of filesystem does this file belong to?
call filesystem (fs) specific read function
maintained in open file object (file)
example: file->f_op->read(…)
similar to device abstraction model in UNIX
VFS System Calls
fundamental UNIX abstractions
files (everything is a file)
ex: /dev/ttyS0 – device as a file
ex: /proc/123 – process as a file
processes
users
lots of syscalls related to files! (~100)
most dispatch to filesystem-specific calls
some require no filesystem action
example: lseek(pos) – change position in file
others have default VFS implementations
VFS System Calls (cont.)
filesystem ops – mounting, info, flushing, chroot, pivot_root
directory ops – chdir, getcwd, link, unlink, rename, symlink
file ops – open/close, (p)read(v)/(p)write(v), seek, truncate, dup
fcntl, creat,
inode ops – stat, permissions, chmod, chown
memory mapping files – mmap, munmap, madvise, mlock
wait for input – poll, select
flushing – sync, fsync, msync, fdatasync
file locking – flock
Big Four Data Structures
struct file
information about an open file
includes current position (file pointer)
struct dentry
information about a directory entry
includes name + inode#
struct inode
unique descriptor of a file or directory
contains permissions, timestamps, block map (data)
inode#: integer (unique per mounted filesystem)
struct superblock
descriptor of a mounted filesystem
Two More Data Structures
struct file_system_type
pointer to implementing module
including how to read a superblock
On module load, you call register_file_system and pass a pointer
to this structure
struct vfsmount
Represents a mounted instance of a particular file system
One super block can be mounted in two places, with different
covering sub mounts
Thus lookup requires parent dentry and a vfsmount
name of file system
Data Structure Relationships
task_struct
inode cache
inode
inode
inode
fds
open
file
object
open
file
object
open
file
object
superblock
dentry
dentry
dentry
dentry
dentry
dentry cache
disk
Sharing Data Structures
calling dup() –
opening the same file twice –
shares dentries
opening same file via different hard links –
shares open file objects
example: 2>&1
shares inodes
mounting same filesystem on different dirs –
shares superblocks
Superblock
mounted filesystem descriptor
usually first block on disk (after boot block)
copied into (similar) memory structure on mount
distinction: disk superblock vs memory superblock
dirty bit (s_dirt), copied to disk frequently
important fields
s_dev, s_bdev – device, device-driver
s_blocksize, s_maxbytes, s_type
s_flags, s_magic, s_count, s_root, s_dquot
s_dirty – dirty inodes for this filesystem
s_op – superblock operations
u – filesystem specific data
Inode
"index" node – unique file or directory descriptor
meta-data: permissions, owner, timestamps, size, link
count
data: pointers to disk blocks containing actual data
data pointers are "indices" into file contents (hence "inode")
inode # - unique integer (per-mounted filesystem)
what about names and paths?
high-level fluff on top of a "flat-filesystem"
implemented by directory files (directories)
directory contents: name + inode
File Links
UNIX link semantics
hard links – multiple dir entries with same inode #
equal status; first is not "real" entry
file deleted when link count goes to 0
restrictions
can't hard link to directories (avoids cycles)
or across filesystems
soft (symbolic) links – little files with pathnames
just aliases for another pathname
no restrictions, cycles possible, dangling links possible
(Open) File Object
struct file (usual variable name - filp)
file descriptor (small ints)
association between file and process
no disk representation
created for each open (multiple possible, even same file)
most important info: file pointer
index into array of pointers to open file objects
file object states
unused (memory cache + root reserve (10))
get_empty_filp()
inuse (per-superblock lists)
system-wide max on open file objects (~8K)
/proc/sys/fs/file-max
Dentry
abstraction of directory entry
directory api: dentry iterators
ex: line from ls -l
either files (hard links) or soft links or subdirectories
every dentry has a parent dentry (except root)
sibling dentries – other entries in the same directory
posix: opendir(), readdir(), scandir(), seekdir(), rewinddir()
syscall: getdents()
why an abstraction?
UNIX: directories are really files with directory "records"
MSDOS, etc.: directory is just a big table on disk (FAT)
no such thing as subdirectories!
just fields in table (file->parentdir), (dir->parentdir)
Dentry Cache
very important cache for filesystem performance
dentry cache "controls" inode cache
every file access causes multiple dentry accesses!
example: /tmp/foo
dentries for "/", "/tmp", "/tmp/foo" (path components)
inodes released only when dentry is released
dentry cache accessed via hash table
hash(dir, filename) -> dentry
Dentry Cache (continued)
dentry states
free (not valid; maintained by slab cache)
in-use (associated with valid open inode)
unused (valid but not being used; LRU list)
negative (file that does not exist)
dentry ops
just a few, mostly default actions
ex: d_compare(dir, name1, name2)
case-insensitive for MSDOS
Process-related Files
current->fs (fs_struct)
root (for chroot jails)
pwd
umask (default file permissions)
current->files (files_struct)
fd[] (file descriptor array – pointers to file objects)
0, 1, 2 – stdin, stdout, stderr
originally 32, growable to 1,024 (RLIMIT_NOFILE)
complex structure for growing … see book
close_on_exec memory (bitmap)
open files normally inherited across exec
Filesystem Types
Linux must "know about" filesystem before mount
multiple (mounted) instances of each type possible
special (virtual) filesystems (like /proc)
structuring technique to touch kernel data
examples:
/proc, /dev (devfs)
sockfs, pipefs, tmpfs, rootfs, shmfs
associated with fictitious block device (major# 0)
minor# distinguishes special filesystem types
Registering a Filesystem Type
must register before mount
register_filesystem() / unregister_filesystem
static (compile-time) or dynamic (modules)
adds file_system_type object to linked-list
file_systems (head; kernel global variable)
file_systems_lock (rw spinlock to protect list)
file_system_type descriptor
name, flags, pointer to implementing module
list of superblocks (mounted instances)
read_super() – pointer to method for reading superblock
most important thing! filesystem specific
Data Structure Relationships (2)
open
file
object
f_dentry
dentry
d_subdirs
open
file
object
dentry
dentry
inode
inode
dentry inode
┴
open
file
object
d_inode
d_subdirs
d_parent
i_sb
superblock
d_sb
i_dentries
Ext2
“Standard” Linux File System
Uses FFS like layout
Previously it was the most commonly used
Serves as a basis for Ext3 which adds journaling
Each file system is composed of identical block groups
Allocation is designed to improve locality
Inodes contain pointers (32 bits) to blocks
Direct, Indirect, Double Indirect, Triple Indirect
Maximum file size: 4.1TB (4K Blocks)
Maximum file system size: 16TB (4K Blocks)
Ext2 Disk Layout
Files in the same directory are stored in the
same block group
Files in different directories are spread
among the block groups
Picture from Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
Block Addressing in Ext2
Twelve “direct” blocks
Inode
Data
Data
Block
Data
Block
Block
BLKSIZE/4
Indirect
Blocks
Indirect
Blocks
(BLKSIZE/4)2
Double
Indirect
Indirect
Blocks
(BLKSIZE/4)3
Triple
Indirect
Double
Indirect
Indirect
Blocks
Indirect
Blocks
Data
Data
Block
Data
Block
Block
Data
Data
Block
Data
Block
Block
Data
Data
Block
Data
Block
Block
Data
Data
Block
Data
Block
Block
Data
Data
Block
Data
Block
Block
Picture from Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
Ext2 Directory Structure
(a) A Linux directory with three files. (b) The same directory
after the file voluminous has been removed.
Picture from Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639