Transcript Lesson16

Page-Faults in Linux
How can we study the handling of
page-fault exceptions?
Why page-faults happen
•
•
•
•
•
•
•
Trying to access a virtual memory-address
Instruction-operand / instruction-address
Read-data/write-data, or fetch-instruction
Maybe page is ‘not present’
Maybe page is ‘not readable’
Maybe page is ‘not writable’
Maybe page is ‘not visible’
Page-fault examples
movl
movl
jmp
%eax, (%ebx)
(%ebx), %eax
ahead
; writable?
; readable?
; present?
Everything depends on the entries in the
current page-directory and page-tables,
and on the cpu’s Current Privilege Level
Current Privilege Level (CPL)
Layout of segment-register contents
(16 bits)
3
15
segment-selector
TI = Table-Indicator
2
1
0
T
I
RPL
RPL=Requested Privilege Level
CPL is determined by the value of RPL field in CS and SS
What does the CPU do?
• Whenever the cpu detects a page-fault, its
action depends on Current Privilege Level
• If CPL == 0 (executing in kernel mode):
1) push EFLAGS register
2) push CS register
3) push EIP register
4) push error-code
5) jump to page-fault service-routine
Alternative action in user-mode
• If CPL == 3 (executing in user mode)
the CPU will switch to its kernel-mode stack:
0)
1)
2)
3)
4)
5)
push SS and ESP
push EFLAGS
push CS
push EIP
push error-code
jump to the page-fault service-routine
How CPU finds new stack
•
•
•
•
•
•
Special CPU segment-register: TR
TR is the ‘Task Register’
TR holds ‘selector’ for a GDT descriptor
Descriptor is for a ‘Task State Segment’
So TR points indirectly to current TSS
TSS stores address of kernel-mode stack
Stack Switching mechanism
INTERRUPT
DESCRIPTOR
TABLE
CS
EIP
user code
SS
ESP
user stack
user-space
kernel-space
Gate descriptor
kernel
code
GLOBAL
DESCRIPTOR
TABLE
kernel
stack
IDTR
TR
GDTR
SS0
ESP0
TSS descriptor
TASK STATE
SEGMENT
Let’s ‘intercept’ page-faults
•
•
•
•
•
•
•
•
Use our systems programming knowledge
We build a ‘new’ Interrupt Descriptor Table
With our own ‘customized’ interrupt-gates
Use a ‘new’ gate for page-fault exceptions
Other existing gates we can simply copy
Why not just modify the existing IDT?
It’s ‘write-protected’ in some Linux kernels
But we can still ‘read’ it (i.e., for copying)
Very delicate to implement
•
•
•
•
•
•
•
Will need to use some assembly language
Using C language doesn’t give full control
C Compiler designers didn’t plan for this!
(except they did allow for using assembly)
Assembly requires us to be very precise
So try keeping assembly to a minimum
We can use a mixture of assembly and C
Allocate a mapped page
•
•
•
•
•
•
•
•
Device interrupts are ‘asynchronous’
CPU requires instant access to the IDT
We must insure CPU can find new IDT
Cannot risk putting it in ‘high memory’
We can use ‘get_free_page()’ function
With flags: GFP_KERNEL and GFP_DMA
(This insures page will be always mapped)
No memory available? Cannot continue.
Must find address of current IDT
•
•
•
•
•
•
•
We’ll need it for copying the existing gates
We’ll need it for restoring old IDT upon exit
We can use the ‘sidt’ instruction to find it
But ‘sidt’ needs a 48-bit memory-operand
No such type is directly supported in C
We could use a 64-bit type (i.e., long long)
Better to use array of three 16-bit values
Getting hold of current IDT
•
•
•
•
•
•
We need to declare a global variable
Because ‘init_module()’ needs it
And also ‘cleanup_module()’ needs it
Use ‘static’ to make it private
Use ‘short’ to get 16-bit array-entries
Use ‘unsigned’ to avoid sign-extensions
static unsigned short oldidtr[ 3 ];
Activating a ‘new’ IDT
•
•
•
•
When we’re ready, we can use ‘sidt’
Instruction will change the IDTR register
Instruction needs 48-bit memory operand
So again we will declare a suitable array
static unsigned short newidtr[ 3 ];
Initializations
•
•
•
•
•
•
We need to initialize our ‘idtr’ array
We need to initialize new Descriptor Table
Use ‘memcpy()’ for copying within kernel
Page-Fault’s gate-descriptor must be built
Must conform to CPU’s expected layout
Need to use a local 64-bit variable
unsigned long long gate_desc;
Format for a Gate Descriptor
Quadword (64-bits)
63
offset[ 31…16 ]
0
gate
type
segment-selector
offset[ 15…0 ]
The address of the fault-handler is ‘split’ into a hiword and a loword
Declaring our fault-handler
• Tell the C compiler our handler’s name:
asmlinkage void isr0x0E( void );
• Its type and value are set by assembler:
asm(“ .text
“);
asm(“ .type
isr0x0E, @function “);
asm(“isr0x0E:
“);
Save/Restore cpu registers
• Upon entering:
asm(“
pushal
asm(“
pushl
asm(“
pushl
• Upon leaving:
asm(“
asm(“
asm(“
asm(“
popl
popl
popal
jmp
%ds
%es
%es
%ds
*old_isr
“);
“);
“);
“);
“);
“);
“);
Handler must access kernel data
• Registers CS and SS get set up by the CPU
• But its our job to set up DS and ES registers
• Linux uses same segments for data and stack
asm(“ mov
asm(“ mov
asm(“ mov
%ss, %eax “);
%eax, %ds “);
%eax, %es “);
• (Current kernel version doesn’t use FS or GS)
Transfer to a C function
•
•
•
•
Handler will need some info from the stack
The ‘error-code’ will be needed for sure
So C function will need an ‘argument’
So here’s our C function prototype:
static void handler( unsigned long *tos );