Transcript ppt

Improving IPC by Kernel Design
Jochen Liedtke
Proceeding of the 14th ACM Symposium on Operating
Systems Principles
Asheville, North Carolina
1993
The Performance of
u-Kernel-Based Systems
H. Haertig, M. Hohmuth, J. Liedtke, S.
Schoenberg, J. Wolter
Proceedings of the 16th Symposium on Operating
Systems Principles
October 1997, pp. 66-77
Jochen Liedtke (1953 – 2001)
• 1977 – Diploma in Mathematics
from University of Beilefeld.
• 1984 – Moved to GMD (German
National Research Center). Build
L3. Known for overcoming ipc
performance hurdles.
• 1996 – IBM T.J Watson Research
Center. Developed L4, a 12kb
second generation microkernel.
The IPC Dilemma
• IPC is a core paradigm of u-kernel architectures
• Most IPC implementations perform poorly
• Really fast message passing systems are needed to
run device drivers and other performance critical
components at the user-level.
• Result: programmers circumvent IPC, co-locating
device drivers in the kernel and defeating the main
purpose of the microkernel architecture
What to Do?
• Optimize IPC performance above all else!
• Results: L3 and L4: second-generation microkernel based operating systems
• Many clever optimizations, but no single “silver
bullet”
Summary of Techniques
Seventeen Total
Standard System Calls (Send/Recv)
Kernel entered/exited four times per call!
Client (Sender)
send ( );
System call,
Enter kernel
Exit kernel
Server (Receiver)
receive ( );
System call,
Enter kernel
Exit kernel
send ( );
System call,
Enter kernel
Exit kernel
Client is not Blocked
receive ( );
System call,
Enter kernel
Exit kernel
New Call/Response-based System Calls
Special system calls for RPC-style interaction
Kernel entered and exited only twice per call!
Client (Sender)
Server (Receiver)
reply_and_recv_next ( );
call ( ); System call,
Enter kernel
Allocate CPU to Server
Suspend
Resume from being suspended
Exit kernel
handle message
Re allocate CPU to Client
Exit kernel
reply_and_recv_next ( );
Enter kernel
Send Reply
Wait for next message
Complex Message Structure
Batching IPC
Combine a sequence of send operations into a
single operation by supporting complex messages
• Benefit: reduces number of sends.
Direct Transfer by Temporary Mapping
• Naïve message transfer: copy from sender to kernel then
from kernel to receiver
• Optimizing transfer by sharing memory between sender
and receiver is not secure
• L3 supports single-copy transfers by temporarily mapping
a communication window into the sender.
Scheduling
• Conventionally, ipc operations call or reply & receive
require scheduling actions:
–
–
–
–
Delete sending thread from the ready queue.
Insert sending thread into the waiting queue
Delete the receiving thread from the waiting queue.
Insert receiving thread into the ready queue.
• These operations, together with 4 expected TLB misses
will take at least 1.2 us (23%T).
Solution, Lazy Scheduling
• Don’t bother updating the scheduler queues!
• Instead, delay the movement of threads among queues until
the queues are queried.
• Why?
– A sending thread that blocks will soon unblock again, and maybe
nobody will ever notice that it blocked
• Lazy scheduling is achieved by setting state flags (ready /
waiting) in the Thread Control Blocks
Pass Short Messages in Registers
• Most messages are very short, 8 bytes (plus
8 bytes of sender id)
– Eg. ack/error replies from device drivers or
hardware initiated interrupt messages.
• Transfer short messages via cpu registers.
• Performance gain of 2.4 us or 48%T.
Impact on IPC Performance
• For an eight byte message, ipc
time for L3 is 5.2 us compared
to 115 us for Mach, a 22 fold
improvement.
• For large message (4K) a 3 fold
improvement is seen.
Relative Importance of Techniques
• Quantifiable impact of techniques
– 49% means that that removing that item would increase ipc time
by 49%.
OS and Application-Level Performance
OS-Level Performance
Application-Level Performance
Conclusion
• Use a synergistic approach to improve IPC
performance
– A thorough understanding of hardware/software
interaction is required
– no “silver bullet”
• IPC performance can be improved by a factor of
10
• … but even so, a micro-kernel-based OS will not
be as fast as an equivalent monolithic OS
– L4-based Linux outperforms Mach-based Linux, but
not monolithic Linux