ppt - Computer Science

Download Report

Transcript ppt - Computer Science

Improving IPC by Kernel Design
Jochen Liedtke
Proceeding of the 14th ACM Symposium on Operating
Systems Principles
Asheville, North Carolina
1993
The Performance of u-KernelBased Systems
H. Haertig, M. Hohmuth, J. Liedtke, S.
Schoenberg, J. Wolter
Proceedings of the 16th Symposium on Operating
Systems Principles
October 1997, pp. 66-77
Jochen Liedtke (1953 – 2001)
• 1977 – Diploma in Mathematics
from University of Beilefeld.
• 1984 – Moved to GMD (German
National Research Center). Build
L3. Known for overcoming ipc
performance hurdles.
• 1996 – IBM T.J Watson Research
Center. Developed L4, a 12kb
second generation microkernel.
The IPC Dilemma
• Inter-process communication (ipc) by message passing is one of the
central paradigms of u-kernel and client / server architectures.
– Increase modularity, flexibility, security and scalability.
• But, most ipc implementations of the time performed poorly (1st
generation micro-kernels such as Mach or Chorus). Really fast
message passing systems were needed to run device drivers and other
performance critical components at the user-level.
• So, programmers started to circumvent ipc. For example, co-locating
device drivers and other components back into the kernel.
• To gain acceptance, ipc has to become a very efficient basic
mechanism.
What to Do?
• The author sets out to construct a u-kernel that
will achieve a tenfold improvement in ipc
performance over comparable systems.
• “ipc performance is the master” is a key design
principle.
• Result is L3 is micro-kernel based operating
system built by GMD (German National Research
Center for Computer Science) and finally L4.
• Use a synergistic approach, no single
“silver bullet” exists.
Summary of Techniques
Seventeen Total
Measured Performance Gains
• Note synergistic effect. For 8-byte ipc;
– 49% + 23% + 21% + 18% + 13% + 10% = 134%
– 49% means that that removing that item would increase ipc time
by 49%.
Standard System Calls (Send,
Receive)
Kernel entered and exited four times, 107 cycles each time.
Client (Sender)
L4_ipc_send ( ); system call,
Enter kernel
Exit kernel
Server (Receiver)
L4_ipc_receive ( ); system call,
Enter kernel
Exit kernel
Client is not Blocked
L4_ipc_send ( ); system call,
Enter kernel
Exit kernel
L4_ipc_receive ( ); system call,
Enter kernel
Exit kernel
Add New System Calls
Kernel entered and exited two times, half as much.
Client (Sender)
L4_ipc_call ( ); system call,
Enter kernel
Allocate Processor to Server
Suspend
Client IS Blocked
L4_ipc_receive ( ); system call,
Processor allocate to Client
Exit kernel
Server (Receiver)
L4_ipc_reply_and_wait ( );
Resume from being suspended
Return to user (exit kernel)
Inspect message
L4_ipc_reply_and_wait ( );
Enter kernel
Send Reply
Wait for next message
Complex Message Structure
Combine a sequence of send operations into a
single operation by supporting complex
messages.
• Benefit: reduces number of sends.
Direct Transfer by Temporary
Mapping
• LRPC and RPC share user level memory of client and
server to transfer messages. But this may effect security.
• Other micro-kernels transfer messages by a twofold copy,
process A space into kernel space into process b space.
• L4 provides single-copy transfers by temporarily sharing
the target region with the sender.
Scheduling, Conventional
• Conventionally, ipc operations call or reply & receive
requires scheduling actions:
–
–
–
–
Delete sending thread from the ready queue.
Insert sending thread into the waiting queue
Delete the receiving thread from the waiting queue.
Insert receiving thread into the ready queue.
• These operations, together with 4 expected TLB misses
will take at least 1.2 us (23%T).
Solution, Lazy Scheduling
• Conventional IPC requires updating of thread scheduler
queues. Performance can be improved by delaying the
movement of threads within/between queues until the
queues are queried. This ``lazy'' scheduling is achieved by
setting state flags (ready / waiting) in the Thread Control
Blocks (tcb – contains basic information about a thread)
and then scanning queues at query time for threads which
should be moved to different queues.
Pass Short Messages in Register
• Typically, a high proportion of messages are very
short, 8 bytes (plus 8 bytes of sender id).
Examples would be ack/error replies from device
drivers or hardware initiated interrupt messages.
• The 486 processor had enough registers to allow
direct transfer of short messages via cpu registers.
• Performance gain of 2.4 us or 48%T.
IPC Performance
• For an eight byte message, ipc
time for L3 is 5.2 us compared
to 115 us for Mach, a 22 fold
improvement.
• For large message (4K) a 3 fold
improvement is seen.
Monolithic Kernel vs. Microkernel
L4 Performance
L4 Performance
Conclusion
• Use a synergistic approach to achieve greater ipc
performance, a single “silver bullet” may not
exist.
• A thorough understanding of the interaction
between the hardware architecture and the
operating system is key to many of the
improvements. Microkernels are not portable
between hardware architectures.
• L4 demonstrated the viability of running
applications on top of a micro-kernel.
References
• http://i30www.ira.de/aboutus/people/liedtke
/inmemoriam.php
• Microkernels; Ulfar Erlingsson, Athanasios
Kyparlis
• Monolithic Kernel vs. Microkernel;
Benjamin Roch; TU Wien