Transcript Slides

Design Tradeoffs For
Software-Managed TLBs
Authers; Nagle, Uhlig, Stanly
Sechrest, Mudge & Brown
Definition


The virtual to physical address translation operation sits on
the critical path between the CPU and the cache.
If every request for a memory location out from the processor
required one or more accesses to main memory (to read page
table entries), then the processor would be very slow.

TLB is a cache for page table entries. It works in much the
same way as the data cache, it stores recently accessed page
table entries.
Operations on an address request by the
CPU



Each TLB entry covers a
whole page of physical
memory
a relatively small number of
TLB entries will cover a
large amount of memory
The large coverage of main
memory by each TLB entry
means that TLBs have a
high hit rate
TLB types


Fully associative in early TLB design
Set associative, is more common in new
design
The Problem.


This paper discusses software managed TLB design
tradeoffs and their interaction with a range of
operating systems however software management
can impose considerable penalties, which can highly
dependent on the operating system structure and its
use of virtual memory
Namely memory references that require mappings
not in the TLB result in misses that must be serviced
either by hardware or by software.
Test Environment





DECstation 3100 with MIPS R2000 processor
R2000 contains 64 entry fully-associative TLB
R2000 TLB hardware supports partitioning into two
sets, an upper and lower set
Lower set consists of entries 0-7 and is used for Page
Table Entries with slow retrieval
Upper set consists of entries 8-63 and contains more
frequently used level 1 user PTEs
Test Tools.



a system analysis tool called Monster, which enables
us to monitor actual miss handling costs in CPU
cycles.
a TLB simulator called Tapeworm which is compiled
directly into the kernel so that it can intercept all of
the actual TLB misses caused by both user processes
and OS kernel memory references.
TLB information that Tapeworm extracts from
the running system is used to obtain TLB miss
counts and to simulate different TLB configurations.
System monitoring with monster.


Monster is a hardware monitoring system, its
comprised of a monitored DECstation 3100,
an attached logic analyzer and a controlling
workstation .
Measures the amount of time to handle each
TLB miss
TLB Simulation with Tapeworm.


The Tapeworm simulator is built into the
operating system and is invoked whenever
there is a TLB miss.
The simulator uses the real TLB misses to
simulate its own TLB configuration.
Trace Driven Simulation


Trace driven simulation was used because it’s
good for studying the components of a
computer memory systems like TLBs.
a sequence of memory references to the
simulation model to mimic the way that a real
processor might exercise the design.
Problems with Trace driven simulation



Difficult to obtain accurate traces.
Consumes a considerable processing and
storage resources
It assumes that address traces are invariant to
changes in the structural parameters of a
simulated TLB
Solution.



Compiling the TLB simulator Tapeworn, directly
onto the operating system kernel. This allows us to
account for all system activity, including multiple
process and kernel interactions.
It does not require address trace
It considers all TLB misses, caused by user level
tasks, or kernel.
Benchmarks
Operating Systems
Test Results
OS Impact on software managed TLBs


Different OS gave different results, although
the same application were run on each system.
There is a difference in TLB misses & total
TLB service time
Increasing TLB Performance




Additional TLB Miss Vectors.
Increase Lower Slots in TLB Partition.
Increase TLB Size.
Modify TLB Associativity.
TLB Miss Vectors






L1 User - on level 1 user PTE
L1 Kernel - miss on level 1 kernel PTE
L2 - miss on level 2 PTE, after level 1 user
miss
L3 - miss on level 3 PTE, after level 1 kernel
miss
Modify - miss on protection violation
Invalid – page fault
TLB Miss Vector Results
Modifying Lower TLB Partition



OSF/1 OS - increase from 4 to 5 lower slots
decreases miss handling time by 50%
Mach 3.0 OS – performance increase up to 8
slots
Microkernel's benefit from lower TLB
partition increase because many system calls
(e.g. Unix server on Mach 3.0) mapped to L2
PTEs
Increasing TLB size
Increasing TLB size
• Building TLBs with additional upper slots.
• The most significant component is L1k
misses, that’s due to the large number of
mapped data structure in the kernel.
• Allowing the uTLB handler to service L1k
misses reduces the TLB service time.
• In each system there is a noticeable
improvement in the TLB service time as
the TLB increases.
Conclusion.

Software-management of TLBs magnifies the
importance of the interactions between TLBs and
operating systems, because of the large variation in
TLB miss service times that can exist. TLB behavior
depends upon the kernel’s use of virtual memory to
map its own data structures, including the page
tables themselves. TLB behavior is also dependent
upon the division of service functionality between
the kernel and separate user tasks.