Transcript ppt
Tornado: Maximizing Locality and
Concurrency in a Shared Memory
Multiprocesor Operating System
By: Ben Gamsa, Orran Krieger, Jonathan Appavoo,
Michael Stumm
Presented by: Holly Grimes
Locality
In uniprocessors
Spatial Locality—access neighboring memory locations
Temporal Locality—access memory locations that where
accessed recently
In multiprocessors, locality involves each processor using
data from its own cache
Why is Locality Important?
Modern multiprocessors exhibit
Higher memory latency
Large write-sharing costs
Large cache lines (false sharing)
Larger system sizes
Large secondary caches
NUMA effects
Why is Locality Important?
Why is Locality Important?
Why is Locality Important?
Why is Locality Important?
Why is Locality Important?
Why is Locality Important?
Why is Locality Important?
Sharing the counter requires moving it back and forth
between the CPU caches
Solution??
Split the counter into an array of integers
Each CPU gets its own counter
Achieving Locality
Achieving Locality
Achieving Locality
Achieving Locality
Using an array of counters seems to solve our problem…
But what happens if both array elements map to the same
cache line?
False sharing
Solution:
Pad each array element
Different elements map to different cache lines
Comparing Counter Implementations
Why is Locality Important?
Modern multiprocessors exhibit
Higher memory latency
Large write-sharing costs
Large cache lines (false sharing)
Larger system sizes
Large secondary caches
NUMA effects
NUMA Effects
NUMA Effects
Achieving Locality in an OS
We’ve seen several ways to achieve locality in the
implementation of a counter
Now we extend these concepts to see how locality can
be achieved in an OS
Tornado’s Approach –
Built a new OS from the ground up
Make locality the primary design goal
Tornado’s Approach
User Application
OS Implementation
Locality
Locality
Independence
Independence
Illustration by Philip Howard
Tornado’s Locality-Maximizing features
Object-oriented Design
Clustered Objects
A New Locking Strategy
Protected Procedure Calls
Object-oriented Design
Object-oriented Structure
Each OS resource is represented as a separate object
All locks and data structures are internal to the objects
This structure allows different resources to be managed
localizes and encapsulates the locks and data
without accessing shared data structures
without acquiring shared locks
Simplifies OS implementation
modular design
Example: Memory Management Objects
HAT
COR
Region
FCM
Process
DRAM
Region
FCM
COR
HAT
FCM
COR
DRAM
Hardware Address Translation
File Cache Manager
Cached Object Representative
Memory manager
Illustration by Philip Howard
Object-oriented Design
For each resource, this design provides one object to be
shared by all CPUs
Comparable to having one copy of the counter shared among
all CPUs
To maximize locality, something more needs to be done
Clustered Objects
Clustered Objects
Clustered objects are composed of a set of
representative objects
Clients access a clustered object using a common
reference to the object
There can be one rep for the system, one rep per processor,
or one rep for a cluster of processors
Each call to an object using this reference is automatically
directed to the appropriate rep
Clients do not need to know anything about the location or
organization of the reps to use a clustered object
Impact on locality is similar to the effect of padded arrays
on the counter
Clustered Objects
Keeping Clustered Objects Consistent
When a clustered object has multiple reps, there must be
a way of keeping the reps consistent
Coordination between reps can happen via
Shared Memory
Protected Procedure Calls
Clustered Object Implementation
Each processor has a translation table
The table is located at the same virtual address in every
processor
For each clustered object, the table contains a pointer to the
rep that serves the given processor
A clustered object reference is just a pointer into this
table
Reps for a clustered object are created dynamically when
they are first accessed
Dynamic rep creation is dealt with by the global miss handler
Clustered Object Implementation
A New Locking Strategy
Synchronization
Two kinds of synchronization issues must be dealt with
Using locks to protect data structures
Ensuring the existence of needed data structures
Locks
Tornado uses spin-then-block locks to minimize the
overhead of the lock/unlock instructions
Tornado limits lock contention by
Encapsulating the locks in objects to limit the scope of locks
Using clustered objects to provide multiple copies of a lock
Existence Guarantees
Traditional Approach: use locks to protect object
references
Prevents races where one process destroys an object that
another process is using
Tornado’s Approach: semi-automatic garbage collection
Garbage collection destroys a clustered object only when
there are no more references to the object
When clients use an existing reference to access a clustered
object, they have a guarantee that the referenced object still
exists
References can be safely accessed without the use of (global)
locks
Protected Procedure Calls
Interprocess Communication
Tornado uses Protected Procedure Calls (PPCs) to bring
locality and concurrency to interprocess communication
A PPC is a call from a client object to a server object
Acts like a clustered object call that passes between the protection
domains of the client and server processes
Advantages of PPC
Client requests are always serviced on their local processor
Clients and servers share the CPU in a manner similar to handoff
scheduling
The server has one thread of control for each client request
Protected Procedure Calls
Performance
Conclusions
Tornado’s increased locality and concurrency has
produced a scalable OS design for multiprocessors
This locality was provided by several key system features
An object-oriented design
Clustered objects
A new locking strategy
Protected procedure calls