No Slide Title
Download
Report
Transcript No Slide Title
Many-Core Operating Systems
Burton Smith
Technical Fellow
Advanced Strategies and Policy
1
The von Neumann Premise
Simply put: “there is exactly one program counter”
It has led to some artifacts:
Synchronous coprocessor coroutining (e.g. 8087)
Interrupts for asynchronous concurrency
Demand paging
And some serious problems:
to make memory allocation incremental
to let |virtual| > |physical|
The memory wall (insufficient memory concurrency)
The ILP wall (diminished improvement in ILP)
The power wall (the cost of run-time ILP exploitation)
Given multiple program counters, what should we change?
Scheduling?
Synchronization?
2
Computing is at a Crossroads
Continual performance improvement is our lifeblood
Single-thread performance is nearing the end of the line
But Moore’s Law will continue for some time to come
What can we do with all those transistors?
Computation needs to become as parallel as possible
It encourages people to buy new hardware
It opens up new software possibilities
Henceforth, serial means slow
Systems must support general purpose parallel computing
The alternative to all this is commoditization
New many-core chips will need new system software
And vice versa!
This talk is about interplay between OS and hardware
3
Many-Core OS Challenges
Architecture of the parallel virtual machine
Processor management
Memory management
Multiple processors
A mix of in-order and out-of-order CPUs
GPUs and other performance accelerators
I/O processors and devices
Performance problems due to paging
TLB pressure from larger working sets
Bandwidth resources
Quality of service (time management)
For media applications, games, real-time apps, etc.
For deadlines
4
The Parallel Virtual Machine
What should the interface that the OS presents to parallel
application software look like?
Stable, negotiated resource allocation
Isolation among protection domains
Freedom from bottlenecks in OS services
The key objective is fine-grain application parallelism
We need the whole tree, not just the low-hanging fruit
5
Fine-grain Parallelism
Exploitable parallelism grows as task granularity shrinks
Inter-task dependence enforcement demands scheduling
No privilege change is needed to stop or restart a task
Locality (e.g. cache content) can be better preserved
Todays OS and hardware don’t encourage waiting
A task needing a value from elsewhere must wait for it
User-level work scheduling is called for
But dependences among tasks become more numerous
OS thread scheduling makes blocking dangerous
Instruction sets encourage non-blocking approaches
Busy-waiting wastes instruction issue opportunities
Impact:
Better instruction set support for blocking synchronization
Changes to OS processor and memory resource management
6
Multithreading and Synchronization
Fine-grain multithreading can use TLP to tolerate latency
In the latter case, some architectural support is helpful
To stop issuing from a context while it is waiting
To resume issuing when the wait is over
To free up the context if and when a wait becomes long
The benefits:
Memory latency
Other operation latency, e.g. branch latency
Synchronization latency
Waiting does not consume issue slots
Overhead is automatically amortized
I talked about this stuff in my 1996 FCRC keynote
7
Resource Scheduling Consequences
Since the user runtime is scheduling work on processors,
the OS should not attempt to do the same
An asynchronous OS API is a necessary corollary
Scheduling memory via demand paging is also problematic
Instead, the two schedulers should negotiate
The application tells the OS its resource needs/desires
The OS makes decisions based on the big picture:
The OS can preempt resources to reclaim them
Availability of resources
Appropriateness of power consumption level
Requirements for quality of service
But with notification, so the application can rearrange things
Resources should be time- and space-shared in chunks
Scheduling turns into a bin-packing problem
8
Bin Packing
The more resources allocated, the more swapping overhead
It would be nice to amortize it…
The more resources you get, the longer you may keep them
Quantity of resource
Roughly, this means scheduling = packing squarish blocks
QOS applications might need long rectangles instead
Time
When the blocks don’t fit, the OS can morph them a little
Or cut corners when absolutely necessary
9
What About Priority Scheduling?
Priorities are appropriate for some kinds of scheduling
Especially when some things to be scheduled are optional
If it all has to be done, how do the priorities get set?
“How much work must be done before the next deadline?”
Even highly interactive tasks can benefit
Deadlines are harder to implement than priorities
Fairness is seldom maintained in the process
Quality of service needs a different approach
The answer is usually “ad-hoc, and often!”
Then again, so is bin packing compared to fixed quanta
Fairness can also be based on quality-of-service concepts
Relative work rates rather than absolute
“In the next 16 milliseconds, give level i activities r times as
many processor-seconds as level i-1 activities”
10
Heterogeneous Processors
There are two kinds of heterogeny:
Both are likely to be important
A single application might ask for a heterogeneous mix
In architecture, i.e. different instruction sets
In implementation, i.e. different performance characteristics
Failure in the HA case might need multiple versions or JIT
In the HI case, scheduling might be based on instrumentation
A key question is whether a processor is time-sharable
If not, the OS has to dedicate it to one application at a time
With user-level scheduling and some support for preemption,
application state save and restore can be done at user level
11
Virtual Memory Design Alternatives
Swapping instead of demand paging
Address-space names/identifiers
TLB shootdown becomes a rarer event
Hardware TLB coherence
Two-dimensional addressing (segmentation w/o registers)
To assist with variable granularity memory allocation
To help mitigate upward pressure on TLB size
To leverage persistent memory via segment sharing
A variation of mmap()might suffice for this purpose
To accommodate variations in memory bank architecture
Local versus global, for example
12
Physical Memory Bank Architecture
Consider this example
Interleaving addresses across the banks is a solution
Page granularity is the standard choice
If memory access is non-uniform, this is not the best idea
An application is using 31 cores, about half of them
50% of its cache misses are stack references
The stacks are all allocated in a compact virtual region
How many of the 128 memory banks are available?
Stacks should be allocated near their processors
So should compiler-allocated temporary arrays on the heap
Is it one bank architecture scheme fits all, or not?
If not, how do we manage the virtual address space?
13
“Hot Spots”
When processors share memory, they can interfere
Within an application, this creates performance problems
Hardware help is needed to discover “where” these are
Beween applications, interference is even more serious
Not only data races, but also bandwidth oversubscription
Performance unpredictability
Denial of service
Covert-channel signaling
Bandwidth is a resource like any other
We need to be able to partition and isolate it
14
I/O Architecture
Direct memory access is usually a good way to do I/O
But I/O devices are getting smarter all the time
Transistors are cheaper than almost anything else
Why not treat I/O devices like heterogeneous processors?
Today’s DMA mostly demands “wired down” pages
This leads to lots of data copying and other OS warts
Teach them to do virtual address translation
Allocate them to real-time or sensor-intensive applications
Allocate them to a not-very-trusted “driver application”
Address space sharing can be partial, as it is now
There is a problem, though: inter-domain signaling (IPC)
This is what interrupts do
I have some issues with interrupts
15
Interrupts
Interrupts are OK when there is only one processor
If there are many processors, which one do you interrupt?
The usual solution: “just pick one and leave it to software”
A better idea is to signal via an address space you already
share (perhaps only in part) with the intended recipient
Some people avoid them to make systems more predictable
“The DMA at address <a> is (ready)(done)”
This is kinda like doing programmed I/O via device CSRs
It’s also the way the CDC 6600 and 7600 did things
You may not want to have the signal recipient busy-wait
16
Conclusions
It is time to rethink some of the basics of computing
There is lots of work for everyone to do
e.g. I’ve left out compilers, debuggers, and applications
We need basic research as well as industrial development
Research in computer systems is deprecated these days
In the USA, NSF and DOD need to take the initiative
17