No Slide Title

Transcript No Slide Title

Many-Core Operating Systems
Burton Smith
Technical Fellow
Advanced Strategies and Policy
1
The von Neumann Premise


Simply put: “there is exactly one program counter”
It has led to some artifacts:



Synchronous coprocessor coroutining (e.g. 8087)
Interrupts for asynchronous concurrency
Demand paging



And some serious problems:




to make memory allocation incremental
to let |virtual| > |physical|
The memory wall (insufficient memory concurrency)
The ILP wall (diminished improvement in ILP)
The power wall (the cost of run-time ILP exploitation)
Given multiple program counters, what should we change?


Scheduling?
Synchronization?
2
Computing is at a Crossroads

Continual performance improvement is our lifeblood



Single-thread performance is nearing the end of the line



But Moore’s Law will continue for some time to come
What can we do with all those transistors?
Computation needs to become as parallel as possible




It encourages people to buy new hardware
It opens up new software possibilities
Henceforth, serial means slow
Systems must support general purpose parallel computing
The alternative to all this is commoditization
New many-core chips will need new system software


And vice versa!
This talk is about interplay between OS and hardware
3
Many-Core OS Challenges


Architecture of the parallel virtual machine
Processor management





Memory management




Multiple processors
A mix of in-order and out-of-order CPUs
GPUs and other performance accelerators
I/O processors and devices
Performance problems due to paging
TLB pressure from larger working sets
Bandwidth resources
Quality of service (time management)


For media applications, games, real-time apps, etc.
For deadlines
4
The Parallel Virtual Machine

What should the interface that the OS presents to parallel
application software look like?




Stable, negotiated resource allocation
Isolation among protection domains
Freedom from bottlenecks in OS services
The key objective is fine-grain application parallelism

We need the whole tree, not just the low-hanging fruit
5
Fine-grain Parallelism

Exploitable parallelism grows as task granularity shrinks


Inter-task dependence enforcement demands scheduling



No privilege change is needed to stop or restart a task
Locality (e.g. cache content) can be better preserved
Todays OS and hardware don’t encourage waiting




A task needing a value from elsewhere must wait for it
User-level work scheduling is called for


But dependences among tasks become more numerous
OS thread scheduling makes blocking dangerous
Instruction sets encourage non-blocking approaches
Busy-waiting wastes instruction issue opportunities
Impact:


Better instruction set support for blocking synchronization
Changes to OS processor and memory resource management
6
Multithreading and Synchronization

Fine-grain multithreading can use TLP to tolerate latency




In the latter case, some architectural support is helpful




To stop issuing from a context while it is waiting
To resume issuing when the wait is over
To free up the context if and when a wait becomes long
The benefits:



Memory latency
Other operation latency, e.g. branch latency
Synchronization latency
Waiting does not consume issue slots
Overhead is automatically amortized
I talked about this stuff in my 1996 FCRC keynote
7
Resource Scheduling Consequences

Since the user runtime is scheduling work on processors,
the OS should not attempt to do the same



An asynchronous OS API is a necessary corollary
Scheduling memory via demand paging is also problematic
Instead, the two schedulers should negotiate


The application tells the OS its resource needs/desires
The OS makes decisions based on the big picture:




The OS can preempt resources to reclaim them


Availability of resources
Appropriateness of power consumption level
Requirements for quality of service
But with notification, so the application can rearrange things
Resources should be time- and space-shared in chunks

Scheduling turns into a bin-packing problem
8
Bin Packing

The more resources allocated, the more swapping overhead


It would be nice to amortize it…
The more resources you get, the longer you may keep them

Quantity of resource

Roughly, this means scheduling = packing squarish blocks
QOS applications might need long rectangles instead
Time

When the blocks don’t fit, the OS can morph them a little

Or cut corners when absolutely necessary
9
What About Priority Scheduling?

Priorities are appropriate for some kinds of scheduling


Especially when some things to be scheduled are optional
If it all has to be done, how do the priorities get set?




“How much work must be done before the next deadline?”
Even highly interactive tasks can benefit
Deadlines are harder to implement than priorities


Fairness is seldom maintained in the process
Quality of service needs a different approach


The answer is usually “ad-hoc, and often!”
Then again, so is bin packing compared to fixed quanta
Fairness can also be based on quality-of-service concepts


Relative work rates rather than absolute
“In the next 16 milliseconds, give level i activities r times as
many processor-seconds as level i-1 activities”
10
Heterogeneous Processors

There are two kinds of heterogeny:




Both are likely to be important
A single application might ask for a heterogeneous mix



In architecture, i.e. different instruction sets
In implementation, i.e. different performance characteristics
Failure in the HA case might need multiple versions or JIT
In the HI case, scheduling might be based on instrumentation
A key question is whether a processor is time-sharable


If not, the OS has to dedicate it to one application at a time
With user-level scheduling and some support for preemption,
application state save and restore can be done at user level
11
Virtual Memory Design Alternatives


Swapping instead of demand paging
Address-space names/identifiers



TLB shootdown becomes a rarer event
Hardware TLB coherence
Two-dimensional addressing (segmentation w/o registers)



To assist with variable granularity memory allocation
To help mitigate upward pressure on TLB size
To leverage persistent memory via segment sharing


A variation of mmap()might suffice for this purpose
To accommodate variations in memory bank architecture

Local versus global, for example
12
Physical Memory Bank Architecture

Consider this example





Interleaving addresses across the banks is a solution


Page granularity is the standard choice
If memory access is non-uniform, this is not the best idea



An application is using 31 cores, about half of them
50% of its cache misses are stack references
The stacks are all allocated in a compact virtual region
How many of the 128 memory banks are available?
Stacks should be allocated near their processors
So should compiler-allocated temporary arrays on the heap
Is it one bank architecture scheme fits all, or not?

If not, how do we manage the virtual address space?
13
“Hot Spots”

When processors share memory, they can interfere


Within an application, this creates performance problems


Hardware help is needed to discover “where” these are
Beween applications, interference is even more serious




Not only data races, but also bandwidth oversubscription
Performance unpredictability
Denial of service
Covert-channel signaling
Bandwidth is a resource like any other

We need to be able to partition and isolate it
14
I/O Architecture

Direct memory access is usually a good way to do I/O



But I/O devices are getting smarter all the time





Transistors are cheaper than almost anything else
Why not treat I/O devices like heterogeneous processors?


Today’s DMA mostly demands “wired down” pages
This leads to lots of data copying and other OS warts
Teach them to do virtual address translation
Allocate them to real-time or sensor-intensive applications
Allocate them to a not-very-trusted “driver application”
Address space sharing can be partial, as it is now
There is a problem, though: inter-domain signaling (IPC)


This is what interrupts do
I have some issues with interrupts
15
Interrupts

Interrupts are OK when there is only one processor


If there are many processors, which one do you interrupt?


The usual solution: “just pick one and leave it to software”
A better idea is to signal via an address space you already
share (perhaps only in part) with the intended recipient




Some people avoid them to make systems more predictable
“The DMA at address <a> is (ready)(done)”
This is kinda like doing programmed I/O via device CSRs
It’s also the way the CDC 6600 and 7600 did things
You may not want to have the signal recipient busy-wait
16
Conclusions


It is time to rethink some of the basics of computing
There is lots of work for everyone to do


e.g. I’ve left out compilers, debuggers, and applications
We need basic research as well as industrial development


Research in computer systems is deprecated these days
In the USA, NSF and DOD need to take the initiative
17

No Slide Title

Transcript No Slide Title

Directory