USING-OS-OBSERVATIONS

Download Report

Transcript USING-OS-OBSERVATIONS

USING OS
OBSERVATIONS TO
IMPROVE
PERFORMANCE IN
MULTICORE SYSTEMS
Authors:
Rob Knauerhase
Paul Brett
Barbara Hohlt
Tong Li
Scott Hahn
Presented By
Ankit Patel
Paper summary


Shows practical way of improving performance
of a system by modifying the operating system
Paper also shows that features provided by
multicore processor in the operating system
kernel can significantly improve the
performance of the system



Today’s operating systems don’t handle
complexity of multicore processors
Goal of this research paper is to show that the
OS can use data obtained from dynamic runtime
environment observation of task behavior.
This paper will also demonstrate the utility of
observation based policy
Background



Multicore processor has several independent
processing units in a single chip. Also known as
Chip Multiprocessing (CMP)
E.g. dual & quad core processors from Intel and
AMD
They share certain resources in chip e.g. Last
Level Cache


Another type of multicore system
Simultanious multithreading (SMT)

Looks like multicore, but implement each virtual
core with combination of functional units within a
processor. E.g. Pantium 4 HT technology based
processors.
OK
There is one problem here….
Problem



As number of core increased within a die,
complexity of design also increases for
multithreaded cores.
Current OS is completely unaware of internal
design of multicore processor.
So it neither exploit CMP features nor avoid
CMP challenges.
From OS level

Current operating systems exploit multicore
systems by using multiprocessing kernel e.g.
Linux and Windows uses Symmetric
Multiprocessing or SMP, where kernel treat each
core as independent processor.
Features provided by processor to
improve performance


Modern processors also include hardware
features for monitoring the CPU’s performance
and behavior
For example, the Intel Vtune system allows
programmers to optimize for cache usage and
floating-point and multimedia extensions
(MMX) instruction usage.
What operating system should do?


OS should make an observation of the behavior
of threads running in the system
These observations, combined with knowledge
of the processor architecture, allow the
implementation of different policies in the OS.
But this is expensive
and CEO will ask for business value
before investing????
Advantages



Good policies can improve overall system
performance
Improve application performance
Decrease system power consumption,


Huge issue in data centers
or provide arbitrary user-defined combinations
of these benefits.
…and this is enough business value
to market new Operating System in
the market…
Don’t you think?
…Very Exciting....
Can you show how this can be done
in OS kernel??
Testing environment used in this
paper



Linux (kernel 2.6.20) on Intel Xeon 5300 series
(contains two 4MB LLC arrays, each shared by
two cores.
Mac OS-X (Darwin) on Intel Core 2 Duo
(contains one 2MB LLC shared by two cores
NO Microsoft WINDOWS

WHY ???????????
Challenges

Cache interference in the last-level cache (LLC):


Lack of intelligent thread migration:


If a task runs on core A, it can use the entire LLC. Another task, running on
core B, shares the LLC resource; the resulting contention slows both tasks.
Worse, the amount of contention is quite dynamic because it depends on
each task’s behavior at a given time. This behavior is impossible for the
application to know at compile time
Linux and OS X include migration for basic load balancing, but they treat
each core equivalently, without the notion of resources shared among cores.
No accommodation of cores with different features:

Current Intel-compatible multicore implementations feature cores that are
exact copies of each other. In the future, however, some cores will likely be
functionally asymmetric for reasons of power, die area, cost, and complexity.
Observation subsystem

inspects relevant performance monitoring
counters and kernel data structures and gathers
information on a perthread basis.
Processor counters used for several
measurable events





LLC misses (INVALID_L2_RQSTS),
LLC references (L2_RQSTS),
instructions retired (INSTR_RETIRED.ANY)
core cycles (CPU_CLK_UNHALTED.CORE)
reference cycles (CPU_CLK_UNHALTED.REF).
Policy: Reducing cache interference


Cache misses per cycle are the best indication of
cache interference.
How to predict future behavior from historical
behavior:

Basing the weight on the metrics from the
immediately past quantum (that is, using temporal
locality as our predictor) is the best solution.
From Implementation Point of View



Implemented observation and scheduler modification
in Linux and Macintosh OS X kernel.
Added cache interference policy among cores sharing
LLC.
When core 0 is ready to be assigned a new task, the
scheduler examines the weights of tasks on other cores
and chooses a task whose weight best complements the
corunning tasks for core 0. Thus, heavy and light tasks
tend to be coscheduled on the shared cache, avoiding
the interference that results from coscheduling two
heavy tasks.
Results


Linux: 30% improvement in worst case scenario
than current implementation and 6% overall
performance improvement
Mac OS X: 3% overall performance
improvement
Results
Policy: Migrating across caches


Observations to affect task migration decisions
in the OS.
Goal is to distribute cache-heavy threads
throughout the system, not only helping spread
out cache load, but also providing more
opportunity for OBS-L to achieve benefits with
its local policies.
Results



Linux load balancer produced between 8 and 18
percent speedups for the heavy cachebuster tasks.
With the addition of OBS-X, cachebuster performance
increased between 12 percent and 62 percent.
WHY ???

OBS-X distributed the cache-heavy tasks across LLC groups,
so it minimizes the scheduling of heavy tasks together
Results
Results
Policy: Addressing fairness

Under this policy (implemented as OBS-C), the
system computes the difference in weights
between any coscheduled tasks and transfers
CPU time (tcredit) proportionally from the
heavier to the lighter task
Results
Policy: Accommodating functional
asymmetry



Simulate functional asymmetry in an existing
multicore system.
used CPU-specific flags to disable floating-point
(FP) and SIMD instructions (including MMX
and streaming SIMD) instructions on a subset
of cores in the system
A second version of the policy (OBS-F2)
monitors the task after it migrates
Another Feature of the policy

OBS-F tracks the accumulation of unavailable
instructions over time (total number and
number of faulting quanta). Using the task’s
history, OBS-F can determine that the frequency
of faulting instructions is high enough that the
task should be banned from a core for the rest
of its life in the system, saving both migration
costs and cache interference caused by a task’s
frequently moving to and from FP-enabled
cores.
Results
Other observation-driven policies

Reducing functional-unit interference


Observation of contention will allow the OS to implement analogous
policies - either to migrate a task somewhere with less contention or to
credit CPU time back to a task that has suffered disproportionately from
contention.
Multicore power management


Implementations allow cores to change power states independently. Initial
research indicates that OS-level observation—of both hardware power
events and per-task activity—will enable better power management
policies.
For Example:tasks that are largely memory bound might run on a lowerpower core and obtain similar performance, whereas computationintensive tasks might benefit from migration to a higher-speed or higherpower core.

Virtualization:

Runtime observation of behavior is even more
important for VMs because an even greater opacity
prevents knowledge of what a VM is doing or will
do next.
Windows lovers…
Pop Quiz
Write a simple program and execute
it on windows: one with normal
priority and other with one higher
priority. And see how normal priority
process suffers. Compare the same
on Linux
Research Avenues



Needs more research in OS to manage cache
Change to kernel requires full understanding of OS
kernel’s memory management and scheduler
(atleast)…..they themselves are big research area.
Due to fast progress in processor technology and slow
progress in OS kernel, OS kernel underutilize the
functionality provided by modern multicore processor.
This is could be another research area to explore how
to better utilize processor to it full potential
Business Value

If we implement at least one of the above
mentioned policy, then it could be a big turnaround for the new OS and add more business
value to the OS vendor.
Summary

OS can help by making dynamic observations of
task behavior and then implementing smarter
policies based on the results of these
observations.
So what do you think??
How both these papers related?
Questions, Comments, Concerns ???
Thank you for staying awake !!!