Performance of multiprocessing systems: Benchmarks and

Download Report

Transcript Performance of multiprocessing systems: Benchmarks and

Performance of multiprocessing
systems: Benchmarks and
performance counters
Miodrag Bolic
ELG7187 Topics in Computers:
Multiprocessor Systems on Chip
Outline
• Benchmarks
• Measurements and monitoring
• Performance counters
Types of benchmarks [1]
• Synthetic benchmarks
– small artificial programs containing a mixture of
statements which are selected such that they are
representative for a large class of real applications.
• Kernel benchmarks
– small but relevant parts of real applications which
typically capture a large portion of the execution
time of real applications.
• Real application benchmarks
Benchmarks: challenges
•
Challenges in developing benchmarks
– Testing a whole system: CPU, cache, main memory, compilers
– Selecting a suitable sets of applications
– How to make portable benchmarks
(ANSI C: How big is a long? How big is a pointer? Does this platform implement calloc? Is it little endian or big endian?
)
•
Fixed workload benchmarks - how fast was the workload completed;
– EEMBC MPEG-x benchmark – time to process the entire video
•
Throughput benchmarks -how many workload units per unit time were completed.
– EEMBC MPEG-x benchmark – number of frames processed for the fixed amount of time
•
•
The base metrics
–
same compiler flags must be used in the same order for all benchmarks..
The peak metrics
–
different compiler options may be used on each benchmark.
4
Available Benchmarks [2]
•
•
•
•
SPEC CPU (general purpose),
MediaBench (media)
BioPerf (bioinformatics)
PARSEC multi-threaded workloads on
multicore processors,
• DaCapo to evaluate Java workloads,
• STAMP to evaluate transactional memory
SPEC
• Each of the programs is executed three times on the computer system U
to be tested. For each of the programs Ai an average xecution time TU (Ai
) in seconds is determined by taking the median of the three execution
times measured.
• For each program, the execution time TU (Ai ) determined in step (1) is
normalized with respect to the reference computer R by dividing the
execution time TR(Ai) on R by the execution time TU (Ai) on U. This yields
an execution factor FU (Ai ) = TR(Ai )/TU (Ai )
– R - Sun Ultra Enterprise 2 with a 296MHz UltraSparc II processor
• SPECint2006 is computed as the geometric mean of the execution factors
of the 12 SPEC integer programs
• Geometric mean:
– the comparison between two machines is independent of the choice of the reference
computer.
– does not provide information about the actual execution time of the programs
Measurement [3]
• It is based on direct measurements of the system under study
using a software or/and hardware monitor.
• Monitor performs three tasks:
– data acquisition,
– data analysis,
– result output
• An event is a change in the system state.
– Examples are process context switching, beginning of seek on a disk,
and arrival of a packet.
• A trace is a log of events
– includes the time of the event, the type of event, etc
Activating a monitor [3]
• Tracing - event-driven monitor - When an event occurs, the
monitor is activated to capture the data about the state of the
system. This gives a complete trace of the executing program.
• Sampling -The monitor is activated by clock interrupts.
Performance counters
• Time-based profiles - where your software spends
its time,
• Hardware performance measurements - what the
processor is doing and how effectively the
processor is being utilized.
• Hardware measurements also pinpoint particular
reasons why the CPU is stalling rather than
accomplishing useful work.
• http://perfsuite.ncsa.uiuc.edu/publications/LJ135
/t1.html
Advantages [4]
• The application and operating system remain largely unmodified, apart
from the addition of drivers in the operating system to enable access to
the hardware performance counters.
• Not using a simulation of the application, operating system, or processor
ensures that the accuracy of the collected event counts.
• Performance-monitoring hardware collects data on the fly as the
application executes, allowing full-speed data collection and avoiding the
slowness of simulation-based approaches.
• This approach can collect data for both the application and the operating
system.
Performance monitoring [4]
• Performance events can be grouped into:
–
–
–
–
–
program characterization,
memory accesses,
pipeline stalls,
branch prediction,
resource utilization.
• Performance-monitoring hardware has two
components:
– performance event detectors
– event counters.
MIPS R10000 [5]
• User, Supervisor, Kernel, and/or Exception
level mode. Any combination of count enable
bits may be asserted.
• Event select
• IP[7] interrupt enable
MIPS R10000 [5]
Intel’s solution
• Hardware performance counters are defined outside the
"architectural" register set, and they are not saved and
restored on process context switches.
• The measurements are therefore attached to the
processor, and not to a process or thread.
• It is possible to separate user code from system code
according to the privilege level
• The Intel Pentium-series processors include a 64-bit cycle
counter, and two 40-bit event counters, with a list of
events and additional semantics that depend on the
particular processor.
• The AMD Athlon processor has a 64-bit cycle counter, and
four 48-bit event counters
Using performance counters [4]
• Scheduling
– Single per-core metric (such as IPC or cache miss
rate) is not sufficient to categorize application
behavior
• Different thread types often have highly varying
characteristics.
• Threads behave differently based on what thread was
scheduled beforehand
• Tuning memory access
• Communication pattern
Problem with perf. Counters [6]
Advanced performance counters [6]
Software [4]
• The Performance Application Programming
Interface (PAPI) tool
– provides a common interface to performancemonitoring hardware for many different processors,
including Alpha, Athlon, Cray, Itanium, MIPS, Pentium,
PowerPC, and UltraSparc.
– Initiate and reset counters, read them
• Intel’s VTune Performance Analyzer
– Supports all Intel Pentium and Itanium processors,
– provides additional performance analysis tools such as
call graph profiling and processor-specific tuning
advice.
Other approaches for collecting processor
performance data [4]
• Software monitoring
– Modify code to collect data
– Need to have available source code and to be able
to rebuild the application.
• Simulators
References
1.
2.
3.
4.
5.
6.
Thomas Rauber, Gudula Runger, Parallel programming:For Multicore and
Cluster Systems, Springer, 2010 (Chapter 4).
Lieven Eeckhout, Computer Architecture Performance Evaluation
Methods, Synthesis Lectures on Computer Architecture, June 2010.
Lei Hu and Ian Gorton, Performance Evaluation for Parallel Systems: A
Survey, University of NSW, Australia, UNSW-CSE-TR-9707, October 1997.
B. Sprunt, The Basics of Performance Monitoring Hardware, IEEE Micro,
July-August, page 64-71, 2002.
MIPS Technologies, MIPS R10000 Microprocessor User’s Manual, Ver 2.0,
1996. http://techpubs.sgi.com/library/manuals/2000/007-2490001/pdf/007-2490-001.pdf
V. Salapura et al, “Next Generation Performance Counters: Towards
Monitoring over thousand concurent events,” IBM Research Report,
RC24351, 2007
Additional material covered in the
lecture
1. Geometric mean computation [1]