Lecture Notes 6
Download
Report
Transcript Lecture Notes 6
Lecture 6:
Multicore Systems
Multicore Computers
(chip multiprocessors)
Combine two or more processors (cores) on a single piece
of silicon
Each core consists of ALU, registers, pipeline hardware, L1
instruction and data caches
Multithreading is used
Pollack’s Rule
Performance increase is roughly proportional to the square
root of the increase in complexity
performance √complexity
Power consumption increase is roughly linearly
proportional to the increase in complexity
power consumption complexity
Pollack’s Rule
complexity power performance
1
1
1
4
4
2
25
25
5
100s of low complexity cores, each operating at very low
power
Ex: Four small cores
complexity power performance
4x1
4x1
4
Increasing CPU Performance
Manycore Chip
Composed of hybrid cores
•
•
•
Some general purpose
Some graphics
Some floating point
Exascale Systems
Board composed of multiple
manycore chips sharing memory
Rack composed of multiple
boards
A room full of these racks
Millions of cores
Exascale systems (1018 Flop/s)
Moore’s Law Reinterpreted
Number of cores per chip doubles every 2 years
Number of threads of execution doubles every 2
years
Shared Memory MIMD
P
P
P
P
Shared memory
Bus
•
Memory
•
Single address space
All processes have access
to the pool of shared
memory
Shared Memory MIMD
CU
PE
data
CU
PE
data
CU
PE
data
CU
PE
instruction
Memory
data
Each processor
executes different
instructions
asynchronously,
using different data
Symmetric Multiprocessors
(SMP)
Proc
Proc
L1
L1
…
L2
L2
System bus
Main Memory
I/O
I/O
I/O
MIMD
Shared memory
UMA
Symmetric Multiprocessors
(SMP)
Characteristics:
Two or more similar processors
Processors share the same memory and I/O facilities
Processors are connected by bus or other internal connection
scheme, such that memory access time is the same for each
processor
All processors share access to I/O devices
All processors can perform the same functions
The system is controlled by an integrated operating system that
provides interaction between processors and their programs
Symmetric Multiprocessors
(SMP)
Operating system:
Provides tools and functions to exploit the parallelism
Schedules processes or threads across all of the processors
Takes care of
•
•
scheduling of threads and processes on processors
synchronization among processors
Multicore Computers
CPU
core 1
L1-I
L1-D
L2
Main Memory
CPU
core n
…
L1-I
L1-D
I/O
I/O
I/O
Dedicated L1 Cache
(ARM11 MPCore)
Multicore Computers
CPU
core 1
L1-I
L1-D
L2
CPU
core n
…
L1-I
L1-D
L2
I/O
Main Memory
I/O
I/O
Dedicated L2 Cache
(AMD Opteron)
Multicore Computers
CPU
core 1
L1-I
L1-D
CPU
core n
…
L1-I
L1-D
L2
I/O
Main Memory
I/O
I/O
Shared L2 Cache
(Intel Core Duo)
Multicore Computers
CPU
core 1
L1-I
L1-D
CPU
core n
…
L2
L1-I
L1-D
L2
L3
I/O
Main Memory
I/O
I/O
Shared L3 Cache
(Intel Core i7)
Multicore Computers
Advantages of Shared L2 cache
Reduced overall miss rate
•
Thread on one core may cause a frame to be brought into the cache, thread on another core
may access the same location that has already been brought into the cache
Data shared by multiple cores is not replicated
The amount of shared cache allocated to each core may be dynamic
Interprocessor communication is easy to implement
Advantages of Dedicated L2 cache
Each core can access its private cache more rapidly
L3 cache
When the amount of memory and number of cores grow, L3 cache provides
better performance
Multicore Computers
On-chip interconnects
Bus
Crossbar
Off-chip communication (CPU-to-CPU or I/O):
Bus-based
Multicore Computers
(chip multiprocessors)
Combine two or more processors (cores) on a single piece
of silicon
Each core consists of ALU, registers, pipeline hardware, L1
instruction and data caches
Multithreading is used
Multicore Computers
Multithreading
A multithreaded processor provides a separate PC for each
thread (hardware multithreading)
Implicit multithreading
•
Concurrent execution of multiple threads extracted from a single sequential
program
Explicit multithreading
•
Execute instructions from different explicit threads by interleaving
instructions from different threads on shared or parallel pipelines
Multicore Computers
Explicit Multithreading
Fine-grained multithreading (Interleaved multithreading)
•
•
Processor deals with two or more thread contexts at a time
Switching from one thread to another at each clock cycle
Coarse-grained multithreading (Blocked multithreading)
•
•
Instructions of a thread are executed sequentially until an event that causes
a delay (eg. cache miss) occurs
This event causes a switch to another thread
Simultaneous multithreading (SMT)
•
•
Instructions are simultaneously issued from multiple threads to the
execution units of a superscalar processor
Thread-level parallelism is combined with instruction-level parallelism (ILP)
Chip multiprocessing (CMP)
•
Each processor of a multicore system handles separate threads
Coarse-grained, Fine-grained,
Symmetric Multithreading, CMP
GPUs
(Graphics Processing Units)
Characteristics of GPUs
GPUs are accelerators for CPUs
SIMD
GPUs have many parallel processors and many concurrent threads
(i.e. 10 or more cores; 100s or 1000s of threads per core)
CPU-GPU combination is an example for heterogeneous computing
GPGPU (general purpose GPU): using a GPU to perform
applications traditionally handled by the CPU
GPUs
GPUs
Core Complexity
Out-of-order execution
Dynamic branch prediction
Larger pipelines for higher clock rates
More circuitry
High performance
GPUs
Complex cores are preferable:
Highly instruction parallel numeric applications
Floating-point applications
Large number of simple cores are preferable:
Application’s serial part is small
Cache Performance
Intel Core i7
Roofline Performance Model
Arithmetic intensity is the ratio of floating-point operations in a program
to the number of data bytes accessed by the program from main
memory
floating-point operations
Arithmetic intensity = --------------------------------------- = FLOPs/Byte
number of data bytes
Roofline Performance Model
Attainable GFLOPs/second
Peak memory bandwidth x Arithmetic intensity
= min
Peak floating-point performance
Roofline Performance Model
Peak floating-point performance is given by the hardware
specifications of the computer (FLOPs/second)
For multicore chips, peak performance is the collective performance
of all the cores on the chip. So, multiply the peak per chip by the
number of chips
Peak memory performance is also given by the hardware
specifications of the computer (Mbytes/second)
Maximum floating-point performance that the memory system of the
computer can support for a given arithmetic intensity, can be plotted
as
Peak memory bandwidth x Arithmetic intensity
(bytes/second) x (FLOPs/bytes) ==> FLOPs/second
Roofline Performance Model
Roofline sets an upper bound on performance
Roofline of a computer does not vary by benchmark kernel
Stream Benchmark
A synthetic benchmark
Measures the performance of long vector operations
They have no temporal locality and they access arrays that are
larger than the cache size
http://www.cs.virginia.edu/stream/ref.html
define N
2000000
...
void tuned_STREAM_Copy() {
int j;
#pragma omp parallel for
for (j=0; j<N; j++)
c[j] = a[j];
}
void tuned_STREAM_Add() {
int j;
#pragma omp parallel for
for (j=0; j<N; j++)
c[j] = a[j]+b[j];
}
void tuned_STREAM_Scale(double scalar) {
int j;
#pragma omp parallel for
for (j=0; j<N; j++)
b[j] = scalar*c[j];
}
void tuned_STREAM_Triad(double scalar) {
int j;
#pragma omp parallel for
for (j=0; j<N; j++)
a[j] = b[j]+scalar*c[j];
}