Improved EDF schedulability analysis of EDF on
Download
Report
Transcript Improved EDF schedulability analysis of EDF on
Multicore systems
Marko Bertogna
Scuola Superiore S.Anna,
ReTiS Lab, Pisa, Italy
Course outline
Why do we need more than one processor?
Key design aspects of multicore platforms
Overview of existing multicore architectures
Parallel programming problems
Multiprocessor Real-Time systems
Introduction to Multicore Systems
Summary
Limits of current single processor
technologies
Superscalar and SMT architectures
SMP vs ASMP
UMA vs NUMA
Cache coherency
Key design aspects
Example of multicore systems
As Moore’s law goes on…
Number of transistor/chip doubles every 18 to 24 mm
…heating becomes a problem
Pentium Tejas
cancelled!
Power density (W/cm2)
1000
100
Pentium P2
P1
P3
P4
STOP !!!
Hot-plate
10
286 486
8086 386
8085
8080
1 8008
4004
Power
0,1
71
74
78
85
92
Nuclear
Reactor
00
04
08
Year
P V f: Clock speed limited to less than 4 GHz
Why did it happen?
Integration technology constantly improves…
CMOS manufactoring process
Half-Pitch:
half the
distance
between
identical
features in
an array of
transistors
Technology trends
Reduced gate sizes
Higher frequencies allowed
More space available
BUT
Phisical limits of semiconductor-based
microelectronics
Larger dynamic power consumed
Leakage current becomes important
Problematic data synchronization (GALS)
Intel’s timeline
Year
1971
1972
1974
1978
1979
1982
1985
1989
1993
1995
1997
1999
2000
2002
2005
2006
2007
2008
Processor
4004
8008
8080
8086
8088
286
386
486
Pentium
Pentium Pro
Pentium II
Pentium III
Pentium 4
Pentium M
Pentium D
Core 2 Duo
Core 2 Quad
Core 2 Quad X
Manufacturing
Frequency
Technology
10 m
108 kHz
10 m
800 kHz
6 m
2 MHz
3 m
5 MHz
3 m
5 MHz
1,5 m
6 MHz
1,5 m
16 MHz
1 m
25 MHz
0,8 m
66 MHz
0,6 m
200 MHz
0,25 m
300 MHz
0,18 m
500 MHz
0,18 m
1,5 GHz
90 nm
1,7 GHz
65 nm
3,2 GHz
65 nm
2,93 GHz
65 nm
2,66 GHz
45 nm
>3 GHz
Number of
transistors
2300
3500
4500
29000
29000
134000
275000
1200000
3100000
5500000
7500000
9500000
42000000
55000000
291000000
291000000
582000000
820000000
Frequency
Lead Microprocessors frequency doubles every 2 years
Frequency (Mhz)
10000
1000
100
486
10
8085
1
0.1
1970
Courtesy, Intel
8086 286
P6
Pentium ® proc
386
8080
8008
4004
1980
1990
Year
2000
2010
Die Size Growth
Die size grows by 14% every two years to satisfy Moore’s Law
Frequency and power
f = operating frequency
V = supply voltage (V~=0.3+0.7 f)
Reducing the voltage causes a higher frequency
reduction
P = Pdynamic + Pstatic = power consumed
Pdynamic A C V2 f
(main contributor until hundreds nm)
Pstatic V Ileak
(due to leakage important when going below 100nm)
Design considerations (I)
P = A C V2 f + V Ileak
Number of transistors grows A grows
Die size grows C grows
Reducing V would allow a quadratic reduction
on dynamic power
But clock frequency would decrease more
than linearly since V~=0.3+0.7 f unless Vth
as well is reduced
But, again, there are problems:
Ileak Isub + Igox increases when Vth is low!
Design considerations (II)
P = A C V2 f + V Ileak
Reducing Vth and gate dimensions leakage
current becomes dominant in recent process
technologies static dissipation
Static power dissipation is always present
unless losing device state
Summarizing:
There is no way out for classic frequency
scaling on single cores systems!
Power delivery and dissipation
100000
18KW
5KW
1.5KW
500W
Power (Watts)
10000
1000
Pentium® proc
100
286 486
8086
10
386
8085
8080
8008
1 4004
0.1
1971 1974 1978 1985 1992 2000 2004 2008
Year
Courtesy, Intel
Power density
Power density too high to keep junctions at low temp
Power Density (W/cm2)
10000
Rocket
Nozzle
Nuclear
Reactor
1000
100
8086
10 4004
Hot Plate
P6
8008 8085
Pentium® proc
386
286
486
8080
1
1970
Courtesy, Intel
1980
1990
Year
2000
2010
UltraSPARC Power consumption
The largest amount
of power is
consumed by
Cores
Leakage
Has this ever happened before?
Keeping alive Moore’s law alive
Exploit the immense number of transistors in
other ways
Reduce gate sizes maintaining the frequency
sufficiently low
Use a higher number of slower logic gates
In other words:
Switch to Multicore Systems!
What is this?
1979 - David May’s B0042 board - 42 Transputers
Definitions
Multicore CPU = two or more cores on a single
package composed by one or more dies
die = IC (Integrated Circuit)
Monolithic processor = all cores on a single die
Multi-Chip Module (MCM) = more dies on a single
package:
Multi-CPU = multiple physically separated
processing units (which often contain special circuitry
to facilitate communication between each other).
We will use the term multiprocessor and
multicore interchangeably
Terminology
Threads
Processes (or tasks)
Independent
Considerable state info
Separate address space
Resources
Memory, file handles,
sockets, device handles
Owned by processes
Threads and processes
Thread of a same process share the same
resources
Processes can share resources only through
explicit methods
The only resources exclusively owned by a
thread are the thread stack, register status
and thread-local storage (if any)
Scheduling
A scheduler is responsible to allocate tasks
and threads to the available computing
resources
Various kind of scheduling algorithms
Best-effort minimize makespan
Real-Time meet deadlines
… other metrics to minimize
Preemption and context changes
Access to shared resources
Multithreading
Classic multithreading on single processor
systems required Time Division Multiplexing
(TDM)
Time driven
Event driven
Multiprocessors different threads and
processes can run on different CPUs
Multithreading is easier (“native”) on
multicore platforms
But scheduling requires more attention
Multithreading issues
Race conditions
Starvation, deadlock, livelock, priority
inversion
Synchronization (mutex, lock)
Atomic execution (semaphores)
Communication:
shared-memory (requires locking)
message-passing (slower but easier)
Different kinds of Parallelisms
Instruction Level Parallelism (ILP)
Data Level Parallelism (DLP)
Thread Level Parallelism (TLP)
Instruction Level Parallelism (ILP)
Execute multiple instruction per clock cycle
Each functional unit on a core is an execution
resource within a single CPU:
Arithmetic Logic Unit (ALU)
Floating Point Unit (FPU)
bit shifter, multiplier, etc.
Need to solve data dependencies
Data dependency
Consider the sequential code:
1.
2.
3.
e=a+b
f=c+d
g=e*f
Operation 3. depends on the results of
operations 1. and 2.
Cannot execute 3. before 1. and 2. are
completed
How to “parallelize” software?
Parallelism can be extracted from ordinary
programs
At run-time (by complex specific HW)
At compile-time (simplifying CPU design and
improving run-time performances)
Degree of ILP is application dependent
ILP: superscalar architectures
Data dependency check in HW
Complex mechanisms
power, die-space and time consuming
Problematic when
code difficult to predict
Intructions have many interdependencies
Superscalar pipelining
(5x) stage pipelining
(2x) Superscalar execution
ILP optimizations
Instruction pipelining
Superscalar execution
Out-of-order execution
deferred memory accesses
combined load and store
Speculative execution
branch prediction,
speculative load
…
Processor front and back end
Intel describes its processors having
“in-order front end”
“out-of-order execution engine”
ILP: compile-time techniques
Compiler decides which operations can run in parallel
Removes the complexity of instruction scheduling
from HW to SW
New instruction sets that explicitly encode multiple
independent operations per instruction
Very Long Instruction Word (VLIW): one instruction encodes
multiple operations (one for each execution unit)
Explicitly Parallel Instruction Computing (EPIC): adds
features to VLIW (cache prefetching instructions, ...)
Data Level Parallelism (DLP)
Higher parallelism than superscalar architecture
SIMD instructions (Single Instruction, Multiple Data)
Intel’s MMX, SSE, SSE2, SSE3, SSE3, SSSE3, SSE4, AVX
AMD’s 3DNow!, 3DNow! Professional, SSE5
ARM’s NEON, IBM’s AltiVec and SPE, etc.
Graphic cards (GPU)
Cell Processor’s SPU
Useful when the same operation has to be applied to
a large set of data (i.e., multimedia, graphic
operations on pixels, etc.)
Multiple data are read and/or modified at the same
same
Thread Level Parallelism (TLP)
Higher level parallelism than ILP
Different kinds of TLP
Superthreading
Hyperthreading or Symultaneous MultiThreading
(SMT)
Chip-level MultiProcessing (CMP)
Needs superscalar processor
Needs multicore architecture
Combinations of the above solutions
Superthreading
Temporal Multithreading (fine- or coarsegrained) when processor idle, execute
instruction of another thread
Makes better use of the computing resources
when a thread is blocked
Requires adequate hardware support to
minimize context change overhead
Hyperthreading
Simultaneous MultiThreading (SMT)
Introduced in late 90s: Intel’s Pentium 4
Execute instructions from multiple threads
simultaneously needs superscalar support
Energy inefficient
can use up to 46% more power than dual-core
designs
Increases cache thrashing by 42%, whereas dual
core results in a 37% decrease
temporarily dropped by Intel (will return with
Nehalem architecture)
Single-threaded CPU
4 running programs
Only the red program
is executing
Up to 4 instr/clock cycle
7 functional units
Pipeline bubbles
Super-threading (time-slice
multithreading)
Multithreaded processor:
able to handle more than
one thread at a time
Interleaved execution of
different threads (helps
masking memory latency)
All instructions in a pipeline
stage must come from
the same thread
Hyper-threading (Simultaneous
MultiThreading)
Interleaved execution of
different threads (helps
masking memory latency)
Instructions in a pipeline
stage may come from
different threads
Hyper-threading (SMT)
From OS perspective: many “logical”
processors
Average ILP for a thread = 2.5 instr/cycle
Pentium 4 issues at most 3 instr/cycle to the
execution core
Hyperthreaded processor can exploit
parallelism beyond a single thread ILP
But..
SMT can be worse than non-SMT approaches
A thread monopolizing an execution unit for many
consecutive pipeline stages can stall the available
functional units (that could have been used by
other threads)
Cache thrashing problem
Different logical processor can execute two
threads accessing completely different memory
areas
Not a problem for Multicores
A smart SMT-aware OS running on a
multicore would schedule two different tasks
on different processors resource
contention is minimized
Comparing parallel solutions
Dual vs Superscalar vs SMT processor
Dual core achieves
best performances
w.r.t. Power and Area
Multicore examples (high-end)
Intel’s Core 2: 2, 4 cores
AMD: 2, 4 cores
IBM-Toshiba-Sony Cell processor: 8 cores
Sun’s Niagara UltraSPARC: 8 cores
Microsoft’s Xenon: 3 cores (Xbox 360)
Tilera’s TILE64: 64-core
Others (network processors, DSP, GPU,…)
Multicore examples (embedded)
ARM’s MPCore: 4 cores
Nios II: x Cores
Network Processors are being replaced by
multicore chips (Broadcom’s 8-core
processors)
DSP: TI, Freescale, Atmel, Picochip (up to
300 cores, communication domain)
How many cores in the future?
Application dependent
Typically few for high-end computing
Many trade-offs
transistor density
technology limits
Amdahl’s law
Beyond 2 billion transistors/chip
Intel’s Tukwila
Itanium based
2.046 B FET
Quad-core
65 nm technology
2 GHz on 170W
30 MB cache
2 SMT
8 threads/ck
How many cores in the future?
Intel’s 80 core prototype already available
Able to transfers a TB of data/s (while Core 2 Duo
reaches 1.66GB data/s)
To be released in 5 years
How many cores in the future?
Berkeley: weather simulation for 1.5km
resolution, 1000 x realtime, 3M customtailored Tensilica cores
Petaflop computer
Power around 2.5 MW
estimated for 2011
cost around 75 M$
main obstacle will be on software
Supercomputers
Petaflop supercomputers (current
supercomputer have Teraflop computing
power)
IBM’s Bluegene/P
3 Pflops
884736 quad-core Power PC
ready before 2010
Prediction
Patterson & Hennessy: “number of cores will
double every 18 months, while power, clock
frequency and costs will remain constant”
Amdahl’s law
Originally made for speed-ups in a portion of
a program
Later adapted to measure the speedup
obtained increasing the number of processors
P = Parallel portion of a given application
N = Number of processors/cores
The total speedup obtained increasing N is
Considerations on Amdahl’s law
For N arbitrarily large maximum speedup
tends to 1 / (1-P)
In practice, performance/price falls rapidly as
N is increased once there is even a small
component of (1 − P)
Example: P = 90% (1 − P) = 10%
speedup < 10
Amdahl’s law
Consequences of Amdahl’s law
“Law of diminuishing returns”: picking
optimal improvements, the income is each
time lower so is with adding processors
Considering as well the memory, bus and I/O
bottlenecks, the situation gets worse
Parallel computing is only useful for
limited numbers of processors, or
problems with very high values of P
“embarrassingly parallel problems”
Embarassingly parallel problems
Problems for which
no particular effort is needed to segment the
problem into a very large number of parallel tasks
there is no essential dependency (or
communication) between those parallel tasks
1 core @ 4 GHz = 2 cores @ 2 GHz
Examples: GPU handled problems, 3D
projection (independent rendering of each
pixel), brute-force searching in cryptography
SMP
Symmetric MultiProcessing
Many identical processors are connected to a
single shared main memory.
SMP
Any processor can work on any task
Easy migration
Access time to a memory location is independent of
which processor makes the request or which memory
chip contains the transferred data: UMA (Uniform
Memory Architecture)
Load balancing
Typically limited to a few number of CPUs
For larger systems, alternatives like NUMA (NonUniform Memory Access)
SMP architectures
Intel’s Xeon, Pentium D, Core Duo, Core 2
Duo
AMD’s Athlon64 X2, Quad FX, Opteron (200
and 2000 series)
Other non-x86:
UltraSPARC, SPARC64, SGI MIPS, SGI f1200, Intel
Itanium, PA-RISC, DEC Alpha, POWER and
PowerPC (604, 604e, G4 and G5 series)
ASMP
ASymmetric MultiProcessing
Assigns certain tasks only to certain
processors
ASMP
Typically there is one Master and one or more
Slave processors ≠ SMP
Master distributes the load to slaves
Limitations to interactions among slave
processors, I/O, memory
Possible to use identical cores in an
asymmetrical way (dedicating tasks to cores)
NUMA
Quickly access local memory
Slower access to remote memory
Improve throughput for localized data
Interprocessor migration is more expensive
Difficult load-balancing
Cache and memory coherence problems
Software solutions
Hardware solutions (ccNUMA)
NUMA
One can view NUMA as a very tightly coupled
form of cluster computing
UMA or NUMA?
Depends on the addressed application
Locality favours NUMA
Tight data dependencies among processes
favour UMA
Caches
Caches play key role in modern architectures
Reduce average data access time
Reduce bandwidth demands placed on shared
interconnect
Memory Wall
A.k.a. Von Neumann bottleneck
Growing disparity of speed between CPU and
memory outside the CPU chip:
From 1986 to 2000, CPU speed improved at an
annual rate of 55% while memory speed only
improved at 10%.
Mainly due to limited communication
bandwidth and memory latency
Larger cache sizes to mask the latency to
memory
Memory bottleneck
Write policy
Write-through
Write-back
Every cache write causes a write to memory
Cache write aren’t immediately mirrored to
memory
A modified data is marked as Dirty
Dirty data needs to be written back to main
memory before eviction
Writes may be held in a temporary queue
Processing multiple stores improves bus usage
Cache coherency
Private processor caches create a problem
Copies of a variable can be present in multiple
processors
A write by one processor may not become visible
to others
They’ll keep accessing stale value in their caches
=> Cache coherence problem
Cache coherency protocols: MSI
PrRd/—
PrWr/—
MSI
M
BusRd/Flush
PrWr/BusRdX
S
PrRd/—
BusRd/—
data unmodified
can be evicted without writing it back
Invalid
PrWr/BusRdX
I
cached data (new) ≠ data in main
memory (old)
needs write back before eviction
Shared
BusRdX/Flush
BusRdX/—
PrRd/BusRd
Modified
cached data is old
must be fetched from memory or
another cache
MSI protocol
Maintained though communication between caches
and main memory
Examples:
Read on M or S states: cache supplies the data
Read on I state: must verify if the data is M in another cache
(snooping, cache-directory, etc)
Write on M state: data written locally
Write on S state: invalidate (snoop, cache dir, etc) other
caches in S state before writing the data
Write on I state: invalidate (snoop, cache dir, etc) other
caches in S or M state before writing the data
MSI protocol (example)
P1
P2
P3
Snooper
Snooper
Snooper
Notation:
r1 = read from P1
w3 = write from P3
Main Memory
Example sequence:
r1, r2, w3, r2
MSI protocol (example)
P1
P2
P3
Snooper
Snooper
PrRd
value
BusRd
Snooper
Operation:
r1
S
Main Memory
Example sequence:
r1, r2, w3, r2
MSI protocol (example)
P1
P2
P3
PrRd
value
S
Snooper
BusRd
Operation:
r2
value
S
Snooper
Main Memory
Snooper
Example sequence:
r1, r2, w3, r2
MSI protocol (example)
P1
value
Snooper
Operation:
w3
P2
S
I
value
P3
S
Snooper
Main Memory
I
value
PrWr
M
Snooper
BusRdX
Example sequence:
r1, r2, w3, r2
MSI protocol (example)
P1
value
Snooper
Operation:
r2
P2
value
S
BusRd
P3
PrRd
S
Snooper
Main Memory
S
value
Flush
M
S
Snooper
Example sequence:
r1, r2, w3, r2
Cache coherency mechanisms
Directory-based
Snooping (or sniffing)
Central directory of cached blocks
Point to point messages
Slower, but more scalable
Monitor address line for write accesses to cached location
Broadcast messages
Faster, but less scalable
Snarfing (less used)
Cache controller watches both address and data
Cached data is always updated when modified
Snoopy cache-coherence control
Bus is a broadcast medium & Caches know
what they have
Cache Controller “snoops” all transactions on
the shared bus and takes action to ensure
coherence
Pn
P1
Bus snoop
$
$
Mem
I/O devices
Cache-memory
transaction
Cache coherency protocols
MESI
MOSI
Introduced in the Pentium processor
Adds an Exclusive state
Reduces traffic due to write exclusively present in a single
cache
Adds an Owned state
Reduces traffic due to write-back of blocks read by other
caches
MOESI
Has both Exclusive and Owned states
beneficial when communication latency and bandwidth is
more efficient between two CPUs than to main memory (es.
Multicore systems with per-core L2 caches)
Cache vs scratchpad memory
Both are the fastest kind of memory after
internal registers
Scratchpad used in PS1’s R3000, PS2’s
R5000 and PS3’s Cell
Performance boost with multicore
Interrupts can be handled on an idle processor
instead of preempting the running process (also for
programs written for single core)
Not faster execution, but smoother appearance
For inherently parallel applications (graphic
operations, servers, compilers, distributed
computing) speedup proportional to the number of
processors
Limitations due to serialized RAM access and cache
coherency
Less likely to benefit from multicores
I/O bound tasks
Tasks composed by a series of pipeline
dependent calculations
Tasks that frequently communicate with each
other
Tasks that contend for shared resources
Programming
Applications are written for uniprocessors
Word processors and computer games
Adapting a single core-suited application to a
multiprocessor device results in performance
losses
Multicore architectures are much more
widespread nowadays
Multicore programming is becoming more
important
Exploiting multicores
Multicore-capable OS’s
Windows NT 4.0/2000/XP/2003 Server/Vista
Linux and Unix-like systems
Mac OS
VxWorks, QNX, etc.
Multi-threaded applications
Muticore optimizations for game engines
Half-Life 2: Episode Two, Crysis, etc.
Software development tools
Parallel programming
Existing parallel programming models
OpenMP
MPI
IBM’s X10
Intel’s TBB (abstraction for C++)
Sun’s Fortress
Cray’s Chapel
Cilk (Cilk++)
Codeplay’s Sieve C++
Rapidmind Development Platform
Design stages
Problem decomposition: break the application
into a large number of small concurrent tasks
Communication analysis: detect existing
interdependencies among the various tasks
Composition: combine/merge tasks to obtain
the desired grain of parallelism
Mapping: allocate tasks to the available
computing resources (if not automatically
done by a scheduler)
Partitioning tasks to processors
Possible partitioning choices
Partition by CPU load
Partition by information-sharing requirements
Partition by functionality
Use the least possible number of processors
or run at the lowest possible frequency
Depends on considerations like fault tolerance,
power consumed, temperature, etc.
Beyond multicore chips
Many-core devices
Tens or hundreds of cores
Integration of SMT techniques into multicore
chips
Special purpose heterogenous cores
Memory-on-chip
Fine-grain power management
Dynamic Voltage and Frequency Scaling
(DVFS)