Transcript Core
Latest processors
Computer architectures M
1
Introduction – new requirments
•
No more clock frequency increase (power consumption – temperature – heat dissipation)
•
Power consumption and configurations totally scalable
•
QPI
•
Greater size caches (integrated L1, L2 and L3)
•
First example (2009) Nehalem: mono/duo/quad/8-core processor, multithreaded 2 (2/4/8/16 virtual
processors), 64 bit parallelism, superscalar 4 (4 CISC instructions to the decoders in parallel =>
4+3=7 u-ops per clock), 4 u-ops for RAT per clock, 6 u-ops to the EUs per clock), physical
address 40 bit => 1 TeraByte (one million Mbytes), tecnology 45 nm. From 700M to 2G
transistors on chip. Sandy bridge tecnology 32nm -> 2,3G transistors
2
Nehalem Processor
Memory controller
IO
I level
Cache
M
i
s
c
I
O
II level
Cache
Q
P
I
M
i
s
c
Core
Core
Core
Q
u
e
u
e
Cache L3 shared cache
0
Core
I
O
Q
P
I
1
QPI
CORE architecture
but with Hyperthreading
4
Cores, Caches and Links
• Riconfigurable architecture
• Notebook: 1 or 2 cores systems which must have reduced costs, low power consumption
and low execution latency time for single tasks
•
Desktop: similar characteristics but less importance for the power consumption. High
bandwith for graphical applications
•
Server: high cores number, very high bandwith and low latency for many different tasks.
RAS – Reliability Avalability Serviceability of tantamount importance
3 Integrated
dynamic Memory
Controllers DDR3
C
O
R
E
C
O
R
E
…
C
O
R
E
Core
L3 Cache
DRAM
…
IMC
QPI …
QPI
Power
&
Clock
Uncore
QPI
5
http://www.agner.org/optimize/microarchitecture.pdf
6
Nehalem characteristics
• QPI bus.
• Each processor (core) has an instruction cache 32KB, 4 ways associative, a data cache 32 KB, 8 ways
associative, and an unified (data and instructions) II level cache 256 KB, 8 ways associative. The II
level caches is not inclusive
• Each quad-core socket (node) relies on a maximum of three DDR3 which operate up to 32GB/s peak
bandwidth. Each channel operates in independent mode and the controller handles the requests
OOO so as to minimize the total latency
• Each core can handle up to 10 data cache misses and up to 16 transactions concurrently (i.e.
instructions and data retrieval from a higher level cache). In comparison Core 2 could handle 8
data cache misses and 14 transactions.
• Third level cache (inclusive cache).
• There is a central queue which allows the interconnection and arbitration between the cores and
the “uncore” region (common to all cores) that is L3, the memory controller and the QPI interface
• From the performance point of view an inclusive L3 is the ideal configuration since it allows to to
handle efficiently the coherence problems of the chip (see later) and avoids data replications. Since
it is inclusive any data present in any core is present in L3 too (although possibly in a non coherent
state)
• The caches sizes change according to the model
7
Increased power Core
Instruction Fetch
ITLB
Instruction Queue
32kB
Instruction Cache
Two unified levels
Decoder (1+3)
7 u-ops
2nd Level TLB
L3 and beyond
Rename/Allocate
256kB
2nd Level Cache
Retirement Unit
(ReOrder Buffer)
4 -uops
Reservation Station
6 u-ops
Execution Units
DTLB
32kB
Data Cache
8
Macrofusion
•
All Core macrofusion cases plus…
– CMP+Jcc macrofusion added for these branch conditions too
• JL/JNGE
• JGE/JNL
• JLE/JNG
• JG/JNLE
9
Core loop detector
•
Exploits the hardware loop detection. The loop detector analyzes the branches and
determin whether it is a loop (fixed direction jump with same in the reverse direction) .
–
Avoids the repetitive fetch and branch prediction
– … But requires the decoding each cycle
Branch
Prediction
Fetch
Loop
Stream
Detector
Decode
18 CISC
instructions
10
Nehalem loop detector Nehalem
•
Similar concepts but
Higher instructions number considered
Similar to
the trace cache
Branch
Prediction
•
Fetch
Decode
Loop
Stream
Detector
28
Micro-Ops
After the Loop Stream Detector, the last step is to use a separated stack which removes
all u-ops regarding the stack. The u-ops which speculatively point to the stack are
handled by a separate adder which writes on a “delta” register (a register apart - RSB)
which is periodically synchronized with the architectural register which contains the non
speculated stack pointer. All u-ops which manipulate the stack pointer do not enter the
execution units,
11
Two levels branch predictor
• Double level in order to include an ever increasing number of predictions. Mechanism
similar to that of the caches
• The first level BTB has 256/512 entries (according to the model): if the data is not found
there a second level (activated upon access) is interrogated with 2-8K entries. This
reduces the power consumption since the second level BTB very seldom is activated
• Increased RSB slots number (higher speculative execution)
12
More powerful memory subsystem
• Fast access to non aligned data. Greater freedom for the compiler
• Hierarchical TLB
# of Entries
1st Level Instruction TLBs
4k Pages
128 slots – 4 ways
1st Level Data TLBs
4k Pages
64 slots – 4 ways
2nd Level Unified TLB
4k Pages
512 slots – 4 ways
13
Nehalem internal architecture
• Nehalem two TLB levels are dynamically allocated between the two threads.
• A very big difference with Core is the caches coverage degree. Core had a 6 MB L2 cache and
the TLB had 136 entries (4 ways ). Considering 4k pages the coverage was 136x4x4KB= 2176
KB, a third of L3 (8MB).
• Nehalem has a 576 entries DTLB (512 second level + 64 first level) which means 4x576=2304 which
amounts to 2304x4KB=9216 KB memory which covers the entire L3 (8MB). (The meaning is that
the address translation has a great possibility of targeting data in L3)
14
(Smart) Caches
Core
32kB L1
Data Cache
•
Three levels hierarchical
•
First level: 32KB Instrucions (4 ways) and 32 KB Data (8 ways)
•
Second level: unified 256KB (8 ways – 10 cycles for access
•
Third level shared among the various cores. The size depends on the
number of cores
32kB L1
Inst. Cache
For a quad-core 8 MB 16 ways. Latency 30-40 cycles
Designed for future expansion
256kB
L2 Cache
Inclusive: all addresses in L1 and L2 are present in L3 (possibly
with different states)
Each L3 line has n (4 in case of quad core) “core valid” bits
which indicate which (if any) cores have a copy of the line.
(If a datum in L1 or L2 certainly is in L3 too but not
viceversa).
It has a private power plane and operated at a private
frequency (not the same of the cores) and higher voltage.
This because the power must be spared and big size
caches have very often errors if their voltage is too low.
Core
Core
Core
L1 Caches
L1 Caches
L1 Caches
…
L2 Cache
L2 Cache
L2 Cache
L3 Cache
15
Inclusive cache vs. non exclusive
Exclusive
Inclusive
L3 Cache
Core
0
Core
1
Core
2
L3 Cache
Core
3
Core
0
Core
1
Core
2
Core
3
An example: the datum requested by Core 0 is not present in its L1 and L2 and therefore is
requested to the common L3
16
Inclusive cache vs. exclusive
Exclusive
Inclusive
L3 Cache
L3 Cache
MISS !
MISS !
Core
0
Core
1
Core
2
Core
3
Core
0
Core
1
Core
2
Core
3
• The requested datum cannot be retrieved from L3 too
17
Inclusive cache vs. exclusive (miss)
Exclusive
Inclusive
L3 Cache
L3 Cache
MISS!
MISS!
Core
0
Core
1
Core
2
Core
3
A request is sent to the other cores
Core
0
Core
1
Core
2
Core
3
The datum is not on the chip
18
Inclusive cache vs. exclusive (hit)
Exclusive
Inclusive
L3 Cache
L3 Cache
HIT !
HIT !
Core
0
Core
1
Core
2
Core
3
No further requests to other cores: if
in L3 cannot be in other cores
(exclusive)
Core
0
Core
1
Core
2
Core
3
The datum could be in other cores too
but….
19
Inclusive cache vs. exclusive
Inclusive
L3 cache has a directory (one bit per
core) which indicates if and in which
cores the datum is present. A snoop is
necessary only if one bit only is set
(possible datum modified), If two or more
bits are set the line in L3 is «clean» and
can be forwarded form L3 to the
requesting core. (The phyiosophy is
directory based coherence)
L3 Cache
Core
0
HIT!
0
Core
1
Core
2
0
0
0
Core
3
20
Inclusive cache vs. exclusive
Exclusive
Core
0
Inclusive
L3 Cache
L3 Cache
MISS!
HIT!
0
Core
1
Core
2
Core
1
Core
2
All cores must be tested
Core
3
Core
0
0
1
0
Core
3
Here only the core with datum (core 2) must
be tested
21
Execution unit
Execute 6 u-ops/cycle
Unified Reservation Station
u-ops
• Schedules operations to Execution units • 3 Memory
•
1
Load
• Single Scheduler for all Execution Units
• 1 Store Address
• Can be used by all integer, all FP, etc.
• 1 Store Data
• 3 “Computational” u-ops
Unified Reservation Station
Integer ALU &
Shift
FP Multiply
FP Add
Branch
Divide
Complex
Integer
FP Shuffle
SSE Integer ALU
Integer Shuffles
SSE Integer
Multiply
Port 4
Integer ALU &
LEA
Port 3
Integer ALU &
Shift
Port 2
Port 5
Port 1
Port 0
6 Ports
Load
Store
Address
Store
Data
SSE Integer ALU
Integer Shuffles
6 ports as in Core
22
Execution unit
Loop Stream Detector
• Each fetch retrieves 16 bytes (128 bit) from the cache which are inserted in a predecode buffer
where 6 instructions at a time are sent to a 18 instructions queue. 4 CISC instructions at a time
are sent to the 4 decoders (if possible) and the decoded u-ops are sent to a 28 slots Loop
Stream Detector
• A new technology is implemented (Unaligned Cache Accesses) which grants the same
execution speed to aligned and non-aligned instructions (i.e. crossing a cache line). Before,
non aligned instructions were a big execution penalty which prevented very often the use of
particular instructions: from Nehalem on this is not any more the case
23
Nehalem internal architecture
EU for
non-memory
instructions
• The RAT can rename up to 4 u-ops per cycle (number of physical registers different according to
the implementation) The renamed instructions in ROB and when the operands are ready inserted in
the Unified Reservation Station
• The ROB and the RS are shared between the two threads. The ROB is statically subdivided:
identical speculative execution “depth” between the two threads.
• The RS stage is competitively” shared according to the situation. A thread could be waiting for a
memory data and therefore using little or no entried of the RS: it would be senseless to block entries of a
blocked thread
24
Nehalem internal architecture
EU for memory
instructions
• Up to 48 Load and 32 Store in the MOB
• From the Load and Store buffers the u-ops access hierarchically the caches. As in PIV caches and
TLB are dynamically subdivided among the two threads. Each core accepts “outstanding misses”
(up to 16) for the best use of the increased memory bandwith
25
Nehalem full picture
L3 shared among
the chip cores. Uncore
• Other improvements: SSE 4.2. instructions: string manipulations (very important
for XML handling)
• Instructions for CRC manipulation (important for the transmissions)
26
Execution parallelism improvement
1
•
Increased ROB size (33% - 128 slots)
•
Improvement of the relates structures
Structure
Intel® Core™
microarchitecture
(Merom)
Intel® Core™
microarchitecture
(Nehalem)
Comment
Reservation
Stations
32
36
Dispatches
operations to
execution units
Load Buffers
32
48
Tracks all load
operations
allocated
Store Buffers
20
32
Tracks all store
operations
allocated
27
Nehalem vs Core
• Core internal architecture modified for QPI
• The execution engine is the same. Some blocks added for Integer and FP optimisation.In practice
the increased number of RS allows a full use of the EUs which in Core were sometimes starved.
• For the same reason the multithread was reintroduced (and therefore the greater ROB – 196 slots
vs 96 of Core - and the larger number of RS - 36 vs the 32 of Core),
• Load buffers are now 48 (32 in Core ) and store buffers 32 (20 in Core). The buffers are partitioned
between the threads (fairness).
• It must be noted that the static sharing between threads provides the threads with a reduced
number of ROB slots (64 – 128/2) instead of 96) but INTEL states that in case of a single thread all
resources are given to it…
30
Power consumption control
Vcc
PLL= Phase Lock Loop
BCLK
From a base frequency
(quartzed) all requested
frequencies are generated
Core
PLL
Vcc
Freq.
Sensors
PLL
Core
Vcc
Freq.
Sensors
PLL
Core
PCU
Vcc
Freq.
Sensors
PLL
Power Control Unit
Current, temperature and
power controlled in real time
Flexible: sophisticated
hardware algorithm for
power consumption
optimization
Core
Uncore ,
LLC
Vcc
Freq.
Sensors
PLL
Power supply gate
33
Core power consumption
Total power consumption
•
Blue: high frequency design requires an efficient
global clock distribution
•
Green: high frequency systems are affected by
unavoidable losses
•
Red: clock and gates
Clock distribution
Loss currents
Local clocks and logic
34
Power minimisation
• Idle CPU states are called Cn
Idle Power (W)
• Higher is n lower is the power consumption BUT longer the time for exiting the «idle»
state
C0
C1
Cn
Exit Latency (ms)
• The OS informs the CPU when no processes must be executed – privileged instruction
MWAIT (Cn)
35
C states (before Nehalem)
Total power consumption
•
•
C0 state : active CPU
C1 and C2 states : pipeline clock and other
clocks majority blocked
• C3 states: all clocks blocked
Clock
distribution
Loss
currents
Local clocks
and logic
• C4,C5 and C6 states: operating voltage
progressive reduction
• Cores had only one power plane and all devices had to be idle before reducing the
voltage
• C higher values were the most expensive for the exiting time (voltage increase, state
restore, pipeline restart etc.)
36
C states behaviour(Core)
Core Power
Task completed. No task ready.
Instruction MWAIT(C6) .
Core 1
0
Core 0
0
Time
37
C states behaviour(Core)
Core Power
Core1 execution stopped, its state
saved and its clocks stopped . Core
0 keeps executing.
Core 1
0
Core 0
0
Time
38
C states behaviour(Core)
Core Power
Core 1
0
Task completed. No task
ready. Instruction MWAIT(C6)
Core 0
0
Time
39
C states behaviour(Core)
Core Power
Only now it is possible
Core 1
0
Reduced voltage
and power
Core 0
0
Time
40
C states behaviour(Core)
Core Power
Core 1 interrupt, voltage increased
Core 1 clocks reactivated, state
restored, the instruction following
MWAIT(C6) is executed. C0 keeps
idle.
Increased
voltage for both
processors
Core 1
0
Core 0
0
Time
41
C states behaviour(Core)
Core Power
Core 1
0
Core 0 interrupt. Core 0 state C0 and
the instruction following MWAIT(C6)
executed. Core 1 keeps executing
Core 0
0
42
C6 Nehalem
Core Power
Cores 0, 1, 2,
and 3 active
Core 3
0
Core 2
0
Core 1
0
Core 0
0
Time
Separeted core power supply!!!!
43
C6 Nehalem
Core Power
Core 2 task completed. No
task ready. MWAIT(C6).
Core 3
0
Core 2
0
Core 1
0
Core 0
0
Time
44
C6 Nehalem
Core Power
0
Core 2 stopped, its clocks
stopped. Cores 0, 1, and 3
keep executing
Core 3
Core 2
0
Core 1
0
Core 0
0
Time
45
C6 Nehalem
Core Power
Core 2 power gate
interdicted: voltage is 0 and
state C6. Cores 0, 1, and 3
keep executing
Core 3
0
Core 2
0
Core 1
0
Core 0
0
Time
46
C6 Nehalem
Core Power
Core 3
0
Core 2
0
Core 0 task completed. No task
ready. MWAIT(C6).Core 0 C6.
Cores 1 and 3 keep executing
0
Core 1
Core 0
0
Time
47
C6 Nehalem
Core Power
Core 2 Interrupt - Core 2 C0
execution from MWAIT(C6). Cores
1 and 3 keep executing
Core 3
0
Core 2
0
Core 1
0
Core 0
0
Time
48
C6 Nehalem
Core Power
Core 3
0
Core 2
0
0
Core 0 interrupt. Power gate
saturated, clocks reactivated, state
restored execution from
MWAIT(C6). Core 1,2, and 3 keep
executing
Core 1
Core 0
0
Time
49
C6 Nehalem
Core power
consumption
All Cores state C6:
o Core power to ~0
• Entire package package C6
Clock and logic
Cores (x N)
•
Clock distribution
Losses
o Uncore clock block
Uncore logic
o I/O low power
o Uncore clock distribution blocked
I/O
Uncore clock
distribution
Uncore losses
50
Further power reduction
• Memory
o Memory clocks are blocked between the requests when the usage is low
o Package memory refresh occurs in C3 (clock block) and C6 (power down) states too
• Links
o Low power when the processor increases its Cx
• The Power Control Unit monitors the interrupts frequency and changes the C states accordingly
o C states linked to the operating system depend on the processor utilization
o In presence of some low workloads the use rate can be low but the latency can be of
tantamount importance (i.e. real time systems)
o The CPU can implement complex behaviour optimisations algorithms
• The system changes the operating clock frequency according to the requirements in order to
minimize the power consumption (processor P states)
• The Power Control Unit modifies the operating voltage for each clock frequency, operating
condition and silicon characteristics
• When a Core enters low power C states its operating voltage is reduced while that of the other Cores
is unmodified
51
Turbo pre-Nehalem (Core)
Clock Stopped
Power reduction in
inactive cores
Core 1
Core 0
Workload Lightly Threaded
Frequency (F)
Core 1
Core 0
Frequency (F)
No Turbo
52
Turbo pre-Nehalem (Core)
Clock Stopped
Turbo Mode
Power reduction in
inactive cores
In response to workload
adds additional performance
bins within headroom
Core 0
Workload Lightly Threaded
Frequency (F)
Core 1
Core 0
Frequency (F)
No Turbo
53
Turbo Nehalem
•
It uses the available clock frequency to maximize the performance both
for multi- and single-thread
Power Gating
Zero power for inactive
cores
Core 3
Core 2
Core 1
Core 0
Workload Lightly Threaded
or < TDP
Frequency (F)
Core 3
Core 2
Core 1
Core 0
Frequency (F)
No Turbo
TDP: Thermal Design Power. An indication of the heat (energy) produced by a processor, which is
also the max. power that the cooling system must dissipate. Measured in Watts
54
Turbo Nehalem
Turbo Mode
Power Gating
Core 1
Workload Lightly Threaded
or < TDP
Core 0
Zero power for inactive
cores
Frequency (F)
Core 3
Core 2
Core 1
Core 0
Frequency (F)
No Turbo
In response to workload adds additional
performance bins (frequency increase)
within headroom
55
Turbo Nehalem
Power Gating
Turbo Mode
Zero power for inactive
cores
In response to workload
adds additional performance
bins within headroom
Core 1
Core 0
Workload Lightly Threaded
or < TDP
Frequency (F)
Core 3
Core 2
Core 1
Core 0
Frequency (F)
No Turbo
56
Turbo Nehalem
Active cores running workloads < Thermal Design Power
Core
Core33
22
Core
Core
Core
Core1 1
00
Core
Core
Workload Lightly Threaded
or < TDP
Frequency (F)
Core 3
Core 2
Core 1
Core 0
Frequency (F)
No Turbo
57
Turbo Nehalem
Active cores running
workloads < TDP
Turbo Mode
In response to workload
adds additional performance
bins within headroom
Core 3
Core 2
Core 1
Core 0
Workload Lightly Threaded
or < TDP
Frequency (F)
Core 3
Core 2
Core 1
Core 0
Frequency (F)
No Turbo
TDP = Thermal Design Power
58
Turbo Nehalem
Power Gating
Turbo Mode
Zero power for inactive
cores
In response to workload
adds additional performance
bins within headroom
Core 3
Core 2
Core 1
Core 0
Workload Lightly Threaded
or < TDP
Frequency (F)
Core 3
Core 2
Core 1
Core 0
Frequency (F)
No Turbo
59
Turbo enabling
•
Turbo Mode is transparent
– Frequency transitions are handled in hw
– The operating system asks for P-state changes (frequency and
voltage) in a transparent way activating the Turbo mode only
when needed for better performance
– The Power Control Unit keeps the silicon within the required
boundaries
60
Westmere
•
Westmere is the name of the 32 nm Nehalem and is the basis of Core i3, Core i5, and
multiple cores (i7plus).
Characteristics
•
2-12 native cores (multithreaded => up to 24 processors)
•
12 MB L3 cache
•
Some versions have an integrated graphic controller
•
A new instructions set (I7) for the Advanced Encryption Standard (AES) and a
new instruction PCLMLQDQ which executes multiplications without carry as
required by the cryptography (i.e. disks encryption)
•
1GB page support
61
62
Roadmap
63
Hexa-Core Gulftown – 12 Threads
64
Westmere: 32nm Nehalem
65
EVGA W555 - Dual processor Westmere – (2x6x2) = 24 threads !!
Video
controller
66
Intel 5520 chipset and two
nForce 200controllers under
the power dissipator
8 SATA ports
2x6 GB/sec and
6x3 GB/sec
Processors
6x2 DIMM DDR3
7 PCI
NON standardE-ATX size
1 or 2 processors and overclockin
67
68
Cooling
towers
69
Roadmap
22 nm technology – 3D trigate transistors – Pipeline 14 stages – Dual channel DDR3 - 64KB L1 (32K instr +
32K data) and 256 KB L2 per core, Reduced cache latency – Can use DDR4 – Three possible GPU: the
most powerful (GT3) has 20 Eus.. Integrated voltage regulator (from the motherboard to the chip). Better
power consumption. Up to 100W TDP. 10% improved performance
5 CISC decoded instructions macrofused produce 4 u-ops per clock – Up to 8 u-ops dispatched per clock
70
Haswell front end
71
Haswell front end
72
Haswell execution unit
8 ports !!
Increasing the OoO window allows the execution units to extract
more parallelism and thus improve single threaded performance
Prioritized extracting instruction level parallelism
73
Haswell execution unit
74
Haswell
75
78