A Power-Efficient High-Throughput 32

Download Report

Transcript A Power-Efficient High-Throughput 32

A Power-Efficient
High Throughput
32-Thread SPARC Processor
Negar Esmaeilie Falah
Instructor : Prof. M. fakhraiee
Class Presentation
Adopted of ISSCC 2006 / SESSION 5 / PROCESSORS / 5.1
1
Outline







Motivation
Architecture Overview
Performance / Power
Physical Implementation
Integer Register File
L2 Cache
Conclusion
2
Motivation

Commercial server applications
– High thread level parallelism (TLP)
– Low instruction level parallelism (ILP)

Major concerns:
– Power
– Cooling
– Space
3
The Niagara SPARC Processor


New architecture and new pipeline to
achieve throughput and performance/watt
Many small, simple cores
– Shallow single issue pipeline
– Small L1 caches



Fine-grain multithreading within core
L2 cache shared across all cores
High bandwidth memory sub-system
4
Architecture Features
 CPU with 32 threads to exploit TLP
 8 cores/chip with 4 threads/core to hide memory
and pipeline stalls
– Shared pipeline to reuse resources
– Shared L2 cache for efficient data sharing among threads
 High bandwidth memory sub-system to increase
throughput:
– Highly associative banked L2 cache
– High bandwidth crossbar to L2 cache
– High bandwidth to DRAM
5
Processor Block Diagram
Sparc 0
Sparc 1
Sparc 2
Sparc 3
Sparc 4
Sparc 5
Sparc 6
Sparc 7
JTAG Clock
& Test
Unit
Crossbar
Floating Point
Unit
DRAM Control
L2
Bank 0
Channel 0
L2
Bank 1
Channel 1
L2
Bank 2
Channel 2
L2
Bank 3
Channel 3
Control
Register
Interface
[1]
JBUS System
Interface
SSI ROM
Interface
DDR2
144@400 MT/s
DDR2
144@400 MT/s
DDR2
144@400 MT/s
DDR2
144@400 MT/s
JBUS (200 MHz)
SSI (50 MHz)
6
Micrograph and Overview
L2Tag
Bank 0
DDR2_1
L2 Buff
Bank 0
L2 Buff
Bank 1
DRAM
Ctl 0,2
IO
Bridge
CLK /
Test CROSSBAR
Unit
L2Tag
Bank 1
DRAM
Ctl 1,3
JBUS
L2 Data
Bank 2
L2Tag
Bank 2
FPU
L2 Buff
Bank 2
L2 Buff
Bank 3
L2Tag
Bank 3
L2 Data
Bank 1
L2 Data
Bank 3
SPARC SPARC SPARC SPARC
Core 1 Core 3 Core 5 Core 7
[1]
DDR2_2
L2 Data
Bank 0
Features:
DDR2_3
DDR2_0
SPARC SPARC SPARC SPARC
Core 0 Core 2 Core4 Core 6
8 64-bit Multithreaded
SPARC Cores
Shared 3MB L2 Cache
16KB I-Cache per Core
8KB D-Cache per Core
4 144-bit DDR2 channels
3.2 GB/sec JBUS I/O
Technology:
90nm CMOS Process
9LM Cu Interconnect
63 Watts @ 1.2GHz/1.2V
Die Size: 378mm2
279M Transistors
7
Flip-chip ceramic LGA
SpecJBB Execution Efficiency
Idle Time
Single
Threaded
3.79 cycles
1
=
1 + 3.79
21%
Efficiency
4
=
4 + 1.56
72%
Efficiency
Idle Time
1.56 cycles
Four
Threaded
Cycles
0
Compute
4
8
Pipeline Latency
Pipeline Conflict
[1]
Memory Latency
8
Power

Power efficient architecture
– Single issue, in-order six stage pipeline
– No speculation, predication or branch prediction
– Small cores can operate at lower frequency while
achieving high throughput performance

Thermal monitoring
–
–
–
–
Peak power closer to average power
Control issue rate within the cores
Halt idle threads
Optimize thread distribution across cores for performance
or power under limited workload
9
Chip power consumption: 63W
10
[1]
H-Tree Clock Distribution
[3]
11
Cool Threads Advantages
59oC
59oC
66oC


66oC
59oC
59oC
59oC
107oC
[1]
Improved reliability with lower
and more uniform junction
temperatures
– Increased lifetime
– Total failure rate reduced by
~8X (vs 105oC)
Optimized performance/
reliability trade-off
– Frequency guardbands due
to CHC, NBTI, etc. reduced
by > 55%
– Reduced design margins
(EM/NBTI)
– Less variation across die
12
Physical Design



Fully static cell based design
methodology
– Many replicated blocks
– Custom design only for
SRAMs, Analogue and IOs
– Increased chip robustness
and test coverage
Clock distribution combines
H-tree and buffered tree
All SRAMs testable through
the scan chain
Statistics
Transistors
Standard
Cells
Flops
Repeaters
Memory
279 Million
2.2 Million
400,614
32 scan
chains
161,000
20 Macros
416
instances
Memory Bits
35.6 Million
Decoupling
Caps
710 nF
13
Integer Register File Overview



One register file required per thread
Supports standard SPARC window RF
Highly integrated cell structure to support 4 threads while
saving area and power
– 8 windows of 32 entries
– 3 read ports + 2 write ports for active window
– Read/write: single cycle throughput / 1-cycle latency

Swaps are pipelined across threads for save / restore
operations
– Swaps block within a thread but not across threads for optimal
CMT performance
– 3 cycle latency with single cycle throughput
14
IRF Swaps Across Thread
Swap Swap Swap
#1
#2
#3
Back to Back
Swap Requests
Clk
SAVE RSTO SAVE RSTO SAVE RSTO
Thread 1
Thread 2
CONVENTIONAL
SWAP
Thread 3
Swap requests fulfilled every 2 cycles
DEC SAVE
DEC
RSTO Thread 1
SAVE RSTO Thread 2
DEC SAVE RSTO Thread 3
INTERNAL
PIPELINED
SWAP
Swap requests fulfilled every cycle
Fixed 3-cycle latency
[1]
15
L2 Cache

High bandwidth 3MB shared Level 2 Cache
–
–
–
–
–
–

Four 750KB independent banks.
12-way set associative
16B read and write operations
2 cycle throughput with 8 cycle latency
Direct communication to DRAM and JBus
Maximum bandwidth of 153.6GB/s
Reverse-Mapped Directory
– CAM based Directory contains L1 cache tags
instead of L2 tags to reduce area
16
Crossbar






8 cores communicate with
L2, FPU and Ctl Register
Interface
134.4 GB/s data BW
3 stage pipeline: request,
arbitrate, transmit
2 queue entries per
source/destination pair
Arbiter prioritizes requests
by age
Standard cell macros with
semi-custom route
17
[1]
64KB Array
32KB Array
32KB Array
128b Data
128b Data
Interface Datapath Unit
128b Data 128b Data
way9 panel
way10 panel
way11 panel
[1]
Logical Sub-Bank 3 Logical Sub-Bank 2
Logical Sub-Bank 0 Logical Sub-Bank 1
L2 Data Array





Each 750KB bank divided
into 4 sub-banks
Each sub-bank reads 16B
independently
12 16KB panels per subbank
Each panel contains data
for 1 of the 12 ways
12 64KB custom macros
per bank
18
L2 Data Clock Header
Design
Special clock header design allows
– Sub-bank and panel level gating to minimize nonactive power
– Only 1-4 panels activated out of 48 panels in a bank
– Interlocking scheme for 2-cycle throughput

access_done
Enable
L2 Clk
Q
Dyn
FF Q
sbank_en
reset
set
panel_en
po_reset
way_select
19
po_reset
L2 Clk
[1]
Conclusion





New CMT architecture developed to address
commercial workload requirements
32-threads to hide instruction latency in a short and
simple pipeline
Large bandwidth instead of high frequency to
deliver target performance at low power
Cooler and more uniform chip temperature to
enhance performance/reliability trade-off
Circuits designed for high bandwidth and low power
to support multithreading
20
References



[1] Ana Sonia Leon, Jinuk Luke Shin, Kenway W.
Tam, William Bryg, Francis Schumacher, Poonacha
Kongetira, David Weisner, Allan Strong, P.
Kongetira, “A Power-Efficient High-Throughput 32Thread SPARC Processor”, 2006.
[2] P. Kongetira, “A 32-Way Multithreaded SPARC
Processor,” 16th Hot Chips Symp., Aug., 2004.
[3] Magdy A. El-Moursy and Eby G. Friedman,
“Exponentially Tapered H-Tree Clock Distribution
Networks”, 2004.
21