Transcript MPSoC

University of Tehran
Electrical and Computer Engineering School
Design of ASIC CMOS Systems Course
MPSoC
Presented by:
Mahdi Hamzeh
Instructor:
Dr S.M. Fakhraie
Spring 2006
This is a class presentation. All data are copy righted to respective authors as listed in the references and have been used here
for educational purpose only
Outline






Introduction
A Power-Efficient High-Throughput 32-Thread SPARC
Processor
A 16-Core RISC Microprocessor with Network
Extensions
A Dual-Core Multi-Threaded Xeon® Processor with
16MB L3 Cache
A 2.6GHz Dual-Core 64b×86 Microprocessor with
DDR2 Memory Support
Conclusion
Introduction




Multi-processor systems-on-chip (MPSoC)
pose many new challenges to the design of
embedded systems [5]
Multi-processor systems-on-chip (MPSoCs)
are becoming a necessary way to balance
performance, power and reliability while
maintaining the maximum degree of flexibility
[6]
Energy-efficient design is strongly required in
MPSoC's [6]
Dynamic power and leakage power
A Power-Efficient High-Throughput
32-Thread SPARC Processor

Features













High throughput
Performance/watt optimized
Concurrent execution of 32 threads
8 symmetrical 4-way 64b multithreaded SPARC cores
16KB ICache per Core
8KB DCache per Core
High-bandwidth low-latency cache/memory
Single-issue 6-stage pipeline for each core
Maximize pipeline utilization
CPU-to-cache crossbar of 134GB/s
4-banked 12-way L2 cache
Pipelined shared 3MB L2 cache of 153.6GB/s
Four 144b DDR2 DIMM channels at 400MT/s (Mega-transfers/s) delivering
25.6GB/s
Processor block diagram [1]
Niagara processor micrograph and overview [1]
A Power-Efficient High-Throughput
32-Thread SPARC Processor (Contd.)

Thread features



Measured IPC (instructions per cycle)





4 threads are interleaved per cycle with zero thread-switch cost
When any thread is blocked by a cache miss or branch penalty, the
other threads issue instructions more frequently, effectively hiding
the miss latency of the first thread.
5.76 with an actual L2 latency of 20.9 CPU cycles and memory
latency of 106ns on Java Business Benchmark (SpecJBB)
Pipeline efficiency 71% (5.76 out of a maximum of 8).
A balanced H-tree scheme is used to distribute the global clock
Thermal gradient of only 7°C. At 63W
Worst-case junction temperature is 66°C.(Compared to a typical
Tj of 105°C, reliability improves by 5×).
Technology







90nm CMOS process
9 layers of Cu interconnect
378mm2 die
279M transistors
packaged in a flip-chip ceramic LGA with 1933
pins
Power dissipation is 63W at 1.2V and 1.2GHz
Library cells are static CMOS with a 1.5 P/N
Chip power consumption: 63W [1]
width ratio
Components

Peak Power Control



Active power- and temperature-control
mechanisms allow threads and cores to be
dynamically scheduled or idled.
Clock-gating techniques include coarse-grain to
disable selective cores, and fine-grain to disable
about 30% of the datapath flops on average.
PLL generate 3 ratioed clock domains



CPU, crossbar, L2 cache
memory interface
system interface
Components (Contd.)

On-chip L2 cache





12-way set-associative
Divided into 4 independent banks
Operate concurrently to read out up to
256B
Each sub-bank supplies 16B with 2-cycle
throughput
providing a maximum data array read
bandwidth of 153.6GB/s
L2 cache data array floorplan and interlocking clock header [1]
Design Methodology

Hold-time methodology based on metalprogrammable delay buffers, allowing
the top level route to freeze while still
resolving violations
A 16-Core RISC Microprocessor with Network
Extensions

Features











Size of Icache (each core) 32kB
Size of Dcache (each core) 8kB
L2 Cache 1MB
Number of MIPS Cores 16
Number of Metal Layers 9
Process 0.13μm CMOS
Voltage 1.2V
Frequency 600MHz
Power 25W
Power for an individual processor is 450mW
@600MHz
Chip plot [2]
Number of transistors 180 million
A 16-Core RISC Microprocessor with Network
Extensions (Contd.)



Targeted for layer-4 through layer-7 network applications
Designed for power efficiency
Components (Hardwired)






Security engines
Network function accelerators
Memory/network/bus controllers
Most of the silicon area is dedicated to the 16 RISC processors and the
1MB L2 cache
remaining area is occupied by network coprocessors and physical
interfaces
Chip interfaces



64b 133MHz PCI/PCIX
144b 800MHz DDR2
36b 600MHz low-latency
DRAM interface
in addition to miscellaneous and
Peak
performance
[2]
general-purpose I/Os
Architecture



Each RISC core can issue in-order two MIPS instructions per cycle
32kB 4-way set-associative virtual instruction cache
The execute unit consists of two pipelines



memory section consists






first handles all instructions
second only handles ALU/insert/extract/shift/move instructions
8kB fully associative Dcache
2kB write buffer
32-entry (64 page) unified translation look-aside buffers (TLBs)
multiplication/division unit in addition to supporting the standard MIPS
instructions
Cryptographic operations are accelerated by dedicated units supporting
different encryption methods: 3DES, AES, MD5, SHA1/256/512, and
GF2
The 16 processors share a 1MB fully coherent L2 write-back cache.
RISC Processor [2]
Power


Aggressive clock gating of all place-and route
and custom islands
Some blocks present natural exclusivity and
hardware enforces exclusivity to reduce peak
power



Example : In execution unit only the ALU or the
shifter need be enabled
power performance is approximately 2000
MIPS/W
Global clock distribution power is <1W at
1.2V, 600MHz for a skew of <50ps
Design Methodology




combination of industry-standard synthesis and place-and-route
flow for control blocks, and full custom schematic/layout design
for the datapath-style units
Global clock distribution is full custom and consists of a powerefficient variable-density grid that minimizes total metal
capacitance while maintaining low resistance paths to the
heaviest clock loads
Local conditional clocks are two gain stages from the global
clock and are designed on an ad-hoc basis
Global floorplanning and wiring is done with an in-house tool
that handles routing in addition to optimal repeater, local clock
driver and decoupling capacitance placement
Chip floorplan [3]
A Dual-Core Multi-Threaded Xeon®
Processor with 16MB L3 Cache

Features















two 64b cores
16MB unified L3 cache
Each core has two threads
Each core has unified 1MB L2 cache
1.328B transistors.
Frequency 3.0GHz
435mm2 die
1.25V core supply
worst-case power dissipation 165W
Typical server workload is 110W
process 65nm
8 copper interconnect
low-k carbon-doped oxide (k=2.9) inter-level dielectric
flip-chip (C4) attached to a 12-layer (4-4-4) organic package with an integrated heat
spreader
Package has 604 pins(238 are signal pins and the rest are power and ground)
Die micrograph [3]
Components

L3 cache







Using 256 data sub-arrays(64kB each)
32 redundancy sub-arrays(68kB each)
data sub-array stores 32 bits
redundancy sub-array stores 34 bits
6T memory-cell bit size is 0.624μm2
physical address is 40b wide
Clock and PLL



3 PLLS
uncore clock is distributed through a balanced tree
embedded in nine vertical spines
De-skew circuits controlled by on-die fuses reduce the
uncore clock skew to less than 11ps.
Clock distribution map [3]
Components (Contd.)

Level shifters


used between voltage domains
DFT and debug features





Scan
observability registers (scan-out)
I/O loopback and I/O test generator (IBIST)
on-die clock shrink
…
Power




only 0.8% of all L3 cache array blocks are powered up for each cache
access
To reduce the L3 cache leakage, NMOS sleep transistors are
implemented in the SRAM sub-arrays
PMOS power gating devices in the cache periphery (Both saving about
3W of leakage)
Supply voltage

three supply voltage




one for the two cores
separate supply for the L3 cache together with the associated control logic
third one for the FSB I/O circuits
design uses longer Le devices (10% longer than nominal) in nontiming-critical paths to reduce subthreshold leakage (About 54% of the
transistor width in the cores and 76% of the transistor width in the
Voltage domains
and power breakdown [3]
uncore (excluding
cache arrays)
L3 cache sleep circuit and shut-off mode [3]
A 2.6GHz Dual-Core 64b×86 Microprocessor with
DDR2 Memory Support

Features














90nm triple-Vt, artially-depleted SOI. 9 layer Cu metallization Dual gate-oxide
thickness Process technology
2 CPU Cores
220mm2 die area
77.4mm2 L2 cache area
243M Transistor count, 134M L2 array 13M L1 array
L1 instruction cache 64kB per core, parity protected
L1 data cache 64kB per core, ECC protected
L2 cache 1MB per core, ECC protected
128b DDR2-800, 12.8GB/s Memory interface
2.6GHz
1.35V core supply
power dissipation 95W
The chip implements the Pacifica architecture for hardware support of virtualization
design has 7% frequency margin and 10% voltage margin at its operating point
die micrograph [4]
Components



2 Hammer cores
on-chip DDR2 memory controller
3 identical PLLs



2 PLLs provide clocks for 3 Hyper Transport links
third provides a clock for the memory controller
and both cores
Clock Distribution


balanced H-tree drives the clock signal from the
PLL to final clock buffers
Worst-case clock skew is 21ps
Power



fine-grained clock gating reduces the load on the clock grid and
reduces power consumption
The clock grids over the 2 cores can be separately enabled
low-power operating modes




clock grids over the CPU cores are disabled
clock grid over the memory controller runs at 1/256th the
frequency of the system clock
The grid provides a low-resistance path to all clock receivers so
clock drivers do not have to be tuned based on loading at the
end of the design cycle
Reducing power dissipation by reducing voltage from 1.35V to
1.1V achieves a three-fold reduction in static leakage and a
47% reduction in dynamic leakage at a cost of 20% in
frequency
Static leakage versus frequency [4]
Conclusion



Power-hungry techniques like memory
speculation, out-of order execution, and
predication are not needed to achieve the
desired performance [1]
Extensive use of simple static CMOS circuits
improves the robustness [2]
Designed for power efficiency, which is a key
requirement for MPSoC embedded
applications [3]
References






[1] A. S. Leon, J. L. Shin, K. W. Tam, W. Bryg, F. Schumacher, P. Kongetira, W. Weisner, A.
Strong,” A Power-Efficient High-Throughput 32-Thread SPARC Processor”, International
Solid-State Circuits Conference ,February 2006.
[2] V. Yalala, D. Brasili, D. Carlson, A. Hughes, A. Jain, T. Kiszely, K. Kodandapani, A.
Varadharajan, T. Xanthopoulos,” A 16-Core RISC Microprocessor with Network Extensions”,
International Solid-State Circuits Conference ,February 2006.
[3] S. Rusu, S. Tam, H. Muljono, D. Ayers, J. Chang,” A Dual-Core Multi-Threaded Xeon®
Processor with 16MB L3 Cache”, International Solid-State Circuits Conference ,February
2006.
[4] M. Golden, S. Arekapudi, G. Dabney, M. Haertel, S. Hale, L. Herlinger, Y. Kim, K.
McGrath, V. Palisetti, M. Singh,” A 2.6GHz Dual-Core 64b×86 Microprocessor with DDR2
Memory Support”, International Solid-State Circuits Conference ,February 2006.
[5] K. C. Chang, J. S. Shen, T. F. Chen,” Evaluation and Design Trade-Offs Between CircuitSwitched and Packet-Switched NOCs for Application-Specific SOCs”, 43rd Design Automation
Conference, July 2006.
[6] I. Issenin, E. Brockmeyer, B. Durinck, N. Dutt, “Multiprocessor System-on-Chip Data
Reuse Analysis for Exploring Customized Memory Hierarchies”, 43rd Design Automation
Conference, July 2006.
Questions
?