A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache

Download Report

Transcript A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache

A Dual-Core Multi-Threaded
Xeon® Processor
with 16MB L3 Cache
Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan
Chang (Intel, Santa Clara, CA)
ISSCC 2006
Instructor:
Dr. S. M. Fakhraie
Provided by:
Nayere Ghobadi
Advanced VLSI Class Presentation
Fall 2006
1
Outline




Multi-core processors
Cache
Xeon processors
Dual-Core Multi-Threaded Xeon Processor









Features
16MB L3 cache
Clock Generation and Distribution
Voltage supplies
Processor Package
Front-side bus (FSB)
Protection
Temperature sensing
Summary and conclusion
2
Multi-core Processors

Is one that combines two
or more independent
processors into a single
package, often a single
IC.

Exhibit some form of
thread-level parallelism
(TLP).
Diagram of an Intel Core 2 dual
core processor (from[6])
3
Multi-core Processors Cont.

Advantages:
1.
2.
3.
Signals don’t have to travel off-chip, so cache
coherency circuitry can operate at a much higher
clock rate.
Require much less space than multi-chip designs.
Slightly less power than two coupled single-core
processors.
4
Multi-core Processors Cont.

Disadvantages:
1.
2.
In addition to OS support, adjustments to existing
software are required to maximize utilization of
the computing resources provided by multi-core
processors.
Drive production yields down and they are more
difficult to manage thermally.
5
Cache


A temporary storage area where frequently accessed
data can be stored for rapid access.
If the processor finds the desired memory location in
the cache, This situation is known as a cache hit,
otherwise it is cache miss. The proportion of
accesses that result in a cache hit is known as the hit
rate.
Diagram of a CPU memory cache (from[6])
6
Cache Cont.

Multi-level caches:



There is a tradeoff between cache latency and hit
rate.
Larger caches have better hit rates but longer
latency.
So many computers use multiple levels of cache,
with small fast caches backed up by larger slower
caches.
7
Xeon Processors

The Xeon is intel's brand
name for its server-class PC
microprocessor intended for
multiple-processor machines.

Generally have more cache
and support larger
multiprocessor configurations
than their desktop
counterparts.
Xeon processor and logo
(from[6])
8
Dual-Core Multi-Threaded Xeon Processor
Features

Two 64b cores. Each
core has:
1.
2.

Two threads
A unified 1MB L2 cache
16MB unified L3 cache
A simple direct interface
between core and frontside bus (FSB) for
minimizing:

1.
2.
L3 cache latency.
External bus latency.
Block diagram (from[1])
9
Dual-Core Multi-Threaded Xeon Processor
Features Cont.

Caching FSB
controller for
handling:
1.
2.
3.


Core arbitration.
L3 cache accesses.
External bus requests.
The processor die is
435mm2 with 1.328B
transistors.
Operates at more
than 3.0GHz from a
1.25V core supply.
Die micrograph (from[1])
10
Dual-Core Multi-Threaded Xeon Processor
Features Cont.




65nm process technology summary
(from[1])
The worst-case power
dissipation is 165W
(power dissipation on a
typical server workload
is 110W).
65nm process
Technology.
Eight copper
interconnect layers.
Low-k carbon-doped
oxide (k=2.9) inter-level
dielectric.
11
16MB L3 Cache

6T SRAM Cell

Read:




Precharge both bitlines high
Raise wordline
One of the two bitlines will be
pulled down by the cell word
bit
bit_b
Write:



Drive one bitline high, the
other low
Raise wordline
Bitlines overpower cell with
new value
6T memory cell (from[5])
12
16MB L3 Cache Cont.





256 data sub-arrays (64kB
each). Each data sub-array
stores 32 bits.
32 redundancy sub-arrays
(68kB each). Each redundancy
sub-array store 34 bits.
Is Composed of 6T memorycells with the size of 0.624µm2.
Physical address is 40b wide.
Only 0.8% of all array blocks are
powered up for each cache
access for reducing active
power.
L3 cache block (from[1])
13
16MB L3 Cache Cont.

Sleep circuit

Active mode:



Sleep mode:



Virtual Vss =Vss
Full voltage swing.
Virtual Vss = 250mV.
Reducing the leakage
by 2X.
Shut-off mode:



NMOS shut-off device is
turned off.
Virtual Vss = Vcc/2.
Reducing the leakage
by 4X.
L3 cache sleep circuit and
shut-off mode (from[1])
14
Clock Generation and Distribution

The critical clocking features of this processor are:
1.
2.



multiple clock domains with different frequencies.
dedicated core and uncore voltage domains.
Separate PLLs and clock distribution trees for each
core and the associated L2 cache.
A third PLL for the uncore half-frequency clock.
De-skew circuits controlled by on-die fuses reduce the
uncore clock skew to less than 11ps.
15
Clock Generation and Distribution
Cont.

Clock distribution map (from[2])
System clock (BCLK) =
200MHz.
 Cores clock (MCLK) =
BCLK×N. MCLK can be
more than 3.0GHz at a
1.25V core supply voltage
(Vcore).
 Uncore clock (SCLK) =
1/2MCLK. Using a
separate uncore voltage
supply (Vcache).
 FSB clock (ZCLK) =
BCLK×4 (quad pumping)
16
Voltage Supplies

Three voltage supplies are used for:
1.
2.
3.


Two cores.
L3 cache together with the associated control logic.
The FSB I/O circuits.
Level shifters are used between voltage domains.
A custom tool checks for presence and correct
connectivity of level shifters on all signals that cross
voltage domain boundaries.
17
Voltage Supplies Cont.
Voltage domains and power breakdown (from[1])
18
Processor Package





The processor is flip-chip or Controlled Collapse Chip
Connection (C4).
The processor die has 13164 C4 solder bumps.
Is attached to a 12-layer (4-4-4) organic package with
an integrated heat spreader.
The package has 604 pins. 238 pins are signal pins
and the rest are power and ground.
The chip-level power distribution consists of a uniform
M8-M7 grid synchronized with the C4 power and
ground bump array.
19
Front-Side Bus (FSB)


Operates at 800MT/s.
A symmetric pre-driver design for controlling the edge
rate to meet timing and signal integrity requirements:
1.
2.
3.
4.
Dividing the FSB output (VOL to VOH) into six voltage
levels.
Each driven by an output driver segment with different
RON value.
When a segment is enabled, it forms a parallel
resistance to the previously enabled segments.
A new voltage level is generated, thus creating a staircase-like waveform in every transition.
20
Front-Side Bus (FSB) Cont.
Symmetric I/O pre-driver circuit (form[1])
21
Protection




Using bit interleaving for adjacent cache lines To
prevent multiple bit errors caused by a single upset
event in the same cache line.
L3 data and tag arrays and L2 data array have Errorcorrection code (ECC) protection
L2 tag has parity checking.
A dynamic 32-entries cache line disable mechanism
protects the L3 cache from erratic bits and infant
mortality failures.
22
Temperature Sensing

Three diodes for temperature sensing:



One in each core. routed to an on-package
temperature-monitor chip. provide temperature data to
the system for fan speed control.
One between the two cores. is routed to pins for system
use.
A temperature sensor near the hot spot in each core,
provides a digital temperature readout that is used in
conjunction with operating-system power-state
requests to make informed throttle and boost
decisions.
23
Summary and Conclusion

Dual-core multi-threaded Xeon processor in 65nm
process Technology.
 The processor is flip-chip (C4).
 Has Two 64b cores. Each core has Two threads and A
unified 1MB L2 cache.
 Has 16MB unified L3 cache
 Operates at more than 3.0GHz from a 1.25V core
supply Three voltage supplies
 The processor FSB Operates at 800MT/s
24
References






[1] S. Rusu, S. Tam, “A Dual-Core Multi-Threaded Xeon®
Processor with 16MB L3 Cache”, IEEE ISSCC Tech. Digest,
p118, 2006.
[2] S. Tam, J. Leung, “Clock Generation and Distribution of a
Dual-Core Xeon® Processor with 16MB L3 Cache”, IEEE ISSCC
Tech. Digest, p382, 2006.
[3] “Dual-Core Intel® Xeon® Processor 7100 Series Datasheet”,
Intel Corporation, September 2006.
[4] S. Tam, et al., “Clock Generation and Distribution for the Third
Generation Itanium® Processor,” Symp. VLSI Circuits, pp. 9-12,
Jun., 2003.
[5] N. H. E. Weste, D. Harris, “ CMOS VLSI Design” ,Pearson
Education Inc., 2005.
[6] Wikipedia, The free encyclopedia . Available:
http://en.wikipedia.org/
25