Transcript SMT
The Alpha 21364 and 21464 Microprocessors:
Continuing the Performance Lead Beyond Y2K
Shubu Mukherjee, Ph.D.
Principal Hardware Engineer
VSSAD Labs, Alpha Development Group
Compaq Computer Corporation
Shrewsbury, Massachusetts
Slides: 1998 Microprocessor Forum (Peter Bannon) and 1999 Microprocessor Forum (Joel Emer)
Better answers
Alpha Microprocessor Roadmap
Higher Performance
0.125mm
0.18mm
0.35mm
21464
EV8
21364
EV7
21264
EV6
0.125mm
0.28mm
21364
EV78
21264
EV67
0.18mm
21264
EV68
1998
Better answers
1999
2000
2001
First System Ship
2002
2003
Alpha 21264 Microprocessor
Architectural Features
First “Out-of-Order” Alpha
Four-wide superscalar
…
Performance
World’s Fastest Microprocessor (www.spec.org, 11/17/99)
39 SPECINT95, 68 SPECFP95 @ 700 Mhz
–
Better answers
Intel Pentium III @ 733 Mhz delivers 36 SPECINT95, 30 SPECFP95
Alpha Microprocessor Roadmap
Higher Performance
0.125mm
0.18mm
0.35mm
21464
EV8
21364
EV7
21264
EV6
0.125mm
0.28mm
21364
EV78
21264
EV67
0.18mm
21264
EV68
1998
Better answers
1999
2000
2001
First System Ship
2002
2003
Alpha 21364 Goals
Leadership single stream performance
Higher operating frequency
Integrated memory interface
Leadership multiprocessor performance
Integrated system / multiprocessor interface
Better answers
Alpha 21364 Features
System-on-a-Chip
Fault-Tolerance
Better answers
Alpha 21264 core with enhancements
Integrated L2 Cache
Integrated memory controller
Integrated network interface
Support for lock-step operation to enable highavailability systems.
21364 Chip Block Diagram
16 L1
Miss Buffers
Address In
R
A
M
B
U
S
Address Out
64K Icache
21264
Core
64K Dcache
16 L1
Victim Buf
Better answers
L2
Cache
Memory
Controller
Network
Interface
16 L2
Victim Buf
N
S
E
W
I/O
21364 Core
FETCH
Stage: 0
Branch
Predictors
MAP
1
2
QUEUE
3
REG
4
EXEC
5
Int
Reg
Map
Int
Issue
Queue
(20)
Reg
File
(80)
Exec
80 in-flight instructions
plus 32 loads and 32 stores
Next-Line
Address
L1 Ins.
Cache
64KB
2-Set
Better answers
Reg
File
(80)
Exec
DCACHE
6
Addr
Exec
Addr
Exec
L1
Data
Cache
64KB
2-Set
L2
cache1
.5MB
6-Set
4 Instructions / cycle
FP
Reg
Map
FP
Issue
Queue
(15)
Reg
File
(72)
FP ADD
Div/Sqrt
FP MUL
Victim
Buffer
Miss
Address
Integrated L2 Cache
1.5 MB
6-way set associative
16 GB/s total read/write bandwidth
16 Victim buffers for L1 -> L2
16 Victim buffers for L2 -> Memory
ECC SECDED code
12ns load to use latency
Better answers
Integrated Memory Controller
Direct RAMbus
High data capacity per pin
800 MHz operation
30ns CAS latency pin to pin
6 GB/sec read or write bandwidth
100s of open pages
Directory based cache coherence
ECC SECDED
Better answers
Integrated Network Interface
Direct processor-to-processor interconnect
10 GB/second per processor
15ns processor-to-processor latency
Out-of-order network with adaptive routing
Asynchronous clocking between processors
3 GB/second I/O interface per processor
Better answers
21364 System Block Diagram
M
364
364
M
364
M
364
IO
IO
IO
IO
M
M
M
M
364
364
364
364
IO
IO
IO
IO
M
M
M
M
364
364
IO
Better answers
M
364
IO
364
IO
IO
Alpha 21364 Technology
0.18 mm CMOS
1000+ MHz
100 Watts @ 1.5 volts
2
3.5 cm
6 Layer Metal
100 million transistors
Better answers
8 million logic
92 million RAM
Alpha 21364 Status
70 SPECint95 (estimated)
120 SPECfp95 (estimated)
RTL model running
Tapeout: Summer 2000
Better answers
21364 Summary: System on a Chip
Integrated L2 cache and memory controller
outstanding single processor performance
Integrated network interface
high performance multi-processor systems
scales to large number of processors
Better answers
Alpha Microprocessor Overview
Higher Performance
0.125mm
0.18mm
0.35mm
21464
EV8
21364
EV7
21264
EV6
0.125mm
0.28mm
21364
EV78
21264
EV67
0.18mm
21264
EV68
1998
Better answers
1999
2000
2001
First System Ship
2002
2003
Alpha 21464 Goals
Leadership single stream performance
Higher operating frequency / better technology
New microarchitecture
Integrated memory interface (like 21364)
Leadership multiprocessor performance
Simultaneous Multithreading (with minimal change/cost)
Integrated system / multiprocessor interface (like 21364)
Better answers
Alpha 21464 Technology Overview
Leading edge process technology – 1.2-2.0GHz
0.125µm CMOS
SOI-compatible
Cu interconnect
low-k dielectrics
Chip characteristics
~1.2V Vdd
~250 Million transistors
Better answers
Alpha 21464 Architecture Overview
Enhanced out-of-order execution
8-wide superscalar
Large on-chip L2 cache
Direct RAMBUS interface
On-chip router for system interconnect
Glueless, directory-based, ccNUMA
for up to 512-way multiprocessing
4-way simultaneous multithreading (SMT)
Better answers
Instruction Issue
Time
Reduced function unit utilization due to dependencies
Better answers
Superscalar Issue
Time
Superscalar leads to more performance, but lower utilization
Better answers
Predicated Issue
Time
Adds to function unit utilization, but results are thrown away
Better answers
Chip Multiprocessor
Time
Limited utilization when only running one thread
Better answers
Fine Grained Multithreading
Time
Intra-thread dependencies still limit performance
Better answers
Simultaneous Multithreading
Time
Maximum utilization of function units by independent operations
Better answers
Basic Out-of-order Pipeline
Fetch
Decode/
Map
Queue
Reg
Read
Execute
Dcache/
Store
Buffer
Reg
Write
PC
Register
Map
Regs
Dcache
Icache
Thread-blind
Better answers
Regs
Retire
SMT Pipeline
Fetch
Decode/
Map
Queue
Reg
Read
Execute
Dcache/
Store
Buffer
Reg
Write
PC
Register
Map
Regs
Icache
Better answers
Dcache
Regs
Retire
Changes for SMT
Basic pipeline – unchanged
Replicated resources
Program counters
Register maps
Shared resources
Register file (size increased)
Instruction queue
First and second level caches
Translation buffers
Branch predictor
Better answers
Multiprogrammed workload
250%
200%
1T
2T
3T
4T
150%
100%
50%
0%
SpecInt
Better answers
SpecFP
Mixed Int/FP
Decomposed SPEC95 Applications
250%
200%
1T
2T
3T
4T
150%
100%
50%
0%
Turb3d
Better answers
Swm256
Tomcatv
Multithreaded Applications
300%
250%
200%
1T
2T
4T
150%
100%
50%
0%
Barnes
Better answers
Chess
Sort
TP
Architectural Abstraction
1 Processor with 4 Thread Processing Units (TPUs)
Shared hardware resources
TPU 0
Icache
TPU1
TPU2
TLB
Scache
Better answers
TPU3
Dcache
21464 System Block Diagram
0123
M
EV8
EV8
M
EV8
IO
IO
IO
M
M
M
EV8
EV8
EV8
IO
IO
IO
M
M
M
EV8
EV8
IO
Better answers
M
EV8
IO
IO
Alpha 21464 Summary
Leadership single stream performance
Higher operating frequency / better technology
New microarchitecture
Integrated memory interface (like 21364)
Leadership multiprocessor performance
Simultaneous Multithreading (with minimal changes/cost)
Integrated system / multiprocessor interface (like 21364)
Better answers
Maintain Performance Lead Beyond Y2K
Alpha 21364
Reuses 21264 microprocessor core
System on a chip
Alpha 21464
New microarchitecture
System on a chip
Better answers
Simultaneous Multithreading
My Current Research: Beyond 21464?
The Truth Project (w/ Joel Emer)
The Multinet Project (w/ Rick Kessler)
Tightly-coupled multiprocessor networks
The Reliant Project (w/ Steve Reinhardt)
Examines different microarchitectural issues
Self-Checking Microprocessors using SMT, ISCA submission
Asim (w/ VSSAD Labs)
Performance Model for Alphas beyond 21464
Better answers