UltraSPARC III - University of Connecticut

Download Report

Transcript UltraSPARC III - University of Connecticut

UltraSPARC III
Hari P. Ananthanarayanan
Anand S. Rajan
Presentation Outline






Background
Introduction to the UltraSPARC
Instruction Issue Unit
Integer Execute Unit
Floating Point Unit
Memory Subsystem
Introduction


3rd generation of Sun Microsystems’ 64 bit SPARC
V9 architecture
Design Target

600 MHz

70-watt power dissipation @ 1.8V

0.25-micron process with 6 metal layers

Transistors Count - 12 million (RAM)
4 million (Logic)

Die size of 360mm2
A Tour of the UltraSPARC





14 stage pipeline
Instruction Issue Unit occupies stages A
through J
Integer Execution Unit - stages R through D
Data Cache Unit – stages E through W
Floating Point Unit – stages E through D
Design Goals



Minimum latency for integer execution path,
determines cycle time - limit stage size to
approximately 8 logic gates
Minimize performance degradation due to
clock overhead, e.g. - On-chip caches are
wave pipelined
Minimize branch misprediction latency – use
of miss queue
Instruction Pipeline
Instruction Issue Unit
Instruction Issue Unit



UltraSparc III is a static speculation machine.
Compiler makes the speculation path
sequential, results in fewer requirements on
the Fetch
Stage A contains a small, 32-byte buffer to
support sequential prefetching into instruction
cache
I-cache access over 2 cycles (P and F) , it is
wave pipelined
Pipeline
Instruction Issue Unit – Contd.




ITLB and branch prediction mechanism overlapped
with I-cache access
Target address is generated only in Stage B and
redirected to Stage A if taken
20 entry instruction queue and 4-entry miss queue.
Latter stores alternate execution path to mitigate
effects of misprediction
Stages I and J used to decode and dispatch
instructions; scoreboarding is used to check for
operand dependency.
Pipeline
Branch Prediction Mechanism


Slightly modified Gshare algorithm with 16K
saturating 2-bit counters – the three low
order index bits into predictor use PC info
only
8 cycle misprediction delay, need to drain
stages
Pipeline
Integer Execute Unit




Executes loads, stores, shift, arithmetic, logical and
branch instructions
4 integer executions per cycle – 2 from
(arithmetic/logical/shift), 1 from load/store and 1
branch
Entire data path uses dynamic precharge circuits –
this is the E stage
Future file technique to handle exceptions – we have
working and architectural register files (WARF)
Pipeline
Integer Execute Unit – Contd.




Integer execution accesses data from WRF in
the R stage and writes to it in C stage.
ARF copied into WRF in case of exceptions.
Results are committed into ARF at the end of
the pipe.
Integer multiply and divide are not pipelined
and are executed in the ASU; strategy is to
decouple less frequently executed
instructions.
Pipeline
Floating Point Unit


Floating point and partitioned fixed
point (graphics) instructions
3 datapaths
4 stage divide/multiply
 4 stage add/subtract/compare
 Unpipelined divide/square root
Push FPU by one stage to keep integer unit
compact (counter the effect of wire delays)


Pipeline
Data Cache Unit
Memory – L1 Data Cache



64 KB, 4-way, 32-byte line
2 cycle access time – Wave pipelined
Sum addressed memory (SAM) – combines address
addition and word line decode
Pipeline
Memory - Prefetch Cache




2 KB, 2 way, 64-byte line
Multi-ported SRAM
Streaming data possible (similar to
stream buffers)
Detects striding loads – hardware
prefetch issued independent of software
prefetch
Pipeline
Memory – Write Cache




2 KB, 4 way, 64-byte line
Reduce bandwidth due to store traffic
Sole source of on-chip dirty data – easy
to handle on-chip cache consistency
Write-validate scheme- multiplex
between L2 bytes and write-cache
bytes for loads
Pipeline
External Memory Interface




L2 Cache – Direct-mapped, Unified Data
and Instruction, 12 cycle access time
Cache controller allows programmable
support of 4 MB or 8 MB
On-chip Main Memory Controller
On-chip Tags – allow associative L2
cache design without latency penalty
Pipeline
Layout of UltraSPARC III