Trantula A Vector Extension to the Alpha Architecture

Download Report

Transcript Trantula A Vector Extension to the Alpha Architecture

Tarantula
A Vector
Extension to the
Alpha Architecture
Roger Espasa, Federico Ardanaz, Joel Emerz,
Stephen Felixz, Julio Gago, Roger Gramunt,Isaac
Hernandez, Toni Juan, Geoff Lowneyz, Matthew
Mattinaz, André Seznec
Universitat Politècnica Catalunya, Barcelona,
Spain
Compaq Computer Corporation, Shrewsbury, MA
State of the World
• CMOS Technology progresses
– More transistors, more functional units, more control
overhead
• VLIW and Wide Superscalar
– More individually controlled units
– Amount of real estate for control logic grows nonlinearly
• Vector ISA
– Localization of parallelism, aggregation of control
– Regular structures, simple control
Tarantula
• EV8 core + tightly integrated Vector
Unit
– Out of Order execution, Register Renaming
– Integrated in VM and cache coherence
system
– SMT support
• Targeted at scientific computing
applications
• Requires compiler support and
recompilation
Vector ISA
• New Architectural State
– 32 vector registers (v0-v31)
• v31 wired to 0. Used for prefetch
– Vector length (vl), Vector stride (vs), Vector
Mask (vm)
• 45 New Instructions
– 5 Groups
• Vector-Vector, Vector-Scalar, Strided Memory
Access, Random Memory Access, Vector Control
Vector Mask
• Allows conditional
execution without
EV8 scalar registers
• VM can be renamed
A(i).ne.0.and.B(i).gt.2
vloadq A(i) --> v0
vloadq B(i) --> v1
vcmpne v0, #0 --> v6
vcmpgt v1, #2 --> v7
vand v6, v7 --> v8
setvm v8 --> vm
Tarantula Block Diagram
Vector Execution Unit
• 16 independent lanes
– No communication, except for gather/scatter
• Each lane has
– 2 functional units
– Slice of Register File and Mask
• Allows high bandwidth
– Address generator and private TLB
• 32 functional unit appear as only 2 issue
ports
– Simple scheduling
Vector Unit – Core
Interface
• Vector Unit physically separate from
core
– Little modification to core
• Large bus prevented by routing space
– Core to VBox
• 3 Instruction Bus
• 2 Data Buses for Scalars from EV8 register file
• 3 Instruction Kill Signal Bus for misspeculation
– VBox to Core
• 3 Instruction Completion Bus
Power Consumption
Vector Memory System
• Bound to EV8 VM and Cache Coherence
architecture
• High Load/Store Bandwidth required
– Goal one 64bit datum per flop
– Memory Bus to slow
– L1 Cache to small for vector data
– Direct Connection to L2 Cache
• Non-Unit Stride central problem
– 20% of all accesses
– Don’t match cache lines
Non-Unit Strides
• EV8 4MByte L2 Cache in 128 banks
– 8 ways, 16 banks per way
– Read 8 ways, select correct one
• Non-unit stride accesses
– Read 16 independent cache lines
– Select one qword per line
• Requires
– Conflict free addresses
– Conflict free writes to 16 lanes
• One qword per lane per cycle
Conflict Free Addresses
• Possible for any 128 consecutive
elements
– For stride S= × 2s with s ≤ 4
– Order stored in ROM table
• Elements accessed out of order
– Even for length < 128 full eight cycles for
address generation
• Slice
– Group of 16 conflict free addresses
PUMP
• Stride 1 accesses
– 80% of all accesses
– 128 Qwords in 16 (aligned) or 17
(misaligned) cache lines
• Full cache lines read into PUMP
latches
– Two qwords per cycle sent to VBox
• Similar for writes
• Allows double bandwidth
Gathers and Scatters
• Arbitrary Address for every vector element
– Reordering algorithm doesn’t work
• Conflict Resolution Box (CR)
– Find biggest subset of non-conflicting addresses,
pack into slice
– Add new addresses to remaining ones and repeat
• Worst case 128 slices generated
• Same algorithm used for self-conflicting
strides
– stride S= × 2s with s > 4
Vector Misses
• To handle L2 misses consider slices as
atomic
• On miss, slice moved to Miss Address
File (MAF)
– Wait for missing data
– Go to retry queue
• Too many retries cause Panic Mode
– MAF nacks all other L2 requests, that might
prevent progress
Scalar-Vector Coherency
• VBox by-passes L1 cache
– Presence bit P indicates L2 cache line loaded
by VCore
– If P Set, VBox invalidates L1
• Scalar Write followed by Vector Read is
not covered
– Barrier command required
– DrainM Purges write buffer and cause replay
trap
Evaluation
• No Compiler support available
– Hand coded assembler cores
• Scientific Benchmarks
• ASIM Simulator
– Cycle Accurate EV8 simulator
• Tarantula compared to
– EV8
– EV8 + Trantula’s memory system
– Tarantula4 1:4 ratio to RAMBUS frequency
Operations per Cycle
Speed Up over EV8
Conclusions
• Vector Processor most efficient solution for
many applications
• Vector Unit can be added to standard
microprocessor core
• Big Bandwidth requirement can only be
satisfied by L2 cache
• Potentially big performance gains
– 2 to 20 over EV8
• Performance depends on good code
– Tiling + aggressive prefetching
• Very good power/performance ratio
Questions
• Can only scientific applications exploit
vector processors?
– Radix sort worked
– Powerful memory access instructions
– Masks allow logic execution
• Does anyone no more about PRAM
algorithms?
• EV8/VBox coherency seems quirky.
Does anyone see a better solution?