ppt - Duke Computer Science

Download Report

Transcript ppt - Duke Computer Science

The CRAY-1 Computer System
Richard M. Russell
Presented by Andrew Waterman
ECE259 Spring 2008
Background
• CRAY-1 by no means first vector machine
– 1960s: Westinghouse Solomon/ILLIAC IV
– 1974: CDC STAR 100
• “I never, ever want to be a pioneer” --Cray
– STAR 100, ILLIAC IV: who's this Amdahl dude?
• 1972: Cray Research formed after spat with CDC
– Seymour Cray wanted to start from scratch on
8600; CDC brass, not so much
• 1976: first CRAY-1 deployed at Livermore
CRAY-1 Hardware
Look Ma, No ASICs!
CRAY-1 Architecture
•
•
•
•
5-ton, vector uniprocessor
Word size = 64 bits
80 MHz clock
8MB RAM in 16 banks @ 20 MHz
– fcpu/fmem = 4 (!!)
• Fairly RISCy 16- or 32-bit instructions
– Load/store; register-register operations
Scalar Operation and Octal Annoyance
• 108 A-registers for 24-bit address
calculations
• 1008 B-registers serve as backing
store for A-registers
• 108 S-registers for source/dest of
scalar integer/FP insns
• T is to S as B is to A
• 118 pipelined scalar FUs
– Address add, mult
– Integer add, shift, logic, pop count
– FP add, mult, reciprocal
Scalar Operation
• Protection without virtual memory
– Base & limit address regs
• Ld $dest,$addr actually loads from $base+$addr
• Program killed if $base+$addr >= $limit
• A handful of registers for interrupts, exceptions, etc.
OS and Front End
• cos (CRAY OS) handles job scheduling, storage
management (tapes!), other I/O, checkpointing
– Packaged with CAL (assembler)
– ...and CFT (Fortran compiler), more later
• Command-line interface and job submission via
separate front-end computer, e.g. VAX
Vector Operation (Finally!)
• 8x64-word V-registers
• Vector Length Register
– Indicates # ops performed by vector insns
– Set from contents of an A-register
• Vector Mask Register
– Indicates which elements in vector to operate on
– Set by vector test insns (e.g. VM[i] := ($Vk[i] == 0))
• 6 Vector FUs
– integer add, shift, bitwise logic
– FP via scalar FPU: add, mult, reciprocal
Vector Load/Store Architecture
• Big departure from STAR 100: register-register ops
• CRAY-1 memory bandwidth == 80Mword/s ==
1word/cycle
– If all 2-source insns are memory-memory, then
IPC=1/3! (and that assumes no bank conflicts!)
– Solution: the RISC approach
• Combined with chaining (next), can sustain >> 1
flop/cycle
Chaining
• Pipeline bypass meets vectors
• Consider SAXPY vector expression a*X+Y
– Slow approach: compute a*X (64 mults), then
compute a*X+Y (64 adds)
• Total latency: 128+mult latency+add latency
– since, in CRAY-1, all FUs are pipelined
– But... no fundamental serialization requirement
• As soon as a*X[0] is computed, can compute
a*X[0]+Y[0]
• Total latency: 64+mult latency+add latency
(speedup of almost 2)
Chaining Example
• Assume: 8-element vectors, single-cycle ops
mul.ds $v2,$v3,$s1
add.d $v1,$v2,$v1
• Without chaining:
mmmmmmmm
aaaaaaaa
• With chaining:
mmmmmmmm
aaaaaaaa
Vector Startup Times
• For vector ops to be efficient enough to justify, startup
overhead must be small
• CRAY-1 can issue a vector insn every cycle, assuming
no structural hazards on FUs
– Result: vector performance > scalar performance
for as few as four elements/vector
Cray Fortran Compiler (CFT)
• Important insight: hand-coding assembly sucks
• The actual important insight: most vectorizable code
is of the embarrassingly-parallel variety
– Even with 1970s compiler technology, innermostloop parallelism is low-hanging fruit
– Exploit this—make the compiler do the heavy lifting
• CFT is pretty good for branchless inner loops
• ...but doesn't even attempt to vectorize code with IFs
– So any use of the Vector Mask register must be
hand-coded
• Upshot: a good start, but not quite there
Analysis
• Extremely fast computer for 1976
• Thought experiment: what if CRAY-1's parameters
scaled with Moore's Law? (32 years == 21 doublings)
– 200,000 transistors => 400 billion transistors
– 8MB main memory => 16TB main memory
– 80 MHz clock => petahertz? (if only)
• For a (merely) 2nd-generation vector processor, the
CRAY-1 was ahead of its time (I think)
– I'm not the only one: it was commercially
phenomenal
• However, design techniques (discrete logic) are totally
unscalable
Questions?
Richard M. Russell
Presented by Andrew Waterman
ECE259 Spring 2008
The CRAY-1 Computer System
Richard M. Russell
Presented by Andrew Waterman
ECE259 Spring 2008