IA-64 Intel Itanium

Download Report

Transcript IA-64 Intel Itanium

IA-64 Architecture
(Think Intel Itanium)
also known as
(EPIC – Extremely Parallel Instruction Computing)
a new kind of superscalar computer
HW 5 - Due 12/4
Please clean up boards in lab by Dec 3
* Put good wires in the box
* Take chips off of the board using chip puller
* Put parts away in the proper bins.
* THANKS!
Superpipelined & Superscaler Machines
Superpipelined machine:
• Superpiplined machines overlap pipe stages
— Relies on stages being able to begin operations before the last
is complete.
Superscaler Machine:
A Superscalar machine employs multiple
independent pipelines to executes multiple
independent instructions in parallel.
— Particularly common instructions (arithmetic, load/store,
conditional branch) can be executed independently.
Why A New Architecture Direction?
Processor designers obvious choices for use of
increasing number of transistors on chip and
extra speed:
• Bigger Caches  diminishing returns
• Increase degree of Superscaling by adding more
execution units  complexity wall: more logic, need
improved branch prediction, more renaming registers,
more complicated dependencies.
• Multiple Processors  challenge to use them
effectively in general computing
• Longer pipelines  greater penalty for misprediction
IA-64 : Background
• Explicitly Parallel Instruction Computing (EPIC)
- Jointly developed by Intel & Hewlett-Packard (HP)
• New 64 bit architecture
— Not extension of x86 series
— Not adaptation of HP 64bit RISC architecture
• To exploit increasing chip transistors and increasing speeds
• Utilizes systematic parallelism
• Departure from superscalar trend
Note: Became the architecture of the Intel Itanium
Basic Concepts for IA-64
• Instruction level parallelism
— EXPLICIT in machine instruction, rather than determined at run
time by processor
• Long or very long instruction words (LIW/VLIW)
— Fetch bigger chunks already “preprocessed”
• * Predicated Execution
— Marking groups of instructions for a late decision on “execution”.
• * Control Speculation
— Go ahead and fetch & decode instructions, but keep track of them
so the decision to “issue” them, or not, can be practically made later
• * Data Speculation (or Speculative Loading)
— Go ahead and load data early so it is ready when needed, and have a
practical way to recover if speculation proved wrong
• * Software Pipelining
- Multiple iterations of a loop can be executed in parallel
General Organization
Predicate Registers
• Used as a flag for instructions that may or may not be
executed.
• A set of instructions is assigned a predicate register when it is
uncertain whether the instruction sequence will actually be
executed (think branch).
• Only instructions with a predicate value of true are executed.
• When it is known that the instruction is going to be executed,
its predicate is set. All instructions with that predicate true
can now be completed.
• Those instructions with predicate false are now candidates for
cleanup.
Predication
Speculative Loading
General Organization
IA-64 Key Hardware Features
• Large number of registers
— IA-64 instruction format assumes 256 Registers
– 128 * 64 bit integer, logical & general purpose
– 128 * 82 bit floating point and graphic
— 64 predicated execution registers
(To support high degree of parallelism)
• Multiple execution units
— Probably pipelined
— 8 or more ?
IA-64 Register Set
Relationship between
Instruction Type & Execution Unit
IA-64 Execution Units
• I-Unit
— Integer arithmetic
— Shift and add
— Logical
— Compare
— Integer multimedia ops
• M-Unit
— Load and store
– Between register and memory
— Some integer ALU operations
• B-Unit
— Branch instructions
• F-Unit
— Floating point instructions
Instruction Format Diagram
Instruction Format
128 bit bundles
• Can fetch one or more bundles at a time
• Bundle holds three instructions plus template
• Instructions are usually 41 bit long
— Have associated predicated execution registers
• Template contains info on which instructions can be
executed in parallel
— Not confined to single bundle
— e.g. a stream of 8 instructions may be executed in parallel
— Compiler will have re-ordered instructions to form contiguous
bundles
— Can mix dependent and independent instructions in same bundle
Field Encoding & Instr Set Mapping
Note: BAR indicates stops: Possible dependencies with Instructions after the stop
Assembly Language Format
[qp] mnemonic [.comp] dest = srcs ;; //
• qp - predicate register
– 1 at execution  execute and commit result to hardware
– 0  result is discarded
• mnemonic - name of instruction
• comp – one or more instruction completers used to qualify mnemonic
• dest – one or more destination operands
• srcs – one or more source operands
• ;;
- instruction groups stops (when appropriate)
– Sequence without read after write or write after write
– Do not need hardware register dependency checks
• //
- comment follows
Assembly Example
Register Dependency:
ld8 r1 = [r5]
;; //first group
add r3 = r1, r4
//second group
• Second instruction depends on value in r1
— Changed by first instruction
— Can not be in same group for parallel execution
• Note ;; ends the group of instructions that can be
executed in parallel
Assembly Example
Multiple Register Dependencies:
ld8
sub
add
st8
r1 =
r6 =
r3 =
[r6]
[r5]
r8, r9
r1, r4
= r12
;;
//first group
//first group
//second group
//second group
• Last instruction stores in the memory location whose
address is in r6, which is established in the second
instruction
Assembly Example – Predicated Code
Consider the Following program with branches:
if (a&&b)
j = j + 1;
else
if(c)
k = k + 1;
else
k = k – 1;
i = i + 1;
Assembly Example – Predicated Code
Pentium Assembly Code
Source Code
if (a&&b)
cmp a, 0 ; compare with 0
je
L1
; branch to L1 if a = 0
cmp b, 0
je
j = j + 1;
L1
add j, 1
; j = j + 1
jmp L3
else
if(c)
L1: cmp c, 0
je
k = k + 1;
L2
add k, 1
; k = k + 1
jmp L3
else
k = k – 1;
i = i + 1;
L2: sub k, 1
; k = k – 1
L3: add i, 1
; i = i + 1
Assembly Example – Predicated Code
Source Code
if (a&&b)
Pentium Code
cmp a, 0
je
j = j + 1;
cmp. eq p1, p2 = 0, a ;;
L1
cmp b, 0
je
IA-64 Code
(p2) cmp. eq p1, p3 = 0, b
L1
add j, 1
(p3) add
j = 1, j
jmp L3
else
if(c)
L1: cmp c, 0
je
k = k + 1;
(p1) cmp. ne p4, p5 = 0, c
L2
add k, 1
(p4) add k = 1, k
jmp L3
else
k = k – 1;
i = i + 1;
L2: sub k, 1
L3: add i, 1
(p5) add k = -1, k
add i = 1, i
Example of Prediction
Data Speculation
• Load data from memory before needed
• What might go wrong?
— Load moved before store that might alter memory location
— Need subsequent check in value
Assembly Example – Data Speculation
Consider the Following program:
(p1) br some_label
// cycle 0
ld8 r1 = [r5]
;; // cycle 0 (indirect memory op – 2 cycles)
add r1 = r1, r3
// cycle 2
Assembly Example – Data Speculation
Consider the Following program:
Original code
Speculated Code
ld8.s r1 = [r5] ;; //cycle -2
// other instructions
(p1) br some_label
//cycle 0
ld8 r1 = [r5] ;; //cycle 0
add r1 = r1, r3 //cycle 2
(p1) br some_label
//cycle 0
chk.s r1, recovery //cycle 0
add r2 = r1, r3
//cycle 0
Assembly Example – Data Speculation
Consider the Following program:
st8
ld8
add
st8
[r4] = r12
//cycle 0
r6 = [r8]
;; //cycle 0 (indirect memory op – 2 cycles)
r5 = r6, r7 ;; //cycle 2
[r18] = r5
//cycle 3
What if r4 and r8 point to the same address?
Assembly Example – Data Speculation
Consider the Following program:
Without Data Speculation
With Data Speculation
ld8.a r6 = [r8] ;;
//cycle -2, adv
// other instructions
st8
ld8
add
st8
[r4] = r12
//cycle 0
r6 = [r8]
;; //cycle 0
r5 = r6, r7 ;; //cycle 2
[r18] = r5
//cycle 3
st8 [r4] = r12
//cycle 0
ld8.c r6 = [r8]
//cycle 0, check
add r5 = r6, r7 ;;
//cycle 0
st8 [r18] = r5
//cycle 1
Assembly Example – Data Speculation
Data Dependencies:
Speculation
Speculation with data dependency
ld8.a r6 = [r8] ;; //cycle-2
// other instructions
st8 [r4] = r12
ld8.c r6 = [r8]
add r5 = r6, r7 ;;
st8 [r18] = r5
//cycle
//cycle
//cycle
//cycle
0
0
0
1
ld8.a r6 = [r8];; //cycle -3,adv ld
// other instructions
add r5 = r6, r7
//cycle -1,uses r6
// other instructions
st8 [r4] = r12
//cycle 0
chk.a r6, recover //cycle 0, check
back:
//return pt
st8 [r18] = r5
//cycle 0
recover:
ld8 r6 = [r8] ;; //get r6 from [r8]
add r5 = r6, r7;; //re-execute
be back
//jump back
Software Pipelining
// y[i] = x[i] + c
L1: ld4 r4=[r5],4 ;;//cycle
add r7=r4,r9 ;;//cycle
st4 [r6]=r7,4
//cycle
br.cloop L1
;;//cycle
0 load postinc 4
2
3 store postinc 4
3
• Adds constant to one vector and stores result in another
• No opportunity for instruction level parallelism in one iteration
• Instruction in iteration x all executed before iteration x+1 begins
• If no address conflicts between loads and stores can move
independent instructions from loop x+1 to loop x
Pipeline - Unrolled Loop, Pipeline Display
Original Loop
L1: ld4 r4=[r5],4 ;;//cycle
add r7=r4,r9 ;;//cycle
st4 [r6]=r7, 4 //cycle
br.cloop L1
;;//cycle
Pipeline Display
0 load postinc 4
2
3 store postinc 4
3
Unrolled loop
ld4 r32=[r5],4;;
ld4 r33=[r5],4;;
ld4 r34=[r5],4
add r36=r32,r9;;
ld4 r35=[r5],4
add r37=r33,r9
st4 [r6]=r36,4;;
ld4 r36=[r5],4
add r38=r34,r9
st4 [r6]=r37,4;;
add r39=r35,r9
st4 [r6]=r38,4;;
add r40=r36,r9
st4 [r6]=r39,4;;
st4 [r6]=r40,4;;
//cycle
//cycle
//cycle
//cycle
//cycle
//cycle
//cycle
//cycle
//cycle
//cycle
//cycle
//cycle
//cycle
//cycle
//cycle
0
1
2
2
3
3
3
3
4
4
5
5
6
6
7
Unrolled Loop Observations
• Completes 5 iterations in 7 cycles
— Compared with 20 cycles in original code
• Assumes two memory ports
— Load and store can be done in parallel
Support For Software Pipelining
• Automatic register renaming
— Fixed size are of predicate and fp register file (p16-P32, fr32-fr127) and programmable
size area of gp register file (max r32-r127) capable of rotation
— Loop using r32 on first iteration automatically uses r33 on second
• Predication
— Each instruction in loop predicated on rotating predicate register
–
Determines whether pipeline is in prolog, kernel, or epilog
• Special loop termination instructions
— Branch instructions that cause registers to rotate and loop counter to decrement
Intel’s Itanium Implements the IA-64