Transcript EPIC

ECE 587
Advanced Computer Architecture I
Chapter 9
EPIC Architecture
Herbert G. Mayer, PSU
Status 7/18/2015
1
Syllabus
 Introduction
 Intel® Itanium® Architecture
 Definitions
 Data and Memory
 Itanium Registers
 Instruction Set Architecture (ISA)
 Bibliography
2
Itanium 2 Processor
3
Itanium Processor Block Diagram
4
Introduction

The Itanium® processor is Intel’s first published, commercial
64-bit computer product, launched 2001, co-developed with HP

Published means: Smart Intel was diligently developing
another 64-bit processor, the extended version of its ancient,
trusted, ugly x86 architecture, just in case, as a secret backup
risk hedge

64-bit means that the logical address range spans 264 different
memory bytes; and natural integer objects are 64 bits wide

The exact format of integer objects is described in section Data
and Memory

During its development at intel, the first generation of Itanium
processors was code-named Merced

The family is now officially called IPF, for Itanium Processor
Family, while early in its development it was referred to as IA64, for Intel 64-bit architecture
5
Introduction
 Intel’s Itanium architecture is radically different from
the widely used IA-32 architecture
 IA-32 should be referred to as x86 architecture, lest
one incorrectly infers today that it be restricted to
32-bit addresses and integer types of 32-bit length
 That limitation no longer exists since introduction of
64-bit versions about ½ year after AMD’s extension
of IA-32 to 64 bits; see also EM64T
 Imagine how Intel felt, when AMD, the company
having produced CPUs compatible with Intel’s chips,
suddenly had a more advanced, attractive x86 CPU!
6
Intel® Itanium® Architecture
 Interestingly, IA-32 object code is executable on
Itanium processors
 More interesting yet, even the Hewlett-Packard PARISC code is natively executable on this new 64-bit
IPF processor
 HP was Intel’s strategic partner in the definition,
development, and cost sharing of the IPF
 Cautious about performance inferences: just
because IA-32 object code is executable on IPF, do
not deduce such code executes on the IPF as fast
as, or faster than, on an x86 processor
7
Intel® Itanium® Architecture

IPF is Intel’s and HP’s first instance of the novel EPIC
architecture

EPIC stands for Explicitly Parallel Instruction Computing. It is
Intel’s first launched 64-bit architecture; the second was
launched later (1q04), with EM64T, the first 64-bit version of the
old x86 architecture

HP already had a 64-bit version with its Performance
Architecture (PA) RISC processor at the time Itanium was
launched

Explicit means, the assembly language programmer bears the
intellectual burden (or the smart compiler) to take advantage of
the parallelism in the architecture

It is not the processor that automatically exploits the
numerous, parallel computing modules; it needs to be told
8
Intel® Itanium® Architecture

As a consequence, compilers for IPF are highly complex

Complexity is not desirable, as that means more errors,
decreased object code quality, something the promoter of a
new architecture should avoid

On the other hand, the IPF has provided explicit architectural
features that ease implementing highly optimizing compilers

A case in point is the architectural support for softwarepipelined (SW PL) loops

Certain source constructs let the compiler emit SW PL loops
that need no prologue and epilogue. This not only renders the
object code more compact, but also faster
9
Intel® Itanium® Architecture
 Parallel means an Itanium processor gains speed
not solely via high clock rates, but via simultaneous
execution of multiple operations in one clock cycle
 Key concepts refined, or newly introduced, in IPF
include: predication, branch prediction, branch
elimination, conditional move, speculation, parallel
comparisons, and a large register file
 Itanium is only the first implementation of the new
64-bit Intel Architecture. Contrary to what you would
expect, initially Itanium only implemented 44
physical of the 64 logical address bits
 Initial product name was Merced
10
Intel® Itanium® Architecture

With 44 bits only, the total address range of first Itanium HW
was only a millionth of the logical address range, but still 4000
times larger than earlier 32-bit architecture

In its second generation, 56 physical bits of the 64-bit logical
address space were implemented in HW

Product name of that new version: Itanium® 2

Short-term, no severe limitations were expected with restricted
56-bit addresses

Still about 16 million times larger than 32-bit addressing space

Integer type operands are of course full 64 bits wide
11
Intel® Itanium® Architecture

Unlike earlier parallel VLIW architectures, EPIC does not use a
fixed width instruction encoding

Instead, operational functions can be combined to operate in
parallel from a single to as many instructions as desired

What is critical in EPIC is that all code is written assuming
parallel semantics within a group (to be explained later), and
sequential semantics across groups

To be able to run in parallel, the machine is built with multiple
execution modules that can all work at the same time

This allows a natural architecture migration from say, 6 HW
modules executing on today’s Itanium, to as many as can be
crammed into a future silicon chip a few years from now
12
Intel® Itanium® Architecture

To illustrate a sample taken from ref [1]. Consider 2 memory
operands a and b to be swapped
temp
a
b

The semicolon operator ‘;’ implies sequential semantics. On a
machine with parallel semantics, it would be sufficient to write
a
b

:= a;
:= b;
:= temp;
:= b, // operand latching needed
:= a; // operand latching needed
With the comma operator ‘,’ implying parallel semantics,
similar to syntactic conventions in the programming language
Algol-68. This source snipped is just a generic example; NOT a
sample of the Itanium assembly language
13
Definitions
14
Definitions
Branch Elimination
 Replacing object code that has conditional
branches, with code that has multiple execution
paths, lacking branches
 The second version with branches eliminated
must be semantically equivalent to the original
code with branches
 Everything else equal, the version without
branches will execute faster
15
Definitions
Bundle
 Group of 3 instructions plus a template, that all fit
into a 16-byte long, 16-byte aligned section of
instruction memory on Itanium
16
Definitions
Conditional Move

Move instruction that transfers bits from source to
destination, but only if an associated condition is true

Otherwise the instruction operates like a noop

Such a move can serve as a special case of branch
elimination. For example, the C source construct:
if ( a > 0 ) x = 99; -- HL source program
could be mapped into the conditional move:
cmov x, #99, a, #0, gt
-- hypothetical asm
which has no branches. Source operand #99 is moved into
memory location x only if the > condition holds between
operands a and integer literal 0
17
Definitions
Endian, Endianness
 A convention that defines in which order the
higher-valued bytes of a multi-byte data object are
addressed
 If the higher address byte holds the higher
numeric value, we call this little-endian
 The other way around we call big-endian ordering
18
Definitions
EPIC
 Explicitly Parallel Instruction Computing, with IPF
being the first commercial architecture that
implements EPIC
 Note IPF’s ability to also execute old Intel x86 and
old HP PA object code
19
Definitions
Epilogue
 When the steady state of a software-pipelined loop
completes, there may be yet to be used operands
and operations to be computed that would not fit
into the steady state
 These last operands must be consumed, some
even be generated during the epilogue, and
ultimately the pipeline must be drained
 This is accomplished in the object code after the
steady state, and that portion of code is called the
epilogue
 See also prologue
20
Definitions
Group
 A sequence of instructions, each with an
associated template and a defined stop
 A group is composed of one bundle or more
 The stop means, the hardware cannot start
executing any subsequent group, until the current
group has completed
 Syntax notation for stop in Itanium assembler is
the double-semicolon ;;
21
Definitions
Parallel Comparison





A composite source program condition of the form:
( ( a > b ) && ( c <= d ) )
requires multiple steps to compute a boolean predicate
Generally, on a sequential architecture these multiple steps
are combined via explicit instructions for anding and oring,
or else the flow of control of execution selects a matching
true label. All this takes time
The Itanium processor allows parallel evaluation of certain
composite Boolean expressions in a single step
The result can be used as a predicate in subsequent
instructions. Notice that such combined Boolean
expressions must be side-effect free
Also this is not equivalent to C’s short-circuit evaluation of
complex boolean expressions!
22
Definitions
Parallel Comparison, Cont’d
 For example, another complex boolean expression
( fun( j, k ) && ( i < MAX ) )
cannot be mapped into a parallel EPIC comparison
 Since one operand is a function call fun( i, k ) with
a possibly large number of parameters, and may
have a side-effect on one of the other operands,
for example “i” which is yet to be compared
 This type of boolean expression is mapped into
sequential code
23
Definitions
Predication

Is the association of a boolean condition with the execution
of an instruction sequence. This allows the following:

Two instruction streams can be executed in parallel, clearly
requiring multiple hardware modules; provided on EPIC

Both streams have a predicate associated with their
operations. Only the stream with the true predicate is
actually retired; the other will be aborted and ignored

Abort can happen as soon as the predicate is known. This
means, the computation of the predicate can proceed in
parallel with the execution of the two code streams, but must
complete by the time these 2 code streams waitie for who’ll
be the winner

An ISA with predication requires bits for the predicates to
use, and which direction (true? or false?) to select

Also, the discarded code path may contain no side-effect,
such as a write to memory!
24
Definitions
Prologue
 Before a software-pipelined loop body can be
initiated, hardware resources (e.g. registers) must
be initialized; we say the loop must be primed
 This is accomplished in the object code before the
steady state, called the Prologue
 See also epilogue
25
Definitions
Register File
 The IPF has a rich set of registers
 This includes 128 general purpose registers (for
integer operations), 128 floating-point-, 64
predicate-, 64 branch-, and 128 so-called
application registers
 Also a variety of special purpose register is
visible; visible means accessible by the assembly
language program
 Includes a user mask, stack marker (frame
marker), ip, processor id, and performance
monitoring registers
26
Definitions
Speculation

If it is suspected --but not sure-- that operand o will be used
in the future, and this operand is not readily available (not
yet in a high-speed register), and it takes long –relative to
instruction execution– to fetch o, a processor may initiate
the fetch well before it is actually used

Advantage: by the time o is needed, it is already available
without delay

Disadvantage: if the flow of control never reaches the place
where o was thought to be needed, then the speculative
fetch was superfluous

May still be meaningful, if a) no side-effects occurred that
are harmful to program correctness, and b) if the hardware
resource required to fetch o was idle anyway; then no loss!
27
Definitions
Steady State
 The software-pipelined object code executed
repeatedly, after the Prologue has been initiated,
before the Epilogue will be active, is called the
Steady State
 Each iteration of the Steady State makes some
progress toward multiple iterations of the original
source loop
 See also prologue and epilogue
28
Definitions
Syllable
 Is the instruction-only portion of a bundle
 A bundle always holds 3 instructions plus a
template, the template specifying additional
necessary information about an instruction
 The instruction alone, without the needed template
information, is a syllable
29
Data & Memory
30
Data and Memory
 Native data types of IPF resemble conventional 32bit architectures, except for the longer 64-bit integer
and unsigned formats
 An extension over IA-32 object code is the IPF
bundle
 Data types include integer, unsigned, floating-point,
and pointer
 Integers are of different widths: byte, word, doubleword, or quad-word precision
 Length in bits as well as min and max values are
listed below:
31
Data and Memory, Min Max
Type
Byte
Word
16
Doubleword+
32
Quadword+
64
Integer [bits]
8
Unsigned [bits]
8
16
32
64
Pointer [bits]
NA
NA
Comp. 32
64
Float [bits]
NA
NA
32, 64
64, 80
Type
byte
Word
Double-word
Minint
-128
-32,768 -2,147,483,648
"-9,223,372,036,854,775,808"
Maxint
127
32,767 2,147,483,647
"9,223,372,036,854,775,807"
Minunsigned
0
Maxunisgned
255
0
0
65,535 4,294,967,295
32
Quad-word
0
"18,446,744,073,709,551,615"
Data and Memory
 Negative numbers are represented in two’s
complement format, with the sign-bit in the mostsignificant position
 Floating-point data use the IEEE 754 standard
 Bits representing integer values are numbered
from 0 in the least significant position (rightmost
position) to higher values. For example, the most
significant bit in a double word is in position
indexed 31
 Maximum address on the first generation Itanium
processor (Merced) was only 17,592,186,040,322
or 244-1. It grew in the second generation to 56
bits, and is now a full 64-bits long
33
Data and Memory
 Bytes are stored in little-endian order by default
 Possible to programmatically select little- or bigendian order, by setting the be bit in the user
mask, a special status register
 The be bit (for big-endian) does not affect how
instructions are stored or fetched from memory
 Object code is always represented in little-endian
order; programmer selected endianness only
impacts data
 In little-endian order, data bytes with the lowest
numeric value are stored in the byte with the
lowest address; conversely for big-endian order
34
Data and Memory
Data quad-word 0x1102030455060708 would be stored:
Data stored in 8 adjacent bytes in memory in little-endian order:
addr: 0 addr: 1 addr: 2 addr: 3 addr: 4 addr: 5 addr: 6 addr: 7
08x
07x
06x
55x
04x
03x
02x
11x
Same int value 0x1102030455060708 stored in 8-byte register:
byte7
11x
byte6
02x
byte5
03x
byte4
04x
35
byte3
55x
byte2
06x
byte1
07x
byte0
08x
Itanium Registers
 The Itanium processor has 128 general registers
(GR), 128 floating-point registers (FR), 64 singlebit predicate registers (PR), 8 branch registers
(BR), and 128 application registers (AR)
 In addition, there are Performance Monitor Data
registers (PMD), processor identifiers (CPUID), a
Current Frame Marker register (CFM), user mask
(UM), and instruction pointer registers (IP)
 GRs, FRs, BRs, ARs, CPUIDs, IP, and PMDs are 64
bits wide
 PRs are 1 bit wide, while the UM holds 6 and the
CFM 38 bits; depicted below:
36
Itanium Register File
gr0
GR
63…0
fr0
FR
63…0
pr0
PR
0
br0
BR
63…0
gr1
63…0
fr1
63…0
pr1
0
br1
63…0
gr2
63…0
fr2
63…0
pr2
0
br2
63…0
gr3
63…0
fr3
63…0
pr3
0
br3
63…0
gr4
63…0
fr4
63…0
pr4
0
br4
63…0
ar16
RSC
gr5
63…0
fr5
63…0
pr5
0
br5
63…0
ar17
BSP
...
...
...
...
br6
63…0
gr16
63…0
fr16
63…0
br7
63…0
ar18 BSPST
O
ar19 RNAT
...
...
...
...
63…0
ar21
FCR
fr126
...
...
63…0
ip
gr126
...
...
63…0
pr62
0
...
...
gr127
63…0
fr127
63…0
pr63
0
ar30
FDR
ar32
CCV
......
0
pr10
......
......
cfm
37…0
User M
um 5…0
CPUID
cpuid0 63…0
ar0
AR
Kr0
...
ar7
Kr7
...
ar36 UNAT
cpuid1
63…0
pmd0
PMD
63…0
...
...
63…0
pmd1
63…0
ar64
LC
...
...
63…0
ar66
EC
cpuidn
pmdm
37
ar40 FSPR
ar44
ITC
ar127
Itanium Registers GR
 The 128 GR registers are the common workhorses
during computation
 They contain integer values being computed that
can also be used as the source and destination
operands in move operations
 It is possible to use these integer values as
machine addresses, thus GRs can be used as
pointers in load- and store-operations
 All machine instructions can refer to these
registers, for reading and writing values
38
Itanium Registers GR

In addition to the 64 data bits, each GR has an associated bit
called the NAT, which stands for Not A Thing. NAT is 1, if the
associated register has not been initialized with good data

NATs support speculation. For example, if a speculative load
is issued but aborted, before the value arrives in its destined
GR, the NAT value can be set to record that fact

Enables integrity of the machine’s exception process

Certain instructions can manipulate individual bits or bit
strings of the 64 bits in the various GRs; there are 2 groups

The first 32, GR0 through GR31, are visible to all software,
and are used to hold globally computed, intermediate
values. However, GR0 is read-only, providing the constant 0,
64 bits long
39
Itanium Registers GR
 The next 96, GR32 to GR127, are used to
implement a small but frequently used portion of
the top of the run-time stack; i.e. work like a
special-purpose top-of-stack cache
 These stack registers are made available to SW by
allocation of a register stack frame, and include
from 0 to 96 registers. All registers not used from
this subset are inaccessible to general SW
 The stack frame portion implemented via GRs is
further partitioned into subsections, one meant to
hold local registers, the other output registers, i.e.
results of the function call
40
Itanium Predicate Registers (PR)
 Execution of most IPF instructions can be
predicated by one of the PRs
 Value 1 in the PR means: the operation terminated
normally
 0 meaning: the result will not be posted (committed),
even if it has been computed already. I.e. there will
be no impact on the ARs of the machine
 A rare exception of an instruction that cannot be
predicated is the loop operation
41
Itanium Predicate RegistersS
 The PRs are also partitioned into 2 sections:
 PR0 through PR15 are static PRs
 The other 48 are so called rotating PRs
 PR0 is an exceptional register, it can only be read,
and its value is always 1, meaning, the predicate is
true; thus PR0 can be used to denote
unconditional execution
 The remaining 48 PRs are used to hold stage
predicates, used during software-pipelining
42
Branch Registers (BR)

IPF instructions are grouped in bundles, which are 16-byte
aligned byte sequences holding executable code. Hence their
rightmost 4 address bits will always be 0 due to alignment;
they don’t need to be stored explicitly

Execution of an indirect branch requires an explicit operand

On the Itanium architecture this operand is a branch register;
it holds the branch destination

The machine then loads the value of the referenced BR into IP
and execution continues from there

Executing branch-related instructions is about the only way to
directly affect the value in the instruction pointer, the register
that holds the address of the next bundle to be executed
43
Current Frame Marker Register CFM
Note: Frame Marker often referred to as Stack Marker
 Each function has a specific stack frame
associated with it, which is created at function
invocation; it is cleared at function return
 If all the relevant data of a function’s stack frame
do fit, they are placed in the stack of general
registers; else the overflowing data must reside in
memory
 Either way, the current frame marker (CFM) holds
the frame marker for the function that is currently
active
44
Current Frame Marker Register CFM
Layout of the CFM:
CFMregister
37 .. 32
Rrb.pr
31 .. 25
Rrb.fr
24 .. 18
Rrb.gr
17 .. 14
sor
13 .. 7
sol
6 .. 0
sof
Meaning of Bits in CFM:
Name
Sof
Sol
Sor
Bit Field
0..6
7..13
14..17
rrb.gr
rrb.fr
rrb.pr
18..24
25..31
32..37
meaning
Total size of stack frame
Size of local part of stack frame, in words
Size of rotating portion of stack frame. The number
of the rotating registers is 8 times the sor value
Register rename base for grs
Register rename base frs
Register rename base prs
45
Application Registers (AR)
Application Registers – t.b.d.:
register
ar0 – ar7
ar8 – ar15
ar16
Mnemonic
KR0 – KR7
Description of register
Kernel registers 0 .. 7
Reserved
t.b.d.
46
Instruction Pointer (IP)
 IPF instructions are fetched in units of bundles,
which are chunks of 16 bytes, or 128 bits
 Bundles are stored bundle-aligned
 The ip can address 18,446,744,073,709,551,616
different bytes (but only at bundle addresses)
 The rightmost 4 bits of the ip thus will always be
zero, due to the bundle-alignment
47
Performance Monitor Data Register
 These are architecture-provided resources that
record the use of hardware modules
 Contents is read-only by SW
 But contrary to the performance monitor registers
on Intel Pentium architectures, they are user
visible
 Herb, add PMU info here!!!!
48
Itanium ISA
Instruction Set Architecture
49
Instruction Set Architecture (ISA)
Parallelism and Dependences

Itanium instructions that are explicitly packaged in groups
can execute in parallel

Assembly programmer or compiler may craft groups as
large as desired; the performance consequence is: All
operations embedded in a single group can be executed
simultaneously, in parallel, saving time over the equivalent
sequential execution

The physical silicon angle of this is: Of all operations that
could be executed in parallel only those are actually
performed in parallel, for which there exist HW resources

E.g. on an Itanium®2 implementation of IPF, there are 6 units
available to operate in parallel
50
Instruction Set Architecture (ISA)
Parallelism and Dependences

If fewer actions are enclosed in a group, some HW will idle

If more actions could be included in a group, then all HW
elements are active, yet some degree of possible parallelism
is lost; future HW implementations may execute that same
object code faster due to the higher degree of parallelism

Parallel execution is not feasible if dependencies exist
between instructions. On the IPF family, however, these
dependencies are not resolved by the machine

It is the human programmer or the optimizing compiler that
explicitly tracks, what can be done in parallel, and what must
be done in sequence. The machine just runs it, goal: TO BE
FAST!
51
Instruction Set Architecture (ISA)
Parallelism and Dependences

If a result has to be computed first before it can be read
somewhere else (memory or register), a true dependence
exists; AKA data dependence; conventional to say
“dependence”

On Itanium we call this a RAW (Read after Write) dependence

If a result has to be read first before it can be re-computed, a
false dependence is created, AKA anti-dependence

On Itanium this is named WAR (Write after Read) dependency

If a result has to be computed first before it can be computed
again, assuming that an intermediate reference is possible,
output dependence is created
52
Instruction Set Architecture (ISA)
Parallelism and Dependences

Itanium calls this WAW (Write after Write) dependency

In all these cases, the prior operation has to complete,
before the dependent can be started; e.g.:
ld8 r14 = [r3]
-- load GR14 w. 8 bytes addr. by GR3
add r15 = r14, r16
-– integer sum into GR15, RAW dep

The loading of an 8-byte value into (8-byte) register GR14
must complete first, before the addition of the 2 long integer
values, held in GR14 and GR16, can be started

Note the assembler register names: r14, and not gr14
53
Instruction Set Architecture (ISA)
Assembly Language Format
 Format of an Itanium assembler instruction:
 In meta-syntax [ and ] brackets mean that the
bracketed portion of the instruction is optional
 In assembly syntax, these bracket pairs [] express:
indirection
 Careful not to get confused by 2 different contexts!
[(pr)] mnemonic[.comp] dest = src1 [, src2 [, src3 ] ]
Meaning of the various assembly language fields:
54
Instruction Set Architecture (ISA)
syntax
(pr)
mnemonic
comp
dest
src1
src2
src3
Name
Predicate
register
Meaning
Used to predicate execution; if value is 0, the result is
not committed, if true, the result is committed. pr0 is
always 1, hence the associated instructions are
executed unconditionally
Instruction Name of the instruction to tell the assembler: which
operation to perform
Completer Further qualifies or completes the instruction
specification. There may be multiple completers per
instruction; not all instructions have a completer
Destination Is the destination of the specified instruction. Choices
are: register or memory
source one Source operand. Not all instructions require a source.
Some instructions allow multiple sources. Sources
may be: Immediate operands, or registers. Memory
can be a source via indirection (through a register)
source two Ditto
source
Ditto
three
55
Instruction Set Architecture (ISA)
Assembly Language Format

A sample assembly language instruction is shown next:
(p0) add r5 = r4, r3, 1 // (p0) can be skipped

This is an integer add instruction that sums up the integer
values in GR4 and GR3, also adds 1

Assigns sum to register GR5. Since the predicate register used
is PR0, which is always true, the commit of the sum to register
GR5 is unconditional, just as if no predicate qualifier had been
given

Predicate registers, when listed, are enclosed in parentheses

Not all instructions allow or need a completer. Typical
completers are shown below. Some instructions allow multiple
completers, notably the memory access instructions, and
branch instructions
56
Instruction Set Architecture (ISA)
Completer
.a
.c
.clr
.nc
.s
.many
.few
.excl
Many
more
Meaning
For “advanced” load; check later if successful
Check
If advanced load was not successful, clear the reg
no clear
Speculative; e.g. for load; NOT allowed for store!
t.b.d.
t.b.d.
t.b.d.
.equ .unc etc.
57
Instruction Set Architecture (ISA)
Itanium Bundle Format
 Executable code on Itanium comes in units of
bundles. A bundle consists of 3 instructions, all
grouped with an associated template
 Template completes the instruction specification
and above all, can define a group boundary, AKA
stop. Stop defines boundary between one group
and the next
 If no stop is included in a template, this means
that the bundle will be part of a larger group,
consisting of more instructions in the next bundle
58
Instruction Set Architecture (ISA)
Itanium Bundle Format
 Each instruction is 41 bits long, a template consumes
5 bits, one in a bundle
 With 3 instructions per bundle, the overall bundle
length is 3 * 41 + 5 = 128 bits, perfectly fitting into 16
bytes; all bundle-aligned, easily accomplished due to
first bundle residing on a mod-16 memory boundary
 From then on all will be aligned on 16-byte boundaries
 With the memory bus being 128 bits wide (or wider on
future IPF implementations) and bundles being bundlealigned, fetching instruction memory is fast
59
Instruction Set Architecture (ISA)
Itanium Bundle Format
 General layout of a bundle is shown next, with bits
ordered from 0 through 127 increasing r. to l.
127
87 | 86
46 | 45
instruction 2
instruction 1
instruction 0
5|4
0
template
 The template serves as a means for the compiler
to communicate additional information about the
instructions, without which they would be
ambiguous
 One such key piece of information is placement of
an instruction group stop, in assembler ;;
60
Instruction Set Architecture (ISA)
Itanium Bundle Format
 A group stop can occur after instruction 2, or 1, or
0, indicating an earlier group must complete
execution, before another starts
 Itanium instructions allows at most 2 stops. Thus,
if 3 are needed, then a no-operation must be
packed into one of the instructions, to effectively
create 2 physical groups, with the third being the
NOOP, whose execution order does not matter
 Compiler-generated code performs this workaround automatically
61
Instruction Set Architecture (ISA)
Itanium Bundle Format
 The template specifies which types of instructions
are assembled into slot 0, 1 and 2. IPF instructions
are partitioned into the following 6 groups:
Type
A
I
M
F
B
L+X
Meaning
ALU: integer or memory unit
Non-ALU: Integer unit
Memory unit
Floating-point unit
Branch unit
Extended unit, or Branch unit
62
Instruction Set Architecture (ISA)
Itanium Bundle Format
 Providing such information to the processor in the
template speeds up instruction decoding, and
thus improves execution speed
 A list with the Instruction Set Architecture (ISA)
templates and embedded stops is shown next:
63
Instruction Set Architecture (ISA)
Template #
0 = 0x00
1 = 0x01
2 = 0x02
3 = 0x03
4 = 0x04
5 = 0x05
6 = 0x06
7 = 0x07
8 = 0x08
9 = 0x09
10 = 0x0a
11 = 0x0b
12 = 0x0c
13 = 0x0d
14 = 0x0e
15 = 0x0f
16 = 0x10
17 = 0x11
18 = 0x12
19 = 0x13
20 = 0x14
21 = 0x15
22 = 0x16
23 = 0x17
24 = 0x18
25 = 0x19
26 = 0x1a
27 = 0x1b
28 = 0x1c
28 = 0x1d
30 = 0x1e
31 = 0x1f
type
MII
MII_
MI_I
MI_I_
MLX
MLX_
reserved
reserved
MMI
MMI_
M_MI
M_MI_
MFI
MFI_
MMF
MMF_
MIB
MIB_
MBB
MBB_
reserved
reserved
BBB
BBB_
MMB
MMB_
reserved
reserved
MFB
MFB_
reserved
reserved
slot 0
Memory unit
Memory unit
Memory unit
Memory unit
Memory unit
Memory unit
slot 1
Integer unit
Integer unit
Integer unit;;
Integer unit;;
L unit?
L unit?
slot2
Integer unit
Integer unit ;;
Integer unit
Integer unit;;
Extended unit
Extended unit;;
Memory unit
Memory unit
Memory unit;;
Memory unit;;
Memory unit
Memory unit
Memory unit
Memory unit
Memory unit
Memory unit
Memory unit
Memory unit
Memory unit
Memory unit
Memory unit
Memory unit
Floating-point unit
Floating-point unit
Memory unit
Memory unit
Integer unit
Integer unit
Branch unit
Branch unit
Integer unit
Integer unit;;
Integer unit
Integer unit;;
Integer unit
Integer unit;;
Floating-point unit
Floating-point unit;;
Branch unit
Branch unit;;
Branch unit
Branch unit;;
Branch unit
Branch unit
Memory unit
Memory unit
Branch unit
Branch unit
Memory unit
Memory unit
Branch unit
Branch unit;;
Branch unit
Branch unit;;
Memory unit
Memory unit
Floating-point unit
Floating-point unit
Branch unit
Branch unit;;
64
Instruction Set Architecture (ISA)
Itanium Bundle Format
 The difference between above templates 0x00 and
0x01, both being MII type operations is: after
instruction 2 in template 0x01 there is a stop, while
in template 0x00 there is none
 In other words, the next bundle after the one for
template 0x00 will belong to the same group, and a
higher degree of parallelism will be possible
65
Instruction Set Architecture (ISA)
Itanium Assembly Code

An instruction group is a sequence of 1 or more instructions
delimited by a stop. The first instruction in a whole program
is thought to be preceded by a stop

Similarly, the last instruction of a complete program is
thought to be followed by a stop

All instructions placed into a single group can be executed
in parallel. Whether or not they will depends on the number
of hardware resources available. In the initial Itanium
architecture only 6 resources are available

In a later implementation, many more may be available, thus
potentially speeding up execution of the same old Itanium
code on a future generation

The ;; indicates to the assembler, where one boundary ends
and thus the next group starts
66
Instruction Set Architecture (ISA)
Itanium Assembly Code

Some assembly language instructions are shown next:
comp.eq p1, p2 = r33, r34

This checks general purpose registers 33 and 34 for
equality; if true, predicate register 1 is set to true, predicate
register 2 to false. Otherwise, since GR33 and GR34 are not
equal, p1 is set to false and p2 to true. A more complicated
case is:
(p3) comp.eq.unc p1, p2 = r33, r34

checks if predicate register 3 is true at the start. If so, then
only if registers GR33 and GR34 are equal this acts like a
regular IPF comparison

Else –i.e. if p3 is false a priori— then predicate registers 1
and 2 are both set to false
67
Bibliography
1. Triebel, Walter: “IA-64 Architecture for Software Developers”,
Intel Press © 2000, 308 pages
2. http://www.intel.com/design/itanium2/manuals/25110901.pdf
3. http://h21007.www2.hp.com/portal/StaticDownload?attachment_c
iid=c2d2e0aecd2b7110VgnVCM100000275d6e10RCRD&ciid=ce1f
d701521c7110VgnVCM100000275d6e10RCRD
4. http://www.intel.com/design/itanium/downloads/245320.htm
5. http://www.intel.com/design/itanium/manuals/iiasdmanual.htm
6. http://download.intel.com/design/Itanium2/manuals/25111003.pdf
68