Itanium EPIC Architecture

Download Report

Transcript Itanium EPIC Architecture

ECE 485/585
Microprocessors
Chapter 9
Itanium® EPIC Processor Architecture
Herbert G. Mayer, PSU
Status 9/1/2016
1
Syllabus
 Introduction
 Intel® Itanium® Architecture
 Data and Memory
 Itanium Registers
 Instruction Set Architecture ISA
 Assembler Source Program
 Appendix
 Bibliography
2
Photo of Itanium 2 Processor
3
Itanium Processor Block Diagram
4
Introduction

The Itanium® processor is Intel’s first published, commercial
64-bit computer product, launched 2001, co-developed with
HP Corp. IPF stands for Itanium Processor Family

Published means: Smart Intel was diligently developing a
contemporaneous, competing 64-bit processor, the extended
version of its ancient x86 architecture, just in case, as a
secret backup risk hedge

64-bit means that the logical address range spans 264 different
memory bytes; and natural integer objects are 64 bits wide

The exact format of data objects is described in section Data
and Memory

During its development at Intel, the first generation of Itanium
processors was internally code-named Merced

The family is now officially called IPF, for Itanium Processor
Family, while early in its development it was referred to as IA64, for Intel 64-bit architecture; conflicting later with x86
5
Introduction
 Intel’s Itanium architecture is radically different from
the widely used 32-bit IA-32 architecture
 IA-32 should be referred to as x86 architecture, lest
one incorrectly infers today that it be restricted to
32-bit addresses and integer types of 32-bit length
 That limitation no longer exists since introduction of
64-bit versions about ½ year after AMD’s extension
of IA-32 to 64 bits; see also EM64T
 Imagine how Intel felt, when AMD, the company
having produced CPUs compatible with Intel’s chips,
suddenly had a more advanced, attractive x86 CPU!
6
Intel® Itanium® Architecture
 Interestingly, IA-32 object code is executable on
Itanium processors
 More interesting yet, even the Hewlett-Packard PARISC code is executable on this novel 64-bit IPF
processor
 HP and Intel were strategic partners in the definition,
development, and cost sharing of the IPF, with HP
having initiated the development
 Cautious about performance inferences! Just
because IA-32 object code is executable on IPF, one
should not deduce such code executes on IPF as
fast as on an x86 processor! 
7
Intel® Itanium® Architecture

IPF is Intel’s and HP’s first instance of the novel EPIC
architecture

EPIC stands for Explicitly Parallel Instruction Computing. It is
Intel’s first launched 64-bit architecture; the second was
launched later (1q04), with EM64T, the first 64-bit version of the
old x86 architecture

HP already had a 64-bit version with its Performance
Architecture (PA) RISC processor at the time of Itanium launch

Explicit means, the assembly language programmer bears the
intellectual burden (or the smart compiler) to take advantage of
the parallelism in the architecture; see ref [8]

It is not the processor that automatically exploits the
numerous, parallel computing modules; the microprocessor
needs to be told!
8
Intel® Itanium® Architecture
 As a consequence, compilers for IPF are highly
complex; see Donald Knuth’s comment, ref [7]
 Compiler complexity is not desirable, as that means
more errors, decreased object code quality,
something a new architecture should avoid
 On the other hand, the IPF has provided explicit
architectural features that enable implementing
highly optimizing compilers
 A case in point is architectural support for software
pipelined loops (SW PL)
 Certain source constructs let the compiler emit SW
PL loops that need no prologue and epilogue
 Absence of Prologue and Epilogue not only renders
the object code more compact, but also faster
9
Intel® Itanium® Architecture
 Parallel means an Itanium processor gains speed
not solely via high clock rates, but via simultaneous
execution of multiple operations in one clock cycle
 Key concepts refined, or newly introduced, in IPF
include: predication, branch prediction, branch
elimination, conditional move, speculation, parallel
comparisons, and a large register file
 The first implementation of the new 64-bit Intel + HP
Itanium architecture only implemented 44 physical
of the 64 logical address bits
10
Intel® Itanium® Architecture
 With 44 bits, the total initial address range of first
Itanium HW was only about a millionth of the logical
address range, but still 4000 times larger than earlier
32-bit architecture
 In its second generation, 56 physical bits of the 64bit logical address space were implemented in HW
 Product name of that new version: Itanium® 2
 Short-term, no severe limitations were expected with
restricted 56-bit addresses
 Still about 16 million times larger than 32-bit
addressing space
 Integer type operands are of course full 64 bits wide
11
Intel® Itanium® Architecture

Unlike earlier parallel VLIW architectures, EPIC does not use a
fixed width instruction encoding

Instead, operational functions can be combined to operate in
parallel from a single to as many instructions as desired

What is critical in EPIC is that all code is written assuming
parallel semantics within a group (to be explained later), and
sequential semantics across groups

To be able to run in parallel, the machine is built with multiple
execution modules that can all work at the same time

This allows a natural architecture migration from say, 6 HW
modules executing on today’s Itanium, to as many as can be
crammed into a future silicon microprocessor a few years from
now
12
Intel® Itanium® Architecture

To illustrate a sample taken from ref [1], consider 2 memory
operands a and b to be swapped
temp
a
b

:= a; // a, b, temp, are memory locs
:= b;
:= temp;
The semicolon operator ‘;’ implies sequential semantics. On a
machine with parallel semantics, it would be sufficient to write
a
b
:= b, // operand latching needed
:= a; // operand latching needed

With the comma operator ‘,’ implying parallel semantics, similar
to syntactic conventions in the programming language Algol-68

This source snipped is just a generic example; NOT a sample of
the Itanium assembly language
13
Data & Memory
14
Data and Memory
 Native data types of IPF resemble conventional 32-bit
architectures, except for the longer 64-bit integer and
unsigned formats
 An extension over IA-32 object code is the IPF bundle
 Data types include integer, unsigned, floating-point,
and pointer
 Integers are of different widths: byte, word, doubleword, or quad-word precision
 Length in bits as well as min and max values are
listed below:
15
Data and Memory, Min Max
Type
Byte
Word
16
Doubleword+
32
Quadword+
64
Integer [bits]
8
Unsigned [bits]
8
16
32
64
Pointer [bits]
NA
NA
Comp. 32
64
Float [bits]
NA
NA
32, 64
64, 80
Type
byte
Word
Double-word
Minint
-128
-32,768 -2,147,483,648
"-9,223,372,036,854,775,808"
Maxint
127
32,767 2,147,483,647
"9,223,372,036,854,775,807"
Minunsigned
0
Maxunisgned
255
0
0
65,535 4,294,967,295
16
Quad-word
0
"18,446,744,073,709,551,615"
Data and Memory
 Negative numbers are represented in two’s
complement format, with the sign-bit in the mostsignificant position
 Floating-point data use the IEEE 754 standard
 Bits representing integer values are numbered from
0 in the least significant position (rightmost
position) to higher values
 For example, the most significant bit in a double
word is in position indexed 31 (Note the unusual
word definition on Intel architectures: 2 bytes)
 Maximum address on the first generation Itanium
processor (Merced) was only 17,592,186,040,322 or
244-1. It grew in the second generation to 56 bits,
and is now a full 64-bits long
17
Data and Memory
 Bytes are stored in little-endian order by default
 Possible to programmatically select little- or bigendian order, by setting the be bit in the user
mask, a special status register
 The be bit (for big-endian) does not affect how
instructions are stored or fetched from memory
 Object code is always represented in little-endian
order; programmer selected endianness only
impacts data
 In little-endian order, data bytes with the lowest
numeric value are stored in the byte with the
lowest address; conversely for big-endian order
18
Data and Memory
Data quad-word 0x1102030455060708 would be stored:
Data stored in 8 adjacent bytes in memory in little-endian order:
addr: 0 addr: 1 addr: 2 addr: 3 addr: 4 addr: 5 addr: 6 addr: 7
08x
07x
06x
55x
04x
03x
02x
11x
Same int value 0x1102030455060708 stored in big-endian order:
byte7
11x
byte6
02x
byte5
03x
byte4
04x
19
byte3
55x
byte2
06x
byte1
07x
byte0
08x
Itanium Registers
 The Itanium processor has 128 general registers
(GR), 128 floating-point registers (FR), 64 singlebit predicate registers (PR), 8 branch registers
(BR), and 128 application registers (AR)
 In addition, there are Performance Monitor Data
registers (PMD), processor identifiers (CPUID), a
Current Frame Marker register (CFM), user mask
(UM), and instruction pointer registers (IP)
 GRs, FRs, BRs, ARs, CPUIDs, IP, and PMDs are 64
bits wide
 PRs are 1 bit wide, while the UM holds 6 and the
CFM 38 bits; depicted below:
20
Itanium Register File
gr0
GR
63…0
fr0
FR
63…0
pr0
PR
0
br0
BR
63…0
gr1
63…0
fr1
63…0
pr1
0
br1
63…0
gr2
63…0
fr2
63…0
pr2
0
br2
63…0
gr3
63…0
fr3
63…0
pr3
0
br3
63…0
gr4
63…0
fr4
63…0
pr4
0
br4
63…0
ar16
RSC
gr5
63…0
fr5
63…0
pr5
0
br5
63…0
ar17
BSP
...
...
...
...
br6
63…0
gr16
63…0
fr16
63…0
br7
63…0
ar18 BSPST
O
ar19 RNAT
...
...
...
...
63…0
ar21
FCR
fr126
...
...
63…0
ip
gr126
...
...
63…0
pr62
0
...
...
gr127
63…0
fr127
63…0
pr63
0
ar30
FDR
ar32
CCV
......
0
pr10
......
......
cfm
37…0
User M
um 5…0
CPUID
cpuid0 63…0
ar0
AR
Kr0
...
ar7
Kr7
...
ar36 UNAT
cpuid1
63…0
pmd0
PMD
63…0
...
...
63…0
pmd1
63…0
ar64
LC
...
...
63…0
ar66
EC
cpuidn
pmdm
21
ar40 FSPR
ar44
ITC
ar127
Itanium Registers GR
 The 128 GR registers are the common workhorses
during computation
 They contain integer values being computed
 It is possible to use these integer values as machine
addresses, thus GRs can be used as pointers in
load- and store-operations
 All machine instructions can refer to these registers,
for reading and writing values
 In addition to the 64 data bits, each GR has an
associated NAT bit, which stands for Not A Thing
 NAT is 1, if the associated register has not been
initialized with valid data
22
Itanium Registers GR
 NATs support speculation
 For example, if a speculative load is issued but
aborted, before the value arrives in its destined GR,
the NAT state records that fact
 Enables integrity of the machine’s exception process
 There are 2 groups of GR registers:
 The first 32, GR0 through GR31, are visible to all
software, and are used to hold globally computed,
intermediate values
 However, GR0 is read-only, providing the constant 0,
64 bits long
23
Itanium Registers GR
 The next 96, GR32 to GR127, are used to implement a
small but frequently used portion of the top of the
run-time stack; i.e. work like a special-purpose topof-stack cache
 These stack registers are made available to SW by
allocation of a register stack frame, and include from
0 to 96 registers
 Registers not used from this subset are inaccessible
to general SW
 The stack frame portion implemented via GRs is
further partitioned into subsections, one meant to
hold local registers, the other output registers, i.e.
results of the current function call
24
Sample Stack Frame, Generic
sp
Locals + Temps
Stack Marker
bp
Actual Parameters
25
Stack Frame
Itanium Predicate Registers PR
 Execution of most IPF instructions can be
predicated by one of the PRs
 Value 1 in the PR means: the operation can
be completed normally
 PR value 0 means the result will not be
posted (committed), even if it has been
computed already. I.e. there will be no stores
and no impact on any AR of the machine
 Exception of an instruction that cannot be
predicated is the loop operation
26
Itanium Predicate Registers




The PRs are also partitioned into 2 sections:
PR0 through PR15 are static PRs
The other 48 are so called rotating PRs
PR0 is an exceptional register, it can only be
read, and its value is always 1, meaning, the
predicate is true; thus PR0 denotes
unconditional execution
 The remaining 48 PRs are used to hold
stage predicates, used during software
pipelining
 SW PL to be discussed in advanced computer
architecture
27
Branch Registers BR

IPF instructions are grouped in bundles, which are 16-byte
aligned byte sequences holding executable code. Hence their
rightmost 4 address bits will always be 0 due to alignment;
these 4 address bits don’t need to be stored explicitly

Execution of an indirect branch requires an explicit operand

On the Itanium architecture this operand is a branch register;
a branch register BR holds the branch destination

The machine then loads the value of the referenced BR into
the IP register and execution continues from there; IP stands
for Instruction Pointer

Executing branch-related instructions is about the only way to
directly affect the value in the instruction pointer, the register
that holds the address of the next bundle to be executed
28
Current Frame Marker Register CFM
Note: Frame Marker is often referred to as Stack Frame,
and its fixed portion as the Stack Marker
 Each function has a specific stack frame associated
with it, which is created at function invocation; it is
cleared at function return
 If all the relevant data of a function’s stack frame do
fit, they are placed in the stack of general registers;
else the overflowing data must reside in memory
 Either way, the current frame marker (CFM) holds the
frame marker for the function that is currently active
 Generally, most functions have small stack frames
29
Current Frame Marker Register CFM
Layout of the CFM:
CFMregister
37 .. 32
Rrb.pr
31 .. 25
Rrb.fr
24 .. 18
Rrb.gr
17 .. 14
sor
13 .. 7
sol
6 .. 0
sof
Meaning of Bits in CFM:
Name
Sof
Sol
Sor
Bit Field
0..6
7..13
14..17
rrb.gr
rrb.fr
rrb.pr
18..24
25..31
32..37
meaning
Total size of stack frame
Size of local part of stack frame, in words
Size of rotating portion of stack frame. The number
of the rotating registers is 8 times the sor value
Register rename base for grs
Register rename base frs
Register rename base prs
30
Application Registers AR
Application Registers – t.b.d.:
register
ar0 – ar7
ar8 – ar15
ar16
Mnemonic
KR0 – KR7
Description of register
Kernel registers 0 .. 7
Reserved
t.b.d.
31
Instruction Pointer IP
 IPF instructions are fetched in units of bundles,
which are chunks of 16 bytes, or 128 bits
 Bundles are stored bundle-aligned
 The ip can address 18,446,744,073,709,551,616
different bytes (but only at bundle addresses)
 The rightmost 4 bits of the ip thus will always be
zero, due to the bundle-alignment
 Hence these 4 bits don’t needs to be stored on the
microprocessor silicon
32
Performance Monitor Data Register
 These are architecture-provided resources
that record the use of hardware modules
 Contents is read-only by SW
 But contrary to the performance monitor
registers on Intel Pentium architectures,
they are user visible on Itanium
33
Itanium ISA
Instruction Set Architecture
34
Instruction Set Architecture ISA
Parallelism, Dependences, and Groups

Itanium instructions packaged in groups can execute in
parallel; allows fast execution, if HW is available!

Assembly programmer or compiler may craft groups as
large as desired; the performance consequence is:

All operations embedded in a single group can be executed
simultaneously, in parallel, saving time over the equivalent
sequential execution

The physical silicon angle of this is: Of all operations that
could be executed in parallel only those are actually
performed in parallel, for which there exist HW resources

E.g. on an Itanium® 2 implementation of IPF, there are 6 units
available to operate in parallel
35
Instruction Set Architecture ISA
Parallelism, Dependences, and Groups

If fewer actions are enclosed in a group, some HW will idle

If more actions could be included in a group, then all HW
elements are active, yet some degree of possible parallelism
will be lost; future HW implementations may execute that
same object code faster due to the higher degree of
parallelism

Parallel execution is not feasible if dependencies exist
between instructions

On Itanium these dependencies are not resolved by the
machine

It is the human programmer or optimizer that explicitly
tracks, what can be done in parallel, and what must be done
in sequence. The machine just runs it, goal: TO BE FAST!
36
Instruction Set Architecture ISA
Parallelism, Dependences, and Groups

If a result has to be computed first before it can be read
somewhere else (memory or register), a true dependence
exists; AKA data dependence; conventional to say
“dependence”


If a result has to be read first before it can be re-computed, a
false dependence is created, AKA anti-dependence


On Itanium we call this a RAW (Read after Write) dependence
On Itanium this is named WAR (Write after Read) dependency
If a result has to be computed first before it can be computed
again, assuming that an intermediate reference is possible,
output dependence is created

Itanium calls this third dependence: WAW (Write after Write) dependence
37
Instruction Set Architecture ISA
Parallelism, Dependences, and Groups

In all these cases, the prior operation has to complete, before
the dependent can be started; e.g.:
ld8 r14 = [r3]
-- load GR14 w. 8 bytes addr. by GR3
add r15 = r14, r16
-– integer sum into GR15, RAW dep

This is an example of RAW dependence, AKA true
dependence

The loading of an 8-byte value into (8-byte) register GR14 must
complete first, before the addition of the 2 long integer values,
held in GR14 and GR16, can be started

Note the assembler register names: r14, and not gr14

This is Intel and HP assembly language convention! Another
assembler may use different conventions
38
Instruction Set Architecture ISA
Assembly Language Format
 Format of an Itanium assembler instruction:
 In meta-syntax [ and ] brackets mean that the
bracketed portion of the instruction is optional
 In assembly syntax, square bracket pairs [] express:
indirection
 Careful not to get confused by 2 different contexts!
[(pr)] mnemonic[.comp] dest = src1 [, src2 [, src3 ] ]
Meaning of the various assembly language fields:
39
Instruction Set Architecture ISA
syntax
(pr)
mnemonic
comp
dest
src1
src2
src3
Name
Predicate
register
Meaning
Used to predicate execution; if value is 0, the result is
not committed, if true, the result is committed. pr0 is
always 1, hence the associated instructions are
executed unconditionally
Instruction Name of the instruction to tell the assembler: which
operation to perform
Completer Further qualifies or completes the instruction
specification. There may be multiple completers per
instruction; not all instructions have a completer
Destination Is the destination of the specified instruction. Choices
are: register or memory
source one Source operand. Not all instructions require a source.
Some instructions allow multiple sources. Sources
may be: Immediate operands, or registers. Memory
can be a source via indirection (through a register)
source two Ditto
source
Ditto
three
40
Instruction Set Architecture ISA
Assembly Language Format

A sample assembly language instruction is shown next:
(p0) add r5 = r4, r3, 1 // (p0) can be skipped

This is an integer add instruction that sums up the integer
values in GR4 and GR3, also adds integer literal 1

Assigns sum to register GR5. Since the predicate register used
is PR0, which is always true, the commit of the sum to register
GR5 is unconditional, as if no predicate qualifier had been given

Predicate registers, when listed, are enclosed in ( ) parentheses

Not all instructions allow or need a completer. Typical
completers are shown below

Some instructions allow multiple completers, notably the
memory access instructions, and branch instructions
41
Instruction Set Architecture ISA
Completer
.a
.c
.clr
.nc
.s
.many
.few
.excl
Many
more
Meaning
For “advanced” load; check later if successful
Check
If advanced load was not successful, clear the reg
no clear
Speculative; e.g. for load; NOT allowed for store!
t.b.d.
t.b.d.
t.b.d.
.equ .unc etc.
42
Instruction Set Architecture ISA
Itanium Bundle Format
 Executable code on Itanium comes in units of
bundles. A bundle consists of 3 instructions, all
grouped with an associated template
 Template completes the instruction specification
and above all, defines group boundaries
 Boundary is also known as a stop. Stop defines
where one group ends and another group starts
 If no stop is included in a template, this means
that the bundle will be part of a larger group,
consisting of more instructions in the next bundle
43
Instruction Set Architecture ISA
Itanium Bundle Format
 Each instruction is 41 bits long, a template consumes
5 bits, one template per bundle
 With 3 instructions per bundle, the overall bundle
length is 3 * 41 + 5 = 128 bits, fitting into 16 bytes; all
bundle-aligned, easily accomplished due to first
bundle residing on a mod-16 memory boundary
 From then on all will be aligned on 16-byte boundaries
 With the memory bus being 128 bits wide (or wider on
future IPF implementations) and bundles being bundlealigned, fetching instruction memory is fast
 Requiring one single transfer on the bus
44
Instruction Set Architecture ISA
Itanium Bundle Format
 General layout of a bundle is shown next, with bits
ordered from 0 through 127 increasing r. to l.
127
87 | 86
46 | 45
instruction 2
instruction 1
instruction 0
5|4
0
template
 The template serves as a means for the compiler to
communicate additional information about
instructions 1, 2, and 3, without which they could be
ambiguous
 One such key piece of information is the placement
of an instruction group stop, in assembler ;;
45
Instruction Set Architecture ISA
Itanium Bundle Format
 A group stop can occur after instruction 2, or 1, or
0, indicating an earlier group must complete
execution, before another starts
 But Itanium instructions allows at most 2 stops in
a bundle
 If 3 stops are needed, a NOOP must be packed
into one of the instructions, to effectively create 2
physical groups, with the third being the NOOP,
whose execution order does not matter
 Compiler-generated code performs this workaround automatically
46
Instruction Set Architecture ISA
Itanium Bundle Format
 The template specifies which types of instructions
are assembled into slot 0, 1, and 2
 IPF instructions are partitioned into the following 6
groups:
Type
A
I
M
F
B
L+X
Meaning
ALU: integer or memory unit
Non-ALU: Integer unit
Memory unit
Floating-point unit
Branch unit
Extended unit, or Branch unit
47
Instruction Set Architecture ISA
Itanium Bundle Format
 Providing such information in the template speeds
up instruction decoding, improving execution speed
 A list with the Instruction Set Architecture (ISA)
templates and embedded stops is shown next
 Note at most 2 stops in any of the formats
 On an architecture that aims to have large groups, it
seems logical to have few stops (max 2) per bundle
48
Instruction Set Architecture ISA
Template #
0 = 0x00
1 = 0x01
2 = 0x02
3 = 0x03
4 = 0x04
5 = 0x05
6 = 0x06
7 = 0x07
8 = 0x08
9 = 0x09
10 = 0x0a
11 = 0x0b
12 = 0x0c
13 = 0x0d
14 = 0x0e
15 = 0x0f
16 = 0x10
17 = 0x11
18 = 0x12
19 = 0x13
20 = 0x14
21 = 0x15
22 = 0x16
23 = 0x17
24 = 0x18
25 = 0x19
26 = 0x1a
27 = 0x1b
28 = 0x1c
28 = 0x1d
30 = 0x1e
31 = 0x1f
type
MII
MII_
MI_I
MI_I_
MLX
MLX_
reserved
reserved
MMI
MMI_
M_MI
M_MI_
MFI
MFI_
MMF
MMF_
MIB
MIB_
MBB
MBB_
reserved
reserved
BBB
BBB_
MMB
MMB_
reserved
reserved
MFB
MFB_
reserved
reserved
slot 0
Memory unit
Memory unit
Memory unit
Memory unit
Memory unit
Memory unit
slot 1
Integer unit
Integer unit
Integer unit;;
Integer unit;;
L unit?
L unit?
slot2
Integer unit
Integer unit ;;
Integer unit
Integer unit;;
Extended unit
Extended unit;;
Memory unit
Memory unit
Memory unit;;
Memory unit;;
Memory unit
Memory unit
Memory unit
Memory unit
Memory unit
Memory unit
Memory unit
Memory unit
Memory unit
Memory unit
Memory unit
Memory unit
Floating-point unit
Floating-point unit
Memory unit
Memory unit
Integer unit
Integer unit
Branch unit
Branch unit
Integer unit
Integer unit;;
Integer unit
Integer unit;;
Integer unit
Integer unit;;
Floating-point unit
Floating-point unit;;
Branch unit
Branch unit;;
Branch unit
Branch unit;;
Branch unit
Branch unit
Memory unit
Memory unit
Branch unit
Branch unit
Memory unit
Memory unit
Branch unit
Branch unit;;
Branch unit
Branch unit;;
Memory unit
Memory unit
Floating-point unit
Floating-point unit
Branch unit
Branch unit;;
49
Instruction Set Architecture ISA
Itanium Bundle Format
 The difference between above templates 0x00 and
0x01, both being MII type operations is: after
instruction 2 in template 0x01 there is a stop, while
in template 0x00 there is none
 In other words, the next bundle after the one for
template 0x00 will belong to the same group, and a
higher degree of parallelism will be possible there
50
Instruction Set Architecture ISA
Itanium Assembly Code

A group is a sequence of 1 or more instructions delimited by
a stop. The first instruction in a whole program is thought to
be preceded by a stop

Similarly, the last instruction of a complete program is
thought to be followed by a stop

All instructions placed into a single group can be executed
in parallel. Whether or not they will depends on the number
of hardware resources available. In the initial Itanium
architecture only 6 resources were available

In a later implementation, more HW resources may become
available, thus potentially speeding up execution of the
same old, unchanged Itanium code on a future generation

The ;; indicates to the assembler, where one boundary ends
and thus the next group starts
51
Instruction Set Architecture ISA
Itanium Assembly Code

Some assembly language instructions follow:
comp.eq p1, p2 = r33, r34

This checks general purpose registers 33 and 34 for
equality; if equal, predicate register 1 is set to true, predicate
register 2 to false. Otherwise p1 is set to false and p2 to true.
A more complicated case is:
(p3) comp.eq.unc p1, p2 = r33, r34

checks if predicate register 3 is true at the start. If so, if
registers GR33 and GR34 are equal, register p1 is set to true
and p2 to false, else the reverse

Else –i.e. if p3 is false a priori— then predicate registers 1
and 2 are both set to false
52
Assembler Source Program
With & Without
Stack Unwind Operations
From ref [8]
53
Assembler for Hello World, With
// hello_world.c assembly with unwind directive
// sample taken from ref [8]
// page 1/3
.file "hello.c"
.pred.safe_across_calls p1-p5, p16-p63
.section .rdata, "a", "progbits"
.align 8
.STRING1:
stringz "Hello World!!!\n"
.text
.align 16
.global hello#
.proc hello#
hello:
.prologue
.save ar.pfs, r34
54
Assembler for Hello World, With
// hello_world.c assembly with unwind directive
// sample taken from ref [8]
// page 2/3
alloc r34 = ar.pfs, 0, 4, 1, 0
.vframe r35
mov r35 = r12
.save rp, r33
mov r33 = b0
// load branch register into GR33
.body
addl r36 = @ltoff(.STRING1), gp
;;
ld8 r36 = [r36]
mov r32 = r1
br.call.sptk.many b0 = printf#
// b0!
;;
55
Assembler for Hello World, With
// hello_world.c assembly with unwind directive
// sample taken from ref [8]
// page 3/3
mov r1 = r32
mov ar.pfs = r34
mov b0 = r33
// restore branch register
.restore sp
mov r12 = r35
br.ret.sptk.many b0
.endp hello#
.global printf#
.type printf#, @function
56
Assembler for Hello World, Without
//
//
//
//
hello_world.c assembly without unwind directive
sample taken from ref [8]
page 1/3
The string is defined in the read only data
section
.section .rdata, "a", "progbits"
.align 8
.STRING1:
stringz "Hello World!!!\n"
// definition of function hello is in text section
// Registers to be saved in local registers:
// gp = r1 - loc0 = r32
// rp = b0 - loc1 = r33
// ar.pfs - loc2 = r34
// sp = r12 - loc3 = r35
57
Assembler for Hello World, Without
// hello_world.c assembly without unwind directive
// sample taken from ref [8]
// page 2/3
.text
.global hello
.proc hello
hello:
alloc loc2 = ar.pfs, 0, 4, 1, 0
mov loc3 = sp
mov loc1 = b0
// save branch register b0
addl out0 = @ltoff(.STRING1), gp
;;
ld8 out0 = [out0] // group of 3 instructions
mov loc0 = gp
br.call.sptk.many b0 = printf
;;
58
Assembler for Hello World, Without
// hello_world.c assembly without unwind directive
// sample taken from ref [8]
// page 3/3
mov gp = loc0
mov ar.pfs = loc2
mov b0 = loc1
mov sp = loc3
br.ret.sptk.many b0
.endp hello
.global printf
.type printf, @function
59
Appendix:
Some Definitions
60
Definitions
Branch Elimination
 Replacing object code that has conditional
branches, with code that has a straight-forward
execution path, lacking branches
 The second version with branches eliminated
must be semantically equivalent to the original
code with branches
 Everything else equal, the version without
branches generally executes faster due to less
cache misses
61
Definitions
Bundle
 Group of 3 instructions plus a template, that all fit
into a 16-byte long, 16-byte aligned section of
instruction memory on Itanium
 Total number of bits = 128
62
Definitions
Conditional Move

Move instruction that transfers bits from source to
destination, but only if an associated condition is true

Otherwise the instruction operates like a noop

Such a move can serve as a special case of branch
elimination. For example, the C source construct:
if ( a > 0 ) x = 99; -- HL source program
could be mapped into the conditional move:
cmov x, #99, a, #0, gt
-- hypothetical asm
which has no branches. Source operand #99 is moved into
memory location x only if the > condition holds between
operands a and integer literal 0
63
Definitions
Endian, Endianness
 A convention that defines in which order the
higher-valued bytes of a multi-byte data object are
addressed

Can be programmed on Itanium with be bit
 If the higher address byte holds the higher
numeric value, we call this little-endian

typical on Intel x86 architecture
 The other way around we call big-endian ordering

typical on IBM 370 architecture
64
Definitions
EPIC
 Explicitly Parallel Instruction Computing, with IPF
being the first commercial architecture that
implements EPIC
 Note IPF’s ability to also execute old Intel x86 and
old HP PA object code
65
Definitions
Epilogue
 When the steady state of a software pipelined loop
completes, there may be yet to be used operands
and operations to be computed that would not fit
into the steady state
 These last operands must be consumed, some
even be generated during the epilogue, and
ultimately the pipeline must be drained
 This is accomplished in the object code after the
steady state, and that portion of code is called the
epilogue
 See also prologue
66
Definitions
Group
 A sequence of instructions, each with an
associated template and a defined stop
 A group is composed of one bundle or more
 The stop means, the hardware cannot start
executing any subsequent group, until the current
group has completed
 Syntax notation for stop in Itanium assembler is
the double-semicolon ;;
67
Definitions
Parallel Comparison





A composite source program condition of the form:
( ( a > b ) && ( c <= d ) )
requires multiple steps to compute a boolean predicate
Generally, on a sequential architecture these multiple steps
are combined via explicit instructions for anding and oring,
or else the flow of control of execution selects a matching
true label. All this takes time
The Itanium processor allows parallel evaluation of certain
composite Boolean expressions in one single step
The result can be used as a predicate in subsequent
instructions. Notice that such combined Boolean
expressions must be side-effect free
Is not equivalent to C’s short-circuit evaluation of complex
boolean expressions!
68
Definitions
Parallel Comparison, Cont’d
 For example, another complex boolean expression
( fun( j, k ) && ( i < MAX ) )
cannot be mapped into a parallel EPIC comparison
 Since one operand is a function call fun( i, k ) with
a possibly large number of parameters, and may
have a side-effect on one of the other operands,
for example “i” which is yet to be compared
 This type of boolean expression is mapped into
sequential code
69
Definitions
Predication

Is the association of a boolean condition with the execution
of an instruction sequence. This allows the following:

Two instruction streams can be executed in parallel, clearly
requiring multiple hardware modules; provided on EPIC

Both streams have a predicate associated with their
operations. Only the stream with the true predicate is
actually retired; the other will be aborted and ignored

Abort can happen as soon as the predicate is known. This
means, the computation of the predicate can proceed in
parallel with the execution of the two code streams, but must
complete by the time these 2 code streams waite for who’ll
be the winner

An ISA with predication requires bits for the predicates to
use, and which direction (true? or false?) to select

Also, the discarded code path may contain no side-effect,
such as a write to memory!
70
Definitions
Prologue
 Before a software pipelined loop body can be
initiated, hardware resources (e.g. registers) must
be initialized; we say the loop must be primed
 This is accomplished in the object code before the
steady state, called the Prologue
 See also epilogue
71
Definitions
Register File
 The IPF has a rich set of registers
 This includes 128 general purpose registers (for
integer operations), 128 floating-point-, 64
predicate-, 64 branch-, and 128 so-called
application registers
 Also a variety of special purpose register is
visible; visible means accessible by the assembly
language program
 Includes a user mask, stack marker (frame
marker), ip, processor id, and performance
monitoring registers
72
Definitions
Speculation

If it is suspected --but not sure-- that operand o will be used
in the future, and this operand is not readily available (not
yet in a high-speed register), and it takes long to fetch o, a
processor may initiate the fetch well before it is actually
used

Advantage: by the time o is needed, it is already available
without delay

Disadvantage: if the flow of control never reaches the place
where o was thought to be needed, then the speculative
fetch was superfluous

May still be meaningful, if a) no side-effects occurred that
are harmful to program correctness, and b) if the hardware
resource required to fetch o was idle anyway; then no loss!
73
Definitions
Steady State
 The software pipelined object code executed
repeatedly, after the Prologue has been initiated,
before the Epilogue will be active, is called the
Steady State
 Each iteration of the Steady State makes some
progress toward multiple iterations of the original
source loop
 See also prologue and epilogue
74
Definitions
Syllable
 Is the instruction-only portion of a bundle
 A bundle always holds 3 instructions plus a
template, the template specifying additional
necessary information about an instruction
 The instruction alone, without the needed template
information, is a syllable
75
Bibliography
1. Triebel, Walter: “IA-64 Architecture for Software Developers”,
Intel Press © 2000, 308 pages
2. http://www.intel.com/design/itanium2/manuals/25110901.pdf
3. http://h21007.www2.hp.com/portal/StaticDownload?attachment_c
iid=c2d2e0aecd2b7110VgnVCM100000275d6e10RCRD&ciid=ce1f
d701521c7110VgnVCM100000275d6e10RCRD
4. http://www.intel.com/design/itanium/downloads/245320.htm
5. http://www.intel.com/design/itanium/manuals/iiasdmanual.htm
6. http://download.intel.com/design/Itanium2/manuals/25111003.pdf
7. Donald Knuth: “Interview with Donald Knuth” 2008-04-25
8. Intel® Itanium® Architecture Assembly Reference Guide, © 2002,
Intel order number 248801-004, at http://developer.intel.com
76