Intel Itanium Architecture
Download
Report
Transcript Intel Itanium Architecture
Welcome to the Presentation
Pang Kee Yeoh
Indraneel Mitra
Majid Jameel
Presentation Overview
Features of Itanium
Future of Itanium
Competition for Itanium
Intel Itanium Architecture
Itanium is a new processor family and architecture,
design by Intel and HP with the future of high end
server and workstation in mind.
Features of Itanium
64-bit addressing
EPIC (Explicit Parallel Instruction Computing)
Wide Parallel Execution core
Prediction
FPU, ALU and Rotating registers
Large fast Cache
High Clock Speed
Scalability
Error Handling
Fast Bus Architecture
Itanium Specifications
Physical
–
–
–
–
–
–
Characteristics
25.4M transistors
.18micron CMOS process
6 metal layers
C4 (flip-chip) assembly technology
1012-pad organic land grid array
733MHz and 800MHz initial release clock speeds
Itanium Specifications Cont…
Instruction Dispersal
–
–
–
–
–
–
–
–
2 bundle dispersal windows
3 instructions per bundle
9 function unit slots
2 integer slots
2 floating point slots
2 memory slots
3 branch slots
Maximum of 6 instructions issued each cycle
Itanium Specifications Cont…
Floating Point
Units
– 2 extended and double precision FMACs (Floatingpoint Multiply Add Calculators)
– 4 double or single precision operations per clock
maximum
– 3.2 GFLOPS of peak double precision floating point
performance at 800MHz
– 2 additional single precision FMACs
– 4 single precision operations per clock maximum
– 6.4 GFLOPS of peak single precision floating point
performance total at 800MHz
Itanium Specifications Cont…
Integer
and Branch Units
– 4 single cycle integer ALUs
– 4 MMX units
– 3 branch units
Itanium Specifications Cont…
Level 3 Cache
– Off-die in two or four chips
– 2MB or 4MB
– Runs at core clock
– 4-way set associative
– Up to 294.8 million transistors
– 128-bit bus
– 21+ cycle latency
Itanium Specifications Cont…
Level
–
–
–
–
–
2 Cache
On-die
96k of full-speed cache
6-way set associative
256-bit bus
6-cycle + latency
Itanium Specifications Cont…
Level
–
–
–
–
–
1 Cache
On-die
16k instruction cache
4-way set associative
16k integer only data cache
2-cycle + latency
Itanium Specifications Cont…
x86
Compatibility
– Hardware decoder turns x86 instructions
into EPIC instructions
– Dynamic scheduler optimizes x86 for EPIC
micro-architecture
– Shared cache
– Shared execution core
64-bit addressing
EPIC processors are capable of addressing
a 64-bit memory space. In comparison, 32bit x86 processors access a relatively small
32-bit address space, or up to 4GB of
memory.
A 64-bit memory space may be a limiting
factor to performance. This gives the
Itanium the memory addressing ability
needed to meet current and foreseeable
future high-end processing needs.
64-bit addressing cont…
Through bank switching, x86 processors, such
as the Intel Pentium III Xeon and the AMD
Athlon, can address more than 4GB of
memory. Unfortunately, there is hardware
and software overhead to bank switching that
harms performance and increases complexity.
64-bit addressing cont…
The first generation of Itanium systems, using the
460GX chipset, will be expandable with up to 64GB
of memory. Generations beyond that will be able to
take more memory. Higher end Itanium systems
designed by the likes of SGI, IBM and HP should
eventually be able to take far more than 64GB.
While it may be hard to imagine 4GB or even 64GB
of memory being a bottleneck to performance,
when one considers SGI has mentioned plans to
eventually build machines using 512 Itanium
processors accessing more than a terabyte of data
in main memory, 64GB of memory, let alone 4GB,
begins to look rather small.
EPIC
New Computer Architecture standard set by Intel
on its new itanium architecture
Previously Computer architectures only consisted
of RISC, CISC and VLIW
EPIC Uses complex instruction in additions to
basic instruction. This complex instruction
includes information on how to run the
instruction parallel with other instructions.
EPIC instructions are put together by the
compiler into a threesome called a bundle.
Bundling
EPIC continue…..
Bundle is a three instruction wide word - improves
instruction level parallelism. Each Bundle Contains three
instructions and a template field which are set during
code generation, by a compiler, or the assembler.
Bundles are then sent to the CPU.
Bundles in the CPU are put together in an instruction
group with other instructions
An instruction group is a set of instructions which do
not have “read after write or write after write
dependencies between them and may execute in
parallel.” This means that the bundle do not affect each
other with the data they are working on, so they can run
together without getting in each others way.
EPIC continue….
In any given clock cycle, the processor
executes as many instructions from one
instruction group as it can according to
resources.
An instruction group must contain at least
one instruction but the number of instructions
in an instruction group is not limited.
The instruction groups can end by cycle
breaks or end dynamically during run time
by taken branch
EPIC continues…..
In addition of grouping operations into
instructions, the compiler handles several other
important tasks that improve efficiency,
parallelism and speed.
CISC puts most of the burden of scheduling
instructions onto the CPU hardware. RISC gives
some of this responsibility to the compiler. VLIW
gives even more importance to the compiler.
EPIC improves on previous technology by
adding branch hints, register stack and rotation,
data and control speculation and memory hints.
It also uses branch prediction.
Prediction
It is a compiling technique that optimises or
removes branching code by working it so that
much of the code runs in parallel.
It minimises the time it takes to run if – then –
else situations and uses processor width to
run both the ‘then’ and ‘else’ in parallel.
When the ‘if’ branch is determined, the
incorrect branch result is discarded.
By removing branches and making code more
parallel, prediction reduces the number of
cycles it takes to complete a task while
making use of a wide processor.
Prediction
According to Jerry Huck of HP:
“Imagine that you are walking into the bank. You will
make either a deposit or a withdrawal. The teller may
predict you will make a withdrawal as they know you
usually do, so they fill out a with drawl form as you get
in line. If you get to the front and make a withdrawal,
all is well, but if you are there to make a deposit, the
teller then has to fill out the deposit slip and the time it
takes to complete the transaction increases.
With Prediction, the teller is ambidextrous and, when
you get in line they fill out both a with drawl and a
deposit slip, so that when you get to the front, no matter
what task you intend on doing, the process will run
without a hitch.”
Prediction Continue….
In the metaphor, prediction is the tellers
knowledge that they should fill out both the
deposit and withdrawal form before they know
exactly what you want. The teller’s ambidexterity,
the ability to fill out both forms at once, is akin to
the ability of an EPCI processor to run instructions
in parallel. prediction removes the penalty of if –
then – else and allows the if – then – else process to
run with as fewer steps as possible.
A side benefit of prediction is that the removal of
branches causes less branch mispredicts. Branch
misprediction requires the pipeline to be flushed
and this is very cycle expensive procedure.
prediction reduces wasted processor time.
Wide Parallel Execution core
Itanium processors are very wide.
They are intended to run multiple instructions
and operations in parallel.
Itanium processors will be deep with a ten stage
pipeline.
The first generation itanium processor will be
able to issue six EPIC instruction in parallel every
clock cycle.
The six issue (two bundler) scheduler disperses
instructions into nine functional slots, two
integer slots, two memory slots and three branch
slots, giving a total of nine dispersal slots.
Wide Parallel Execution core cont…
This limits the number of each type of
instruction that can be assigned in a single clock
cycle. If an instruction/s can not be executed
because too many slots of one type are filled, the
instructions are delayed until the next cycle.
This means that proper compiler design is
crucial to functional aspect of the itanium.
Backing up the itanium six issue scheduler are
eleven execution units; four integer, two
floating points, three branch, two load/store
units.
Wide Parallel Execution core cont…
This helps support the various EPIC
instructions that can launch more than one
operation in a single instruction, such as
SIMD, floating point operations.
Combined with the EPIC instruction set the
itanium can execute up to 20 operations in a
single cycle when doing some floating point
intensive task.
FPU, ALU and Rotating Registers
FPU
– The Itanium contains 4 pipelined FMAC
(Floating Point Multiple Add Calculator)
units. There are an additional two FMACs
tuned for 3D applications. They are each
capable of processing up to two singleprecision floating-point operations per clock.
That yields another 3.2GFLOPS of singleprecision processing power. All together, the
Itanium has a theoretical max of 6.4GLOPS of
single-precision floating point processing
power.
FPU, ALU and Rotating Registers
cont…
ALU
– There are four pipelined ALUs (Arithmetic
Logic Unit) in the original Itanium. Each
can process one integer calculation per
cycle. They can also process MMX type
instructions. While the Itanium has the
potential to be a massive floating-point
powerhouse, its integer performance also
has tremendous potential.
FPU, ALU and Rotating Registers
cont…
Plentiful Registers
– The Itanium will come with 128 floating point
and 128 integer registers. When processing up to
20 operations in a single clock, the registers give
plenty of room for data inside the processor. This
reduces the chances of the execution of an
instruction being delayed because data could not
be held locally. This is especially important since
the Itanium can process up to eight floating-point
operations in a single clock. With the possibility
of eight operations running in a single clock,
having too few registers could be a serious
bottleneck.
FPU, ALU and Rotating Registers
cont…
The registers also have the ability to rotate.
Rotating registers allows the processor to perform
an operation on multiple software accessible
registers in turn.
This increases CPU pipeline utilization and
efficiency when dealing with streams of data to
process.
Large Fast Cache
When a processor is waiting for data or
instructions, time is wasted. The longer it
takes for data and instructions to get to the
CPU, the worse it gets. When data and
instructions are in cache, the processor can
grab them much quicker than when having to
go to slow main memory. Not only is cache
latency much lower than DRAM latency, the
bandwidth is much higher.
Large Fast Cache cont…
There are some trick programming
techniques in use out there to keep often-used
data and instructions in cache and they are
not the kind of techniques you learn in your
high school BASIC course.
Still, the easiest way to keep data and
instructions in cache is to have a lot of cache
to keep them in. Intel knew that when they
designed the Itanium.
Large Fast Cache cont…
The Itanium has three levels of cache. L1 and
L2 are on-die while L3 is on cartridge.
According to Intel, the L3 cache weighs in at
2MB or 4MB of four-way set associative cache
on two or four 1MB chips.
IDC reports that the L2 cache size is 96k in
size, and the L1 cache, which does not deal
with floating point data, has a 16KB integer
data and a 16KB instruction cache.
Large Fast Cache cont…
The 294.8 million transistors of (4MB) level
three cache runs at the full processor speed,
giving 12.8GBps of memory bandwidth at
800MHz.
With 2MB or 4MB of L3 cache on the Itanium,
the chances of the required data and
instructions being in cache are quite good,
bus traffic can be reduced, and performance
increases. With six pipelines hungry for
instructions and data, the Itanium needs all
the cache it can get.
Large Fast Cache cont…
To make caching even more effective, Intel uses data
speculation and cache hints. Data speculation is
caching and calling for data that may be needed or
may be changed before it is needed, so that, in the case
that the data is needed and it has not changed, the
CPU does not have to take a latency impact from
calling for the data.
The processor, with the help of compiled instructions,
looks ahead, anticipates what info it may need, and
then brings it to cache or into the processor. This helps
hide memory latency. Cache hints are two-bit markers
for memory loads set by the compiler that help the
CPU find data in cache. This improves the speed of
retrieving data from cache.
Clock Speed
The first generation of Itanium processors will
come in the first half of 2001 at 733MHz and
800MHz. The first generation's clock speed may
not be particularly quick, but Intel has several
generations ahead of the Itanium already in the
works that should increase performance.
Intel claims they have plenty of clock headroom
in the Itanium design and are aiming for a
greater than 1GHz clock speed with their second
generation Itanium processor, McKinley, which
will have the L3 cache on-die.
Scalability
The Itanium was not designed for small
systems, it is intended for 1 to 4000 processor
workstations and servers.
There are several Itanium features designed
to help with hardware scalability: a full-CPUspeed Level 2 bus, a large L3 cache, deferredtransaction support and flexible page sizes.
Scalability cont…
The full-CPU-speed Level 3 bus provides quick
communication between CPUs. The large L2
cache reduces inter-CPU bus traffic by keeping
data close to the CPU that needs it.
Deferred-transaction support can stop one CPU
from getting in the way of another. Flexible
page sizes, from 4KB to 256MB, give the
Itanium family the flexibility to access small
amounts of memory in small chunks and
massive amounts of memory in massive chunks
without the overhead of smaller page sizes.
Scalability cont…
The first generation Itanium chipset, the
460GX, will support up to four processors,
and OEMs will be able to build eight-way
and larger systems.
Successive generations of chipsets should be
successively more scalable. Third party
solutions should also increase scalability.
Error Handling
The Itanium will have extensive error
handling capabilities. It features ECC and
parity error checking on most processor
caches and busses.
If a machine error occurs and a piece of data
becomes corrupted, the ECC or parity
checking will allow the machine to recognize
the error, fix it if possible, or flag it as
corrupted.
The processor also has the capability to kill an
application or thread that has experienced a
machine error without having to reboot.
Error Handling cont…
Chipset, OS, and system designers, which
will include the likes of HP, IBM, Compaq,
SGI, Microsoft and Intel, will bring out their
own error handling and reliability processes
that should further enhance Itanium-based
server uptime to 99.9% and beyond.
Fast Bus Architecture
A major link in the food delivery system for
the Itanium is the system bus. The Itanium
will use a 2.1GBps multi-drop system bus to
keep well fed with data and instructions. We
expect it will have a 128-bit 133MHz bus.
The memory subsystem and I/O will be
determined by the chipset used. First
generation systems should use dual-memory
ported SDRAM giving 4.2GBps of memory
bandwidth. Later generations will have the
option to use DDR SDRAM or RDRAM.
Fast Bus Architecture cont…
Eventually, Intel plans on moving server
platforms to DDR II. 64bit, 66MHz PCI and
AGP Pro (4x) should be common on Itanium
motherboards and support will be included
in Intel's 460GX chipset
Itanium Roadmap
Future
According to Intel, the EPIC architecture was
designed with about 25 years of headroom for
future development in mind.
McKinley will follow the original Itanium and
will integrate its L3 cache onto the CPU die.
McKinley will arrive in the first half of 2002.
Madison may also arrive in 2002 on a .13-micron
process. Deerfield will arrive not long after, also
on a .13 process, at a lower price and
performance level but with more performance
for the dollar than Madison.
Future cont…
Madison may also arrive in 2002 on a .13micron process. Deerfield will arrive not
long after, also on a .13 process, at a lower
price and performance level but with more
performance for the dollar than Madison.
Furthermore it will offer larger amounts of
L3 cache.
Deerfield will be positioned as a value part in
conjunction with Madison the same way as a
P3 and Celeron compares today. It might be
the CPU targeting consumer desktops.
Competition
Sun
UltraSPARC
IBM PowerPC
Compaq’s Alpha
AMD’s Sledgehammer
Competition
Sun UltraSPARC
In 1995, 8 years after the first SPARC station
was introduced, Sun went 64 bit with the
introduction of UltraSPARC 1 RISC
processor. The first model ran at 143Mhz and
had 128 bit datapaths.
In 1996, it became the first 64 bit CPU to
incoporate multimedia extensions to handle
complex 2D/3D graphics.
Competition cont…
In 1997, the UltraSPARC 2 was released at
250Mhz while the UltraSPARC 3 (with new 256
bit data paths) is released in the second quarter
of 2000. Other plans include a UltraSPARC 4
which will be pumped up to 1 Ghz and
UltraSPARC 5 which will run at 1.5Ghz
In 1997, the UltraSPARC 2 was released at
250Mhz while the UltraSPARC 3 (with new 256
bit datapaths) is released in the second quarter
of 2000. Other plans include a UltraSPARC 4
which will be pumped up to 1 Ghz and
UltraSPARC 5 which will run at 1.5Ghz
Competition cont…
The UltraSPARC 2 was designed using a 0.25
micron process, while the UltraSPARC 3
employed a 0.18 micron process.
As a side issue, Sun believes that its
UltraSPARC 3, 4 and 5 will be ahead of
Itanium before it arrives because the binary
application code written for the UltraSPARC
2 will run unmodified on the other series'
making the transition easy.
Competition cont…
There are several arguments that Sun puts forth
against the Itanium:
– Sun has been supplying 64 bit solutions since 1995
and has ironed out its bugs. While Itanium may be
arriving soon, the testing of applications and
enterprise solutions could well take much longer.
– Furthermore Sun produces Solaris (which has
been a true 64 bit since 1999), so it has the
necessary experience in ironing out problems.
Competition cont…
– Lastly Sun claims that its Visual Instruction Set
can be used to speed up networking, I/O and
memory management by optimizing the passing
of data blocks through protocol stacks with the
special instructions.
Competition cont…
IBM PowerPC
– IBM’s PowerPC RISC processor made its
debut on the 14th of February 1990.
– In 1991, an alliance was formed between IBM,
Motorola and Apple and the PowerPC is still
being developed for MACS until today.
– PowerPC’s went 64 bit in 1998 with the
codename Power3 which covers the PowerPC
604e and Power PC RS64 processors.
– IBM’s roadmap included building a Power4 in
the last quarter of 2000 which was to run at 1
Ghz.
Competition cont…
Compaq’s
Alpha
– While Sun’s UltraSPARC and IBM’s PowerPC
processors went 64 bit in 1995 and 1998, Digital
Alpha’s CPU was 64 bit ever since its birth. That
was during 1992.
– Even during then the Alpha was a powerful CPU,
launched at 200Mhz, when the MIPS 64 bit R4000
ran at 100Mhz and Intel’s 32 bit 386 only ran at
25Mhz
– Today this processor belongs to Compaq.
Competition cont…
– The Alpha Server SC can run from 64 to 512
processors. Furthermore the Alpha SC server
would form the largest supercomputer in
Europe running 2500 Alpha EV67 CPU’s and
would handle 5 trillion instructions per second.
– It is believed that accordingly the Alpha’s
would be more appropriate for huge number
crunching scientific and multimedia entertainment applications.
Competition cont…
– The Alpha Server SC can run from 64 to 512
processors. Furthermore the Alpha SC server
would form the largest supercomputer in
Europe running 2500 Alpha EV67 CPU’s and
would handle 5 trillion instructions per second.
– It is believed that accordingly the Alpha’s
would be more appropriate for huge number
crunching
scientific
and
multimedia/entertainment applications.
Competition cont…
AMD’s
Sledgehammer
– At the microprocessor forum on 5th October
1999, AMD announced details of its 64 bit
processor. This 64 bit processor is codenamed
Sledgehammer.
– With regards to this AMD plans to extend Intel’s
original x86 instruction to include a 64 bit mode.
This is to maintain compatibility with 32 bit
apps while benefiting from a 64 bit platform.
Competition cont…
– Sledgehammer will also employ AMD’s future
system bus, named Lightning Data Transport
(LDT). LDT is an internal chip to chip
interconnect that can deliver up to 6.4 Gbits/sec
bandwidth, that’s about 20 times faster than
current 266Mbits/sec system interconnects.
– Finally the Sledgehammer’s universal selling
point is that it will be the only chip with full
native x86 32 bit and 64 bit compatibility. Other
64 bit chips may offer some kind of x86
compatibility, but according to AMD, each of
these relegate the x86 instructions to a second
class status.
Conclusion
The Itanium has a complex, bleeding edge,
forward looking processor family that holds
promise for huge gains in processing power. The
processor uses the entirely new EPIC
architecture that has the potential to deliver large
improvements in processor parallelism. It is all
about speed, and the Itanium has the ability to
deliver it but the real test will be once Itanium
hits the consumer market.