AMD`s Microprocessors Architecture

Download Report

Transcript AMD`s Microprocessors Architecture

 AMD’s Microprocessors Architecture
AMD’s
Microprocessor Model
I.
II.
III.
K5 (Four-Issue Out-of-order
Processor, first chip of K86 family)
K7 (Three-Issue SuperScalar
Processor)
Athlon (World’s First 7th Generation
x86 Processor)
Launce
Yr
Competitor
Intel Model
1994
100 MHz
Pentium
1998
Katmai
2000
Pentium III
Fields of discussion:
Basic features of each model’s architecture,
Instruction handling,
 Caches & Memory management,
Pipeline (# of stages, possible penalties).



Advanced Computer Architectures
1

Part I: AMD’s K5
- Basic Features
- Advanced
four-issue superscalar core: that supports
speculative, out-of-order execution and register renaming.
- Die-sized static chip of 4.3 million transistors, 3.3-V designed,
implemented in AMD’s 0.5-micron, three-metal CMOS process.
- 30% Faster at Same Clock Rate as Pentium – (2.5 times as
fast as 486). K5: a flexible, aggressive microarchitecture that
achieves higher performance (on integer code, though less
emphasis is given on floating-point performance).
- The challenge of Decoding multiple x86 instructions in parallel
is achieved by predecoding them when fetched from memory
into the instruction cache.
Advanced Computer Architectures
2
Block diagram of AMD’s K5


K5 adds predecode bits to x86 instructions before caching them.
x86 instructions are converted into one or more microinstructions (RISCoperations or ROPs) – 16 bytes at a time - up to four ROPs issued per cycle.
3
K5 ‘s instruction translation process




Instructions are fed into a 16-byte queue,
Up to four ROPs worth of instructions can be pulled from queue during
each clock cycle,
Four identical ROP converters translate x86 instruction into ROPs.
A RISK-like outcome is acquired once past the ROP converters.
Advanced Computer Architectures
4
K5 ‘s Execution Units
 ROPs are dispatched to Execution Units, in reservation stations, and wait
for the needed resources to become available.
Available Execution Units (6 in number) include:
 2 Dual ALUs (one with a shifter, the other with the divider – 2
execution stations)
 1 FPU (1 execution station)
 2 Dual Load/Store Units (2 execution stations each), and
 1 Branch Unit (2 execution stations).
 8 Operand Buses feed the Execution Units (4 units are fed with 2
operands on every cycle).
 5 Result Buses, 41 bits wide, support transfers on floating-point data (2
used in parallel to support x86 compatible floating-point operations).
 40-Word Register File
 1 16-entry reorder buffer (ROB), stores results from speculatively
executed instructions (and then results are written to register file).
Advanced Computer Architectures
5
Pipeline in K5’s architecture
 6 stages of pipelining, only 5 of them affect performance – instruction timing.
 Requirement for an extra decode stage for the x86-to-ROP translation.
 Combination of address generation into the same stage as cache access.
Extra
decode
stage
2nd phase
1st phase
Calculation Full linear 32-bit
of cache address calculation,
segment-protection
index
checking
Advanced Computer Architectures
6
Caches and Memory management
 The Data
Cache is
divided into
four banks.
There are 2
access ports
one for each
Load/Store
Unit.
 The Instruction Cache has a 16byte line size – a 16-byte buffer
ensures
compatibility
with
the
Pentium bus which performs 32-byte
bursts, so it
holds the
second
Cache line.
 The caches are virtually addressed and tagged to avoid the need to
translate addresses before a cache access.
 A single set of physical tags is shared by instruction and data caches.
Thus, conflicts with the CPU for cache access are eliminated and ensure
consistency between instruction and data caches
Advanced Computer Architectures
7
Summarizing about AMD’s K5
 K5 goes further in combining x86 compatibility with RISC-like core,
(employing similar architecture as the NexGen’s Nx586)
achieving: - a large software base,
- high performance.
 K5 (at that time) competitors:
- Cyrix’s M1 (not yet launched in the market),
- Pentium
 holds large part of the marketplace, often forcing AMD to
provide higher performance at the same price,
 already on the way to increasing it’s clock rate (promising
to launch a 120-MHz speed rate (in 1 year’s time – early
1995), or even a 150-MHz speed rate (in 2 years time –
late 1995)).
Advanced Computer Architectures
8







Part II: AMD’s K7 3-issue superscalar processor - Basic Features
10 stages of pipeline->
High clock rates
achievement,
Large L1 instruction data caches ->
Functions in systems
with of without
backside L2,
Astonishingly small
(184mm2), despite
transistor complexity
(22-million).
Up to 72 instructions can be in execution in K7’s out of order integer pipe,
floating-point pipe, and load/store unit.
Advanced Computer Architectures
9
K7’s deep 10-stage pipeline
 Long pipeline (1st half occupied for x86 instruction decoding),
 Simple branch prediction by a 2.048-entry branch history
table with a 2-bit Smith prediction algorithm, but
 Misprediction cost equals to a minimum of a 10-cycle penalty.
Advanced Computer Architectures
10
K7’s Instruction Decoding
deliver High Instruction Bandwidth
 x86 instructions are
decoded to MOPs
(3/cycle) and
dispatched to either
the integer pipe
(via direct or
vector path
decoders
depending their
complexity)
or to the
floating-point pipe
(directly passed to
the ICU)
Advanced Computer Architectures
11
K7’s Out-of-Order Integer Pipe Issuing 6 ROPs/Cycle
 The Integer Scheduler is an out-of-order 15-entry
reservation station organized as 3 5-MOPs queues.
A MOP equals 1 or 2
ROP(s) (load, store,
load/store, ALU
operation, or branch).
 The integer pipe
provides 3 IEUs and 3
AGUs.
 Each of the 3 queues
of the scheduler is
physically associated
with a IEU/AGU pair.
Advanced Computer Architectures
12
K7 provides large Memory Bandwidth
to support instruction execution
 The 44-entry LSU
queues memory requests
by the AGUs (3/cycle), and
 Issues them to the Dcache out-of-order (it can
provide up to 8 Gbytes/s of
bandwidth)
 Data is snooped from
the result buses.
 The cache is
nonblocking.
 The data cache is physically tagged. A 2-level translation lookaside buffer
(TLB), translates the effective addresses to physical (in parallel).
Advanced Computer Architectures
13
K7’s additional features
 On-Chip tags support Large External L2.
 K7 connects to the
chip set via point-topoint interconnect
instead of a shared bus.
 This requires more
pins in MP
configurations but
allows the bus to run at
higher speed.
 K7’s bus comprises
three separate ports:
address in, address out,
and a 72-bit
bidirectional data port.
 It uses a five-state MOESI cache coherence protocol (‘owned’ state added)
Advanced Computer Architectures
14
Summarizing about K7 microprocessor
 K7 is the most
complex of any current
x86 processor.
 It seems to
outperform Intel’s
models on an
instructions-per-clock
basis.
 It promises AMD
performance
leadership, allowing to
increase both prices
and profit margins.

Advanced Computer Architectures
15

Part III: AMD’s Athlon (1st member of of the 7th-generation AMD-processors’
family)
AMD Seventh Generation
Intel Previous Generation
Processor
Architecture/
Technology –
Competitive
Comparison
Advanced Computer Architectures
16
AMD Athlon Processor Microarchitecture Features
I.
The industry's first nine-issue, superpipelined, superscalar x86 processor
microarchitecture designed for high clock frequencies

Multiple x86 instruction decoders

Three out-of-order, superscalar, fully pipelined floating point
execution units, which execute all x87 (floating point), MMX and
3DNow! Instructions

Three out-of-order, superscalar, pipelined integer units

Three out-of-order, superscalar, pipelined address calculation units

72-entry instruction control unit

Advanced dynamic branch prediction
II.
High-performance cache architecture featuring an integrated 128KB L1
cache and a programmable, high-speed backside L2 cache interface
III.
200MHz AMD Athlon processor system bus (scalable beyond 400 MHz)
enabling leading-edge system bandwidth for data movement-intensive
applications
IV.
Enhanced 3DNow! technology with new instructions to enable improved
integer math calculations for speech or video encoding and improved data
movement for Internet plug-ins and other streaming applications
Advanced Computer Architectures
17
AMD Athlon Processor Architecture Block Diagram
Advanced Computer Architectures
18
II.
High-Performance Cache Design
 Integrated, dual-ported 64KB split-L1 data and instruction caches with
separate snoop port, with eight banks to support concurrent access by two
64-bit loads or stores,
 Multi-level translation look-aside buffers (TLBs),
 A scalable L2 cache controller with a 72-bit interface, and
 An integrated tag for cost-effective 512KB L2 configurations
 First to incorporate a system-based MOESI (Modify, Owner, Exclusive,
Shared, Invalid) cache control protocol for x86 multiprocessing platforms,
thus deliver exceptional performance in both uni and multi processor systems
 It supports error correction code (ECC) protection
Advanced Computer Architectures
19
III. The Industry's First 200-MHz System Bus for x86 Platforms
 ADVANTAGES OFFERED
AMD Seventh Generation
Advanced Computer Architectures
Intel Previous
Generation
20
IV. Leading-Edge Floating Point and
3D Multimedia Technology
 The three execution units (Fmul, Fadd,
and Fstore) in the AMD Athlon processor's
floating point pipeline handle all x87
(floating
point)
instructions,
MMX
instructions,
and
enhanced
3DNow!
Instructions.
(Using a data format and single-instruction
multiple-data (SIMD) operations, it can deliver
as many as four 32-bit, single-precision floating
point results per clock cycle, resulting in a peak
performance of 4.0 Gigaflops at 1000 MHz).
 Use of the AMD's original 21 3DNow! instructions plus 24 new ones:
 12, that improve multimedia-enhanced integer math calculations,
 7, that accelerate data movement, and Internet functionality,
 5 DSP instructions, that enhance the performance of communications
applications (!!! unique in Athlon)
Advanced Computer Architectures
21
Examples of applications that benefit
from Athlon’s processor capabilities
 Advanced imaging software for processing digital imaging
 Enhanced Internet browsing using next-generation browser features
 Architectural 3D rendering systems
 CAD/CAE software packages
 Near real-time MPEG-2 video encoding/editing for higher quality video
 Speech recognition in Web browsing and word processing
 Financial modelling and trading software
 Realistic 3D software, including 3D games and flight simulators.
Advanced Computer Architectures
22
Summarizing about AMD’s Athlon processor
 It uses the latest microarchitecture innovation and system bus
technology to deliver the industry’s highest performance for x86compatible platforms.
 Compatible with x86 versions of Microsoft Windows, as well as other
operating systems.
 Provides a new level of performance and data movement capabilities for
the next generation computation-intensive software.
 Powers the Next Generation for the growing fields of
digital imaging,
the Internet,
enterprise computing,
CAD/CAE packages,
scientific/technical applications, and
3D gaming.
Advanced Computer Architectures
23
CONCLUSION



AMD’s microprocessors seem always to compete against
Intel’s performance and market dominance,

Each model manages to accept the challenge of offering:



High performance at the same clock rate,
Advanced techniques of register renaming,
out-of-order execution and of superscalar
design,
Support for high-end desktop products,
uniprocessor
and
multiprocessing
workstations and servers.
 AMD’s forthcoming optimized chipsets are planned to enable multiprocessing
system design based on 2 or more AMD family processors.
Advanced Computer Architectures
24