EECC722 - Shaaban

Download Report

Transcript EECC722 - Shaaban

Advanced Computer Architecture
Course Goal:
Understanding important emerging design techniques, machine
structures, technology factors, evaluation methods that will
determine the form of high-performance programmable processors and
computing systems in 21st Century.
Important Factors:
•
•
Driving Force: Applications with diverse and increased computational demands
even in mainstream computing (multimedia etc.)
Techniques must be developed to overcome the major limitations of current
computing systems to meet such demands:
– ILP limitations, Memory latency, IO performance.
– Increased branch penalty/other stalls in deeply pipelined CPUs.
– General-purpose processors as only homogeneous system computing
resource.
• Enabling Technology for many possible solutions:
– Increased density of VLSI logic (one billion transistors in 2005?)
– Enables a high-level of system-level integration.
EECC722 - Shaaban
#1 Lec # 1 Fall 2004 9-6-2004
Topics we will cover include:
•
Course Topics
Overcoming inherent ILP & clock scaling limitations by exploiting
Thread-level Parallelism (TLP):
– Support for Simultaneous Multithreading (SMT).
• Alpha EV8. Intel P4 Xeon (aka Hyper-Threading), IBM Power5.
– Chip Multiprocessors (CMPs):
• The Hydra Project. IBM Power4, 5 ….
•
Instruction Fetch Bandwidth/Memory Latency Reduction:
– Conventional & Block-based Trace Cache (Intel P4).
•
Advanced Branch Prediction Techniques.
•
•
Towards micro heterogeneous computing systems:
– Vector processing. Vector Intelligent RAM (VIRAM).
– Digital Signal Processing (DSP) & Media Architectures &
Processors.
– Re-Configurable Computing and Processors.
Virtual Memory Implementation Issues.
•
High Performance Storage: Redundant Arrays of Disks (RAID).
EECC722 - Shaaban
#2 Lec # 1 Fall 2004 9-6-2004
Computer System Components
L1
1000MHZ - 3.6 GHZ (a multiple of system bus speed)
Pipelined ( 7 - 30 stages )
Superscalar (max ~ 4 instructions/cycle) single-threaded
Dynamically-Scheduled or VLIW
Dynamic and static branch prediction
CPU
L2
SDRAM
PC100/PC133
100-133MHZ
64-128 bits wide
2-way inteleaved
~ 900 MBYTES/SEC
L3
Double Date
Rate (DDR) SDRAM
PC3200
400MHZ (effective 200x2)
64-128 bits wide
4-way interleaved
~3.2 GBYTES/SEC
(second half 2002)
RAMbus DRAM (RDRAM)
PC800, PC1060
400-533MHZ (DDR)
16-32 bits wide channel
~ 1.6 - 3.2 GBYTES/SEC
( per channel)
Caches
Front Side
Examples: Alpha, AMD K7: EV6, 400MHZ
Intel PII, PIII: GTL+ 133MHZ
800MHZ
Bus (FSB) Intel P4
Support for one or more CPUs
adapters
Memory
Controller
Memory Bus
I/O Buses
NICs
Controllers
Example: PCI-X 133MHZ
PCI, 33-66MHZ
32-64 bits wide
133-1024 MBYTES/SEC
Memory
Disks
Displays
Keyboards
North
Bridge
South
Bridge
Networks
I/O Devices:
Fast Ethernet
Gigabit Ethernet
ATM, Token Ring ..
Chipset
EECC722 - Shaaban
#3 Lec # 1 Fall 2004 9-6-2004
Computer System Components
Enhanced CPU Performance & Capabilities:
Memory Latency Reduction:
Conventional &
Block-based
Trace Cache.
L1
•
•
•
•
•
Support for Simultaneous Multithreading (SMT): Intel HT.
VLIW & intelligent compiler techniques: Intel/HP EPIC IA-64.
More Advanced Branch Prediction Techniques.
Chip Multiprocessors (CMPs): The Hydra Project. IBM Power 4,5
Vector processing capability: Vector Intelligent RAM (VIRAM).
Or Multimedia ISA extension.
• Digital Signal Processing (DSP) capability in system.
• Re-Configurable Computing hardware capability in system.
SMT
CMP
CPU
L2
Integrate Memory
Controller & a portion
of main memory with
CPU: Intelligent RAM
Integrated memory
Controller:
AMD Opetron
IBM Power5
L3
Caches
Front Side Bus (FSB)
adapters
Memory
Controller
Memory Bus
I/O Buses
NICs
Controllers
Memory
Disks (RAID)
Displays
Keyboards
Recent Trend:
More system components integration
(lowers cost, improves system performance)
North
Bridge
South
Bridge
Networks
I/O Devices:
Chipset
EECC722 - Shaaban
#4 Lec # 1 Fall 2004 9-6-2004
EECC551 Review
•
•
•
•
•
•
•
•
•
•
Recent Trends in Computer Design.
A Hierarchy of Computer Design.
Computer Architecture’s Changing Definition.
Computer Performance Measures.
Instruction Pipelining.
Branch Prediction.
Instruction-Level Parallelism (ILP).
Loop-Level Parallelism (LLP).
Dynamic Pipeline Scheduling.
Multiple Instruction Issue (CPI < 1): Superscalar vs. VLIW
• Dynamic Hardware-Based Speculation
• Cache Design & Performance.
EECC722 - Shaaban
#5 Lec # 1 Fall 2004 9-6-2004
Trends in Computer Design
• The cost/performance ratio of computing systems have seen a
steady decline due to advances in:
– Integrated circuit technology: decreasing feature size,
• Clock rate improves roughly proportional to improvement in 
• Number of transistors improves proportional to  (or faster).
– Architectural improvements in CPU design.
• Microprocessor systems directly reflect IC improvement in terms
of a yearly 35 to 55% improvement in performance.
• Assembly language has been mostly eliminated and replaced by
other alternatives such as C or C++
• Standard operating Systems (UNIX, NT) lowered the cost of
introducing new architectures.
• Emergence of RISC architectures and RISC-core architectures.
• Adoption of quantitative approaches to computer design based on
empirical performance observations.
EECC722 - Shaaban
#6 Lec # 1 Fall 2004 9-6-2004
Processor Performance Trends
Mass-produced microprocessors a cost-effective high-performance
replacement for custom-designed mainframe/minicomputer CPUs
1000
Supercomputers
100
Mainframes
10
Minicomputers
Microprocessors
1
0.1
1965
1970
1975
1980 1985
Year
1990
1995
2000
EECC722 - Shaaban
#7 Lec # 1 Fall 2004 9-6-2004
Microprocessor Performance
1987-97
1200
DEC Alpha 21264/600
1000
800
Integer SPEC92 Performance
600
DEC Alpha 5/500
400
200
0
DEC Alpha 5/300
DEC
HP
SunMIPSMIPSIBM 9000/AXP/
DEC Alpha 4/266
-4/ M M/ RS/ 750 500
IBM POWER 100
260 2000 1206000
87 88 89 90 91 92 93 94 95 96 97
EECC722 - Shaaban
#8 Lec # 1 Fall 2004 9-6-2004
Microprocessor Architecture Trends
CISC Machines
instructions take variable times to complete
RISC Machines (microcode)
simple instructions, optimized for speed
RISC Machines (pipelined)
same individual instruction latency
greater throughput through instruction "overlap"
Superscalar Processors
multiple instructions executing simultaneously
Multithreaded Processors
VLIW
additional HW resources (regs, PC, SP) "Superinstructions" grouped together
each context gets processor for x cycles decreased HW control complexity
CMPs
Single Chip Multiprocessors
duplicate entire processors
(tech soon due to Moore's Law)
SIMULTANEOUS MULTITHREADING (SMT)
multiple HW contexts (regs, PC, SP)
each cycle, any context may execute
SMT/CMPs (e.g. IBM Power5 in 2004)
EECC722 - Shaaban
#9 Lec # 1 Fall 2004 9-6-2004
Microprocessor Frequency Trend
100
Intel
Processor freq
scales by 2X per
generation
IBM Power PC
DEC
Gate delays/clock
21264S
1,000
Mhz
21164A
21264
Pentium(R)
21064A
21164
II
21066
MPC750
604
604+
10
Pentium Pro
601, 603 (R)
Pentium(R)
100
486
386
2005
2003
2001
1999
1997
1995
1993
1991
1989
1
1987
10
 Frequency doubles each generation
 Number of gates/clock reduce by 25%
 Leads to deeper pipelines with more stages
Gate Delays/ Clock
10,000
Realty Check:
Clock frequency scaling
is slowing down!
(Did silicone finally hit
the wall?)
Result:
Deeper Pipelines
Longer stalls
Higher CPI
(lowers effective
performance
per cycle)
(e.g Intel Pentium 4E has 30+ pipeline stages)
EECC722 - Shaaban
#10 Lec # 1 Fall 2004 9-6-2004
Microprocessor Transistor
Count Growth Rate
100000000
One billion in 2005?
Alpha 21264: 15 million
Pentium Pro: 5.5 million
PowerPC 620: 6.9 million
Alpha 21164: 9.3 million
Sparc Ultra: 5.2 million
10000000
Moore’s Law
Pentium
i80486
Transistors
1000000
i80386
i80286
100000
Moore’s Law:
i8086
10000
i8080
i4004
1000
1970
1975
1980
1985
Year
1990
1995
2000
2X transistors/Chip
Every 1.5 years
Still valid possibly
until 2010
EECC722 - Shaaban
#11 Lec # 1 Fall 2004 9-6-2004
Parallelism in Microprocessor VLSI Generations
Bit-level parallelism
Instruction-level
Thread-level (?)
100,000,000
Multiple micro-operations
per cycle
(Superscalar)
Simultaneous
Multithreading SMT:
e.g. Intel’s Hyper-threading

10,000,000





1,000,000



R10000




 










Chip-Multiprocessors (CMPs)
e.g IBM Power 4
Pentium
Transistors



i80386

Chip-Level
Parallel
Processing


i80286 
100,000


 R3000
 R2000

 i8086
10,000
 i8080
 i8008

 i4004
1,000
1970
1975
1980
1985
1990
1995
2000
2005
EECC722 - Shaaban
#12 Lec # 1 Fall 2004 9-6-2004
Computer Technology Trends:
Evolutionary but Rapid Change
• Processor:
– 2X in speed every 1.5 years; 100X performance in last decade.
• Memory:
– DRAM capacity: > 2x every 1.5 years; 1000X size in last decade.
– Cost per bit: Improves about 25% per year.
• Disk:
–
–
–
–
Capacity: > 2X in size every 1.5 years.
Cost per bit: Improves about 60% per year.
200X size in last decade.
Only 10% performance improvement per year, due to mechanical
limitations.
• Expected State-of-the-art PC by end of year 2004:
– Processor clock speed:
– Memory capacity:
– Disk capacity:
> 3600 MegaHertz (3.6 GigaHertz)
> 4000 MegaByte (2 GigaBytes)
> 300 GigaBytes (0.3 TeraBytes)
EECC722 - Shaaban
#13 Lec # 1 Fall 2004 9-6-2004
Architectural Improvements
• Increased optimization, utilization and size of cache systems with
multiple levels (currently the most popular approach to utilize the
increased number of available transistors) .
• Memory-latency hiding techniques.
• Optimization of pipelined instruction execution.
• Dynamic hardware-based pipeline scheduling.
• Improved handling of pipeline hazards.
• Improved hardware branch prediction techniques.
• Exploiting Instruction-Level Parallelism (ILP) in terms of multipleinstruction issue and multiple hardware functional units.
• Inclusion of special instructions to handle multimedia applications.
• High-speed system and memory bus designs to improve data transfer
rates and reduce latency.
EECC722 - Shaaban
#14 Lec # 1 Fall 2004 9-6-2004
Current Computer Architecture Topics
Input/Output and Storage
Disks, WORM, Tape
Emerging Technologies
Interleaving
Bus protocols
DRAM
Memory
Hierarchy
Coherence,
Bandwidth,
Latency
L2 Cache
L1 Cache
VLSI
Instruction Set Architecture
Addressing,
Protection,
Exception Handling
Pipelining, Hazard Resolution, Superscalar,
Reordering, Branch Prediction, Speculation,
VLIW, Vector, DSP, ...
Multiprocessing,
Simultaneous CPU Multi-threading
RAID
Pipelining and Instruction
Level Parallelism (ILP)
Thread Level Parallelism (TLB)
EECC722 - Shaaban
#15 Lec # 1 Fall 2004 9-6-2004
CPU Execution Time: The CPU Equation
• A program is comprised of a number of instructions, I
– Measured in:
instructions/program
• The average instruction takes a number of cycles per instruction
(CPI) to be completed.
– Measured in: cycles/instruction
– IPC (Instructions Per Cycle) = 1/CPI
• CPU has a fixed clock cycle time C = 1/clock rate
– Measured in:
seconds/cycle
• CPU execution time is the product of the above three
parameters as follows:
CPU Time
=
I
x CPI
x
C
CPU time
= Seconds
Program
= Instructions x Cycles
Program
Instruction
x Seconds
Cycle
EECC722 - Shaaban
#16 Lec # 1 Fall 2004 9-6-2004
Factors Affecting CPU Performance
CPU time
= Seconds
= Instructions x Cycles
Program
Program
Instruction
Instruction
Count I
CPI
IPC
Program
X
X
Compiler
X
X
X
X
Instruction Set
Architecture (ISA)
Organization
(Micro-Architecture)
Technology
x Seconds
X
Cycle
Clock Cycle C
X
X
EECC722 - Shaaban
#17 Lec # 1 Fall 2004 9-6-2004
Metrics of Computer Performance
Execution time: Target workload,
SPEC95, SPEC2000, etc.
Application
Programming
Language
Compiler
ISA
(millions) of Instructions per second – MIPS
(millions) of (F.P.) operations per second – MFLOP/s
Datapath
Control
Megabytes per second.
Function Units
Transistors Wires Pins
Cycles per second (clock rate).
Each metric has a purpose, and each can be misused.
EECC722 - Shaaban
#18 Lec # 1 Fall 2004 9-6-2004
SPEC: System Performance
Evaluation Cooperative
The most popular and industry-standard set of CPU benchmarks.
• SPECmarks, 1989:
– 10 programs yielding a single number (“SPECmarks”).
• SPEC92, 1992:
– SPECInt92 (6 integer programs) and SPECfp92 (14 floating point programs).
• SPEC95, 1995:
– SPECint95 (8 integer programs):
• go, m88ksim, gcc, compress, li, ijpeg, perl, vortex
– SPECfp95 (10 floating-point intensive programs):
• tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5
– Performance relative to a Sun SuperSpark I (50 MHz) which is given a score
of SPECint95 = SPECfp95 = 1
• SPEC CPU2000, 1999:
– CINT2000 (11 integer programs). CFP2000 (14 floating-point intensive programs)
– Performance relative to a Sun Ultra5_10 (300 MHz) which is given a score of
SPECint2000 = SPECfp2000 = 100
EECC722 - Shaaban
#19 Lec # 1 Fall 2004 9-6-2004
Top 20 SPEC CPU2000 Results (As of March 2002)
Top 20 SPECint2000
#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
MHz
1300
2200
2200
1667
1000
1400
1050
1533
750
833
1400
833
600
675
900
552
750
700
800
400
Processor
POWER4
Pentium 4
Pentium 4 Xeon
Athlon XP
Alpha 21264C
Pentium III
UltraSPARC-III Cu
Athlon MP
PA-RISC 8700
Alpha 21264B
Athlon
Alpha 21264A
MIPS R14000
SPARC64 GP
UltraSPARC-III
PA-RISC 8600
POWER RS64-IV
Pentium III Xeon
Itanium
MIPS R12000
int peak
814
811
810
724
679
664
610
609
604
571
554
533
500
478
467
441
439
438
365
353
Top 20 SPECfp2000
int base
790
790
788
697
621
648
537
587
568
497
495
511
483
449
438
417
409
431
358
328
MHz
1300
1000
1050
2200
2200
833
800
833
1667
750
1533
600
675
900
1400
1400
500
450
500
400
Processor
POWER4
Alpha 21264C
UltraSPARC-III Cu
Pentium 4 Xeon
Pentium 4
Alpha 21264B
Itanium
Alpha 21264A
Athlon XP
PA-RISC 8700
Athlon MP
MIPS R14000
SPARC64 GP
UltraSPARC-III
Athlon
Pentium III
PA-RISC 8600
POWER3-II
Alpha 21264
MIPS R12000
fp peak
1169
960
827
802
801
784
701
644
642
581
547
529
509
482
458
456
440
433
422
407
fp base
1098
776
701
779
779
643
701
571
596
526
504
499
371
427
426
437
397
426
383
382
EECC722 - Shaaban
Source: http://www.aceshardware.com/SPECmine/top.jsp
#20 Lec # 1 Fall 2004 9-6-2004
Performance Enhancement Calculations:
Amdahl's Law
• The performance enhancement possible due to a given design
improvement is limited by the amount that the improved feature is used
• Amdahl’s Law:
Performance improvement or speedup due to enhancement E:
Execution Time without E
Speedup(E) = -------------------------------------Execution Time with E
Performance with E
= --------------------------------Performance without E
– Suppose that enhancement E accelerates a fraction F of the
execution time by a factor S and the remainder of the time is
unaffected then:
Execution Time with E = ((1-F) + F/S) X Execution Time without E
Hence speedup is given by:
Execution Time without E
1
Speedup(E) = --------------------------------------------------------- = -------------------((1 - F) + F/S) X Execution Time without E
(1 - F) + F/S
EECC722 - Shaaban
#21 Lec # 1 Fall 2004 9-6-2004
Pictorial Depiction of Amdahl’s Law
Enhancement E accelerates fraction F of execution time by a factor of S
Before:
Execution Time without enhancement E:
Unaffected, fraction: (1- F)
Affected fraction: F
Unchanged
Unaffected, fraction: (1- F)
F/S
After:
Execution Time with enhancement E:
Execution Time without enhancement E
1
Speedup(E) = ------------------------------------------------------ = -----------------Execution Time with enhancement E
(1 - F) + F/S
EECC722 - Shaaban
#22 Lec # 1 Fall 2004 9-6-2004
Performance Enhancement Example
• For the RISC machine with the following instruction mix given earlier:
Op
ALU
Load
Store
Branch
Freq
50%
20%
10%
20%
Cycles
1
5
3
2
CPI(i)
.5
1.0
.3
.4
% Time
23%
45%
14%
18%
CPI = 2.2
• If a CPU design enhancement improves the CPI of load instructions
from 5 to 2, what is the resulting performance improvement from this
enhancement:
Fraction enhanced = F = 45% or .45
Unaffected fraction = 100% - 45% = 55% or .55
Factor of enhancement = 5/2 = 2.5
Using Amdahl’s Law:
1
1
Speedup(E) = ------------------ = --------------------- =
(1 - F) + F/S
.55 + .45/2.5
1.37
EECC722 - Shaaban
#23 Lec # 1 Fall 2004 9-6-2004
Extending Amdahl's Law To Multiple Enhancements
• Suppose that enhancement Ei accelerates a fraction Fi of the
execution time by a factor Si and the remainder of the time is
unaffected then:
Speedup 
Original Execution Time
((1   F )   F ) XOriginal
i
i
Speedup 
i
i
S
Execution Time
i
1
((1   F )   F )
i
i
i
i
S
i
Note: All fractions refer to original execution time.
EECC722 - Shaaban
#24 Lec # 1 Fall 2004 9-6-2004
Amdahl's Law With Multiple Enhancements:
Example
•
Three CPU or system performance enhancements are proposed with the
following speedups and percentage of the code execution time affected:
Speedup1 = S1 = 10
Speedup2 = S2 = 15
Speedup3 = S3 = 30
•
•
Percentage1 = F1 = 20%
Percentage1 = F2 = 15%
Percentage1 = F3 = 10%
While all three enhancements are in place in the new design, each
enhancement affects a different portion of the code and only one
enhancement can be used at a time.
What is the resulting overall speedup?
Speedup 
1
((1   F )   F )
i
i
•
i
i
S
i
Speedup = 1 / [(1 - .2 - .15 - .1) + .2/10 + .15/15 + .1/30)]
= 1/ [
.55
+
.0333
]
= 1 / .5833 = 1.71
EECC722 - Shaaban
#25 Lec # 1 Fall 2004 9-6-2004
Pictorial Depiction of Example
Before:
Execution Time with no enhancements: 1
Unaffected, fraction: .55
S1 = 10
F1 = .2
S2 = 15
S3 = 30
F2 = .15
F3 = .1
/ 15
/ 10
/ 30
Unchanged
Unaffected, fraction: .55
After:
Execution Time with enhancements: .55 + .02 + .01 + .00333 = .5833
Speedup = 1 / .5833 = 1.71
Note: All fractions refer to original execution time.
EECC722 - Shaaban
#26 Lec # 1 Fall 2004 9-6-2004
Evolution of Instruction Sets
Single Accumulator (EDSAC 1950)
Accumulator + Index Registers
(Manchester Mark I, IBM 700 series 1953)
Separation of Programming Model
from Implementation
High-level Language Based
(B5000 1963)
Concept of a Family
(IBM 360 1964)
General Purpose Register Machines
Complex Instruction Sets
(Vax, Intel 432 1977-80)
Load/Store Architecture
(CDC 6600, Cray 1 1963-76)
RISC
(Mips,SPARC,HP-PA,IBM RS6000, . . .1987)
EECC722 - Shaaban
#27 Lec # 1 Fall 2004 9-6-2004
A "Typical" RISC
•
•
•
•
•
32-bit fixed format instruction (3 formats I,R,J)
32 64-bit GPRs (R0 contains zero, DP take pair)
32 64-bit FPRs,
3-address, reg-reg arithmetic instruction
Single address mode for load/store:
base + displacement
– no indirection
• Simple branch conditions (based on register values)
• Delayed branch
EECC722 - Shaaban
#28 Lec # 1 Fall 2004 9-6-2004
A RISC ISA Example: MIPS
Register-Register
31
26 25
Op
21 20
rs
rt
6 5
11 10
16 15
rd
sa
0
funct
Register-Immediate
31
26 25
Op
21 20
rs
16 15
0
immediate
rt
Branch
31
26 25
Op
21 20
rs
16 15
0
displacement
rt
Jump / Call
31
26 25
Op
0
target
EECC722 - Shaaban
#29 Lec # 1 Fall 2004 9-6-2004
Instruction Pipelining Review
•
•
•
•
•
•
Instruction pipelining is CPU implementation technique where multiple
operations on a number of instructions are overlapped.
An instruction execution pipeline involves a number of steps, where each step
completes a part of an instruction. Each step is called a pipeline stage or a pipeline
segment.
The stages or steps are connected in a linear fashion: one stage to the next to form
the pipeline -- instructions enter at one end and progress through the stages and
exit at the other end.
The time to move an instruction one step down the pipeline is is equal to the
machine cycle and is determined by the stage with the longest processing delay.
Pipelining increases the CPU instruction throughput: The number of instructions
completed per cycle.
– Under ideal conditions (no stall cycles), instruction throughput is one
instruction per machine cycle, or ideal CPI = 1
Pipelining does not reduce the execution time of an individual instruction: The
time needed to complete all processing steps of an instruction (also called
instruction completion latency).
– Minimum instruction latency = n cycles, where n is the number of pipeline
stages
EECC722 - Shaaban
#30 Lec # 1 Fall 2004 9-6-2004
MIPS In-Order Single-Issue Integer Pipeline
Ideal Operation
Time in clock cycles 
Clock Number
Instruction Number
1
2
3
4
5
6
7
Instruction I
Instruction I+1
Instruction I+2
Instruction I+3
Instruction I +4
IF
ID
IF
EX
ID
IF
MEM
EX
ID
IF
WB
MEM
EX
ID
IF
WB
MEM
EX
ID
8
WB
MEM
EX
WB
MEM
9
WB
Time to fill the pipeline
MIPS Pipeline Stages:
IF
ID
EX
MEM
WB
= Instruction Fetch
= Instruction Decode
= Execution
= Memory Access
= Write Back
Last instruction,
I+4 completed
First instruction, I
Completed
5 pipeline stages
Ideal CPI =1
EECC722 - Shaaban
#31 Lec # 1 Fall 2004 9-6-2004
A Pipelined MIPS Datapath
• Obtained from multi-cycle MIPS datapath by adding buffer registers between pipeline stages
• Assume register writes occur in first half of cycle and register reads occur in second half.
EECC722 - Shaaban
#32 Lec # 1 Fall 2004 9-6-2004
Pipeline Hazards
• Hazards are situations in pipelining which prevent the next
instruction in the instruction stream from executing during
the designated clock cycle.
• Hazards reduce the ideal speedup gained from pipelining
and are classified into three classes:
– Structural hazards: Arise from hardware resource
conflicts when the available hardware cannot support all
possible combinations of instructions.
– Data hazards: Arise when an instruction depends on the
results of a previous instruction in a way that is exposed
by the overlapping of instructions in the pipeline
– Control hazards: Arise from the pipelining of conditional
branches and other instructions that change the PC
EECC722 - Shaaban
#33 Lec # 1 Fall 2004 9-6-2004
Performance of Pipelines with Stalls
• Hazards in pipelines may make it necessary to stall the pipeline
by one or more cycles and thus degrading performance from the
ideal CPI of 1.
CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instruction
• If pipelining overhead is ignored (no change in clock cycle) and we
assume that the stages are perfectly balanced then:
Speedup = CPI un-pipelined / CPI pipelined
= CPI un-pipelined / (1 + Pipeline stall cycles per instruction)
EECC722 - Shaaban
#34 Lec # 1 Fall 2004 9-6-2004
MIPS with Memory
Unit Structural Hazards
EECC722 - Shaaban
#35 Lec # 1 Fall 2004 9-6-2004
Resolving A Structural
Hazard with Stalling
EECC722 - Shaaban
#36 Lec # 1 Fall 2004 9-6-2004
Data Hazards
• Data hazards occur when the pipeline changes the order of
read/write accesses to instruction operands in such a way that
the resulting access order differs from the original sequential
instruction operand access order of the unpipelined machine
resulting in incorrect execution.
• Data hazards usually require one or more instructions to be
stalled to ensure correct execution.
• Example:
DADD R1, R2, R3
DSUB R4, R1, R5
AND R6, R1, R7
OR R8,R1,R9
XOR R10, R1, R11
– All the instructions after DADD use the result of the DADD instruction
– DSUB, AND instructions need to be stalled for correct execution.
EECC722 - Shaaban
#37 Lec # 1 Fall 2004 9-6-2004
Data
Hazard Example
Figure A.6 The use of the result of the DADD instruction in the next three instructions
causes a hazard, since the register is not written until after those instructions read it.
EECC722 - Shaaban
#38 Lec # 1 Fall 2004 9-6-2004
Minimizing Data hazard Stalls by Forwarding
• Forwarding is a hardware-based technique (also called register
bypassing or short-circuiting) used to eliminate or minimize
data hazard stalls.
• Using forwarding hardware, the result of an instruction is copied
directly from where it is produced (ALU, memory read port
etc.), to where subsequent instructions need it (ALU input
register, memory write port etc.)
• For example, in the MIPS pipeline with forwarding:
– The ALU result from the EX/MEM register may be forwarded or fed
back to the ALU input latches as needed instead of the register
operand value read in the ID stage.
– Similarly, the Data Memory Unit result from the MEM/WB register
may be fed back to the ALU input latches as needed .
– If the forwarding hardware detects that a previous ALU operation is to
write the register corresponding to a source for the current ALU
operation, control logic selects the forwarded result as the ALU input
rather than the value read from the register file.
EECC722 - Shaaban
#39 Lec # 1 Fall 2004 9-6-2004
MIPS Pipeline
with Forwarding
A set of instructions that depend on the DADD result uses forwarding paths to avoid the data hazard
EECC722 - Shaaban
#40 Lec # 1 Fall 2004 9-6-2004
Data Hazard Classification
I (Write)
I (Read)
Shared
Operand
J (Read)
Shared
Operand
J (Write)
Read after Write (RAW)
Write after Read (WAR)
I (Write)
I (Read)
Shared
Operand
Shared
Operand
J (Write)
Write after Write (WAW)
J (Read)
Read after Read (RAR) not a hazard
EECC722 - Shaaban
#41 Lec # 1 Fall 2004 9-6-2004
Control Hazards
• When a conditional branch is executed it may change the PC
and, without any special measures, leads to stalling the pipeline
for a number of cycles until the branch condition is known.
• In current MIPS pipeline, the conditional branch is resolved in
the MEM stage resulting in three stall cycles as shown below:
Branch instruction
Branch successor
Branch successor + 1
Branch successor + 2
Branch successor + 3
Branch successor + 4
Branch successor + 5
IF ID EX MEM WB
IF stall stall
IF ID
IF
EX
ID
IF
MEM
EX
ID
IF
WB
MEM WB
EX
MEM
ID
EX
IF
ID
IF
Assuming we stall on a branch instruction:
Three clock cycles are wasted for every branch for current MIPS pipeline
EECC722 - Shaaban
#42 Lec # 1 Fall 2004 9-6-2004
Reducing Branch Stall Cycles
Pipeline hardware measures to reduce branch stall cycles:
1- Find out whether a branch is taken earlier in the pipeline.
2- Compute the taken PC earlier in the pipeline.
In MIPS:
– In MIPS branch instructions BEQZ, BNE, test a register
for equality to zero.
– This can be completed in the ID cycle by moving the zero
test into that cycle.
– Both PCs (taken and not taken) must be computed early.
– Requires an additional adder because the current ALU is
not useable until EX cycle.
– This results in just a single cycle stall on branches.
EECC722 - Shaaban
#43 Lec # 1 Fall 2004 9-6-2004
Modified MIPS Pipeline:
Conditional Branches
Completed in ID Stage
EECC722 - Shaaban
#44 Lec # 1 Fall 2004 9-6-2004
Pipeline Performance Example
• Assume the following MIPS instruction mix:
Type
Arith/Logic
Load
Store
branch
Frequency
40%
30%
of which 25% are followed immediately by
an instruction using the loaded value
10%
20%
of which 45% are taken
• What is the resulting CPI for the pipelined MIPS with
forwarding and branch address calculation in ID stage
when using a branch not-taken scheme?
• CPI = Ideal CPI + Pipeline stall clock cycles per instruction
=
=
=
=
1 +
1 +
1 +
1.165
stalls by loads + stalls by branches
.3 x .25 x 1
+
.2 x .45 x 1
.075
+
.09
EECC722 - Shaaban
#45 Lec # 1 Fall 2004 9-6-2004
Pipelining and Exploiting
Instruction-Level Parallelism (ILP)
• Pipelining increases performance by overlapping the execution
of independent instructions.
• The CPI of a real-life pipeline is given by (assuming ideal
memory):
Pipeline CPI = Ideal Pipeline CPI + Structural Stalls + RAW Stalls
+ WAR Stalls + WAW Stalls + Control Stalls
• A basic instruction block is a straight-line code sequence with no
branches in, except at the entry point, and no branches out
except at the exit point of the sequence .
• The amount of parallelism in a basic block is limited by
instruction dependence present and size of the basic block.
• In typical integer code, dynamic branch frequency is about 15%
(average basic block size of 7 instructions).
EECC722 - Shaaban
#46 Lec # 1 Fall 2004 9-6-2004
Increasing Instruction-Level Parallelism
• A common way to increase parallelism among instructions is
to exploit parallelism among iterations of a loop
– (i.e Loop Level Parallelism, LLP).
• This is accomplished by unrolling the loop either statically
by the compiler, or dynamically by hardware, which
increases the size of the basic block present.
• In this loop every iteration can overlap with any other
iteration. Overlap within each iteration is minimal.
for (i=1; i<=1000; i=i+1;)
x[i] = x[i] + y[i];
• In vector machines, utilizing vector instructions is an
important alternative to exploit loop-level parallelism,
• Vector instructions operate on a number of data items. The
above loop would require just four such instructions.
EECC722 - Shaaban
#47 Lec # 1 Fall 2004 9-6-2004
MIPS Loop Unrolling Example
• For the loop:
for (i=1000; i>0; i=i-1)
x[i] = x[i] + s;
The straightforward MIPS assembly code is given by:
Loop: L.D
ADD.D
S.D
DADDUI
BNE
F0, 0 (R1)
F4, F0, F2
F4, 0(R1)
R1, R1, # -8
R1, R2,Loop
;F0=array element
;add scalar in F2
;store result
;decrement pointer 8 bytes
;branch R1!=R2
R1 is initially the address of the element with highest address.
8(R2) is the address of the last element to operate on.
EECC722 - Shaaban
#48 Lec # 1 Fall 2004 9-6-2004
MIPS FP Latency Assumptions
• All FP units assumed to be pipelined.
• The following FP operations latencies are used:
Instruction
Producing Result
Instruction
Using Result
Latency In
Clock Cycles
FP ALU Op
Another FP ALU Op
3
FP ALU Op
Store Double
2
Load Double
FP ALU Op
1
Load Double
Store Double
0
EECC722 - Shaaban
#49 Lec # 1 Fall 2004 9-6-2004
Loop Unrolling Example (continued)
• This loop code is executed on the MIPS pipeline as follows:
No scheduling
Clock cycle
Loop: L.D
F0, 0(R1)
1
stall
2
ADD.D F4, F0, F2
3
stall
4
stall
5
S.D
F4, 0 (R1)
6
DADDUI R1, R1, # -8
7
stall
8
BNE
R1,R2, Loop
9
stall
10
With delayed branch scheduling
Loop: L.D
DADDUI
ADD.D
stall
BNE
S.D
F0, 0(R1)
R1, R1, # -8
F4, F0, F2
R1,R2, Loop
F4,8(R1)
6 cycles per iteration
10/6 = 1.7 times faster
10 cycles per iteration
EECC722 - Shaaban
#50 Lec # 1 Fall 2004 9-6-2004
Loop Unrolling Example (continued)
• The resulting loop code when four copies of the loop body
are unrolled without reuse of registers:
No scheduling
Loop: L.D
F0, 0(R1)
ADD.D F4, F0, F2
SD
F4,0 (R1)
; drop DADDUI & BNE
LD
F6, -8(R1)
ADDD F8, F6, F2
SD
F8, -8 (R1), ; drop DADDUI & BNE
LD
F10, -16(R1)
ADDD F12, F10, F2
SD
F12, -16 (R1) ; drop DADDUI & BNE
LD
F14, -24 (R1)
ADDD F16, F14, F2
SD
F16, -24(R1)
DADDUI R1, R1, # -32
BNE
R1, R2, Loop
Three branches and three
decrements of R1 are eliminated.
Load and store addresses are
changed to allow DADDUI
instructions to be merged.
The loop runs in 28 assuming each
L.D has 1 stall cycle, each ADD.D
has 2 stall cycles, the DADDUI 1
stall, the branch 1 stall cycles, or
7 cycles for each of the four
elements.
EECC722 - Shaaban
#51 Lec # 1 Fall 2004 9-6-2004
Loop Unrolling Example (continued)
When scheduled for pipeline
Loop: L.D
L.D
L.D
L.D
ADD.D
ADD.D
ADD.D
ADD.D
S.D
S.D
DADDUI
S.D
BNE
S.D
The execution time of the loop
has dropped to 14 cycles, or 3.5
clock cycles per element
F0, 0(R1)
F6,-8 (R1)
F10, -16(R1)
F14, -24(R1)
compared to 6.8 before scheduling
F4, F0, F2
and 6 when scheduled but unrolled.
F8, F6, F2
F12, F10, F2
Unrolling the loop exposed more
F16, F14, F2
computation that can be scheduled
F4, 0(R1)
to minimize stalls.
F8, -8(R1)
R1, R1,# -32
F12, -16(R1),F12
R1,R2, Loop
F16, 8(R1), F16 ;8-32 = -24
EECC722 - Shaaban
#52 Lec # 1 Fall 2004 9-6-2004
Loop-Level Parallelism (LLP) Analysis
• Loop-Level Parallelism (LLP) analysis focuses on whether data
accesses in later iterations of a loop are data dependent on data
values produced in earlier iterations.
e.g. in
for (i=1; i<=1000; i++)
x[i] = x[i] + s;
the computation in each iteration is independent of the previous
iterations and the loop is thus parallel. The use of X[i] twice is within
a single iteration.
Thus loop iterations are parallel (or independent from each other).
• Loop-carried Dependence: A data dependence between different
loop iterations (data produced in earlier iteration used in a later one).
• LLP analysis is normally done at the source code level or close to it
since assembly language and target machine code generation
introduces a loop-carried name dependence in the registers used for
addressing and incrementing.
• Instruction level parallelism (ILP) analysis, on the other hand, is
usually done when instructions are generated by the compiler.
EECC722 - Shaaban
#53 Lec # 1 Fall 2004 9-6-2004
LLP Analysis Example 1
• In the loop:
for (i=1; i<=100; i=i+1) {
A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] + A[i+1];} /* S2 */
}
(Where A, B, C are distinct non-overlapping arrays)
– S2 uses the value A[i+1], computed by S1 in the same iteration. This
data dependence is within the same iteration (not a loop-carried
dependence).
 does not prevent loop iteration parallelism.
– S1 uses a value computed by S1 in an earlier iteration, since iteration i
computes A[i+1] read in iteration i+1 (loop-carried dependence,
prevents parallelism). The same applies for S2 for B[i] and B[i+1]
These two dependences are loop-carried spanning more than one iteration
preventing loop parallelism.
EECC722 - Shaaban
#54 Lec # 1 Fall 2004 9-6-2004
• In the loop:
LLP Analysis Example 2
for (i=1; i<=100; i=i+1) {
A[i] = A[i] + B[i];
B[i+1] = C[i] + D[i];
/* S1 */
/* S2 */
}
– S1 uses the value B[i] computed by S2 in the previous iteration (loopcarried dependence)
– This dependence is not circular:
• S1 depends on S2 but S2 does not depend on S1.
– Can be made parallel by replacing the code with the following:
Loop Start-up code
A[1] = A[1] + B[1];
for (i=1; i<=99; i=i+1) {
B[i+1] = C[i] + D[i];
A[i+1] = A[i+1] + B[i+1];
}
B[101] = C[100] + D[100]; Loop Completion code
EECC722 - Shaaban
#55 Lec # 1 Fall 2004 9-6-2004
LLP Analysis Example 2
Original Loop:
Iteration 1
for (i=1; i<=100; i=i+1) {
A[i] = A[i] + B[i];
B[i+1] = C[i] + D[i];
}
Iteration 2
A[1] = A[1] + B[1];
A[2] = A[2] + B[2];
B[2] = C[1] + D[1];
B[3] = C[2] + D[2];
/* S1 */
/* S2 */
......
......
Loop-carried
Dependence
Iteration 99
A[99] = A[99] + B[99];
A[100] = A[100] + B[100];
B[100] = C[99] + D[99];
B[101] = C[100] + D[100];
A[1] = A[1] + B[1];
for (i=1; i<=99; i=i+1) {
B[i+1] = C[i] + D[i];
A[i+1] = A[i+1] + B[i+1];
}
B[101] = C[100] + D[100];
Modified Parallel Loop:
Iteration 98
Loop Start-up code
Iteration 100
Iteration 1
A[1] = A[1] + B[1];
A[2] = A[2] + B[2];
B[2] = C[1] + D[1];
B[3] = C[2] + D[2];
....
Not Loop
Carried
Dependence
Iteration 99
A[99] = A[99] + B[99];
A[100] = A[100] + B[100];
B[100] = C[99] + D[99];
B[101] = C[100] + D[100];
Loop Completion code
EECC722 - Shaaban
#56 Lec # 1 Fall 2004 9-6-2004
Reduction of Data Hazards Stalls
with Dynamic Scheduling
• So far we have dealt with data hazards in instruction pipelines by:
– Result forwarding and bypassing to reduce latency and hide or
reduce the effect of true data dependence.
– Hazard detection hardware to stall the pipeline starting with the
instruction that uses the result.
– Compiler-based static pipeline scheduling to separate the dependent
instructions minimizing actual hazards and stalls in scheduled code.
• Dynamic scheduling:
– Uses a hardware-based mechanism to rearrange instruction
execution order to reduce stalls at runtime.
– Enables handling some cases where dependencies are unknown at
compile time.
– Similar to the other pipeline optimizations above, a dynamically
scheduled processor cannot remove true data dependencies, but tries
to avoid or reduce stalling.
EECC722 - Shaaban
#57 Lec # 1 Fall 2004 9-6-2004
Dynamic Pipeline Scheduling: The Concept
• Dynamic pipeline scheduling overcomes the limitations of in-order
execution by allowing out-of-order instruction execution.
• Instruction are allowed to start executing out-of-order as soon as
their operands are available.
Example:
In the case of in-order execution
SUBD must wait for DIVD to complete
which stalled ADDD before starting execution
In out-of-order execution SUBD can start as soon
as the values of its operands F8, F14 are available.
DIVD F0, F2, F4
ADDD F10, F0, F8
SUBD F12, F8, F14
• This implies allowing out-of-order instruction commit (completion).
• May lead to imprecise exceptions if an instruction issued earlier
raises an exception.
• This is similar to pipelines with multi-cycle floating point units.
EECC722 - Shaaban
#58 Lec # 1 Fall 2004 9-6-2004
Dynamic Scheduling:
The Tomasulo Algorithm
• Developed at IBM and first implemented in IBM’s 360/91
mainframe in 1966, about 3 years after the debut of the scoreboard
in the CDC 6600.
• Dynamically schedule the pipeline in hardware to reduce stalls.
• Differences between IBM 360 & CDC 6600 ISA.
– IBM has only 2 register specifiers/instr vs. 3 in CDC 6600.
– IBM has 4 FP registers vs. 8 in CDC 6600.
• Current CPU architectures that can be considered descendants of
the IBM 360/91 which implement and utilize a variation of the
Tomasulo Algorithm include:
RISC CPUs: Alpha 21264, HP 8600, MIPS R12000, PowerPC G4
RISC-core x86 CPUs: AMD Athlon, Pentium III, 4, Xeon ….
EECC722 - Shaaban
#59 Lec # 1 Fall 2004 9-6-2004
Dynamic Scheduling: The Tomasulo Approach
The basic structure of a MIPS floating-point unit using Tomasulo’s algorithm
EECC722 - Shaaban
#60 Lec # 1 Fall 2004 9-6-2004
Reservation Station Fields
• Op Operation to perform in the unit (e.g., + or –)
• Vj, Vk Value of Source operands S1 and S2
– Store buffers have a single V field indicating result to
be stored.
• Qj, Qk Reservation stations producing source
registers. (value to be written).
– No ready flags as in Scoreboard; Qj,Qk=0 => ready.
– Store buffers only have Qi for RS producing result.
• A: Address information for loads or stores. Initially
immediate field of instruction then effective address
when calculated.
• Busy: Indicates reservation station and FU are busy.
• Register result status: Qi Indicates which functional
unit will write each register, if one exists.
– Blank (or 0) when no pending instructions exist that
EECC722 - Shaaban
will write to that register.
#61 Lec # 1 Fall 2004 9-6-2004
1
Three Stages of Tomasulo Algorithm
Issue: Get instruction from pending Instruction Queue.
– Instruction issued to a free reservation station (no structural hazard).
– Selected RS is marked busy.
– Control sends available instruction operands values (from ISA registers)
to assigned RS.
– Operands not available yet are renamed to RSs that will produce the
operand (register renaming).
2
Execution (EX): Operate on operands.
– When both operands are ready then start executing on assigned FU.
– If all operands are not ready, watch Common Data Bus (CDB) for needed
result (forwarding done via CDB).
3
Write result (WB): Finish execution.
–
–
Write result on Common Data Bus to all awaiting units
Mark reservation station as available.
• Normal data bus: data + destination (“go to” bus).
• Common Data Bus (CDB): data + source (“come from” bus):
– 64 bits for data + 4 bits for Functional Unit source address.
– Write data to waiting RS if source matches expected RS (that produces result).
– Does the result forwarding via broadcast to waiting RSs.
EECC722 - Shaaban
#62 Lec # 1 Fall 2004 9-6-2004
Dynamic Conditional Branch Prediction
•
•
•
Dynamic branch prediction schemes are different from static
mechanisms because they use the run-time behavior of branches to make
more accurate predictions than possible using static prediction.
Usually information about outcomes of previous occurrences of a given
branch (branching history) is used to predict the outcome of the current
occurrence. Some of the proposed dynamic branch prediction
mechanisms include:
– One-level or Bimodal: Uses a Branch History Table (BHT), a table
of usually two-bit saturating counters which is indexed by a portion
of the branch address (low bits of address).
– Two-Level Adaptive Branch Prediction.
– MCFarling’s Two-Level Prediction with index sharing (gshare).
– Hybrid or Tournament Predictors: Uses a combinations of two or
more (usually two) branch prediction mechanisms.
To reduce the stall cycles resulting from correctly predicted taken
branches to zero cycles, a Branch Target Buffer (BTB) that includes the
addresses of conditional branches that were taken along with their
targets is added to the fetch stage.
EECC722 - Shaaban
#63 Lec # 1 Fall 2004 9-6-2004
Branch Target Buffer (BTB)
•
•
•
•
•
•
Effective branch prediction requires the target of the branch at an early pipeline
stage.
One can use additional adders to calculate the target, as soon as the branch
instruction is decoded. This would mean that one has to wait until the ID stage
before the target of the branch can be fetched, taken branches would be fetched
with a one-cycle penalty (this was done in the enhanced MIPS pipeline Fig A.24).
To avoid this problem one can use a Branch Target Buffer (BTB). A typical BTB
is an associative memory where the addresses of taken branch instructions are
stored together with their target addresses.
Some designs store n prediction bits as well, implementing a combined BTB and
BHT.
Instructions are fetched from the target stored in the BTB in case the branch is
predicted-taken and found in BTB. After the branch has been resolved the BTB
is updated. If a branch is encountered for the first time a new entry is created
once it is resolved.
Branch Target Instruction Cache (BTIC): A variation of BTB which caches
also the code of the branch target instruction in addition to its address. This
eliminates the need to fetch the target instruction from the instruction cache or
from memory.
EECC722 - Shaaban
#64 Lec # 1 Fall 2004 9-6-2004
Basic Branch Target Buffer (BTB)
EECC722 - Shaaban
#65 Lec # 1 Fall 2004 9-6-2004
EECC722 - Shaaban
#66 Lec # 1 Fall 2004 9-6-2004
One-Level Bimodal Branch Predictors
• One-level or bimodal branch prediction uses only one level of branch
history.
• These mechanisms usually employ a table which is indexed by lower
bits of the branch address.
• The table entry consists of n history bits, which form an n-bit
automaton or saturating counters.
• Smith proposed such a scheme, known as the Smith algorithm, that
uses a table of two-bit saturating counters.
• One rarely finds the use of more than 3 history bits in the literature.
• Two variations of this mechanism:
– Decode History Table: Consists of directly mapped entries.
– Branch History Table (BHT): Stores the branch address as a tag. It
is associative and enables one to identify the branch instruction
during IF by comparing the address of an instruction with the stored
branch addresses in the table (similar to BTB).
EECC722 - Shaaban
#67 Lec # 1 Fall 2004 9-6-2004
One-Level Bimodal Branch Predictors
Decode History Table (DHT)
High bit determines
branch prediction
0 = Not Taken
1 = Taken
N Low Bits of
Table has 2N entries.
Example:
For N =12
Table has 2N = 212 entries
= 4096 = 4k entries
0
0
1
1
0
Not Taken
1
0
1 Taken
Number of bits needed = 2 x 4k = 8k bits
EECC722 - Shaaban
#68 Lec # 1 Fall 2004 9-6-2004
Prediction Accuracy
of A 4096-Entry Basic
Dynamic Two-Bit
Branch Predictor
Integer average 11%
FP average 4%
Integer
EECC722 - Shaaban
#69 Lec # 1 Fall 2004 9-6-2004
Correlating Branches
Recent branches are possibly correlated: The behavior of
recently executed branches affects prediction of current
branch.
Example:
B1
B2
B3
if (aa==2)
aa=0;
if (bb==2)
bb=0;
if (aa!==bb){
L1:
L2:
DSUBUI
BENZ
DADD
DSUBUI
BNEZ
DADD
DSUBUI
BEQZ
R3, R1, #2
R3, L1
R1, R0, R0
R3, R1, #2
R3, L2
R2, R0, R0
R3, R1, R2
R3, L3
; b1 (aa!=2)
; aa==0
; b2 (bb!=2)
; bb==0
; R3=aa-bb
; b3 (aa==bb)
Branch B3 is correlated with branches B1, B2. If B1, B2 are
both not taken, then B3 will be taken. Using only the behavior
of one branch cannot detect this behavior.
EECC722 - Shaaban
#70 Lec # 1 Fall 2004 9-6-2004
Correlating Two-Level Dynamic GAp Branch Predictors
•
•
•
•
Improve branch prediction by looking not only at the history of the branch in
question but also at that of other branches using two levels of branch history.
Uses two levels of branch history:
– First level (global):
• Record the global pattern or history of the m most recently executed
branches as taken or not taken. Usually an m-bit shift register.
– Second level (per branch address):
• 2m prediction tables, each table entry has n bit saturating counter.
• The branch history pattern from first level is used to select the proper
branch prediction table in the second level.
• The low N bits of the branch address are used to select the correct
prediction entry within a the selected table, thus each of the 2m tables
has 2N entries and each entry is 2 bits counter.
• Total number of bits needed for second level = 2m x n x 2N bits
In general, the notation: (m,n) GAp predictor means:
– Record last m branches to select between 2m history tables.
– Each second level table uses n-bit counters (each table entry has n bits).
Basic two-bit single-level Bimodal BHT is then a (0,2) predictor.
EECC722 - Shaaban
#71 Lec # 1 Fall 2004 9-6-2004
Low 4 bits of address
Organization of A Correlating Twolevel GAp (2,2) Branch Predictor
Global
(1st level)
Second Level
High bit determines
branch prediction
0 = Not Taken
1 = Taken
Selects
correct
entry in table
Adaptive
GAp
per address
(2nd level)
m = # of branches tracked in first level = 2
Thus 2m = 22 = 4 tables in second level
N = # of low bits of branch address used = 4
Thus each table in 2nd level has 2N = 24 = 16
entries
Selects correct
table
First Level
(2 bit shift register)
n = # number of bits of 2nd level table entry = 2
Number of bits for 2nd level = 2m x n x 2N
= 4 x 2 x 16 = 128 bits
EECC722 - Shaaban
#72 Lec # 1 Fall 2004 9-6-2004
Basic
Basic
Correlating
Two-level
GAp
Prediction Accuracy
of Two-Bit Dynamic
Predictors Under
SPEC89
EECC722 - Shaaban
#73 Lec # 1 Fall 2004 9-6-2004
Multiple Instruction Issue: CPI < 1
•
To improve a pipeline’s CPI to be better [less] than one, and to utilize ILP
better, a number of independent instructions have to be issued in the same
pipeline cycle.
•
Multiple instruction issue processors are of two types:
– Superscalar: A number of instructions (2-8) is issued in the same
cycle, scheduled statically by the compiler or dynamically
(Tomasulo).
• PowerPC, Sun UltraSparc, Alpha, HP 8000 ...
– VLIW (Very Long Instruction Word):
A fixed number of instructions (3-6) are formatted as one long
instruction word or packet (statically scheduled by the compiler).
– Joint HP/Intel agreement (Itanium, Q4 2000).
– Intel Architecture-64 (IA-64) 64-bit address:
• Explicitly Parallel Instruction Computer (EPIC): Itanium.
• Limitations of the approaches:
– Available ILP in the program (both).
– Specific hardware implementation difficulties (superscalar).
– VLIW optimal compiler design issues.
EECC722 - Shaaban
#74 Lec # 1 Fall 2004 9-6-2004
Simple Statically Scheduled Superscalar Pipeline
•
•
•
•
Two instructions can be issued per cycle (two-issue superscalar).
One of the instructions is integer (including load/store, branch). The other instruction
is a floating-point operation.
– This restriction reduces the complexity of hazard checking.
Hardware must fetch and decode two instructions per cycle.
Then it determines whether zero (a stall), one or two instructions can be issued per
cycle.
Instruction Type
1
2
3
Integer Instruction
IF
IF
ID
ID
IF
IF
EX
EX
ID
ID
IF
IF
FP Instruction
Integer Instruction
FP Instruction
Integer Instruction
FP Instruction
Integer Instruction
FP Instruction
4
MEM
EX
EX
EX
ID
ID
IF
IF
Two-issue statically scheduled pipeline in operation
FP instructions assumed to be adds
5
6
WB
EX
MEM
EX
EX
EX
ID
ID
7
WB
WB
EX
MEM
EX
EX
EX
WB
WB
EX
MEM
EX
8
WB
WB
EX
EECC722 - Shaaban
#75 Lec # 1 Fall 2004 9-6-2004
Intel/HP VLIW “Explicitly Parallel
Instruction Computing (EPIC)”
• Three instructions in 128 bit “Groups”; instruction template
fields determines if instructions are dependent or independent
– Smaller code size than old VLIW, larger than x86/RISC
– Groups can be linked to show dependencies of more than three
instructions.
• 128 integer registers + 128 floating point registers
– No separate register files per functional unit as in old VLIW.
• Hardware checks dependencies
(interlocks binary compatibility over time)
• Predicated execution: An implementation of conditional
instructions used to reduce the number of conditional branches
used in the generated code  larger basic block size
• IA-64 : Name given to instruction set architecture (ISA).
• Itanium : Name of the first implementation (2001).
EECC722 - Shaaban
#76 Lec # 1 Fall 2004 9-6-2004
Intel/HP EPIC VLIW Approach
original source
code
compiler
Expose
Instruction
Parallelism
Instruction Dependency
Analysis
Exploit
Parallelism:
Generate
VLIWs
Optimize
128-bit bundle
127
0
Instruction 2
Instruction 1
Instruction 0
Template
EECC722 - Shaaban
#77 Lec # 1 Fall 2004 9-6-2004
Unrolled Loop Example for Scalar Pipeline
1 Loop:
2
3
4
5
6
7
8
9
10
11
12
13
14
L.D
L.D
L.D
L.D
ADD.D
ADD.D
ADD.D
ADD.D
S.D
S.D
DADDUI
S.D
BNE
S.D
F0,0(R1)
F6,-8(R1)
F10,-16(R1)
F14,-24(R1)
F4,F0,F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
F4,0(R1)
F8,-8(R1)
R1,R1,#-32
F12,-16(R1)
R1,R2,LOOP
F16,8(R1)
L.D to ADD.D: 1 Cycle
ADD.D to S.D: 2 Cycles
; 8-32 = -24
14 clock cycles, or 3.5 per iteration
EECC722 - Shaaban
#78 Lec # 1 Fall 2004 9-6-2004
Loop Unrolling in Superscalar Pipeline:
(1 Integer, 1 FP/Cycle)
Integer instruction
Loop:
L.D F0,0(R1)
L.D F6,-8(R1)
L.D F10,-16(R1)
L.D F14,-24(R1)
L.D F18,-32(R1)
S.D F4,0(R1)
S.D F8,-8(R1)
S.D F12,-16(R1)
DADDUI R1,R1,#-40
S.D F16,-24(R1)
BNE R1,R2,LOOP
SD -32(R1),F20
FP instruction
ADD.D F4,F0,F2
ADD.D F8,F6,F2
ADD.D F12,F10,F2
ADD.D F16,F14,F2
ADD.D F20,F18,F2
Clock cycle
1
2
3
4
5
6
7
8
9
10
11
12
• Unrolled 5 times to avoid delays (+1 due to SS)
• 12 clocks, or 2.4 clocks per iteration (1.5X)
• 7 issue slots wasted
EECC722 - Shaaban
#79 Lec # 1 Fall 2004 9-6-2004
Loop Unrolling in VLIW Pipeline
(2 Memory, 2 FP, 1 Integer / Cycle)
Memory
reference 1
Memory
reference 2
L.D F0,0(R1)
L.D F6,-8(R1)
L.D F10,-16(R1)
L.D F14,-24(R1)
L.D F18,-32(R1)
L.D F22,-40(R1) ADD.D F4,F0,F2
L.D F26,-48(R1)
FP
operation 1
ADD.D F12,F10,F2
ADD.D F20,F18,F2
S.D F4,0(R1)
S.D F8, -8(R1)
S.D F12, -16(R1) S.D F16,-24(R1)
S.D F20, 24(R1)
S.D F28, 8(R1)
S.D F24,16(R1)
ADD.D F28,F26,F2
FP
op. 2
Int. op/
branch
Clock
1
2
ADD.D F8,F6,F2
3
ADD.D F16,F14,F2
4
ADD.D F24,F22,F2
5
6
DADDUI R1,R1,#-56 7
8
BNE R1,R2,LOOP
9
Unrolled 7 times to avoid delays
7 results in 9 clocks, or 1.3 clocks per iteration (1.8X)
Average: 2.5 ops per clock, 50% efficiency
Note: Needs more registers in VLIW (15 vs. 6 in Superscalar)
EECC722 - Shaaban
#80 Lec # 1 Fall 2004 9-6-2004
Superscalar Dynamic Scheduling
• How to issue two instructions and keep in-order instruction
issue for Tomasulo?
– Assume: 1 integer + 1 floating-point operations.
– 1 Tomasulo control for integer, 1 for floating point.
• Issue at 2X Clock Rate, so that issue remains in order.
• Only FP loads might cause a dependency between integer and
FP issue:
– Replace load reservation station with a load queue;
operands must be read in the order they are fetched.
– Load checks addresses in Store Queue to avoid RAW
violation
– Store checks addresses in Load Queue to avoid WAR,
WAW.
• Called “Decoupled Architecture”
EECC722 - Shaaban
#81 Lec # 1 Fall 2004 9-6-2004
Superscalar Architectures:
Issue Slot Waste Classification
•
Empty or wasted issue slots can be defined as either vertical waste or
horizontal waste:
– Vertical waste is introduced when the processor issues no
instructions in a cycle.
– Horizontal waste occurs when not all issue slots can be filled
in a cycle.
EECC722 - Shaaban
#82 Lec # 1 Fall 2004 9-6-2004
Hardware Support for Extracting More Parallelism
• Compiler ILP techniques (loop-unrolling, software Pipelining etc.)
are not effective to uncover maximum ILP when branch behavior
is not well known at compile time.
• Hardware ILP techniques:
– Conditional or Predicted Instructions: An extension to the
instruction set with instructions that turn into no-ops if
a condition is not valid at run time.
– Speculation: An instruction is executed before the processor
knows that the instruction should execute to avoid control
dependence stalls:
• Static Speculation by the compiler with hardware support:
– The compiler labels an instruction as speculative and the hardware
helps by ignoring the outcome of incorrectly speculated instructions.
– Conditional instructions provide limited speculation.
• Dynamic Hardware-based Speculation:
– Uses dynamic branch-prediction to guide the speculation process.
– Dynamic scheduling and execution continued passed a conditional
branch in the predicted branch direction.
EECC722 - Shaaban
#83 Lec # 1 Fall 2004 9-6-2004
Conditional or Predicted Instructions
• Avoid branch prediction by turning branches into
conditionally-executed instructions:
if (x) then (A = B op C) else NOP
– If false, then neither store result nor cause exception:
instruction is annulled (turned into NOP) .
– Expanded ISA of Alpha, MIPS, PowerPC, SPARC have
conditional move.
– HP PA-RISC can annul any following instruction.
– IA-64: 64 1-bit condition fields selected so
x
conditional execution of any instruction
(Predication).
• Drawbacks of conditional instructions
A=
B op C
– Still takes a clock cycle even if “annulled”.
– Must stall if condition is evaluated late.
– Complex conditions reduce effectiveness;
condition becomes known late in pipeline.
EECC722 - Shaaban
#84 Lec # 1 Fall 2004 9-6-2004
Dynamic Hardware-Based Speculation
• Combines:
– Dynamic hardware-based branch prediction
– Dynamic Scheduling: of multiple instructions to issue and
execute out of order.
• Continue to dynamically issue, and execute instructions passed
a conditional branch in the dynamically predicted branch
direction, before control dependencies are resolved.
– This overcomes the ILP limitations of the basic block size.
– Creates dynamically speculated instructions at run-time with no
compiler support at all.
– If a branch turns out as mispredicted all such dynamically
speculated instructions must be prevented from changing the state of
the machine (registers, memory).
• Addition of commit (retire or re-ordering) stage and forcing
instructions to commit in their order in the code (i.e to write
results to registers or memory).
• Precise exceptions are possible since instructions must commit in
order.
EECC722 - Shaaban
#85 Lec # 1 Fall 2004 9-6-2004
Hardware-Based
Speculation
Speculative Execution +
Tomasulo’s Algorithm
EECC722 - Shaaban
#86 Lec # 1 Fall 2004 9-6-2004
Four Steps of Speculative Tomasulo Algorithm
1. Issue — Get an instruction from FP Op Queue
If a reservation station and a reorder buffer slot are free, issue instruction
& send operands & reorder buffer number for destination (this stage is
sometimes called “dispatch”)
2. Execution — Operate on operands (EX)
When both operands are ready then execute; if not ready, watch CDB for
result; when both operands are in reservation station, execute; checks
RAW (sometimes called “issue”)
3. Write result — Finish execution (WB)
Write on Common Data Bus to all awaiting FUs & reorder buffer; mark
reservation station available.
4. Commit — Update registers, memory with reorder buffer result
– When an instruction is at head of reorder buffer & the result is present,
update register with result (or store to memory) and remove instruction
from reorder buffer.
– A mispredicted branch at the head of the reorder buffer flushes the
reorder buffer (sometimes called “graduation”)
 Instructions issue in order, execute (EX), write result (WB) out of
order, but must commit in order.
EECC722 - Shaaban
#87 Lec # 1 Fall 2004 9-6-2004
Memory Hierarchy: The motivation
•
The gap between CPU performance and main memory has been
widening with higher performance CPUs creating performance
bottlenecks for memory access instructions.
•
The memory hierarchy is organized into several levels of memory with
the smaller, more expensive, and faster memory levels closer to the CPU:
registers, then primary Cache Level (L1), then additional secondary
cache levels (L2, L3…), then main memory, then mass storage (virtual
memory).
•
Each level of the hierarchy is a subset of the level below: data found in a
level is also found in the level below but at lower speed.
•
Each level maps addresses from a larger physical memory to a smaller
level of physical memory.
•
This concept is greatly aided by the principal of locality both temporal
and spatial which indicates that programs tend to reuse data and
instructions that they have used recently or those stored in their vicinity
leading to working set of a program.
EECC722 - Shaaban
#88 Lec # 1 Fall 2004 9-6-2004
Memory Hierarchy: Motivation
Processor-Memory (DRAM) Performance Gap
100
CPU
Processor-Memory
Performance Gap:
(grows 50% / year)
10
DRAM
1
µProc
60%/yr.
DRAM
7%/yr.
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Performance
1000
EECC722 - Shaaban
#89 Lec # 1 Fall 2004 9-6-2004
Cache Design & Operation Issues
• Q1: Where can a block be placed cache?
(Block placement strategy & Cache organization)
– Fully Associative, Set Associative, Direct Mapped.
• Q2: How is a block found if it is in cache?
(Block identification)
– Tag/Block.
• Q3: Which block should be replaced on a miss?
(Block replacement)
– Random, LRU.
• Q4: What happens on a write?
(Cache write policy)
– Write through, write back.
EECC722 - Shaaban
#90 Lec # 1 Fall 2004 9-6-2004
Cache Organization & Placement Strategies
Placement strategies or mapping of a main memory data block onto
cache block frame addresses divide cache into three organizations:
1
Direct mapped cache: A block can be placed in one location only,
given by:
(Block address) MOD (Number of blocks in cache)
2
3
Fully associative cache: A block can be placed anywhere in cache.
Set associative cache: A block can be placed in a restricted set of
places, or cache block frames. A set is a group of block frames in
the cache. A block is first mapped onto the set and then it can be
placed anywhere within the set. The set in this case is chosen by:
(Block address) MOD (Number of sets in cache)
If there are n blocks in a set the cache placement is called n-way
set-associative.
EECC722 - Shaaban
#91 Lec # 1 Fall 2004 9-6-2004
Locating A Data Block in Cache
• Each block frame in cache has an address tag.
• The tags of every cache block that might contain the required data
are checked in parallel.
• A valid bit is added to the tag to indicate whether this entry contains
a valid address.
• The address from the CPU to cache is divided into:
– A block address, further divided into:
• An index field to choose a block set in cache.
(no index field when fully associative).
• A tag field to search and match addresses in the selected set.
– A block offset to select the data from the block.
Block Address
Tag
Index
Block
Offset
EECC722 - Shaaban
#92 Lec # 1 Fall 2004 9-6-2004
Address Field Sizes
Physical Address Generated by CPU
Block Address
Tag
Block
Offset
Index
Block offset size = log2(block size)
Index size = log2(Total number of blocks/associativity)
Tag size = address size - index size - offset size
Number of Sets
Mapping function:
Cache set or block frame number = Index =
= (Block Address) MOD (Number of Sets)
EECC722 - Shaaban
#93 Lec # 1 Fall 2004 9-6-2004
4K Four-Way Set Associative Cache:
MIPS Implementation Example
A d dress
Tag
Field
31 3 0
1 2 11 10 9 8
V
Tag
D ata
V
D a ta
V
T ag
D ata
V
T ag
D ata
1
2
25 4
25 5
22
Block Address = 30 bits
Index = 8 bits
Block offset
= 2 bits
32
4 - to - 1 m ultip le xo r
H it
Mapping Function:
Tag
0
25 3
Can cache up to
232 bytes = 4 GB
of memory
Tag = 22 bits
Ind ex
Index
Field
8
22
1024 block frames
Each block = one word
4-way set associative
256 sets
3 2 1 0
D a ta
Cache Set Number = (Block address) MOD (256)
EECC722 - Shaaban
#94 Lec # 1 Fall 2004 9-6-2004
Cache Performance:
Average Memory Access Time (AMAT), Memory Stall cycles
• The Average Memory Access Time (AMAT): The number of
cycles required to complete an average memory access request
by the CPU.
• Memory stall cycles per memory access: The number of stall
cycles added to CPU execution cycles for one memory access.
• For ideal memory: AMAT = 1 cycle, this results in zero
memory stall cycles.
• Memory stall cycles per average memory access = (AMAT -1)
• Memory stall cycles per average instruction =
Memory stall cycles per average memory access
x Number of memory accesses per instruction
= (AMAT -1 ) x ( 1 + fraction of loads/stores)
Instruction Fetch
EECC722 - Shaaban
#95 Lec # 1 Fall 2004 9-6-2004
Cache Performance
Single Level L1 Princeton (Unified) Memory Architecture
CPUtime = Instruction count x CPI x Clock cycle time
CPIexecution = CPI with ideal memory
CPI = CPIexecution + Mem Stall cycles per instruction
Mem Stall cycles per instruction =
Mem accesses per instruction x Miss rate x Miss penalty
CPUtime = Instruction Count x (CPIexecution +
Mem Stall cycles per instruction) x Clock cycle time
CPUtime = IC x (CPIexecution + Mem accesses per instruction x
Miss rate x Miss penalty) x Clock cycle time
Misses per instruction = Memory accesses per instruction x Miss rate
CPUtime = IC x (CPIexecution + Misses per instruction x Miss penalty) x
Clock cycle time
EECC722 - Shaaban
#96 Lec # 1 Fall 2004 9-6-2004
Miss Rates for Caches with Different Size,
Associativity & Replacement Algorithm
Sample Data
Associativity:
Size
16 KB
64 KB
256 KB
2-way
LRU
Random
5.18% 5.69%
1.88% 2.01%
1.15% 1.17%
4-way
LRU Random
4.67% 5.29%
1.54% 1.66%
1.13% 1.13%
8-way
LRU
Random
4.39% 4.96%
1.39% 1.53%
1.12% 1.12%
EECC722 - Shaaban
#97 Lec # 1 Fall 2004 9-6-2004
Cache Write Strategies
1
Write Though: Data is written to both the cache block and to a
block of main memory.
– The lower level always has the most updated data; an important
feature for I/O and multiprocessing.
– Easier to implement than write back.
– A write buffer is often used to reduce CPU write stall while data is
written to memory.
2
Write back: Data is written or updated only to the cache
block. The modified or dirty cache block is written to main
memory when it’s being replaced from cache.
– Writes occur at the speed of cache
– A status bit called a dirty or modified bit, is used to indicate
whether the block was modified while in cache; if not the block is
not written back to main memory when replaced.
– Uses less memory bandwidth than write through.
EECC722 - Shaaban
#98 Lec # 1 Fall 2004 9-6-2004
Cache Write Miss Policy
• Since data is usually not needed immediately on a write miss
two options exist on a cache write miss:
Write Allocate:
The cache block is loaded on a write miss followed by write hit actions.
No-Write Allocate:
The block is modified in the lower level (lower cache level, or main
memory) and not loaded into cache.
While any of the above two write miss policies can be used with
either write back or write through:
• Write back caches always use write allocate to capture
subsequent writes to the block in cache.
• Write through caches usually use no-write allocate since
subsequent writes still have to go to memory.
EECC722 - Shaaban
#99 Lec # 1 Fall 2004 9-6-2004
Memory Access Tree, Unified L1
Write Through, No Write Allocate, No Write Buffer
CPU Memory Access
Read
Write
L1
L1 Read Hit:
Access Time = 1
Stalls = 0
L1 Read Miss:
Access Time = M + 1
Stalls Per access
% reads x (1 - H1 ) x M
L1 Write Hit:
Access Time: M +1
Stalls Per access:
% write x (H1 ) x M
L1 Write Miss:
Access Time : M + 1
Stalls per access:
% write x (1 - H1 ) x M
Stall Cycles Per Memory Access = % reads x (1 - H1 ) x M + % write x M
AMAT = 1 + % reads x (1 - H1 ) x M + % write x M
EECC722 - Shaaban
#100 Lec # 1 Fall 2004 9-6-2004
Memory Access Tree Unified L1
Write Back, With Write Allocate
CPU Memory Access
Read
Write
L1
L1 Hit:
% read x H1
Access Time = 1
Stalls = 0
Clean
Access Time = M +1
Stall cycles = M x (1-H1 ) x
% reads x % clean
L1 Read Miss
L1 Write Hit:
% write x H1
Access Time = 1
Stalls = 0
Dirty
Access Time = 2M +1
Stall cycles = 2M x (1-H1) x
%read x % dirty
Stall Cycles Per Memory Access =
L1 Write Miss
Clean
Access Time = M +1
Stall cycles = M x (1 -H1) x
% write x % clean
Dirty
Access Time = 2M +1
Stall cycles = 2M x (1-H1) x
%read x % dirty
(1-H1) x ( M x % clean + 2M x % dirty )
AMAT = 1 + Stall Cycles Per Memory Access
EECC722 - Shaaban
#101 Lec # 1 Fall 2004 9-6-2004
Miss Rates For Multi-Level Caches
• Local Miss Rate: This rate is the number of misses in a
cache level divided by the number of memory accesses to
this level. Local Hit Rate = 1 - Local Miss Rate
• Global Miss Rate: The number of misses in a cache level
divided by the total number of memory accesses generated
by the CPU.
• Since level 1 receives all CPU memory accesses, for level 1:
Local Miss Rate = Global Miss Rate = 1 - H1
• For level 2 since it only receives those accesses missed in 1:
Local Miss Rate = Miss rateL2= 1- H2
Global Miss Rate = Miss rateL1 x Miss rateL2
= (1- H1) x (1 - H2)
EECC722 - Shaaban
#102 Lec # 1 Fall 2004 9-6-2004
Write Policy For 2-Level Cache
• Write Policy For Level 1 Cache:
– Usually Write through to Level 2
– Write allocate is used to reduce level 1 miss reads.
– Use write buffer to reduce write stalls
• Write Policy For Level 2 Cache:
– Usually write back with write allocate is used.
• To minimize memory bandwidth usage.
• The above 2-level cache write policy results in inclusive L2 cache
since the content of L1 is also in L2
• Common in the majority of all CPUs with 2-levels of cache
• As opposed to exclusive L1, L2 (e.g AMD Athlon)
EECC722 - Shaaban
#103 Lec # 1 Fall 2004 9-6-2004
2-Level (Both Unified) Memory Access Tree
L1: Write Through to L2, Write Allocate, With Perfect Write Buffer
L2: Write Back with Write Allocate
CPU Memory Access
(H1)
(1-H1)
L1 Hit:
Stalls Per access = 0
L1
L1 Miss:
L2 Hit:
Stalls = (1-H1) x H2 x T2
L2
(1-H1) x (1-H2)
L2 Miss
Clean
Stall cycles =
M x (1 -H1) x (1-H2) x
% clean
Dirty
Stall cycles =
2M x (1-H1) x (1-H2) x
% dirty
Stall cycles per memory access =
(1-H1) x H2 x T2 +
M x (1 -H1) x (1-H2) x % clean +
=
(1-H1) x H2 x T2 +
(1 -H1) x (1-H2) x ( % clean x M + % dirty x 2M)
2M x (1-H1) x (1-H2) x % dirty
EECC722 - Shaaban
#104 Lec # 1 Fall 2004 9-6-2004
Hit time
Miss
Penalty
Miss rate
Cache Optimization Summary
Technique
Larger Block Size
Higher Associativity
Victim Caches
Pseudo-Associative Caches
HW Prefetching of Instr/Data
Compiler Controlled Prefetching
Compiler Reduce Misses
Priority to Read Misses
Subblock Placement
Early Restart & Critical Word 1st
Non-Blocking Caches
Second Level Caches
Small & Simple Caches
Avoiding Address Translation
Pipelining Writes
MR
+
+
+
+
+
+
+
MP HT
–
–
+
+
+
+
+
–
+
+
+
+
Complexity
0
1
2
2
2
3
0
1
1
2
3
2
0
2
1
EECC722 - Shaaban
#105 Lec # 1 Fall 2004 9-6-2004
X86 CPU Cache/Memory Performance Example:
AMD Athlon XP/64/FX Vs. Intel P4/Extreme Edition
Intel P4 3.2 GHz
Extreme Edition
Data L1: 8KB
Data L2: 512 KB
Data L3: 2048 KB
Intel P4 3.2 GHz
Data L1: 8KB
Data L2: 512 KB
AMD Athon 64 FX51 2.2 GHz
Data L1: 64KB
Data L2: 1024 KB (exlusive)
AMD Athon 64 3400+ 2.2 GHz
Data L1: 64KB
Data L2: 1024 KB (exclusive)
AMD Athon 64 3200+ 2.0 GHz
Data L1: 64KB
Data L2: 1024 KB (exclusive)
AMD Athon 64 3000+ 2.0 GHz
Data L1: 64KB
Data L2: 512 KB (exclusive)
Main Memory: Dual (64-bit) Channel PC3200 DDR SDRAM
peak bandwidth of 6400 MB/s
Source: The Tech Report 1-21-2004
http://www.tech-report.com/reviews/2004q1/athlon64-3000/index.x?pg=3
AMD Athon XP 2.2 GHz
Data L1: 64KB
Data L2: 512 KB (exclusive)
EECC722 - Shaaban
#106 Lec # 1 Fall 2004 9-6-2004