No Slide Title

Download Report

Transcript No Slide Title

High Performance Computing
(CS 680)
Lecture 2a: Overview of High Performance Processors*
Jeremy R. Johnson
*This lecture was derived from material in
the text (HPC Chap. 1-2).
High Performance Computing
1
Introduction
• Objective: To review recent developments in the design of
high performance microprocessors. To indicate how these
features effect program performance.
• An example program will be used to illustrate benchmarking
techniques and the effect of compiler optimizations and code
organization on performance. We will indicate how changes
in software can improve performance by better utilizing the
underlying hardware. Our goal for the course is to
understand this behavior.
• Topics
– pipelining
– instruction level parallelism, superscalar and out of order execution
– Memory Hierarchy: cache, virtual memory
High Performance Computing
2
RISC vs. CISC
•CISC: instruction set made up of powerful instructions close
to primitives in a high-level language such as C or FORTRAN
•RISC: low level instructions are emphasized. RISC is a label
most commonly used for a set of instruction set architecture
characteristics chosen to ease the use of aggressive
implementation techniques found in high-performance
processors (John Mashey)
•Prevalence began in mid-1980s (earlier example CDC 6600)
when more transistors and better compilers became available.
•Trade complex instructions for faster clock rate and more
room for extra registers, cache and advanced performance
techniques.
High Performance Computing
3
Characterizing RISC
• Instruction pipelining
• Pipelining floating point execution
• Uniform instruction length
• Delayed branching
• Load/Store architecture
• Simple addressing modes
High Performance Computing
4
Pipelining
• Instruction pipelining
–
–
–
–
–
Instruction Fetch
Instruction Decode
Operand Fetch
Execute
Writeback
IF ID F E W
IF ID F E W
IF ID F E W
High Performance Computing
5
Branches and Hazards
• If a branch is executed the pipeline may need to be flushed
since the wrong instructions may have been started.
IF ID F E W
guess
guess
guess
sure
IF ID F E W
IF ID F E W
IF ID F E W
IF ID F E
High Performance Computing
6
Advanced Techniques
• Superscalar Processors
– issue more than one instruction per cycle
– can’t have dependencies or hardware conflict
– for example can execute an add simultaneously with a mult
• Superpipeling
– more stages in the pipeline
• Out of order and speculative execution
– maintain semantics but allow instructions to be computed in different
order
– may need to guess which instruction to execute
– depends on difference between computation and execution
High Performance Computing
7
Post-RISC Pipeline
IF ID
Instruction Reorder Buffer
IRB
Branch Prediction
E
RR
Rename Registers
High Performance Computing
R
8
Memory Hierarchy
• SRAM vs. DRAM
– small fast memory vs. large slow memory
– principle of locality
•
•
•
•
•
Registers
Cache (level 1)
Cache (level 2)
Main memory
Disk
High Performance Computing
9
Memory Access Speed on
DEC 21164 Alpha
• Clock Speed 500 MHz (= 2 ns clock rate)
• Registers (2 ns)
• L1 On-Chip (4 ns)
• L2 On-Chip (5 ns)
• L3 Off-Chip (30 ns)
• Memory (220 ns)
High Performance Computing
10
Common Framework for Memory
Hierarchies
• Question 1: Where can a block be placed?
– One place (direct mapped), a few places (set associative), or any place
(fully associative)
• Question 2: How is a block found?
– There are four methods: indexing, limited search, full search, or table
lookup
• Question 3: Which block should be replaced on a cache
miss?
– Typically least recently used or random block
• Question 4: What happens on writes?
– Write-through or write-back
High Performance Computing
11
Mapping to Cache
• Cache - the level of the memory hierarchy between the CPU
and main memory.
• Direct-Mapped Cache - memory mapped to one location in
cache
(Block address) mod (Number of block in cache)
• Number of blocks is typically a power of two  cache
location obtained from low-order bits of address.
0
1
2
3
28
29
30
31
...
0
1
2
3
High Performance Computing
12
Locating an Data in the Cache
Address (showing bit positions)
• Compare cache index
(mapping) to Tag (highorder bits) to see if element
is currently in cache
• Valid bit used to indicate
whether data in cache is
valid
• A hit occurs when the data
is in cache, otherwise it is
a miss
• The extra time required
when a cache miss occurs
is called the miss penalty
31 30
13 12 11
210
Byte
offset
Hit
10
20
Tag
Data
Index
Index Valid Tag
Data
0
1
2
1021
1022
1023
High Performance Computing
20
32
13
Example
• 32-word memory
• 8-word cache
Index Valid Tag Data
000
001
010
011
100
101
110
Address Binary
Cache block Hit or miss
22
10110
110
26
11010
010
22
10110
110
26
11010
010
16
10000
000
3
00011
011
16
10000
000
18
10010
010
High Performance Computing
14
Cache Organization
• Since cache is smaller than memory more than one address
must map to same line in cache
• Direct-Mapped Cache
– address mod cache size (only one location when memory address
gets mapped to)
• Fully Associative Cache
– address can be mapped anywhere in cache
– need tag and associative search to find if element in cache
• Set-Associative Cache
– compromise between two extremes
– element can map to several locations
High Performance Computing
15
Model for Cache Misses
• Compulsory misses
– These are cache misses caused by the first access to a block that has
never been in the cache
• Capacity misses (cold-start)
– These are cache misses caused when the cache cannot contain all the
blocks needed during execution of a program.
• Conflict misses (collision)
– These are cache misses that occur in a set associative or direct
mapped cache when multiple blocks compete for the same set. These
misses are eliminated with a fully associative cache.
High Performance Computing
16
Measuring Cache Performance
• CPU time = (CPU execution clock cycles +
Memory stall clock cycles)  Clock-cycle time
• Memory stall clock cycles =
Read-stall cycles + Write-stall cycles
• Read-stall cycles = Reads/program  Read miss rate 
Read miss penalty
• Write-stall cycles = (Writes/program  Write miss rate 
Write miss penalty) + Write buffer stalls
(assumes write-through cache)
• Write buffer stalls should be negligible and write and read
miss penalties equal (cost to fetch block from memory)
• Memory stall clock cycles = Mem access/program  miss
rate  miss penalty
High Performance Computing
17
Virtual Memory
• Decouple physical addresses (memory locations) from
addresses used by a program. Programmer sees a large
memory with the same virtual addresses independent of
where the program is actually placed in memory.
– Virtual to physical mapping performed via a page table
– Since page tables can be in virtual memory, there could be several
table lookups for a single memory reference.
– TLB (translation lookaside buffer) is a cache to store commonly used
virtual to physical maps.
• Page Fault
– when page is not in memory it must be brought in (from disk)
– very slow (usually occurs with OS intervention)
High Performance Computing
18
Improving Memory Performance
• Larger and wider caches
• Cache bypass
• Interleaved and pipelined memory systems
• Prefetching
• Post-RISC effects on memory
• New memory trends
High Performance Computing
19