Document

Transcript Document

Chapter 5 Memory Hierarchy Design
5.1 Introduction
5.2 Review of the ABCs of Caches
5.3 Cache Performance
5.4 Reducing Cache Miss Penalty
5.5 Reducing Miss Rate
5.6 Reducing Cache Miss Penalty of Miss
Rate via Parallelism
5.7 Reducing Hit Time
5.8 Main Memory and Organizations for
Improving Performance
1
Chapter 5 Memory Hierarchy Design
5.9 Memory Technology
5.10 Virtual Memory
5.11 Protection and Examples of Virtual Memory
5.12 Crosscutting Issues: The design Memory Hierarchy
5.13 Putting it all together:Alpha 21264 memory hierarchies
5.14 Another view: The emotion engine of the Sony Playstation 2
5.15 Another view: The Sun Fire 6800 Server
5.16 Fallacies and Pitfalls
5.17 Concluding remarks
5.18 Historical perspective and references
2
The levels in a typical memory hierarchy in
embedded, desktop, and server computers
Upper
An economical solution to that desire is a memory
hierarchy, which takes advantages of locality and
cost-performance of memory technology
3
Starting with 1980 performance as a baseline, the gap in
performance between memory and CPUs is plotted over time
4
5.2 Review of the ABCs of Caches
•
•
•
•
•
•
•
•
•
•
•
•
Cache
Virtual memory
Memory stall cycles
Direct mapped
Valid bit
Block address
Write through
Instruction cache
Average memory access time
Cache hit
Page
Miss penalty
5
Review of the ABCs of Caches
•
•
•
•
•
•
•
•
•
•
•
•
Fully associative
Dirty bit
Block offset
Write back
Data cache
Hit time
Cache miss
Page fault
Miss rate
N-way set associative
Least-recently used
Tag field
6
Review of the ABCs of Caches
•
•
•
•
•
•
•
•
•
•
•
Write allocate
Unified cache
Miss per instruction
Block
Locality
Set
Random replacement
Index field
No-write allocate
Write buffer
Write stall
7
The typical levels in the hierarchy slow down and get larger as
we move away from the CPU for a large workstation or small
servers
8
Cache performance review
Memory stall cycles  Number of misses  Miss penalty
Misses
 IC 
 Miss penalty
Instruction
Memory accesses
 IC 
 Miss rate  Miss penalty
Instruction
9
Cache performance review
Memory stall clock cycles  IC  Re ads per instruction  Re ad miss rate  Miss p
 IC Writes per instruction Write miss rate Write miss penalty
Memory stall clock cycles  IC 
Memory accesses
 Miss rate  Miss penalty
Instruction
10
Cache miss rate and penalty
11
Cache miss rate and penalty
12
Measuring miss rate as misses per instruction
Misses
Miss rate  Memory accesses

Instruction
Instruction count
Memory accesses
 Miss rate 
Instruction
13
Memory stall cycles
14
Four memory hierarchy questions
• Q1:Where can a block be placed in the upper level? (Block
placement)
• Q2:How is a block found if it is in the upper level?(block
identification)
• Q3:Which block should be replaced on a miss?(Block
replacement)
• Q4:What happens on a write?(Write strategy)
15
Q1:Where can a block be placed in the
upper level? (Block placement)
• Direct mapped: Only one place
• Fully associative:Anywhere in the cache
• Set associative:A set is a group of blocks in
the cache.
( Block address) MOD( Number of sets in cache)
16
This example cache has eight block frames and memory
has 32 blocks
17
The three portions of an address in a setassociative or direct-mapped cache
The tag field is compared aga int it for a hit
Select the desired data from the block
Select the set
18
Q3:Which block should be replaced on a
cache miss
• Random
• Least-recently used(LRU)
• First in, first out(FIFO)
19
Data cache misses per 1000 instructions comparing leastrecently used, random, and first in, first out replacement
for several sizes and associativities
20
Q4:What happens on a write?
• Two write policies:
Write through(page. 401)
Write back
• Two options on a write miss:
Write allocate-The block is allocated on a
write miss, followed by the
write hit actions above.
No-write allocate-This apparently unusual alternative is
write misses do not affect the cache. Instead, the block is modified
only in the lower-level memory(Main Memory).
21
No-write allocate versus write allocate
22
The organization of the data cache in the Alpha
21264 microprocessor(2-way set)
41  bit physical address
48  bit virtual address
23
Physical address
• Physical address=38bit block address+6bit
block offset
• Ex.
Cache size
65,536
2
Index

Block size  Set associativity

64  2
 512  2
9
• 38bit block address=9bit index +29 bit tag
24
Miss per 1000 instructions for instruction, data, and
unified cache of different sizes
25
Cache performance
Average memory access time
Average memory access time  Hit time  Miss rate  Miss penalty
Example
Hit time  0.25 ~ 1.0ns.
Miss penalty  75 ~ 100C.C.
Miss rate  1%
26
Unified cache  separated cache
27
Unified cache or separated cache
28
Unified cache or separated cache
Instuction first
29
Average memory access time and processor
performance
• Average memory access time depend on
many factors:
-Memory stalls
-CPU(out-of-order execution),…
CPU time  (CPU execution clock cycles  Memory stall clock cycles )
 Clock cycle time
30
Performance impact by cache
31
Performance impact by cache
Without cache
CPI  151
With cache
CPI  4
With cache without miss
CPI  1
32
Cache misses have a double-barreled
impact on a CPU
• A lower CPI.
The lower the CPIexecution,the higher the relative impact of a fixed
number of cache miss clock cycles.
• A fast clock.
When calculating CPI, the cache miss penalty is measured in CPU
clock cycles for a miss. Therefore, even if memory hierarchies for two
computers are identical, the CPU with the higher clock rate has a larger
number of clock cycles per miss and hence a higher memory portion of
CPI.
33
Directed mapped or two-way set associative
34
ex-P410-1
35
Miss penalty and out-of-order execution processors
• Redefine memory stalls to lead to a new definition of miss
penalty as on overlapped latency.
Memory stall cycles
Misses

 (Total miss latency
Instruction
Instruction
 Overlapped miss latency )
• Miss Latency
Length of memory latency
Length of latency overlap
36
Out-of-order execution with a direct-mapped
cache
Overlapped
37
Summary of performance equations in this chapter
38
We organize 17 cache optimization into four
categories
• Reducing the miss penalty (Section 5.4):multilevel caches, critical
word first, read miss before write miss, merging write buffers, and
victim caches.
• Reducing the miss rate (Section 5.5): Larger block size, larger cache
size, higher associativity, way predication and pseudoassociativity, and
compiler optimizations.
• Reducing the miss penalty or miss rate via parallelism (Section 5.6):
nonblocking caches, hardware prefetching, and compiler prefetching.
• Reducing the time to hit in the cache (Sections 5.7): small and simple
caches, avoiding address translation, pipelined cache access, and trace
caches.
39
5.4 Reducing cache miss penalty
• First miss penalty reduction technique: multilevel caches
Average memory access time  Hit time  Miss rate  Miss penalty
L1
L1
Miss penalty  Hit time  Miss rate  Miss penalty
L1
L2
L2
L1
L2
Average memory access time  Hit time  Miss rate
L1
L1
 ( Hit time  Miss rate  Miss penalty )
• Multiple miss rates
Local miss rate-Miss rateL1, or Miss rateL2
Global miss rate-Miss rateL1 for level 1 or Miss rateL1xMiss
rateL2 for level 2.
L2
L2
L2
40
Average number of persons  Initial number  Miss rate  Miss penalty
The first-level and the second-level caches
41
The first-level and the second-level caches
42
Miss rates versus cache size for multilevel
caches
Second  level miss
43
rate
Relative execution time by second-level cache
size
44
Two major questions for the design of
the second-level cache
• Will it lower the average memory access time
portion of the CPI, and how much does it cost?
• The second-level caches should be much bigger
than the first.
• Multilevel inclusion is the natural policy for
memory hierarchies: L1 data are always present in
L2.
• Multilevel exclusion: L1 data is never found in an
L2 cache(prevents wasting space in the L2 cache)
45
The impact of second-level cache
associativity of its miss penalty
46
Second miss penalty reduction
technique:Critical word first and early restart
• This strategy is impatience: Don’t wait for the full
block to be loaded before sending the requested
word and restarting the CPU.
Critical words first-Request the missed word first from memory and
send it to the CPU as soon as it arrives; let the CPU continue execution
while filling the rest of the words in the block. Critical-word-first fetch
is also called wrapped fetch and requested word first.
Early restart-Fetch the words in normal order, but as soon as the
requested word of the block arrives, send it to the CPU and let the
CPU continue execution.
47
The average miss penalty
48
Third miss penalty reduction techniques:Giving
priority to read misses over writes
• Fourth miss penalty reduction techniques:
Merging write buffer (Fig. 5.12)
• Fifth miss penalty reduction techniques: Victim
Caches (Fig. 5.13)
• Summary of miss penalty reduction techniques:
The principle of locality
Retrieving the word
Giving priority to read misses over writes
A victim caches
49
R2=R3?
50
To illustrate write merging, the write buffer on top does use it
while the write buffer on the bottom does
Merge int o a sin gle entry
51
Placement of victim cache in the memory hierarchy
52
5.5 Reducing miss rate
• Three simple categories of cache misses:
Compulsory-cold-start misses of first-reference
misses
Capacity-the cache cannot contain all the blocks
Conflict-collision misses or interference misses.
• Associativity Miss Rate
Eight-way
Four-way
Two-way
One-way
53
Total miss rate for each size cache and percentage of each
according the three C’s
54
Total miss rate (top) and distribution of miss rate (bottom) for each
size cache according to the three C’s for the data in Fig 5.14
55
Miss rate versus block size for five different-sized
caches
56
First miss rate reduction technique: Larger block size
• Larger blocks will reduce compulsory
misses-Temporal locality and spatial locality.
• Larger blocks increase the miss penalty(the
size of blocks increase)
• Larger blocks increase the miss rate(reduce
the number of blocks in the cache)
57
Actual miss rate versus block size for five different-sized cached
in Fig 5.16
58
Average memory access time
59
Average memory access time versus block size for five
different-sized caches in Fig. 5.16
60
Second miss rate reduction technique:larger
caches
• The size of second- or third-level caches in 2001=the size
of main memory in the desktop computers in 1991.
• Third miss rate reduction technique:higher associativity
Two general rules of thumb:
-- eight-way set associatives<->fully associative
--A direct-mapped cache of size N has about the same
miss rate as a two-way set-associative cache of size
N/2
61
1-way, 2-way, 4-way and 8-way
62
1-way, 2-way, 4-way and 8-way
63
Average memory access time using miss rate in Fig. 5.14
for parameters in the example
64
Fourth miss rate reduction technique:way predication
and pseudoassociative caches
• Way predication saves pipeline stages in
more than 85% of the instruction fetches
• Pseudoassociative or column associative
Next level checking
A simple way is to invert the most
significant bit of the index field to find the
other block in the “pseudoset”
65
Relationship between regular hit time, pseudohit,
and miss penalty
66
Fifth miss rate reduction technique:Compiler
optimizations
• Reordering the instruction-reduce the miss rates
by 50%?
• Long cache block
• Improve the spatial and temporal locality:Loop
Interchange, Blocking
/* Before */
for (j=0; j<100; j=j+1)
for (i=0; i<5000; i=i+1)
x[i][j]= 2 * x[i][j];
/* After */
for (j=0; j<5000; j=j+1)
for (i=0; i<100; i=i+1)
x[i][j]= 2 * x[i][j];
67
Blocking
•
Storing the array row by row (row major order) or column by column(column major
order) does solve the problem because both rows and columns are used in every iteration
of the loop. Such orthogonal accesses mean that transformations such as loop
interchange are not helpful.
• Instead of operating on entire rows or columns of an array, blocked algorithms operate
on submatrices or blocks.
/*Before */
for (i=0; i<N; i=i+1)
for (j=0; j<N; j=j+1)
{ r=0;
for ( k=0; k<N; k=k+1)
r=r+y[i][k]*z[k][j];
x[i][j]=r;
}
In the worst case, there would be 2N3 + N2 memory words accessed for N3
operations
68
Blocking factor
• The original code is changed to compute on a
submatrix of size B by B.
/*After */
for (jj=0; jj<N; jj=jj+B)
for (kk=0; kk<N; kk=kk+B)
for(i=0; i<N; i=i+1)
for(j=jj; j<min(jj+B, N); k=k+1)
{ r=0;
for (k=kk; k<min(kk+B, N); k=k+1)
r=r + y[i][k]*z[k][j];
x[i][j]=x[i][j] + R;
};
In the worst case, there would be 2N3/B + N2 memory words accessed for N3
operations
69
A snapshot of the three arrays x, y, and z when
i=1
70
The age of access to the arrays x, y, and z
71
5.6 Reducing cache miss penalty or miss rate
via parallelism
• First miss penalty/rate reduction technique: Nonblocking
caches to reduce stall on cache misses
--The CPU could continue fetching instructions from the
instruction cache while waiting for the data cache to return
the missing dataA nonblocking cache or lockup-free
cache
--Hit under 1 miss, or hit under multiple misses,..
72
Two-way
set
73
Ratio of the average memory stall time for a blocking
cache to hit under-miss schemes as the number of
understanding misses is varied for 18 SPEC92 programs
74
Second miss penalty/rate reduction
technique:Hardware prefetching of
instructions and data
• Instruction prefetch is frequently done in hardware outside
of the cache.
• The processor fetches two blocks on a miss: the requested
block and the consecutive block (to the instruction stream
buffer).
• A stream buffer would catch 15% to 25% of the misses
from a 4KB direct-mapped instruction cache with 16-byte
blocks.
• Eight stream buffers can capture 50% to 70% of all misses
from a processor with two 64 KB four way set-associate
caches.
75
Average access time, cache size, miss rate
76
Average memory
access time
77
Third miss penalty/rate reduction
technique:compiler-controlled prefectching
• The compiler insert prefetch instructions to request to
request the data before they are needed.
Two flavors of prefetch:
* Register prefetch will load the value into a register.
* Cache prefetch loads data only into the cache and not the register.
• Faulting prefetch:The address does or does not cause an
exception for virtual address faults and protection violation.
• Non-faulting prefetches simply turn into no-ops if they
would normally result in an exception.
78
Compiler-controlled prefetching
• The most effective prefetch is “semantically
invisible” to a program.
It doesn’t change the contents of registers and
memory.
It cannot cause virtual memory faults.
• The goal of prefetching is to overlap execution
with the prefetching of data.
79
Data cache misses
80
Data cache
misses
81
The time saved
82
5.7 Reducing Hit Time
• First hit time reduction technique:small and
simple caches
*Smaller hardware is faster, and a small cache certainly
helps the hit time.
*Keep the cache simple
*
83
Access times as size and associativity vary in a
CMOS cache
Hit
time
84
Second hit time reduction technique: Avoiding
address translation during indexing of the cache
• Physical address Virtual address
• Two tasks:
Indexing the cache and comparing address
• Process switchingthe virtual address refer to
different physical address, the cache can be
flushed. One solution:use a process-identifier
tag(PID). By recycled the PID instead of flushing
the cache.
85
Synonyms or aliases
• Synonyms or aliases:Two different virtual addressthe
same physical address.
*Hardware solutions guarantee every cache block a unique
physical address.
*Software solutions:page coloring is simply set-associative
mapping applied to virtual memory: The 4KB(1212) pages
are mapped using 64(26) set to ensure that the physical and
virtual address match in the last 18 bits. This restriction
means a direct-mapped cache that is 218 (256K) bytes or
smaller can never have duplicate physical address for
blocks.
86
Miss rate versus virtually addressed cache size of a program measured three
ways: without process switches(uniprocess), with process switches using a
process-identifier tag (PID), and with process switches but without PIDs(purge)
87
Third hit time reduction technique:
pipelined cache access
• The pipeline for the Pentium takes 1 clock
cycle to access the instruction cache.
Fourth hit time reduction technique: trace
caches
• A trace cache finds a dynamic sequence of
instructions including taken branches to
load into a cache block.
88
Summary of cache optimization showing impact on cache
performance and complexity for the techniques in Section 5.4-5.7
89
5.8 Main memory and organization for
improving performance
• Cache Main memoryI/O
• Memory bandwidth  Memory speed
New organization
• Assume the performance of the basic memory
organization is
4 clock cycles to send the address
56 clock cycles for the access time per word
4 clock cycles to send a word of data
Given given a cache block of 4 words, and that a
word is 8 bytes, the miss penalty is
4X(4+56+4)=256 clock cycles, with a memory
90
First technique for higher bandwidth: wider
main memory
• Double or quadrupling the width of the cachedouble or
quadrupling the memory bandwidth
• With a main memory width of 2 words, the miss penalty in
our example would drop from 4X64 CC to 2X64CC.
• The width of the cache the error correction problem
91
Second technique for higher bandwidth: simple
interleaved memory
• Number of memory banks Parallelism
• Miss penalty of Fig. 5.27(c): 4 + 56 + ( 4 X 4 )CC.
4 X 2 X 4 Bytes/76CC=0.4 Bytes/CC
• The mapping of address of banks effects the
behavior of the memory system.
• The address of the four banks are interleaved
Bank 0 has all words whose address modulo 4 is 0
Bank 1 has all words whose address modulo 4 is 1
92
Three examples of bus width, memory width, and memory
interleaving of achieve higher memory bandwidth
93
Four-way interleaved memory
94
What can interleaving and wide memory buy?
95
What can interleaving and wide memory buy?
96
How many banks should be included
• Number of banks >= Number of clock cycles to
access word in bank
• Capacity per memory chip increases, there are
fewer chips in the same-sized memory system,
making multiple banks much more expensive.
• A second disadvantage of memory banks is again
the difficulty of main memory expansion.
97
Third technique for higher
bandwidth:independent memory banks
• Interleaving sequential accessesmultiple
independent accesses
• Cache read in bank i, cache write in bank j
• Multiple cache accesses
• Nonblocking caches allow the CPU to proceed
beyond a cache miss.
• Multiprocessors share a common memory provide
further motivation for memory banks.
98
5.9 Memory technology
• Memory latency: Access time and cycle time
Access time is the time between when a read is
requested and when the desired word arrives,
while cycle time is the minimum time between
requests to memory.
• One reason that cycle time is greater than access
time is that the memory needs the address lines to
be stable between accesses.
99
DRAM Technology
• Capacity of DRAM Number of address lines
Multiplex the address lines
: Row access strobe and column access strobe.
• DRAM need refresh it contentoccasionally unavailable.
• Refresh time<=5% of the total time
• Fourfold improvement in capacity every three years.
• 5% performance improvement in row access time
100
Internal organization of a 64M bit DRAM
101
Times of fast and slow DRAMs with each
generation
102
SRAM technology
• SRAM need not refresh.
• Six transistors per bit
• SRAM design are concerned with speed and
capacity.
• The capacity of DRAMs is roughly 4 to 8
times that of SRAMs. The cycle time of
SRAM is 8 to 16 times faster than DRAMs,
but they are also 8 to 16 times as expensive.
103
Embedded processor memory
technology: ROM and flash
• ROM for the embedded program and the
constant
• Flash memory allows the embedded device
to alter nonvolatile memory after the system
is manufactured.
• Flash memory allows reading at almost
DRAM speed, but writing flash is 10 to 100
times slower.
104
Improving memory performance in a standard
DRAM chip
• Improving bandwidth of DRAM
1. Fast page mode: allow repeated accesses to the
row buffer without another row access time.
2. Synchronous DRAM
3. The third major DRAM innovation to increase
bandwidth is to transfer data on both the rising
edge and falling edge of the DRAM clock cycle.
. Adding little cost to the system while achieving a
significant improvement in bandwidth.
105
Improving memory performance via new
DRAM interface:RAMBUS
• DRAM A high speed interfaceCPU
• The first-generation RAMBUS interface
dropped RAS/CAS. Replacing it with “a”
bus that allows other accesses over the bus
between the sending of the address and
return of the data.RDRAM, Such a bus is
called a packet-switched bus or splittransaction bus.
106
The second-generation RAMBUS
interface
• Direct RDRAM (DRDRAM)1.6GB/Sec.
• This interface include separate row- and columncommand buses instead of the conventional
multiplexing; an 18-bit data bus; expanding from
4 to 16 internal banks per RDRAM to reduce bank
conflicts;increasing the number of row buffers
from 4 to 8; increasing the clock to 400 MHz; and
a much more sophisticated controller on chip
• Because of the separation of data, row, column
buses, three transactions can be performed
simultaneously.
107
5.10 Virtual Memory
• Virtual memory divides physical memory into
blocks and allocates them to different processes.
• A protection scheme restricts a process to the
blocks belonging only to that process.
• Virtual memory automatically manages the two
levels of the memory hierarchy represented by
main memory and secondary storage.
• Virtual memory was invented to relieve
programmers of program block allocation; it
automatically manages the two levels of the
memory hierarchy represented by main memory
and secondary storage.
108
5.10 Virtual memory
• Relocation allows the same program to run
in any location in physical memory.
• Page or segment is used for block, and page
fault or address fault is used for miss
• Virtual address virtual memoryphysical
address
hardware/software
• Memory mapping or address translation.
109
The logical program in its contiguous virtual address space is
shown on the left
110
There are further differences between caches and
virtual memory beyond those quantitative ones
mentioned in Fig. 5.32
• Replacement on cache misses is primarily controlled by hardware,
while virtual memory replacement is primarily controlled by the
operating system.
• The size of the processor address determines the size of virtual
memory, but the cache size is independent of the processor address
size.
• Main memory  Secondary storage.
• Virtual memory system: Page, a fixed-size blocks
(4096~65536)
Segment, a variable-size blocks(1~[216~232])
• Paged addressing has a single fixed-size address divided into page
number and offset within a page, analogous to cache addressing
• Segmented address: the variable size of segments require for segment
and 1 word for an offset within a segment.
111
Typical ranges of parameter for caches and
virtual memory
112
Example of how paging and segmentation divide
a program
113
Paging and segmentation
114
Four memory hierarchy questions
revisited
• Q1: Where can a block be placed in main memory?
Operating systems allow blocks to be placed anywhere
in main memory.
• Q2: How is a block found if it in main memory?(Fig. 5.35)
Both paging and segmentation rely on a data structure
that is indexed by the page or segment number.
• Q3: Which block should be replaced on a virtual memory
miss?
Almost all operating systems try to replace the least-recently used
(LRU) block because if the past predicts the future
• Q4: What happens on a write?
The write strategy is always write back.
115
The mapping of a virtual address to a physical
address via a page table
116
Techniques for fast address translation
• Paging1. One memory access to obtain the physical
address.
2. A second access to get the data.
• A general solution is to again rely on the principle of locality. (Address
translation is in a special cache)
• This special address translation lookaside buffer (TLB)
• A TLB entry is like a cache entry where the tag holds portions of the
virtual address and the data portion holds a physical page frame
number, protection field, valid bit, and usually a use bit and dirty bit.
• OS manage the TLB
117
Fig. 5.36
• The TLB uses fully associative placement; thus,
the translation begins (step 1 and 2) by sending the
virtual address to all tags, the type of memory
access is checked for a violation (also in step 2)
against protection information in the TLB.
• Address space number (SPN) plays the same role
as the PID.
• Step 3: 128:1 multiplexor, Step 4: Page frame(31
or 28 bit) and page offset(13-bit).
118
Operation of the Alpha 21264 data TLB during
address translation
119
Selecting a page size
• The following favor a larger page size:
1.The size of the page table is inversely proportional to the page size;
memory can therefore be saved by making the page bigger.
2. A larger page size can allow larger caches will fast cache hit times.
3. Transferring larger pages to or from secondary storage, possibly over
a network, is more efficient than transferring.
4. The number of TLB entries is restricted, so a larger page size means
that more memory can be mapped efficiently, thereby reducing the
number of TLB misses.
• The main motivation for a smaller page size is conserving storage. A
small page size will result in less wasted storage when a continuous
region of virtual memory is not equal in size to a multiple of the page
size.
120
The overall picture of a hypothetical memory hierarchy going from
virtual address to L2 cache access
121
5.11 Protection and examples of virtual
memory
• Multiprogramming, a computer would be shared by several programs
running concurrently, led to new demands for protection and sharing
among program.
• Process, a running program plus any state needed to continue running
it.
• Process exchange is called a process switch or context switch.
• The operating system designer must guarantee that processes do not
interfere with each others’ computation.
• The operating system partitions main memory so that several different
processes have their state in memory at the same time.
• The operating system allow communication between processes or save
memory by reducing the number of copies of identical information.
122
Protecting processes
• The simplest protection mechanism is pair of register that
checks every address to be sure that it falls between the
two limits, base and bound. An address is valid if
Base =< Address =< Bound or (Base + Address) =< Bound
• The computer designer has three more responsibilities in
helping the operating system designer protect process from
each other.
1. Provide at least two modes; a user process and an
operating process.
2. Provide a portion of the CPU state that a user process
can use but not write.
3.Provide mechanism whereby the CPU can go from user
123
mode to supervisor mode and vice versa.
Protection mechanism
• The simplest way of doing protection is to add
permission flags to each page and segment.
• Since few programs today intentionally modify
their own code, an operating system can detect
accidental writes to code by offering readonly protection to pages.
• Processes are thus protected from one
another by having their page tables, each
pointing to distinct pages of memory.
124
Protection mechanism
• “Ring” added to the protection structure expand
memory access protection from two levels ( user
and kernel) to many more.
• A program can’t unlock access to the data unless it
has the key.
125
A paged virtual memory example:The Alpha
memory management and the 21264 TLB
• The Alpha architecture uses a combination of segmentation
and paging, providing protection minimizing page table
size.
• This combination provides many advantages:
Segmentation divides the address space and conserves
page table, while paging provides virtual memory
relocation, and protection.
• 64-bit address space(virtual)41-bit physical address
space.
126
The organization of seg0 and seg1 in the Alpha
127
The mapping of an Alpha virtual address
128
Five protection fields of the each entry of
the page table
• Valid –says that the page frame number is valid for
hardware translation
• User read enable- Allows user programs to read data
within this page
• Kernel read enable-Allows the kernel to read data within
this page
• User write enable-Allows user programs to write data
within this page
• Kernel write enable-Allows the kernel to write data within
this page
129
The maximum virtual address and physical
address
• The maximum virtual address and physical
address is tied to the page size.
• Page size [8 KB ~ 64 KB] virtual address: 3 x13
+ 16=55 bits; The max physical address: 32 +
18=48 bits.
• Memory management in the Alpha 21264 rely on
page-level address translation and correct
operation of the operating system to provide safety
to multiple processes sharing the computer.
130
Memory hierarchy parameters of the Alpha
21264TLB(Translation lookaside buffer),
PTE(Page table entry)
131
A segmented virtual memory example:
Protection in the Intel Pentium
• The original 8086 used segment for address, no virtual
memory or protection mechanism.
• The “successors” to the 8086 (IA-32): Lager address space,
protection scheme to avoid security loopholes.
• Double the traditional two-level address space the
Pentium has four levels of protection.
• The IA-32 divides the address space, allowing both the
operating system and the user access to the full space.
• The IA-32 allows the operating system to maintain the
protection level of the called routine for the parameters that
are passed to it.
132
IA-32 has a page table entry (PTE), or a
segment descriptor
• The fields found in PTE
1. Present bit-Equivalent to the PTE valid bit, used to
indicate this is a valid translation
2. Base field-Equivalent to a page frame address,
containing the physical address of the first byte of the
segment.
3. Access bit-Like the reference bit or use bit in some
architecture that is helpful for replacement algorithms.
4.Attributes field-Specifies the valid operations and
protection levels for operations that use this segment.
133
Reference: Walter A Triebel and Avtar Singh, “The 8088 and
8086 Microprocessors: Programming,Interfacing, Software,
Hardware, and Applications”, Prentice-Hall, Fourth Edition,
2003,
• DPL:Descriptor privilege level
• Privilege check by CPU
• Conforming:When a service is initiates, the
current privilege level may change. This depends
on whether the software that was interrupted in a
code segment that was configured as conforming
or nonconforming. If the interrupted code is in a
conforming code segment, CPL does not change
when the service routine is initiated.
134
The IA-32 segment description are distinguished by bits in the
attributes field
135
Adding sharing and protection
• Half of the address space is shared by all all process and half is unique
to each process, called global address space and local address space,
respectively.
• Writing data A descriptorAction???
• Privilege level comparing between the caller and the called
• Adding safe calls from user to OS gates and inheriting protection level
for parameters.
• To restrict entry into others’ code, the IA-32 provides a special
segment descriptor, or call gate, identified by a bit in the attributes
field.
• The call gate is to prevent the user from randomly jumping anywhere
into a protected or more privileged code segment.
136
5.12 Crosscutting Issues:The design of
memory hierarchies
• Superscalar CPU and number of ports to cache
Parallelism: multiple instructions can be issued within a
single clock cycle.
• Speculative execution and the memory system
• To fulfill the demanding for instruction-level parallelism
and clock rate, increasingly the instruction cache and first
part of instruction execution are merging.
• Embedded computers have bigger instruction cache
• Embedded computers often are placed in real-time
environments where a set of tasks must be completed
every time period.a portion of cache acts like a small
scratch-pad memory under program control.
137
I/O and consistency of cached data
• There is little danger in the CPU seeing the old or stale
copy.
• I/O devices give the opportunity for other devices to cause
copies to be inconsistent or for other devices to read the
stale copies. Cache coherency problem.
• I/O competing with the CPU for cache access will cause
the CPU to stall for I/O.
• If a write-through cache were used, then memory would
have an up-to-date copy of the information, and there
would be no stale-data issue for output.
138
I/O and consistency of cached data
• Input requires some extra work. The software solution is to guarantee
that no blocks of the I/O buffer designed for input are in the cache.
• One approach: A buffer page is marked as noncachable; the operating
system always inputs to such a page.
• Another approach: the operating system flushes the buffer addresses
from the cache before the input occurs.
• A hardware solution: Check the I/O addresses on input to see if they
are in cache.(Checking of I/O addresses in parallel with processor
cache accesses)
• Cache coherency protocols:Maintain coherency for multiple processors
139
The cache coherency problem
140
5.13 Putting all together: Alpha 21264
memory
• The 21264 is an out-of-order execution processor
that fetches up to four instructions per clock cycle
and executes up to six instructions per clock cycle.
• Virtual addressPhysical address
48-bit
44-bit
43-bit
41-bit
• Alpha onThe chip loads the instruction cache
serially from an external PROM.(16K instructions)
into the cache.
•
141
5.13 Putting it all together: Alpha 21264
memory hierarchy
• The preloaded instructions execute in privileged
architecture library (PAL) mode.
• The software executed in PAL mode simply machine
language routines with some implementation-specific
extensions to allow access to low-level hardware, such as
TLB.
• One of the first step is to update the instruction TLB with
valid page table entries(PTEs) for this process. Kernel
code updates the appropriate page table entry(in memory)
for each page to be mapped. A miss in the TLB is handled
by PAL code that relies on the TLB cannot change the TLB.
142
5.13 Putting it all together: Alpha 21264
memory hierarchy
• Once the operating system is ready to begin executing a
user process, it sets the PC to the appropriate address in
segment seg0
• First, a 12-bit address is sent to the 64 KB instruction
cache, along with a 35-bit page number. An 8-bit address
space number (SPN) is also sent, for the same purpose as
using ASNs in the TLB ( step 1).
• The instruction cache is virtually indexed and virtually
tagged, so instruction TLB translations are only required
on cache misses.
143
5.13 Putting it all together: Alpha 21264
memory hierarchy
• To reduce latency, the instruction cache includes two
mechanisms to begin early access of the next block: Way
predication and line predication.
• Way predication: The way-predicting cache relies on a 1bit field for every 16 bytes to predict which of two sets will
be used next.
• Line predication: Predict the next sequential group on a
cache miss.
• Step 2: The tag field of the PC is compared to the address
from the tag portion of the cache, and the 8-bit process
ASN to the tag ASN field.
•
144
5.13 Putting it all together: Alpha 21264
memory hierarchy
• Step 3: The valid bit is also checked. If any field
has the wrong value, it is a miss. On a hit in the
instruction cache, the proper fetch block is
supplied, and the next way and line prediction is
loaded to read the next block.
• Step 4: An instruction cache miss causes a
simultaneous check of the instruction TLB and the
instruction prefetcher.
• Step 5: The fully associative TLB simultaneously
searches all 128 entries to find a match between
the address and a valid PTE.
145
• Step 6: If the desired instruction address is found
5.13 Putting it all together: Alpha 21264
memory hierarchy
• Step 7: The instruction address eventually supplied directly
by the prefetcher.
• Step 8: Otherwise, if there is no TLB exception, an access
to the second-level cache is started.
• Step 9: The 35-bit block address (41-bit physical address –
6-bit block offset) is divided into an 18-bit tag and a 17-bit
index.
• Step 10: The cache controller reads the tag from that index,
and if it matches and is valid.
• Step 11: It returns the critical 16 bytes
• Step 12: At the same time: a “request” is made for the next
sequential 64-block
146
5.13 Putting it all together: Alpha 21264
memory hierarchy
• Step 13: The next sequential 64-byte block is loaded into the
instruction prefetcher in the next 6 clock cycles
• Step 14: To save time, the prefetched instructions are passed around
the CPU and then written to the instruction cache while the instructions
execute in the CPU.
• Step 15: If the instruction is not found in the secondary cache, the
physical address command is sent to the ES40 system chip set via four
consecutive transfer cycles on a narrow, 15-bit outbound address bus.
• Step 16:The address and command use the address bus for 8 CPU
cycles. The ES40 connects the microprocessor to memory via a
crossbar to one of two 256-bit memory buses to service the request
147
The overall picture of the Alpha 21264 memory
hierarchy
148
Alpha 21264/21164 performance speedup versus
miss rate for SPECint2000
149
Performance of the 21264 memory hierarchy
• 21264: Out-of-order execution; A memory stall for one instruction may
be completely hidden by successful completion of later instruction.
• The higher the speedup of the 21264 over the 21164(In-order
execution, higher miss rate)
• The 21264’s ability to continue to execute during cache misses that
stall the 21164 but hit in the L2 cache of the 21264.
• The peak rate of the CPI of 21264 is 0.25, or 4 instruction per clock
cycle.
• SPEC95: the 21264 completes almost 2 instruction per clock cycle.
• Database ap.
Higher miss rate + Higher branch mispreication
The server may heavier demands on the
memory than do the microprocessor for.
150
CPI and misses per 1000 instructions for running a TPC-C-like
database workload and the SPEC95 benchmarks on the Alpha 21264
in the Compaq ES40
151
5.14 Another view: The emotion engine
of the Sony Playstation 2
• The operation of Playstation 2: The data are often
a continuous stream.
• The steady stream of graphics and audio
demanded by electronic games leads to a different
approach to memory design.
• The style is high bandwidth via many dedicated
independent memories.
• Fig. 5.46 show that much smaller caches capture
the misses for multimedia applications
152
Three C’s for MPEG3 decode
153
Figure 5.47: The block diagram of the Sony
Playstation 2
• Playstation 2: A game machine, there are interfaces for
video, sound, and a DVD player
• Two standard computer I/O buses, USB and IEEE 1394, a
PCMCIA slot, and a Modem.
• It includes a 34-MHz MIPS processor that also acts the
emulation computer to run games for earlier Sony
Playstations.
• It also connects to a standard PC audio card to provide the
sound for the games.
154
Figure 5.47: The block diagram of the Sony
Playstation 2
• The 21264 microprocessor in 0.25-micron technology is
about 160 mm2 and uses 150M transistors.
• The Emotion Engine in 0.25-micron technology is about
225mm2 and uses 13.5M transistors. And the Graphics
Synthesizer is 279mm2.
• Two-instruction issue per one clock cycle.
• 128-bit SIMD instructions for multimedia applications.
• Vector Unit 0 is primarily a DSP-like coprocessor for the
CPU.
• Vector Unit 1 has similar functions to VPU0, but it
normally operates independently of the CPU.
155
Block diagram of the Sony Playstation 2
156
Figure 5.47: The block diagram of the Sony
Playstation 2
• The PS2 uses two PC800 (400 MHz) DRDRAM chips using two
channels, offering 32 MB of storage and a peak memory bandwidth of
3.2 MB/Sec.
• The graphic Synthesizer take rendering (移交) command from
Emotion Engine in what are commonly called display lists. These are
lists of 32-bit commands tell the renderer what shape to use and where
to place them, plus what colors and textures to fill them.
• The Graphics Synthesizer contains the full video buffer and has a
2048-bit-wide interface so that pixel filling is not a bottleneck.
• The “separate” memories dedicated to individual functions to
inexpensively achieve greater memory bandwidth for the entire system.
157
Figure 5.47: The block diagram of the Sony
Playstation 2
• A major insight shaped the design of the Emotion Engine: Generally,
in a racing car game there are foreground objects that are constantly
changing and background objects that change less in reaction to the
event, although the background can be most of the screen. This
observation led to a split of responsibilities.
• CPU-VPU0 handles the foreground action.
• CPU-VPU1 handles the background.
(1) The traditional 64-bit MIPS architecture including a floating-point
unit
(2) The MIPS architecture extended with multimedia instructions
(VPU0)
(3) Independent vector processor (VPU1)
(4) Image processing unit to accelerate MPEG decoding.
Split function + Dedicated memory (four memories)
158
Figure 5.47: The block diagram of the Sony
Playstation 2
• The programmer organizes all memories as two double buffers, one
pair for incoming DMA data and one pair for the outgoing DMA data.
• The programmer next set up the “10” DMA channels, take are to meet
the real-time deadline for realistic animation of 15 frames per second.
• Fig. 5.48: Serial, where CPU/VPU0 acts as a preprocessor on what to
give VPU1 for it to create for the Graphics Interface using the
scratchpad memory as the buffer.
Parallel: Where both the CPU/VPU0 and VPU1 create
display lists. The display lists and Graphics Synthesizer have multiple
context identifiers to distinguish the parallel display lists to product a
coherent final image.
• Two dedicated buses: a 128-bit path between the CPU and VPU0 and a
128-bit path between VPU1 and the Graphics Interface. The
programmer also chooses which bus to use when setting up the DMA
channels.
159
Two modes of using emotion engine organization
160
5.15 Another view: The Sun Fire 6800 Server
• The Sun Fire 6800 is a midrange multiprocessor server
with particular attention paid to the memory system.
• The emphasis of this server of this sever is costperformance for both commercial computing and running
database applications such as data warehousing and data
mining
• This sever also includes special features to improve
availability and maintainability.
• Miss rate is 0.5% for 1 MB data cache.
• MultiprocessorCoherency misses (a Fourth C)
161
Clock cycles per instruction for memory accesses versus
off-chip cache size for a four-processor server
162
Technical summary of Sun Fire 6800 server Ultra SPARC III
microprocessor
163
Sun Fire 6800 server block diagram
164
Sun Fire 6800 Server
• The data switch yields a peak bandwidth to off-chip
memory of 11 GB/Sec.
• Error correction codes enable buses and memories to both
detect and correct errors.
• The idea is to calculate and store parity over different
subsets of the bits in protected word.
• When parity does not match, it indicates an error.
• The Sun Fire ECC was also designed to detect any pair of
bit errors, and also to detect if a whole DRAM chip failed,
turning all the bits of an 8-bit-wide chip to 0 Single error
correcting/double error detecting.
165
Sun Fire 6800 Server
• Memory is connected directly to the processor to lower
latency.(DRAM controller on the chip can save 80ns)
• UltraSPARC include the tags for L2 cache on chipSave 10 clock
cycles off a miss.
• The on-chip caches are both four-way set associative ( 32KB
instruction, 64KB data cache)
• To reduce the latency to the data cache, it combines and address adder
with the word line decoder. summing address memory reduced
latency from 3 to 2 clock cycles.
• The L1 data cache uses write through (no-write allocate) and L2 cache
uses write back (Write allocate). Both caches provide parity and detect
error.
166
Instruction and data misses per 1000 instruction as cache
size varies from 4 KB to 4096 KB
167
5.16 Fallacies and pitfalls
• Fallacy:Predicting cache performance of one program from
another.
• Pitfall: Simulating enough instructions to get accurate
performance measures of the memory hierarchy
• Pitfall: Too small an address
• Pitfall: Emphasizing memory bandwidth in DRAMs versus
memory latency.
• Pitfall: Delivering high memory bandwidth in a cachebased system.
• Pitfall: Ignoring the impact of the operating system on the
performance of the memory hierarchy.
• Pitfall: Relying on the operating systems to change the 168
page size over time
Instruction misses per 1000 references for five inputs to
perl benchmark from SPEC2000
169
Comparison of price SDRAM versus DRDRAM
in memory modules and in systems in 2000
170
Top 10 in memory bandwidth as measured by the
copy portion of the steam benchmark
[McCalpin2001]
171
Misses and time spent in misses for applications
and operating system
172
5.17 Concluding Remarks
• Memory speedBalancingCPU speed
• Do 8KB page makes sense with terabyte main
memory?
• The design decisions at all these levels interact,
and the architect must take the whole system wise
to make wise decisions.
• New inventions: prefetching, cache-aware
compiler, increasing page size.
• Design timeBalancingdebug time
173
Desktop, embedded, and server microprocessors in 2001.
174
58
175
5.18 Historical Perspective and
references
•
•
•
•
•
•
•
•
•
•
A two levels of memory hierarchy—1962
Translation lookaside buffer—1978
IBM 360 – 32 bit of address(1964)
Measure program behavior as memory traffic as well as
miss rate—1968
Spatial locality and temporal locality –1982
Three C’s model—1987
The victim cache– 1993
Nonblocking cache—1981,1998
Multilevel inclusion property—1988
Multilevel exclusion property—1994
176
5.18 Historical Perspective and
references
• Prefetching via streaming buffer—1990
• The streaming buffer that work well with
nonblocking loads and speculative
execution for in-order processors—1995
• Execution out-of order –1997
• The measurement of SPEC2000
benchmarks collected by Cantin and Hill-2001
177

Document

Transcript Document

Directory