Cache memories.

Download Report

Transcript Cache memories.

Why is memory important?
• Processor performance has increased at a much faster
rate than memory performance, making main memory
the bottleneck.
1000
CPU
CPU-DRAM Gap
100
– 1980: no cache in mproc;
– 1995 2-level cache, 60% trans. on Alpha 21164 mproc
1
DRAM
1999
2000
10
1998
1997
1996
1995
1994
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
1983
1982
1981
1980
1
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Memory Hierarchy ‫היראכיה של הזיכרון‬
In 1998
SRAM .5- 5ns
DRAM 50- 70ns
Disk
5 to 20 million ns
$4000 to $10,000 per Gbyte.
$100 to $200 per Gbyte.
$0.50 to $2
per Gbyte.
CPU
Level 1
Levels in the
memory hierarchy
Increasing distance
from the CPU in
access time
Cache
Memory
Disk
‫משתמשים רוצים זיכרון מהיר וזול‬
‫ הירארכיה של הזיכרון‬:‫הפתרון‬
Level 2
Level n
Size of the memory at each level
A memory hierarchy in which the faster but
smaller part is “close” to the CPU and used
most of the time and in which slower but
larger part is ‘’far” from the CPU, will give us
the illusion of having a fast large inexpensive
2
memory
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Memory Hierarchy
Secondary
Storage
Disks
DRAM
Memory
Hierarchy
L2 Cache
L1 Cache
3
Processor
Registers
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Differences in Memory Levels
Level
Registers
L1 Cache
(on chip)
L2Cache
(off chip)
Main
Memory
Secondary
Storage
Memory
Size
Technology
D Flip64 32-bit
Flops
SRAM
16 Kbytes
Typical
Cost per
Access Time Gbyte
.5 -3 ns
N/A
.5 - 5 ns
SRAM
256 Kbytes .5 - 5 ns
DRAM
64 Mbytes
50 - 70 ns
$4,000 $10,000
$4,000 $10,000
$100 - $200
Magnetic
Disk
2 Gbytes
10 - 20 ms
$0.50-$2
4
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Memories Technology and
Principle of Locality
• Faster Memories are more expensive per bit
• Slower Memories are usually smaller in area size per bit
Memory
Technology
Typical access
time
$ per Gbyte
SRAM
.5-5 ns
$4,000-$10,000
DRAM
50 - 70 ns
$100-$200
Magnetic Disk
10-20 million ns
$0.50-$2
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Static vs. Dynamic RAM
• Static RAM
– Fast (active drive)
– Less dense (4-6 transistors/bit)
– Stable (holds value as long as power applied)
• Dynamic RAM
– Slower
– High density (1 transistor/bit)
– Unstable (needs refresh)
6
Neither device holds data if power removed
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Locality
•
temporal locality:
–
‫לוקאליות בזמן‬
If we accessed a certain address, the chances are high to access it again shortly. For
data this is so since we probably update it, for instruction it is so since we tend to use
loops
•
spatial locality:
–
–
‫לוקאליות במרחב‬
If we accessed a certain address, the chances are high to access its neighbors.
For instructions this is so due to the sequential nature of programs. For data this is so
since we use groups of variable such as arrays.
•
So, let’s keep recent data and instructions in
a fast memory (i.e., close to the CPU)
7
This memory is called the cache.
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
The cache principle
:‫המונחים המרכזיים‬:
Hit:
a successful search of info in the cache. If it is in the
cache, we have a hit. We continue executing the
instructions.
Miss: an unsuccessful search of info in the cache. If it is not
in the cache, we have a miss and we have to bring the
requested data from a slower memory up one level in the
hierarchy. Until then, we must stall the pipeline!
Block: The basic unit that is loaded into the cache when miss
8
occurs is a block. The minimal size of block is a single
word.
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Direct Mapped Cache
000
001
010
011
100
101
110
111
Cache
00001
00101
01001
01101
10001
10101
11001
11101
Memory
9
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Direct Mapped Cache
block = 1 word
size of cache=16words
2n blocks
10
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
One possible arrangement for MIPS cache:
A d dre ss (s h ow in g b it p o s itio n s )
3 1 30
1 3 1 2 11
2 1 0
By te
o ffse t
H it
10
20
Tag
D a ta
In d e x
In d e x
V a lid
T ag
D a ta
0
1
2
1 0 21
1 0 22
1 0 23
20
32
11
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Another possibility for MIPS
(actual DECStation 3100):
12
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
For any 32 bit address CPU:
A d dre ss (s h ow in g b it p o s itio n s )
31
3
n+2n+1
2 1 0
By te
o ffse t
30-n
n
Tag
In d e x
V a lid
T ag
D a ta
In d e x
0
1
2
2n locations
2n -1
30-n
32
13
H it
D a ta
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Direct Mapped Cache: Temporal
Example
Figure 7.6
lw
lw
$1,10 110 ($0)
$2,11 010 ($0)
Miss: valid
Miss: valid
lw
lw
$1,22($0)
$2,26($0)
lw
$3,10 110 ($0)
Hit!
lw
$3,22($0)
Index
000
Valid
N
001
010
011
100
N
N
Y
N
N
101
110
111
N
N
Y
N
Tag
Data
11
Memory[11010]
10
Memory[10110]
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Direct Mapped Cache: Worst case,
always miss!
Figure 7.6
lw
lw
$1,10 110 ($0)
$2,11 110 ($0)
Miss: valid
Miss: tag
lw
lw
$1,22($0)
$2,30($0)
lw
$3,00 110 ($0)
Miss: tag
lw
$3,6($0)
Index
000
Valid
N
001
010
011
100
N
N
N
N
101
110
111
N
N
YY Y
N
Tag
10
1100
Data
Memory[10110]
Memory[11110]
Memory[00110]
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Figure 7.7
Direct Mapped Cache: Mips
Tag
Architecture Index
Address (showing bit positions)
31 30
13 12 11
2 1 0
Byte
offset
Hit
Hit
Data
Data
10
20
Tag
Index
Index
Valid
Tag
Data
0
1
2
1021
1022
1023
20
32
Compare Tags
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Handling writes: ‫טיפול בכתיבה‬
• Write through
– Anything we write is written to the cache and to the memory (we now
discuss one word blocks).
• Write through usually uses a Write buffer
– Since writing to the slower memory take too much time, we use an
intermediate buffer. It gets the write “bursts’ of the program and slowly
but surely writes it to the memory. (If the buffer gets full, we must stall
the CPU)
• Write-back
– Another method is to copy the cache into the memory only when the
block is replaced with another block. This is called write-back or copyback.
17
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
The block size
The block does not have to be a single word. When
we increase the size of the cache blocks, we improve
the hit rate since we reduce the misses due to spatial
locality of the program (mainly) but also the data
(e.g., in image processing). Here is a comparison of
the miss rate of two programs with a single word vs
4 words blocks:
Program
gcc
spice
Block size in
words
1
4
1
4
Instruction
miss rate
6.1%
2.0%
1.2%
0.3%
Data miss
rate
2.1%
1.7%
1.3%
0.6%
Effective combined
miss rate
5.4%
1.9%
1.2%
0.4%
18
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Direct Mapped Cache
block = 1 word
size of cache=16words
19
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Direct Mapped Cache
block = 4 word
size of cache=16words
1 block = 4 words
This is still called a direct mapped
cache since each block in the
memory is mapped directly to a
20
single block in the cache
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
A 4 words block direct mapped implementation
Address (showing bit positions)
31
16 15
4 32 1 0
16
12
2 Byte
offset
Tag
Index
V
Block offset
16 bits
128 bits
Tag
Data
4K
entries
16
32
32
32
32
Mux
32
Hit
Data
21
When we have more than a single word in a block, the efficiency of storage is slightly higher since we
have 1 tag for each block instead of for each word. On the other hand we slow the cache somewhat
since we add multiplexors. Anyhow, this is not the issue. The issue is reducing miss rate
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
A 2m words block implementation
Address (showing bit positions)
31
n+m+1
m+1 2 1 0
30-n-m
Tag
n
m
Byte Offset inside a word
Index
Block offset
30-n-m bits
V
32*2m bits
Data
Tag
2n
entries
30-n-m
32
32
32
Mux
32
m
32
Hit
Data
22
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Block size and miss rate:
When we increase the size of the block, the miss rate, especially for instructions,
is reduced. However, in case we leave the cache size as is, we’ll get to a situation
where there are too few blocks, so we have to change them even before we took
advantage on the locality, i.e., before we used the entire block. That will increase
the miss rate(explains the right hand side of the graphs below)
40%
35%
Miss rate
30%
25%
20%
15%
10%
5%
0%
4
16
64
Block size (bytes)
256
1 KB
8 KB
23
16 KB
64 KB
256 KB
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Block size and write:
When we have more than a single word in a block, then when we write (a single
word) into a block, we must first read the entire block from the memory (unless
its already in the cache) and only then write to the cache and to the memory.
If we had the block in the cache, the process is exactly as it was for a single word
block cache.
Separate instruction and data caches
Note that usually we have separate instruction and data caches. Having a single cache
for both could give some flexibility since we have sometimes more room for data but
the 2 separate caches have twice the bandwidth, I.e., we can read both at the same time
(2 times faster). That is why most CPUs use separate instruction and data caches. 24
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Faster CPUs need better caches
It is shown in the book (section 7.3 pp. 565-567) that when we improve the CPU
(shorten the CPI or the CK period) but leave the cache as is, the percentage of
miss penaly is increased. This means that we need better caches for faster CPUs.
Better means we should reduce the miss rate and reduce the miss penalty.
Reducing the miss rate
This is done by letting the cache more flexibility in keeping data.
So far we allowed a memory block to be mapped to a single block in cache. We
called it a direct mapped cache. There is no flexibility here. The most flexible
scheme is that a block can be store at any of the cache blocks. That way, we can
keep some frequently used blocked that always competed on the same cache
block in direct mapped block implementation. Such a flexible scheme is called
fully associative cache. In a fully associative cache the tag should be compared
to all cache entries.
We have also a compromise called “N-way set associative” cache. Here each
memory block is mapped to one of an N blocks of the cache.
25
Note that for caches having more than 1 possible mapping, we should employ
some replacement policy. (LRU or Random are used)
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Direct Mapped Cache
block = 4 word
size of cache=16words
1 block = 4 words
26
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
2 way set associative Cache
block = 4 word
size of cache=32words
1
2
1 block
= 4 words
1
2
1
N*2n blocks
of 2m words
2
1
2
27
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
A 2way set associative cache
Address (showing bit positions)
31
Tag
0
30-n-m
Byte Offset inside a word
n
m
Tag
index
Block offset
30-n-m bits
V Tag
30-n-m bits
Data (32*2m bits)
V Tag
32*2m bits
Data
2n
entries
30-n-m 32
32
32
Mux
2n
entries
30-n-m 32
32
32
32
Mux
m
32
32
m
32
Hit2
Hit1
Data2
Data2
Mux
Hit
28
Data
Here the block size is 2m word. We see that we have actually 2 caches + a multiplexor
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Fully associative Cache
block = 4 word
size of cache=32words
1
2
1 block = 4 words
3
4
5
6
7
8
29
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
30-m
A fully associative
cache
m
30-m
2
m
Tag
V
Tag
32
32
32
32
32
32
32
32
32
32
V
Tag
32
32
32
32
V
Tag
32
32
32
32
32
32
32
hit
Data
30
Here the block size is 2m word. We see that we have only N blocks
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Faster CPUs need better caches
Better means we should reduce the miss rate. For that we used 4 set associative
cache.
Better also means reduce the miss penalty.
Reducing the miss penalty
This is done by using 2 levels cache. The first cache will be on the same chip as
the CPU, actually it is a part of the CPU. It is very fast (1-2ns=less than 1 ck
cycle) it is small, the block is also small, so it can be 4-way set associative. The
level 2 cache is out of the chip 10 times slower but still 10 times faster than the
memory (DRAM). It has larger block almost always 2-way set associative or
direct mapped. Mainly aimed to reduce the read penalty. Analyzing such caches
is complicated. Ususally simulations are required.
An optimal single level cache is usually larger and slower than the level1 cache
and faster and smaller than the level2 cache.
Note that usually we have separate instruction and data caches.
31
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Virtual Memory
32
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Virtual Memory
33
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Address translation
The translation is simple. We use the LSBs to point at the address inside a page and the rest of the
bits, the MSBs to point at a “virtual” page. The translation should replace the virtual page number
with physical page number, having a smaller number of bits. This means that the physical memory is
smaller than the virtual memory and so, we’ll have to load and store pages whenever required.
Before VM, the programmer was responsible to load and replace “overlays” of code or data. VM
take this burden away.
By the way, using pages with “relocating” the code and the data every time it is loaded into memory
also enables better usage of memory. Large contiguous areas are not required.
34
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
Address translation
The translation is done by a table called the page table. WE have such a table, residing in the main
memory, for each process. A special register, the page table register, points at the start of the table.
When switching the program, I.e, switching to another process, we change the contents of that register
so it points to the appropriate page table. [ To switch a process means also storing all the registers
including the PC of the current process and retrieving those of the process we want to switch to. This is
done by the Operating System every now and then according to some predetermined rule] .
We need to have a valid bit,
same as in caches, which
tells whether the page is
valid or not.
In VM we have fully
associative placement of
pages in the physical
memory.To reduce chances
to page fault. We also apply
sophisticated algorithms for
replacement of pages.
Since the read/write time
(from/to disk) is very long,
we use s/w mechanism
instead of h/w (used in
caches). Also, we use writeback scheme and not writethrough.
35
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.
The page table
The operating system (OS) creates a copy of all the pages of a process on the disk. It loads the
requested pages into the physical memory and keeps track on which page is loaded and which is not.
The page table can be used to point at the pages on the disk .If the valid bit is on, the table has the
physical page address. If the valid bit is off, the table has its disk address.
When a page fault occurs, if all physical memory is used, the OS must choose which page to be
replaced. LRU is often used. However, to simplify things, we set a “use” bit or “reference” bit by
h/w every time a page is accessed. Every now and then these bits are cleared by the OS. So,
according to these bits, the OS can decide which page has a higher chance of being used and keep it
in memory.
The page table could be very big. So there
are technique to keep it small. We do not
prepare room for all virtual addresses
possible, but add an entry whenever a new
page is requested. We sometimes have a
page table with two parts the heap, growing
upwards and the stack growing downwards.
Some OS uses hashing to translate between
the virtual page address and the page table.
Sometimes the page table itself is allowed
to be paged.
Note that every access to the memory is
made of two reads, 1st we read the physical
page address from the page table, then we
can perform the real read.
36
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved.