Transcript CPU

Memory hierarchies
An introduction to the use of cache
memories as a technique for exploiting
the ‘locality-of-reference’ principle
Recall system diagram
Main
memory
CPU
system bus
I/O
I/O
I/O
I/O
I/O
SRAM versus DRAM
• Random Access Memory uses two distinct
technologies: ‘static’ and ‘dynamic’
• Static RAM is much faster
• Dynamic RAM is much cheaper
• Static RAM offers ‘persistent’ storage
• Dynamic RAM provides ‘volatile’ storage
• Our workstations have 1GB of DRAM, but
only 512KB of SRAM (ratio: 2-million to 1)
Data access time
• The time-period between the supplying of
the data access signal and the output (or
acceptance) of the data by the addressed
subsystem
• FETCH-OPERATION: cpu places address
for a memory-location on the system bus,
and memory-system responds by placing
memory-location’s data on the system bus
Comparative access timings
•
•
•
•
•
Processor registers: 5 nanoseconds
SRAM memory: 15 nanoseconds
DRAM memory: 60 nanoseconds
Magnetic disk: 10 milliseconds
Optical disk: (even slower)
System design decisions
•
•
•
•
•
How much memory?
How fast?
How expensive?
Tradeoffs and compromises
Modern systems employ a ‘hybrid’ design
in which small, fast, expensive SRAM is
supplemented by larger, slower, cheaper
DRAM
Memory hierarchy
CPU
‘word’ transfers
CACHE MEMORY
‘burst’ transfers
MAIN MEMORY
Instructions versus Data
• Modern system designs frequently use a
pair of separate cache memories, one for
storing processor instructions and another
for storing the program’s data
• Why is this a good idea?
• Let’s examine the way a cache works
Fetching program-instructions
• CPU fetches its next program-instruction
by placing the value from register EIP on
the system bus and issuing a ‘read’ signal
• If that instruction has not previously been
fetched, it will not be available in the ‘fast’
cache memory, so it must be gotten from
the ‘slower’ main memory; however, it will
be ‘saved’ in the cache in case it’s needed
again very soon (e.g., in a program loop)
Most programs use ‘loops’
// EXAMPLE:
int sum = 0, i = 0;
do {
sum += array[ i ];
++ i;
}
while ( i < 64 );
Assembly language loop
movl
movl
again:
addl
incl
cmp
jae
jmp
finis:
$0, %eax # initialize accumulator
$0, %esi # initialize array-index
array(%esi, 4), %eax # add next int
%esi
# increment array-index
$64, %esi # check for exit-condition
finis
# index is beyond bounds
again
# else do another addition
Benefits of cache-mechanism
• During the first pass through the loop, all
of the loop-instructions must be fetched
from the (slow) DRAM memory
• But loops are typically short – maybe only
a few dozen bytes – and multiple bytes
can be fetched together in a single ‘burst’
• Thus subsequent passes through the loop
can fetch all of these same instructions
from the (fast) SRAM memory
‘Locality-of-Reference’
• The ‘locality-of-reference’ principle is says
that, with most computer programs, when
the CPU fetches an instruction to execute,
it is very likely that the next instruction it
fetches will be located at a nearly address
Accessing data-operands
• Accessing a program’s data differs from
accessing its instructions in that the data
consists of ‘variables’ as well ‘constants’
• Instructions are constant, but data may be
modified (i.e., may ‘write’ as well as ‘read’)
• This introduces the problem known as ‘cache
coherency’ – if a new value gets assigned to a
variable, its cached value will be modified, but
what will happen to its original memory-value?
Cache ‘write’ policies
• ‘Write-Through’ policy: a modified variable in the
cache is immediately copied back to the original
memory-location, to insure that the memory and
the cache are kept ‘consistent’ (synchronized)
• ‘Write-Back’ policy: a modified variable in the
cache is not immediately copied back to its
original memory-location – some values nearby
might also get changed soon, in which case all
adjacent changes could be done in one ‘burst’
Locality-of-reference for data
// EXAMPLE (from array-processing):
int price[64], int quantity[64], revenue[64];
for (int i = 0; i < 64; i++)
revenue[ i ] = price[ i ] * quantity[ i ];
Cache ‘hit’ or cache ‘miss’
• When the CPU accesses an address, it
first looks in the (fast) cache to see if the
address’s value is perhaps already there
• In case the CPU finds that address-value
is in the cache, this is called a ‘cache hit’
• Otherwise, if the CPU finds that the value
it wants to access is not presently in the
cache, this is called a ‘cache miss’
Cache-Line Replacement
• Small-size cache-memory quickly fills up
• So some ‘stale’ items need to be removed
in order to make room for the ‘fresh’ items
• Various policies exist for ‘replacement’
• But ‘write-back’ of any data in the cache
which is ‘inconsistent’ with data in memory
can safely be deferred until it’s time for the
cached data to be ‘replaced’
Performance impact
y
75ns
60ns
y = 75 – 60x
Examples:
If x = .90, then y = 21ns
If x = .95, then y = 18ns
Average
access
time
15ns
0
1
Access frequency (proportion of cache-hits)
x
Control Register 0
• Privileged software (e.g., a kernel module)
can disable Pentium’s cache-mechanism
(by using some inline assembly language)
31
30
29
18
16
PG CD NW
AM
WP
5
4
NE ET
3
2
1
0
TS EM MP PE
asm(“ movl %cr0, %eax “);
asm(“ orl $0x60000000, %eax “);
asm(“ movl %eax, %cr0 “);
In-class demonstration
• Install ‘nocache.c’ module to turn off cache
• Execute ‘cr0.cpp’ to verify cache disabled
• Run ‘sieve.cpp’ using the ‘time’ command
to measure performance w/o caching
• Remove ‘nocache’ to re-enable the cache
• Run ‘sieve.cpp’ again to measure speedup
• Run ‘fsieve.cpp’ for comparing the speeds
of memory-access versus disk-access
In-class exercises
• Try modifying the ‘sieve.cpp’ source-code
so that each array-element is stored at an
address which is in a different cache-line
(cache-line size on Pentium is 32 bytes)
• Try modifying the ‘fsieve.cpp’ source-code
to omit the file’s ‘O_SYNC’ flag; also try
shrinking the ‘m’ parameter so that more
than one element is stored in a disk-sector
(size of each disk-sector is 512 bytes)