Direct Mapped Cache

Download Report

Transcript Direct Mapped Cache

Lecture 11, Oct. 29, 2007
•
Syllabus change:
– Tues, Oct. 29, 2007 – Caching
– Thurs, Nov. 1, 2007 – Review
– Tues, Nov. 6, 2007 – Midterm
– Tues, Nov. 15, 2007 – Quiz (caching/V-M) (%10)
– Thurs, Nov. 29, 2007 – Final Project / Final Day of Class
• Still PopNet
– Make modifications to the code
• Add lasers to simulated code
• Add no virtual memory, but make the I/O BW large
and scalable
• Add network coding/computation at each node (ECC,
compression, etc)
•
I will give HW on Caching/Virtual Memory (Due Nov. 15, same day as
quiz occurs)
2004 Morgan Kaufmann Publishers
1
Chapter Seven
2004 Morgan Kaufmann Publishers
2
Memories: Review
•
SRAM:
– value is stored on a pair of inverting gates
– very fast but takes up more space than DRAM (4 to 6 transistors)
•
DRAM:
– value is stored as a charge on capacitor (must be refreshed)
– very small but slower than SRAM (factor of 5 to 10)
Word line
A
A
B
B
Pass transistor
Capacitor
Bit line
2004 Morgan Kaufmann Publishers
3
Exploiting Memory Hierarchy
•
Users want large and fast memories!
SRAM access times are .5 – 5ns at cost of $4000 to $10,000 per GB.
DRAM access times are 50-70ns at cost of $100 to $200 per GB.
Disk access times are 5 to 20 million ns at cost of $.50 to $2 per GB.
•
Try and give it to them anyway
– build a memory hierarchy
2004
CPU
Level 1
Increasing distance
from the CPU in
access time
Levels in the
Level 2
memory hierarchy
Level n
Size of the memory at each level
2004 Morgan Kaufmann Publishers
4
Locality
•
A principle that makes having a memory hierarchy a good idea
•
If an item is referenced,
temporal locality: it will tend to be referenced again soon
spatial locality: nearby items will tend to be referenced soon.
Why does code have locality?
•
Our initial focus: two levels (upper, lower)
– block: minimum unit of data
– hit: data requested is in the upper level
– miss: data requested is not in the upper level
2004 Morgan Kaufmann Publishers
5
Cache
•
•
Two issues:
– How do we know if a data item is in the cache?
– If it is, how do we find it?
Our first example:
– block size is one word of data
– "direct mapped"
For each item of data at the lower level,
there is exactly one location in the cache where it might be.
e.g., lots of items at the lower level share locations in the upper level
2004 Morgan Kaufmann Publishers
6
Direct Mapped Cache
Mapping: address is modulo the number of blocks in the cache
Cache
000
001
010
011
100
101
110
111
•
00001
00101
01001
01101
10001
10101
11001
11101
Memory
2004 Morgan Kaufmann Publishers
7
Direct Mapped Cache
Address (showing bit positions)
•
31 30
13 12 11
2 10
Byte
offset
For MIPS:
Hit
20
10
Tag
Data
Index
Index
0
1
2
Valid Tag
Data
1021
1022
1023
20
32
=
What kind of locality are we taking advantage of?
2004 Morgan Kaufmann Publishers
8
Direct Mapped Cache
•
Taking advantage of spatial locality:
Address (showing bit positions)
31
14 13
18
Hit
65
8
210
4
Tag
Byte
offset
Data
Block offset
Index
18 bits
V
512 bits
Tag
Data
256
entries
16
32
32
32
=
Mux
32
2004 Morgan Kaufmann Publishers
9
Hits vs. Misses
•
Read hits
– this is what we want!
•
Read misses
– stall the CPU, fetch block from memory, deliver to cache, restart
•
Write hits:
– can replace data in cache and memory (write-through)
– write the data only into the cache (write-back the cache later)
•
Write misses:
– read the entire block into the cache, then write the word
2004 Morgan Kaufmann Publishers
10
Hardware Issues
•
Make reading multiple words easier by using banks of memory
CPU
CPU
CPU
Multiplexor
Cache
Cache
Cache
Bus
Bus
Memory
b. Wide memory organization
Bus
Memory
Memory
Memory
Memory
bank 0
bank 1
bank 2
bank 3
c. Interleaved memory organization
Memory
a. One-word-wide
memory organization
•
It can get a lot more complicated...
2004 Morgan Kaufmann Publishers
11
Performance
•
Increasing the block size tends to decrease miss rate:
40%
35%
Miss rate
30%
25%
20%
15%
10%
5%
0%
4
16
64
Block size (bytes)
256
1 KB
8 KB
16 KB
64 KB
256 KB
•
Use split caches because there is more spatial locality in code:
Program
gcc
spice
Block size in
words
1
4
1
4
Instruction
miss rate
6.1%
2.0%
1.2%
0.3%
Data miss
rate
2.1%
1.7%
1.3%
0.6%
Effective combined
miss rate
5.4%
1.9%
1.2%
0.4%
2004 Morgan Kaufmann Publishers
12
Performance
•
Simplified model:
execution time = (execution cycles + stall cycles)  cycle time
stall cycles = # of instructions  miss ratio  miss penalty
•
Two ways of improving performance:
– decreasing the miss ratio
– decreasing the miss penalty
What happens if we increase block size?
2004 Morgan Kaufmann Publishers
13
Decreasing miss ratio with associativity
One-way set associative
(direct mapped)
Block
Tag Data
0
Two-way set associative
1
2
3
4
5
6
Set
Tag Data Tag Data
0
1
2
3
7
Four-way set associative
Set
Tag Data Tag Data Tag Data Tag Data
0
1
Eight-way set associative (fully associative)
Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data
Compared to direct mapped, give a series of references that:
– results in a lower miss ratio using a 2-way set associative cache
– results in a higher miss ratio using a 2-way set associative cache
2004 Morgan Kaufmann Publishers 14
assuming we use the “least recently used” replacement strategy
An implementation
Address
31 30
12 11 10 9 8
8
22
Index
0
1
2
V
Tag
Data
V
321 0
Tag
Data
V
Tag
Data
V
Tag
Data
253
254
255
22
32
4-to-1 multiplexor
Hit
Data
2004 Morgan Kaufmann Publishers
15
Performance
15%
1 KB
12%
2 KB
9%
4 KB
6%
8 KB
16 KB
32 KB
3%
64 KB
128 KB
0
One-way
Two-way
Four-way
Eight-way
Associativity
2004 Morgan Kaufmann Publishers
16
Decreasing miss penalty with multilevel caches
•
Add a second level cache:
– often primary cache is on the same chip as the processor
– use SRAMs to add another cache above primary memory (DRAM)
– miss penalty goes down if data is in 2nd level cache
•
Example:
– CPI of 1.0 on a 5 Ghz machine with a 5% miss rate, 100ns DRAM access
– Adding 2nd level cache with 5ns access time decreases miss rate to .5%
•
Using multilevel caches:
– try and optimize the hit time on the 1st level cache
– try and optimize the miss rate on the 2nd level cache
2004 Morgan Kaufmann Publishers
17
Cache Complexities
•
Not always easy to understand implications of caches:
1200
2000
Radix sort
1000
Radix sort
1600
800
1200
600
800
400
200
Quicksort
400
0
Quicksort
0
4
8
16
32
64
128
256
512 1024 2048 4096
Size (K items to sort)
Theoretical behavior of
Radix sort vs. Quicksort
4
8
16
32
64
128
256
512 1024 2048 4096
Size (K items to sort)
Observed behavior of
Radix sort vs. Quicksort
2004 Morgan Kaufmann Publishers
18
Cache Complexities
•
Here is why:
5
•
Radix sort
4
3
2
1
Quicksort
0
4
8
16
32
64
128
256
512 1024 2048 4096
Size (K items to sort)
•
Memory system performance is often critical factor
– multilevel caches, pipelined processors, make it harder to predict outcomes
– Compiler optimizations to increase locality sometimes hurt ILP
•
Difficult to predict best algorithm: need experimental data
2004 Morgan Kaufmann Publishers
19
Virtual Memory
•
Main memory can act as a cache for the secondary storage (disk)
Virtual addresses
Physical addresses
Address translation
Disk addresses
•
Advantages:
– illusion of having more physical memory
– program relocation
– protection
2004 Morgan Kaufmann Publishers
20
Pages: virtual memory blocks
•
Page faults: the data is not in memory, retrieve it from disk
– huge miss penalty, thus pages should be fairly large (e.g., 4KB)
– reducing page faults is important (LRU is worth the price)
– can handle the faults in software instead of hardware
– using write-through is too expensive so we use writeback
Virtual address
31 30 29 28 27
15 14 13 12 11 10 9 8
3210
Page offset
Virtual page number
Translation
29 28 27
15 14 13 12 11 10 9 8
Physical page number
Physical address
3210
Page offset
2004 Morgan Kaufmann Publishers
21
Page Tables
Virtual page
number
Page table
Physical page or
Valid disk address
1
1
1
1
0
1
1
0
1
1
0
1
Physical memory
Disk storage
2004 Morgan Kaufmann Publishers
22
Page Tables
Page table register
Virtual address
31 30 29 28 27
1 5 1 4 1 3 1 2 11 1 0 9 8
Virtual page number
Page offset
12
20
Valid
3 2 1 0
Physical page number
Page table
18
If 0 then page is not
present in memory
29 28 27
1 5 1 4 1 3 1 2 11 1 0 9 8
Physical page number
Physical address
3 2 1 0
Page offset
2004 Morgan Kaufmann Publishers
23
(can stop here) Making Address Translation Fast
•
A cache for address translations: translation lookaside buffer
TLB
Virtual page
number Valid Dirty Ref
1
1
1
1
0
1
0
1
1
0
0
0
Tag
Physical page
address
1
1
1
1
0
1
Physical memory
Page table
Physical page
Valid Dirty Ref or disk address
1
1
1
1
0
1
1
0
1
1
0
1
Typical values:
1
0
0
0
0
0
0
0
1
1
0
1
1
0
0
1
0
1
1
0
1
1
0
1
16-512 entries,
miss-rate: .01% - 1%
miss-penalty: 10 – 100 cycles
Disk storage
2004 Morgan Kaufmann Publishers
24
TLBs and caches
Virtual address
TLB access
TLB miss
exception
No
Yes
TLB hit?
Physical address
No
Try to read data
from cache
Cache miss stall
while read block
No
Cache hit?
Yes
Write?
No
Yes
Write access
bit on?
Write protection
exception
Yes
Try to write data
to cache
Deliver data
to the CPU
Cache miss stall
while read block
No
Cache hit?
Yes
Write data into cache,
update the dirty bit, and
put the data and the
address into the write buffer
2004 Morgan Kaufmann Publishers
25
TLBs and Caches
Virtual address
31 30 29
14 13 12 11 10 9
Virtual page number
3 2 1 0
Page offset
12
20
Valid Dirty
Tag
Physical page number
=
=
=
=
=
=
TLB
TLB hit
20
Page offset
Physical page number
Physical address
Block
Cache index
Physical address tag
offset
18
8
4
Byte
offset
2
8
12
Valid
Data
Tag
Cache
=
Cache hit
32
Data
2004 Morgan Kaufmann Publishers
26
Modern Systems
•
2004 Morgan Kaufmann Publishers
27
Modern Systems
•
Things are getting complicated!
2004 Morgan Kaufmann Publishers
28
Some Issues
•
Processor speeds continue to increase very fast
— much faster than either DRAM or disk access times
100,000
10,000
1,000
Performance
CPU
100
10
Memory
1
Year
•
Design challenge: dealing with this growing disparity
– Prefetching? 3rd level caches and more? Memory design?
2004 Morgan Kaufmann Publishers
29
Chapters 8 & 9
(partial coverage)
2004 Morgan Kaufmann Publishers
30
Interfacing Processors and Peripherals
•
•
•
I/O Design affected by many factors (expandability, resilience)
Performance:
— access latency
— throughput
— connection between devices and the system
— the memory hierarchy
— the operating system
A variety of different users (e.g., banks, supercomputers, engineers)
Interrupts
Processor
Cache
Memory- I/O bus
Main
memory
I/O
controller
Disk
Disk
I/O
controller
I/O
controller
Graphics
output
Network
2004 Morgan Kaufmann Publishers
31
I/O
•
Important but neglected
“The difficulties in assessing and designing I/O systems have
often relegated I/O to second class status”
“courses in every aspect of computing, from programming to
computer architecture often ignore I/O or give it scanty coverage”
“textbooks leave the subject to near the end, making it easier
for students and instructors to skip it!”
•
GUILTY!
— we won’t be looking at I/O in much detail
— be sure and read Chapter 8 in its entirety.
— you should probably take a networking class!
2004 Morgan Kaufmann Publishers
32
I/O Devices
•
Very diverse devices
— behavior (i.e., input vs. output)
— partner (who is at the other end?)
— data rate
2004 Morgan Kaufmann Publishers
33
I/O Example: Disk Drives
Platters
Tracks
Platter
Sectors
Track
•
To access data:
— seek: position head over the proper track (3 to 14 ms. avg.)
— rotational latency: wait for desired sector (.5 / RPM)
— transfer: grab the data (one or more sectors) 30 to 80 MB/sec
2004 Morgan Kaufmann Publishers
34
I/O Example: Buses
•
•
•
•
Shared communication link (one or more wires)
Difficult design:
— may be bottleneck
— length of the bus
— number of devices
— tradeoffs (buffers for higher bandwidth increases latency)
— support for many different devices
— cost
Types of buses:
— processor-memory (short high speed, custom design)
— backplane (high speed, often standardized, e.g., PCI)
— I/O (lengthy, different devices, e.g., USB, Firewire)
Synchronous vs. Asynchronous
— use a clock and a synchronous protocol, fast and small
but every device must operate at same rate and
clock skew requires the bus to be short
— don’t use a clock and instead use handshaking
2004 Morgan Kaufmann Publishers
35
I/O Bus Standards
•
Today we have two dominant bus standards:
2004 Morgan Kaufmann Publishers
36
Other important issues
•
Bus Arbitration:
— daisy chain arbitration (not very fair)
— centralized arbitration (requires an arbiter), e.g., PCI
— collision detection, e.g., Ethernet
•
Operating system:
— polling
— interrupts
— direct memory access (DMA)
•
Performance Analysis techniques:
— queuing theory
— simulation
— analysis, i.e., find the weakest link (see “I/O System
Design”)
•
Many new developments
2004 Morgan Kaufmann Publishers
37
Pentium 4
•
I/O Options
Pentium 4
processor
DDR 400
(3.2 GB/sec)
Main
memory
DIMMs
DDR 400
(3.2 GB/sec)
System bus (800 MHz, 604 GB/sec)
AGP 8X
Memory
(2.1 GB/sec)
Graphics
controller
output
hub
CSA
(north bridge)
(0.266 GB/sec)
1 Gbit Ethernet
82875P
Serial ATA
(150 MB/sec)
(266 MB/sec) Parallel ATA
(100 MB/sec)
Serial ATA
(150 MB/sec)
Parallel ATA
(100 MB/sec)
Disk
Disk
Stereo
(surroundsound)
AC/97
(1 MB/sec)
USB 2.0
(60 MB/sec)
...
I/O
controller
hub
(south bridge)
82801EB
CD/DVD
Tape
(20 MB/sec)
10/100 Mbit Ethernet
PCI bus
(132 MB/sec)
2004 Morgan Kaufmann Publishers
38
Fallacies and Pitfalls
•
Fallacy: the rated mean time to failure of disks is 1,200,000 hours,
so disks practically never fail.
•
Fallacy: magnetic disk storage is on its last legs, will be replaced.
•
Fallacy: A 100 MB/sec bus can transfer 100 MB/sec.
•
Pitfall: Moving functions from the CPU to the I/O processor,
expecting to improve performance without analysis.
2004 Morgan Kaufmann Publishers
39
Multiprocessors
•
Idea: create powerful computers by connecting many smaller ones
good news: works for timesharing (better than supercomputer)
bad news: its really hard to write good concurrent programs
many commercial failures
Processor
Processor
Processor
Cache
Cache
Cache
Processor
Processor
Processor
Cache
Cache
Cache
Memory
Memory
Memory
Single bus
Memory
I/O
Network
2004 Morgan Kaufmann Publishers
40
Questions
•
How do parallel processors share data?
— single address space (SMP vs. NUMA)
— message passing
•
How do parallel processors coordinate?
— synchronization (locks, semaphores)
— built into send / receive primitives
— operating system protocols
•
How are they implemented?
— connected by a single bus
— connected by a network
2004 Morgan Kaufmann Publishers
41
Supercomputers
Plot of top 500 supercomputer sites over a decade:
Single Instruction multiple data (SIMD)
500
Cluster
(network of
workstations)
400
Cluster
(network of
SMPs)
300
Massively
parallel
processors
(MPPs)
200
100
Sharedmemory
multiprocessors
(SMPs)
0
93 93 94 94 95 95 96 96 97 97 98 98 99 99 00
Uniprocessors
2004 Morgan Kaufmann Publishers
42
Using multiple processors an old idea
•
Some SIMD designs:
•
Costs for the the Illiac IV escalated from $8 million in 1966 to $32 million in
1972 despite completion of only ¼ of the machine. It took three more years
before it was operational!
“For better or worse, computer architects are not easily discouraged”
Lots of interesting designs and ideas, lots of failures, few successes
2004 Morgan Kaufmann Publishers
43
Topologies
P0
P1
P2
P3
P0
a. 2-D grid or mesh of 16 nodes
P4
P1
P5
P2
P6
P3
P7
P4
P5
P6
P7
b. Omega network
a. Crossbar
b. n-cube tree of 8 nodes (8 = 23 so n = 3)
2004 Morgan Kaufmann Publishers
44
Clusters
•
•
•
•
•
•
Constructed from whole computers
Independent, scalable networks
Strengths:
– Many applications amenable to loosely coupled machines
– Exploit local area networks
– Cost effective / Easy to expand
Weaknesses:
– Administration costs not necessarily lower
– Connected using I/O bus
Highly available due to separation of memories
In theory, we should be able to do better
2004 Morgan Kaufmann Publishers
45
Google
•
•
•
•
•
Serve an average of 1000 queries per second
Google uses 6,000 processors and 12,000 disks
Two sites in silicon valley, two in Virginia
Each site connected to internet using OC48 (2488 Mbit/sec)
Reliability:
– On an average day, 20 machines need rebooted (software error)
– 2% of the machines replaced each year
In some sense, simple ideas well executed. Better (and cheaper)
than other approaches involving increased complexity
2004 Morgan Kaufmann Publishers
46
Concluding Remarks
•
Evolution vs. Revolution
“More often the expense of innovation comes from being too disruptive
to computer users”
“Acceptance of hardware ideas requires acceptance by software
people; therefore hardware people should learn about software. And if
software people want good machines, they must learn more about hardware
to be able to communicate with and thereby influence hardware engineers.”
2004 Morgan Kaufmann Publishers
47