Lec08-MemoryHierarchyIII

Transcript Lec08-MemoryHierarchyIII

CSCE 513 Computer Architecture
Lec08
Memory Hierarchy IV
Topics

Pipelining Review
 Load-Use Hazard

Memory Hierarchy Review
 Terminology review
 Basic Equations
 6 Basic Optimizations

Memory Hierarchy – Chapter 2
Readings: Appendix B, Chapter 2
–1–
September 28, 2015
CSCE 513 Fall 2015
AMAT Equations
Terminology
(abbreviations)
•AMAT
•HT – HitTime
•MR miss Rate
•MP miss Penalty
–2–
CSCE 513 Fall 2015
AMAT – weighted average
–3–
CSCE 513 Fall 2015
AMAT – weighted average (continued)
–4–
CSCE 513 Fall 2015
2.2 - 10 Advanced Cache Optimizations
Five Categories
1. Reducing Hit Time-Small and simple first-level caches and wayprediction. Both techniques also generally decrease power
consumption.
2.
Increasing cache bandwidth— Pipelined caches, multibanked
caches, and nonblocking caches. These techniques have
varying impacts on power consumption.
3. Reducing the miss penalty— Critical word first and merging
write buffers. These optimizations have little impact on power.
4. Reducing the miss rate— Compiler optimizations
5. Reducing the miss penalty or miss rate via parallelism—
Hardware prefetching and compiler prefetching.
–5–
CSCE 513 Fall 2015
To improve hit time, predict the way to pre-set
mux
Advanced Optimizations
Way Prediction
Mis-prediction gives longer hit time
 Prediction accuracy

 > 90% for two-way
 > 80% for four-way
 I-cache has better accuracy than D-cache
First used on MIPS R10000 in mid-90s
 Used on ARM Cortex-A8

Extend to predict block as well
“Way selection”
 Increases mis-prediction penalty

–6–
Copyright © 2012, Elsevier Inc. All rights reserved.
CSCE 513 Fall 2015
Pipeline cache access to improve bandwidth

Examples:
Advanced Optimizations
Pipelining Cache
 Pentium: 1 cycle
 Pentium Pro – Pentium III: 2 cycles
 Pentium 4 – Core i7: 4 cycles
Increases branch mis-prediction penalty
Makes it easier to increase associativity
–7–
Copyright © 2012, Elsevier Inc. All rights reserved.
CSCE 513 Fall 2015
Allow hits before
previous misses
complete


Advanced Optimizations
Nonblocking Caches
“Hit under miss”
“Hit under multiple
miss”
L2 must support this
In general,
processors can
hide L1 miss
penalty but not L2
miss penalty
–8–
Copyright © 2012, Elsevier Inc. All rights reserved.
CSCE 513 Fall 2015
Organize cache as independent banks to
support simultaneous access
Advanced Optimizations
Multibanked Caches
ARM Cortex-A8 supports 1-4 banks for L2
 Intel i7 supports 4 banks for L1 and 8 banks for
L2

Interleave banks according to block address
–9–
Copyright © 2012, Elsevier Inc. All rights reserved.
CSCE 513 Fall 2015
Critical word first
Request missed word from memory first
 Send it to the processor as soon as it arrives

Advanced Optimizations
Critical Word First, Early Restart
Early restart
Request words in normal order
 Send missed work to the processor as soon as
it arrives

Effectiveness of these strategies depends
on block size and likelihood of another
access to the portion of the block that has
not yet been fetched
– 10 –
Copyright © 2012, Elsevier Inc. All rights reserved.
CSCE 513 Fall 2015
When storing to a block that is already pending in
the write buffer, update write buffer
Advanced Optimizations
Merging Write Buffer
Reduces stalls due to full write buffer
Do not apply to I/O addresses
No write buffering
Write buffering
– 11 –
Copyright © 2012, Elsevier Inc. All rights reserved.
CSCE 513 Fall 2015
Loop Interchange

Swap nested loops to access memory in
sequential order
Advanced Optimizations
Compiler Optimizations
Blocking
Instead of accessing entire rows or columns,
subdivide matrices into blocks
 Requires more memory accesses but improves
locality of accesses

– 12 –
Copyright © 2012, Elsevier Inc. All rights reserved.
CSCE 513 Fall 2015
Fetch two blocks on miss (include next
sequential block)
Advanced Optimizations
Hardware Prefetching
Pentium 4 Pre-fetching
– 13 –
Copyright © 2012, Elsevier Inc. All rights reserved.
CSCE 513 Fall 2015
Insert prefetch instructions before data is
needed
Non-faulting: prefetch doesn’t cause
exceptions
Advanced Optimizations
Compiler Prefetching
Register prefetch

Loads data into register
Cache prefetch

Loads data into cache
Combine with loop unrolling and software
pipelining
– 14 –
Copyright © 2012, Elsevier Inc. All rights reserved.
CSCE 513 Fall 2015
– 15 –
Copyright © 2012, Elsevier Inc. All rights reserved.
Advanced Optimizations
Summary
CSCE 513 Fall 2015
Memory Technology
Memory Technology
Performance metrics
Latency is concern of cache
 Bandwidth is concern of multiprocessors and
I/O
 Access time

 Time between read request and when desired word
arrives

Cycle time
 Minimum time between unrelated requests to
memory
DRAM used for main memory, SRAM used
for cache
– 16 –
Copyright © 2012, Elsevier Inc. All rights reserved.
CSCE 513 Fall 2015
Memory Technology
Memory Technology
SRAM
Requires low power to retain bit
 Requires 6 transistors/bit

DRAM
Must be re-written after being read
 Must also be periodically refeshed

 Every ~ 8 ms
 Each row can be refreshed simultaneously
One transistor/bit
 Address lines are multiplexed:

 Upper half of address: row access strobe (RAS)
 Lower half of address: column access strobe (CAS)
– 17 –
Copyright © 2012, Elsevier Inc. All rights reserved.
CSCE 513 Fall 2015
Memory Technology
Memory Technology
Amdahl:


Memory capacity should grow linearly with processor
speed
Unfortunately, memory capacity and speed has not kept
pace with processors
Some optimizations:


Multiple accesses to same row
Synchronous DRAM
 Added clock to DRAM interface
 Burst mode with critical word first



– 18 –
Wider interfaces
Double data rate (DDR)
Multiple banks on each DRAM device
Copyright © 2012, Elsevier Inc. All rights reserved.
CSCE 513 Fall 2015
– 19 –
Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Technology
Memory Optimizations
CSCE 513 Fall 2015
– 20 –
Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Technology
Memory Optimizations
CSCE 513 Fall 2015
Memory Technology
Memory Optimizations
DDR:

DDR2
 Lower power (2.5 V -> 1.8 V)
 Higher clock rates (266 MHz, 333 MHz, 400 MHz)

DDR3
 1.5 V
 800 MHz

DDR4
 1-1.2 V
 1600 MHz
GDDR5 is graphics memory based on DDR3
– 21 –
Copyright © 2012, Elsevier Inc. All rights reserved.
CSCE 513 Fall 2015
Memory Technology
Memory Optimizations
Graphics memory:

Achieve 2-5 X bandwidth per DRAM vs. DDR3
 Wider interfaces (32 vs. 16 bit)
 Higher clock rate
» Possible because they are attached via soldering instead of
socketted DIMM modules
Reducing power in SDRAMs:
Lower voltage
 Low power mode (ignores clock, continues to
refresh)

– 22 –
Copyright © 2012, Elsevier Inc. All rights reserved.
CSCE 513 Fall 2015
– 23 –
Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Technology
Memory Power Consumption
CSCE 513 Fall 2015
Memory Technology
Flash Memory
Type of EEPROM
Must be erased (in blocks) before being
overwritten
Non volatile
Limited number of write cycles
Cheaper than SDRAM, more expensive than
disk
Slower than SRAM, faster than disk
– 24 –
Copyright © 2012, Elsevier Inc. All rights reserved.
CSCE 513 Fall 2015
Understand ReadyBoost and whether
it will Speed Up your System
Windows 7 supports Windows ReadyBoost.
•
•
•
This feature uses external USB flash drives as a hard disk
cache to improve disk read performance.
Supported external storage types include USB thumb drives,
SD cards, and CF cards.
Since ReadyBoost will not provide a performance gain when
the primary disk is an SSD, Windows 7 disables ReadyBoost
when reading from an SSD drive.
External storage must meet the following requirements:
•
•
•
Capacity of at least 256 MB, with at least 64 kilobytes (KB) of
free space. The 4-GB limit of Windows Vista has been
removed.
At least a 2.5 MB/sec throughput for 4-KB random reads
At least a 1.75 MB/sec throughput for 1-MB random writes
– 25 –
http://technet.microsoft.com/en-us/magazine/ff356869.aspx
CSCE 513 Fall 2015
Memory Technology
Memory Dependability
Memory is susceptible to cosmic rays
Soft errors: dynamic errors

Detected and fixed by error correcting codes
(ECC)
Hard errors: permanent errors

Use sparse rows to replace defective rows
Chipkill: a RAID-like error recovery
technique
– 26 –
Copyright © 2012, Elsevier Inc. All rights reserved.
CSCE 513 Fall 2015
Solid State Drives
http://en.wikipedia.org/wiki/Solid-state_drive
http://www.tomshardware.com/charts/hard-drives-andssds,3.html
•
•
– 27 –
Hard Drives 34 dimensions: eg Desktop performance
SSD -
CSCE 513 Fall 2015
Windows Experience Index
Control Panel\All Control Panel Items\Performance
Information and Tools
– 28 –
Control Panel\All Control Panel Items\Performance Information and Tools
CSCE 513 Fall 2015

Lec08-MemoryHierarchyIII

Transcript Lec08-MemoryHierarchyIII

Directory