Computer Structure The Uncore

Download Report

Transcript Computer Structure The Uncore

Computer Structure
The Uncore
1
Computer Structure 2012 – Uncore
2nd Generation Intel® Core™
Integrates CPU, Graphics, MC,
PCI Express* on single chip
Next Generation
Intel® Turbo Boost
Technology
High Bandwidth
Last Level Cache
Next Generation
Graphics and Media
Embedded
DisplayPort
Discrete Graphics
Support: 1x16 or 2x8
2
High BW/ low-latency
core/GFX interconnect
PCH
×16 PCIe
PCI
Express*
DMI
System
Display Agent
Intel® Advanced Vector
Extension (Intel® AVX)
IMC
Core
LLC
Core
LLC
Core
LLC
Core
LLC
Substantial performance
improvement
2ch DDR3
Graphics
Foil taken from IDF 2011
Integrated Memory
Controller – 2ch DDR3
Intel® Hyper-Threading
Technology
4 Cores / 8 Threads
2 Cores / 4 Threads
Computer Structure 2012 – Uncore
3rd Generation Intel CoreTM




3
22nm process
Quad core die, with Intel HD Graphics 4000
1.4 Billion transistors
Die size: 160 mm2
Computer Structure 2012 – Uncore
The Uncore Subsystem
 The SoC design provides a high bandwidth bi-directional ring bus
– Connect between the IA cores and the various un-core sub-systems
 The uncore subsystem includes
– A system agent
– The graphics unit (GT)
– The last level cache (LLC)
DMI
 In Intel Xeon Processor E5 Family
– No graphics unit (GT)
– Instead it contains many more components:




4
An LLC with larger capacity and snooping
capabilities to support multiple processors
Intel® QuickPath Interconnect interfaces
that can support multi-socket platforms
Power management control hardware
A system agent capable of supporting high
bandwidth traffic from memory and I/O devices
From the Optimization Manual
PCI
Express*
System
Display Agent
IMC
Core
LLC
Core
LLC
Core
LLC
Core
LLC
Graphics
Computer Structure 2012 – Uncore
Scalable Ring On-die Interconnect
• Ring-based interconnect between Cores, Graphics,
Last Level Cache (LLC) and System Agent domain
• Composed of 4 rings
– 32 Byte Data ring, Request ring,
Acknowledge ring and Snoop ring
– Fully pipelined at core frequency/voltage:
bandwidth, latency and power scale with cores
• Massive ring wire routing runs over the LLC
with no area impact
• Access on ring always picks the shortest path –
minimize latency
• Distributed arbitration, ring protocol handles
coherency, ordering, and core interface
• Scalable to servers with large number of
processors
DMI
PCI
Express*
System
Display Agent
Core
LLC
Core
LLC
Core
LLC
Core
LLC
Graphics
High Bandwidth, Low Latency, Modular
5
Foil taken from IDF 2011
IMC
Computer Structure 2012 – Uncore
Last Level Cache – LLC
 The LLC consists of multiple cache slices
– The number of slices is equal to the number of IA cores
– Each slice contains a full cache port that can supply 32 bytes/cycle
 Each slice has logic portion + data array portion
– The logic portion handles




Data coherency
Memory ordering
Access to the data array portion
LLC misses and write-back to memory
DMI
System
Display Agent
– The data array portion stores cache lines


May have 4/8/12/16 ways
Corresponding to 0.5M/1M/1.5M/2M block size
 The GT sits on the same ring interconnect
– Uses the LLC for its data operations as well
– May in some case competes with the core on LLC
6
From the Optimization Manual
PCI
Express*
IMC
Core
LLC
Core
LLC
Core
LLC
Core
LLC
Graphics
Computer Structure 2012 – Uncore
Cache Box
• Interface block
–
–
–
–
Between Core/Graphics/Media and the Ring
Between Cache controller and the Ring
Implements the ring logic, arbitration, cache controller
Communicates with System Agent for LLC misses,
external snoops, non-cacheable accesses
• Full cache pipeline in each cache box
– Physical Addresses are hashed at the source
to prevent hot spots and increase bandwidth
– Maintains coherency and ordering for the
addresses that are mapped to it
– LLC is fully inclusive with “Core Valid Bits” –
eliminates unnecessary snoops to cores
– Per core CVB indicates if core needs to be
snooped for a given cache line
• Runs at core voltage/frequency, scales with Cores
Distributed coherency & ordering;
Scalable Bandwidth, Latency & Power
7
Foil taken from IDF 2011
DMI
PCI
Express*
System
Display Agent
IMC
Core
LLC
Core
LLC
Core
LLC
Core
LLC
Graphics
Computer Structure 2012 – Uncore
Ring Interconnect and LLC
 The physical addresses of data kept in the LLC are distributed
among the cache slices by a hash function
– Addresses are uniformly distributed
– From the cores and the GT view, the LLC acts as one shared cache

With multiple ports and bandwidth that scales with the number of cores
– The number of cache-slices increases with the number of cores

The ring and LLC are not likely to be a BW limiter to core operation
– From SW point of view, this does not appear as a normal N-way cache
– The LLC hit latency, ranging between 26-31 cycles, depends on
 The core location relative to the LLC block
(how far the request needs to travel on the ring)
 All the traffic that cannot be satisfied by the LLC, still travels through
the cache-slice logic portion and the ring, to the system agent
– E.g., LLC misses, dirty line writeback, non-cacheable operations, and
MMIO/IO operations
8
From the Optimization Manual
Computer Structure 2012 – Uncore
LLC Sharing
• LLC is shared among all Cores, Graphics and Media
– Graphics driver controls which streams are cached/coherent
– Any agent can access all data in the LLC, independent of who
allocated the line, after memory range checks
• Controlled LLC way allocation mechanism
prevents thrashing between Core/GFX
DMI
• Multiple coherency domains
– IA Domain (Fully coherent via cross-snoops)
System
Display Agent
IMC
– Graphic domain (Graphics virtual caches,
flushed to IA domain by graphics engine)
Core
LLC
– Non-Coherent domain (Display data,
flushed to memory by graphics engine)
Core
LLC
Core
LLC
Core
LLC
Much higher Graphics performance,
DRAM power savings, more DRAM BW
available for Cores
9
PCI
Express*
Foil taken from IDF 2011
Graphics
Computer Structure 2012 – Uncore
Cache Hierarchy
Level
Capacity ways
Line Size Write Update
Latency Bandwidth
Inclusive
(bytes)
Policy
(cycles) (Byte/cyc)
L1 Data
32KB
8
64
Write-back
-
4
2 ×16
L1 Instruction
32KB
8
64
N/A
-
-
-
L2 (Unified)
256KB
8
64
Write-back
No
12
1 × 32
LLC
Varies
Varies
64
Write-back
Yes
26-31
1 × 32
 The LLC is inclusive of all cache levels above it
– Data contained in the core caches must also reside in the LLC
– Each LLC cache line holds an indication of the cores that may have
this line in their L2 and L1 caches
 Fetching data from LLC when another core has the data
– Clean hit – data is not modified in the other core – 43 cycles
– Dirty hit – data is modified in the other core – 60 cycles
10
From the Optimization Manual
Computer Structure 2012 – Uncore
11
Computer Structure 2012 – Uncore
12
Computer Structure 2012 – Uncore
13
Computer Structure 2012 – Uncore
14
Computer Structure 2012 – Uncore
15
Computer Structure 2012 – Uncore
16
Computer Structure 2012 – Uncore
17
Computer Structure 2012 – Uncore
18
Computer Structure 2012 – Uncore
19
Computer Structure 2012 – Uncore
20
Computer Structure 2012 – Uncore
21
Computer Structure 2012 – Uncore
22
Computer Structure 2012 – Uncore
23
Computer Structure 2012 – Uncore
24
Computer Structure 2012 – Uncore
Data Prefetch to L2$ and LLC
 Two HW prefetchers fetch data from memory to L2$ and LLC
– Streamer and spatial prefetcher prefetch the data to the LLC
– Typically data is brought also to the L2

Unless the L2 cache is heavily loaded with missing demand requests.
 Spatial Prefetcher
– Strives to complete every cache line fetched to the L2 cache with the
pair line that completes it to a 128-byte aligned chunk
 Streamer Prefetcher
– Monitors read requests from the L1 caches for ascending and
descending sequences of addresses


L1 D$ requests: loads, stores, and L1 D$ HW prefetcher
L1 I$ code fetch requests
– When a forward or backward stream of requests is detected


25
The anticipated cache lines are pre-fetched
Prefetch-ed cache lines must be in the same 4K page
From the Optimization Manual
Computer Structure 2012 – Uncore
Data Prefetch to L2$ and LLC
 Streamer Prefetcher Enhancement
– The streamer may issue two prefetch requests on every L2 lookup
 Runs up to 20 lines ahead of the load request
– Adjusts dynamically to the number of outstanding requests per core
 Not many outstanding requests  prefetch further ahead
 Many outstanding requests  prefetch to LLC only, and less far ahead
– When cache lines are far ahead

Prefetch to LLC only and not to the L2$

Avoids replacement of useful cache lines in the L2$
– Detects and maintains up to 32 streams of data accesses
 For each 4K byte page, can maintain one forward and one backward
stream
26
From the Optimization Manual
Computer Structure 2012 – Uncore
Lean and Mean System Agent
• Contains PCI Express*, DMI, Memory
Controller, Display Engine…
• Contains Power Control Unit
DMI
– Programmable uController, handles all power
management and reset functions in the chip
System
Display Agent
IMC
Core
LLC
Core
LLC
– Provides cores/Graphics /Media with high BW,
low latency to DRAM/IO for best performance
Core
LLC
– Handles IO-to-cache coherency
Core
LLC
• Smart integration with the ring
• Separate voltage and frequency from
ring/cores, Display integration for better
battery life
• Extensive power and thermal
management for PCI Express* and DDR
Graphics
Smart I/O Integration
27
PCI Express*
Foil taken from IDF 2011
Computer Structure 2012 – Uncore
The System Agent
 The system agent contains the following components
– An arbiter that handles all accesses from the ring domain and from I/O
(PCIe* and DMI) and routes the accesses to the right place
– PCIe controllers connect to external PCIe devices

Support different configurations: x16+x4, x8+x8+x4, x8+x4+x4+x4
– DMI controller connects to the PCH chipset
– Integrated display engine, Flexible Display Interconnect, and Display
Port, for the internal graphic operations
– Memory controller
 All main memory traffic is routed from the arbiter to the
memory controller
– The memory controller supports two channels of DDR


Data rates of 1066MHz, 1333MHz and 1600MHz
8 bytes per cycle
– Addresses are distributed between memory channels based on a local
hash function that attempts to balance the load between the channels in
order to achieve maximum bandwidth and minimum hotspot collisions
28
From the Optimization Manual
Computer Structure 2012 – Uncore
The Memory Controller
 For best performance
– Populate both channels with equal amounts of memory

Preferably the exact same types of DIMMs
– Using more ranks for the same amount of memory, results in
somewhat better memory bandwidth

Since more DRAM pages can be open simultaneously
– Use highest supported speed DRAM, with the best DRAM timings
 The two memory channels have separate resources
– Handle memory requests independently
– Each memory channel contains a 32 cache-line write-data-buffer
 The memory controller contains a high-performance out-oforder scheduler
– Attempts to maximize memory bandwidth while minimizing latency
– Writes to the memory controller are considered completed when
they are written to the write-data-buffer
– The write-data-buffer is flushed out to main memory at a later time,
not impacting write latency
29
From the Optimization Manual
Computer Structure 2012 – Uncore
The Memory Controller
 Partial writes are not handled efficiently on the memory
controller
– May result in read-modify-write operations on the DDR channel
 if the partial-writes do not complete a full cache-line in time
– Software should avoid creating partial write transactions whenever
possible and consider alternative
 such as buffering the partial writes into full cache line writes
 The memory controller also supports high-priority
isochronous requests
– E.g., USB isochronous, and Display isochronous requests
 High bandwidth of memory requests from the integrated
display engine takes up some of the memory bandwidth
– Impacts core access latency to some degree
30
From the Optimization Manual
Computer Structure 2012 – Uncore
Integration: Optimization Opportunities
• Dynamically redistribute power between Cores & Graphics
• Tight power management control of all components,
providing better granularity and deeper idle/sleep states
• Three separate power/frequency domains:
System Agent (Fixed), Cores+Ring, Graphics (Variable)
• High BW Last Level Cache, shared among Cores and Graphics
– Significant performance boost, saves memory bandwidth and power
• Integrated Memory Controller and PCI Express ports
– Tightly integrated with Core/Graphics/LLC domain
– Provides low latency & low power – remove intermediate busses
• Bandwidth is balanced across the whole machine,
from Core/Graphics all the way to Memory Controller
• Modular uArch for optimal cost/power/performance
– Derivative products done with minimal effort/time
31
Foil taken from IDF 2011
Computer Structure 2012 – Uncore
DRAM
32
Computer Structure 2012 – Uncore
Basic DRAM chip
RAS#
Addr
CAS#
Row
Address
Latch
Column
Address
Latch
Row
Address
decoder
Memory
array
Column addr
decoder
Data
 DRAM access sequence
– Put Row on addr. bus and assert RAS# (Row Addr. Strobe) to latch Row
– Put Column on addr. bus and assert CAS# (Column Addr. Strobe) to latch Col
– Get data on address bus
33
Computer Structure 2012 – Uncore
DRAM Operation
 DRAM cell consists of transistor + capacitor
– Capacitor keeps the state;
Transistor guards access to the state
– Reading cell state:
raise access line AL and sense DL

AL
DL
M
C
Capacitor charged 
current to flow on the data line DL
– Writing cell state:
set DL and raise AL to charge/drain capacitor
– Charging and draining a capacitor is not
instantaneous
 Leakage current drains capacitor even when
transistor is closed
– DRAM cell periodically refreshed every 64ms
34
Computer Structure 2012 – Uncore
DRAM Access Sequence Timing
tRP – Row Precharge
RAS#
tRCD – RAS/CAS
delay
CAS#
A[0:7]
X
Row i
Col n
X
Row j
CL – CAS latency
Data
–
–
–
–
–
Data n
Put row address on address bus and assert RAS#
Wait for RAS# to CAS# delay (tRCD) between asserting RAS and CAS
Put column address on address bus and assert CAS#
Wait for CAS latency (CL) between time CAS# asserted and data ready
Row precharge time: time to close current row, and open a new row
35
Computer Structure 2012 – Uncore
DRAM controller
 DRAM controller gets address and command
– Splits address to Row and Column
– Generates DRAM control signals at the proper timing
A[20:23]
address
decoder
Chip
select
Time
delay
gen.
RAS#
CAS#
Select
A[10:19]
A[0:9]
D[0:7]
address
mux
Memory address bus
DRAM
R/W#
 DRAM data must be periodically refreshed
– DRAM controller performs DRAM refresh, using refresh counter
36
Computer Structure 2012 – Uncore
Improved DRAM Schemes
 Paged Mode DRAM
– Multiple accesses to different columns from same row
– Saves RAS and RAS to CAS delay
RAS#
CAS#
A[0:7]
X
Row
X
Col n
X
Col n+1
X
Data n
Data
X
Col n+2
D n+1
D n+2
 Extended Data Output RAM (EDO RAM)
– A data output latch enables to parallel next column address
with current column data
RAS#
CAS#
A[0:7]
Data
37
X
Row
X
Col n
X
Col n+1
X
X
Col n+2
Data n
Data n+1
Data n+2
Computer Structure 2012 – Uncore
Improved DRAM Schemes (cont)
 Burst DRAM
– Generates consecutive column address by itself
RAS#
CAS#
A[0:7]
Data
38
X
Row
X
Col n
X
Data n
Data n+1
Data n+2
Computer Structure 2012 – Uncore
Synchronous DRAM – SDRAM
 All signals are referenced to an external clock (100MHz-200MHz)
– Makes timing more precise with other system devices
 4 banks – multiple pages open simultaneously (one per bank)
 Command driven functionality instead of signal driven
– ACTIVE: selects both the bank and the row to be activated
 ACTIVE to a new bank can be issued while accessing current bank
– READ/WRITE: select column
 Burst oriented read and write accesses
– Successive column locations accessed in the given row
– Burst length is programmable: 1, 2, 4, 8, and full-page
 May end full-page burst by BURST TERMINATE to get arbitrary burst length
 A user programmable Mode Register
– CAS latency, burst length, burst type
 Auto pre-charge: may close row at last read/write in burst
 Auto refresh: internal counters generate refresh address
39
Computer Structure 2012 – Uncore
SDRAM Timing
clock
cmd
ACT
NOP
RD RD+PC ACT
tRCD >
20ns
NOP
RD
ACT
NOP
RD
NOP
NOP
NOP
t RRD >
20ns
BL = 1
t RC>70ns
Bank
Bank 0
X
Bank 0 Bank 0 Bank 1
X
Bank 1 Bank 0
X
Bank 0
X
X
X
Addr
Row i
X
Col j Col k Row m
X
Col n Row l
X
Col q
X
X
X
CL=2
Data
Data j Data k
Data n
Data q
 tRCD: ACTIVE to READ/WRITE gap = tRCD(MIN) / clock period
 tRC: successive ACTIVE to a different row in the same bank
 tRRD: successive ACTIVE commands to different banks
40
Computer Structure 2012 – Uncore
DDR-SDRAM
 2n-prefetch architecture
– DRAM cells are clocked at the same speed as SDR SDRAM cells
– Internal data bus is twice the width of the external data bus
– Data capture occurs twice per clock cycle
 Lower half of the bus sampled at clock rise
 Upper half of the bus sampled at clock fall
0:n-1
SDRAM
Array
0:n-1
0:2n-1
n:2n-1
400M
xfer/sec
200MHz clock
 Uses 2.5V (vs. 3.3V in SDRAM)
– Reduced power consumption
41
Computer Structure 2012 – Uncore
DDR SDRAM Timing
133MHz
clock
cmd
ACT
NOP
NOP
RD
NOP
ACT
NOP
NOP
RD
NOP
ACT
NOP
NOP
tRCD >20ns
t RRD >20ns
t RC>70ns
Bank
Bank 0
X
X
Bank 0
X
Bank 1
X
X
Bank 1
X
Bank 0
X
X
Addr
Row i
X
X
Col j
X
Row m
X
X
Col n
X
Row l
X
X
CL=2
Data
42
j
+1 +2 +3
n +1 +2 +3
Computer Structure 2012 – Uncore
DIMMs
 DIMM: Dual In-line Memory Module
– A small circuit board that holds memory chips
 64-bit wide data path (72 bit with parity)
– Single sided: 9 chips, each with 8 bit data bus
– Dual sided: 18 chips, each with 4 bit data bus
– Data BW: 64 bits on each rising and falling edge of the clock
 Other pins
– Address – 14, RAS, CAS, chip select – 4, VDC – 17, Gnd – 18,
clock – 4, serial address – 3, …
43
Computer Structure 2012 – Uncore
DDR Standards
 DRAM timing, measured in I/O bus cycles, specifies 3 numbers
– CAS Latency – RAS to CAS Delay – RAS Precharge Time
 CAS latency (latency to get data in an open page) in nsec
– CAS Latency × I/O bus cycle time
Standard
name
Mem.
clock
(MHz)
I/O bus
clock
(MHz)
DDR-200
100
100
10
200
DDR-266
133⅓
133⅓
7.5
266⅔
DDR-333
166⅔
166⅔
6
333⅓
DDR-400
200
Cycle Data
VDDQ
time
rate
(V)
(ns) (MT/s)
200
5
400
2.5
2.6
Module
name
transfer
rate
(MB/s)
PC-1600
1600
PC-2100
2133⅓
PC-2700
2666⅔
PC-3200
3200
(CL-tRCDtRP)
CAS
Latency
(ns)
2.5-3-3
3-3-3
3-4-4
12.5
15
15
Timing
 Total BW for DDR400
– 3200M Byte/sec = 64 bit2200MHz / 8 (bit/byte)
– 6400M Byte/sec for dual channel DDR SDRAM
44
Computer Structure 2012 – Uncore
DDR2
 DDR2 doubles the bandwidth
– 4n pre-fetch: internally read/write 4×
the amount of data as the external bus
– DDR2-533 cell works at the same freq. as
a DDR266 cell or a PC133 cell
Memory
Cell Array
I/O
Data Bus
Buffers
– Prefetching increases latency
 Smaller page size: 1KB vs. 2KB
– Reduces activation power – ACTIVATE
command reads all bits in the page
Memory
Cell Array
I/O
Data Bus
Buffers
 8 banks in 1Gb densities and above
– Increases random accesses
 1.8V (vs 2.5V) operation voltage
– Significantly lower power
45
Memory
Cell Array
I/O
Data Bus
Buffers
Computer Structure 2012 – Uncore
DDR2 Standards
Standard
name
Mem
clock
(MHz)
Cycle
time
I/O Bus
Data rate
clock
(MT/s)
(MHz)
Module
name
Peak
transfer
rate
Timings
CAS
Latency
DDR2400
100
10 ns
200
400
PC23200
3200 MB/
s
3-3-3
4-4-4
15
20
DDR2533
133
7.5 ns
266
533
PC24200
4266 MB/
s
3-3-3
4-4-4
11.25
15
DDR2667
166
6 ns
333
MHz
667
PC25300
5333 MB/
s
4-4-4
5-5-5
12
15
800
PC26400
6400 MB/
s
4-4-4
5-5-5
6-6-6
10
12.5
15
1066
PC28500
8533 MB/
s
6-6-6
7-7-7
11.25
13.125
DDR2800
200
5 ns
400
MHz
DDR21066
266
3.75
ns
533
MHz
46
Computer Structure 2012 – Uncore
DDR3
 30% power consumption reduction compared to DDR2
– 1.5V supply voltage, compared to DDR2's 1.8V
– 90 nanometer fabrication technology
 Higher bandwidth
– 8 bit deep prefetch buffer (vs. 4 bit in DDR2 and 2 bit in DDR)
 Transfer data rate
– Effective clock rate of 800–1600 MHz using both rising and falling
edges of a 400–800 MHz I/O clock
– DDR2: 400–800 MHz using a 200–400 MHz I/O clock
– DDR: 200–400 MHz based on a 100–200 MHz I/O clock
 DDR3 DIMMs
– 240 pins, the same number as DDR2, and are the same size
– Electrically incompatible, and have a different key notch location
47
Computer Structure 2012 – Uncore
DDR3 Standards
Standard
Name
Mem
clock
(MHz)
I/O bus
clock
(MHz)
DDR3-800
100
400
2.5
DDR3-1066
133⅓
533⅓
1.875
DDR3-1333
166⅔
666⅔
1.5
DDR3-1600
200
800
1.25
DDR3-1866
233⅓
933⅓
1.07
DDR3-2133
48
266⅔
1066⅔
Module
name
Peak
transfer
rate
(MB/s)
800
PC3-6400
6400
1066⅔
PC3-8500
8533⅓
1333⅓ PC3-10600
I/O bus
Data rate
Cycle time
(MT/s)
(ns)
0.9375
1600
PC3-12800
1866⅔ PC3-14900
2133⅓ PC3-17000
Timings
(CL-tRCDtRP)
CAS
Latency
(ns)
5-5-5
6-6-6
6-6-6
7-7-7
8-8-8
12 1⁄2
15
11 1⁄4
13 1⁄8
15
10666⅔
8-8-8
9-9-9
12
13 1⁄2
12800
9-9-9
10-10-10
11-11-11
11 1⁄4
12 1⁄2
13 3⁄4
14933⅓
11-11-11
12-12-12
11 11⁄14
12 6⁄7
17066⅔
12-12-12
13-13-13
11 1⁄4
12 3⁄16
Computer Structure 2012 – Uncore
DDR2 vs. DDR3 Performance
The high latency of DDR3 SDRAM has
negative effect on streaming operations
49
Source:
xbitlabs
Computer Structure 2012 – Uncore
How to get the most of Memory ?
 Single Channel DDR
L2 Cache
FSB – Front Side Bus
CPU
DRAM Memory Bus DDR
Ctrlr
DIMM
 Dual channel DDR
– Each DIMM pair must be the same
L2 Cache
CPU
FSB – Front Side Bus
DRAM
Ctrlr
CH A
DDR
DIMM
CH B
DDR
DIMM
 Balance FSB and memory bandwidth
– 800MHz FSB provides 800MHz × 64bit / 8 = 6.4 G Byte/sec
– Dual Channel DDR400 SDRAM also provides 6.4 G Byte/sec
50
Computer Structure 2012 – Uncore
How to get the most of Memory ?
 Each DIMM supports 4 open pages simultaneously
– The more open pages, the more random access
– It is better to have more DIMMs
 n DIMMs: 4n open pages
 DIMMs can be single sided or dual sided
– Dual sided DIMMs may have separate CS of each side
 The number of open pages is doubled (goes up to 8)
 This is not a must – dual sided DIMMs may also have a
common CS for both sides, in which case, there are only 4
open pages, as with single side
51
Computer Structure 2012 – Uncore
SRAM – Static RAM




True random access
High speed, low density, high power
No refresh
Address not multiplexed
 DDR SRAM
– 2 READs or 2 WRITEs per clock
– Common or Separate I/O
– DDRII: 200MHz to 333MHz Operation; Density: 18/36/72Mb+
 QDR SRAM
– Two separate DDR ports: one read and one write
– One DDR address bus: alternating between the read address and
the write address
– QDRII: 250MHz to 333MHz Operation; Density: 18/36/72Mb+
52
Computer Structure 2012 – Uncore
SRAM vs. DRAM
 Random Access: access time is the same for all locations
DRAM – Dynamic RAM
SRAM – Static RAM
Refresh
Refresh needed
No refresh needed
Address
Address muxed: row+ column
Address not multiplexed
Access
Not true “Random Access”
True “Random Access”
density
High (1 Transistor/bit)
Low (6 Transistor/bit)
Power
low
high
Speed
slow
fast
Price/bit
low
high
Typical usage Main memory
53
cache
Computer Structure 2012 – Uncore
Read Only Memory (ROM)
 Random Access
 Non volatile
 ROM Types
– PROM – Programmable ROM

Burnt once using special equipment
– EPROM – Erasable PROM

Can be erased by exposure to UV, and then reprogrammed
– E2PROM – Electrically Erasable PROM



54
Can be erased and reprogrammed on board
Write time (programming) much longer than RAM
Limited number of writes (thousands)
Computer Structure 2012 – Uncore