Computer Structure The Uncore
Download
Report
Transcript Computer Structure The Uncore
Computer Structure
The Uncore
1
Computer Structure 2012 – Uncore
2nd Generation Intel® Core™
Integrates CPU, Graphics, MC,
PCI Express* on single chip
Next Generation
Intel® Turbo Boost
Technology
High Bandwidth
Last Level Cache
Next Generation
Graphics and Media
Embedded
DisplayPort
Discrete Graphics
Support: 1x16 or 2x8
2
High BW/ low-latency
core/GFX interconnect
PCH
×16 PCIe
PCI
Express*
DMI
System
Display Agent
Intel® Advanced Vector
Extension (Intel® AVX)
IMC
Core
LLC
Core
LLC
Core
LLC
Core
LLC
Substantial performance
improvement
2ch DDR3
Graphics
Foil taken from IDF 2011
Integrated Memory
Controller – 2ch DDR3
Intel® Hyper-Threading
Technology
4 Cores / 8 Threads
2 Cores / 4 Threads
Computer Structure 2012 – Uncore
3rd Generation Intel CoreTM
3
22nm process
Quad core die, with Intel HD Graphics 4000
1.4 Billion transistors
Die size: 160 mm2
Computer Structure 2012 – Uncore
The Uncore Subsystem
The SoC design provides a high bandwidth bi-directional ring bus
– Connect between the IA cores and the various un-core sub-systems
The uncore subsystem includes
– A system agent
– The graphics unit (GT)
– The last level cache (LLC)
DMI
In Intel Xeon Processor E5 Family
– No graphics unit (GT)
– Instead it contains many more components:
4
An LLC with larger capacity and snooping
capabilities to support multiple processors
Intel® QuickPath Interconnect interfaces
that can support multi-socket platforms
Power management control hardware
A system agent capable of supporting high
bandwidth traffic from memory and I/O devices
From the Optimization Manual
PCI
Express*
System
Display Agent
IMC
Core
LLC
Core
LLC
Core
LLC
Core
LLC
Graphics
Computer Structure 2012 – Uncore
Scalable Ring On-die Interconnect
• Ring-based interconnect between Cores, Graphics,
Last Level Cache (LLC) and System Agent domain
• Composed of 4 rings
– 32 Byte Data ring, Request ring,
Acknowledge ring and Snoop ring
– Fully pipelined at core frequency/voltage:
bandwidth, latency and power scale with cores
• Massive ring wire routing runs over the LLC
with no area impact
• Access on ring always picks the shortest path –
minimize latency
• Distributed arbitration, ring protocol handles
coherency, ordering, and core interface
• Scalable to servers with large number of
processors
DMI
PCI
Express*
System
Display Agent
Core
LLC
Core
LLC
Core
LLC
Core
LLC
Graphics
High Bandwidth, Low Latency, Modular
5
Foil taken from IDF 2011
IMC
Computer Structure 2012 – Uncore
Last Level Cache – LLC
The LLC consists of multiple cache slices
– The number of slices is equal to the number of IA cores
– Each slice contains a full cache port that can supply 32 bytes/cycle
Each slice has logic portion + data array portion
– The logic portion handles
Data coherency
Memory ordering
Access to the data array portion
LLC misses and write-back to memory
DMI
System
Display Agent
– The data array portion stores cache lines
May have 4/8/12/16 ways
Corresponding to 0.5M/1M/1.5M/2M block size
The GT sits on the same ring interconnect
– Uses the LLC for its data operations as well
– May in some case competes with the core on LLC
6
From the Optimization Manual
PCI
Express*
IMC
Core
LLC
Core
LLC
Core
LLC
Core
LLC
Graphics
Computer Structure 2012 – Uncore
Cache Box
• Interface block
–
–
–
–
Between Core/Graphics/Media and the Ring
Between Cache controller and the Ring
Implements the ring logic, arbitration, cache controller
Communicates with System Agent for LLC misses,
external snoops, non-cacheable accesses
• Full cache pipeline in each cache box
– Physical Addresses are hashed at the source
to prevent hot spots and increase bandwidth
– Maintains coherency and ordering for the
addresses that are mapped to it
– LLC is fully inclusive with “Core Valid Bits” –
eliminates unnecessary snoops to cores
– Per core CVB indicates if core needs to be
snooped for a given cache line
• Runs at core voltage/frequency, scales with Cores
Distributed coherency & ordering;
Scalable Bandwidth, Latency & Power
7
Foil taken from IDF 2011
DMI
PCI
Express*
System
Display Agent
IMC
Core
LLC
Core
LLC
Core
LLC
Core
LLC
Graphics
Computer Structure 2012 – Uncore
Ring Interconnect and LLC
The physical addresses of data kept in the LLC are distributed
among the cache slices by a hash function
– Addresses are uniformly distributed
– From the cores and the GT view, the LLC acts as one shared cache
With multiple ports and bandwidth that scales with the number of cores
– The number of cache-slices increases with the number of cores
The ring and LLC are not likely to be a BW limiter to core operation
– From SW point of view, this does not appear as a normal N-way cache
– The LLC hit latency, ranging between 26-31 cycles, depends on
The core location relative to the LLC block
(how far the request needs to travel on the ring)
All the traffic that cannot be satisfied by the LLC, still travels through
the cache-slice logic portion and the ring, to the system agent
– E.g., LLC misses, dirty line writeback, non-cacheable operations, and
MMIO/IO operations
8
From the Optimization Manual
Computer Structure 2012 – Uncore
LLC Sharing
• LLC is shared among all Cores, Graphics and Media
– Graphics driver controls which streams are cached/coherent
– Any agent can access all data in the LLC, independent of who
allocated the line, after memory range checks
• Controlled LLC way allocation mechanism
prevents thrashing between Core/GFX
DMI
• Multiple coherency domains
– IA Domain (Fully coherent via cross-snoops)
System
Display Agent
IMC
– Graphic domain (Graphics virtual caches,
flushed to IA domain by graphics engine)
Core
LLC
– Non-Coherent domain (Display data,
flushed to memory by graphics engine)
Core
LLC
Core
LLC
Core
LLC
Much higher Graphics performance,
DRAM power savings, more DRAM BW
available for Cores
9
PCI
Express*
Foil taken from IDF 2011
Graphics
Computer Structure 2012 – Uncore
Cache Hierarchy
Level
Capacity ways
Line Size Write Update
Latency Bandwidth
Inclusive
(bytes)
Policy
(cycles) (Byte/cyc)
L1 Data
32KB
8
64
Write-back
-
4
2 ×16
L1 Instruction
32KB
8
64
N/A
-
-
-
L2 (Unified)
256KB
8
64
Write-back
No
12
1 × 32
LLC
Varies
Varies
64
Write-back
Yes
26-31
1 × 32
The LLC is inclusive of all cache levels above it
– Data contained in the core caches must also reside in the LLC
– Each LLC cache line holds an indication of the cores that may have
this line in their L2 and L1 caches
Fetching data from LLC when another core has the data
– Clean hit – data is not modified in the other core – 43 cycles
– Dirty hit – data is modified in the other core – 60 cycles
10
From the Optimization Manual
Computer Structure 2012 – Uncore
11
Computer Structure 2012 – Uncore
12
Computer Structure 2012 – Uncore
13
Computer Structure 2012 – Uncore
14
Computer Structure 2012 – Uncore
15
Computer Structure 2012 – Uncore
16
Computer Structure 2012 – Uncore
17
Computer Structure 2012 – Uncore
18
Computer Structure 2012 – Uncore
19
Computer Structure 2012 – Uncore
20
Computer Structure 2012 – Uncore
21
Computer Structure 2012 – Uncore
22
Computer Structure 2012 – Uncore
23
Computer Structure 2012 – Uncore
24
Computer Structure 2012 – Uncore
Data Prefetch to L2$ and LLC
Two HW prefetchers fetch data from memory to L2$ and LLC
– Streamer and spatial prefetcher prefetch the data to the LLC
– Typically data is brought also to the L2
Unless the L2 cache is heavily loaded with missing demand requests.
Spatial Prefetcher
– Strives to complete every cache line fetched to the L2 cache with the
pair line that completes it to a 128-byte aligned chunk
Streamer Prefetcher
– Monitors read requests from the L1 caches for ascending and
descending sequences of addresses
L1 D$ requests: loads, stores, and L1 D$ HW prefetcher
L1 I$ code fetch requests
– When a forward or backward stream of requests is detected
25
The anticipated cache lines are pre-fetched
Prefetch-ed cache lines must be in the same 4K page
From the Optimization Manual
Computer Structure 2012 – Uncore
Data Prefetch to L2$ and LLC
Streamer Prefetcher Enhancement
– The streamer may issue two prefetch requests on every L2 lookup
Runs up to 20 lines ahead of the load request
– Adjusts dynamically to the number of outstanding requests per core
Not many outstanding requests prefetch further ahead
Many outstanding requests prefetch to LLC only, and less far ahead
– When cache lines are far ahead
Prefetch to LLC only and not to the L2$
Avoids replacement of useful cache lines in the L2$
– Detects and maintains up to 32 streams of data accesses
For each 4K byte page, can maintain one forward and one backward
stream
26
From the Optimization Manual
Computer Structure 2012 – Uncore
Lean and Mean System Agent
• Contains PCI Express*, DMI, Memory
Controller, Display Engine…
• Contains Power Control Unit
DMI
– Programmable uController, handles all power
management and reset functions in the chip
System
Display Agent
IMC
Core
LLC
Core
LLC
– Provides cores/Graphics /Media with high BW,
low latency to DRAM/IO for best performance
Core
LLC
– Handles IO-to-cache coherency
Core
LLC
• Smart integration with the ring
• Separate voltage and frequency from
ring/cores, Display integration for better
battery life
• Extensive power and thermal
management for PCI Express* and DDR
Graphics
Smart I/O Integration
27
PCI Express*
Foil taken from IDF 2011
Computer Structure 2012 – Uncore
The System Agent
The system agent contains the following components
– An arbiter that handles all accesses from the ring domain and from I/O
(PCIe* and DMI) and routes the accesses to the right place
– PCIe controllers connect to external PCIe devices
Support different configurations: x16+x4, x8+x8+x4, x8+x4+x4+x4
– DMI controller connects to the PCH chipset
– Integrated display engine, Flexible Display Interconnect, and Display
Port, for the internal graphic operations
– Memory controller
All main memory traffic is routed from the arbiter to the
memory controller
– The memory controller supports two channels of DDR
Data rates of 1066MHz, 1333MHz and 1600MHz
8 bytes per cycle
– Addresses are distributed between memory channels based on a local
hash function that attempts to balance the load between the channels in
order to achieve maximum bandwidth and minimum hotspot collisions
28
From the Optimization Manual
Computer Structure 2012 – Uncore
The Memory Controller
For best performance
– Populate both channels with equal amounts of memory
Preferably the exact same types of DIMMs
– Using more ranks for the same amount of memory, results in
somewhat better memory bandwidth
Since more DRAM pages can be open simultaneously
– Use highest supported speed DRAM, with the best DRAM timings
The two memory channels have separate resources
– Handle memory requests independently
– Each memory channel contains a 32 cache-line write-data-buffer
The memory controller contains a high-performance out-oforder scheduler
– Attempts to maximize memory bandwidth while minimizing latency
– Writes to the memory controller are considered completed when
they are written to the write-data-buffer
– The write-data-buffer is flushed out to main memory at a later time,
not impacting write latency
29
From the Optimization Manual
Computer Structure 2012 – Uncore
The Memory Controller
Partial writes are not handled efficiently on the memory
controller
– May result in read-modify-write operations on the DDR channel
if the partial-writes do not complete a full cache-line in time
– Software should avoid creating partial write transactions whenever
possible and consider alternative
such as buffering the partial writes into full cache line writes
The memory controller also supports high-priority
isochronous requests
– E.g., USB isochronous, and Display isochronous requests
High bandwidth of memory requests from the integrated
display engine takes up some of the memory bandwidth
– Impacts core access latency to some degree
30
From the Optimization Manual
Computer Structure 2012 – Uncore
Integration: Optimization Opportunities
• Dynamically redistribute power between Cores & Graphics
• Tight power management control of all components,
providing better granularity and deeper idle/sleep states
• Three separate power/frequency domains:
System Agent (Fixed), Cores+Ring, Graphics (Variable)
• High BW Last Level Cache, shared among Cores and Graphics
– Significant performance boost, saves memory bandwidth and power
• Integrated Memory Controller and PCI Express ports
– Tightly integrated with Core/Graphics/LLC domain
– Provides low latency & low power – remove intermediate busses
• Bandwidth is balanced across the whole machine,
from Core/Graphics all the way to Memory Controller
• Modular uArch for optimal cost/power/performance
– Derivative products done with minimal effort/time
31
Foil taken from IDF 2011
Computer Structure 2012 – Uncore
DRAM
32
Computer Structure 2012 – Uncore
Basic DRAM chip
RAS#
Addr
CAS#
Row
Address
Latch
Column
Address
Latch
Row
Address
decoder
Memory
array
Column addr
decoder
Data
DRAM access sequence
– Put Row on addr. bus and assert RAS# (Row Addr. Strobe) to latch Row
– Put Column on addr. bus and assert CAS# (Column Addr. Strobe) to latch Col
– Get data on address bus
33
Computer Structure 2012 – Uncore
DRAM Operation
DRAM cell consists of transistor + capacitor
– Capacitor keeps the state;
Transistor guards access to the state
– Reading cell state:
raise access line AL and sense DL
AL
DL
M
C
Capacitor charged
current to flow on the data line DL
– Writing cell state:
set DL and raise AL to charge/drain capacitor
– Charging and draining a capacitor is not
instantaneous
Leakage current drains capacitor even when
transistor is closed
– DRAM cell periodically refreshed every 64ms
34
Computer Structure 2012 – Uncore
DRAM Access Sequence Timing
tRP – Row Precharge
RAS#
tRCD – RAS/CAS
delay
CAS#
A[0:7]
X
Row i
Col n
X
Row j
CL – CAS latency
Data
–
–
–
–
–
Data n
Put row address on address bus and assert RAS#
Wait for RAS# to CAS# delay (tRCD) between asserting RAS and CAS
Put column address on address bus and assert CAS#
Wait for CAS latency (CL) between time CAS# asserted and data ready
Row precharge time: time to close current row, and open a new row
35
Computer Structure 2012 – Uncore
DRAM controller
DRAM controller gets address and command
– Splits address to Row and Column
– Generates DRAM control signals at the proper timing
A[20:23]
address
decoder
Chip
select
Time
delay
gen.
RAS#
CAS#
Select
A[10:19]
A[0:9]
D[0:7]
address
mux
Memory address bus
DRAM
R/W#
DRAM data must be periodically refreshed
– DRAM controller performs DRAM refresh, using refresh counter
36
Computer Structure 2012 – Uncore
Improved DRAM Schemes
Paged Mode DRAM
– Multiple accesses to different columns from same row
– Saves RAS and RAS to CAS delay
RAS#
CAS#
A[0:7]
X
Row
X
Col n
X
Col n+1
X
Data n
Data
X
Col n+2
D n+1
D n+2
Extended Data Output RAM (EDO RAM)
– A data output latch enables to parallel next column address
with current column data
RAS#
CAS#
A[0:7]
Data
37
X
Row
X
Col n
X
Col n+1
X
X
Col n+2
Data n
Data n+1
Data n+2
Computer Structure 2012 – Uncore
Improved DRAM Schemes (cont)
Burst DRAM
– Generates consecutive column address by itself
RAS#
CAS#
A[0:7]
Data
38
X
Row
X
Col n
X
Data n
Data n+1
Data n+2
Computer Structure 2012 – Uncore
Synchronous DRAM – SDRAM
All signals are referenced to an external clock (100MHz-200MHz)
– Makes timing more precise with other system devices
4 banks – multiple pages open simultaneously (one per bank)
Command driven functionality instead of signal driven
– ACTIVE: selects both the bank and the row to be activated
ACTIVE to a new bank can be issued while accessing current bank
– READ/WRITE: select column
Burst oriented read and write accesses
– Successive column locations accessed in the given row
– Burst length is programmable: 1, 2, 4, 8, and full-page
May end full-page burst by BURST TERMINATE to get arbitrary burst length
A user programmable Mode Register
– CAS latency, burst length, burst type
Auto pre-charge: may close row at last read/write in burst
Auto refresh: internal counters generate refresh address
39
Computer Structure 2012 – Uncore
SDRAM Timing
clock
cmd
ACT
NOP
RD RD+PC ACT
tRCD >
20ns
NOP
RD
ACT
NOP
RD
NOP
NOP
NOP
t RRD >
20ns
BL = 1
t RC>70ns
Bank
Bank 0
X
Bank 0 Bank 0 Bank 1
X
Bank 1 Bank 0
X
Bank 0
X
X
X
Addr
Row i
X
Col j Col k Row m
X
Col n Row l
X
Col q
X
X
X
CL=2
Data
Data j Data k
Data n
Data q
tRCD: ACTIVE to READ/WRITE gap = tRCD(MIN) / clock period
tRC: successive ACTIVE to a different row in the same bank
tRRD: successive ACTIVE commands to different banks
40
Computer Structure 2012 – Uncore
DDR-SDRAM
2n-prefetch architecture
– DRAM cells are clocked at the same speed as SDR SDRAM cells
– Internal data bus is twice the width of the external data bus
– Data capture occurs twice per clock cycle
Lower half of the bus sampled at clock rise
Upper half of the bus sampled at clock fall
0:n-1
SDRAM
Array
0:n-1
0:2n-1
n:2n-1
400M
xfer/sec
200MHz clock
Uses 2.5V (vs. 3.3V in SDRAM)
– Reduced power consumption
41
Computer Structure 2012 – Uncore
DDR SDRAM Timing
133MHz
clock
cmd
ACT
NOP
NOP
RD
NOP
ACT
NOP
NOP
RD
NOP
ACT
NOP
NOP
tRCD >20ns
t RRD >20ns
t RC>70ns
Bank
Bank 0
X
X
Bank 0
X
Bank 1
X
X
Bank 1
X
Bank 0
X
X
Addr
Row i
X
X
Col j
X
Row m
X
X
Col n
X
Row l
X
X
CL=2
Data
42
j
+1 +2 +3
n +1 +2 +3
Computer Structure 2012 – Uncore
DIMMs
DIMM: Dual In-line Memory Module
– A small circuit board that holds memory chips
64-bit wide data path (72 bit with parity)
– Single sided: 9 chips, each with 8 bit data bus
– Dual sided: 18 chips, each with 4 bit data bus
– Data BW: 64 bits on each rising and falling edge of the clock
Other pins
– Address – 14, RAS, CAS, chip select – 4, VDC – 17, Gnd – 18,
clock – 4, serial address – 3, …
43
Computer Structure 2012 – Uncore
DDR Standards
DRAM timing, measured in I/O bus cycles, specifies 3 numbers
– CAS Latency – RAS to CAS Delay – RAS Precharge Time
CAS latency (latency to get data in an open page) in nsec
– CAS Latency × I/O bus cycle time
Standard
name
Mem.
clock
(MHz)
I/O bus
clock
(MHz)
DDR-200
100
100
10
200
DDR-266
133⅓
133⅓
7.5
266⅔
DDR-333
166⅔
166⅔
6
333⅓
DDR-400
200
Cycle Data
VDDQ
time
rate
(V)
(ns) (MT/s)
200
5
400
2.5
2.6
Module
name
transfer
rate
(MB/s)
PC-1600
1600
PC-2100
2133⅓
PC-2700
2666⅔
PC-3200
3200
(CL-tRCDtRP)
CAS
Latency
(ns)
2.5-3-3
3-3-3
3-4-4
12.5
15
15
Timing
Total BW for DDR400
– 3200M Byte/sec = 64 bit2200MHz / 8 (bit/byte)
– 6400M Byte/sec for dual channel DDR SDRAM
44
Computer Structure 2012 – Uncore
DDR2
DDR2 doubles the bandwidth
– 4n pre-fetch: internally read/write 4×
the amount of data as the external bus
– DDR2-533 cell works at the same freq. as
a DDR266 cell or a PC133 cell
Memory
Cell Array
I/O
Data Bus
Buffers
– Prefetching increases latency
Smaller page size: 1KB vs. 2KB
– Reduces activation power – ACTIVATE
command reads all bits in the page
Memory
Cell Array
I/O
Data Bus
Buffers
8 banks in 1Gb densities and above
– Increases random accesses
1.8V (vs 2.5V) operation voltage
– Significantly lower power
45
Memory
Cell Array
I/O
Data Bus
Buffers
Computer Structure 2012 – Uncore
DDR2 Standards
Standard
name
Mem
clock
(MHz)
Cycle
time
I/O Bus
Data rate
clock
(MT/s)
(MHz)
Module
name
Peak
transfer
rate
Timings
CAS
Latency
DDR2400
100
10 ns
200
400
PC23200
3200 MB/
s
3-3-3
4-4-4
15
20
DDR2533
133
7.5 ns
266
533
PC24200
4266 MB/
s
3-3-3
4-4-4
11.25
15
DDR2667
166
6 ns
333
MHz
667
PC25300
5333 MB/
s
4-4-4
5-5-5
12
15
800
PC26400
6400 MB/
s
4-4-4
5-5-5
6-6-6
10
12.5
15
1066
PC28500
8533 MB/
s
6-6-6
7-7-7
11.25
13.125
DDR2800
200
5 ns
400
MHz
DDR21066
266
3.75
ns
533
MHz
46
Computer Structure 2012 – Uncore
DDR3
30% power consumption reduction compared to DDR2
– 1.5V supply voltage, compared to DDR2's 1.8V
– 90 nanometer fabrication technology
Higher bandwidth
– 8 bit deep prefetch buffer (vs. 4 bit in DDR2 and 2 bit in DDR)
Transfer data rate
– Effective clock rate of 800–1600 MHz using both rising and falling
edges of a 400–800 MHz I/O clock
– DDR2: 400–800 MHz using a 200–400 MHz I/O clock
– DDR: 200–400 MHz based on a 100–200 MHz I/O clock
DDR3 DIMMs
– 240 pins, the same number as DDR2, and are the same size
– Electrically incompatible, and have a different key notch location
47
Computer Structure 2012 – Uncore
DDR3 Standards
Standard
Name
Mem
clock
(MHz)
I/O bus
clock
(MHz)
DDR3-800
100
400
2.5
DDR3-1066
133⅓
533⅓
1.875
DDR3-1333
166⅔
666⅔
1.5
DDR3-1600
200
800
1.25
DDR3-1866
233⅓
933⅓
1.07
DDR3-2133
48
266⅔
1066⅔
Module
name
Peak
transfer
rate
(MB/s)
800
PC3-6400
6400
1066⅔
PC3-8500
8533⅓
1333⅓ PC3-10600
I/O bus
Data rate
Cycle time
(MT/s)
(ns)
0.9375
1600
PC3-12800
1866⅔ PC3-14900
2133⅓ PC3-17000
Timings
(CL-tRCDtRP)
CAS
Latency
(ns)
5-5-5
6-6-6
6-6-6
7-7-7
8-8-8
12 1⁄2
15
11 1⁄4
13 1⁄8
15
10666⅔
8-8-8
9-9-9
12
13 1⁄2
12800
9-9-9
10-10-10
11-11-11
11 1⁄4
12 1⁄2
13 3⁄4
14933⅓
11-11-11
12-12-12
11 11⁄14
12 6⁄7
17066⅔
12-12-12
13-13-13
11 1⁄4
12 3⁄16
Computer Structure 2012 – Uncore
DDR2 vs. DDR3 Performance
The high latency of DDR3 SDRAM has
negative effect on streaming operations
49
Source:
xbitlabs
Computer Structure 2012 – Uncore
How to get the most of Memory ?
Single Channel DDR
L2 Cache
FSB – Front Side Bus
CPU
DRAM Memory Bus DDR
Ctrlr
DIMM
Dual channel DDR
– Each DIMM pair must be the same
L2 Cache
CPU
FSB – Front Side Bus
DRAM
Ctrlr
CH A
DDR
DIMM
CH B
DDR
DIMM
Balance FSB and memory bandwidth
– 800MHz FSB provides 800MHz × 64bit / 8 = 6.4 G Byte/sec
– Dual Channel DDR400 SDRAM also provides 6.4 G Byte/sec
50
Computer Structure 2012 – Uncore
How to get the most of Memory ?
Each DIMM supports 4 open pages simultaneously
– The more open pages, the more random access
– It is better to have more DIMMs
n DIMMs: 4n open pages
DIMMs can be single sided or dual sided
– Dual sided DIMMs may have separate CS of each side
The number of open pages is doubled (goes up to 8)
This is not a must – dual sided DIMMs may also have a
common CS for both sides, in which case, there are only 4
open pages, as with single side
51
Computer Structure 2012 – Uncore
SRAM – Static RAM
True random access
High speed, low density, high power
No refresh
Address not multiplexed
DDR SRAM
– 2 READs or 2 WRITEs per clock
– Common or Separate I/O
– DDRII: 200MHz to 333MHz Operation; Density: 18/36/72Mb+
QDR SRAM
– Two separate DDR ports: one read and one write
– One DDR address bus: alternating between the read address and
the write address
– QDRII: 250MHz to 333MHz Operation; Density: 18/36/72Mb+
52
Computer Structure 2012 – Uncore
SRAM vs. DRAM
Random Access: access time is the same for all locations
DRAM – Dynamic RAM
SRAM – Static RAM
Refresh
Refresh needed
No refresh needed
Address
Address muxed: row+ column
Address not multiplexed
Access
Not true “Random Access”
True “Random Access”
density
High (1 Transistor/bit)
Low (6 Transistor/bit)
Power
low
high
Speed
slow
fast
Price/bit
low
high
Typical usage Main memory
53
cache
Computer Structure 2012 – Uncore
Read Only Memory (ROM)
Random Access
Non volatile
ROM Types
– PROM – Programmable ROM
Burnt once using special equipment
– EPROM – Erasable PROM
Can be erased by exposure to UV, and then reprogrammed
– E2PROM – Electrically Erasable PROM
54
Can be erased and reprogrammed on board
Write time (programming) much longer than RAM
Limited number of writes (thousands)
Computer Structure 2012 – Uncore