Transcript Memory

ECE 485/585
Microprocessors
Chapter 1
Microprocessor Characterization:
CPU & Memory
Herbert G. Mayer, PSU
Status 11/30/2016
Parts gratefully taken with permission from Eric Krause @ PSU
1
Syllabus








Introduction
Microprocessor μP
Latency and Bandwidth
Memory Hierarchy
Memory Types
Memory Low-Level View
CISC vs. RISC
Bibliography
2
Introduction
 In lectures on: Microprocessor System
Design, the focus is on Microprocessors,
yet a system view is included
 System includes memory, bus, peripherals,
brief mention of power supply, especially
for an embedded μP
 This EE class goes beyond pure function,
and includes low-level EE discussions for
parts of the μP system
 Ideal outcome for you: to be able to design
your future employer’s microcontroller, or
ideally a full μP
3
Introduction
 Key modules of any microprocessor (μP) are:
1. Central Processing Unit, AKA CPU, includes ALU, Register
File, pc, ir, flags, and internal registers
2. Memory (AKA Main Memory), including stack portion
3. Caches; L1 and sometimes L2 integrated on same silicon
die, physically but not logically part of CPU
4. Data-, address-, and control buses connecting CPU,
peripherals, and memory; AKA System Bus
5. Peripherals and their controllers, connected via bus
6. IO devices and controller, connected to system bus
7. Branch Prediction unit; invisible to API
 Vast speed differences between CPU and memory
 Great speed disparity between various controllers
 Inherently slow are those accepting manual input
4
Introduction: Generic μP
Generic Microprocessor System with Memory, Controllers
5
Introduction: Generic μP
 Generic μP above shows coarse resolution of CPU
 Leaves out caches and branch prediction, necessary to
increase processing- and data access speed
 Lists program counter pc (AKA instruction pointer),
register file, transparent instruction register ir (current
instruction being executed)
 Views components other than CPU as “hanging off”
central system bus
 In reality, at times (for some Intel μP) multiple buses
are used, with varying speeds, width, functions, etc.
 Some buses are proprietary, in order to maximize
transmission speed (latency, throughput)
 Other buses for commodity peripherals (disks, thumb
drives, printers etc.) have standardized interfaces, at
times with much lower speeds, allowing easy exchange
6
Introduction: Abstract μP
Abstract Microprocessor System with Memory, Controllers
7
Introduction: Abstract μP
 Abstract μP above also exhibits coarse
resolution of complete system
 But highlights multiple buses for control,
addresses, and actual data transmitted
between memory and CPU
 Only the box “Input and Output” is even more
abstract than in earlier pictures
 There are many ways of depicting the same
idea, depending on what detail is omitted from
a more complete μP system
 See yet another model on a later page, where
memory is further partitioned into logical
subsections
8
Microprocessor μP
 What is essential about a microprocessor?
 Nothing really  compared to old-fashioned
main frame CPU, except:
1. Form-factor: Way smaller than main frame;
see Cray 1 below
2. Power consumption: way less power
3. Clock rate: way higher clock speed
4. General use: microprocessors found almost
everywhere, including cell phones, space
probes
9
Introduction: Another Abstract μP
10
Introduction: μP Characterization
 Typically a μP system consists of a single
chip CPU, plus memory, bus, and peripherals
 Embedded μP is part of a larger system,
controls that system, often inaccessible: e.g.
when used on interplanetary space probe
 Key part of laptop computer is μP
 Heart of contemporary desktop is μP, may be
dual or quad processor, each AKA cores
 Servers contain multiple μPs, each of which
may have multiple cores
 Mainframe computers, minicomputers, had
way larger form factor and higher need for
power, need for air-conditioned cooling
11
Introduction: μP Characterization
Not a μP: Cray 1 Supercomputer. Additional modules:
Farms of disk drives; Air conditioning; Power supply.
AKA: The Most Expensive Love Seat on Earth
12
Pure Microprocessor μP, no Memory
Connecting pins on this reverse side of processor
13
Microprocessor μP
 What is essential about microprocessor μP vs.
main frame CPU? Continued from p. 9:
 Nothing really , compared to old-fashioned
main frame CPU, except:
5. Except reliability: small IC allows easier
shielding from radiation due to small footprint
6. And scalability: Switch 2, 4, . . . n together for
parallel processing in high-end servers
7. And use of less electric power for cooling
8. μPs allow creation of large farms of servers
for massive compute needs, e.g. Amazon,
Google
 Yet there are some true differences; see
section 2: “High Level14 View”
Pure Microprocessor μP, no Memory
Connection on same side as processor
15
Microprocessors in Servers
Server Farm w. 1000s of MP μ Processors
16
Latency & Bandwidth
17
Latency & Bandwidth
 A μP memory is characterized by metrics:
1. Latency
time between action & response
2. Bandwidth units e.g. bytes transmitted per time
3. Capacity
total address range
 Unit of latency is time t
 Unit of bandwidth is number of data units per
time t; for example, bandwidth can be GB/s,
or gigabytes per second
 Capacity refers to address range, e.g. 32- or
64-bit range: 232 or 264 different units/bytes
18
Latency
 Latency is the time elapsed between issuing a
specific system request and receiving the response
 Def: memory latency is the time between 1.) issuing
a memory access –by executing a ld instruction–
and 2.) the time a next instruction can use the
loaded data
 Measured in units of seconds * 10-3, 10-6 or 10-9
 Careful! Latency of memory access does not
necessarily require the second point as being the
time at which such an access completes
 This distinction alludes to speculative execution,
discussed later
19
Latency
 Latency is critical, as μP generally stalls –ignoring
speculative execution for the moment– if memory
subsystem needs time to respond to a load or store
 High memory latency is undesired, as it increases
the time of program completion
 Low latency is attractive; shortens that time
 Within one selected technology, latency cannot
widely be improved by spending more resources
 If latency must be improved for a μP system,
generally some other memory access technology
should be selected; often way more expensive
20
Latency & Bandwidth
 Bandwidth characterizes data throughput of system
 For example bandwidth of a memory subsystem is
the number information units transmitted per step
 Often that unit is a byte: an addressable composite of
8 bits
 On some large mainframes, the unit of information is
word, e.g. 60-bit or 64-bit words on Cyber systems,
but these are not microprocessors 
 High-performance μP can overlap multiple memory
requests to increase bandwidth, without changing
latency for a single access
 Bandwidth can, up to some technology limit, be
improved by higher technology cost, i.e. spending
more $ for wider buses or other resources
21
Memory
22
Memory Hierarchy
 Conventional to show memory in block diagrams as
one logical block
 In reality there are numerous memory types, each
with differing attributes, such as: speed, maximum
size, cost, lifetime etc.
 One of those types is: cache memory, with the best
speed attribute!
 Then why wouldn’t an architect design main memory
solely out of pure cache memory technology?
 Rhetorical question: total cost would be prohibitive,
yet it would indeed speed up average memory
access dramatically
 During the evolution of computer technology, the
discrepancy between processor speed and memory
access speed continues to grow worse!
23
Trend of Memory Speed
Performance
CPU
Intel® Pentium II
Processor:
Out of Order
Execution
~30%
DRAM
Multilevel
Caches Caches
Time
Intel® Xeon™ Processor:
Hyperthreading
Technology
~30%
Instruction
Level
Thread
Level
Processor CPU speed increases over time.
Memory access speed also increases over time,
but more slowly than CPU, hence the gap widens!
24
Ideal Memory
 Ideal memory has: unbounded capacity, to
store any data set and any program code
 Exploits: infinite bandwidth, to move any
number of data to/from the μP in no time
 Is: persistent, i.e. bits of information retain
their value between power cycles
 Exhibits: no latency, so that the μP never has
reason to stall
 Costs: Low cost, so that $ investment for
very large main memory does not dominate
μP system expense
25
Memory Hierarchy
Memory pyramid shows HW resources that hold data
sorted in decreasing order of speed, top to bottom:
1. Registers, internal to μP, small in number, except on
newer architectures such as Itanium  Fastest!
2. L1 cache, often on chip, few tens of kB
3. L2 cache, on chip; on newer μP hundreds of kB
4. L3 cache, common on servers, generally off-chip
5. Main memory, can be physically smaller than logical
address space; solved via VMM
6. SSD disk, known as solid state disk; AKA RAM disk,
is storage device w/o moving parts; so disk is not to
be interpreted literally!
7. Old fashioned disk, with rotating magnetic storage
8. Back-up tape or disk; slowest
26
Memory Hierarchy
27
Memory Hierarchy
 Various HW resources hold information to be
generated and processed by the ALU
 Ideally, such data are present in HW registers
 Desirable, since register to register arithmetic
operations often can be completed in a single cycle
 And on a superscalar μP sometimes multiple
instructions can be executed in a single cycle
 Alas! There are only few registers available, thus the
actual data must also reside elsewhere
 Generally, that is main memory, or caches
 Until short before 64-bit computing, physical
memories were sufficiently cheap and large to render
virtualization superfluous!
28
Memory Hierarchy (from Wikipedia)
29
Intel® 80486 Memory Organization
30
Memory Attributes
Memory Characteristics by Technology:
Bandwidth data for Intel Haswell; see ref [7]
31
Memory Attributes
 Microprocessor memory AKA: memory, main
memory, primary memory, or main store
 Not to be confused with other devices storing data,
such as rotating disc drives, SSDs, magnetic tapes,
optical drives, punched cards in days of old, etc.
 Memory and thus its size is an inherent part of the
architecture: e.g. its total addressable space is
defined by number of address bits
 Number of addressable units on magnetic media are
generally not limited by address bits; e.g. could be
accessed sequentially in sequential files with no
predefined upper bound
 Different technologies of creating computer
memories exist, various pros and cons
32
Memory Attributes
 Memory HW typically organized in banks, rows,
and columns
 Signal to initiate memory access is called strobe
 When data are needed, memory controller selects
bank based on address, activates a row access
strobe (RAS) to ID line of the data, followed by a
strobe for column access (CAS)
 One clock-cycle later the data are ID-ed
 Another clock later data can be sent
 Strobe takes multiple cycles, as strobe length is
dictated by memory technology, not by the clock
rate of CPU; can result in many CPU cycles
33
Memory Attributes
 Memory can be: read-writable, or read-only; latter
known as ROM
 Memory can be accessible sequentially only (e.g.
tape), or randomly by address; latter known as RAM
 Memory can be volatile or persistent even after
power is turned off; latter known as non-volatile
 Memory can be static or dynamic. Static RAM
retains information while power is applied, known
as SRAM
 Dynamic RAM needs periodic refresh while power is
applied (once every few tens of milliseconds)
known as DRAM
 Not all combinations of all technologies make sense
or are desirable!
34
Memory-Related Nomenclature
 Byte is sequence of 8 bits, addressable as one unit
 Unit gibibit AKA gibit is a certain multiple of bits; see
https://en.wikipedia.org/wiki/Gibibit
 1 gibibit = 230 bits = 1,073,741,824 bits = 1,024 mebibits
 1 gibibit ≈ 1.073 109 bits ≈ 1.073 gigabits
35
Memory Types: DRAM
Dynamic RAM, DRAM:
 Stores bit as charge on 1 capacitor plus 1 transistor
 Cheap to build but leaks charge; compact vs. SRAM
 Sensitive to disturbance, such as light, rays, etc.
 Due to leakage, must be refreshed every few tens of
milliseconds during operation; refresh rate is quite
slow relative to CPU clock speed
 Used for main memory, due to low cost per bit
 Not persistent: volatile; info lost after power-down
 Packaged: either 168-pin dual inline modules DIMMs
in 64-bit chunks; or 72-pin SIMMs 32-bit chunks
36
Memory Types: DRAM
1 transistor, 1 CAP DRAM cell for 1 bit memory
37
Memory Types: DRAM
Sample DRAM Photo
38
Memory Types: SRAM
Static RAM, SRAM:
 Expensive to build: 2 decimal orders of magnitude
more expensive than DRAM
 Uses 6 transistors per bit --or 4 transistors and
resistors R in MOS technology, but Rs are large
 Consumes way more silicon space than DRAM
 Fast access time, about 1/10th of DRAM access time
 Not sensitive to light and mild radiation
 Used for cache memory
 Not persistent: volatile; info lost after power-down
 Due to high cost, NOT all of main memory is built
from SRAM; and with caches there is no need to 
39
Memory Types: SRAM
6 Transistor CMOS SRAM cell for 1 bit cache
40
Memory Types: FPM DRAM
Fast Page Mode DRAM, FPM DRAM:
 Like DRAM, FPM DRAM loads a consecutive row of
bytes into its internal buffers
 After use of one part (e.g. using a single byte or
single word), DRAM discards the buffer
 If next memory access refers to the same area,
DRAM reloads the same buffer again; then accesses
 FPM DRAM skips consecutive loads, if they are
known to refer to same general area: speeding up
memory access
 In actual SW, multiple memory references of closeby addresses are quite frequent!
 FPM DRAM benefits from this, as do data caches
41
Memory Types: EDO DRAM
Extended Data Out DRAM, EDO DRAM:
 Works like FPM DRAM
 But in case of bursts --i.e. multiple memory
accesses in a row to successive addresses– the
CAS signal can be spaced more closely together
than in case of independent addresses
 And all the RAS signals (except for the first) in case
of a burst can be saved anyway
 Result: faster access for consecutive memory
addresses
 Note an analogous phenomenon in data cache use,
where it is referred to as spatial locality
42
Memory Types: SDRAM
Synchronous DRAM: SDRAM
 Memory technologies listed so far use signals that
are separate from memory controller: Asynchronous
 SDRAM reduces circuitry and thus cost by recycling:
using the rising edge of the already existing external
clock driving memory controller
 As a consequence of synchronicity, SDRAM produces
needed data faster than asynchronous memory
technologies
 Moreover, technology is used per memory bank: thus,
if multiple sequential accesses refer to different
banks, SDRAM accesses data in parallel, faster than
asynchronous technology
43
Memory Types: DDR SDRAM
Double Data-Rate Synch DRAM, DDR SDRAM:
 Is an SDRAM technology
 Instead of using purely the rising edge of the
external clock signal, with which it is synced, DDR
uses both edges
 AKA double-pumping!
 Result is doubling the memory access speed!
 DDR2 SDRAM doubles the data rate again by using
another, separate internal clock at a different ratio,
transferring data at each of the 4 edges
 DDR2 is not backward compatible with DDR
 Since 2008 DDR3 doubles this again; also not
compatible with other DDR technologies
44
Memory Types: RDRAM
Rambus DRAM, RDRAM:
 Technology by Rambus Corp. in Sunnyvale:
https://www.rambus.com/corporate-overview/
 Developed early 2000s, as an improved type of
synchronous dynamic RAM
 Touted by Rambus to become THE sole technology
in high-bandwidth applications
 Expected to become standard PC memory, once
Intel agreed to adoption with its future chipsets
 Legal disputes with other manufacturers about
technology and ownership resulted in RDRAM not
widely accepted since about 2003
45
Memory Types
Richard Crisp @ Rambus Key DRAM Developer:
46
Memory Low-Level View
(Section taken from Eric Krause PSU ECE Dept.)
47
Memory Terminology

Latch (one FlipFlop) stores 1 bit

Register stores a full machine word

Memory devices store >> 1 words; on 32-bit architectures it is
feasible to have all of memory as real, physical memory

General usage of reading memory:


enable device/memory

supply address of word on address lines

addressed word arrives on data lines
Common Terms

word size = number of bits of natural computing unit, e.g. integer

Capacity
= 2address bits

Bus
= parallel lines connecting memory and μP

Volatility
= means: memory contents are lost on power-off

ROM
= Read Only Memory

RAM
= Random Access Memory
48
Memory Types
 Types of memory:






MROM: Mask-programmed during manufacturing
PROM: Programmed by user, by blowing fuses
EPROM: Electrically programmable by user, erased by
exposing to ultraviolet light
EEPROM: Electrically Erasable Programmable ROM; can
be programmed or erased by user, one byte at a time
Flash EEPROM: A type of EEPROM programmable in
blocks rather than single bytes
Synchronous Flash EEPROM: Synchronous version of
the above
 Memory attributes



All are non-volatile
Writes can be slower than reads!
Asynchronous (except for EEPROM)
49
Words of Memory
ROM: Read Only Memory
1. How many words can be
addressed? 2address bits = 215
2. What is the width of these
words? 8 bits
3. How is it activated/turned on?
Specific signal
4. Why does it have 3-state
outputs? Allow multiple of
same devices to be connected
50
ROM Timing
Pins are: power, clock, address, output, CE and OE
51
ROM Timing
 Assuming Address asserted when supplied to inputs
 There will be propagation delay before output appears
 Called Address Access Time tACC, i.e. time to wait for
valid data available on outputs after address applied
 Assuming it was already powered up and outputs
were enabled
 A key parameter when using memory
 If address is already present on address inputs when
CE# is asserted, it takes some time to power on tCE
 tOE is the time for output buffers to be turned on
52
ROM Timing
 These parameters are major limiting factors for
the performance of a microprocessor and pose
important design considerations
 What is happening on the data lines before the
output is valid? See chevrons; bus is in
unknown state!
 Individual outputs may be low or high
 What do valid data look like? Or invalid data?
 How do we know if we clock in invalid data?
We generally don’t!
53
RAM Timing
 Note that ROMs are also randomly
addressable
 Random access memory (RAM) can be read,
written, is volatile
 Speed to read and write are generally equal
 SRAM: has very fast access times, uses 6
transistors/bit. Data are static while powered
 DRAM: Slower access, but only 1 transistor/bit
 DRAM data must be refreshed regularly while
powered
 SDRAM: Pipelined, synchronous DRAM
 DDR SDRAM: Double Data Rate SDRAM
54
RAM Timing
RAM: Random Access Memory,
here 12 address lines:






How many words can be addressed? 4k
What is the width of these words? 8 bits
Is Asynchronous!
Benefit of synchronous would be: here μP
needs to keep address the whole time
waiting for data to arrive; only then can μP
move to next address; μP couldn't start
generating next address until the current
operation was complete; wasted time,
especially as processor speeds increase
Synchronous device can send address
into the data input for duration of HOLD
time, which in modern devices can be 0,
and clock it. Then μP can do something
else; enabling parallel operations
Pipelined SRAM can do this:
55
Pipelined SRAM
56
Pipelined SRAM
Actual Data Sheet for pipelined SRAM, shows:





input registers
output registers
address register
enable register (control signals)
control logic: see the gates on left
57
CISC vs. RISC
58
CISC vs. RISC Discussion
 Design approach of early computer
architects was to: allow all engineering
methods, exercise total freedom of
instruction design, and apply full creativity
for building CPUs
 Computers had many instruction types, such
as register-to-memory, memory-to-memory,
memory-to-register operations, etc.
 Instructions had various lengths; e.g. Intel
x86 instructions vary from 1 to 15 bytes
 Even opcode proper of instructions varied;
from 1 bit (iWarp C&A instruction) to multiple
bytes, (e.g. 9 byte NOOP on x86)
 This was called complex instruction set
computing, AKA CISC59
CISC vs. RISC Discussion
 But computers were never fast enough!
 Mid 1980s, architects produced a new
architecture, with rigid design rules, severe
limitations, for the sake of faster clock
 David Patterson at UCB, Carlo Sequin, and
others at Stanford University postulated new
approach, with defined restrictions, resulting
in faster clock rate
 Referred to as reduced instruction set
computing, AKA RISC
 Term RISC coined by David Patterson
60
CISC vs. RISC Discussion
 For a while RISC seemed to be the clear
winner in the race for speed
 Today one observes a resurgence of CISC
computing
 Yet internally, practically all architectures
practice RISC approach –internal design is
hidden from user, not visible in the
Instruction Set Architecture, AKA ISA
61
CISC
 Varying instruction length
 Pretty much any instruction can access
memory
 Opcodes consume from few bits up to
multiple bytes
 Rich variety of different instructions
 Generally slower clock speed
 Multiple –even many– cycles for most
instructions
62
RISC
 Uniform instruction length; e.g. 4 bytes on old 32bit architectures
 Only load and store ops access memory
 Only load and store ops consume multiple cycles
 Uniform opcode length, uniform instruction length,
e.g. only 32 bit instructions on 32-bit architecture
 Limited variety of instructions; i.e. small number of
different opcodes
 Generally fast clock speed
 Generally 1 cycle per instruction execution
 Single-cycle goal forces unusual steps for some FP
ops; e.g. to break FP-divide or multiply into a short
sequence of equivalent but simpler operations
63
Bibliography
1. Shen, John Paul, and Mikko H. Lipasti: Modern
Processor Design, Fundamentals of Superscalar
Processors, McGraw Hill, ©2005, ISBN 10:
0070570647, ISBN 13: 9780070570641
2. http://forums.amd.com/forum/messageview.cfm?cati
d=11&threadid=29382&enterthread=y
3. http://www.ece.umd.edu/~blj/papers/hpca2006.pdf
4. Kilburn, T., et al: “One-level storage systems, IRE
Transactions, EC-11, 2, 1962, p. 223-235
5. RISC versus CISC:
https://en.wikipedia.org/wiki/Reduced_instruction_s
et_computing
6. RISC:
https://cs.stanford.edu/people/eroberts/courses/soc
o/projects/risc/whatis/index.html
7. https://software.intel.com/en-us/forums/intel64
moderncode-for-parallel-architectures/topic/608964