Transcript Memory
ECE 485/585
Microprocessors
Chapter 1
Microprocessor Characterization:
CPU & Memory
Herbert G. Mayer, PSU
Status 11/30/2016
Parts gratefully taken with permission from Eric Krause @ PSU
1
Syllabus
Introduction
Microprocessor μP
Latency and Bandwidth
Memory Hierarchy
Memory Types
Memory Low-Level View
CISC vs. RISC
Bibliography
2
Introduction
In lectures on: Microprocessor System
Design, the focus is on Microprocessors,
yet a system view is included
System includes memory, bus, peripherals,
brief mention of power supply, especially
for an embedded μP
This EE class goes beyond pure function,
and includes low-level EE discussions for
parts of the μP system
Ideal outcome for you: to be able to design
your future employer’s microcontroller, or
ideally a full μP
3
Introduction
Key modules of any microprocessor (μP) are:
1. Central Processing Unit, AKA CPU, includes ALU, Register
File, pc, ir, flags, and internal registers
2. Memory (AKA Main Memory), including stack portion
3. Caches; L1 and sometimes L2 integrated on same silicon
die, physically but not logically part of CPU
4. Data-, address-, and control buses connecting CPU,
peripherals, and memory; AKA System Bus
5. Peripherals and their controllers, connected via bus
6. IO devices and controller, connected to system bus
7. Branch Prediction unit; invisible to API
Vast speed differences between CPU and memory
Great speed disparity between various controllers
Inherently slow are those accepting manual input
4
Introduction: Generic μP
Generic Microprocessor System with Memory, Controllers
5
Introduction: Generic μP
Generic μP above shows coarse resolution of CPU
Leaves out caches and branch prediction, necessary to
increase processing- and data access speed
Lists program counter pc (AKA instruction pointer),
register file, transparent instruction register ir (current
instruction being executed)
Views components other than CPU as “hanging off”
central system bus
In reality, at times (for some Intel μP) multiple buses
are used, with varying speeds, width, functions, etc.
Some buses are proprietary, in order to maximize
transmission speed (latency, throughput)
Other buses for commodity peripherals (disks, thumb
drives, printers etc.) have standardized interfaces, at
times with much lower speeds, allowing easy exchange
6
Introduction: Abstract μP
Abstract Microprocessor System with Memory, Controllers
7
Introduction: Abstract μP
Abstract μP above also exhibits coarse
resolution of complete system
But highlights multiple buses for control,
addresses, and actual data transmitted
between memory and CPU
Only the box “Input and Output” is even more
abstract than in earlier pictures
There are many ways of depicting the same
idea, depending on what detail is omitted from
a more complete μP system
See yet another model on a later page, where
memory is further partitioned into logical
subsections
8
Microprocessor μP
What is essential about a microprocessor?
Nothing really compared to old-fashioned
main frame CPU, except:
1. Form-factor: Way smaller than main frame;
see Cray 1 below
2. Power consumption: way less power
3. Clock rate: way higher clock speed
4. General use: microprocessors found almost
everywhere, including cell phones, space
probes
9
Introduction: Another Abstract μP
10
Introduction: μP Characterization
Typically a μP system consists of a single
chip CPU, plus memory, bus, and peripherals
Embedded μP is part of a larger system,
controls that system, often inaccessible: e.g.
when used on interplanetary space probe
Key part of laptop computer is μP
Heart of contemporary desktop is μP, may be
dual or quad processor, each AKA cores
Servers contain multiple μPs, each of which
may have multiple cores
Mainframe computers, minicomputers, had
way larger form factor and higher need for
power, need for air-conditioned cooling
11
Introduction: μP Characterization
Not a μP: Cray 1 Supercomputer. Additional modules:
Farms of disk drives; Air conditioning; Power supply.
AKA: The Most Expensive Love Seat on Earth
12
Pure Microprocessor μP, no Memory
Connecting pins on this reverse side of processor
13
Microprocessor μP
What is essential about microprocessor μP vs.
main frame CPU? Continued from p. 9:
Nothing really , compared to old-fashioned
main frame CPU, except:
5. Except reliability: small IC allows easier
shielding from radiation due to small footprint
6. And scalability: Switch 2, 4, . . . n together for
parallel processing in high-end servers
7. And use of less electric power for cooling
8. μPs allow creation of large farms of servers
for massive compute needs, e.g. Amazon,
Google
Yet there are some true differences; see
section 2: “High Level14 View”
Pure Microprocessor μP, no Memory
Connection on same side as processor
15
Microprocessors in Servers
Server Farm w. 1000s of MP μ Processors
16
Latency & Bandwidth
17
Latency & Bandwidth
A μP memory is characterized by metrics:
1. Latency
time between action & response
2. Bandwidth units e.g. bytes transmitted per time
3. Capacity
total address range
Unit of latency is time t
Unit of bandwidth is number of data units per
time t; for example, bandwidth can be GB/s,
or gigabytes per second
Capacity refers to address range, e.g. 32- or
64-bit range: 232 or 264 different units/bytes
18
Latency
Latency is the time elapsed between issuing a
specific system request and receiving the response
Def: memory latency is the time between 1.) issuing
a memory access –by executing a ld instruction–
and 2.) the time a next instruction can use the
loaded data
Measured in units of seconds * 10-3, 10-6 or 10-9
Careful! Latency of memory access does not
necessarily require the second point as being the
time at which such an access completes
This distinction alludes to speculative execution,
discussed later
19
Latency
Latency is critical, as μP generally stalls –ignoring
speculative execution for the moment– if memory
subsystem needs time to respond to a load or store
High memory latency is undesired, as it increases
the time of program completion
Low latency is attractive; shortens that time
Within one selected technology, latency cannot
widely be improved by spending more resources
If latency must be improved for a μP system,
generally some other memory access technology
should be selected; often way more expensive
20
Latency & Bandwidth
Bandwidth characterizes data throughput of system
For example bandwidth of a memory subsystem is
the number information units transmitted per step
Often that unit is a byte: an addressable composite of
8 bits
On some large mainframes, the unit of information is
word, e.g. 60-bit or 64-bit words on Cyber systems,
but these are not microprocessors
High-performance μP can overlap multiple memory
requests to increase bandwidth, without changing
latency for a single access
Bandwidth can, up to some technology limit, be
improved by higher technology cost, i.e. spending
more $ for wider buses or other resources
21
Memory
22
Memory Hierarchy
Conventional to show memory in block diagrams as
one logical block
In reality there are numerous memory types, each
with differing attributes, such as: speed, maximum
size, cost, lifetime etc.
One of those types is: cache memory, with the best
speed attribute!
Then why wouldn’t an architect design main memory
solely out of pure cache memory technology?
Rhetorical question: total cost would be prohibitive,
yet it would indeed speed up average memory
access dramatically
During the evolution of computer technology, the
discrepancy between processor speed and memory
access speed continues to grow worse!
23
Trend of Memory Speed
Performance
CPU
Intel® Pentium II
Processor:
Out of Order
Execution
~30%
DRAM
Multilevel
Caches Caches
Time
Intel® Xeon™ Processor:
Hyperthreading
Technology
~30%
Instruction
Level
Thread
Level
Processor CPU speed increases over time.
Memory access speed also increases over time,
but more slowly than CPU, hence the gap widens!
24
Ideal Memory
Ideal memory has: unbounded capacity, to
store any data set and any program code
Exploits: infinite bandwidth, to move any
number of data to/from the μP in no time
Is: persistent, i.e. bits of information retain
their value between power cycles
Exhibits: no latency, so that the μP never has
reason to stall
Costs: Low cost, so that $ investment for
very large main memory does not dominate
μP system expense
25
Memory Hierarchy
Memory pyramid shows HW resources that hold data
sorted in decreasing order of speed, top to bottom:
1. Registers, internal to μP, small in number, except on
newer architectures such as Itanium Fastest!
2. L1 cache, often on chip, few tens of kB
3. L2 cache, on chip; on newer μP hundreds of kB
4. L3 cache, common on servers, generally off-chip
5. Main memory, can be physically smaller than logical
address space; solved via VMM
6. SSD disk, known as solid state disk; AKA RAM disk,
is storage device w/o moving parts; so disk is not to
be interpreted literally!
7. Old fashioned disk, with rotating magnetic storage
8. Back-up tape or disk; slowest
26
Memory Hierarchy
27
Memory Hierarchy
Various HW resources hold information to be
generated and processed by the ALU
Ideally, such data are present in HW registers
Desirable, since register to register arithmetic
operations often can be completed in a single cycle
And on a superscalar μP sometimes multiple
instructions can be executed in a single cycle
Alas! There are only few registers available, thus the
actual data must also reside elsewhere
Generally, that is main memory, or caches
Until short before 64-bit computing, physical
memories were sufficiently cheap and large to render
virtualization superfluous!
28
Memory Hierarchy (from Wikipedia)
29
Intel® 80486 Memory Organization
30
Memory Attributes
Memory Characteristics by Technology:
Bandwidth data for Intel Haswell; see ref [7]
31
Memory Attributes
Microprocessor memory AKA: memory, main
memory, primary memory, or main store
Not to be confused with other devices storing data,
such as rotating disc drives, SSDs, magnetic tapes,
optical drives, punched cards in days of old, etc.
Memory and thus its size is an inherent part of the
architecture: e.g. its total addressable space is
defined by number of address bits
Number of addressable units on magnetic media are
generally not limited by address bits; e.g. could be
accessed sequentially in sequential files with no
predefined upper bound
Different technologies of creating computer
memories exist, various pros and cons
32
Memory Attributes
Memory HW typically organized in banks, rows,
and columns
Signal to initiate memory access is called strobe
When data are needed, memory controller selects
bank based on address, activates a row access
strobe (RAS) to ID line of the data, followed by a
strobe for column access (CAS)
One clock-cycle later the data are ID-ed
Another clock later data can be sent
Strobe takes multiple cycles, as strobe length is
dictated by memory technology, not by the clock
rate of CPU; can result in many CPU cycles
33
Memory Attributes
Memory can be: read-writable, or read-only; latter
known as ROM
Memory can be accessible sequentially only (e.g.
tape), or randomly by address; latter known as RAM
Memory can be volatile or persistent even after
power is turned off; latter known as non-volatile
Memory can be static or dynamic. Static RAM
retains information while power is applied, known
as SRAM
Dynamic RAM needs periodic refresh while power is
applied (once every few tens of milliseconds)
known as DRAM
Not all combinations of all technologies make sense
or are desirable!
34
Memory-Related Nomenclature
Byte is sequence of 8 bits, addressable as one unit
Unit gibibit AKA gibit is a certain multiple of bits; see
https://en.wikipedia.org/wiki/Gibibit
1 gibibit = 230 bits = 1,073,741,824 bits = 1,024 mebibits
1 gibibit ≈ 1.073 109 bits ≈ 1.073 gigabits
35
Memory Types: DRAM
Dynamic RAM, DRAM:
Stores bit as charge on 1 capacitor plus 1 transistor
Cheap to build but leaks charge; compact vs. SRAM
Sensitive to disturbance, such as light, rays, etc.
Due to leakage, must be refreshed every few tens of
milliseconds during operation; refresh rate is quite
slow relative to CPU clock speed
Used for main memory, due to low cost per bit
Not persistent: volatile; info lost after power-down
Packaged: either 168-pin dual inline modules DIMMs
in 64-bit chunks; or 72-pin SIMMs 32-bit chunks
36
Memory Types: DRAM
1 transistor, 1 CAP DRAM cell for 1 bit memory
37
Memory Types: DRAM
Sample DRAM Photo
38
Memory Types: SRAM
Static RAM, SRAM:
Expensive to build: 2 decimal orders of magnitude
more expensive than DRAM
Uses 6 transistors per bit --or 4 transistors and
resistors R in MOS technology, but Rs are large
Consumes way more silicon space than DRAM
Fast access time, about 1/10th of DRAM access time
Not sensitive to light and mild radiation
Used for cache memory
Not persistent: volatile; info lost after power-down
Due to high cost, NOT all of main memory is built
from SRAM; and with caches there is no need to
39
Memory Types: SRAM
6 Transistor CMOS SRAM cell for 1 bit cache
40
Memory Types: FPM DRAM
Fast Page Mode DRAM, FPM DRAM:
Like DRAM, FPM DRAM loads a consecutive row of
bytes into its internal buffers
After use of one part (e.g. using a single byte or
single word), DRAM discards the buffer
If next memory access refers to the same area,
DRAM reloads the same buffer again; then accesses
FPM DRAM skips consecutive loads, if they are
known to refer to same general area: speeding up
memory access
In actual SW, multiple memory references of closeby addresses are quite frequent!
FPM DRAM benefits from this, as do data caches
41
Memory Types: EDO DRAM
Extended Data Out DRAM, EDO DRAM:
Works like FPM DRAM
But in case of bursts --i.e. multiple memory
accesses in a row to successive addresses– the
CAS signal can be spaced more closely together
than in case of independent addresses
And all the RAS signals (except for the first) in case
of a burst can be saved anyway
Result: faster access for consecutive memory
addresses
Note an analogous phenomenon in data cache use,
where it is referred to as spatial locality
42
Memory Types: SDRAM
Synchronous DRAM: SDRAM
Memory technologies listed so far use signals that
are separate from memory controller: Asynchronous
SDRAM reduces circuitry and thus cost by recycling:
using the rising edge of the already existing external
clock driving memory controller
As a consequence of synchronicity, SDRAM produces
needed data faster than asynchronous memory
technologies
Moreover, technology is used per memory bank: thus,
if multiple sequential accesses refer to different
banks, SDRAM accesses data in parallel, faster than
asynchronous technology
43
Memory Types: DDR SDRAM
Double Data-Rate Synch DRAM, DDR SDRAM:
Is an SDRAM technology
Instead of using purely the rising edge of the
external clock signal, with which it is synced, DDR
uses both edges
AKA double-pumping!
Result is doubling the memory access speed!
DDR2 SDRAM doubles the data rate again by using
another, separate internal clock at a different ratio,
transferring data at each of the 4 edges
DDR2 is not backward compatible with DDR
Since 2008 DDR3 doubles this again; also not
compatible with other DDR technologies
44
Memory Types: RDRAM
Rambus DRAM, RDRAM:
Technology by Rambus Corp. in Sunnyvale:
https://www.rambus.com/corporate-overview/
Developed early 2000s, as an improved type of
synchronous dynamic RAM
Touted by Rambus to become THE sole technology
in high-bandwidth applications
Expected to become standard PC memory, once
Intel agreed to adoption with its future chipsets
Legal disputes with other manufacturers about
technology and ownership resulted in RDRAM not
widely accepted since about 2003
45
Memory Types
Richard Crisp @ Rambus Key DRAM Developer:
46
Memory Low-Level View
(Section taken from Eric Krause PSU ECE Dept.)
47
Memory Terminology
Latch (one FlipFlop) stores 1 bit
Register stores a full machine word
Memory devices store >> 1 words; on 32-bit architectures it is
feasible to have all of memory as real, physical memory
General usage of reading memory:
enable device/memory
supply address of word on address lines
addressed word arrives on data lines
Common Terms
word size = number of bits of natural computing unit, e.g. integer
Capacity
= 2address bits
Bus
= parallel lines connecting memory and μP
Volatility
= means: memory contents are lost on power-off
ROM
= Read Only Memory
RAM
= Random Access Memory
48
Memory Types
Types of memory:
MROM: Mask-programmed during manufacturing
PROM: Programmed by user, by blowing fuses
EPROM: Electrically programmable by user, erased by
exposing to ultraviolet light
EEPROM: Electrically Erasable Programmable ROM; can
be programmed or erased by user, one byte at a time
Flash EEPROM: A type of EEPROM programmable in
blocks rather than single bytes
Synchronous Flash EEPROM: Synchronous version of
the above
Memory attributes
All are non-volatile
Writes can be slower than reads!
Asynchronous (except for EEPROM)
49
Words of Memory
ROM: Read Only Memory
1. How many words can be
addressed? 2address bits = 215
2. What is the width of these
words? 8 bits
3. How is it activated/turned on?
Specific signal
4. Why does it have 3-state
outputs? Allow multiple of
same devices to be connected
50
ROM Timing
Pins are: power, clock, address, output, CE and OE
51
ROM Timing
Assuming Address asserted when supplied to inputs
There will be propagation delay before output appears
Called Address Access Time tACC, i.e. time to wait for
valid data available on outputs after address applied
Assuming it was already powered up and outputs
were enabled
A key parameter when using memory
If address is already present on address inputs when
CE# is asserted, it takes some time to power on tCE
tOE is the time for output buffers to be turned on
52
ROM Timing
These parameters are major limiting factors for
the performance of a microprocessor and pose
important design considerations
What is happening on the data lines before the
output is valid? See chevrons; bus is in
unknown state!
Individual outputs may be low or high
What do valid data look like? Or invalid data?
How do we know if we clock in invalid data?
We generally don’t!
53
RAM Timing
Note that ROMs are also randomly
addressable
Random access memory (RAM) can be read,
written, is volatile
Speed to read and write are generally equal
SRAM: has very fast access times, uses 6
transistors/bit. Data are static while powered
DRAM: Slower access, but only 1 transistor/bit
DRAM data must be refreshed regularly while
powered
SDRAM: Pipelined, synchronous DRAM
DDR SDRAM: Double Data Rate SDRAM
54
RAM Timing
RAM: Random Access Memory,
here 12 address lines:
How many words can be addressed? 4k
What is the width of these words? 8 bits
Is Asynchronous!
Benefit of synchronous would be: here μP
needs to keep address the whole time
waiting for data to arrive; only then can μP
move to next address; μP couldn't start
generating next address until the current
operation was complete; wasted time,
especially as processor speeds increase
Synchronous device can send address
into the data input for duration of HOLD
time, which in modern devices can be 0,
and clock it. Then μP can do something
else; enabling parallel operations
Pipelined SRAM can do this:
55
Pipelined SRAM
56
Pipelined SRAM
Actual Data Sheet for pipelined SRAM, shows:
input registers
output registers
address register
enable register (control signals)
control logic: see the gates on left
57
CISC vs. RISC
58
CISC vs. RISC Discussion
Design approach of early computer
architects was to: allow all engineering
methods, exercise total freedom of
instruction design, and apply full creativity
for building CPUs
Computers had many instruction types, such
as register-to-memory, memory-to-memory,
memory-to-register operations, etc.
Instructions had various lengths; e.g. Intel
x86 instructions vary from 1 to 15 bytes
Even opcode proper of instructions varied;
from 1 bit (iWarp C&A instruction) to multiple
bytes, (e.g. 9 byte NOOP on x86)
This was called complex instruction set
computing, AKA CISC59
CISC vs. RISC Discussion
But computers were never fast enough!
Mid 1980s, architects produced a new
architecture, with rigid design rules, severe
limitations, for the sake of faster clock
David Patterson at UCB, Carlo Sequin, and
others at Stanford University postulated new
approach, with defined restrictions, resulting
in faster clock rate
Referred to as reduced instruction set
computing, AKA RISC
Term RISC coined by David Patterson
60
CISC vs. RISC Discussion
For a while RISC seemed to be the clear
winner in the race for speed
Today one observes a resurgence of CISC
computing
Yet internally, practically all architectures
practice RISC approach –internal design is
hidden from user, not visible in the
Instruction Set Architecture, AKA ISA
61
CISC
Varying instruction length
Pretty much any instruction can access
memory
Opcodes consume from few bits up to
multiple bytes
Rich variety of different instructions
Generally slower clock speed
Multiple –even many– cycles for most
instructions
62
RISC
Uniform instruction length; e.g. 4 bytes on old 32bit architectures
Only load and store ops access memory
Only load and store ops consume multiple cycles
Uniform opcode length, uniform instruction length,
e.g. only 32 bit instructions on 32-bit architecture
Limited variety of instructions; i.e. small number of
different opcodes
Generally fast clock speed
Generally 1 cycle per instruction execution
Single-cycle goal forces unusual steps for some FP
ops; e.g. to break FP-divide or multiply into a short
sequence of equivalent but simpler operations
63
Bibliography
1. Shen, John Paul, and Mikko H. Lipasti: Modern
Processor Design, Fundamentals of Superscalar
Processors, McGraw Hill, ©2005, ISBN 10:
0070570647, ISBN 13: 9780070570641
2. http://forums.amd.com/forum/messageview.cfm?cati
d=11&threadid=29382&enterthread=y
3. http://www.ece.umd.edu/~blj/papers/hpca2006.pdf
4. Kilburn, T., et al: “One-level storage systems, IRE
Transactions, EC-11, 2, 1962, p. 223-235
5. RISC versus CISC:
https://en.wikipedia.org/wiki/Reduced_instruction_s
et_computing
6. RISC:
https://cs.stanford.edu/people/eroberts/courses/soc
o/projects/risc/whatis/index.html
7. https://software.intel.com/en-us/forums/intel64
moderncode-for-parallel-architectures/topic/608964