Transcript ppt

CS 152
Computer Architecture and Engineering
Lecture 26
The Final Chapter
A whirlwind retrospective on the term
0
December 7th, 2001
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.1
What IS Quantum Computing?
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.2
Can we Use Quantum Mechanics to Compute?
• Weird properties of quantum mechanics?
– You’ve already seen one: tunneling of electrons through
insulators to make TMJ RAM
– Quantization: Only certain values or orbits are good
• Remember orbitals from chemistry???
– Superposition: Schizophrenic physical elements don’t quite
know whether they are one thing or another
• All existing digital abstractions try to eliminate QM
– Transistors/Gates designed with classical behavior
– Binary abstraction: a “1” is a “1” and a “0” is a “0”
• Quantum Computing:
Use of Quantization and Superposition to compute.
• Interesting results:
– Shor’s algorithm: factors in polynomial time!
– Grover’s algorithm: Finds items in unsorted database in time
proportional to square-root of n.
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.3
Quantization: Use of “Spin”
North
Representation:
|0> or |1>
Spin ½ particle:
(Proton/Electron)
South
• Particles like Protons have an intrinsic “Spin” when defined with
respect to an external magnetic field
• Quantum effect gives “1” and “0”:
– Either spin is “UP” or “DOWN” nothing inbetween
• Kane Proposal: use of impurity Phosphorus in silicon
– Spin of odd proton is used to represent the bit
– Manipulation of this bit via “Hyperfine” interaction with electrons
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.4
Now add Superposition!
• The bit can be in a combination of “1” and “0”:
– Written as: = C0|0> + C1|1>
– The C’s are complex numbers!
– Important Constraint: |C0|2 + |C1|2 =1
• If measure bit to see what looks like,
– With probability |C0|2 we will find |0> (say “UP”)
– With probability |C1|2 we will find |1> (say “DOWN”)
• Is this a real effect? Options:
– This is just statistical – given a large number of protons, a
fraction of them (|C0|2 ) are “UP” and the rest are down.
– This is a real effect, and the proton is really both things until
you try to look at it
• Reality: second choice! There are experiments to prove it!
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.5
Implications: A register can have many values
• Implications of superposition:
– An n-bit register can have 2n values simultaneously!
– 3-bit example:
= C000|000>+ C001|001>+ C010|010>+ C011|011>+
C100|100>+ C101|101>+ C110|110>+ C111|111>
• Probabilities of measuring all bits are set by coefficients:
– So, prob of getting |000> is |C000|2, etc.
– Suppose we measure only one bit (first):
• We get a “0” with probability: P0=|C000|2+ |C001|2+ |C010|2+ |C011|2
1
Result: = P (C000|000>+ C001|001>+ C010|010>+ C011|011>)
0
• We get a “1” with probability: P1=|C100|2+ |C101|2+ |C110|2+ |C111|2
1
Result: = P (C100|100>+ C101|101>+ C110|110>+ C111|111>)
1
• Problem: Don’t want environment to measure before ready!
– Solution: Quantum Error Correction Codes!
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.6
Model? Operations on coefficients + measurements
Input
Complex
State
Unitary
Transformations
Measure
Output
Classical
Answer
• Basic Computing Paradigm:
– Input is a register with superposition of many values
• Possibly all 2n inputs equally probable!
– Unitary transformations compute on coefficients
• Must maintain probability property (sum of squares = 1)
• Looks like doing computation on all 2n inputs simultaneously!
– Output is one result attained by measurement
• If do this poorly, just like probabilistic computation:
– If 2n inputs equally probable, may be 2n outputs equally probable.
– After measure, like picked random input to classical function!
– All interesting results have some form of “fourier transform”
computation being done in unitary transformation
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.7
Some Issues in building quantum computer
• What are the bits and how do we manipulate them?
– NMR computation: use “cup of liquid”.
• Use nuclear spins (special protons on complex molecules).
• Manipulate with radio-frequencies
• IBM Has produced a 7-bit computer
– Silicon options (more scalable)
• Impurity Phosphorus in silicon
• Manipulate through electrons (including measurement)
• Still serious noise/fabrication issues
– Other options:
• Optical (Phases of photons represent bits)
• Single electrons trapped in magnetic fields
• How do we prevent the environment from “Measuring”?
– Make spins as insulated from environment as possible
– Quantum Error Correction!
• Where do we get “clean” bits (I.e. unsuperposed |0> or |1>)?
– Entropy exchange unit:
• Radiates heat to environment (entropy)
• Produces clean bits (COLD) to enter into device
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.8
DNA Computing
• Can we use DNA to do massive
computations?
– Organisms do it
– DNA has an extremely high
information density:
• 4 different base pairs:
– Adenine/Thymine
– Guanine/Cytosine
• Always paired up on opposite
strands=>Energetically favorable
– Active operations:
• Copy: Split strands of 1 DNA apart in
proper solution, gain 2 copies
• Concatenate: eg:
GTAATCCT will cause
XXXXXCATT to combine with
AGGAYYYYY
• Polymerase Chain Reaction (PCR):
greatly amplifies region of molecule
between two marker molecules
12/7/01
©UCB Fall 2001
©1999 Access Excellence
@ the National Health Museum
CS152 / Kubiatowicz
Lec26.9
DNA Computing and the traveling salesman
• Given a set of cities and costs
between them (possibly directed
paths):
– Find shortest path
– Simpler: find single path that visits all cities
• DNA Computing example is later
– Every city represented by unique 20 base-pair strand
– Every path between cities represented by complementary
pairs: 10 pairs from source city, 10 pairs from destination
– Shorter example:
AAGT for city 1, TTCG for city 2
Path 1->2: CAAA
Will build: AAGTTTCG
..CAAA..
– Dump “city molecules” and “path molecules” into testtube.
Select and amplify paths of right length. Analyze for result.
CS152 / Kubiatowicz
– Been done for 6 cities! (Adleman,
~1998!)
12/7/01
©UCB Fall 2001
Lec26.10
Where have we been?
Input
Multiplier
Input
Multiplicand
32
Multiplicand
Register
LoadMp
32=>34
signEx
32
34
34
1
0
34x2 MUX
Arithmetic
Multi x2/x1
34
34
Sub/Add
34-bit ALU
Control
Logic
34
32
32
2
ShiftAll
LO register
(16x2 bits)
Prev
2
Booth
Encoder
HI register
(16x2 bits)
LO[1]
Extra
2 bits
2
"LO
[0]"
Single/multicycle
Datapaths
<<1
32=>34
signEx
ENC[2]
ENC[1]
ENC[0]
LoadLO
ClearHI
LoadHI
2
32
Result[HI]
LO[1:0]
32
Result[LO]
1000
CPU
“Moore’s Law”
IFetchDcd
WB
Exec Mem
Performance
10
DRAM
9%/yr.
DRAM (2X/10 yrs)
1
198
2
3
198
498
1
5
198
6
198
7
198
8
198
9
199
0
199
199
2
199
399
1
4
199
5
199
699
1
7
199
8
199
9
200
0
Exec Mem
Processor-Memory
Performance Gap:
(grows 50% / year)
198
098
1
1
198
IFetchDcd
CS152
Spring ‘99
100
µProc
60%/yr.
(2X/1.5yr)
WB
Time
IFetchDcd
Exec Mem
IFetchDcd
WB
Exec Mem
WB
Pipelining
I/O
Memory Systems
12/7/01
©UCB Fall 2001

CS152 / Kubiatowicz
Lec26.11
The Big Picture
° Since 1946 all computers have had 5 components
Processor
Input
Control
Memory
Datapath
12/7/01
Output
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.12
What is “Computer Architecture”?
Application
Operating
System
Compiler
Firmware
Instr. Set Proc. I/O system
Instruction Set
Architecture
Datapath & Control
Digital Design
Circuit Design
Layout
• Coordination of many levels of abstraction
• Under a rapidly changing set of forces
• Design, Measurement, and Evaluation
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.13
Performance and Technology Trends
1000
Supercomputers
Performance
100
Mainframes
10
Minicomputers
Microprocessors
1
0.1
1965
1970
1975
1980
1985
1990
1995
2000
Year
• Technology Power: 1.2 x 1.2 x 1.2 = 1.7 x / year
– Feature Size: shrinks 10% / yr. => Switching speed improves 1.2 / yr.
– Density: improves 1.2x / yr.
– Die Area: 1.2x / yr.
• One lesson of RISC is to keep the ISA as simple as possible:
– Shorter design cycle => fully exploit the advancing technology (~3yr)
– Advanced branch prediction and pipeline techniques
– Bigger and more sophisticated on-chip caches
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.14
Examples of “Moore’s Law’s”
• Processor
– logic capacity: about 30% per year
– clock rate:
about 20% per year
– Performance: about 50-60% per year (2x in 18
months)
• Memory
– DRAM capacity: about 60% per year (4x every 3 years)
– Memory speed: about 10% per year
– Cost per bit: improves about 25% per year
• Disk
– capacity: about 60% per year
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.15
Processor Performance
350
Performance
300
RISC
250
200
RISC
introduction
150
100
Intel x86
35%/yr
50
0
1982
1984
1986
1988
1990
1992
1994
Year
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.16
Instruction Set Architecture (subset of Computer Arch.)
... the attributes of a [computing] system as seen by the
programmer, i.e. the conceptual structure and functional
behavior, as distinct from the organization of the data
flows and controls the logic design, and the physical
implementation.
– Amdahl,
Blaaw, and Brooks, 1964
-- Organization of Programmable
Storage
SOFTWARE
-- Data Types & Data Structures:
Encodings & Representations
-- Instruction Set
-- Instruction Formats
-- Modes of Addressing and Accessing Data Items and Instructions
-- Exceptional Conditions
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.17
Instruction Set Design
software
instruction set
hardware
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.18
Summary of the Design Process
Hierarchical Design to manage complexity
Top Down vs. Bottom Up vs. Successive Refinement
Importance of Design Representations:
Block Diagrams
Decomposition into Bit Slices
top
down
bottom
up
Truth Tables, K-Maps
mux design
meets at TT
Circuit Diagrams
Other Descriptions: state diagrams, timing diagrams, reg xfer, . . .
Optimization Criteria:
Area
Gate Count
Delay
[Package Count]
Pin Out
12/7/01
Logic Levels
Power
Fan-in/Fan-out
Cost
©UCB Fall 2001
Design time
CS152 / Kubiatowicz
Lec26.19
Measurement and Evaluation
Design
Architecture is an iterative process
-- searching the space of possible designs
-- at all levels of computer systems
Analysis
Creativity
Cost /
Performance
Analysis
You must be willing to
throw out Bad Ideas!!
Good Ideas
Bad Ideas
12/7/01
Mediocre Ideas
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.20
One of most important aspects of design: TEST
• Think about testing from beginning of design
• Well over 50% of modern teams devoted to testing
• VHDL Test benches: monitoring hardware to aid
debugging:
– Include assert statements to check for “things that should
never happen”
Complete Top-Level Design
Test Bench
Device Under
Test
Inline Monitor
Output in readable
format (disassembly)
Assert Statements
Inline vectors
Assert Statements
File IO (either for patterns
or output diagnostics)
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.21
Basis of Evaluation
Cons
Pros
• representative
Actual Target Workload
• portable
• widely used
• improvements
useful in reality
• easy to run, early in
design cycle
• identify peak
capability and
potential bottlenecks
12/7/01
• very specific
• non-portable
• difficult to run, or
measure
• hard to identify cause
•less representative
Full Application Benchmarks
Small “Kernel”
Benchmarks
Microbenchmarks
©UCB Fall 2001
• easy to “fool”
• “peak” may be a long
way from application
performance
CS152 / Kubiatowicz
Lec26.22
Amdahl's Law
Speedup due to enhancement E:
ExTime w/o E
Speedup(E) = -------------------- =
ExTime w/ E
Performance w/ E
--------------------Performance w/o E
Suppose that enhancement E accelerates a fraction F of the task
by a factor S and the remainder of the task is unaffected then,
ExTime(with E) = ((1-F) + F/S) X ExTime(without E)
Speedup(with E) =
12/7/01
1
(1-F) + F/S
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.23
Performance Evaluation Summary
CPU time
= Seconds
Program
= Instructions x Cycles
Program
Instruction
x Seconds
Cycle
° Time is the measure of computer performance!
° Remember Amdahl’s Law:
Speedup is limited by unimproved part of program
° Good products created when have:
• Good benchmarks
• Good ways to summarize performance
° If NOT good benchmarks and summary, then choice between
1) improving product for real programs
2) changing product to get more sales (sales almost always
wins)
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.24
Integrated Circuit Costs
Die cost =
Wafer cost
Dies per Wafer * Die yield
Dies per wafer =  * ( Wafer_diam / 2)2 –  * Wafer_diam – Test dies  Wafer Area
Die Area
 2 * Die Area
Die Area
Die Yield =
Wafer yield
{ 1+
Defects_per_unit_area * Die_Area


}
Die Cost is goes roughly with the cube of the area.
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.25
Computer Arithmetic
• Bits have no inherent meaning: operations determine
whether really ASCII characters, integers, floating point
numbers
• Hardware algorithms for arithmetic:
– Carry lookahead/carry save addition
– Multiplication and divide.
– Booth algorithms
• Divide uses same hardware as multiply (Hi & Lo registers
in MIPS)
• Floating point follows paper & pencil method of scientific
notation
– using integer algorithms for multiply/divide of significands
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.26
Carry Look Ahead (Design trick: peek)
C0 = Cin
A0
B0
G
P
A
0
0
1
1
S
C1 = G0 + C0  P0
A1
B1
G
P
S
B
0
1
0
1
C-out
0
C-in
C-in
1
“kill”
“propagate”
“propagate”
“generate”
P = A and B
G = A xor B
C2 = G1 + G0 P1 + C0  P0  P1
A2
B2
G
P
S
C3 = G2 + G1 P2 + G0  P1  P2 + C0  P0  P1  P2
A3
B3
G
P
S
G
P
C4 = . . .
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.27
MULTIPLY HARDWARE Version 3
• 32-bit Multiplicand reg, 32-bit ALU,
64-bit Product reg (shift right), (0-bit Multiplier reg)
Multiplican
d
32 bits
32-bit ALU
“HI”
“LO”
Shift Right
Product (Multiplier)
64 bits
Control
Write
Divide can use same hardware!
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.28
Booth’s Algorithm Insight
middle of run
end of run
0 1 1 1 1beginning
0 of run
Current Bit Bit to the Right
Explanation
Example
Op
1
0
Begins run of 1s
0001111000
sub
1
1
Middle of run of 1s
0001111000
none
0
1
End of run of 1s
0001111000
add
0
0
Middle of run of 0s
0001111000
none
Originally for Speed (when shift was faster than add)
• Replace a string of 1s in multiplier with an initial subtract
when we first see a one and then later add for the bit after
the last one
12/7/01
©UCB Fall 2001
–1
+ 10000
01111
CS152 / Kubiatowicz
Lec26.29
Pentium Bug
• Pentium: Difference between bugs that board designers
must know about and bugs that potentially affect all users
–$200,000 cost in June to repair design
–$400,000,000 loss in December in profits to replace bad
parts
–How much to repair Intel’s reputation?
–Make public complete description of bugs in later
category?
• What is technologist’s and company’s responsibility to disclose
bugs?
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.30
Administrivia
• Oral reports Tuesday:
– 10am – 11:20 pm and 1:00 – 2:20 in 306 Soda
– Remember: talk is 15 minutes + 5 minutes questions
• Don’t bring more than 8 slides!!!
• Practice! Your final project grade will depend partially on
your oral report.
• Everyone should participate in talk
– Sheet on my door
• Finally:
– 4pm go over to lab to run mystery programs.
– Reports submitted via Web site by midnight (not with
submit program!)
• Everyone should show up for at least part of this
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.31
Multiple Cycle Datapath
PCWr
PCWrCond
Zero
MemWr
IRWr
RegDst
ALUSelA
RegWr
32
PC
32
32
5
Rt 0
Rd
Ra
Rb
Reg File
Rw
B
1 Mux 0
Extend
ExtOp
©UCB Fall 2001
<< 2
32
32
32
0
1
32
32
2
3
ALU
Control
32
MemtoReg
Zero
1
4
busW busB
1
Imm 16
12/7/01
busA A
0
ALU Out
WrAdr
32
Din Dout
32 Rt
Mux
Ideal
Memory
1
5
32
ALU
32
Rs
Mem Data Reg
Mux
RAdr
0
Mux
0
32
Instruction Reg
32
1
Mux
IorD
PCSrc
ALUOp
ALUSelB
CS152 / Kubiatowicz
Lec26.32
Control: Hardware vs. Microprogrammed
° Control may be designed using one of several initial
representations. The choice of sequence control, and how logic is
represented, can then be determined independently; the control can
then be implemented with one of several methods using a
structured logic technique.
Initial Representation
Sequencing Control
Logic Representation
Implementation Technique
12/7/01
Finite State Diagram
Microprogram
Explicit Next State
Function
Microprogram counter
+ Dispatch ROMs
Logic Equations
Truth Tables
PLA
ROM
“hardwired control” “microprogrammed control”
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.33
Finite State Machine (FSM) Spec
IR <= MEM[PC]
“instruction fetch”
PC <= PC + 4
0000
“decode”
0001
ORi
ALUout
<= A fun B
ALUout
<= A or ZX
0100
0110
LW
ALUout
<= A + SX
1000
M <=
MEM[ALUout]
1001
BEQ
SW
ALUout
<= A + SX
ALUout
<= PC +SX
1011
0010
MEM[ALUout]
<= B
1100
R[rd]
<= ALUout
0101
12/7/01
R[rt]
<= ALUout
0111
R[rt] <= M
1010
©UCB Fall 2001
If A = B then PC
<= ALUout
0011
Memory
Write-back
Execute
R-type
CS152 / Kubiatowicz
Lec26.34
Sequencer-based control unit
Control Logic
Multicycle
Datapath
Outputs
Inputs
1
Adder
Types of “branching”
• Set state to 0
• Dispatch (state 1)
• Use incremented state
number
State Reg
Address Select Logic
Opcode
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.35
“Macroinstruction” Interpretation
Main
Memory
ADD
SUB
AND
.
.
.
DATA
execution
unit
CPU
User program
plus Data
this can change!
one of these is
mapped into one
of these
AND microsequence
control
memory
e.g., Fetch
Calc Operand Addr
Fetch Operand(s)
Calculate
Save Answer(s)
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.36
Microprogramming
Label
ALU
SRC1
SRC2
Fetch: Add
Add
PC
PC
4
Extshft
Rtype: Func
rs
rt
Dest.
Memory
Read PC
Mem. Reg. PC Write Sequencing
IR
ALU
Seq
Dispatch
Seq
Fetch
rd ALU
Ori:
Lw:
Or
Add
rs
rs
Extend0
rt ALU
Seq
Fetch
rt MEM
Seq
Seq
Fetch
Extend
Read ALU
Sw:
Add
rs
Extend
Seq
Fetch
Write ALU
Beq:
12/7/01
Subt.
rs
rt
ALUoutCond.
©UCB Fall 2001
Fetch
CS152 / Kubiatowicz
Lec26.37
Precise Interrupts
• Precise  state of the machine is preserved as if program
executed up to the offending instruction
– All previous instructions completed
– Offending instruction and all following instructions act as if they have
not even started
– Same system code will work on different implementations
– Position clearly established by IBM
– Difficult in the presence of pipelining, out-ot-order execution, ...
– MIPS takes this position
• Imprecise  system software has to figure out what is where
and put it all back together
• Performance goals often lead designers to forsake precise
interrupts
– system software developers, user, markets etc. usually wish they had
not done this
• Modern techniques for out-of-order execution and branch
prediction help implement precise interrupts
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.38
Recap: Pipelining Lessons (its intuitive!)
6 PM
T
a
s
k
7
8
9
Time
30 30 30 30 30 30 30
A
° Multiple tasks operating
simultaneously using
different resources
° Potential speedup =
Number pipe stages
B
O
r
d
e
r
° Pipelining doesn’t help
latency of single task, it
helps throughput of entire
workload
° Pipeline rate limited by
slowest pipeline stage
C
° Unbalanced lengths of
pipe stages reduces
speedup
D
° Time to “fill” pipeline and
time to “drain” it reduces
speedup
° Stall for Dependences
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.39
Why Pipeline? Because the resources are there!
Time (clock cycles)
Inst 4
12/7/01
Im
Dm
Reg
Dm
Im
Reg
Im
Reg
©UCB Fall 2001
Reg
Reg
Dm
Reg
ALU
Inst 3
Reg
Reg
ALU
Inst 2
Im
Dm
ALU
Inst 1
Reg
ALU
O
r
d
e
r
Inst 0
Im
ALU
I
n
s
t
r.
Dm
Reg
CS152 / Kubiatowicz
Lec26.40
Can pipelining get us into trouble?
• Yes: Pipeline Hazards
– structural hazards: attempt to use the same resource two
different ways at the same time
• E.g., combined washer/dryer would be a structural hazard or
folder busy doing something else (watching TV)
– data hazards: attempt to use item before it is ready
• E.g., one sock of pair in dryer and one in washer; can’t fold
until get sock from washer through dryer
• instruction depends on result of prior instruction still in the
pipeline
– control hazards: attempt to make a decision before
condition is evaulated
• E.g., washing football uniforms and need to get proper
detergent level; need to see after dryer before next load in
• branch instructions
• Can always resolve hazards by waiting
– pipeline control must detect the hazard
– take action (or delay action) to resolve hazards
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.41
Exceptions in a 5 stage pipeline
Time
Bad Inst
Inst TLB fault
Overflow
IFetch Dcd
Exec
IFetch Dcd
Program Flow
Data TLB
Mem
WB
Exec
Mem
WB
Exec
Mem
WB
Exec
Mem
IFetch Dcd
IFetch Dcd
WB
• Use pipeline to sort this out!
– Pass exception status along with instruction.
– Keep track of PCs for every instruction in pipeline.
– Don’t act on exception until it reache WB stage
• Handle interrupts through “faulting noop” in IF stage
• When instruction reaches WB stage:
– Save PC  EPC, Interrupt vector addr  PC
– Turn all instructions in earlier stages into noops!
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.42
Data Stationary Control
• The Main Control generates the control signals during
Reg/Dec
– Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later
– Control signals for Mem (MemWr Branch) are used 2 cycles later
– Control signals for Wr (MemtoReg MemWr) are used 3 cycles later
Reg/Dec
ALUSrc
ALUSrc
ALUOp
ALUOp
RegDst
MemWr
Branch
MemtoReg
RegWr
12/7/01
RegDst
MemWr
Branch
MemtoReg
RegWr
©UCB Fall 2001
MemWr
Branch
MemtoReg
RegWr
Wr
Mem/Wr Register
ExtOp
Mem
Ex/Mem Register
ExtOp
ID/Ex Register
IF/ID Register
Main
Control
Exec
MemtoReg
RegWr
CS152 / Kubiatowicz
Lec26.43
Pipeline Summary
° Simple 5-stage pipeline: F
DEMW
° Pipelines pass control information down the pipe just
as data moves down pipe
° Resolve data hazards through forwarding.
° Forwarding/Stalls handled by local control
° Exceptions stop the pipeline
° MIPS I instruction set architecture made pipeline visible
(delayed branch, delayed load)
° More performance from deeper pipelines, parallelism
° You built a complete 5-stage pipeline in the lab!
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.44
Out of order execution: Tomasulo Organization
FP Registers
From Mem
FP Op
Queue
Load Buffers
Load1
Load2
Load3
Load4
Load5
Load6
Store
Buffers
Add1
Add2
Add3
Mult1
Mult2
FP adders
Reservation
Stations
To Mem
FP multipliers
Common Data Bus (CDB)
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.45
How can the machine exploit available ILP?
Limitation
Technique
IF D Ex M
IF D Ex
IF D
IF
° Pipelining
W
M W
Ex M W
D Ex M W
Issue rate,
FU stalls,
FU depth
° Super-pipeline
- Issue 1 instr. / (fast) cycle
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
- IF takes multiple cycles
Clock skew,
FU stalls,
FU depth
° Super-scalar
- Issue multiple scalar
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
instructions per cycle
Hazard resolution
° VLIW
- Each instruction specifies
multiple scalar operations
IF D Ex
Ex
Ex
Ex
M
M
M
M
W
W
W
W
Packing,
Compiler
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.46
Processor-DRAM Gap (latency)
CPU
“Moore’s Law”
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
7%/yr.
100
10
1
µProc
60%/yr.
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Performance
1000
Time
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.47
Levels of the Memory Hierarchy
Upper Level
Capacity
Access Time
Cost
Staging
Xfer Unit
CPU Registers
100s Bytes
<2s ns
Registers
Cache
K Bytes SRAM
2-100 ns
$.01-.001/bit
Cache
Instr. Operands
Blocks
Main Memory
M Bytes DRAM
100ns-1us
$.01-.001
Disk
G Bytes
ms
-4
-3
10 - 10 cents
Tape
infinite
sec-min
10 -6
12/7/01
faster
prog./compiler
1-8 bytes
cache cntl
8-128 bytes
Memory
Pages
OS
512-4K bytes
Files
user/operator
Mbytes
Disk
Tape
©UCB Fall 2001
Larger
Lower Level
CS152 / Kubiatowicz
Lec26.48
Memory Hierarchy
° The Principle of Locality:
• Program access a relatively small portion of the address
space at any instant of time.
- Temporal Locality: Locality in Time
- Spatial Locality: Locality in Space
° Three Major Categories of Cache Misses:
• Compulsory Misses: sad facts of life. Example: cold start
misses.
• Conflict Misses: increase cache size and/or associativity.
• Capacity Misses: increase cache size
° Virtual Memory invented as another level of the hierarchy
– Today VM allows many processes to share single memory
without having to swap all processes to disk, protection
more important
– TLBs are important for fast translation/checking
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.49
Set Associative Cache
• N-way set associative: N entries for each Cache Index
– N direct mapped caches operates in parallel
• Example: Two-way set associative cache
– Cache Index selects a “set” from the cache
– The two tags in the set are compared to the input in
parallel
– Data is selected based on the tag result
Valid
Cache Tag
:
:
Adr Tag
Compare
Cache Index
Cache Data
Cache Data
Cache Block 0
Cache Block 0
:
:
Sel1 1
Mux
0 Sel0
Cache Tag
Valid
:
:
Compare
OR
Hit
12/7/01
Cache Block
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.50
Quicksort vs. Radix as vary number keys: Cache misses
5
Quick(miss/key)
Radix(miss/key)
Radix sort
4
3
Cache misses
2
1
Quick
sort
0
1000
10000
100000 1000000 1000000
0
Job size in keys
What is proper approach to fast algorithms?
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.51
Performance Retrospective
° Theory of Algorithms & Compilers often based on number of
operations
° Compiler remove operations and “simplify” ops:
Integer adds << Integer multiplies << FP adds << FP multiplies
• Advanced pipelines => these operations take similar time
(FP multiply faster than integer multiply)
° As Clock rates get higher and pipelines are longer, instructions
take less time but DRAMs only slightly faster (although much
larger)
° Today time is a function of (ops, cache misses);
° Given importance of caches, what does this mean to:
• Compilers?
• Data structures?
• Algorithms?
• How do you tune performance on Pentium Pro? Random?
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.52
Static RAM Cell
6-Transistor SRAM Cell
0
0
bit
word
word
(row select)
1
1
bit
• Write:
1. Drive bit lines (bit=1, bit=0)
2.. Select row
bit
• Read:
bit
replaced with pullup
to save area
1. Precharge bit and bit to Vdd or Vdd/2 => make sure equal!
2.. Select row
3. Cell pulls one line low
4. Sense amp on column detects difference between bit and bit
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.53
1-Transistor Memory Cell (DRAM)
• Write:
row select
– 1. Drive bit line
– 2.. Select row
• Read:
– 1. Precharge bit line to Vdd
– 2.. Select row
– 3. Cell and bit line share charges
bit
• Very small voltage changes on the bit line
– 4. Sense (fancy sense amp)
• Can detect changes of ~1 million electrons
– 5. Write: restore the value
• Refresh
– 1. Just do a dummy read to every cell.
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.54
Classical DRAM Organization (square)
bit (data) lines
r
o
w
d
e
c
o
d
e
r
row
address
Each intersection represents
a 1-T DRAM Cell
RAM Cell
Array
word (row) select
Column Selector &
I/O Circuits
data
12/7/01
Column
Address
• Row and Column Address
together:
Select
©UCB–Fall
2001
1 bit a time
CS152 / Kubiatowicz
Lec26.55
I/O System Design Issues
• Systems have a hierarchy of busses as well (PC: memory,PCI,ESA)
Processor
interrupts
Cache
Memory - I/O Bus
Main
Memory
I/O
Controller
Disk
12/7/01
Disk
I/O
Controller
I/O
Controller
Graphics
Network
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.56
A Three-Bus System (+ backside cache)
Processor Memory Bus
Processor
Backside
Cache bus
Memory
Bus
Adaptor
Bus
Adaptor
I/O Bus
L2 Cache
Bus
Adaptor
I/O Bus
• A small number of backplane buses tap into the processormemory bus
– Processor-memory bus is only used for processor-memory
traffic
– I/O buses are connected to the backplane bus
• Advantage: loading on the processor bus is greatly reduced
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.57
Main componenets of Intel Chipset: Pentium II/III
• Northbridge:
– Handles memory
– Graphics
• Southbridge: I/O
–
–
–
–
–
–
–
12/7/01
PCI bus
Disk controllers
USB controlers
Audio
Serial I/O
Interrupt controller
Timers
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.58
Disk Device Terminology
Disk Latency = Queueing Time +
Controller time +
Seek Time + Rotation Time + Xfer Time
Order of magnitude times for 4K byte transfers:
Average Seek: 8 ms or less
Rotate: 4.2 ms @ 7200 rpm
Xfer: 1 ms @ 7200 rpm
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.59
Disk I/O Performance
300
Metrics:
Response Time
Throughput
Response
Time (ms)
200
100
0
100%
0%
Throughput
(% total BW)
Queue
Proc
IOC
Device
Response time = Queue + Device Service time
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.60
A Little Queuing Theory: M/M/1 queues
System
Queue
Proc
server
IOC
Device
• Described “memoryless” or Markovian request arrival
(M for C=1 exponentially random), 1 server: M/M/1 queue
• When Service times have C = 1, M/M/1 queue
Tq = Tser x u / (1 – u)
Tser average time to service a customer
u server utilization (0..1): u =  x Tser
Tq average time/customer in queue
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.61
Computers in the news: Tunneling Magnetic Junction
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.62
What does the future hold?
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.63
Forces on Computer Architecture
Technology
Programming
Languages
Applications
Computer
Architecture
Operating
Systems
History
(A = F / M)
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.64
Today: building materials prevalent
• Originally: worried about squeezing the last ounce of
performance from limited resources
• Today: worried about an abundance (embarrassment) of
riches?
– Billions of transistors on a chip (17nm Yeah!)
– Microprocessor Report articles wondering if all the lessons of
RISC are now irrelevant
• Moore’s laws: exponential growth of everything
– Transistors, Performance, Disk Space, Memory Size
• So, what matters any more????
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.65
Key Technologies
° Fast, cheap, highly integrated “computers-on-a-chip”
• IDT R4640, NEC VR4300, StrongARM, Superchips
• Devices everywhere!
° Micromechanical Devices (MEMS)
• Integrated sensors everywhere
° Ubiquituous access to fast networks
• Network is everywhere!
• ISDN, Cable Modems, ATM, Metrocom (wireless) . .
° Platform independent programming languages
• Java, JavaScript, Visual Basic Script
° Lightweight Operating Systems
• GEOS, NCOS, RISCOS
° ???
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.66
Issues for Research
• Complexity:
– more than 50% of design teams now for verification
• Power
– Processor designs hampered in performance to keep from
melting
– Why 3 or 4 orders of magnitude difference in power consumption
between custom hardware and general Von Neuman
architectures?
• Energy
– Portable devices
• Scalability, Reliability, Maintainability
– How to keep services up 24x7?
• Performance (“Cost conscious”)
– how to get performance without a lot of power, complexity, etc.
• Security?
– What are the consequences of distributed info to privacy?
– What are moral implications of ubiquitous sensors???
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.67
1995 Computer Food Chain
Minicomputer
(hitting wall soon)
Mainframe
(future is bleak)
Vector Supercomputer
12/7/01
Work- PC
station
Massively Parallel
Processors
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.68
Interconnection Networks
° Switched vs. Shared Media: pairs communicate at
same time:
“point-to-point” connections
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.69
Cluster/Network of Workstations (NOW)
MPP
P
P
SMP
P P P
P
M M M M M
NI
NI
NI
NI
Fast
Communication
12/7/01
P
I/O
Bus
NI
NI
Fast, Switched
Network
P
Distributed Comp.
M
General
Purpose
I/O
NI
D
P
D
P
M
M
NI
NI
…
…
D
P
M
NI
Fast, Switched
©UCB FallNetwork
2001
D
P
D
P
M
M
NI
NI
…
…
D
P
M
NI
Slow, Scalable
Network
Incremental
Scalability,
Timeliness
CS152 / Kubiatowicz
Lec26.70
Mainframe
Vector
Supercomputer
Massively Minicomputer
Parallel
Processors
2005 Computer Food Chain?
Portable
Computers
Networks of Workstations/PCs
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.71
Intelligent DRAM (IRAM)
• IRAM motivation (2000 to 2005)
–
–
–
–
–
–
–
256 Mbit/1Gbit DRAMs in near future (128 MByte)
Current CPUs starved for memory BW
On chip memory BW = SQRT(Size)/RAS or 80 GB/sec
1% of Gbit DRAM = 10M transistors for µprocessor
Even in DRAM process, a 10M trans. CPU is attractive
Package could be network interface vs. Addr./Data pins
Embedded computers are increasingly important
• Why not re-examine computer design based on separation
of memory and processor?
– Compact code & data?
– Vector instructions?
– Operating systems? Compilers? Data Structures?
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.72
IRAM Vision Statement
Microprocessor & DRAM
on a single chip:
– on-chip memory latency
5-10X, bandwidth 50-100X
– improve energy efficiency
2X-4X (no off-chip bus)
– serial I/O 5-10X v. buses
– smaller board area/volume
– adjustable memory size/width
I/O I/O
Bus
Proc
$ $
L2$
Bus
D R A M
I/O
I/O
Proc
Bus
D R A M
12/7/01
©UCB Fall 2001
L
o f
ga
i b
c
D
R f
Aa
Mb
CS152 / Kubiatowicz
Lec26.73
OceanStore: The Oceanic Data Utility
Global-Scale Persistent Storage
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.74
Ubiquitous Devices  Ubiquitous Storage
• Consumers of data move, change from one device to
another, work in cafes, cars, airplanes, the office, etc.
• Properties REQUIRED for OceanStore:
– Strong Security: data must be encrypted whenever in the
infrastructure; resistance to monitoring
– Coherence: too much data for naïve users to keep coherent
“by hand”
– Automatic replica management and optimization: huge
quantities of data cannot be managed manually
– Simple and automatic recovery from disasters: probability
of failure increases with size of system
– Utility model: world-scale system requires cooperation
across administrative boundaries
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.75
Utility-based Infrastructure
Canadian
OceanStore
Sprint
AT&T
Pac
Bell
IBM
IBM
• Service provided by confederation of companies
– Monthly fee paid to one service provider
– Companies buy and sell capacity
from each other
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.76
Introspective Computing
• Biological Analogs for computer systems:
– Continuous adaptation
– Insensitivity to design flaws
• Both hardware and software
• Necessary if can never be
sure that all components
Compute
are working properly…
Monitor
• Examples:
– ISTORE -- applies introspective
computing to disk storage
– DynaComp -- applies introspective
computing at chip level
Adapt
• Compiler always running and part of execution!
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.77
and why not
° multiprocessors on a chip?
° complete systems on a chip?
• memory + processor + I/O
° computers in your credit card?
° networking in your kitchen? car?
° eye tracking input devices?
° Wearable computers
° Intelligent books (made from paper!)
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.78
Learned from Cal/CS152?
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.79
CS152: So what's in it for me? (from 1st lecture)
° In-depth understanding of the inner-workings of
modern computers, their evolution, and trade-offs
present at the hardware/software boundary.
• Insight into fast/slow operations that are easy/hard to
implementation hardware
° Experience with the design process in the context of
a large complex (hardware) design.
• Functional Spec --> Control & Datapath --> Physical implementation
• Modern CAD tools
° Designer's "Intellectual" toolbox.
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.80
Simulate Industrial Environment (from 1st lecture)
° Project teams must have at least 4 members
• Managers have value
° Communicate with colleagues (team members)
• What have you done?
• What answers you need from others?
• You must document your work!!!
• Everyone must keep an on-line notebook
° Communicate with supervisor (TAs)
• How is the team’s plan?
• Short progress reports are required:
- What is the team’s game plan?
-
12/7/01
What is each member’s responsibility?
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.81
So let’s thanks those TAs
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.82
Summary: Things we Hope You Learned from 152
° Keep it simple and make it work:
• Fully test everything individually & then together;
break when together
• Retest everything whenever you make any changes
• Last minute changes are big “no nos”
° Group dynamics. Communication is the key to
success:
• Be open with others of your expectations & your problems (e.g., trip)
• Everybody should be there on design meetings when key decisions
are made and jobs are assigned
° Planning is very important (“plan your life; live your
plan”):
• Promise what you can deliver; deliver more than you promise
• Murphy’s Law: things DO break at the last minute
- DON’T make your plan based on the best case scenarios
- Freeze your design and don’t make last minute changes
° Never give up! It is not over until you give up (“Bear
won’t die”)
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.83
The End!
12/7/01
©UCB Fall 2001
CS152 / Kubiatowicz
Lec26.84