Transcript ppt

CS 152
Computer Architecture and Engineering
Lecture 26
The Final Chapter
A whirlwind retrospective on the term
0
May 10, 2004
John Kubiatowicz (www.cs.berkeley.edu/~kubitron)
lecture slides: http://inst.eecs.berkeley.edu/~cs152/
What IS Quantum Computing?
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.2
Can we Use Quantum Mechanics to Compute?
• Weird properties of quantum mechanics?
– You’ve already seen one: tunneling of electrons through
insulators to make TMJ RAM
– Quantization: Only certain values or orbits are good
• Remember orbitals from chemistry???
– Superposition: Schizophrenic physical elements don’t quite
know whether they are one thing or another
• All existing digital abstractions try to eliminate QM
– Transistors/Gates designed with classical behavior
– Binary abstraction: a “1” is a “1” and a “0” is a “0”
• Quantum Computing:
Use of Quantization and Superposition to compute.
• Interesting results:
– Shor’s algorithm: factors in polynomial time!
– Grover’s algorithm: Finds items in unsorted database in time
proportional to square-root of n.
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.3
Quantization: Use of “Spin”
North
Representation:
|0> or |1>
Spin ½ particle:
(Proton/Electron)
South
• Particles like Protons have an intrinsic “Spin” when defined with
respect to an external magnetic field
• Quantum effect gives “1” and “0”:
– Either spin is “UP” or “DOWN” nothing between
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.4
Kane Proposal II (First one didn’t quite work)
Single Spin
Control Gates
Inter-bit
Control Gates
Phosphorus
Impurity Atoms
• Bits Represented by combination of proton/electron spin
• Operations performed by manipulating control gates
– Complex sequences of pulses perform NMR-like operations
• Temperature < 1° Kelvin!
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.5
Now add Superposition!
• The bit can be in a combination of “1” and “0”:
– Written as: = C0|0> + C1|1>
– The C’s are complex numbers!
– Important Constraint: |C0|2 + |C1|2 =1
• If measure bit to see what looks like,
– With probability |C0|2 we will find |0> (say “UP”)
– With probability |C1|2 we will find |1> (say “DOWN”)
• Is this a real effect? Options:
– This is just statistical – given a large number of protons, a
fraction of them (|C0|2 ) are “UP” and the rest are down.
– This is a real effect, and the proton is really both things until
you try to look at it
• Reality: second choice! There are experiments to prove it!
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.6
Implications: A register can have many values
• Implications of superposition:
– An n-bit register can have 2n values simultaneously!
– 3-bit example:
= C000|000>+ C001|001>+ C010|010>+ C011|011>+
C100|100>+ C101|101>+ C110|110>+ C111|111>
• Probabilities of measuring all bits are set by coefficients:
– So, prob of getting |000> is |C000|2, etc.
– Suppose we measure only one bit (first):
• We get a “0” with probability: P0=|C000|2+ |C001|2+ |C010|2+ |C011|2
1
Result: = P (C000|000>+ C001|001>+ C010|010>+ C011|011>)
0
• We get a “1” with probability: P1=|C100|2+ |C101|2+ |C110|2+ |C111|2
1
Result: = P (C100|100>+ C101|101>+ C110|110>+ C111|111>)
1
• Problem: Don’t want environment to measure before ready!
– Solution: Quantum Error Correction Codes!
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.7
Model? Operations on coefficients + measurements
Input
Complex
State
Unitary
Transformations
Measure
Output
Classical
Answer
• Basic Computing Paradigm:
– Input is a register with superposition of many values
• Possibly all 2n inputs equally probable!
– Unitary transformations compute on coefficients
• Must maintain probability property (sum of squares = 1)
• Looks like doing computation on all 2n inputs simultaneously!
– Output is one result attained by measurement
• If do this poorly, just like probabilistic computation:
– If 2n inputs equally probable, may be 2n outputs equally probable.
– After measure, like picked random input to classical function!
– All interesting results have some form of “fourier transform”
computation being done in unitary transformation
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.8
Some Issues in building quantum computer
• What are the bits and how do we manipulate them?
– NMR computation: use “cup of liquid”.
• Use nuclear spins (special protons on complex molecules).
• Manipulate with radio-frequencies
• IBM Has produced a 7-bit computer
– Silicon options (more scalable)
• Impurity Phosphorus in silicon
• Manipulate through electrons (including measurement)
• Still serious noise/fabrication issues
– Other options:
• Optical (Phases of photons represent bits)
• Single ions trapped in magnetic fields
• How do we prevent the environment from “Measuring”?
– Make spins as insulated from environment as possible
– Quantum Error Correction!
• Where do we get “clean” bits (I.e. unsuperposed |0> or |1>)?
– Entropy exchange unit:
• Radiates heat to environment (entropy)
• Produces clean bits (COLD) to enter into device
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.9
ION Trap Quantum Computer: Promising technology
Top
CrossSectional
View
• IONS of Be+ trapped in oscillating
quadrature field
– Internal electronic modes of IONS
used for quantum bits
– MEMs technology
– Target? 50,000 ions
– ROOM Temperature!
• Ions moved to interaction regions
– Ions interactions with one another
moderated by lasers
5/10/03
©UCB Spring 2004
Top View
Proposal: NIST Group
CS152 / Kubiatowicz
Lec26.10
DNA Computing
• Can we use DNA to do massive
computations?
– Organisms do it
– DNA has an extremely high
information density:
• 4 different base pairs:
– Adenine/Thymine
– Guanine/Cytosine
• Always paired up on opposite
strands=>Energetically favorable
– Active operations:
• Copy: Split strands of 1 DNA apart in
proper solution, gain 2 copies
• Concatenate: eg:
GTAATCCT will cause
XXXXXCATT to combine with
AGGAYYYYY
• Polymerase Chain Reaction (PCR):
greatly amplifies region of molecule
between two marker molecules
5/10/03
©UCB Spring 2004
©1999 Access Excellence
@ the National Health Museum
CS152 / Kubiatowicz
Lec26.11
DNA Computing and the Hamiltonian Path
• Given a set of cities and costs
between them (possibly directed
paths):
– Find shortest path
– Simpler: find single path that visits all cities
• DNA Computing example is latter version:
– Every city represented by unique 20 base-pair strand
– Every path between cities represented by complementary
pairs: 10 pairs from source city, 10 pairs from destination
– Shorter example:
AAGT for city 1, TTCG for city 2
Path 1->2: CAAA
Will build: AAGTTTCG
..CAAA..
– Dump “city molecules” and “path molecules” into testtube.
Select and amplify paths of right length. Analyze for result.
CS152 / Kubiatowicz
– Been done for 6 cities! (Adleman,
~1998!)
5/10/03
©UCB Spring 2004
Lec26.12
Even more promising uses of DNA
• Self-assembly of components
DNA
Active Region
Bonding
– DNA serves as substrate
– Attach active elements in middle of components.
– Final step – metal deposited over DNA
Active Region
• Other interesting structures could be built
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.13
Administrivia
• Oral reports Thursday:
– 1:00 – 3:00 in 306 Soda
– Remember: talk is 20 minutes + 5 minutes questions
• Don’t bring more than 10 slides!!!
• Practice! Your final project grade will depend partially on
your oral report.
• Everyone should participate in talk
– Sheet on my door
– Everyone should show up for these 2 hours (short!)
– Plan on cheering on your classmates (ask good
questions)
• Finally: Final contest when?
– Friday?
– Monday? (My original plan)
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.14
Where have we been?
Input
Multiplier
Input
Multiplicand
32
Multiplicand
Register
LoadMp
32=>34
signEx
32
34
34
1
0
34x2 MUX
Arithmetic
Multi x2/x1
34
34
Sub/Add
34-bit ALU
Control
Logic
34
32
32
2
ShiftAll
LO register
(16x2 bits)
Prev
2
Booth
Encoder
HI register
(16x2 bits)
LO[1]
Extra
2 bits
2
"LO
[0]"
Single/multicycle
Datapaths
<<1
32=>34
signEx
ENC[2]
ENC[1]
ENC[0]
LoadLO
ClearHI
LoadHI
2
32
Result[HI]
LO[1:0]
32
Result[LO]
1000
CPU
“Moore’s Law”
IFetchDcd
WB
Exec Mem
Performance
10
DRAM
9%/yr.
DRAM (2X/10 yrs)
1
198
2
3
198
498
1
5
198
6
198
7
198
8
198
9
199
0
199
199
2
199
399
1
4
199
5
199
699
1
7
199
8
199
9
200
0
Exec Mem
Processor-Memory
Performance Gap:
(grows 50% / year)
198
098
1
1
198
IFetchDcd
CS152
Spring ‘04
100
µProc
60%/yr.
(2X/1.5yr)
WB
Time
IFetchDcd
Exec Mem
IFetchDcd
WB
Exec Mem
WB
Pipelining
I/O
Memory Systems
5/10/03
©UCB Spring 2004

CS152 / Kubiatowicz
Lec26.15
The Big Picture
° Since 1946 all computers have had 5 components
Processor
Input
Control
Memory
Datapath
5/10/03
Output
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.16
What is “Computer Architecture”?
Application
Operating
System
Compiler
Firmware
Instr. Set Proc. I/O system
Instruction Set
Architecture
Datapath & Control
Digital Design
Circuit Design
Layout
• Coordination of many levels of abstraction
• Under a rapidly changing set of forces
• Design, Measurement, and Evaluation
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.17
Performance and Technology Trends
1000
Supercomputers
Performance
100
Mainframes
10
Minicomputers
Microprocessors
1
0.1
1965
1970
1975
1980
1985
1990
1995
2000
Year
• Technology Power: 1.2 x 1.2 x 1.2 = 1.7 x / year
– Feature Size: shrinks 10% / yr. => Switching speed improves 1.2 / yr.
– Density: improves 1.2x / yr.
– Die Area: 1.2x / yr.
• One lesson of RISC is to keep the ISA as simple as possible:
– Shorter design cycle => fully exploit the advancing technology (~3yr)
– Advanced branch prediction and pipeline techniques
– Bigger and more sophisticated on-chip caches
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.18
Examples of “Moore’s Law’s”
• Processor
– logic capacity: about 30% per year
– clock rate:
about 20% per year
– Performance: about 50-60% per year (2x in 18
months)
• Memory
– DRAM capacity: about 60% per year (4x every 3 years)
– Memory speed: about 10% per year
– Cost per bit: improves about 25% per year
• Disk
– capacity: about 60% per year
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.19
Instruction Set Architecture (subset of Computer Arch.)
... the attributes of a [computing] system as seen by the
programmer, i.e. the conceptual structure and functional
behavior, as distinct from the organization of the data
flows and controls the logic design, and the physical
implementation.
– Amdahl,
Blaaw, and Brooks, 1964
-- Organization of Programmable
Storage
SOFTWARE
-- Data Types & Data Structures:
Encodings & Representations
-- Instruction Set
-- Instruction Formats
-- Modes of Addressing and Accessing Data Items and Instructions
-- Exceptional Conditions
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.20
Instruction Set Design
software
instruction set
hardware
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.21
Measurement and Evaluation
Design
Architecture is an iterative process
-- searching the space of possible designs
-- at all levels of computer systems
Analysis
Creativity
Cost /
Performance
Analysis
You must be willing to
throw out Bad Ideas!!
Good Ideas
Bad Ideas
5/10/03
Mediocre Ideas
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.22
One of most important aspects of design: TEST
• Think about testing from beginning of design
• Well over 50% of modern teams devoted to testing
• Verilog Test benches: monitoring hardware to aid
debugging:
– Include assert statements to check for “things that should
never happen”
Complete Top-Level Design
Test Bench
Device Under
Test
Inline Monitor
Output in readable
format (disassembly)
Assert Statements
Inline vectors
Assert Statements
File IO (either for patterns
or output diagnostics)
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.23
Basis of Evaluation
Cons
Pros
• representative
Actual Target Workload
• portable
• widely used
• improvements
useful in reality
• easy to run, early in
design cycle
• identify peak
capability and
potential bottlenecks
5/10/03
• very specific
• non-portable
• difficult to run, or
measure
• hard to identify cause
•less representative
Full Application Benchmarks
Small “Kernel”
Benchmarks
Microbenchmarks
©UCB Spring 2004
• easy to “fool”
• “peak” may be a long
way from application
performance
CS152 / Kubiatowicz
Lec26.24
Amdahl's Law
Speedup due to enhancement E:
ExTime w/o E
Speedup(E) = -------------------- =
ExTime w/ E
Performance w/ E
--------------------Performance w/o E
Suppose that enhancement E accelerates a fraction F of the task
by a factor S and the remainder of the task is unaffected then,
ExTime(with E) = ((1-F) + F/S) X ExTime(without E)
Speedup(with E) =
5/10/03
1
(1-F) + F/S
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.25
Performance Evaluation Summary
CPU time
= Seconds
Program
= Instructions x Cycles
Program
Instruction
x Seconds
Cycle
° Time is the measure of computer performance!
° Remember Amdahl’s Law:
Speedup is limited by unimproved part of program
° Good products created when have:
• Good benchmarks
• Good ways to summarize performance
° If NOT good benchmarks and summary, then choice between
1) improving product for real programs
2) changing product to get more sales (sales almost always
wins)
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.26
Computer Arithmetic
• Bits have no inherent meaning: operations determine
whether really ASCII characters, integers, floating point
numbers
• Hardware algorithms for arithmetic:
– Carry lookahead/carry save addition
– Multiplication and divide.
– Booth algorithms
• Divide uses same hardware as multiply (Hi & Lo registers
in MIPS)
• Floating point follows paper & pencil method of scientific
notation
– using integer algorithms for multiply/divide of significands
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.27
Carry Look Ahead (Design trick: peek)
C0 = Cin
A0
B0
G
P
A
0
0
1
1
S
C1 = G0 + C0  P0
A1
B1
G
P
S
B
0
1
0
1
C-out
0
C-in
C-in
1
“kill”
“propagate”
“propagate”
“generate”
P = A and B
G = A xor B
C2 = G1 + G0 P1 + C0  P0  P1
A2
B2
G
P
S
C3 = G2 + G1 P2 + G0  P1  P2 + C0  P0  P1  P2
A3
B3
G
P
S
G
P
C4 = . . .
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.28
MULTIPLY HARDWARE Version 3
• 32-bit Multiplicand reg, 32-bit ALU,
64-bit Product reg (shift right), (0-bit Multiplier reg)
Multiplican
d
32 bits
32-bit ALU
“HI”
“LO”
Shift Right
Product (Multiplier)
64 bits
Control
Write
Divide can use same hardware!
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.29
Booth’s Algorithm Insight
middle of run
end of run
0 1 1 1 1beginning
0 of run
Current Bit Bit to the Right
Explanation
Example
Op
1
0
Begins run of 1s
0001111000
sub
1
1
Middle of run of 1s
0001111000
none
0
1
End of run of 1s
0001111000
add
0
0
Middle of run of 0s
0001111000
none
Originally for Speed (when shift was faster than add)
• Replace a string of 1s in multiplier with an initial subtract
when we first see a one and then later add for the bit after
the last one
5/10/03
©UCB Spring 2004
–1
+ 10000
01111
CS152 / Kubiatowicz
Lec26.30
Multiple Cycle Datapath
PCWr
PCWrCond
Zero
MemWr
IRWr
RegDst
ALUSelA
RegWr
32
PC
32
32
5
Rt 0
Rd
Rb
busA A
Reg File
Rw
B
1 Mux 0
<< 2
Extend
ExtOp
32
32
32
0
1
32
32
2
3
ALU
Control
32
MemtoReg
©UCB Spring 2004
Zero
1
4
busW busB
1
Imm 16
5/10/03
Ra
0
ALU Out
WrAdr
32
Din Dout
32 Rt
Mux
Ideal
Memory
1
5
32
ALU
32
Rs
Mem Data Reg
Mux
RAdr
0
Mux
0
32
Instruction Reg
32
1
Mux
IorD
PCSrc
ALUOp
ALUSelB
CS152 / Kubiatowicz
Lec26.31
Sequencer-based control unit
Control Logic
Multicycle
Datapath
Outputs
Inputs
1
Adder
Types of “branching”
• Set state to 0
• Dispatch (state 1)
• Use incremented state
number
State Reg
Address Select Logic
Opcode
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.32
Microprogramming
Label
ALU
SRC1
SRC2
Fetch: Add
Add
PC
PC
4
Extshft
Rtype: Func
rs
rt
Dest.
Memory
Read PC
Mem. Reg. PC Write Sequencing
IR
ALU
Seq
Dispatch
Seq
Fetch
rd ALU
Ori:
Lw:
Or
Add
rs
rs
Extend0
rt ALU
Seq
Fetch
rt MEM
Seq
Seq
Fetch
Extend
Read ALU
Sw:
Add
rs
Extend
Seq
Fetch
Write ALU
Beq:
5/10/03
Subt.
rs
rt
ALUoutCond.
©UCB Spring 2004
Fetch
CS152 / Kubiatowicz
Lec26.33
Precise Interrupts
• Precise  state of the machine is preserved as if program
executed up to the offending instruction
– All previous instructions completed
– Offending instruction and all following instructions act as if they have
not even started
– Same system code will work on different implementations
– Position clearly established by IBM
– Difficult in the presence of pipelining, out-ot-order execution, ...
– MIPS takes this position
• Imprecise  system software has to figure out what is where
and put it all back together
• Performance goals often lead designers to forsake precise
interrupts
– system software developers, user, markets etc. usually wish they had
not done this
• Modern techniques for out-of-order execution and branch
prediction help implement precise interrupts
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.34
Recap: Pipelining Lessons (its intuitive!)
6 PM
T
a
s
k
8
9
Time
30 30 30 30 30 30 30
A
B
O
r
d
e
r
7
C
D
° Pipelining doesn’t help
latency of single task, it
helps throughput of entire
workload
° Multiple tasks operating
simultaneously using
different resources
° Potential speedup =
Number pipe stages
° Pipeline rate limited by
slowest pipeline stage
° Unbalanced lengths of
pipe stages reduces
speedup
° Time to “fill” pipeline and
time to “drain” it reduces
speedup
° Stall for Dependences
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.35
Why Pipeline? Because the resources are there!
Time (clock cycles)
Inst 4
5/10/03
Im
Dm
Reg
Dm
Im
Reg
Im
Reg
©UCB Spring 2004
Reg
Reg
Dm
Reg
ALU
Inst 3
Reg
Reg
ALU
Inst 2
Im
Dm
ALU
Inst 1
Reg
ALU
O
r
d
e
r
Inst 0
Im
ALU
I
n
s
t
r.
Dm
Reg
CS152 / Kubiatowicz
Lec26.36
Can pipelining get us into trouble?
• Yes: Pipeline Hazards
– structural hazards: attempt to use the same resource two
different ways at the same time
• E.g., combined washer/dryer would be a structural hazard or
folder busy doing something else (watching TV)
– data hazards: attempt to use item before it is ready
• E.g., one sock of pair in dryer and one in washer; can’t fold
until get sock from washer through dryer
• instruction depends on result of prior instruction still in the
pipeline
– control hazards: attempt to make a decision before
condition is evaulated
• E.g., washing football uniforms and need to get proper
detergent level; need to see after dryer before next load in
• branch instructions
• Can always resolve hazards by waiting
– pipeline control must detect the hazard
– take action (or delay action) to resolve hazards
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.37
Resolve RAW by “forwarding” (or bypassing)
IAU
npc
I mem
Regs
op rw rs rt
Forward
mux
A
B
im
PC
n op rw
alu
S
n op rw
D mem
m
n op rw
Regs
5/10/03
• Detect nearest
valid write op
operand register
and forward into
op latches,
bypassing
remainder of the
pipe
• Increase muxes to
add paths from
pipeline registers
• Data Forwarding =
Data Bypassing
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.38
Data Stationary Control
• The Main Control generates the control signals during
Reg/Dec
– Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later
– Control signals for Mem (MemWr Branch) are used 2 cycles later
– Control signals for Wr (MemtoReg MemWr) are used 3 cycles later
Reg/Dec
ALUSrc
ALUSrc
ALUOp
ALUOp
RegDst
MemWr
Branch
MemtoReg
RegWr
5/10/03
RegDst
MemWr
Branch
MemtoReg
RegWr
©UCB Spring 2004
MemWr
Branch
MemtoReg
RegWr
Wr
Mem/Wr Register
ExtOp
Mem
Ex/Mem Register
ExtOp
ID/Ex Register
IF/ID Register
Main
Control
Exec
MemtoReg
RegWr
CS152 / Kubiatowicz
Lec26.39
Exceptions in a 5 stage pipeline
Time
Bad Inst
Inst TLB fault
Overflow
IFetch Dcd
Exec
IFetch Dcd
Program Flow
Data TLB
Mem
WB
Exec
Mem
WB
Exec
Mem
WB
Exec
Mem
IFetch Dcd
IFetch Dcd
WB
• Use pipeline to sort this out!
– Pass exception status along with instruction.
– Keep track of PCs for every instruction in pipeline.
– Don’t act on exception until it reache WB stage
• Handle interrupts through “faulting noop” in IF stage
• When instruction reaches WB stage:
– Save PC  EPC, Interrupt vector addr  PC
– Turn all instructions in earlier stages into noops!
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.40
You Did It!
• You all built a working 5-stage pipeline!
(Give yourselves a hand!)
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.41
Out of order execution: Tomasulo Organization
FP Registers
From Mem
FP Op
Queue
Load Buffers
Load1
Load2
Load3
Load4
Load5
Load6
Store
Buffers
Add1
Add2
Add3
Mult1
Mult2
FP adders
Reservation
Stations
To Mem
FP multipliers
Common Data Bus (CDB)
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.42
Tomasulo With Reorder buffer (ROB)
FP Op
Queue
Reorder Buffer
(ROB)
Done?
-- <val2> ST 0(R3),F0
Y ROB7
F0 <val2> ADDD F0,F4,F6 Ex ROB6
F4 M[10] LD F4,0(R3)
Y ROB5
-BNE F2,<…>
N ROB5
F2
DIVD F2,F10,F6 N ROB3
F10
ADDD F10,F4,F0 N ROB2
F0
LD F0,10(R2)
N ROB1
Registers
Dest
2 ADDD R(F4),ROB1
FP adders
5/10/03
Newest
Oldest
To
Memory
Dest
3 DIVD ROB2,R(F6)
Reservation
Stations
from
Memory
Dest
1 10+R2
FP multipliers
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.43
Prediction: Branches, Dependencies, Data
• Prediction has become essential to getting good
performance from scalar instruction streams.
• Architects are now predicting everything:
data dependencies, actual data, and results of groups of
instructions:
– At what point does computation become a probabilistic
operation + verification?
– We are pretty close with control hazards already…
•
Why does prediction work?
– Underlying algorithm has regularities.
– Data that is being operated on has regularities.
– Instruction sequence has redundancies that are artifacts of
way that humans/compilers think about problems.
• Prediction  Compressible information streams?
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.44
How can the machine exploit available ILP?
Limitation
Technique
IF D Ex M
IF D Ex
IF D
IF
° Pipelining
W
M W
Ex M W
D Ex M W
Issue rate,
FU stalls,
FU depth
° Super-pipeline
- Issue 1 instr. / (fast) cycle
- IF takes multiple cycles
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
Clock skew,
FU stalls,
FU depth
° Super-scalar
- Issue multiple scalar
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
instructions per cycle
Hazard resolution
° VLIW
- Each instruction specifies
multiple scalar operations
IF D Ex
Ex
Ex
Ex
M
M
M
M
W
W
W
W
Packing,
Compiler
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.45
Pentium 4 Block Diagram
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.46
Pentium-4 die floor plan
L1 Dcache
L2 cache
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.47
Processor-DRAM Gap (latency)
CPU
“Moore’s Law”
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
7%/yr.
100
10
1
µProc
60%/yr.
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Performance
1000
Time
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.48
Levels of the Memory Hierarchy
Upper Level
Capacity
Access Time
Cost
Staging
Xfer Unit
CPU Registers
100s Bytes
<2s ns
Registers
Cache
K Bytes SRAM
2-100 ns
$.01-.001/bit
Cache
Instr. Operands
Blocks
Main Memory
M Bytes DRAM
100ns-1us
$.01-.001
Disk
G Bytes
ms
-4
-3
10 - 10 cents
Tape
infinite
sec-min
10 -6
5/10/03
faster
prog./compiler
1-8 bytes
cache cntl
8-128 bytes
Memory
Pages
OS
512-4K bytes
Files
user/operator
Mbytes
Disk
Tape
©UCB Spring 2004
Larger
Lower Level
CS152 / Kubiatowicz
Lec26.49
Memory Hierarchy
° The Principle of Locality:
• Program access a relatively small portion of the address
space at any instant of time.
- Temporal Locality: Locality in Time
- Spatial Locality: Locality in Space
° Three Major Categories of Cache Misses:
• Compulsory Misses: sad facts of life. Example: cold start
misses.
• Conflict Misses: increase cache size and/or associativity.
• Capacity Misses: increase cache size
° Virtual Memory invented as another level of the hierarchy
– Today VM allows many processes to share single memory
without having to swap all processes to disk, protection
more important
– TLBs are important for fast translation/checking
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.50
Set Associative Cache
• N-way set associative: N entries for each Cache Index
– N direct mapped caches operates in parallel
• Example: Two-way set associative cache
– Cache Index selects a “set” from the cache
– The two tags in the set are compared to the input in
parallel
– Data is selected based on the tag result
Valid
Cache Tag
:
:
Adr Tag
Compare
Cache Data
Cache Index
Cache Data
Cache Block 0
Cache Block 0
:
:
Sel1 1
Mux
0 Sel0
Cache Tag
Valid
:
:
Compare
OR
Hit
5/10/03
Cache Block
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.51
Static RAM Cell
6-Transistor SRAM Cell
0
0
bit
word
word
(row select)
1
1
bit
• Write:
1. Drive bit lines (bit=1, bit=0)
2.. Select row
bit
• Read:
bit
replaced with pullup
to save area
1. Precharge bit and bit to Vdd or Vdd/2 => make sure equal!
2.. Select row
3. Cell pulls one line low
4. Sense amp on column detects difference between bit and bit
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.52
1-Transistor Memory Cell (DRAM)
• Write:
row select
– 1. Drive bit line
– 2.. Select row
• Read:
– 1. Precharge bit line to Vdd
– 2.. Select row
– 3. Cell and bit line share charges
bit
• Very small voltage changes on the bit line
– 4. Sense (fancy sense amp)
• Can detect changes of ~1 million electrons
– 5. Write: restore the value
• Refresh
– 1. Just do a dummy read to every cell.
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.53
DRAM Capacitors: more capacitance in a small area
• Trench capacitors:
• Stacked capacitors
– Logic ABOVE capacitor
– Logic BELOW capacitor
– Gain in surface area of
– Gain in surface area of
capacitor
capacitor
– Better Scaling properties
CS152
/ Kubiatowicz
–2004
2-dim cross-section
quite
5/10/03
©UCB
Spring
– Better Planarization
Lec26.54
Computers in the news: Tunneling Magnetic Junction
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.55
A Three-Bus System (+ backside cache)
Processor Memory Bus
Processor
Backside
Cache bus
Memory
Bus
Adaptor
Bus
Adaptor
I/O Bus
L2 Cache
Bus
Adaptor
I/O Bus
• A small number of backplane buses tap into the processormemory bus
– Processor-memory bus is only used for processor-memory
traffic
– I/O buses are connected to the backplane bus
• Advantage: loading on the processor bus is greatly reduced
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.56
Main componenets of Intel Chipset: Pentium II/III
• Northbridge:
– Handles memory
– Graphics
• Southbridge: I/O
–
–
–
–
–
–
–
5/10/03
PCI bus
Disk controllers
USB controlers
Audio
Serial I/O
Interrupt controller
Timers
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.57
Disk Device Terminology
Disk Latency = Queueing Time +
Controller time +
Seek Time + Rotation Time + Xfer Time
Order of magnitude times for 4K byte transfers:
Average Seek: 8 ms or less
Rotate: 4.2 ms @ 7200 rpm
Xfer: 1 ms @ 7200 rpm
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.58
Disk I/O Performance
300
Metrics:
Response Time
Throughput
Response
Time (ms)
200
100
0
100%
0%
Throughput
(% total BW)
Queue
Proc
IOC
Device
Response time = Queue + Device Service time
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.59
A Little Queuing Theory: M/M/1 queues
System
Queue
Proc
server
IOC
Device
• Described “memoryless” or Markovian request arrival
(M for C=1 exponentially random), 1 server: M/M/1 queue
• When Service times have C = 1, M/M/1 queue
Tq = Tser x u / (1 – u)
Tser average time to service a customer
u server utilization (0..1): u =  x Tser
Tq average time/customer in queue
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.60
What does the future hold?
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.61
Forces on Computer Architecture
Technology
Programming
Languages
Applications
Computer
Architecture
Operating
Systems
History
(A = F / M)
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.62
Today: building materials prevalent
• Originally: worried about squeezing the last ounce of
performance from limited resources
• Today: worried about an abundance (embarrassment) of
riches?
– Billions of transistors on a chip (17nm Yeah!)
– Microprocessor Report articles wondering if all the lessons of
RISC are now irrelevant
• Moore’s laws: exponential growth of everything
– Transistors, Performance, Disk Space, Memory Size
• So, what matters any more????
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.63
Key Technologies
° Fast, cheap, highly integrated “computers-on-a-chip”
• IDT R4640, NEC VR4300, StrongARM, Superchips
• Devices everywhere!
° Micromechanical Devices (MEMS)
• Integrated sensors everywhere
° Ubiquituous access to fast networks
• Network is everywhere!
• ISDN, Cable Modems, ATM, Metrocom (wireless) . .
° Platform independent programming languages
• Java, JavaScript, Visual Basic Script
° Lightweight Operating Systems
• GEOS, NCOS, RISCOS
° ???
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.64
Issues for Research
• Complexity:
– more than 50% of design teams now for verification
• Power
– Processor designs hampered in performance to keep from
melting
– Why 3 or 4 orders of magnitude difference in power consumption
between custom hardware and general Von Neuman
architectures?
• Energy
– Portable devices
• Scalability, Reliability, Maintainability
– How to keep services up 24x7?
• Performance (“Cost conscious”)
– how to get performance without a lot of power, complexity, etc.
• Security?
– What are the consequences of distributed info to privacy?
– What are moral implications of ubiquitous sensors???
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.65
Learned from Cal/CS152?
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.66
CS152: So what's in it for me? (from 1st lecture)
° In-depth understanding of the inner-workings of
modern computers, their evolution, and trade-offs
present at the hardware/software boundary.
• Insight into fast/slow operations that are easy/hard to
implementation hardware
° Experience with the design process in the context of
a large complex (hardware) design.
• Functional Spec --> Control & Datapath --> Physical implementation
• Modern CAD tools
° Designer's "Intellectual" toolbox.
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.67
Simulate Industrial Environment (from 1st lecture)
° Project teams must have at least 4 members
• Managers have value
° Communicate with colleagues (team members)
• What have you done?
• What answers you need from others?
• You must document your work!!!
• Everyone must keep an on-line notebook
° Communicate with supervisor (TAs)
• How is the team’s plan?
• Short progress reports are required:
- What is the team’s game plan?
-
5/10/03
What is each member’s responsibility?
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.68
So let’s thanks those TAs
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.69
Summary: Things we Hope You Learned from 152
° Keep it simple and make it work:
• Fully test everything individually & then together;
break when together
• Retest everything whenever you make any changes
• Last minute changes are big “no nos”
° Group dynamics. Communication is the key to
success:
• Be open with others of your expectations & your problems (e.g., trip)
• Everybody should be there on design meetings when key decisions
are made and jobs are assigned
° Planning is very important (“plan your life; live your
plan”):
• Promise what you can deliver; deliver more than you promise
• Murphy’s Law: things DO break at the last minute
- DON’T make your plan based on the best case scenarios
- Freeze your design and don’t make last minute changes
° Never give up! It is not over until you give up (“Bear
won’t die”)
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.70
The End!
5/10/03
©UCB Spring 2004
CS152 / Kubiatowicz
Lec26.71