Transcript ppt

CS 152
Computer Architecture and Engineering
Lecture 25
The Final Chapter
A whirlwind retrospective on the term
0
December 1, 1999
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.1
Outline of Today’s Lecture
° Recap: What was covered in lectures 45 minutes)
° Questions and Administrative Matters (2 minutes)
° Future of Computer Architecture and Engineering (15 minutes)
° Lessons from CS 152 (10 minutes)
° HKN evaluation of teaching staff (15 minutes)
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.2
Where have we been?
Input
Multiplier
Input
Multiplicand
32
Multiplicand
Register
LoadMp
32=>34
signEx
32
34
34
1
0
34x2 MUX
Arithmetic
Multi x2/x1
34
34
Sub/Add
34-bit ALU
Control
Logic
34
32
32
2
ShiftAll
LO register
(16x2 bits)
Prev
2
Booth
Encoder
HI register
(16x2 bits)
LO[1]
Extra
2 bits
2
"LO
[0]"
Single/multicycle
Datapaths
<<1
32=>34
signEx
ENC[2]
ENC[1]
ENC[0]
LoadLO
ClearHI
LoadHI
2
32
Result[HI]
LO[1:0]
32
Result[LO]
1000
CPU
“Moore’s Law”
IFetchDcd
WB
Exec Mem
Performance
10
DRAM
9%/yr.
DRAM (2X/10 yrs)
1
198
2
3
198
498
1
5
198
6
198
7
198
8
198
9
199
0
199
199
2
199
399
1
4
199
5
199
699
1
7
199
8
199
9
200
0
Exec Mem
Processor-Memory
Performance Gap:
(grows 50% / year)
198
098
1
1
198
IFetchDcd
CS152
Spring ‘99
100
µProc
60%/yr.
(2X/1.5yr)
WB
Time
IFetchDcd
Exec Mem
IFetchDcd
WB
Exec Mem
WB
Pipelining
I/O
Memory Systems
12/01/99
©UCB Fall 1999

CS152 / Kubiatowicz
Lec25.3
The Big Picture
° Since 1946 all computers have had 5 components
Processor
Input
Control
Memory
Datapath
12/01/99
Output
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.4
What is “Computer Architecture”?
Application
Operating
System
Compiler
Firmware
Instr. Set Proc. I/O system
Instruction Set
Architecture
Datapath & Control
Digital Design
Circuit Design
Layout
• Coordination of many levels of abstraction
• Under a rapidly changing set of forces
• Design, Measurement, and Evaluation
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.5
Performance and Technology Trends
1000
Supercomputers
Performance
100
Mainframes
10
Minicomputers
Microprocessors
1
0.1
1965
1970
1975
1980
1985
1990
1995
2000
Year
• Technology Power: 1.2 x 1.2 x 1.2 = 1.7 x / year
– Feature Size: shrinks 10% / yr. => Switching speed improves 1.2 / yr.
– Density: improves 1.2x / yr.
– Die Area: 1.2x / yr.
• One lesson of RISC is to keep the ISA as simple as possible:
– Shorter design cycle => fully exploit the advancing technology (~3yr)
– Advanced branch prediction and pipeline techniques
– Bigger and more sophisticated on-chip caches
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.6
Examples of “Moore’s Law’s”
• Processor
– logic capacity: about 30% per year
– clock rate:
about 20% per year
– Performance: about 50-60% per year (2x in 18
months)
• Memory
– DRAM capacity: about 60% per year (4x every 3 years)
– Memory speed: about 10% per year
– Cost per bit: improves about 25% per year
• Disk
– capacity: about 60% per year
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.7
Processor Performance
350
Performance
300
RISC
250
200
RISC
introduction
150
100
Intel x86
35%/yr
50
0
1982
1984
1986
1988
1990
1992
1994
Year
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.8
Instruction Set Architecture (subset of Computer Arch.)
... the attributes of a [computing] system as seen by the
programmer, i.e. the conceptual structure and functional
behavior, as distinct from the organization of the data
flows and controls the logic design, and the physical
implementation.
– Amdahl,
Blaaw, and Brooks, 1964
-- Organization of Programmable
Storage
SOFTWARE
-- Data Types & Data Structures:
Encodings & Representations
-- Instruction Set
-- Instruction Formats
-- Modes of Addressing and Accessing Data Items and Instructions
-- Exceptional Conditions
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.9
Instruction Set Design
software
instruction set
hardware
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.10
Summary of the Design Process
Hierarchical Design to manage complexity
Top Down vs. Bottom Up vs. Successive Refinement
Importance of Design Representations:
Block Diagrams
Decomposition into Bit Slices
top
down
bottom
up
Truth Tables, K-Maps
mux design
meets at TT
Circuit Diagrams
Other Descriptions: state diagrams, timing diagrams, reg xfer, . . .
Optimization Criteria:
Area
Gate Count
Delay
[Package Count]
Pin Out
12/01/99
Logic Levels
Power
Fan-in/Fan-out
Cost
©UCB Fall 1999
Design time
CS152 / Kubiatowicz
Lec25.11
Measurement and Evaluation
Design
Architecture is an iterative process
-- searching the space of possible designs
-- at all levels of computer systems
Analysis
Creativity
Cost /
Performance
Analysis
You must be willing to
throw out Bad Ideas!!
Good Ideas
Bad Ideas
12/01/99
Mediocre Ideas
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.12
One of most important aspects of design: TEST
• Think about testing from beginning of design
• Well over 50% of modern teams devoted to testing
• VHDL Test benches: monitoring hardware to aid
debugging:
– Include assert statements to check for “things that should
never happen”
Complete Top-Level Design
Test Bench
Device Under
Test
Inline Monitor
Output in readable
format (disassembly)
Assert Statements
Inline vectors
Assert Statements
File IO (either for patterns
or output diagnostics)
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.13
Why should you keep an design notebook?
• Keep track of the design decisions and the reasons behind
them
– Otherwise, it will be hard to debug and/or refine the design
– Write it down so that can remember in long project: 2 weeks >2 yrs
– Others can review notebook to see what happened
• Record insights you have on certain aspect of the design
as they come up
• Record of the different design & debug experiments
– Memory can fail when very tired
• Industry practice: learn from others mistakes
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.14
Basis of Evaluation
Cons
Pros
• representative
Actual Target Workload
• portable
• widely used
• improvements
useful in reality
• easy to run, early in
design cycle
• identify peak
capability and
potential bottlenecks
12/01/99
• very specific
• non-portable
• difficult to run, or
measure
• hard to identify cause
•less representative
Full Application Benchmarks
Small “Kernel”
Benchmarks
Microbenchmarks
©UCB Fall 1999
• easy to “fool”
• “peak” may be a long
way from application
performance
CS152 / Kubiatowicz
Lec25.15
Amdahl's Law
Speedup due to enhancement E:
ExTime w/o E
Speedup(E) = -------------------- =
ExTime w/ E
Performance w/ E
--------------------Performance w/o E
Suppose that enhancement E accelerates a fraction F of the task
by a factor S and the remainder of the task is unaffected then,
ExTime(with E) = ((1-F) + F/S) X ExTime(without E)
Speedup(with E) =
12/01/99
1
(1-F) + F/S
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.16
Performance Evaluation Summary
CPU time
= Seconds
Program
= Instructions x Cycles
Program
Instruction
x Seconds
Cycle
° Time is the measure of computer performance!
° Remember Amdahl’s Law:
Speedup is limited by unimproved part of program
° Good products created when have:
• Good benchmarks
• Good ways to summarize performance
° If NOT good benchmarks and summary, then choice between
1) improving product for real programs
2) changing product to get more sales (sales almost always
wins)
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.17
Integrated Circuit Costs
Die cost =
Wafer cost
Dies per Wafer * Die yield
Dies per wafer =  * ( Wafer_diam / 2)2 –  * Wafer_diam – Test dies  Wafer Area
Die Area
 2 * Die Area
Die Area
Die Yield =
Wafer yield
{ 1+
Defects_per_unit_area * Die_Area


}
Die Cost is goes roughly with the cube of the area.
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.18
Computer Arithmetic
• Bits have no inherent meaning: operations determine
whether really ASCII characters, integers, floating point
numbers
• Hardware algorithms for arithmetic:
– Carry lookahead/carry save addition
– Multiplication and divide.
– Booth algorithms
• Divide uses same hardware as multiply (Hi & Lo registers
in MIPS)
• Floating point follows paper & pencil method of scientific
notation
– using integer algorithms for multiply/divide of significands
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.19
Carry Look Ahead (Design trick: peek)
C0 = Cin
A0
B0
G
P
A
0
0
1
1
S
C1 = G0 + C0  P0
A1
B1
G
P
S
B
0
1
0
1
C-out
0
C-in
C-in
1
“kill”
“propagate”
“propagate”
“generate”
P = A and B
G = A xor B
C2 = G1 + G0 P1 + C0  P0  P1
A2
B2
G
P
S
C3 = G2 + G1 P2 + G0  P1  P2 + C0  P0  P1  P2
A3
B3
G
P
S
G
P
C4 = . . .
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.20
MULTIPLY HARDWARE Version 3
• 32-bit Multiplicand reg, 32-bit ALU,
64-bit Product reg (shift right), (0-bit Multiplier reg)
Multiplican
d
32 bits
32-bit ALU
“HI”
“LO”
Shift Right
Product (Multiplier)
64 bits
Control
Write
Divide can use same hardware!
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.21
Booth’s Algorithm Insight
middle of run
end of run
0 1 1 1 1beginning
0 of run
Current Bit Bit to the Right
Explanation
Example
Op
1
0
Begins run of 1s
0001111000
sub
1
1
Middle of run of 1s
0001111000
none
0
1
End of run of 1s
0001111000
add
0
0
Middle of run of 0s
0001111000
none
Originally for Speed (when shift was faster than add)
• Replace a string of 1s in multiplier with an initial subtract
when we first see a one and then later add for the bit after
the last one
12/01/99
©UCB Fall 1999
–1
+ 10000
01111
CS152 / Kubiatowicz
Lec25.22
Double Bit Booth Multiplier
Input
Multiplier
Input
Multiplicand
32
Multiplicand
Register
LoadMp
32=>34
signEx
<<1
32
34
34
32=>34
signEx
1
0
34x2 MUX
Multi x2/x1
34
34
Sub/Add
34-bit ALU
Control
Logic
32
32
2
ShiftAll
LO register
(16x2 bits)
Booth
Encoder
2
Prev
HI register
(16x2 bits)
LO[1]
Extra
2 bits
2
"L O
[0]
"
34
ENC[2]
ENC[1]
ENC[0]
32
12/01/99
LoadLO
ClearHI
LoadHI
2
Result[HI]
LO[1:0]
32
©UCB Fall 1999
Result[LO]
CS152 / Kubiatowicz
Lec25.23
Pentium Bug
• Pentium: Difference between bugs that board designers
must know about and bugs that potentially affect all users
–$200,000 cost in June to repair design
–$400,000,000 loss in December in profits to replace bad
parts
–How much to repair Intel’s reputation?
–Make public complete description of bugs in later
category?
• What is technologist’s and company’s responsibility to disclose
bugs?
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.24
Multiple Cycle Datapath
PCWr
PCWrCond
Zero
MemWr
ALUSelA
RegWr
32
5
Rt 0
Rd
Rb
Reg File
Rw
B
1 Mux 0
Extend
ExtOp
©UCB Fall 1999
<< 2
0
32
32
32
0
1
32
32
2
3
ALU
Control
32
MemtoReg
Zero
1
4
busW busB
1
Imm 16
12/01/99
busA A
32
ALU Out
32
32 Rt
Ra
1
ALU
WrAdr
32
Din Dout
5
Mux
Ideal
Memory
1
Rs
Mem Data Reg
Mux
RAdr
0
Mux
0
Instruction Reg
32
32
RegDst
32
PC
32
IRWr
Mux
IorD
PCSrc
ALUOp
ALUSelB
CS152 / Kubiatowicz
Lec25.25
Control: Hardware vs. Microprogrammed
° Control may be designed using one of several initial
representations. The choice of sequence control, and how logic is
represented, can then be determined independently; the control can
then be implemented with one of several methods using a
structured logic technique.
Initial Representation
Sequencing Control
Logic Representation
Implementation Technique
12/01/99
Finite State Diagram
Microprogram
Explicit Next State
Function
Microprogram counter
+ Dispatch ROMs
Logic Equations
Truth Tables
PLA
ROM
“hardwired control” “microprogrammed control”
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.26
Finite State Machine (FSM) Spec
IR <= MEM[PC]
“instruction fetch”
PC <= PC + 4
0000
“decode”
0001
ORi
ALUout
<= A fun B
ALUout
<= A or ZX
0100
0110
LW
ALUout
<= A + SX
1000
M <=
MEM[ALUout]
1001
BEQ
SW
ALUout
<= A + SX
ALUout
<= PC +SX
1011
0010
MEM[ALUout]
<= B
1100
R[rd]
<= ALUout
0101
12/01/99
R[rt]
<= ALUout
0111
R[rt] <= M
1010
©UCB Fall 1999
If A = B then PC
<= ALUout
0011
Memory
Write-back
Execute
R-type
CS152 / Kubiatowicz
Lec25.27
Sequencer-based control unit
Control Logic
Multicycle
Datapath
Outputs
Inputs
1
Adder
Types of “branching”
• Set state to 0
• Dispatch (state 1)
• Use incremented state
number
State Reg
Address Select Logic
Opcode
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.28
“Macroinstruction” Interpretation
Main
Memory
ADD
SUB
AND
.
.
.
DATA
execution
unit
CPU
User program
plus Data
this can change!
one of these is
mapped into one
of these
AND microsequence
control
memory
e.g., Fetch
Calc Operand Addr
Fetch Operand(s)
Calculate
Save Answer(s)
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.29
Microprogramming
Label
ALU
SRC1
SRC2
Fetch: Add
Add
PC
PC
4
Extshft
Rtype: Func
rs
rt
Dest.
Memory
Read PC
Mem. Reg. PC Write Sequencing
IR
ALU
Seq
Dispatch
Seq
Fetch
rd ALU
Ori:
Lw:
Or
Add
rs
rs
Extend0
rt ALU
Seq
Fetch
rt MEM
Seq
Seq
Fetch
Extend
Read ALU
Sw:
Add
rs
Extend
Seq
Fetch
Write ALU
Beq:
12/01/99
Subt.
rs
rt
ALUoutCond.
©UCB Fall 1999
Fetch
CS152 / Kubiatowicz
Lec25.30
Precise Interrupts
• Precise  state of the machine is preserved as if program
executed up to the offending instruction
– All previous instructions completed
– Offending instruction and all following instructions act as if they have
not even started
– Same system code will work on different implementations
– Position clearly established by IBM
– Difficult in the presence of pipelining, out-ot-order execution, ...
– MIPS takes this position
• Imprecise  system software has to figure out what is where
and put it all back together
• Performance goals often lead designers to forsake precise
interrupts
– system software developers, user, markets etc. usually wish they had
not done this
• Modern techniques for out-of-order execution and branch
prediction help implement precise interrupts
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.31
Administrivia
• Oral reports Monday? (Thursday?):
– 10am - 12pm, 2pm-4pm, 306 Soda
– 5pm go over to lab to run mystery programs.
– Reports submitted via Web site by 5pm (not in BOX
downstairs)
– TAs have handed out a list of requirements
– Remember: talk is 15 minutes + 5 minutes questions
• Don’t bring more than 8 slides!!!
• Practice! Your final project grade will depend partially on
your oral report.
• Everyone should participate in talk
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.32
Recap: Pipelining Lessons (its intuitive!)
6 PM
T
a
s
k
7
8
9
Time
30 30 30 30 30 30 30
A
° Multiple tasks operating
simultaneously using
different resources
° Potential speedup =
Number pipe stages
B
O
r
d
e
r
° Pipelining doesn’t help
latency of single task, it
helps throughput of entire
workload
° Pipeline rate limited by
slowest pipeline stage
C
° Unbalanced lengths of
pipe stages reduces
speedup
D
° Time to “fill” pipeline and
time to “drain” it reduces
speedup
° Stall for Dependences
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.33
Why Pipeline? Because the resources are there!
Time (clock cycles)
Inst 4
12/01/99
Im
Dm
Reg
Dm
Im
Reg
Im
Reg
©UCB Fall 1999
Reg
Reg
Dm
Reg
ALU
Inst 3
Reg
Reg
ALU
Inst 2
Im
Dm
ALU
Inst 1
Reg
ALU
O
r
d
e
r
Inst 0
Im
ALU
I
n
s
t
r.
Dm
Reg
CS152 / Kubiatowicz
Lec25.34
Can pipelining get us into trouble?
• Yes: Pipeline Hazards
– structural hazards: attempt to use the same resource two
different ways at the same time
• E.g., combined washer/dryer would be a structural hazard or
folder busy doing something else (watching TV)
– data hazards: attempt to use item before it is ready
• E.g., one sock of pair in dryer and one in washer; can’t fold
until get sock from washer through dryer
• instruction depends on result of prior instruction still in the
pipeline
– control hazards: attempt to make a decision before
condition is evaulated
• E.g., washing football uniforms and need to get proper
detergent level; need to see after dryer before next load in
• branch instructions
• Can always resolve hazards by waiting
– pipeline control must detect the hazard
– take action (or delay action) to resolve hazards
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.35
Exceptions in a 5 stage pipeline
Time
Bad Inst
Inst TLB fault
Overflow
IFetch Dcd
Exec
IFetch Dcd
Program Flow
Data TLB
Mem
WB
Exec
Mem
WB
Exec
Mem
WB
Exec
Mem
IFetch Dcd
IFetch Dcd
WB
• Use pipeline to sort this out!
– Pass exception status along with instruction.
– Keep track of PCs for every instruction in pipeline.
– Don’t act on exception until it reache WB stage
• Handle interrupts through “faulting noop” in IF stage
• When instruction reaches WB stage:
– Save PC  EPC, Interrupt vector addr  PC
– Turn all instructions in earlier stages into noops!
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.36
Data Stationary Control
• The Main Control generates the control signals during
Reg/Dec
– Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later
– Control signals for Mem (MemWr Branch) are used 2 cycles later
– Control signals for Wr (MemtoReg MemWr) are used 3 cycles later
Reg/Dec
ALUSrc
ALUSrc
ALUOp
ALUOp
RegDst
MemWr
Branch
MemtoReg
RegWr
12/01/99
RegDst
MemWr
Branch
MemtoReg
RegWr
©UCB Fall 1999
MemWr
Branch
MemtoReg
RegWr
Wr
Mem/Wr Register
ExtOp
Mem
Ex/Mem Register
ExtOp
ID/Ex Register
IF/ID Register
Main
Control
Exec
MemtoReg
RegWr
CS152 / Kubiatowicz
Lec25.37
Pipeline Summary
° Simple 5-stage pipeline: F
DEMW
° Pipelines pass control information down the pipe just
as data moves down pipe
° Resolve data hazards through forwarding.
° Forwarding/Stalls handled by local control
° Exceptions stop the pipeline
° MIPS I instruction set architecture made pipeline visible
(delayed branch, delayed load)
° More performance from deeper pipelines, parallelism
° You built a complete 5-stage pipeline in the lab!
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.38
Out of order execution: Tomasulo Organization
FP Registers
From Mem
FP Op
Queue
Load Buffers
Load1
Load2
Load3
Load4
Load5
Load6
Store
Buffers
Add1
Add2
Add3
Mult1
Mult2
FP adders
Reservation
Stations
To Mem
FP multipliers
Common Data Bus (CDB)
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.39
How can the machine exploit available ILP?
Limitation
Technique
IF D Ex M
IF D Ex
IF D
IF
° Pipelining
W
M W
Ex M W
D Ex M W
Issue rate,
FU stalls,
FU depth
° Super-pipeline
- Issue 1 instr. / (fast) cycle
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
- IF takes multiple cycles
Clock skew,
FU stalls,
FU depth
° Super-scalar
- Issue multiple scalar
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
instructions per cycle
Hazard resolution
° VLIW
- Each instruction specifies
multiple scalar operations
IF D Ex
Ex
Ex
Ex
M
M
M
M
W
W
W
W
Packing,
Compiler
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.40
Processor-DRAM Gap (latency)
CPU
“Moore’s Law”
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
7%/yr.
100
10
1
µProc
60%/yr.
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Performance
1000
Time
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.41
Levels of the Memory Hierarchy
Upper Level
Capacity
Access Time
Cost
Staging
Xfer Unit
CPU Registers
100s Bytes
<2s ns
Registers
Cache
K Bytes SRAM
2-100 ns
$.01-.001/bit
Cache
Instr. Operands
Blocks
Main Memory
M Bytes DRAM
100ns-1us
$.01-.001
Disk
G Bytes
ms
-4
-3
10 - 10 cents
Tape
infinite
sec-min
10 -6
12/01/99
faster
prog./compiler
1-8 bytes
cache cntl
8-128 bytes
Memory
Pages
OS
512-4K bytes
Files
user/operator
Mbytes
Disk
Tape
©UCB Fall 1999
Larger
Lower Level
CS152 / Kubiatowicz
Lec25.42
Memory Hierarchy
° The Principle of Locality:
• Program access a relatively small portion of the address
space at any instant of time.
- Temporal Locality: Locality in Time
- Spatial Locality: Locality in Space
° Three Major Categories of Cache Misses:
• Compulsory Misses: sad facts of life. Example: cold start
misses.
• Conflict Misses: increase cache size and/or associativity.
• Capacity Misses: increase cache size
° Virtual Memory invented as another level of the hierarchy
– Today VM allows many processes to share single memory
without having to swap all processes to disk, protection
more important
– TLBs are important for fast translation/checking
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.43
Set Associative Cache
• N-way set associative: N entries for each Cache Index
– N direct mapped caches operates in parallel
• Example: Two-way set associative cache
– Cache Index selects a “set” from the cache
– The two tags in the set are compared to the input in
parallel
– Data is selected based on the tag result
Valid
Cache Tag
:
:
Adr Tag
Compare
Cache Index
Cache Data
Cache Data
Cache Block 0
Cache Block 0
:
:
Sel1 1
Mux
0 Sel0
Cache Tag
Valid
:
:
Compare
OR
Hit
12/01/99
Cache Block
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.44
Quicksort vs. Radix as vary number keys: Cache misses
5
Quick(miss/key)
Radix(miss/key)
Radix sort
4
3
Cache misses
2
1
Quick
sort
0
1000
10000
100000 1000000 1000000
0
Job size in keys
What is proper approach to fast algorithms?
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.45
Performance Retrospective
° Theory of Algorithms & Compilers often based on number of
operations
° Compiler remove operations and “simplify” ops:
Integer adds << Integer multiplies << FP adds << FP multiplies
• Advanced pipelines => these operations take similar time
(FP multiply faster than integer multiply)
° As Clock rates get higher and pipelines are longer, instructions
take less time but DRAMs only slightly faster (although much
larger)
° Today time is a function of (ops, cache misses);
° Given importance of caches, what does this mean to:
• Compilers?
• Data structures?
• Algorithms?
• How do you tune performance on Pentium Pro? Random?
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.46
Static RAM Cell
6-Transistor SRAM Cell
0
0
bit
word
word
(row select)
1
1
bit
• Write:
1. Drive bit lines (bit=1, bit=0)
2.. Select row
bit
• Read:
bit
replaced with pullup
to save area
1. Precharge bit and bit to Vdd or Vdd/2 => make sure equal!
2.. Select row
3. Cell pulls one line low
4. Sense amp on column detects difference between bit and bit
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.47
1-Transistor Memory Cell (DRAM)
• Write:
row select
– 1. Drive bit line
– 2.. Select row
• Read:
– 1. Precharge bit line to Vdd
– 2.. Select row
– 3. Cell and bit line share charges
bit
• Very small voltage changes on the bit line
– 4. Sense (fancy sense amp)
• Can detect changes of ~1 million electrons
– 5. Write: restore the value
• Refresh
– 1. Just do a dummy read to every cell.
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.48
Classical DRAM Organization (square)
bit (data) lines
r
o
w
d
e
c
o
d
e
r
row
address
Each intersection represents
a 1-T DRAM Cell
RAM Cell
Array
word (row) select
Column Selector &
I/O Circuits
data
12/01/99
Column
Address
• Row and Column Address
together:
Select
©UCB–Fall
1999
1 bit a time
CS152 / Kubiatowicz
Lec25.49
Main Memory Performance
•
•
•
Simple: CPU, Cache, Bus, Memory same width (32 bits)
Wide: CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64
bits & 256 bits)
Interleaved: CPU, Cache, Bus 1 word: Memory N Modules
(4 Modules); example is word interleaved
Timing model: 1 to send address,
6 access time, 1 to send data
Cache Block is 4 words
Simple M.P.
= 4 x (1+6+1) = 32
Wide M.P.
=1+6+1
=8
Interleaved M.P. = 1 + 6 + 4x1 = 11
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.50
I/O System Design Issues
• Systems have a hierarchy of busses as well (PC: memory,PCI,ESA)
Processor
interrupts
Cache
Memory - I/O Bus
Main
Memory
I/O
Controller
Disk
12/01/99
Disk
I/O
Controller
I/O
Controller
Graphics
Network
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.51
A Three-Bus System
Processor Memory Bus
Processor
Memory
Bus
Adaptor
Bus
Adaptor
Backplane Bus
Bus
Adaptor
I/O Bus
I/O Bus
• A small number of backplane buses tap into the processormemory bus
– Processor-memory bus is only used for processor-memory
traffic
– I/O buses are connected to the backplane bus
• Advantage: loading on the processor bus is greatly reduced
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.52
Disk Device Terminology
Disk Latency = Queueing Time +
Controller time +
Seek Time + Rotation Time + Xfer Time
Order of magnitude times for 4K byte transfers:
Average Seek: 8 ms or less
Rotate: 4.2 ms @ 7200 rpm
Xfer: 1 ms @ 7200 rpm
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.53
Disk I/O Performance
300
Metrics:
Response Time
Throughput
Response
Time (ms)
200
100
0
100%
0%
Throughput
(% total BW)
Queue
Proc
IOC
Device
Response time = Queue + Device Service time
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.54
A Little Queuing Theory: M/M/1 queues
System
Queue
Proc
server
IOC
Device
• Described “memoryless” or Markovian request arrival
(M for C=1 exponentially random), 1 server: M/M/1 queue
• When Service times have C = 1, M/M/1 queue
Tq = Tser x u / (1 – u)
Tser average time to service a customer
u server utilization (0..1): u =  x Tser
Tq average time/customer in queue
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.55
Computers in the news: Tunneling Magnetic Junction
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.56
Computers in the News: Sony Playstation 2000
• (as reported in Microprocessor Report, Vol 13, No. 5)
– Emotion Engine: 6.2 GFLOPS, 75 million polygons per second
– Graphics Synthesizer: 2.4 Billion pixels per second
– Claim: Toy Story realism brought to games!
CS152 / Kubiatowicz
12/01/99
©UCB Fall 1999
Lec25.57
Computers in the News: Electronic Ink
• Electronic Ink:
– Little capsules with charged balls that are
half black/half white
– Placing an electronic charge of one polarity makes dot black and the
other polarity makes it white.
– Flexible, cheap, paper-like displays!
Schematic Diagram
Electron Micrograph
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.58
What does the future hold?
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.59
Forces on Computer Architecture
Technology
Programming
Languages
Applications
Computer
Architecture
Operating
Systems
History
(A = F / M)
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.60
Today: building materials prevalent
• Originally: worried about squeezing the last ounce of
performance from limited resources
• Today: worried about an abundance (embarrassment) of
riches?
– Billions of transistors on a chip (17nm Yeah!)
– Microprocessor Report articles wondering if all the lessons of
RISC are now irrelevant
• Moore’s laws: exponential growth of everything
– Transistors, Performance, Disk Space, Memory Size
• So, what matters any more????
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.61
Key Technologies
° Fast, cheap, highly integrated “computers-on-a-chip”
• IDT R4640, NEC VR4300, StrongARM, Superchips
• Devices everywhere!
° Micromechanical Devices (MEMS)
• Integrated sensors everywhere
° Ubiquituous access to fast networks
• Network is everywhere!
• ISDN, Cable Modems, ATM, Metrocom (wireless) . .
° Platform independent programming languages
• Java, JavaScript, Visual Basic Script
° Lightweight Operating Systems
• GEOS, NCOS, RISCOS
° ???
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.62
Issues for Research
• Complexity:
– more than 50% of design teams now for verification
• Power
– Processor designs hampered in performance to keep from
melting
– Why 3 or 4 orders of magnitude difference in power consumption
between custom hardware and general Von Neuman
architectures?
• Energy
– Portable devices
• Scalability, Reliability, Maintainability
– How to keep services up 24x7?
• Performance (“Cost conscious”)
– how to get performance without a lot of power, complexity, etc.
• Security?
– What are the consequences of distributed info to privacy?
– What are moral implications of ubiquitous sensors???
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.63
1985 Computer Food Chain
“Big Iron”
Mainframe
Work- PC
Ministation
computer
Vector Supercomputer
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.64
1995 Computer Food Chain
Minicomputer
(hitting wall soon)
Mainframe
(future is bleak)
Vector Supercomputer
12/01/99
Work- PC
station
Massively Parallel
Processors
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.65
Interconnection Networks
° Switched vs. Shared Media: pairs communicate at
same time:
“point-to-point” connections
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.66
Cluster/Network of Workstations (NOW)
MPP
P
P
SMP
P P P
P
M M M M M
NI
NI
NI
NI
Fast
Communication
12/01/99
P
I/O
Bus
NI
NI
Fast, Switched
Network
P
Distributed Comp.
M
General
Purpose
I/O
NI
D
P
D
P
M
M
NI
NI
…
…
D
P
M
NI
Fast, Switched
©UCB FallNetwork
1999
D
P
D
P
M
M
NI
NI
…
…
D
P
M
NI
Slow, Scalable
Network
Incremental
Scalability,
Timeliness
CS152 / Kubiatowicz
Lec25.67
Mainframe
Vector
Supercomputer
Massively Minicomputer
Parallel
Processors
2005 Computer Food Chain?
Portable
Computers
Networks of Workstations/PCs
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.68
Intelligent DRAM (IRAM)
• IRAM motivation (2000 to 2005)
–
–
–
–
–
–
–
256 Mbit/1Gbit DRAMs in near future (128 MByte)
Current CPUs starved for memory BW
On chip memory BW = SQRT(Size)/RAS or 80 GB/sec
1% of Gbit DRAM = 10M transistors for µprocessor
Even in DRAM process, a 10M trans. CPU is attractive
Package could be network interface vs. Addr./Data pins
Embedded computers are increasingly important
• Why not re-examine computer design based on separation
of memory and processor?
– Compact code & data?
– Vector instructions?
– Operating systems? Compilers? Data Structures?
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.69
IRAM Vision Statement
Microprocessor & DRAM
on a single chip:
– on-chip memory latency
5-10X, bandwidth 50-100X
– improve energy efficiency
2X-4X (no off-chip bus)
– serial I/O 5-10X v. buses
– smaller board area/volume
– adjustable memory size/width
I/O I/O
Bus
Proc
$ $
L2$
Bus
D R A M
I/O
I/O
Proc
Bus
D R A M
12/01/99
©UCB Fall 1999
L
o f
ga
i b
c
D
R f
Aa
Mb
CS152 / Kubiatowicz
Lec25.70
OceanStore: The Oceanic Data Utility
Global-Scale Persistent Storage
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.71
Ubiquitous Devices  Ubiquitous Storage
• Consumers of data move, change from one device to
another, work in cafes, cars, airplanes, the office, etc.
• Properties REQUIRED for OceanStore:
– Strong Security: data must be encrypted whenever in the
infrastructure; resistance to monitoring
– Coherence: too much data for naïve users to keep coherent
“by hand”
– Automatic replica management and optimization: huge
quantities of data cannot be managed manually
– Simple and automatic recovery from disasters: probability
of failure increases with size of system
– Utility model: world-scale system requires cooperation
across administrative boundaries
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.72
Utility-based Infrastructure
Canadian
OceanStore
Sprint
AT&T
Pac
Bell
IBM
IBM
• Service provided by confederation of companies
– Monthly fee paid to one service provider
– Companies buy and sell capacity
from each other
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.73
OceanStore Assumptions
• Untrusted Infrastructure:
– Infrastructure is comprised of untrusted components
– Only cyphertext within the infrastructure
– Must be careful to avoid leaking information
• Mostly Well-Connected:
– Data producers and consumers are connected to a highbandwidth network most of the time
– Exploit mechanism such as multicast for quicker
consistency between replicas
• Promiscuous Caching:
– Data may be cached anywhere, anytime
– Global optimization through tacit information collection
• Operations Interface with Conflict Resolution:
– Applications employ an operations-oriented interface, rather
than a file-systems interface
– Coherence is centered around conflict resolution
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.74
Introspective Computing
• Biological Analogs for computer systems:
– Continuous adaptation
– Insensitivity to design flaws
• Both hardware and software
• Necessary if can never be
sure that all components
Compute
are working properly…
Monitor
• Examples:
– ISTORE -- applies introspective
computing to disk storage
– DynaComp -- applies introspective
computing at chip level
Adapt
• Compiler always running and part of execution!
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.75
and why not
° multiprocessors on a chip?
° complete systems on a chip?
• memory + processor + I/O
° computers in your credit card?
° networking in your kitchen? car?
° eye tracking input devices?
° Wearable computers
° Intelligent books (made from paper!)
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.76
Learned from Cal/CS152?
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.77
CS152: So what's in it for me? (from 1st lecture)
° In-depth understanding of the inner-workings of
modern computers, their evolution, and trade-offs
present at the hardware/software boundary.
• Insight into fast/slow operations that are easy/hard to
implementation hardware
° Experience with the design process in the context of
a large complex (hardware) design.
• Functional Spec --> Control & Datapath --> Physical implementation
• Modern CAD tools
° Designer's "Intellectual" toolbox.
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.78
Simulate Industrial Environment (from 1st lecture)
° Project teams must have at least 4 members
• Managers have value
° Communicate with colleagues (team members)
• What have you done?
• What answers you need from others?
• You must document your work!!!
• Everyone must keep an on-line notebook
° Communicate with supervisor (TAs)
• How is the team’s plan?
• Short progress reports are required:
- What is the team’s game plan?
-
12/01/99
What is each member’s responsibility?
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.79
So let’s thanks those TAs
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.80
Summary: Things we Hope You Learned from 152
° Keep it simple and make it work:
• Fully test everything individually & then together;
break when together
• Retest everything whenever you make any changes
• Last minute changes are big “no nos”
° Group dynamics. Communication is the key to
success:
• Be open with others of your expectations & your problems (e.g., trip)
• Everybody should be there on design meetings when key decisions
are made and jobs are assigned
° Planning is very important (“plan your life; live your
plan”):
• Promise what you can deliver; deliver more than you promise
• Murphy’s Law: things DO break at the last minute
- DON’T make your plan based on the best case scenarios
- Freeze your design and don’t make last minute changes
° Never give up! It is not over until you give up (“Bear
won’t die”)
12/01/99
©UCB Fall 1999
CS152 / Kubiatowicz
Lec25.81