ECE 252 / CPS 220 Advanced Computer

Download Report

Transcript ECE 252 / CPS 220 Advanced Computer

ECE552 / CPS550
Advanced Computer Architecture I
Lecture 1
Metrics and Early Machines
Benjamin Lee
Electrical and Computer Engineering
Duke University
www.duke.edu/~bcl15
www.duke.edu/~bcl15/class/class_ece552fall16.html
Computing Devices (Then)
Mark I
Harvard University, 1944
ECE 552 / CPS 550
EDSAC
University of Cambridge, 1949
2
Computing Devices (Now)
iPad
Apple/ARM, 2010
ECE 552 / CPS 550
Blue Gene/P
IBM, 2007
3
Computer Architecture
Application
Gap too large to bridge in
one step
Physics
Computer architecture is the design of abstraction layers,
which allow efficient implementations of computational
applications on available technologies
ECE 552 / CPS 550
4
Abstraction Layers
Application
Algorithm
Programming Language
Domain of
early
computer
architecture
(‘50s-’80s)
Operating System/Virtual Machines
Instruction Set Architecture (ISA)
Microarchitecture
Gates/Register-Transfer Level (RTL)
Domain of
recent
computer
architecture
(since ‘90s)
Circuits
Devices
Physics
ECE 552 / CPS 550
5
An Integrated Approach
Architect Systems
- Coordinate system across hardware-software interface
~ Technology, hardware, run-time software, compilers, apps
- Responsible for end-to-end functionality
Design and Analyze
- Search design space of computer systems
- Evaluate designs with quantitative metrics
~ Performance, power, cost
Navigate Computing Landscape
- Technologies are emerging
- Applications are demanding
- Systems are scaling
ECE 552 / CPS 550
6
ECE 552 Executive Summary
In-order Datapath
(understand, ECE 250)
(built, ECE 350)
ECE 552 / CPS 550
Chip Multiprocessors
(understand, experiment ECE552)
7
ECE 552 Administrivia
Instructor
Prof. Benjamin Lee
[email protected]
Office Hours: TuTh 4-5pm, Hudson 210
Teaching
Assistants
Ramin Bashizade, [email protected]
Office Hours: WF 3:30 – 4:30pm, LSRC D301
Tamara Lehman, [email protected]
Office Hours: TuTh 11:30 – 12:30pm, TBD
Lectures
Tu/Th 10:05 – 11:20AM, Teer 203
Text
Computer Architecture: A Quantitative Approach,
5th Edition (2012). Do not use earlier editions
Web
http://www.duke.edu/~BCL15/class/class_ece552fall16.html
ECE 552 / CPS 550
8
ECE 552 Prerequisites
Participation
- Electrical and Computer Engineering, Computer Science
- PhD, MS, Undergraduates
Prerequisites
- Introduction to computer architecture (CPS 104, ECE 152, or equiv.)
- Programming (homework/projects in C, C++)
Background Knowledge
- Instruction sets, computer arithmetic, assembly programming
D.A. Patterson and J.L. Hennessy. Computer Organization and Design:
The Hardware/Software Interface, 5th Edition.
ECE 552 / CPS 550
9
ECE 552 Lectures
1. Design Metrics
a)
b)
c)
Performance
Power
Early machines
2. Simple Pipelining
a)
b)
c)
d)
Multi-cycle machines
Branch prediction
In-order superscalar
Optimizations
3. Complex Pipelining
a)
b)
Score-boarding, Tomasulo algorithm
Out-of-order superscalar
4. Explicitly Parallel Architectures
a)
b)
c)
VLIW
Vector machines
Multi-threading
5. Memory Systems
a)
b)
c)
Caches
DRAM
Virtual memory
6. Multiprocessors
a)
b)
Memory models
Coherence protocols
Midterm Exam
Fall Break
ECE 552 / CPS 550
10
ECE 552 Readings
1. Technology
a)
b)
Moore’s Law
Technology scaling
2. History
a)
b)
Classic machines
The 801 minicomputer
3. Pipelining
a)
b)
Power as a design constraint
Optimizing pipeline depth
5. Parallelism I
a)
b)
Data flow processors
Simultaneous multi-threading
6. Memory
a)
b)
Victim cache
Phase change memory
7. Parallelism II
a)
b)
Consistency
Coherence
4. Microarchitecture
a)
b)
Branch prediction
Complexity and superscalar design
ECE 552 / CPS 550
11
ECE 552 Components
30%
Homework and Readings
- Homework done in teams of 3
- 5 classes dedicated to paper discussions
20%
Midterm exam
- 75 minutes (in class), closed book
20%
Final exam
- 3 hours, closed-book
30%
Term project/paper
- Project done in teams of 3
Academic Policy
University policy as codified by Duke Undergraduate Honor Code will be strictly
enforced. Zero tolerance for cheating and/or plagiarism.
ECE 552 / CPS 550
12
ECE 552 Academic Policy
University policy as codified by the Duke Undergraduate Honor Code will be
strictly enforced. Zero tolerance for cheating and/or plagiarism.
If a student is suspect of academic dishonesty (e.g., cheating on an exam,
copying a lab report, collaborating inappropriately on an assignment), faculty
are required to report the matter to the Office of Student Conduct.
A student found responsible for academic dishonesty faces formal disciplinary
action, which may include suspension. A student suspended twice for academic
dishonesty automatically faces a minimum 5-year separation from Duke
University.
ECE 552 / CPS 550
13
ECE 552 Term Project
Scope
- Semester-long research project
- Teams of 3
- Students propose project ideas (Oct 14)
Final Paper
- 6-12 page research paper
- Evaluate research idea quantitatively
- Survey and cite related work
ECE 552 / CPS 550
14
ECE 552 Upcoming Deadlines
1 September – Reading #1 Due
Readings are available on Sakai.
Submit reading responses on Sakai.
1. Moore. “Cramming more components onto integrated circuits”
2. Horowitz et al. “Scaling, power, and the future of CMOS”
15 September – Homework #1 Due
Homework will be available on Sakai.
Submit homework on Sakai in teams of two.
ECE 552 / CPS 550
15
Performance
Latency versus Throughput
Definitions
- Latency: time to finish given task (a.k.a. execution time)
- Throughput: number of tasks in given time (a.k.a. bandwidth)
- Throughput exploits parallelism. Latency cannot
Example: Move people from Duke to UNC, 10 miles
- Car: capacity = 5, speed = 60 miles/hour
- Latency = (10 miles @ 60 miles/hour )= 10 minutes
- Throughput = (3 trips @ 60 miles per hour) = 15 people/hour
- Bus: capacity = 60, speed = 20 miles/hour
- Latency = (10 miles @ 20 miles/hour) = 30 minutes
- Throughput = (1 trip @ 20 miles per hour) = 60 people/hour
ECE 552 / CPS 550
17
Aggregating Performance
Addition
- Latency is additive. Throughput is not.
- Example: Consider applications A1 and A2 on processor P
- Latency(A1,A2) = Latency(A1) + Latency(A2)
- Throughput (A1,A2) = 1/[1/Throughput(A1) + 1/Throughput(A2)]
Averages
- Arithmetic Mean: (1/N) * ∑P=1..N Latency(P)
- For measures that are proportional to time (e.g., latency)
- Harmonic Mean: N / ∑P=1..N 1/Throughput(P)
- For measures that are inversely proportional to time (e.g., throughput)
- Geometric Mean: (∏P=1..N Speedup(P))^(1/N)
- For ratios (e.g., speed-ups)
ECE 552 / CPS 550
18
Processor Performance
Performance (vs. VAX-11/780)
10000
SPECint Benchmarks. Hennessy and
Patterson, Computer Architecture: A
Quantitative Approach, 4th Edition, 2006.
??%/year
1000
52%/year
100
10
25%/year
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
ECE 552 / CPS 550
19
Performance Factors
Latency = (Seconds / Cycle) x (Cycles / Instruction) x (Instructions / Program)
Seconds / Cycle
- Technology and architecture
- Transistor scaling
- Processor microarchitecture
Cycles / Instruction (CPI)
- Architecture and systems
- Processor microarchitecture
- System balance (processor, memory, network, storage)
Instructions / Program
- Algorithm and applications
- Compiler transformations, optimizations
- Instruction set architecture
ECE 552 / CPS 550
20
Performance Factors
Latency = (Seconds / Cycle) x (Cycles / Instruction) x (Instructions / Program)
Seconds / Cycle
- Technology and architecture
- Transistor scaling
- Processor microarchitecture
Cycles / Instruction (CPI)
- Architecture and systems
- Processor microarchitecture
- System balance (processor, memory, network, storage)
Instructions / Program
- Algorithm and applications
- Compiler transformations, optimizations
- Instruction set architecture
ECE 552 / CPS 550
21
Moore’s Law
-
Moore. “Cramming more components onto integrated circuits.”
Electronics, Vol 38, No. 8, 1965.
As integration increases, packaging cost decreases
How does Moore’s Law impact performance?
ECE 552 / CPS 550
22
Field-Effect Transistors
-
MOS: metal-oxide semiconductor
FET: field-effect transistor
-
Charge carriers flow between source-drain, controlled by gate voltage
Abstract MOSFET as electrical switch
Source
Gate Drain
Drain
Width
Length
ECE 552 / CPS 550
Bulk
Gate
Channel
Source
23
Complementary MOS (CMOS)
-
Map voltages to logical values (Vdd=1, Gnd=0)
Implement complementary Boolean logic
-
nFET: conduct charge when Vg = Vdd, used in pull-down network
pFET: conduct charge when Vg = Gnd, used in pull-up network
Examples: Inverter, NAND
Vdd
A
pFET
B
!(AB)
A
!A
nFET
A
B
Gnd
ECE 552 / CPS 550
24
Transistor Dimensions
-
Process defined by feature size (F), layout design (l = F/2)
Example: F=2l =45nm process technology
-
Transistor dimensions determine technology performance
Transistor drive strength (i.e., speed) increases as channel length shrinks
Minimum Length=2l
Source
Gate Drain
Gate
Width
Source
Drain Width=4l
Length
ECE 552 / CPS 550
Bulk
25
Dennard Scaling
-
Dennard et al. “Design of ion-implanted MOSFETs with very small physical
dimensions,” Journal Solid State Circuits, 1974.
-
Scale not only dimensions but also doping concentration and voltage
Transistors become faster (1.4x)
Applied to Moore’s Law: k=1.4, 1/k = 0.7 every 18-24 months
Gate Drain
Source
Width
Length
ECE 552 / CPS 550
Bulk
26
Dennard Scaling Limits
-
Horowitz et al. “Scaling, power, and the future of CMOS.” IEDM, 2005.
Classical Dennard scaling ended at 130nm in 2000-2001.
-
Oxide Thickness: How to manage increasing leakage? Use high-K dielectrics
Channel Length: How to manage increasing leakage? Stop scaling L
Doping Concentration: How to handle imprecise doping? Manage variability
Voltage: How to manage increasing leakage? Stop scaling V
Current: How to increase current with shrinking channels? Stress silicon
-
Example: Intel 22nm process technology with FinFET
Image: Courtesy Intel Corp.
ECE 552 / CPS 550
27
Processor Performance
Performance (vs. VAX-11/780)
10000
SPECint Benchmarks. Hennessy and
Patterson, Computer Architecture: A
Quantitative Approach, 4th Edition, 2006.
??%/year
1000
52%/year
100
10
25%/year
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
ECE 552 / CPS 550
28
Performance Factors
Latency = (Seconds / Cycle) x (Cycles / Instruction) x (Instructions / Program)
Seconds / Cycle
- Technology and architecture
- Transistor scaling
- Processor microarchitecture
Cycles / Instruction (CPI)
- Architecture and systems
- Processor microarchitecture
- System balance (processor, memory, network, storage)
Instructions / Program
- Algorithm and applications
- Compiler transformations, optimizations
- Instruction set architecture
ECE 552 / CPS 550
29
Cycles per Instruction (CPI)
Average Instruction Latency
- Different instructions require different number of cycles
- Examine instruction frequency
- CPI is slightly easier to calculate than IPC (time versus rate)
Example
- Instruction frequency: 1/3 INT, 1/3 FP, 1/3 MEM operations
- Instruction cycles: 1cy INT, 3cy FP, 2cy MEM
- CPI = (1/3 x 1) + (1/3 x 3) + (1/3 x 2)
Caveat
- CPI provides high-level, quick estimates of performance
- Does not account for details (e.g., instruction dependences)
ECE 552 / CPS 550
30
CPI and Design
Baseline Processor + Application
- Integer ALU: 50%, 1 cycle
- Load: 20%, 5 cycle
- Store: 10%, 1 cycle
- Branch: 20%, 2 cycle
Possible Enhancements
- Option 1: Branch prediction for 1-cycle branch
- Option 2: Bigger data cache for 3-cycle load
- Which enhancement is preferred?
Cycles Per Instruction
- Base = (0.5 x 1) + (0.2 x 5) + (0.1 x 1) + (0.2 x 2) = 2 cycles
- Option 1 = (0.5 x 1) + (0.2 x 5) + (0.1 x 1) + (0.2 x 1) = 1.8 cycles
- Option 1 = (0.5 x 1) + (0.2 x 3) + (0.1 x 1) + (0.2 x 2) = 1.6 cycles
ECE 552 / CPS 550
31
Measuring CPI
Physical Measurements
- Measure wall clock time as application runs
- Multiply time by clock frequency to get cycles
- Profile application with hardware counters (e.g., Intel VTune)
Simulated Measurements
- Cycle-level, microarchitectural simulation (e.g., SimpleScalar)
- Run applications on simulated hardware
- Track instructions as they progress through the design
ECE 552 / CPS 550
32
Benchmarking
Measuring Performance
- Target Workload: accurate but not portable
- Representative Benchmark: portable but not accurate
- Microbenchmark: small, fast code sequences but incomplete
Representative Benchmarks
- SPEC (Standard Performance Evaluation Corporation, www.spec.org)
- Collects, standardizes, distributes benchmark programs
- Scientific and commercial computing
- SPLASH-2, NAS, SPEC OpenMP, SPECjbb
- Online transaction processing (OLTP) with heavy I/O, memory
- TPC-C, TPC-H, TPC-W
- Datacenter workloads
- Search (e.g., Nutch/Lucene), analytics (e.g,. Hadoop, Spark)
ECE 552 / CPS 550
33
Performance Factors
Latency = (Seconds / Cycle) x (Cycles / Instruction) x (Instructions / Program)
Seconds / Cycle
- Technology and architecture
- Transistor scaling
- Processor microarchitecture
Cycles / Instruction (CPI)
- Architecture and systems
- Processor microarchitecture
- System balance (processor, memory, network, storage)
Instructions / Program
- Algorithm and applications
- Compiler transformations, optimizations
- Instruction set architecture
ECE 552 / CPS 550
34
Single Accumulator
- Carry-over from calculators, typically less than 2-dozen instructions
- Single operand (AC)
LOAD x
STORE x
AC  M[x]
M[x]  (AC)
ADD x
SUB x
AC  (AC) + M[x]
SHIFT LEFT
SHIFT RIGHT
AC  2  (AC)
JUMP x
JGE x
PC  x
if (AC)  0 then PC  x
LOAD ADR x
STORE ADR x
AC  Extract address field (M[x])
ECE 552 / CPS 550
35
Using Accumulator
C i  Ai + B i , 1  i  n
LOOP
LOAD
JGE
ADD
STORE
N
DONE
ONE
N
# AC  M[N]
# if(AC>0), PC DONE
# AC  AC + 1
# M[N]  AC
F1
F2
F3
LOAD
ADD
STORE
JUMP
A
B
C
LOOP
# AC  M[A]
# AC  (AC) + M[B]
# M[C]  (AC)
DONE
HLT
Notice M[N] is a counter, not an index.
How to modify the addresses A, B and C ?
ECE 552 / CPS 550
36
Self-Modifying Code
Ci  Ai + Bi, 1  i  n
LOOP
F1
F2
F3
DONE
LOAD
JGE
ADD
STORE
LOAD
ADD
STORE
LOAD ADR
ADD
STORE ADR
LOAD ADR
ADD
STORE ADR
LOAD ADR
ADD
STORE ADR
JUMP
HLT
ECE 552 / CPS 550
N
DONE
ONE
N
A
B
C
F1
ONE
F1
F2
ONE
F2
F3
ONE
F3
LOOP
# AC  M[N]
# if (AC >= 0), PC  DONE
# AC  AC + M[ONE]
# M[N]  AC
# AC  M[A]
# AC  AC + M[B]
# M[C]  (AC)
# AC  address field (M[F1])
# AC  AC + M[ONE]
# changes address of A
# changes address of B
# changes address of C
Each iteration requires:
total
Inst fetch
17
Stores
5
book-keeping
14
4
37
Index Registers
Specialized registers to simplify address calculations
- T. Kilburn, Manchester University, 1950s
- Instead of single AC register, use AC and IX registers
Modify Existing Instructions
- Load x, IX
- Add x, IX
AC  M[x + (IX)]
AC  (AC) + M[x + (IX)]
Add New Instructions
- Jzi x, IX
- Loadi x, IX
if (IX)=0, then PC x, else (IX)(IX)+1
IX  M[x] (truncated to fit IX)
Index registers have accumulator-like characteristics
ECE 552 / CPS 550
38
Using Index Registers
C i  Ai + B i , 1  i  n
LOOP
DONE
LOADi
JZi
LOAD
ADD
STORE
JUMP
HALT
-n, IX
DONE, IX
LASTA, IX
LASTB, IX
LASTC, IX
LOOP
# load n into IX
# if(IX=0), DONE
# AC M[LASTA + (IX)]
# note: LASTA is address
# of last element in A
- Longer instructions (1-2 bits), index registers with ALU circuitry
- Does not require self-modifying code, modify IX instead
- Improved program efficiency (operations per iteration)
total
book-keeping
Inst fetch
5
2
Stores
1
0
ECE 552 / CPS 550
39
Modifying Index Registers
Option 1: Increment index register by k
AC  (IX)
AC  (AC) + k
IX  (AC)
new instruction
new instruction
Also, the AC must be saved and restored
Option 2: Manipulate index register directly
INCi k, IX
STOREi x, IX
IX  (IX) + k
M[x]  (IX) (extended to fit a word)
IX begins to resemble AC
- Several index registers, accumulators
- Motivates general-purpose registers (e.g., MIPS ISA R0-R31)
ECE 552 / CPS 550
40
Evolution of Addressing Modes
1. Single accumulator, absolute address
Load x
AC  M[x]
2. Single accumulator, index registers
Load x, IX
AC  M[x + (IX)]
3. Single accumulator, indirection
Load (x)
AC  M[M[x]]
4. Multiple accumulators, index registers, indirection
Load Ri, IX, (x)
Ri  M[M[x] + (IX)]
5. Indirection through registers
Load Ri, (Rj)
Ri  M[M[(Rj)]]
6. The Works
Load Ri, Rj, (Rk)
ECE 552 / CPS 550
Ri  M[Rj + (Rk)]; Rj = index; Rk = base address
41
Evolution of Instruction Formats
Zero-address Formats
- Instructions have zero operands
- Operands on a stack
add
M[sp]  M[sp] + M[sp-1]
load
M[sp]  M[M[sp]]
- Stack can be registers or memory
- Top of stack usually cached in registers
Register
SP
A
B
C
One-address Formats
- Instructions have one operand
- Accumulator is always other implicit operand
ECE 552 / CPS 550
42
Evolution of Instruction Formats
Two-address Formats
- Destination is same as one of the operand sources
Ri  (Ri) + (Rj)
# (Reg x Reg) to Reg
Ri  (Ri) + M[x]
# (Reg x Mem) to Reg
- x can be specified directly or via register
- x address calculation could include indexing, indirection, etc.
Three-address Formats
- One destination and up to two operand sources
Ri  (Rj) + (Rk)
# (Reg x Reg) to Reg
Ri  (Rj) + M[x]
# (Reg x Reg) to Reg
ECE 552 / CPS 550
43
Data Formats
Data Sizes
- Bytes, Half-words, words, double words
Byte Addressing
- Location of most-, least- significant bits
LSB
MSB
Big Endian
MSB
LSB
Little Endian
Word Alignment
- Suppose memory is organized into 32-bit words (e.g., 4 bytes).
- Word aligned addresses begin only at 0, 4, 8, … bytes
0
1
ECE 552 / CPS 550
2
3
4
5
6
7
44
Software Developments
Numerical Libraries (up to 1955)
- floating-point operations
- transcendental functions
- matrix multiplication, equation solvers, etc.
High-level Languages(1955-1960)
- Fortran, 1956
- assemblers, loaders, linkers, compilers
Operating Systems (1955-1960)
- accounting programs to track usage and charges
ECE 552 / CPS 550
45
Performance Factors
Latency = (Seconds / Cycle) x (Cycles / Instruction) x (Instructions / Program)
Seconds / Cycle
- Technology and architecture
- Transistor scaling
- Processor microarchitecture
Cycles / Instruction (CPI)
- Architecture and systems
- Processor microarchitecture
- System balance (processor, memory, network, storage)
Instructions / Program
- Algorithm and applications
- Compiler transformations, optimizations
- Instruction set architecture
ECE 552 / CPS 550
46
Pitfall: Incomplete Metrics
Ignoring Instructions per Program
- Neglect dynamic instruction count
- Misleading if working in algorithms, compilers, or ISA
Using Instructions per Second
- MIPS = (Instructions / Cycle) x (Cycles / Second) x 1E-6
- FLOPS: considers only floating-point instructions
- Example: CPI = 2, clock frequency = 500MHz, 250 MIPS
- Example: compiler removes instructions, latency falls, MIPS increases
Using Clock Frequency
- Cannot equate clock frequency with performance
- Proc A: CPI = 2, f = 500MHz
- Proc B: CPI = 1, f = 300MHz
- Given the same ISA and compiler, B is faster
ECE 552 / CPS 550
47
Pitfall: Diminishing Returns
- Amdahl. “Validity of the single-processor approach…” AFIPS, 1967.
Amdhal’s Law (Make Common Case Fast)
Consider improving fraction F of system with a speedup S.
T(new)
= T(base) x (1-F) + T(base) x F / S
= T(base) x [(1-F) + F/S]
Speedup
Max Speedup
= 1 / [(1-F) + F/S] = T(base)/T(new)
= 1 / (1 – F)
Example
- Suppose FP computation is 1/4 of an application’s execution time
- Maximum benefit from optimizing FP unit is 1.3x (=1/0.75)
- Multiprocessor systems were original application of this law
- Accounts for diminishing marginal returns
ECE 552 / CPS 550
48
Power
Power Factors
Definitions
- Energy (Joules) = a x C x V2
- Power (Watts) = a x C x V2 x f
Power Factors and Trends
- activity (a): function of application resource usage
- capacitance (C): function of design; scales with area
- voltage (V): constrained by leakage, which increases as V falls
- frequency (f): varies with pipelining and transistor speeds
- Models in cycle-accurate simulators (e.g., Princeton Wattch)
Dynamic Voltage and Frequency Scaling (DVFS)
- P-states: move between operational modes with different V, f
- Intel TurboBoost: increase V, f for short durations without violating
thermal design point (TDP)
ECE 552 / CPS 550
50
Power and Temperature
Temperature
•
Power density (Watts / sq-mm) is
proxy for thermal effects
•
Estimate thermal conductivity,
resistance to identify processor hot
spots (e.g., HotSpot simulator)
Power Budgets
•
•
Power  Package Cost
130W servers, 65W desktops, 10-30W
laptops, 1-2W hand-held
ECE 552 / CPS 550
51
Power and Multiprocessors
Multiprocessors
•
Chip multiprocessors (CMPs)
integrate multiple cores on die
Efficiency
•
•
Reduce power with simpler cores
Recover lost performance with many
core parallelism
ECE 552 / CPS 550
52
Power and Multiprocessors
Lower voltage, frequency
•
•
•
Voltage, frequency scale together
Power proportional to V2 and f
Performance proportional to f
V∝f
Power ∝ V2 f
Perf ∝ f
Example
•
•
•
•
Baseline: 1-core at V, f
Multiprocessor: 4-cores at 0.85V, 0.85f; program is 75% parallel
Core Power @ lower V, f
0.61x =0.853
Core Performance @ lower V, f
0.85x
•
•
•
Multicore Power @ lower V, f
Multicore Performance @ 4 cores
Multicore Performance @ lower f
•
•
Multiprocessor: 1.5% power per 1% performance [+144% power, +94% perf]
Boosting V, f: 3% power per 1% performance [+(1.013-1) power, + (1.01-1) perf]
ECE 552 / CPS 550
2.44x = 0.61x 4
2.28x = 1/[0.25 + (0.75 / 4)]
1.94x = 2.28 x 0.85
53
Cost
Cost
Non-recurring Engineering (NRE)
- Dominated by engineer-years ($200K per engineer-year)
- Mask costs (>$1M per spin)
Chip Cost
- Depends on wafer and chip size, process maturity
Packaging Cost
- Depends on number of pins (e.g., signal + power/ground)
- Depends on thermal design point (e.g., heat sink)
Total Cost of Ownership
- Capital costs (e.g., server procurement cost)
- Operating costs (e.g., electricity)
ECE 552 / CPS 550
55
Yield
Wafers
- Integrated circuits built with multi-step chemical process on wafers
- Cost per wafer depends on wafer size, number of steps
Chip (a.k.a. Die)
- If chips are large, fewer chips per wafer
- Larger chips have lower yield
- Uniform defect density
- Chip cost is proportional to area2-3
Process Variability
- Yield is non-binary
- Binning for speed grades
- Binning for core count
- Post-fabrication tuning with spares
ECE 552 / CPS 550
56
Compatibility
Compatibility
Early 1960s IBM had 4 incompatible computers
- IBM 701, 650, 702, 1401
- Different instruction set architecture
- Different I/O system, secondary storage (magnetic taps, drums, disks)
- Different assemblers, compilers, libraries
- Different markets (e.g., business, scientific, real-time)
The need for compatibility motivated IBM 360.
ECE 552 / CPS 550
58
IBM 360: Design Principles
Amdahl, Blaauw and Brooks, “Architecture of the IBM System/360” 1964
1. Support growth and successor machines
2. Connect I/O devices with general method
3. Emphasize total performance
- Evaluate programmability, answers per month not bits per second
4. Eliminate manual intervention
- Machine must be capable of supervising itself
5. Reduce down time
- Build hardware fault checking and fault location support
6. Facilitate assembly
- Redundant I/O devices, memories for fault tolerance
7. Support flexibility
- Some problems required floating-point words > 36bits
ECE 552 / CPS 550
59
IBM 360: General Purpose Registers
Processor State
•
•
•
•
16, 32-bit general-purpose registers  use as index and base registers
4, 64-bit floating-point registers
Program status word (PSW) with program counter (PC)
Condition codes, control flags
Data Formats
•
•
•
•
8-bit bytes: the IBM 360 is why bytes are 8-bits long today!
16-bit half-words
32-bit words
64-bit double-words
ECE 552 / CPS 550
60
IBM 360: Initial Implementation
Storage
Datapath
Circuit Delay
Local Store
Control Store
Model 30
8K - 64 KB
8-bit
30 nsec/level
Main Store
1 microsecond read
Model 70
256K - 512 KB
64-bit
5 nsec/level
Transistor Registers
Conventional circuits
Abstraction
•
IBM 360 ISA hid technologies across models
Milestone
•
•
The first true ISA designed as portable hardware-software interface
With minor modifications, ISA still survives today
ECE 552 / CPS 550
61
IBM z11: 47 Years Later
Technology (seconds / cycle)
5.2 GHz in IBM 45nm CMOS technology
1.4 billion transistors in 512 sq-mm
Microarchitecture (cycle / instruction)
64-bit virtual addressing
Out-of-order, 3-way superscalar pipeline
Redundant datapaths
L1 i-cache (64KB); L1 d-cache (128KB) d-cache
L2 cache (1.5MB), private, per-core
L3 cache (24MB), eDRAM
Power and Parallelism
Quad-core design
Scales to 96 cores in one machine
IBM HotChips 2010
ECE 552 / CPS 550
62
Acknowledgements
These slides contain material developed and copyright by
- Arvind (MIT)
- Krste Asanovic (MIT/UCB)
- Joel Emer (Intel/MIT)
- James Hoe (CMU)
- John Kubiatowicz (UCB)
- Alvin Lebeck (Duke)
- David Patterson (UCB)
- Daniel Sorin (Duke)
ECE 552 / CPS 550
63