Part 1: Introduction

Download Report

Transcript Part 1: Introduction

UNIVERSITY OF MASSACHUSETTS
Dept. of Electrical & Computer Engineering
Computer Architecture
ECE 668
Part 1
IntroductioN
Csaba Andras Moritz
ECE668 Part.1 .1
Coping with ECE 668
 Students with varied backgrounds
 Prerequisites – Basic Computer Architecture, VLSI
 2 projects to choose from, some flexibility beyond that
 You need software and/or Verilog/HSPICE skils to complete it
 2 exams – midterm and final
 Class participation, attend office hours
 About the instructor
 First lectures- review of Performance and Pipelining (Chapter 1
+ Appendix A)
 Many lectures will be using the whiteboard, and slides
 Lectures related to textbook and beyond
 Many lectures
are outside the textbook
 Web: www.ecs.umass.edu/ece/andras/courses/ECE668/
ECE668 Part.1 .2
What you should know
 Basic machine structure
 processor (data path, control, arithmetic), memory, I/O
 Read and write in an assembly language, C, C++,..
 MIPS/ARM ISA preferred
 Understand the concepts of pipelining and virtual
memory
 Basic VLSI – HSPICE and/or Verilog
ECE668 Part.1 .3
Textbook and references
 Textbook: D.A. Patterson and J.L. Hennessy,
Computer Architecture: A Quantitative Approach,
4th edition (or later), Morgan-Kaufmann.
 Recommended reading:
 J.P. Shen and M.H. Lipasti, Modern Processor Design:
Fundamentals of Superscalar Processors, McGraw-Hill,
2005.
 Chandrakasan et al, Design of High-Performance
Microprocessor Circuits
 NASIC research papers and Nanoelectronics textbook
chapter; SKYBRIDGE, N3ASIC, CMOL, FPNI, SPWF
papers if interested
 Other research papers we bring up in class.
ECE668 Part.1 .4
Course Outline
 I. Introduction (Ch 1)
 II. Pipeline Design (App A)
 III. Instruction-level Parallelism, Pipelining
(App.A,Ch.2)
 IV. Memory Design: Memory Hierarchy,
Cache Memory, Secondary Memory (Ch.4)
 V. Multiprocessors (Ch. 3)
 VI. Deep Submicron Implementation – Process
Variation, Power-aware Architectures,
Compiler’s role
 VII. Nanoscale architectures
ECE668 Part.1 .5
Administrative Details
Instructor: Prof. Csaba Andras Moritz
KEB 309H
Email: [email protected]
Office Hours: 2:30-3:30 pm, Tues., &
2:30-3PM Thur.
 TA – pending
 Course web page: details available at:
http://www.ecs.umass.edu/ece/andras/course
s/ECE668




ECE668 Part.1 .6
Grading
 Midterm I - 35%
 Project – 30%: two projects to choose from
 Class Participation – 5%
 Final Exam. - 30%
 Homework – exam questions
ECE668 Part.1 .7
What is “Computer Architecture”
Computer Architecture
=
Instruction Set Architecture +
Machine Organization
(e.g., Pipelining, Memory Hierarchy,
Storage systems, etc)
Or
Unconventional Organization
IBM 360 (minicomputer, mainframe, supercomputer)
Intel X86 vs. ARM vs. Nanoprocessors
ECE668 Part.1 .8
Computer Architecture Topics - Processors
Input/Output and Storage
RAID
performance,
reliability
Disks, Tape
Interleaving
Bus protocols
DRAM
Memory
Hierarchy
VLSI
L2 Cache
L1 Cache
Bandwidth,
Latency
Addressing
Instruction Set Architecture
Pipelining, Hazard Resolution,
Superscalar, Reordering,
Branch Prediction, VLIW, Vector
ECE668 Part.1 .9
Instruction
Level Parallelism
Advanced
CMOS multi-cores
&Nano proc.?
2013
ECE668 Part.1 .10
ECE668 Part.1 .11
Scaling
ECE668 Part.1 .12
Figure courtesy of Intel; Copyright – Baskaran Ganesan, Intel Higher Education Program
Shrinking geometry
ECE668 Part.1 .13
Figure courtesy of Intel; Copyright – Baskaran Ganesan, Intel Higher Education Program
Die
ECE668 Part.1 .14
Wafer
ECE668 Part.1 .15
CPUs: Archaic (Nostalgic) v. Semi Modern v. Modern?
2001 Intel Pentium 4
1500 MHz
(120X)
4500 MIPS (peak)
(2250X)
Latency 15 ns
(20X)
42,000,000 xtors, 217 mm2
64-bit data bus, 423 pins
3-way superscalar,
Dynamic translate to RISC, Superpipelined
(22 stage),
Out-of-Order execution
 On-chip 8KB Data caches,
96KB Instr. Trace cache,
256KB L2 cache

1982 Intel 80286

12.5 MHz

2 MIPS (peak)

Latency 320 ns

2
134,000 xtors, 47 mm 
16-bit data bus, 68 pins 
Microcode interpreter,
separate FPU chip
 (no caches)







ECE668 Part.1 .16
2015?
Multi-core = Network on a chip
 Everything you learn as CSE students
applied/integrated in a chip!
ECE668 Part.1 .17
Intel Polaris with 80 cores
ECE668 Part.1 .18
Figure courtesy of Intel; Copyright – Baskaran Ganesan, Intel Higher Education Program
Tilera processor with 64 cores
 MIT startup from Raw project (used to be
involved in this)
ECE668 Part.1 .19
What is next: Nanoprocessors?
 Molecular memory, NASIC processors, 3D?
Cross
NW devices
Courtesy of
Prof Chui’s
Group at
UCLA
2-4
decoder
opcode
dest
adder/
multiplier
2-4
decoder
rf3~0
opcode
operanda
ECE668 Part.1 .20
rf3~0
result
operandb
operanda
dest
operandb
opcode
NASIC ALU, Copyright: NASIC
group, UMASS
adder/ result
multiplier
From Nanodevices to Nanocomputing
Crossed Nanowire Array
n+ gate
Array-based Circuits with
Built-in Fault-tolerance
(NASICs)
pchannel
n+ source
& drain
a0 b0
a0 b0
clk
s0 s0
a1 b1
a1 b1
Down
clk
s1 s1
a2 b2
a2 b2
Down
clk
s2 s2
a3 b3
a3 b3
Down
Up
s3 s3
Down
Up
Up
c0
c0
clk
s0
s0
Up
s0
s0
c2
c2
c1
c1
s0 s0
s1 s1
s0
s0
s0
s0
c4
c4
c3
c3
s2 s2
Evaluation/Cascading:
Streaming Control with
Surrounding Microwires
s3 s3
Nanoprocessor
ECE668 Part.1 .21
NASICs Fabric Based Architectures
Cellular Architecture
WIre Streaming Processor
 General purpose stream processor





ECE668 Part.1 .22
5-stage pipeline with minimal feedback
Built-in fault tolerance: up to 10%
device level defect rates
33X density adv vs. 16nm scaled
CMOS
Simpler manufacturing
~9X improved power-per-performance
efficiency (rough estimate)
•
Special purpose for image and
signal processing
Massively parallel array of identical
interacting simple functional cells
 Fully programmable from external
template signals
 22X denser than in 16nm scaled
CMOS

N3ASIC- 3D Nanowire Technology
ECE668 Part.1 .23
N3P – Hybrid Spin-Charge Platform
ECE668 Part.1 .24
Skybridge 3D Circuits – Vertically
Integrated
3D Circuit concept and
1 bit full adder
Designed in my group
FETs are gate-all-around
on vertical nanowires
ECE668 Part.1 .25
Example ISAs in Processors
(Instruction Set Architectures)
ARM
Digital Alpha
HP PA-RISC
Sun Sparc
MIPS
Intel
RISC vs. CISC
ECE668 Part.1 .26
(32, 64-bit, v8)
1985
(v1, v3)
1992
(v1.1, v2.0)
1986
(v8, v9)
1987
(MIPS I, II, III, IV, V) 1986
(8086,80286,80386,
1978
80486,Pentium, MMX, ...)
Basics
 Let us review some basics
ECE668 Part.1 .27
RISC ISA Encoding Example
ECE668 Part.1 .28
Virtualized ISAs
 BlueRISC TrustGUARD
 ISA is randomly created internally
 Fluid - more than one ISA possible
ECE668 Part.1 .29
Characteristics of RISC
 Only Load/Store instructions access memory
 A relatively large number of registers
Goals of new computer designs
 Higher performance
 More functionality (e.g., MMX)
 Other design objectives? (examples)


ECE668 Part.1 .30
How to measure performance?
• Time to run the task
– Execution time, response time, latency
– Performance may be defined as 1 / Ex_Time
– Throughput, bandwidth
ECE668 Part.1 .31
Speedup
performance(x) =
1
execution_time(x)
" Y is n times faster than X" means
n = speedup =
Execution_time(old / brand x)
Execution_time(new / brand y)
Speedup must be greater than 1;
Tx/Ty = 3/2 = 1.5
ECE668 Part.1 .32
but not Ty/Tx = 2/3 = 0.67
MIPS and MFLOPS
 MIPS (Million Instructions Per Second)
 Can we compare two different CPUs using MIPS?
 MFLOPS (Million Floating-point operations Per Sec.)
 Application dependent (e.g., compiler)
 Still useful for benchmarks
 Benchmarks: e.g., SPEC CPU 2000: 26 applications
(with inputs)
 SPECint2000: Twelve integer, e.g., gcc, gzip, perl
 SPECfp2000: Fourteen floating-point intensive, e.g., equake
ECE668 Part.1 .33
SPEC CPU 2000
SPECint2000
Benchmark Language Category
164.gzip
C
Compression
175.vpr
C
FPGA Circuit Place& Route
176.gcc
C
C Compiler
181.mcf
C
Combinatorial Optimization
186.crafty C
Game Playing: Chess
197.parser C
Word Processing
252.eon
C++ Computer Visualization
253.perlbmk C
PERL Prog Language
254.gap
C
Group Theory, Interpreter
255.vortex C
Object-oriented Database
256.bzip2 C
Compression
300.twolf
C
Place and Route Simulator
www.specbench.org/cpu200
ECE668 Part.1 .34
SPECfp2000
Benchmark Language Category
168.wupwise Fortran77 Quantum Chromodynamics
171.swim Fortran77 Shallow Water Modeling
172.mgrid Fortran77 Multi-grid Solver
173.applu Fortran77 Partial Differential Equations
177.mesa C
3-D Graphics Library
178.galgel Fortran90 Fluid Dynamics
179.art
C
Image Recognition /Neural Nets
183.equake C
Seismic Wave Propagation
187.facerec Fortran 90 Face Recognition
188.ammp C
Computational Chemistry
189.lucas Fortran90 Primality Testing
191.fma3d Fortran90 Finite-element Crash - Nuclear
Physics
200.sixtrack Fortran77 Accelerator Design
301.apsi Fortran77 Meteorology: Pollutant
Distribution
Spec2006
(still
current)
ECE668 Part.1 .35
Other
Benchmarks
Workload Category
Example Benchmark Suite
CPU Benchmarks - Uniprocessor
SPEC CPU 2006
Java Grande Forum Benchmarks
SciMark, ASCI
SPLASH, NASPAR
MediaBench
EEMBC benchmarks
BDTI benchmarks
SPECjvm98, CaffeineMark
SPECjBB2000, VolanoMark
Java Grande Forum Benchmarks
SciMark
CPU - Parallel Processor
Multimedia
Embedded
Digital Signal Processing
Java - Client side
Java - Server side
Java - Scientific
www.spec.org
ECE668 Part.1 .36
Transaction Processing
On-Line Transaction Processing TPC-C, TPC-W
Transaction Processing
Decision Support Systems
TPC-H, TP-R
Web Server
SPEC web99, TPC-W, VolanoMark
Electronic commerce
TPC-W, SPECjBB2000
Mail-server
SPECmail2000
Network File System
SPEC SFS 2.0
Personal Computer
SYSMARK, WinBench, DMarkMAX99
Handheld device committee SPEC
Synthetic
Benchmarks
Whetstone
Benchmark
www.cse.clrc.ac.uk
/disco/Benchmarks/
whetstone.shtml
Rank
Machine
1 Pentium 4/3066 (ifc)
1966
2 HP Superdome Itanium2/1500 492
3 HP RX5670 Itanium2/1500-H 655
4 Pentium 4/2666 (ifc)
1966
5 IBM pSeries 690Turbo/1.7
1996
6 Compaq Alpha ES45/1250 1679
7 HP RX4640 Itanium2/1300
492
8 IBM Regatta-HPC/1300
492
9 IBM pSeries 690Turbo/1.3 1996
10 AMD Opteron848/2200
1966
Core DMIPS Freq. DMIPS Inline
/MHz. (MHz)
DMIPS/MHz
4Kc™ 1.3
4KEc™ 1.35
5Kc™ 1.4
5Kf™ 1.4
20Kc™ 1.7
ECE668 Part.1 .37
300
300
350
320
600
Mflop ratings (Vl=1024) Total CPU
N2 N3
N8 (seconds)
390
405
490
448
1020
1.6
1.8
2.0
2.0
2.2
Inline
DMIPS
480
540
700
640
1320
529
3441
3441
444
475
815
2753
444
353
1147
1347
2907
2907
1201
1841
1925
2511
1454
1905
1255
9.2
9.8
9.8
10.4
10.8
10.9
11.3
11.5
11.7
11.8
MWIPS
4071
3826
3855
3532
3472
3441
3324
3281
3260
3158
Dhrystone
Benchmark
How do we design faster CPUs?
 Faster technology – used to be the main approach
 (a) getting more expensive
 (b) reliability & yield
 (c) speed of light (3.10^8 m/sec)
 Larger dies (SOC - System On a Chip)

less wires between ICs but - low yield (next slide)
 Parallel processing - use n independent processors
 limited success
 n-issue superscaler microprocessor (currently n=4)
 Can we expect a Speedup = n ?
 Pipelining
 Multi-threading
ECE668 Part.1 .38
Power consumption
 Dynamic
 α * Vdd^2 * f* Cl
 Leakage
 Mainly from subthreshold (the FETs leak current)
 Significant for small feature sizes (less Ion/Ioff)
 Power-aware architectures
 Objective is to minimize activity often
 Role of compilers - control
 Circuit level optimizations – make same more efficient
 CAD tools – e.g., clock gating – make it easy to add
ECE668 Part.1 .39
Define and quantify power
Poweridle = Currentidle ´ Voltage
 Leakage current increases in processors with
smaller transistor sizes
 Increasing the number of transistors increases
power even if they are turned off
 Leakage is dominant sub 90nms
 Very low power systems even gate voltage to
inactive modules to control loss due to leakage
ECE668 Part.1 .40
Define and quantity dependability (2/3)

Module reliability = measure of continuous service
accomplishment (or time to failure).
2 metrics



Mean Time To Repair (MTTR) measures Service
Interruption


Mean Time To Failure (MTTF) measures Reliability
Failures In Time (FIT) = 1/MTTF, the rate of failures
• Traditionally reported as failures per billion hours of operation
Mean Time Between Failures (MTBF) = MTTF+MTTR
Module availability (MA) measures service as alternate
between the 2 states of accomplishment and interruption
(number between 0 and 1, e.g. 0.9)

ECE668 Part.1 .41
Module availability MA = MTTF / ( MTTF + MTTR)
Example calculating reliability


If modules have exponentially distributed lifetimes
(age of module does not affect probability of
failure), overall failure rate is the sum of failure
rates of the modules
Calculate FIT and MTTF for 10 disks (1M hour
MTTF per disk), 1 disk controller (0.5M hour
MTTF), and 1 power supply (0.2M hour MTTF):
FailureRat e 
MTTF 
ECE668 Part.1 .42
Integrated Circuits Yield

Defect_Den sity ´ Die_area  
Die Yield  Wafer_yiel d ´ 1 + 
÷
a


 
ECE668 Part.1 .44
-a
Integrated Circuits Costs
IC cost 
Die cost + Testing cost + Packaging cost
Die cost 
Final test yield
Wafer cost
Dies per Wafer x Die Yield
Dies per wafer 
P (Wafer_diam/2)²
Die_Area
-
P x Wafer_diam
2×
x Die_Area
- Test_Die
Die Cost goes up roughly with (Die_Area)2
ECE668 Part.1 .45
Amdahl’s Law - Basics
Example: Executing a program on n independent
processors
Fraction
= parallelizable part of program
Speedup
= n
enhanced
enhanced
ExTime
= ExTime old (1- Fraction
new
Speedupoverall 
n
enhanced
ExTimeold

ExTimenew
Lim Speedup
ECE668 Part.1 .46

overall
) +
ExTime oldFraction
enhanced
n
1
(1 - Fractionenhanced  +
= 1 / (1 - Fraction
Fractionenhanced
Speedupenhanced
enhanced
)
Amdahl’s Law - Graph
Law of Diminishing Returns
1-f enh
ECE668 Part.1 .47
Amdahl’s Law - Extension
 Example: Improving part of a processor (e.g.,
multiplier, floating-point unit)
Fraction
enhanced
Speedup
overall
= part of program to be enhanced

1
( 1 - Fractionenhanced  +
< 1 / (1 - Fraction
enhanced
Fractionenhanced
Speedup
)
enhanced
A given signal processing application consists of 40% multiplications.
An enhanced multiplier will execute 5 times faster
Speedup
ECE668 Part.1 .48
overall
= 1 / (
+
) = 1.47 < 1/0.6 = 1.66
Amdahl’s Law - Another Example
 Floating point instructions improved to run 2X;
but only 10% of actual run time is used by FP
instructions
ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold
Speedupoverall =
ECE668 Part.1 .49
1
0.95
=
1.053
Instruction execution
 Components of average execution time (CPI Law)
 Average CPU time per program
CPU time
= Seconds
Program
= Instructions x Cycles x Seconds
Program
Instruction
Cycle
CPI
1/clock_rate
 The “End to End Argument” is what RISC was
ultimately about - it is the performance of the
complete system that matters, not individual
components!
ECE668 Part.1 .50
Cycles Per Instruction – Another
Performance Metric
“Average Cycles per Instruction”
CPI = Total_No_of_Cycles / Instruction Count
“CPI of Individual Instructions”
CPIj - CPI for instruction j (j=1,…,n)
Ij - # of times instruction j is executed
n
CPU time  Cycle Time   CPI j  I j
j 1
“Instruction Frequency”
CPI   CPIj ´ Fj
n
j 1
ECE668 Part.1 .51
where Fj 
Ij
Instruction Count
Example: Calculating CPI
Base Machine
Op
ALU
Load
Store
Branch
(Reg / Reg)
Freq Cycles
50%
1
20%
2
10%
2
20%
2
Typical Mix of
instruction types
in program
ECE668 Part.1 .52
CPIj * Fj
.5
.4
.2
.4
(% Time)
( %)
(27%)
(13%)
(27%)
Pipelining - Basics
4 consecutive operations
2
2
Z=F(X,Y)=SqRoot(X +Y )
X
Y
( )2
Square Root
Z
If each step takes 1T then one calculation takes 3T, four
take 12T
X
Stage 1
X2
Stage 2
+Y 2
Stage 3
Z
SqRoot
Y
Assuming ideally that each stage takes 1T
What will be the latency (time to produce the first result)?
What will be the throughput (pipeline rate in the steady state)?
ECE668 Part.1 .53
Pipelining - Timing
Total of 6T;
T
T
T
T
Speedup = ?
For n operations: 3T + (n-1)T = latency +
Speedup =
ECE668 Part.1 .54
n  3T
3T + (n-1)T
=
3n
n + 2
T
T
n-1
throughput
# of stages
n

Pipelining - Non ideal
Non-ideal situation:
1. Steps take T ,T ,T
Rate = 1 / max T
2 3
i
1
Slowest unit determines the throughput
2. To allow independent operation must add latches
t  t latch+ max
ECE668 Part.1 .55
T
i
Rule of Thumb for Latency Lagging BW
In the time that bandwidth doubles, latency
improves by no more than a factor of 1.2 to
1.4
(and capacity improves faster than bandwidth)
 Stated alternatively:
Bandwidth improves by more than the square
of the improvement in Latency
ECE668 Part.1 .56
Latency Lags Bandwidth (last ~20 years)
 Performance Milestones
 Processor: ‘286, ‘386, ‘486,
Pentium, Pentium Pro, Pentium
4 (21x,2250x)
 Ethernet: 10Mb, 100Mb,
1000Mb, 10000 Mb/s (16x,1000x)
 Memory Module: 16bit plain
DRAM, Page Mode DRAM,
32b, 64b, SDRAM,
DDR SDRAM (4x,120x)
 Disk : 3600, 5400, 7200,
10000, 15000 RPM (8x, 143x)
10000
CPU high,
Memory low
(“Memory
Wall”) 1000
Processor
Network
Relative
Memory
BW
100
Improve
ment
Disk
10
(Latency improvement
= Bandwidth improvement)
1
1
10
100
Relative Latency Improvement
ECE668 Part.1 .57
Summary of Technology Trends
 For disk, LAN, memory, and microprocessor, bandwidth improves
by square of latency improvement
 In the time that bandwidth doubles,
to 1.4X
latency improves by no more than 1.2X
 Lag probably even larger in real systems, as bandwidth gains
multiplied by replicated components
 Multiple processors in a cluster or even in a chip
 Multiple disks in a disk array
 Multiple memory modules in a large memory
 Simultaneous communication in switched LAN
 HW and SW developers should innovate assuming Latency Lags
Bandwidth
 If everything improves at the same rate, then nothing really
 When rates vary, require real innovation
ECE668 Part.1 .58
changes
Summary of Architecture Trends
 CMOS Microprocessors focus on computing
bandwidth with multiple cores
 Accelerators for specialized support
 Software to take advantage – Von Neumann design
 As nanoscale technologies emerge new
architectural areas are created
 Unconventional architectures

ECE668 Part.1 .59
» Not programmed – would operate more like the
brain through learning and inference
As well as new opportunities for microprocessor design
Backup slides for students
ECE668 Part.1 .60
6 Reasons Latency Lags Bandwidth
1.
•
Moore’s Law helps BW more than latency
•
ECE668 Part.1 .61
Faster transistors, more transistors,
more pins help Bandwidth
» MPU Transistors:
0.130 vs. 42 M xtors
(300X)
» DRAM Transistors: 0.064 vs. 256 M xtors
(4000X)
» MPU Pins:
68 vs. 423 pins
(6X)
» DRAM Pins:
16 vs. 66 pins
(4X)
Smaller, faster transistors but communicate
over (relatively) longer lines: limits latency
» Feature size:
1.5 to 3 vs. 0.18 micron (8X,17X)
» MPU Die Size:
35 vs. 204 mm2 (ratio sqrt  2X)
» DRAM Die Size:
47 vs. 217 mm2 (ratio sqrt  2X)
6 Reasons Latency Lags Bandwidth (cont’d)
2. Distance limits latency
•
•
Size of DRAM block  long bit and word lines
 most of DRAM access time
Speed of light and computers on network
3. Bandwidth easier to sell (“bigger=better”)
•
•
•
•
E.g., 10 Gbits/s Ethernet (“10 Gig”) vs.
10 msec latency Ethernet
4400 MB/s DIMM (“PC4400”) vs. 50 ns latency
Even if just marketing, customers now trained
Since bandwidth sells, more resources thrown at bandwidth,
which further tips the balance
ECE668 Part.1 .62
6 Reasons Latency Lags Bandwidth (cont’d)
4. Latency helps BW, but not vice versa
•
•
•
ECE668 Part.1 .63
Spinning disk faster improves both bandwidth and
rotational latency
» 3600 RPM  15000 RPM = 4.2X
» Average rotational latency: 8.3 ms  2.0 ms
» Things being equal, also helps BW by 4.2X
Lower DRAM latency 
More access/second (higher bandwidth)
Higher linear density helps disk BW
(and capacity), but not disk Latency
» 9,550 BPI  533,000 BPI  60X in BW
6 Reasons Latency Lags Bandwidth (cont’d)
5. Bandwidth hurts latency
•
•
Queues help Bandwidth, hurt Latency (Queuing Theory)
Adding chips to widen a memory module increases
Bandwidth but higher fan-out on address lines may
increase Latency
6. Operating System overhead hurts
Latency more than Bandwidth
•
ECE668 Part.1 .64
Long messages amortize overhead;
overhead bigger part of short messages