CS510: Computer Architectures

Download Report

Transcript CS510: Computer Architectures

CS510:
Computer Architectures
Fall, 2001
Jung Wan Cho
[email protected]
Ext.3512, 8701
Introduction
CS510 Computer Architectures
Lecture 01 -1
Administration
Text Book:
J. L. Hennessy and D. A. Patterson
Computer Architecture: A Quantitative Approach,
2nd Edition, MKP, Inc.
Grading:
Homework: 10%
Mid-Term and Final Exams: 40% each
Others: 10%
Office:
Room 3437(3rd fl., East wing of CS Bldg)
Office Hours: M/W 10:00~11:30AM
TA:
이승학
Introduction
CS510 Computer Architectures
Lecture 01 -2
Table of Contents
•
•
•
•
•
•
•
•
•
•
•
Introduction
Cost
Performance
Instruction Set Architecture
Implementation
Pipelining
Instruction-level Parallelism
Memory Hierarchy
Storage Systems
Input/Output Systems
etc
Introduction
CS510 Computer Architectures
Lecture 01 -3
Introduction
CS510 Computer Architectures
Lecture 01 -4
Introduction
CS510 Computer Architectures
Lecture 01 -5
Lecture 1
Introduction
Introduction
CS510 Computer Architectures
Lecture 01 -6
Hardware Resources
in a Computer System
Introduction
CS510 Computer Architectures
Lecture 01 -7
Hardware Resources
• Central Processing Unit(CPU)
– Registers and Flags
– Arithmetic and Logic Unit(ALU, Execution Unit)
– Buses
• Memory System
– Main Memory
– Memory Bus
• I/O System
– I/O Devices
– I/O Buses
– I/O Controllers
Introduction
CS510 Computer Architectures
Lecture 01 -8
Central Processing
Unit
Introduction
CS510 Computer Architectures
Lecture 01 -9
Central Processing Unit
Hardware Resources in a CPU can be classified into 3 classes
– Storage Resources
• Registers
• Flags
– Functional Resources
• ALU
– Transfer Resources
• Internal Buses
Introduction
CS510 Computer Architectures
Lecture 01 -10
Storage Resources
Introduction
CS510 Computer Architectures
Lecture 01 -11
Storage Resources
• Registers
– Dedicated Registers
• Data Registers
• Address Registers
– Index Register, Base Address Register, Stack Pointer Register,
Page/Segment Pointer Register
• Control Registers
– Major State Register, Timing State Register, etc
– Special Registers
• Registers that can not be accessed by the programmer
– PC, MAR, MBR, IR, etc
– General Purpose Registers
• Register that can be used for all purpose
• Flags
– Representing Resulting Status of arithmetic operations
– Representing the Status of CPU
Introduction
CS510 Computer Architectures
Lecture 01 -12
Dedicated Registers vs
General Purpose Registers
• Efficiency of register utilization issue
– Better utilization of GPR than the Dedicated Registers
• May experience the shortage of Data Registers when there are
some unused Address Registers
• Instruction length issue
– Dedicated register makes the instruction length shorter than
GPR
• Less number of bits are needed to address the Dedicated
Registers
• Execution speed issue
– Access time of the dedicated register is faster than GPR
• Shorter register address results in faster access time(register
address decoding time)
Introduction
CS510 Computer Architectures
Lecture 01 -13
Flags
• A Flag is an 1 bit register(flip flop)
– Represents the result of an arithmetic operation
• O_flag, Z_flag, N_flag(S_flag), C_flag, etc
– Represents the status of CPU
• IE_flag(interrupt enable), IR_flag(interrupt requested),
IM_flag(interrupt mask)
• U-flag(User/Supervisor mode)
• etc
• Flags can be packed in a register
– Packed flags along with other important special registers
– Program Status Word(s)(PSW)
Introduction
CS510 Computer Architectures
Lecture 01 -14
Register Life
Register Life
– A Register Life begins with storing information into the
register and ends just before storing information into that
register
– Length of the Register Life:
The number of machine instructions during the Register Life
R1’s
Register
Life
Introduction
R1
...
Ri
...
Rj
...
...
R1
information 1
f(R1, X)
f(R1, Y)
information 2
CS510 Computer Architectures
Lecture 01 -15
Statistics:
How Many Registers?
Needs to evaluate following in terms of Time Used or Time
Saved by having or not having enough registers
– How many registers are used simultaneously?
• Average number of simultaneous live registers is 2 ~ 6
– How many would be sufficient most or all of the time?
• No program uses more than 15 registers simultaneously
• 17 out of 41 programs would get by with less than 10
registers(10 registers would suffice 90% of time)
– What would be overhead if the number of registers were
reduced?
• For a live register 2 pairs of LD/ST instructions
• For a dead register life( the register with a long dormant
period), 1 pair of LD/ST instructions
– In what purpose the registers were used during their lives?
• Average 39%(18 ~ 68%) of the lives are used for indexing
Introduction
CS510 Computer Architectures
Lecture 01 -16
Status of CPU for the Running
Program
• Context:
Processor Environment for the running program
– Contents of registers and flags
• Context of the program continuously changes as the
execution progresses
• Context Switch
When there is a change in the executing procedure as in the cases
such as procedure call or return, context needs to be changed
• Context Switch is a very costly operation
– Multiple register Load and Store operations
• LD multiple registers, ST multiple registers
– Multiple Set of registers, or Overlapping Multiple Set of
registers
Introduction
CS510 Computer Architectures
Lecture 01 -17
Context Switch
• Instruction set supports the Load Multiple Registers and
Store Multiple Registers Instructions
• Multiple Register Sets
– Instead of saving and restoring Context, simply change the
Register Set Pointer value
Register Set Pointer
Introduction
...
...
...
Set 0
set 1
set 2
...
CS510 Computer Architectures
...
set n-1
Lecture 01 -18
Functional Resources
Introduction
CS510 Computer Architectures
Lecture 01 -19
Arithmetic and Logic Unit
• Arithmetic Unit
– Basic arithmetic unit
• Adder
– High performance arithmetic unit(Superscalar, Super pipeline)
• Adder, Multiplier, Divider, etc
• Floating point unit, Decimal arithmetic unit, etc
• Logic Unit
– Basic logic unit
• Logic function associated with one of the Complete sets
– High performance logic unit
• AND, OR, Invert, EXOR, etc
• Shifter
– Serial shifter
– Parallel shifter
Introduction
CS510 Computer Architectures
Lecture 01 -20
Speed Enhanced FU
• Multiple Functional Units
– Superscalar
• More than one functional units perform functions concurrently
with different set of data
• Pipelined Functional Unit
– Superpipeline
• One functional unit perform the same function, but different
phase of operations, on many set of data
Introduction
CS510 Computer Architectures
Lecture 01 -21
Transfer Resource
Introduction
CS510 Computer Architectures
Lecture 01 -22
Internal Bus
The more Internal Buses, the more opportunities of
Concurrent Operations
...
...
Introduction
CS510 Computer Architectures
Lecture 01 -23
Internal Bus
Stack_Machine
AC_Machine
PC
IR
MAR
MBR
IX
ALU
AC
PC
IR
MAR
MBR
IX
SP
ALU
GPR_Machine
PC
IR
MAR
MBR
IX
Introduction
GPR
ALU
CS510 Computer Architectures
Lecture 01 -24
Introduction
CS510 Computer Architectures
Lecture 01 -25
Memory System
Introduction
CS510 Computer Architectures
Lecture 01 -26
Main Memory
Introduction
CS510 Computer Architectures
Lecture 01 -27
Performance Gap
1000
CPU
100
55%/year
10
35%/year
7%/year
DRAM
Introduction
CS510 Computer Architectures
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
1983
1982
1981
1980
1
Lecture 01 -28
Memory Bus
Memory
Memory Bus
Instructions
and Data for
the running
program
Files of Data
and Program
CPU
I/O
SPEED of memory is always considered to be too SLOW, and
CPACITY of memory is always considered to be too SMALL
Introduction
CS510 Computer Architectures
Lecture 01 -29
Memory Bandwidth
• Memory Bandwidth
– Information transfer rate in terms of bits/sec
– It is determined by the memory cycle(or access) time and the
delay of the memory bus
– Memory Bandwidth can be increased by increasing the
information unit(size of a word), but this requires to increase
the size of registers, width of buses, etc
• Bottleneck
– Memory Bandwidth is shared by the CPU and I/O
» Normally CPU accesses the memory in every memory cycle for
instruction fetch and data fetch/store
» I/O steals some of the memory cycles from CPU
– Memory Cycle is too slow with respect to CPU speed
» CPU often waits for the information from memory to before
continuing its operation(s)
Introduction
CS510 Computer Architectures
Lecture 01 -30
Speed Gap
Cache
– Small capacity Fast storage placed between CPU and Main
Memory
– Information anticipated to be needed by CPU is brought to the
cache ahead of time
– CPU can find the information most of the time in the cache so
that the access time of the memory becomes close to the
cache storage access time
Multiple Module Memory and Address Interleaving
– Main Memory is organized with multiple memory modules
– Memory system maps the consecutive addresses into the
different memory modules
– Accesses to the consecutive addresses can be pipelined such
that while one module send out the accessed information
another module accesses the information, and while doing
that another module decodes the address, etc
Introduction
CS510 Computer Architectures
Lecture 01 -31
Memory Capacity
Virtual Storage System
– Main Memory is supplemented by a large capacity secondary
storage
– Make the information needed by the CPU can be accessed
from the Main Memory even though it is stored in the
secondary storage
– By this way, the capacity of the Main Memory virtually looks
like the capacity of the secondary storage
– Some kind of mechanism is needed to bring the information
stored in the secondary storage into the Main Memory before
the CPU tries to access that information
Introduction
CS510 Computer Architectures
Lecture 01 -32
Trends in
Computer Architectures
Pre-WWII: Mechanical calculating machines
WWII-50s: Technology improvement
- relays --> vacuum tubes
- high-level languages
80s:
Keep it simple
- RISC
- Shift complexity to software
60s:
Miniaturization/Packaging
- Transistors
- Integrated Circuits
90s:
70s:
Semantic Gap
- Complex instruction set
- Large support in hardware
- microcoding
What to do with all these tr’s?
- Large on-chip cache
- Prefetching hardware
- Speculation execution
- Special-purpose instructions
and hardware
- Multple processors on-a-chip
Introduction
CS510 Computer Architectures
Lecture 01 -33
Review on Technology
Trends
Introduction
CS510 Computer Architectures
Lecture 01 -34
Development of
Computer Architectures:
Introduction
CS510 Computer Architectures
Lecture 01 -35
Dimension of Evolution
• Speed
» Component-level(technology driven) speedup
Cycle time -- Processor, Memory, I/O
» Instruction-level speedup
» Task/Program-level speedup
• Storage Capacity
» Main Memory
» Storage Hierarchy
• Functionality
» Instruction Set
» High Function Execution Unit
• Friendliness
» Programmer-friendly
» Compiler-friendly
Introduction
CS510 Computer Architectures
Lecture 01 -36
Focusing Points
Computer System
CPU
[Clock Period,
CPI,
Instruction count]
Memory Bus
Introduction
[Bandwidth]
Memory
Secondary
Storage
[Capacity,
Cycle Time]
[Capacity,
Data Rate]
CS510 Computer Architectures
Lecture 01 -37
A Balanced Computer System
CPU
Execution Rate
E(MIPS)
[C/E Ratio]
Capacity C(Mb)
Cycle time
Memory Bus
Memory
[E/B Ratio]
I/O Bus
Data Rate
B(Mb/Sec)
Secondary
Storage
Capacity
(Mb)
Memory Bandwidth
BW(Mb/Sec)
Introduction
CS510 Computer Architectures
Lecture 01 -38
A Balanced Computer System:
Subsystem Characteristics
• CPU
– Instruction execution rate: MIPS/MFLOPS
– MIPS is misleading, since the amount of information
processed by an instruction varies
8-bit 1 MIPS <=> 32-bit 1 MIPS
• Memory
– Cycle time is misleading - How fast to access a unit of
information
– Bandwidth = cycle time/sec x bytes/cycle
– For a given cycle time, double the bandwidth with doubling
the width of a word
• Secondary Storage
– Data Rate(B) does not completely represent the device
performance
– IRG, Seek time, Latency
Introduction
CS510 Computer Architectures
Lecture 01 -39
A Balanced Computer system:
Gaining Higher MIPS
FACT:
Memory Cycle Time >> CPU Cycle Time
Design Features - Avoid Memory access
• Large register file
• Cache
No matter how large, not enough to store all information
FACT:
Still needs memory
I/O Transfer Rate(B) << Memory Bandwidth(BW)
Design Features
• Large capacity memory
• Virtual storage system
Memory Capacity(C) should be as large as possible
To increase MIPS(E), Memory Capacity(C) needs to be increased accordingly
C/E Ratio must be kept constant
Introduction
CS510 Computer Architectures
Lecture 01 -40
A Balanced Computer system:
Gaining Higher MIPS
No matter how large the Memory Capacity(C) is, still needs Secondary Storage access
Library Routines, Systems programs including OS
• Requirement of high Data Rate(B)
needs a constant E/B Ratio
• To increase the Data Rate(B), transfer bigger blocks
needs for larger buffers
larger memory capacity(C)
• Multiprogramming to reduce CPU idle time during secondary storage access
large capacity(C) memory for more user’s load modules
To gain higher MIPS(E), Balanced C/E Ratio and Balanced E/B Ratio are needed
Introduction
CS510 Computer Architectures
Lecture 01 -41
Introduction
CS510 Computer Architectures
Lecture 01 -42
CPU Performance
CPU time = IC x CPI x Clock Period
– Clock Period: Component Technology and Hardware Organization
– CPI(Clocks per Instruction): Hardware Organization and
Instruction Set Architecture
– IC(Instruction Count): Instruction Set Architecture and Compiler
CPU performance depends on these three factors, and can be
improved by reducing one or more of these factors
Introduction
CS510 Computer Architectures
Lecture 01 -43
Clock Period
Registers
ALU
Clock Period = (Setup Time + Hold time) of registers(F/F)
+ Delay of (Registers + ALU)
Clock Period is determined by
•
Timing Characteristics of the Component
•
Organization(Algorithm) of the Execution Unit
Introduction
CS510 Computer Architectures
Lecture 01 -44
Clock Period:
Evolution of Component Technology
Vacuum Tube
Transistor
Discrete components
SSI
MSI
LSI
gates functional
blocks
• Compact Packages
– Low power consumption
– Shorter delay, Faster clock period
– Smaller size systems
VLSI
subsystem
WSI
?
system
opportunities of
more functions
on a chip
• High Density package
– Higher functionality
– More reliable systems
Introduction
CS510 Computer Architectures
Lecture 01 -45
Clock Period:
Evolution of Execution Unit(ALU)
Speed of Arithmetic Algorithms reached the Theoritical Upper Bound,
Addition in the Second Generation, and
Multiplication in the early part of the Third Generation Computer eras.
Add Time (Sam Winograd) :
The Fastest Add algorithm
CR Representation + Conditional Sum Adder
Introduction
CS510 Computer Architectures
Lecture 01 -46
Chinese Remainder
Representation
Chinese Remainder Theorem
A set of n relatively prime numbers: mi
A set of remainders uniquely determines an integer A
in the range of 0 < A < M, where
N
M =  mi
i=1
Conversion to decimal
Let
Nj =
N

mi
i=1, i=j
n
A = |  Ni | ai / Ni |mi |
I=1
Introduction
Representation of A with mi’s
A = (a1, a2, … , an)
= (|A|m1, |A|m2, …, |A|mn)
M
CS510 Computer Architectures
Lecture 01 -47
Characteristics of CRR
• Compact Code:
Much shorter than the original number
• Independent operations on individual code symbol:
No needs to consider carry propagation to the adjacent
code symbols
• Operation on Complement of numbers for subtraction or
negative number representation
• Easy to generate CRR and easy to convert to decimal
numbers
Introduction
CS510 Computer Architectures
Lecture 01 -48
Conditional Sum Adder
30
177
0
1
cs
01
10
01
01
01
01
0
0
cs
00 0
00 1
0
1
1
1
0
1
cs
01
10
10
10
0
0
1
1
cs
11 0
11 1
0
1
0
1
1
0
cs
01
10
01
10
01
01
1
0
0
1
1bit add time
Introduction
+
1
0
cs
00 1
11 0
1
0
1
1
0
cs
01
10
00 1
0
1
cs
01
10
1
c
L
00
1
1
1
1
2
1
1
1
3
log2 n x Correct sum bit selection time
CS510 Computer Architectures
Lecture 01 -49
Upper Bound of Add Time
• Winograd’s Upper Bound on Add Time
T > ta + log
r/2
(n / r/2 )
ts
n: number of bits for the largest remainder
ts: correct sum and carry selection time
ta: 1-bit add time
r: fan-in of the functional module
• Speed of Arithmetic Operations can further be improved
only by
– Component Technology
Faster switching, More functions in a package(larger fan-in)
– Architecture
Multiple function units and concurrent operations
Introduction
CS510 Computer Architectures
Lecture 01 -50
Clocks per Instruction
Complex instructions take more clocks to complete
Instruction Cycle
Instruction
Fetch
Decode
Operand
Instr
Fetch
processing
Simple, short instructions
Simple addressing modes
Perform
Operation
Simple operations
Non-memory
referencing
instructions
Introduction
CS510 Computer Architectures
Lecture 01 -51
Clocks per Instruction(CPI)
Clocks per Instruction - Number of clock pulses to complete an instruction
• Instruction Set Processor Architecture dependent
– Concurrent execution of multiple instructions
– vector, pipelined, superscalar, VLIW
• Simple, Shorter, Direct(not encoded), and non-Memory Referencing
Instructions(RISC) are favorable
– RR-type instructions <- require a large number of registers
– Simple address modes
• Instructions doing complex functions(CISC-Powerful Instructions) are
not favorable
– Multiplication, division,… require a lot of clock cycles
– LDM and STM
– Memory referencing with complex address modes
Introduction
CS510 Computer Architectures
Lecture 01 -52
Instruction Count
Number of machine instructions to implement
an application or an algorithm
• Depends on Instruction Set
– In general codes become longer with simple instruction
sets and shorter with complex instruction sets
• Depends on Compiler
– Very difficult task for compilers to optimize the generation of
of codes, especially in the general purpose instruction set
environment
Introduction
CS510 Computer Architectures
Lecture 01 -53
Clocks per Instruction and Instruction Count:
Instruction Set
Range of Instruction Set
1 instruction IS
(Bn instruction)
RISC
General Purpose
high performance IS
CISC
Issues on instruction set design: Trade-off between 3Es
Elegance
- Completeness
- Symmetry
- Flexibility
Introduction
Efficiency
- Instruction length
- Address map
- Frequency of use
- Memory BW utilization
- Instruction execution O/H
CS510 Computer Architectures
Environment
- Multiprogramming
environment
- Code generation by
compilers
Lecture 01 -54
Clocks per Instruction and Instruction Count:
Complex Instruction Set(CISC)
Design Philosophy
Elegance
– General purpose functions for all kinds of applications
• programmer friendly, compiler-friendly
– Support variety of addressing modes
Efficiency
– Reduction of Semantic Gap between HLL and ML by including HLL primitives in IS
• Increase the CPI
• Significantly reduce the Instruction Count
• Compiler-friendly
– Reduction of the Ratio of Overheads to Execution
• High performance instructions to use memory bandwidth efficiently
– Reduction of Instruction Count
Environment
–
Introduction
Instruction level supports of Sharing and Protection of resources and context switch
• Reduce the Instruction Count
CS510 Computer Architectures
Lecture 01 -55
Instruction Set Evolution:
Criticism on CISC
Large Powerful Instruction Set
Efficiency Issues
• Large, Powerful, General Purpose instruction sets are Efficient
– Flexibility and application adaptability
– Memory bandwidth utilization - Instruction Count
• However, they are also Inefficient in the view point of
– Specialized instructions for several HLL
Excess baggage for a certain language
• Compilers may not utilize the right instructions
•
– Program execution characteristics are ignored
Most frequently used and most time consuming instructions
• CPI is important
•
Control Unit Issue
• Control Unit also becomes larger and complex
– Microprogramming the CU is inevitable
– Microprogramming inefficiency - slow clock cycle
– Consumes a significant portions of processor chip area
Introduction
CS510 Computer Architectures
Lecture 01 -56
Clocks per Instruction and Instruction Count:
Reduced Instruction Set (RISC)
Philosophy
• Make the most Frequently used statements(instructions)
simple and fast
• Make the most Time-consuming statements(instructions) fast
Introduction
CS510 Computer Architectures
Lecture 01 -57
Clocks per Instruction and Instruction Count:
Reduced Instruction Set (RISC)
Most Frequently Executed Instructions
The most frequently executed HLL statements are Assignment Statements
The most frequently executed type of machine operations are;
Storage
F(storage, storage)
To make them execute fast, i.e., to reduce CPI
1-cycle instruction;
– Identical instruction format, length
– RR instruction which used Registers for storage of operands
large register file
R F(R, R)
ADD
Rd,Rs1,Rs2
– Simple addressing modes for Loads and Stores
LD Rd, S2
Introduction
CS510 Computer Architectures
Lecture 01 -58
Clocks per Instruction and Instruction Count:
Reduced Instruction Set (RISC)
Most Time Consuming Instructions
Most time consuming statements are Procedure CALLs and RETURNs , which
involve a large number of Loads and Stores.
To make CALLs and RETURNs fast, i.e., to reduce CPI;
Multiple Sets of Registers
– Switch register set for context switch
– Avoids memory access for context switch at the cost of R-R
moves for parameters
– Further optimization by Overlapping Multiple Sets of Registers
Another time consuming statements are Branch-type Statements in the Pipelined
Execution architectures.
To make Branches fast;
Branch optimization by compilers
Introduction
CS510 Computer Architectures
Lecture 01 -59
Multiple Register Set
R Set Pointer
Set0
Set1
Set2
R0
R1
R2
R4
Set(m-1)
...
...
...
...
...
R(n-1)
Introduction
CS510 Computer Architectures
Lecture 01 -60
Overlapping MRS
Physical Register File
Proc A’s Logical Register File
Register Set
Allocated to Proc B
Proc B’s Logical Register File
Register Set
Allocated to Proc B
Introduction
CS510 Computer Architectures
Lecture 01 -61
Clocks per Instruction and Instruction Count:
Registers
Dimension - Width, Number, and Types of Registers
Width of Registers
– 4, 8, 16, 32, 64, …
– Wider registers improve both CPI and Instruction Count
Number of Registers
Large number of registers, in general, improves both CPI and Instruction Count
– Logical Registers
ž Somewhere between 8 and 32 is optimal
žWhen insufficient, 2 Loads and 2 Stores penalty per operation
– Physical Registers
•Simply the more the better, but on-chip space limitation
•MRS and Overlapping MRS improve CPU performance significantly
Types of Registers
– General Purpose vs Dedicated Special Purpose Registers
– Flexibility vs Register addressing
Introduction
CS510 Computer Architectures
Lecture 01 -62
Penalty for Not Enough
Registers
Load current data
Store data which is
not needed now
Memory
Registers
Introduction
Load the saved data
when it is needed
Store data when this
register is needed for
another data load
CS510 Computer Architectures
Lecture 01 -63
Introduction
CS510 Computer Architectures
Lecture 01 -64
Clocks per Instruction and Instruction Count:
Control Unit Implementation
Micro-instruction
IR
Control Point
Control
Storage
Decoder
CSAR
U-program
Control
Unit
Flag
Status
of CPU
Branch Address
CPU
Control Unit
Introduction
CS510 Computer Architectures
Lecture 01 -65
Clocks per Instruction and Instruction Count:
Control Unit Implementation
Hardwired Control or Microprogrammed Control
Flexibility Issue
– Application adaptability
– Debugging
– Modification, Taylorability, Development
Microprogrammed Control has definite advantage over the
Hardwired Control
Speed
– Hardwired control is definitely faster, i.e., smaller CPI
Cost
Control
Unit
Cost
Hardwired
Microprogrammed
Richness of Instruction Set
Introduction
CS510 Computer Architectures
Lecture 01 -66
Implementation of
Microprogrammed CPU
A Machine
CPU
(Hardware)
Instruction
Set
Host
Machine
CU
m-prgrammed
Introduction
CS510 Computer Architectures
Lecture 01 -67
Clocks per Instruction and Instruction Count:
Microprogrammed Control Unit
Microprogramming made computer family concept feasible
– Set of hardware-wise different computers which provide an identical IS
– Developing Computer Family with Hardwired Control is very expensive
– In each of the computers in the family has a hardware-wise identical Control
Units with different microprograms
Microprogrammed computers have evolved to a new concept of computers;
– Microprogrammable computers
Control Unit provides Writeable Control Storage so that the
microprogram can be modified or replaces with a new
microprogram for user’s needs
– Universal Host concept
Microprogrammable computers could improve Instruction Count
– Instructions tailored to the application can be implemented
Introduction
CS510 Computer Architectures
Lecture 01 -68
Computer Family,
Universal Host
Different
Hardware
Different Target Machines ISA
A CPU Family
(Same ISA)
CPU1
M1
CU1
CPU1
Universal
Host
CU1
(mP1)
CPU2
CPU2
M2
CU2
.
.
.
m -programmable
CU
.
.
.
.
.
.
CPUn
CPUn
CUn
(mPn)
Mn
CUn
Introduction
CU2
(mP2)
CS510 Computer Architectures
Lecture 01 -69
Further Optimization
of CPI and IC
Further improvement on Effective CPI can be made by executing
multiple instructions in Parallel
– Pipelining of instruction Executions
– Superscalar
Further improvements on IC can be made by executing a single instruction
on multiple of operands
– Vector Processing(SIMD)
Further improvements on both CPI and IC can be made by executing
a single instruction that specifies multiple of operands
– VLIW
Introduction
CS510 Computer Architectures
Lecture 01 -70
Reduction of CPI:
Pipelining
Reduce the effective CPI by overlapping the executions of
multiple instructions in every clock cycle
3-stage Pipelined Instruction Execution
I0
I1
I2
I3
I4
F
filling
D
F
E
D
F
E
D
F
E
D
F
F: Instruction Fetch
D: Decode, Assemble opd
E: Execute, write result
drain
E
D
E
Source of Inefficiency from Hazards, pipeline filling and draining
– Structural Hazard
– Data Hazard
– Control Hazard
Hazards degrade the pipeline performance
Introduction
CS510 Computer Architectures
Lecture 01 -71
Reduction of CPI:
Superscalar and Superpipeline
Reduce the Effective CPI by issuing Multiple Instructions
for Concurrent Execution to instruction execution pipeline
– Superscalar - multple copies of pipeline stages
For m copies, effective CPI is 1/m of the pipeline
– Superpipeline - pipelined pipeline stages
For m stages, effective CPI is 1/m of the pipeline
Superscalar Instruction Execution
Degree=3
Introduction
Superpipeline Instruction Execution
Degree=3
CS510 Computer Architectures
Lecture 01 -72
Reduction of CPI:
Superscalar and Superpipeline
CPI can be reduced to less than 1(Sub CPI)
– Hazards are more serious
– Thus, needs performance enhancement techniques
•Register renaming for data hazards
•Out-of-order Issue and out-of-order Execution
•Branch prediction
Introduction
CS510 Computer Architectures
Lecture 01 -73
Reduction of CPI:VLIW
Single instruction specifies multiple concurrent operations
•
Requirements
– Large number of data paths to make a large number of concurrent operations
possible
– Intelligent compiler to pack collection of operations representing a program
into VLIW instructions
Execution of VLIW Instructions
• Simple hardware implementation
– No needs for hardware to detect instruction parallelism, which is inherently
specified by instruction
• Binary compatibility is absent among processors
Introduction
CS510 Computer Architectures
Lecture 01 -74
Classification of
Computer Architectures
Mike Flynn’s Classification
• Instruction Stream
– A sequence of instructions executed by a single processor
• Data Stream
– A sequence of data processed by a single processor
SISD Architecture
SIMD Architecture
MISD Architecture
MIMD Architecture
Introduction
CS510 Computer Architectures
Lecture 01 -75
SIMD Architectures
Shared Memory
PE0
CU
PE1
M0
DS0
M1
DS1
M2
...
PEn-1
...
DSn-1
Mm-1
IS
Vector Processor or Array Processor
Introduction
CS510 Computer Architectures
Lecture 01 -76
MIMD Architectures
Shared-Memory
Multiprocessor
Message-Passing
Multicomputer
Scalable
Shared-Memory
Multiprocessor
Dataflow
Multithreaded
Architecture
Introduction
CS510 Computer Architectures
Lecture 01 -77
Types of MIMD Architectures:
Shared-Memory Multiprocessor
M
M
M
M
Interconnection Network
P
P
Limitations
– Memory access latency
– Hot Spot problem
– Unscalable
Introduction
P
Encore
Multimax
Sequent Symmetry
BBN TC-2000
P
Scalable Shared-Memory multiprocessor
– Global single address space
– Stanford DASH, KSR-1, IEEE SCI
CS510 Computer Architectures
Lecture 01 -78
Types of MIMD Architectures:
Message-Passing Multicomputer
Interconnection Network
P
P
P
P
M
M
M
M
Point-to-point
connection
• Scalable
• Limitations
– Communication overhead
– Hard to program
TMC CM-5, Intel Paragon, nCUBE
Introduction
CS510 Computer Architectures
Lecture 01 -79
Future Trends of Processors
•
Complexity
– 50 million transistors on an 1-inch die
– Incorporate multiple processors on a single chip
•
Performance
– Over 2,000 MIPS
– Operate at over 250MHz
•
Architectural Features
– 64-bit address and data types
– 256-bit I/O paths to memory
– Special purpose processing units
• Vector floating-point operations
• Graphics, video, sound generation and processing
An entire personal computer or workstation on a chip
Introduction
CS510 Computer Architectures
Lecture 01 -80
Future Trends of Processors
Design space of modern processor families
CPI 100 MIPS
20
10
Superscalar
5.0
Most likely future
processor space
Superpipelined
Scalar RISC
2.0
1.0
0.5
Superscalar
RISC
0.2
VLIW
Vector
Supercomputer
0.1
MHz
5
Introduction
10
20
50
100
200
CS510 Computer Architectures
500
1000
Lecture 01 -81
Architecture:
Where are we heading?
“Parallel Processing” is the magic word
History: Speed of traditional single CPU computers has increased
10-fold every 5 years.
But, by the time when a new parallel computer is
developed and implemented, single CPU computers will
just as fast.
Minsky’s Conjecture: Speedup achievable by a parallel computer
increases as the logarithm of number of processors.
Large scale parallelism is unproductive.
Amdahl’s Law: A small number of sequential operations can
effectively limit the speedup of a parallel algorithm.
If the sequentially operated portion is 10%, the maximum
achievable speedup is 10, no matter how many processors
a parallel computer has.
Introduction
CS510 Computer Architectures
Lecture 01 -82
Architecture:
Where are we heading?
Up to now, class of parallel computers well-defined in the market is
Vector Machines
– Many applications are being developed
– These machines are not suited for many classes of problems
– Machine understanding/translation, Expert systems,
Knowledge-base applications, Heuristics,...
Introduction
CS510 Computer Architectures
Lecture 01 -83
Building Block:
Where are we heading?
Application Adaptability by the
Reconfigureable Building Block is the magic word.
• Computer Technology Developments
High Density
Chip
Higher degree of integration
Wafer
Needs for new Packaging, Mounting,
Cooling, … techniques
• Architectural Developments
Functionality
Reliability
Multiplicity
More functions in a package
Redundancy, Fail-soft
Multiple copies of subsystems
A Building Block contains highly complex functions for all kinds of applications
– For a particular application, unused functions are extra baggage
– Requirement of Reconfiguration within the building block to mask out
those unneeded functions
Reincarnation of Microprogram
Introduction
CS510 Computer Architectures
Lecture 01 -84