Transcript a. ALU 1
Computer Organization and
Architecture
(3 Credits/SKS)
Prof. Dr. Bagio Budiardjo
Semester Genap 2010/2011
About the Course :
Course Objectives: After completing this course the
students are expected to understand and to be able to
analyze the computer architecture, in particular the
instruction-set design (e.g. addressing modes), and its
influence to performance. The students are also expected
to understand the meaning of computer organization, that
is, the interconnections of computer sub-systems : CPU,
memory, bus and I/O from a computing system.
The student is expected to understand the more advanced
technique in processor design : pipelining.
Key words : architecture, instruction-set design, computer
organization, performance, processor design and,
pipelining techniques
About the grading scheme :
• This part is actually not too rigid but it will
appear as the combination of : homework, quiz,
exercise, mid-test and final-test; whenever
possible.
• One scheme possible is :
Homework
: 15% (4)
Mid test
: 40 %
Final Test
: 45 %
• Grading the homework : Maximum point , 5 point
each. Three levels of grading :Good(5), OK(3),
and Bad(2).
The books and supporting materials :
•
•
•
Williams Stalling’s book titled Computer Organization and
Architecture, Seventh Edition, Prentice Hall 2006; will be used
as the main reference for this lecture. There is a new edition of
this book, issued in 2010 but up till now is still unavailable in
Jakarta.
The classic book is good (Logic and Computer Design
Fundamentals) , by Morris M Manno and Charles Kilme Pearson Asia – 2004), but too many stresses on digital logics.
We use materials from this book to explain the hardware
design of computer components, whenever possible
Chapters covered will be : Chapters: 1, 2, 3, 4, 5, 10 and 11
and 13 (Stalling’s). Additional materials about pipelining are
taken from another book.
Books and supporting materials - continued
• There will be no handouts (unless it is very important).
• Lecture notes are given through memory stick/CD, SAP
could be downloaded from SIAK-NG
• Students are encouraged to read books/papers in this field
of study.
Schedule of class :
• At scheduled time and place (K-102) for about 120
minutes
• Lecture will be given mainly using LCD projector
About the “course direction”
Why do we study Computer Architecture ?
History :
Course under this name has been taught in many
universities long before the microprocessors
exist. Years ago, people studied mainframe
architectures : IBM S/370, CDC Cyber, CRAY,
Amdahl, etc.
Since the microprocessors emerge, this course is
changed slightly to cope with more advanced
topics: Computer design and performance issues
About the “course direction”
Computer Organization & Architecture
Micro & Embedded
Microprocessors
Application of µproc
Embedded Systems
embedding µproc based
intelligence to new system/device
OAK
Processors
ProcessorsArchitecture
Architecture&
&Design
Design
Analyzing
processor
design emphasizing
Analyzing
& Implementing
on howComputer
to obtain Systems
better processing
to achievespeed
(Cost speed
effectiveness)
best processing
– Cost effectiveness
Parallel & Distributed
Computing Systems
Organizing Processors/Computing
systems to obtain better speed up with
different processing paradigm
About the “course direction” - continued
This course is aimed at :
1. Explaining the phenomena of computer
architecture and computer design
Knowing the basic instruction cycle and its
implication to processing speed
2. Studying the “key” problems :
a. CPU memory bottleneck
b. CPU I/O devices problems
3. Studying how the “performance” could be
improved
example : CPU-memory : cache memory
4. How could we improve execution speed
with other techniques ?
Example : pipelining
Reasons for studying
Computer Architecture
(Stalling’s arguments)
• Able to select “proper” computer systems for a
particular environment (cost and effectiveness)
• Able to analyzed a processor “embedded” to an
environment. Able to analyzed the use of
processor in automobile, able to use proper tools
to analyzed
• Able to choose proper software for a particular
computer system
View of a Computer System
– Processor Organization : Another view
CPU : Central Processing Unit
Control
Unit
MMU : Mem Mng. Unit
To/from
memory
IR
PC
Cache
memory
R1
MAR
MBR
R2
ALU1
ALU2
R3
ADDER
Issues :
Clock speed,
Gating signal
ALU3
FPU : Floating Point Unit
BUS
Implementation in CHIP
Frequently Asked Question
What is the role of CPU clock ?
What is the difference between P IV/2.4 G &
P IV/3.0 G ? (CPU - clock speed 2.4 and 3.0 Ghz)
Consider an instruction of a CPU :
AR
R1, R2
(add register, content of R1 and content of
register R2, place result in R1)
– Execution steps of
AR
R1,R2
The “possible” micro-execution steps are :
a.
b.
c.
d.
ALU1 [R1]
ALU2 [R2]
ADD
R1 [ALU3]
{content of R1 is moved to ALU1}
{content of R2 is moved to ALU2}
{content of ALU1 + ALU2 = ALU3}
{Result of addition is moved to R1}
If, each micro-step is executed in “one” clock-cycle,
then this AR instruction needs 4 clock-cycles.
For the time being, we ignore the fetch cycle
ADD R1, R2
– Processor Organization – continued.1
a.
b.
c.
d.
Control
Unit
ALU1 [R1]
ALU2 [R2]
ADD
R1 [ALU3]
IR
To/from
memory
PC
R1
MAR
MBR
R2
ALU1
ALU2
R3
ADDER
ALU3
ALU1
BUS
[R1]
: jalur/unit tidak
aktif
ADD R1, R2
– Processor Organization – continued.2
a.
b.
c.
d.
Control
Unit
ALU1 [R1]
ALU2 [R2]
ADD
R1 [ALU3]
IR
To/from
memory
PC
R1
MAR
MBR
R2
ALU1
ALU2
R3
ADDER
ALU2
ALU3
[R2]
: jalur/komponen tdk
aktif
BUS
ADD R1, R2
– Processor Organization – continued.3
a.
b.
c.
d.
Control
Unit
ALU1 [R1]
ALU2 [R2]
ADD
R1 [ALU3]
IR
To/from
memory
PC
R1
MAR
MBR
R2
ALU1
ALU2
R3
ADDER
ADD
ALU3
: jalur/komponen tdk
aktif
BUS
ADD R1, R2
– Processor Organization – continued.4
a.
b.
c.
d.
Control
Unit
ALU1 [R1]
ALU2 [R2]
ADD
R1 [ALU3]
IR
To/from
memory
PC
R1
MAR
MBR
R2
ALU1
ALU2
R3
ADDER
R1
ALU3
[ALU3]
: jalur/komponen tdk
aktif
BUS
– Processor Organization – Microprogram
Control
Unit
ADD R1, R2
a.
b.
c.
d.
ALU1 [R1]
ALU2 [R2]
ADD
R1 [ALU3]
IR
To/from
memory
A1
PC
MAR
R1
MBR
Microprogram
B1
B2
ALU1
A2
ALU2
A1 A2 A3 B1 B2 B3 ADD
A3
ADD
R2
ADDER
ALU3
B3
R3
a
1
0
0
1
0
0
0
b
0
1
0
0
1
0
0
c
0
0
0
0
0
0
1
d
1
0
0
0
0
1
0
BUS
1 = open; 0 = closed
Analysis of Instruction Cycle
• With single bus, it is slow, since in each “clock”
only one transfer could be executed
• Is there any other way to “improve” the speed?
• Dual bus processor may be faster
• Additional processor cost
Dual processor-bus : A way to improve speed
Other components
(Control Unit,IR,PC,
MAR,MBR)
1
1. ALU1 [R1] (bus1)
ALU2 [R2] (bus2)
2
2. ADD
R1
R2
ALU1
ALU2
3. R1 [ALU3] (bus1)
Only 3 clocks
cycles needed,
25% faster
How about this :
R3
ADDER
1. ALU1 [R1] (bus1)
ALU2 [R2] (bus2)
ADD
2. R1 [ALU3] (bus1)
ALU3
DUAL BUS
Only 2 clocks
cycles needed,
50% faster
Dual processor-bus : Microprogram level representation
1
Other components
(Control Unit,IR,PC,
MAR,MBR)
2
A1
R1
A2
B1
B2
B3
A3
B4
R2
A4
ALU1
ALU2
A5
R3
A6
SUB
ADDER
B5
ALU3
B6
DUAL BUS
How do we create the
Microprogram
for instruction
SUB R3, R2 ?
Microprogram for SUB R3, R2 on dual bus Processor
1. Assume that Subtraction and transfer back the
result of SUB operation are done in separate clock
Step A1
A2
A3
A4
A5
A6
B1
B2
B3
B4
B5
B6
SUB
a
0
0
1
0
1
0
1
0
1
0
0
0
0
b
0
0
0
0
0
0
0
0
0
0
0
0
1
c
0
0
0
0
0
1
0
0
0
0
0
1
0
2. Assume that Subtraction and transfer back the
result of SUB operation are done in the same clock
Step A1
A2
A3
A4
A5
A6
B1
B2
B3
B4
B5
B6
SUB
a
0
0
1
0
1
0
1
0
1
0
0
0
0
b
0
0
0
0
0
1
0
0
0
0
0
1
1
Triple processor-bus : Can the processing speed imrpoved?
1
2
3
Other components
(Control Unit,IR,PC,
MAR,MBR)
R1
Please notice the
direction of arrows
R2
ALU1
ALU2
R3
ADDER
ALU3
Triple Bus
If all the CPU components
(registers, ALUs and adder)
could work in a one third (1/3) clock
cycle (transfer of bits, adding
numbers), how many clock (s)
needed to complete an addition
operation (ADD R1,R2) ?
Write down the “register transfer”
and the microprogram for your
register transfer language
Program Execution
• A scientific program using assembly language is run on a
microprocessor with 1 Ghz clock. To complete the program , it needs
to execute :
a. 150.000 arithmetic instructions (e.g ADD R1,R2; MUL R1,R3; etc)
b. 250.000 register transfer instructions (e.g MOV R1,R2; etc)
c. 100.000 memory access instructions (e.g LOAD R1,X; STORE
R2,Y; etc).
If, average arithmetic instructions need 2 clocks (to complete), average
register transfer instructions need 1 clock and average memory access
instructions need 10 clocks; calculate the average CPI (clock per
instruction) of the above mentioned program.
How many times it needs to complete the program (in seconds)?
Can it be “one clock?” – Yes it can !
Views of Other Books on “Micro Operations”
• The Bus is called “data path”
• It is not only consist of bus (a bunch of wires), but
other digital devices
• Enable signals is forced to fasten execution
• Additional (processor) cost
Datapath Example :
Taken from Morris Manno’s book
Load enable
• Four parallel-load
registers
• Two mux-based
register selectors
• Register destination
decoder
• Mux B for external
constant input
• Buses A and B with external
address and data outputs
• ALU and Shifter with
Mux F for output select
• Mux D for external data input
• Logic for generating status bits
V, C, N, Z
A select
Write
D data
B select
A address
B address
2
2
n
Load
R0
n
n
Load
R1
0
1
MUX
2
3
n
n
0
1
Load
2
3
R2
n
n
Load
MUX
R3
n
0 1 2 3
n
n
Decoder
D address
2
Constant in n
Destination select
n
MB select
Register file
A data
1
0
MUX B
Bus A
A
n
C
N
Z
B
G select
A
B
4
S2:0 || Cin
Arithmetic/logic
unit (ALU)
G
0
1
MUX F
F
n
MF select
MD select
n
Bus D
Out
n
H select
2
S
IR
0
B
Shifter
IL
0
H
n
n
Zero Detect
Address
Out
Data
n
Bus B
V
B data
n
0
1
MUX D
Function unit
n
Data In
Datapath Example: Performing a Microoperation
Microoperation: R0 ← R1 + R2
Load enable
A select
Write
D data
Apply 01 to A select to place
contents of R1 onto Bus A
Apply 10 to B select to place
contents of R2 onto B data and
apply 0 to MB select to place
B data on Bus B
Apply 0010 to G select to perform
addition G = Bus A + Bus B
Apply 0 to MF select and 0 to MD
select to place the value of G onto
BUS D
Apply 00 to Destination select to
enable the Load input to R0
Apply 1 to Load Enable to force the
Load input to R0 to 1 so that R0 is
loaded on the clock pulse (not shown)
The overall microoperation requires
1 clock cycle (!)
B select
A address
B address
2
2
n
Load
R0
n
n
Load
R1
0
1
MUX
2
3
n
n
0
1
Load
2
3
R2
n
n
Load
MUX
R3
n
0 1 2 3
n
n
Decoder
D address
2
Constant in n
Destination select
n
MB select
Register file
A data
1
0
MUX B
Bus A
A
n
C
N
Z
B
G select
A
B
4
S2:0 || Cin
Arithmetic/logic
unit (ALU)
G
0
1
MUX F
F
n
MF select
MD select
n
Bus D
Out
n
H select
2
S
IR
0
B
Shifter
IL
0
H
n
n
Zero Detect
Address
Out
Data
n
Bus B
V
B data
n
0
1
MUX D
Function unit
n
Data In
Lesson Learned
• We could improve the instruction execution speed by
increasing processor clock speed (can we?)
• We could improve the instruction execution speed by
implementing dual bus (can we?)
• We can overcome (partly) the CPU-Memory bottleneck by
inserting cache memory between CPU and Main Memory
(can we?)
• Is there any other way to improve instruction execution
speed (increasing performance)? - pipelining
• Are these improvements need extra cost? (cost vs
performance issue)
What do we get after studying Computer
Architecture ?
• It is always a complicated problem to answer.
• Basically we learn about the processor design
issues, namely hardware of a computer but it was
taught through “software” logics.
• At least we know about basic building blocks of a
computer
• We know the design development trends
Question : How do we fetch the instruction?
(from memory)
• There is a procedure to bring an instruction from memory
to CPU (IR), is called the instruction fetch
• PC always hold the address of (next) instruction in
memory
• PC tranfer the address to MAR, and READ memory
• PC ususally is icremented by 1 (point to next instruction)
• Instruction is placed by memory in MBR
• Content of MBR is transferred to IR
(instruction is fetched, ready to be executed)
Question : How do we fetch the instruction?
(from memory) - continued
• Or with register transfer language, we could express the
fetch cycle as
1.
2.
3.
MAR ← [PC]
READ (memory) and wait for completion
IR ← [MBR]
In terms of CPU clock, this steps may take up to 50 CPU
clocks depending on the memory clock speed.
What is our topic ?
Intruction Set Architecture(ISA)
Application
Program
Compiler
OS
ISA
CPU
Design
Circuit
Design
Chip
Layout
Chapter 1 : Introduction
1. 1. Introduction : Organization & Architecture
• Organization and Architecture : two jargons that are often
confusing
• Computer organization refers to the operational units and
their interconnections that realize the architectural
specifications (!)
• Computer Architecture refers to those attributes of a
system visible to a programmer, or put another way, those
attributes that have a direct impact on the logical execution
of a program (!)
• The later definition (architecture) concerns more about the
performance, compared to the first one (organization)
1. 1. Introduction - continued
• Architecture concerns more about the basic instruction
design, that may lead to better performance of the system
• Organization, is the implementation of computer
system, in terms of its interconnection of functional units :
CPU, memory, bus and I/O devices.
• Example : IBM/S-370 family architecture. There are plenty
of IBM products having the same architecture (S-370) but
different organization, depending on its price/performance
measures. Cost and performance differs the organizations
• So, organization of a computer is the implementation of
its architecture, but tailored to fit the intended price and
performance measures.
Chapter 2 :
Computer Evolution and
Performance
ENIAC - background
•
•
•
•
•
•
Electronic Numerical Integrator And Computer
Eckert and Mauchly
University of Pennsylvania
Trajectory tables for weapons
Started 1943
Finished 1946
– Too late for war effort
• Used until 1955
ENIAC - details
•
•
•
•
•
•
•
•
Decimal (not binary)
20 accumulators of 10 digits
Programmed manually by switches
18,000 vacuum tubes
30 tons
15,000 square feet
140 kW power consumption
5,000 additions per second
ENIAC
ENIAC
Another View of ENIAC
YOUR PICTURE GALLERY IS NOW LOADING...
Structure of von Neumann machine
IAS - details
• 1000 x 40 bit words
– Binary number
– 2 x 20 bit instructions
• Set of registers (storage in CPU)
– Memory Buffer Register
– Memory Address Register
– Instruction Register
– Instruction Buffer Register
– Program Counter
– Accumulator
– Multiplier Quotient
2. 1.Evolution and Performance - history
• 1946 Von Neuman and his gang proposed IAS (Institute
for Advanced Studies)
• The design included :
– main memory
– ALU
– Control Unit
– I/O
• First Stored Program, able to perform :
+, -, x, :
• The “father” of all modern computer/processor
Structure of IAS
IAS
2. 1. Evolution and Performance -history
IAS components are :
• MBR (memory buffer register), MAR (memory address
register), IR (instruction register), IBR (instruction buffer
register), PC (program counter), AC (accumulator and
MQ (multiplier quotient), memory (1000 locations)
• 20 bit instruction : 8 bit opcode, 12 bit address (addressing
one of 1000 memory locations - 0 to 999)
• 39 bit data (with sign bit - 1 bit)
• Operations : data transfer between registers and ALU,
unconditional branch, conditional branch, arithmetic,
address modify
2.1. Evolution - History of Commercial computers
• First Generation : 1950 Mauchly & Eckert developed
UNIVAC I, used by Census Beureau
• Then appeared UNIVAC II, and later grew to UNIVAC 1100
series (1103, 1104,1105,1106,1108) - vacuum tubes and later
transistor
• Second Generation : Transistors, IBM 7094 (although there
are NCR, RCA and others tried to develop their versions commercially not successful)
• Third Generation : Integrated Circuit (IC) - SSI. IBM S/360
was the successful example
• Later generations (possibly fourth and fifth) : LSI and VLSI
technology
2.1. Evolution - history of commercial computers
Table 2.1
Approx
Speed
Generation
Time
Technology
(opr/sec)
-------------------------------------------------------------------------1.
1946-57
Vacuum tube
40,000
2.
1958-64
Transistor
200,000
3.
1965-71
SSI & MSI
1,000,000
4.
1972-77
LSI
10,000,000
5.
1978VLSI
100,000,000
--------------------------------------------------------------------------
Vaccum Tubes
Transistor
2.1. Evolution - System 360 Family
Model Model Model Model Model
Characteristic
30
40
50
65
75
-----------------------------------------------------------------------------------------Max memory size (Bytes) 64K
256K
256K
512K
512K
Memory data-rate(MB/s) 0.5
0.8
2.0
8.0
16.0
Processor cycle time (s) 1.0
0.625
0.5
0.25
0.2
Relative Speed
1
3.5
10
21
50
Max Number data channel 3
3
4
6
6
Max chan. data-rate(KB/s) 250
400
800
1250
1250
--------------------------------------------------------------------------------------• Family architecture menyebabkan adanya istilah : upward dan
downward compatible
Generations of Computer
• Vacuum tube - 1946-1957
• Transistor - 1958-1964
• Small scale integration - 1965 on
– Up to 100 devices on a chip
• Medium scale integration - to 1971
– 100-3,000 devices on a chip
• Large scale integration - 1971-1977
– 3,000 - 100,000 devices on a chip
• Very large scale integration - 1978 to date
– 100,000 - 100,000,000 devices on a chip
• Ultra large scale integration
– Over 100,000,000 devices on a chip
Moore’s Law
•
•
•
•
•
•
•
•
•
Increased density of components on chip
Gordon Moore - cofounder of Intel
Number of transistors on a chip will double every year
Since 1970’s development has slowed a little
– Number of transistors doubles every 18 months
Cost of a chip has remained almost unchanged
Higher packing density means shorter electrical paths,
giving higher performance
Smaller size gives increased flexibility
Reduced power and cooling requirements
Fewer interconnections increases reliability
Moore’s Law
Growth in CPU Transistor Count
Growth in CPU Transistor Count
IBM 360 series
• 1964
• Replaced (& not compatible with) 7000 series
• First planned “family” of computers
– Similar or identical instruction sets
– Similar or identical O/S
– Increasing speed
– Increasing number of I/O ports (i.e. more terminals)
– Increased memory size
– Increased cost
• Multiplexed switch structure
2.1. Evolution - Later generations
• Semiconductor memories :
1K,4K,16K,64K,256K,1M,4M,16 Mbits on a single chip
At present : 256 Mbit, 512 Mbit per chip
• Microprocessors appeared :
Intel 4004 (1971), Intel 8008 (72), Intel 8080 (8 bit-74),
8086 (16 bit-81), 80386 (32bit-85) onward.
• At almost the same time : Motorola, 6800 (8bit), 68000
(16bit), 68010 (16bit), 68020 (32bit), 68030/40 (32bit)
• Then Motorola’s product disappeared commercially
• Intel products dominated the market, since the appearance
of IBM PC
2.1. Evolution of Microprocessors
Table 2.2
-----------------------------------------------------------------------------------------Feature
8008
8080
8086 80386
80486
-----------------------------------------------------------------------------------------Year introduced
1972
1974 1978
1985
1989
# of instructions
66
111
133
154
235
Address bus width
8
16
20
32
32
Data bus width
8
8
16
32
32
# of registers
8
8
16
8
8
Memory addressability
16KB 64KB 1 MB 4 GB
4 GB
Bus Bandwidth (MB/s)
0.75
5
32
32
Reg-Reg add time (s)
1.3
0.3
0.125
0.06
------------------------------------------------------------------------------------------
2.2 Designing for Performance
• Price of processor continue to drop every year
• $1000 for an advanced system is today’s price : in it you
may find more than 100 million transistors !
• Even 100 millions pieces of toilet papers cost more !!
• Computing power is for free !!
• People solve problem that never been thought possible
before : image processing, speech recognition,
videoconferencing, multimedia authoring, etc.
• We need more and more computing power
• The organization and architecture of today’s processor
remains the same (basically) as those of IAS !
• Algorithms to improve speed and efficiency differs !
2.2. Designing - processor speed
• Intel Pentium and PowerPC follows Moore’s Law :
By shrinking size of lines in IC chips by 10%, industry may get
new IC with 4 times transistor density every 3 years !
• The above law is true for DRAM (Dynamic Random Access
Memory)
• If the capacity does increase, the speed doesn’t increase
automatically
• More work in designing instructions needed
• Also, techniques for faster instruction execution must be
developed : branch prediction, data flow analysis and
speculative execution
Pentium Evolution (1)
• 8080
– first general purpose microprocessor
– 8 bit data path
– Used in first personal computer – Altair
• 8086
– much more powerful
– 16 bit
– instruction cache, prefetch few instructions
– 8088 (8 bit external bus) used in first IBM PC
• 80286
– 16 Mbyte memory addressable
– up from 1Mb
• 80386
– 32 bit
– Support for multitasking
Pentium Evolution (2)
• 80486
– sophisticated powerful cache and instruction pipelining
– built in maths co-processor
• Pentium
– Superscalar
– Multiple instructions executed in parallel
• Pentium Pro
– Increased superscalar organization
– Aggressive register renaming
– branch prediction
– data flow analysis
– speculative execution
Pentium Evolution (3)
• Pentium II
– MMX technology
– graphics, video & audio processing
• Pentium III
– Additional floating point instructions for 3D graphics
• Pentium 4
– Note Arabic rather than Roman numerals
– Further floating point and multimedia enhancements
• Itanium
– 64 bit
– see chapter 15
• See Intel web pages for detailed information on processors
Intel Microprocessor Performance
Summary: Important Points
•
•
•
•
•
•
•
•
•
•
Organization and Architecture
Family Architectures
Function of a Computer (Data Processing, Control, Data movement)
Born of Computers (Eniac-decimal, IAS-digital) Mauckly-Eckert
Microprocessors(I-4004,8008,8080,8086/16,80386/32)
IAS Instructions
Von Neuman bottleneck
Increasing clock speed, make bus wider, cache memory
Loosers : e.g. Motorola Micro Processor, Radio Shack,
More dense transistor in a single chip (4 times every 3 years, by
shrinking lines by 10%)