Transcript Document
OVERVIEW &
COMPUTER
PERFORMANCE
SCR1043 - Module 1
1
--
Organization and Architecture
- Structure and Function
Reference: William Stallings – Computer
Organization & Architecture
SCR1043 - Module 1
2
Computer Architecture is those attributes visible to the
programmer. Examples:
the Instruction set
the number of bits used to represent various data types
I/O mechanisms
memory addressing techniques
Computer Organization is how features are implemented:
Control signals
Interfaces between computer and peripherals
The memory technology being used
So, for example, the fact that a multiply instruction is
available is a computer architecture issue. How that
multiply is implemented is a computer organization issue.
SCR1043 - Module 1
3
Many
computer manufacturers offer a family of
computer models, all with the same architecture
but with differences in organization.
All
Intel x86 family share the same basic
architecture
The
IBM System/370 architecture first introduced
in 1970 included a number of models that share the
same basic architecture and has survived to this
day as the architecture of IBM’s mainframe product
line.
The
newer models retained the same architecture
so that the customer’s software investment was
protected (code compatibility)
SCR1043 - Module 1
4
A
computer is a complex system with a hierarchical
system of interrelated subsystems with different
levels.
At
each level, the designer is concerned with
structure and function:
Structure: The way in which the components are
interrelated.
Function: The operation of each individual component
as part of the structure.
The
computer system in this course will be
described from the top down, instead of bottomup.
SCR1043 - Module 1
5
Four
Central processing unit (CPU): Controls the operation of
the computer and performs its data processing
functions. Its major structural components are:
main structural components:
Control unit: Controls the operation of the CPU
Arithmetic and logic unit (ALU): Performs the computer’s data
processing functions
Registers: Provides storage internal to the CPU
CPU interconnection: Some mechanism that provides for
communication among the control unit, ALU, and registers
Main memory: Stores data
I/O: Moves data between the computer and its external
environment
System interconnection: Some mechanism that provides
for communication among CPU, main memory, and I/O
SCR1043 - Module 1
6
Computer
Peripherals
Central
Processing
Unit
Computer
Main
Memory
Systems
Interconnection
Input
Output
Communication
lines
SCR1043 - Module 1
7
CPU
Computer
Arithmetic
and
Login Unit
Registers
I/O
System
Bus
CPU
Internal CPU
Interconnection
Memory
Control
Unit
SCR1043 - Module 1
8
Control Unit
CPU
Sequencing
Login
ALU
Internal
Bus
Control
Unit
Control Unit
Registers and
Decoders
Registers
Control
Memory
SCR1043 - Module 1
9
There are only four functions:
Data processing
process data in variety of forms and
requirements
Data storage
short and long term data storage for
retrieval and update
Data movement
move data between computer and outside
world.
Control
control of process, move and store data
using instruction.
How to perform this function?
through PROGRAM
SCR1043 - Module 1
10
A
sequence of steps
For each step, a computer function is
executed
For each operation, a different/new set of
control signals is needed
For each operation a unique code (instruction)
is provided
e.g. ADD, MOVE
A
hardware segment accepts the code and
issues the control signals
SCR1043 - Module 1
11
Approach
1: Hardwired program
connecting/combining various logic components to
store data and perform arithmetic and logic
operations
Hardwired systems are inflexible
SCR1043 - Module 1
12
Approach 2: Software
General purpose hardware can do different tasks, given
correct control signals
Instead of re-wiring, supply a new set of control signals
through instruction codes
SCR1043 - Module 1
13
- A Brief History of Computers
- Designing for Performance
- Pentium and PowerPC Evolution
Reference: William Stallings – Computer
Organization & Architecture
SCR1043 - Module 1
14
SCR1043 - Module 1
15
1943-1946:
ENIAC (Electronic Numerical
Integrator And Computer)
First
general purpose computer
Designed
by Mauchly and Eckert
Designed
to create ballistics tables for WWII,
but too late – helped determine H-bomb
feasibility instead. General purpose!
30
tons + 15000 sq. ft. + 18000 vacuum tubes +
140 KW = 5000 additions/sec
SCR1043 - Module 1
16
SCR1043 - Module 1
17
1945:
stored-program concept first
implemented for EDVAC (Electronic Discrete
Variable Computer).
Key
concepts:
Data and instructions are stored in a single read-write
memory.
The contents of this memory are addressable by
location, without regard to the type of data
contained there
Execution occurs in a sequential fashion from one
instruction to the next
SCR1043 - Module 1
18
SCR1043 - Module 1
19
Prototype
for all subsequent general-purpose
computers. With rare exceptions, all of today’s
computers have this same general structure, and
are referred to as von Neumann machines.
General
IAS structure consists of:
A main memory, which stores both data and instructions
An ALU capable of operating on binary data
A control unit, which interprets the instructions in
memory and causes them to be executed
I/O equipment operated by the control unit
SCR1043 - Module 1
20
SCR1043 - Module 1
21
1950:
UNIVAC – commissioned by Census
Bureau for 1950 calculations
Late
1950’s: UNIVAC II
Greater memory and higher performance
Same basic architecture as UNIVAC
First example of upward compatibility
1953:
IBM 701 – primarily for science
1955:
IBM 702 – primarily for business
SCR1043 - Module 1
22
1947:
Transistor developed at Bell Labs
Introduction of more complex ALU and control units
High-level programming languages
The
data channel – an independent I/O module
with its own processor and instruction set
The
multiplexor – a central termination point for
data channels, CPU, and memory. Precursor to idea
of data bus.
DEC
(Digital Equipment Corporation) founded in
1957 delivered its first computer, PDP-1, a minicomputer phenomenon.
SCR1043 - Module 1
23
SCR1043 - Module 1
24
1958:
Integrated circuit developed
1964:
Introduction of IBM System/360
First planned family of computer products.
Characteristics of a family:
Similar or Identical Instruction Set and Operating System
Increasing Speed
Increasing Number of I/O Ports
Increasing Memory Size
Increasing Cost
Different models could all run the same software, but
with different price/performance.
SCR1043 - Module 1
25
Literally - “small
electronics”
A computer is made up of
gates, memory cells and
interconnections
These can be
manufactured on a
semiconductor
e.g. silicon wafer
With
microelectronics, density of components on
chip keep on increasing
From
Number of transistors on a chip will double every year
Since 1970’s development has slowed a little, a modified law
Gordon Moore – co-founder of Intel, it says
Number of transistors on a chip doubles every 18 months
Therefore, more circuit can be packed on the same size chip
Higher
packing density means
shorter electrical paths, giving higher performance
Smaller size gives increased flexibility
Reduced power and cooling requirements
Fewer interconnections increases reliability
SCR1043 - Module 1
27
Moore prediction
Actual
SCR1043 - Module 1
28
1964
Replaced
First
(& not compatible with) 7000 series
planned “family” of computers
Similar or identical instruction sets
Similar or identical O/S
Increasing
speed
Increasing number of I/O ports (i.e. more
terminals)
Increased memory size
Increased cost
SCR1043 - Module 1
29
SCR1043 - Module 1
30
1964:
First PDP-8 shipped
First minicomputer
Started OEM market
Introduced the bus structure
Did
not need air conditioned
room
Small
enough to sit on a lab
bench
$16,000
compared to
$100k++ for IBM 360
SCR1043 - Module 1
31
Semiconductor
memory
Replaced bulky core memory
Goes through its own generations in size, increasing by a
factor of 4 each time: 1K, 4K, 16K, 64K, 256K, 1M, 4M,
16M on a single chip with declining cost and access time
Microprocessor
Distributed
Larger
and personal computers
computing
and larger scales of integration
SCR1043 - Module 1
32
SCR1043 - Module 1
33
Microprocessor : all CPU components on a single chip
1971 - 4004
First microprocessor
4 bit
Followed in 1972 by 8008
8 bit
Both designed for specific applications
1974 - 8080
Intel’s first general purpose microprocessor
Designed to be the CPU of a general purpose
microcomputer
SCR1043 - Module 1
34
8080
first general purpose microprocessor
8 bit data path
Used in first personal computer – Altair
8086
much more powerful
16 bit
instruction cache, prefetch few instructions
8088 (8 bit external bus) used in first IBM PC
80286
16 MB memory addressable
80386
First 32 bit design
Support for multitasking- run multiple programs at the same time
SCR1043 - Module 1
35
80486
sophisticated powerful cache and instruction pipelining
built in maths co-processor
Pentium
Superscalar technique - multiple instructions executed in
parallel
Pentium Pro
Increased superscalar organization
Aggressive register renaming
branch prediction
data flow analysis
speculative execution
SCR1043 - Module 1
36
Pentium II
MMX technology
graphics, video & audio processing
Pentium III
Additional floating point instructions for 3D graphics
Pentium 4
Further floating point and multimedia enhancements
Itanium
64 bit
Core Duo
starts of a multicore processor
SCR1043 - Module 1
37
1975, 801 minicomputer project (IBM) RISC
Berkeley RISC I processor
1986, IBM commercial RISC workstation product, RT PC.
Not commercial success
Many rivals with comparable or better performance
1990, IBM RISC System/6000
RISC-like superscalar machine
POWER architecture
IBM alliance with Motorola (68000 microprocessors),
and Apple, (used 68000 in Macintosh)
Result is PowerPC architecture
Derived from the POWER architecture
Superscalar RISC
Apple Macintosh
Embedded chip applications
SCR1043 - Module 1
42
Price/performance
Price drops every year
Performance increases almost yearly
Memory goes up a factor of 4 every 3 years of so
The
basic building blocks for today’s computers are
the same as those of the IAS computer nearly 50
years ago.
SCR1043 - Module 1
43
Density
of integrated circuits increases by 4 every 3
years (e.g. memory evolution)
Also
results in performance boost of 4-5 times
every 3 years
Requires
more elaborate ways of feeding
instructions quickly enough. Some techniques:
Branch prediction
Data-flow analysis
Speculative execution
SCR1043 - Module 1
44
All
components do not increase performance at
same rate as processor
Results in a need to adjust the organization and
architecture to compensate for the mismatch
among the capabilities of the various components.
SCR1043 - Module 1
45
Must
carry a constant flow of program instructions
and data between memory chips and processor
Processor
speed and memory capacity have grown
rapidly
Speed
with which data can be transferred between
processor and main memory has lagged badly
DRAM
density goes up faster than amount of main
memory needed
Number of DRAM’s goes down
With fewer DRAM’s, less opportunity for parallel data
transfer
SCR1043 - Module 1
46
Increase number of bits retrieved at one time
Make DRAM “wider” rather than “deeper”
Change DRAM interface
Include cache in DRAM chip
Reduce frequency of memory access
More complex and efficient cache between processor and
memory
Cache on chip/processor
Increase interconnection bandwidth between processor and
memory
High speed buses
Hierarchy of buses
I/O
devices also become increasingly demanding
SCR1043 - Module 1
47
Peripherals
with intensive I/O demands
Large data throughput demands
Processors can handle this
Problem moving data
Solutions:
Caching
Buffering
Higher-speed interconnection buses
More elaborate bus structures
Multiple-processor configurations
Peripherals (I/O devices) has extremes
•speed variations : < 1Hz to GHz
•in amount of data transfer: <1bit/sec to Gb/sec
Because
of constant and unequal changes in:
Processor components
Main memory
I/O devices
Interconnection structures,
designers must constantly strive to balance their
throughtput and processing demands.
SCR1043 - Module 1
50
Increase
Fundamentally due to shrinking logic gate size
More gates, packed more tightly, increasing clock rate
Propagation time for signals reduced
Increase
Cache access times drop significantly
Change
size and speed of caches
Dedicating part of processor chip
hardware speed of processor
processor organization and architecture
Increase effective speed of execution
Parallelism
Power
Power density increases with density of logic and clock speed
Dissipating heat
RC delay
Speed at which electrons flow is limited by resistance and
capacitance of metal wires connecting them
due to increased density
Interconnected wires becomes thinner, increasing resistance (R)
Wires are closer together, increasing capacitance (C)
Therefore, Delay increases as RC product increases
Memory latency
Memory speeds lag behind processor speeds
Solution:
More emphasis on organizational and architectural approaches
Better performance if improvement in architecture of the CPU
compared to the processing speed (technology)
SCR1043 - Module 1
53
Typically
two or three levels of cache between
processor and main memory (L1,L2,L3)
Chip
density increased
More cache memory on chip
Faster cache access
Pentium
chip devoted about 10% of chip area to
cache
Pentium
4 devotes about 50%
Enable
parallel execution of instructions
Pipeline
works like assembly line
Different stages of execution of different instructions at
same time along pipeline
Superscalar
allows multiple pipelines within single
processor
Instructions that do not depend on one another can be
executed in parallel
Both
of these approaches are reaching a point of
diminishing returns.
Internal organization of processors complex
Can get a great deal of parallelism
Further significant increases likely to be relatively
modest
Benefits
from cache are reaching limit
Increasing clock rate runs into power dissipation
problem
Some fundamental physical limits are being reached
We
can use Amdahl’s law to estimate maximum
expected performance improvements to an overall
system when only part of the system is improved.
Within a processor, increase in performance is
proportional to square root of increase in complexity
If software can use multiple processors, doubling
number of processors almost doubles performance
So, use two simpler processors on the chip rather than one
more complex processor
Multiple processors on single chip
With large shared cache
With two processors, larger caches are justified
Power consumption of memory logic (for cache) is less than
processing logic
Example: IBM POWER4
Two cores based on PowerPC
CPU Performance and its factors
Evaluating Performance
Reference: David A. Patterson & John L. Hennessy –
Computer Organization And Design
SCR1043 - Module 1
59
Hardware performance is often key to the effectiveness
of an entire system of hardware and software.
For different types of applications, different
performance metrics may by appropriate, and different
aspects of a computer systems may be the most
significant factor in determining overall performance.
Understanding how best to measure performance and
limitations of performance is important when selecting
a computer system
To understand the issues of assessing performance.
Why a piece of software performs as it does?
Why one instruction set can be implemented to perform better than another?
How some hardware feature affects performance?
SCR1043 - Module 1
60
Performance
Identify
is important!
HW/SW performance problems
Comparisons:
Which machine is faster?
Which ISA is better?
Which implementation (of an ISA) is faster?
Expose
significant performance issues (enable us to
ignore unimportant issues)
SCR1043 - Module 1
61
Which
•
of these airplanes has the best performance?
How do we say one computer has better performance than another?
•
Peformance based on speed
•
•
To take a single passenger from one point to another in the least time – Concorde
Performance based on throughput
•
To transport 450 passengers from one point to another - 747
SCR1043 - Module 1
62
Response
Time and Throughput
Response Time: time to respond (complete an
operation)
Throughput: jobs completed per unit time
Often can trade one for the other
SCR1043 - Module 1
63
MB/s,
Mb/s: Megabytes, Megabits Per Second
MIPS:
Millions of Instructions Per Second
CPI:
Clock Cycles Per Instruction
IPC: Instructions Per Clock cycle
Hz:
(processor clock frequency) cycles Per Second
LIPS:
Logical Interference Per Second
FLOPS:
Floating-Point arithmetic Operations Per
Second
SCR1043 - Module 1
64
Real
time: “Wall Clock” time, always ticking
CPU
execution time (CPU time): ticks only when
CPU is working for you
User: CPU time spent in the program
System: CPU time spent in the operating system
performing tasks on behalf of the program
Clock
cycle: Also called tick, clock tick, clock
period, clocks, cycle (e.g. 0.25 nanosecond). The
time for one clock period, usually of the processor
clock, which runs at a constant rate
Clock
rate: the inverse of the clock cycle.
Frequency (e.g. 4 GHz)
65
SCR1043 - Module 1
CPU
execution time for a program = Seconds for
the program
Clock
cycle time = Seconds per clock cycle
Clock
ticks at a constant rate, measure time in
clock cycles:
Seconds
=
Cycles
* Seconds
Program
Program
Cycle
Prefer
clock frequency? Divide by Hz
Seconds
=
Cycles / Clock rate (Freq)
Program
Program
SCR1043 - Module 1
66
A
simple formula relates the most basic
metrics (i.e., clock cycles and clock cycle
time) to CPU time
SCR1043 - Module 1
67
Our favorite program runs in 10 seconds on computer A,
which has a 4 GHz clock. Computer B will run this
program in 6 seconds, given that computer B requires
1.2 times as many clock cycles as computer A for this
program. What is computer B’s clock rate?
CPU Time(A) = CPU Clock Cycles(A) / Clock Rate(A)
10 s
= CPU Clock Cycles(A) / 4 GHz
10 s
= CPU Clock Cycles(A) / 4 X 10*9 Hz
CPU Clock Cycles(A) = 40 x 10*9 cycles
CPU Time(B) = 1.2 X CPU Clock Cycles(A) / Clock Rate(B)
6s
= 1.2 X CPU Clock Cycles(A) / Clock Rate(B)
Clock Rate (B) = 1.2 X 40 X 10*9 cycles / 6 seconds
Clock Rate (B) = 48 X 10*9 cycles / 6 seconds
Clock Rate (B) = 8 X 10*9 cycles / seconds
Clock Rate (B) = 8 GHz
SCR1043 - Module 1
68
Instruction count = Instructions executed for the program
Clock cycle per instruction = Average number of clock
cycles per instructions
Programs
are made of instructions:
Cycles
Program
Using
Instructions
Program
* Cycles
Instructions
CPI:
Cycles
Program
Or,
=
=
Instructions * CPI
Program
using Instructions Per Clock (IPC):
Cycles
Program
=
Instructions
Program
SCR1043 - Module 1
/ IPC
69
CPU time
=
Seconds
Program
=
Cycles
Program
= Instructions * Cycles
Program
Instructions
In
*
*
Seconds
Cycle
Seconds
Cycle
other words:
SCR1043 - Module 1
70
Suppose
we have two implementations of the same
instruction set architecture and for the same
program. Which computer is faster and by how
much?
Computer A: clock cycle time=250 ps and CPI=2.0
Computer B: clock cycle time=500 ps and CPI=1.2
Say
I = number of instructions for the program, find
number of clock cycles for A and B
CPU Clock Cycles(A)
CPU Clock Cycles(A)
CPU Clock Cycles(B)
CPU Clock Cycles(B)
=I
=I
=I
=I
SCR1043 - Module 1
X CPI(A)
X 2.0
X CPI(B)
X 1.2
71
Compute
CPU Time for A and B
CPU Time(A)
CPU Time(A)
CPU Time(B)
CPU Time(B)
= CPU Clock Cycles(A) X Clock Cycle Time(A)
= I X 2.0 X 250 ps = I X 500 ps
= CPU Clock Cycles(B) X Clock Cycle Time(B)
= I X 1.2 X 500 ps = I X 600 ps
Clearly
A is faster. The amount faster is the ratio of
execution time.
Performance(A) = Execution time(B) = I X 600 ps = 1.2 times
Performance(B) Execution time(B) I X 500 ps
We can conclude, A is 1.2 times faster than B for
this program
SCR1043 - Module 1
72
Sometimes it is possible to compute the CPU clock
cycles by looking at the different types of instructions
and using their individual clock cycle counts
CPIi = count of the number of instructions of class i executed
Ci = average number of cycles per instruction for that instruction class
n = number of instruction classes
Remember that overall CPI for a program will depend
on both the number of cycles for each instruction type
and the frequency of each instruction type in the
program execution
SCR1043 - Module 1
73
SCR1043 - Module 1
74
A compiler designer is trying to decide between two code
sequences for a particular computer. The hardware designers have
supplied the following facts:
For a particular high-level-language statement, the compiler
writer is considering two code sequence that require the following
instruction counts:
Which code sequence executes the most instructions?
Which will be faster?
Which is the CPI for each sequence?
SCR1043 - Module 1
75
Sequence 1 (Instruction Count(1)) 2+1+2=5 instructions
Sequence 2 (Instruction Count(2)) 4+1+1=6 instructions
CPU Clock Cycles(1)= (2X1)+(1X2)+(2X3) = 10 cycles
CPU Clock Cycles(2)= (4X1)+(1X2)+(1X3) = 9 cycles
So code Sequence 2 faster, even though it executes 1 extra
instruction
Code Sequence 2 uses fewer clock cycles, must have lower CPI
CPI = CPU Clock Cycles/Instruction Count
CPI(1) = CPU Clock Cycles(1)/Instruction Count(1) = 10/5 = 2
CPI(2) = CPU Clock Cycles(2)/Instruction Count(2) = 9/6 = 1.5
SCR1043 - Module 1
76
The
evolution of computers has been characterized
by increasing processor speed, decreasing comp
size, increasing memory size, and increasing I/O
capacity and speed.
All
computer designers must balance performance
and cost.
Execution
time of real programs as the metric is a
reliable method of determining and reporting
performance.
SCR1043 - Module 1
77