Transcript Document

OVERVIEW &
COMPUTER
PERFORMANCE
SCR1043 - Module 1
1
--
Organization and Architecture
- Structure and Function
Reference: William Stallings – Computer
Organization & Architecture
SCR1043 - Module 1
2

Computer Architecture is those attributes visible to the
programmer. Examples:
 the Instruction set
 the number of bits used to represent various data types
 I/O mechanisms
 memory addressing techniques

Computer Organization is how features are implemented:
 Control signals
 Interfaces between computer and peripherals
 The memory technology being used

So, for example, the fact that a multiply instruction is
available is a computer architecture issue. How that
multiply is implemented is a computer organization issue.
SCR1043 - Module 1
3
 Many
computer manufacturers offer a family of
computer models, all with the same architecture
but with differences in organization.
 All
Intel x86 family share the same basic
architecture
 The
IBM System/370 architecture first introduced
in 1970 included a number of models that share the
same basic architecture and has survived to this
day as the architecture of IBM’s mainframe product
line.
 The
newer models retained the same architecture
so that the customer’s software investment was
protected (code compatibility)
SCR1043 - Module 1
4
A
computer is a complex system with a hierarchical
system of interrelated subsystems with different
levels.
 At
each level, the designer is concerned with
structure and function:


Structure: The way in which the components are
interrelated.
Function: The operation of each individual component
as part of the structure.
 The
computer system in this course will be
described from the top down, instead of bottomup.
SCR1043 - Module 1
5
 Four

Central processing unit (CPU): Controls the operation of
the computer and performs its data processing
functions. Its major structural components are:







main structural components:
Control unit: Controls the operation of the CPU
Arithmetic and logic unit (ALU): Performs the computer’s data
processing functions
Registers: Provides storage internal to the CPU
CPU interconnection: Some mechanism that provides for
communication among the control unit, ALU, and registers
Main memory: Stores data
I/O: Moves data between the computer and its external
environment
System interconnection: Some mechanism that provides
for communication among CPU, main memory, and I/O
SCR1043 - Module 1
6
Computer
Peripherals
Central
Processing
Unit
Computer
Main
Memory
Systems
Interconnection
Input
Output
Communication
lines
SCR1043 - Module 1
7
CPU
Computer
Arithmetic
and
Login Unit
Registers
I/O
System
Bus
CPU
Internal CPU
Interconnection
Memory
Control
Unit
SCR1043 - Module 1
8
Control Unit
CPU
Sequencing
Login
ALU
Internal
Bus
Control
Unit
Control Unit
Registers and
Decoders
Registers
Control
Memory
SCR1043 - Module 1
9

There are only four functions:
Data processing
 process data in variety of forms and
requirements
 Data storage
 short and long term data storage for
retrieval and update
 Data movement
 move data between computer and outside
world.
 Control
 control of process, move and store data
using instruction.


How to perform this function?

through PROGRAM
SCR1043 - Module 1
10
A
sequence of steps
 For each step, a computer function is
executed
 For each operation, a different/new set of
control signals is needed
 For each operation a unique code (instruction)
is provided

e.g. ADD, MOVE
A
hardware segment accepts the code and
issues the control signals
SCR1043 - Module 1
11
 Approach
1: Hardwired program

connecting/combining various logic components to
store data and perform arithmetic and logic
operations

Hardwired systems are inflexible
SCR1043 - Module 1
12

Approach 2: Software


General purpose hardware can do different tasks, given
correct control signals
Instead of re-wiring, supply a new set of control signals
through instruction codes
SCR1043 - Module 1
13
- A Brief History of Computers
- Designing for Performance
- Pentium and PowerPC Evolution
Reference: William Stallings – Computer
Organization & Architecture
SCR1043 - Module 1
14
SCR1043 - Module 1
15
 1943-1946:
ENIAC (Electronic Numerical
Integrator And Computer)
 First
general purpose computer
 Designed
by Mauchly and Eckert
 Designed
to create ballistics tables for WWII,
but too late – helped determine H-bomb
feasibility instead. General purpose!
 30
tons + 15000 sq. ft. + 18000 vacuum tubes +
140 KW = 5000 additions/sec
SCR1043 - Module 1
16
SCR1043 - Module 1
17
 1945:
stored-program concept first
implemented for EDVAC (Electronic Discrete
Variable Computer).
 Key
concepts:

Data and instructions are stored in a single read-write
memory.

The contents of this memory are addressable by
location, without regard to the type of data
contained there

Execution occurs in a sequential fashion from one
instruction to the next
SCR1043 - Module 1
18
SCR1043 - Module 1
19
 Prototype
for all subsequent general-purpose
computers. With rare exceptions, all of today’s
computers have this same general structure, and
are referred to as von Neumann machines.
 General




IAS structure consists of:
A main memory, which stores both data and instructions
An ALU capable of operating on binary data
A control unit, which interprets the instructions in
memory and causes them to be executed
I/O equipment operated by the control unit
SCR1043 - Module 1
20
SCR1043 - Module 1
21
 1950:
UNIVAC – commissioned by Census
Bureau for 1950 calculations
 Late



1950’s: UNIVAC II
Greater memory and higher performance
Same basic architecture as UNIVAC
First example of upward compatibility
 1953:
IBM 701 – primarily for science
 1955:
IBM 702 – primarily for business
SCR1043 - Module 1
22
 1947:
Transistor developed at Bell Labs
 Introduction of more complex ALU and control units
 High-level programming languages
 The
data channel – an independent I/O module
with its own processor and instruction set
 The
multiplexor – a central termination point for
data channels, CPU, and memory. Precursor to idea
of data bus.
 DEC
(Digital Equipment Corporation) founded in
1957 delivered its first computer, PDP-1, a minicomputer phenomenon.
SCR1043 - Module 1
23
SCR1043 - Module 1
24
 1958:
Integrated circuit developed
 1964:
Introduction of IBM System/360

First planned family of computer products.
Characteristics of a family:






Similar or Identical Instruction Set and Operating System
Increasing Speed
Increasing Number of I/O Ports
Increasing Memory Size
Increasing Cost
Different models could all run the same software, but
with different price/performance.
SCR1043 - Module 1
25
Literally - “small
electronics”
 A computer is made up of
gates, memory cells and
interconnections
 These can be
manufactured on a
semiconductor


e.g. silicon wafer
 With
microelectronics, density of components on
chip keep on increasing
 From


Number of transistors on a chip will double every year
Since 1970’s development has slowed a little, a modified law


Gordon Moore – co-founder of Intel, it says
Number of transistors on a chip doubles every 18 months
Therefore, more circuit can be packed on the same size chip
 Higher




packing density means
shorter electrical paths, giving higher performance
Smaller size gives increased flexibility
Reduced power and cooling requirements
Fewer interconnections increases reliability
SCR1043 - Module 1
27
Moore prediction
Actual
SCR1043 - Module 1
28
 1964
 Replaced
 First


(& not compatible with) 7000 series
planned “family” of computers
Similar or identical instruction sets
Similar or identical O/S
 Increasing
speed
 Increasing number of I/O ports (i.e. more
terminals)
 Increased memory size
 Increased cost
SCR1043 - Module 1
29
SCR1043 - Module 1
30
 1964:
First PDP-8 shipped
 First minicomputer
 Started OEM market
 Introduced the bus structure
 Did
not need air conditioned
room
 Small
enough to sit on a lab
bench
 $16,000
compared to
$100k++ for IBM 360
SCR1043 - Module 1
31
 Semiconductor


memory
Replaced bulky core memory
Goes through its own generations in size, increasing by a
factor of 4 each time: 1K, 4K, 16K, 64K, 256K, 1M, 4M,
16M on a single chip with declining cost and access time
 Microprocessor
 Distributed
 Larger
and personal computers
computing
and larger scales of integration
SCR1043 - Module 1
32
SCR1043 - Module 1
33










Microprocessor : all CPU components on a single chip
1971 - 4004
First microprocessor
4 bit
Followed in 1972 by 8008
8 bit
Both designed for specific applications
1974 - 8080
Intel’s first general purpose microprocessor
Designed to be the CPU of a general purpose
microcomputer
SCR1043 - Module 1
34




8080
 first general purpose microprocessor
 8 bit data path
 Used in first personal computer – Altair
8086
 much more powerful
 16 bit
 instruction cache, prefetch few instructions
 8088 (8 bit external bus) used in first IBM PC
80286
 16 MB memory addressable
80386
 First 32 bit design
 Support for multitasking- run multiple programs at the same time
SCR1043 - Module 1
35



80486
 sophisticated powerful cache and instruction pipelining
 built in maths co-processor
Pentium
 Superscalar technique - multiple instructions executed in
parallel
Pentium Pro
 Increased superscalar organization
 Aggressive register renaming
 branch prediction
 data flow analysis
 speculative execution
SCR1043 - Module 1
36





Pentium II
 MMX technology
 graphics, video & audio processing
Pentium III
 Additional floating point instructions for 3D graphics
Pentium 4
 Further floating point and multimedia enhancements
Itanium
 64 bit
Core Duo
 starts of a multicore processor
SCR1043 - Module 1
37
1975, 801 minicomputer project (IBM) RISC
 Berkeley RISC I processor
 1986, IBM commercial RISC workstation product, RT PC.

Not commercial success
 Many rivals with comparable or better performance


1990, IBM RISC System/6000
RISC-like superscalar machine
 POWER architecture

IBM alliance with Motorola (68000 microprocessors),
and Apple, (used 68000 in Macintosh)
 Result is PowerPC architecture

Derived from the POWER architecture
 Superscalar RISC
 Apple Macintosh
 Embedded chip applications

SCR1043 - Module 1
42
 Price/performance



Price drops every year
Performance increases almost yearly
Memory goes up a factor of 4 every 3 years of so
 The
basic building blocks for today’s computers are
the same as those of the IAS computer nearly 50
years ago.
SCR1043 - Module 1
43
 Density
of integrated circuits increases by 4 every 3
years (e.g. memory evolution)
 Also
results in performance boost of 4-5 times
every 3 years
 Requires
more elaborate ways of feeding
instructions quickly enough. Some techniques:



Branch prediction
Data-flow analysis
Speculative execution
SCR1043 - Module 1
44
 All
components do not increase performance at
same rate as processor
 Results in a need to adjust the organization and
architecture to compensate for the mismatch
among the capabilities of the various components.
SCR1043 - Module 1
45
 Must
carry a constant flow of program instructions
and data between memory chips and processor
 Processor
speed and memory capacity have grown
rapidly
 Speed
with which data can be transferred between
processor and main memory has lagged badly
 DRAM
density goes up faster than amount of main
memory needed


Number of DRAM’s goes down
With fewer DRAM’s, less opportunity for parallel data
transfer
SCR1043 - Module 1
46




Increase number of bits retrieved at one time
 Make DRAM “wider” rather than “deeper”
Change DRAM interface
 Include cache in DRAM chip
Reduce frequency of memory access
 More complex and efficient cache between processor and
memory
 Cache on chip/processor
Increase interconnection bandwidth between processor and
memory
 High speed buses
 Hierarchy of buses
 I/O
devices also become increasingly demanding
SCR1043 - Module 1
47
 Peripherals
with intensive I/O demands
 Large data throughput demands
 Processors can handle this
 Problem moving data
 Solutions:





Caching
Buffering
Higher-speed interconnection buses
More elaborate bus structures
Multiple-processor configurations
Peripherals (I/O devices) has extremes
•speed variations : < 1Hz to GHz
•in amount of data transfer: <1bit/sec to Gb/sec
 Because




of constant and unequal changes in:
Processor components
Main memory
I/O devices
Interconnection structures,
designers must constantly strive to balance their
throughtput and processing demands.
SCR1043 - Module 1
50
 Increase

Fundamentally due to shrinking logic gate size


More gates, packed more tightly, increasing clock rate
Propagation time for signals reduced
 Increase

Cache access times drop significantly
 Change

size and speed of caches
Dedicating part of processor chip


hardware speed of processor
processor organization and architecture
Increase effective speed of execution
Parallelism


Power
 Power density increases with density of logic and clock speed
 Dissipating heat
RC delay
 Speed at which electrons flow is limited by resistance and
capacitance of metal wires connecting them
 due to increased density

Interconnected wires becomes thinner, increasing resistance (R)
Wires are closer together, increasing capacitance (C)

Therefore, Delay increases as RC product increases



Memory latency
 Memory speeds lag behind processor speeds
Solution:
 More emphasis on organizational and architectural approaches
Better performance if improvement in architecture of the CPU
compared to the processing speed (technology)
SCR1043 - Module 1
53
 Typically
two or three levels of cache between
processor and main memory (L1,L2,L3)
 Chip

density increased
More cache memory on chip

Faster cache access
 Pentium
chip devoted about 10% of chip area to
cache
 Pentium
4 devotes about 50%
 Enable
parallel execution of instructions
 Pipeline

works like assembly line
Different stages of execution of different instructions at
same time along pipeline
 Superscalar
allows multiple pipelines within single
processor

Instructions that do not depend on one another can be
executed in parallel
 Both
of these approaches are reaching a point of
diminishing returns.
 Internal organization of processors complex


Can get a great deal of parallelism
Further significant increases likely to be relatively
modest
 Benefits
from cache are reaching limit
 Increasing clock rate runs into power dissipation
problem

Some fundamental physical limits are being reached
 We
can use Amdahl’s law to estimate maximum
expected performance improvements to an overall
system when only part of the system is improved.
Within a processor, increase in performance is
proportional to square root of increase in complexity
 If software can use multiple processors, doubling
number of processors almost doubles performance

So, use two simpler processors on the chip rather than one
more complex processor
 Multiple processors on single chip


With large shared cache
With two processors, larger caches are justified
 Power consumption of memory logic (for cache) is less than
processing logic
 Example: IBM POWER4


Two cores based on PowerPC
CPU Performance and its factors
Evaluating Performance
Reference: David A. Patterson & John L. Hennessy –
Computer Organization And Design
SCR1043 - Module 1
59

Hardware performance is often key to the effectiveness
of an entire system of hardware and software.

For different types of applications, different
performance metrics may by appropriate, and different
aspects of a computer systems may be the most
significant factor in determining overall performance.

Understanding how best to measure performance and
limitations of performance is important when selecting
a computer system

To understand the issues of assessing performance.

Why a piece of software performs as it does?

Why one instruction set can be implemented to perform better than another?

How some hardware feature affects performance?
SCR1043 - Module 1
60
 Performance
 Identify
is important!
HW/SW performance problems
 Comparisons:



Which machine is faster?
Which ISA is better?
Which implementation (of an ISA) is faster?
 Expose
significant performance issues (enable us to
ignore unimportant issues)
SCR1043 - Module 1
61
 Which
•
of these airplanes has the best performance?
How do we say one computer has better performance than another?
•
Peformance based on speed
•
•
To take a single passenger from one point to another in the least time – Concorde
Performance based on throughput
•
To transport 450 passengers from one point to another - 747
SCR1043 - Module 1
62
 Response



Time and Throughput
Response Time: time to respond (complete an
operation)
Throughput: jobs completed per unit time
Often can trade one for the other
SCR1043 - Module 1
63
 MB/s,
Mb/s: Megabytes, Megabits Per Second
 MIPS:
Millions of Instructions Per Second
 CPI:
Clock Cycles Per Instruction
IPC: Instructions Per Clock cycle
 Hz:
(processor clock frequency) cycles Per Second
 LIPS:
Logical Interference Per Second
 FLOPS:
Floating-Point arithmetic Operations Per
Second
SCR1043 - Module 1
64
 Real
time: “Wall Clock” time, always ticking
 CPU
execution time (CPU time): ticks only when
CPU is working for you


User: CPU time spent in the program
System: CPU time spent in the operating system
performing tasks on behalf of the program
 Clock
cycle: Also called tick, clock tick, clock
period, clocks, cycle (e.g. 0.25 nanosecond). The
time for one clock period, usually of the processor
clock, which runs at a constant rate
 Clock
rate: the inverse of the clock cycle.
Frequency (e.g. 4 GHz)
65
SCR1043 - Module 1
 CPU
execution time for a program = Seconds for
the program
 Clock
cycle time = Seconds per clock cycle
 Clock
ticks at a constant rate, measure time in
clock cycles:
Seconds
=
Cycles
* Seconds
Program
Program
Cycle
 Prefer
clock frequency? Divide by Hz
Seconds
=
Cycles / Clock rate (Freq)
Program
Program
SCR1043 - Module 1
66
A
simple formula relates the most basic
metrics (i.e., clock cycles and clock cycle
time) to CPU time
SCR1043 - Module 1
67

Our favorite program runs in 10 seconds on computer A,
which has a 4 GHz clock. Computer B will run this
program in 6 seconds, given that computer B requires
1.2 times as many clock cycles as computer A for this
program. What is computer B’s clock rate?
CPU Time(A) = CPU Clock Cycles(A) / Clock Rate(A)
10 s
= CPU Clock Cycles(A) / 4 GHz
10 s
= CPU Clock Cycles(A) / 4 X 10*9 Hz
CPU Clock Cycles(A) = 40 x 10*9 cycles
CPU Time(B) = 1.2 X CPU Clock Cycles(A) / Clock Rate(B)
6s
= 1.2 X CPU Clock Cycles(A) / Clock Rate(B)
Clock Rate (B) = 1.2 X 40 X 10*9 cycles / 6 seconds
Clock Rate (B) = 48 X 10*9 cycles / 6 seconds
Clock Rate (B) = 8 X 10*9 cycles / seconds
Clock Rate (B) = 8 GHz
SCR1043 - Module 1
68
Instruction count = Instructions executed for the program
 Clock cycle per instruction = Average number of clock
cycles per instructions

 Programs
are made of instructions:
Cycles
Program
 Using
Instructions
Program
* Cycles
Instructions
CPI:
Cycles
Program
 Or,
=
=
Instructions * CPI
Program
using Instructions Per Clock (IPC):
Cycles
Program
=
Instructions
Program
SCR1043 - Module 1
/ IPC
69
CPU time
=
Seconds
Program
=
Cycles
Program
= Instructions * Cycles
Program
Instructions
 In
*
*
Seconds
Cycle
Seconds
Cycle
other words:
SCR1043 - Module 1
70
 Suppose
we have two implementations of the same
instruction set architecture and for the same
program. Which computer is faster and by how
much?


Computer A: clock cycle time=250 ps and CPI=2.0
Computer B: clock cycle time=500 ps and CPI=1.2
 Say
I = number of instructions for the program, find
number of clock cycles for A and B
CPU Clock Cycles(A)
CPU Clock Cycles(A)
CPU Clock Cycles(B)
CPU Clock Cycles(B)
=I
=I
=I
=I
SCR1043 - Module 1
X CPI(A)
X 2.0
X CPI(B)
X 1.2
71
 Compute
CPU Time for A and B
CPU Time(A)
CPU Time(A)
CPU Time(B)
CPU Time(B)
= CPU Clock Cycles(A) X Clock Cycle Time(A)
= I X 2.0 X 250 ps = I X 500 ps
= CPU Clock Cycles(B) X Clock Cycle Time(B)
= I X 1.2 X 500 ps = I X 600 ps
 Clearly
A is faster. The amount faster is the ratio of
execution time.
Performance(A) = Execution time(B) = I X 600 ps = 1.2 times
Performance(B) Execution time(B) I X 500 ps

We can conclude, A is 1.2 times faster than B for
this program
SCR1043 - Module 1
72

Sometimes it is possible to compute the CPU clock
cycles by looking at the different types of instructions
and using their individual clock cycle counts
CPIi = count of the number of instructions of class i executed
 Ci = average number of cycles per instruction for that instruction class
 n = number of instruction classes


Remember that overall CPI for a program will depend
on both the number of cycles for each instruction type
and the frequency of each instruction type in the
program execution
SCR1043 - Module 1
73
SCR1043 - Module 1
74

A compiler designer is trying to decide between two code
sequences for a particular computer. The hardware designers have
supplied the following facts:

For a particular high-level-language statement, the compiler
writer is considering two code sequence that require the following
instruction counts:

Which code sequence executes the most instructions?
Which will be faster?
Which is the CPI for each sequence?


SCR1043 - Module 1
75




Sequence 1 (Instruction Count(1))  2+1+2=5 instructions
Sequence 2 (Instruction Count(2)) 4+1+1=6 instructions
CPU Clock Cycles(1)= (2X1)+(1X2)+(2X3) = 10 cycles
CPU Clock Cycles(2)= (4X1)+(1X2)+(1X3) = 9 cycles

So code Sequence 2 faster, even though it executes 1 extra
instruction
 Code Sequence 2 uses fewer clock cycles, must have lower CPI

CPI = CPU Clock Cycles/Instruction Count
CPI(1) = CPU Clock Cycles(1)/Instruction Count(1) = 10/5 = 2
CPI(2) = CPU Clock Cycles(2)/Instruction Count(2) = 9/6 = 1.5


SCR1043 - Module 1
76
 The
evolution of computers has been characterized
by increasing processor speed, decreasing comp
size, increasing memory size, and increasing I/O
capacity and speed.
 All
computer designers must balance performance
and cost.
 Execution
time of real programs as the metric is a
reliable method of determining and reporting
performance.
SCR1043 - Module 1
77