Transcript Chapter 1
Chapter 11
System Performance
Enhancement
Basic Operation of a Computer
Program is loaded into memory
Instruction is fetched from memory
Operands are decoded and required data
fetched from specified location (using
addressing mode built into instruction)
Operation corresponding to instruction is
executed
Additional operand determines return
location for the result of operation
Performance
CPU performs program instructions via a sequence of
fetch-execute cycles
Note: F-E Cycle consists of many phases
Performance is degraded by delays in memory accesses -
Performance Enhancement
RISC Architecture
- Reduced Instruction Set Computing
- Simple instructions: easier to decode
and to run in parallel
- Limited memory access - only load and
store
- Many registers and compilers that optimize
their use
Performance Enhancement
Pipelining - Overlap processing of instructions so that
more than one instruction is being worked on at a
given time
While one instruction is fetching, another may be
executing
So pipelining performs Fetch - Execute phases in
parallel
NOTE: Only 1 instruction at a time is actually being
executed to completion
Objective: start and finish one instruction per clock
cycle: CPI = 1
Pipelining - Fig. 10.23
Performance Enhancement
SuperScalar Design - start and finish more
than one instruction per clock cycle: CPI < 1
Executes several operations at once
Hardware duplication to support parallelism
CPU may have instruction fetch unit and
several execution units operating in parallel
Hardware schedules instructions to exploit
parallelism
Other Means of Improving Performance
Multiprocessing
Faster Clock Speed
Wider Instructions and Data Paths
Longer Registers
Faster Disk Access
Memory Enhancements
Multiprocessing
Increase number of processors
Multiprocessors - computers that have
multiple CPUs within a single system,
sharing memory and I/O devices
Typically, 2-4 processors
Tightly coupled system
Typical Multiprocessing System
Symmetrical Multiprocessing (SMP)
Systems
Each CPU operates independently
Each CPU has access to all the system resources
(memory and I/O)
Any CPU can respond to an interrupt
A program in memory can be executed by any CPU
Each CPU has identical access to OS
Each CPU performs its own dispatch scheduling that is, determining what program will execute
Very controlled environment - CPUs, memory, I/O
devices, and OS are designed to operate together
and communication is built into the system
Increase Clock Speed
Faster clock speeds impact
overall speed of the system
since instruction cycle time is
proportional to clock speed
Limitation - ability of CPU,
busses, and other components
to keep up
Wider Instruction and Data
Paths
Ability to process more bits at a time
improves performance
CPU can fetch or store more data in
a single operation
CPU can fetch more instructions at a
time
Memory accesses are slow
compared to CPU operations, so
improves performance
Longer Registers
Longer registers (# of bits) within CPU
reduces number of program steps to
complete a calculation
Example - Using 16-bit registers for 64-bit
addition requires 4 additions plus steps to
handle carries between registers and 4
moves to transfer result to memory
With 64-bit registers only a single addition
and single move to memory via wider
internal bus
Faster Disk Access
Small improvements in disk access can
have significant improvement in system
performance
Approach - data distributed among multiple
devices so data can be accessed
simultaneously from different devices
Manufacturers continue to produce disk
drives that are smaller and more densely
packed
Larger/Faster Memory
Increased amounts of memory provide larger
buffers that can be used to hold data and programs
transferred from I/O devices
Reduces number of disk accesses
Faster memory reduces number of wait states that
must be inserted into the instruction cycle when
memory access takes place
Memory access time can be reduced via RISC
architecture - more registers - and
by providing wider memory data paths (8 bytes)
Memory
DRAM - Dynamic RAM - inexpensive
memory, requires less electrical power,
and more compact with more bits of
memory in single integrated circuit.
Requires periodic refreshing.
SRAM -Static RAM - 2-3 times faster, but
more expensive and requires more chips
Impractical to use SRAM memory
Solution - Cache Memory
Cache Memory
Cache Memory
Cache memory is organized into blocks of 8-16
bytes each
Block holds exact copy of data stored in main
memory
Each block has a tag that identifies location of data
in main memory contained in the block
64KB of cache => 8,192 blocks of data
CPU request for memory is handled by Cache
Controller that checks tags for desired location
Hit => data in Cache
Miss => not present
Read => transfer data from Cache to CPU and
Write => store data with tag in Cache memory
If Miss, data is copied from memory to Cache
Cache Illustration
Cache Illustration
Cache Situations
Full Cache and Memory Write:
LRU - Least Recently Used Algorithm
replace block that has not been accessed for the longest
Suppose block to be replaced has been altered - first
write block to memory before replacement
Cache controller manages entire cache operation. CPU is
unaware of Cache presence.
Why does Cache work?
Locality of Reference - Empirical studies show that most
well written programs confine memory references to a
few small regions of memory - e.q. sequential
instructions or loops or small procedure or array data.
Hit-to-Miss ratios of 90%.
Two-Level Cache System