Performance Tuning
Download
Report
Transcript Performance Tuning
Performance Tuning
John Black
CS 425
UNR, Fall 2000
Why Go Fast?
•
•
•
•
•
Real Time Systems
Solve Bigger Problems
Less waiting time
Better Simulations, Video Games, etc.
Moore’s Law: Every 18 months CPU
speeds double for constant cost
– Has held true for last 30 years!
How do we go Faster?
• Find the Bottlenecks!
– Hardware problems
– Bad algorithms/data structures
– I/O bound (disk, network, peripherals)
• If none of these, we have to hand tune
Hand Tuning
• Use a “profiler”
– 80% of time is spent in 20% of the code
– Find this 20% and tune by hand
– Do NOT waste time tuning code which is not a
bottleneck
• How can we hand-tune?
Hand Tuning (cont.)
• Exploit architecture-dependent features
(of course this approach limits portability)
• We focus on the Pentium family
– Memory System
– Pipeline
Memory System
• Modern processors have a hierarchical
memory system
– Memory units nearer the processor are faster
but smaller
Registers
About 40 32-bit registers
Level 1 Cache
Level 2 Cache
Main Memory
32K
P3 has 256K or
512K
Common Memory-Related
Bottlenecks
• Alignment: the requirement that an accessed
object lie at an address which is a multiple
of 16 or 32 or 64, etc.
• For Pentium Pro, P2, P3, there is no penalty
for misaligned data (unless you cross a
cache line boundary)
– Cache lines are 32 byte cells in the caches
– We always read 32-bytes at a time
Locality
• Since we fetch 32 bytes at a time, accessing
memory sequentially (or in a small
neighborhood) is efficient
– This is called “spatial locality”
• Since cache contents are aged (with an LRU
algorithm) and eventually kicked out,
accessing items repeatedly is efficient
– This is called “temporal locality”
Digression on Big-Oh
• We learn in algorithms class that we are
concerned only with the asymptotic running
time on a RAM machine
– With new architectures this may no longer be a
good way to measure performance
– An O(n lg n) algorithm may be FASTER than
an O(n) algorithm due to architectural
considerations
Pipelining
• Modern processors execute many
instructions in parallel to increase speed
Decode
Get Args
Execute
Retire
An instruction may be getting decoded at the same time as
another is getting its arguments and another is executing;
this parallelism greatly speeds up the processor
Is a Pentium RISC or CISC?
• The Pentium instruction set is CISC
(Complex Instruction Set Computer) but it
actually translates into micro-ops and runs
on a RISC (Reduced Instruction Set
Computer) under the sheets
• The RISC instructions (micro-ops) are what
is actually pipelined
How does this Affect Tuning?
• If the pipeline is not fully utilized, we can
improve performance by fixing this
– Reorder instructions to reduce dependencies
and “pipeline stalls”
– Avoid mispredicted branches with loop
unrolling
– Avoid function calls with inlining
The Downside
• All this tuning makes code harder to write,
maintain, debug, port, etc.
• Assembly language may be required, which
has all the above drawbacks