Transcript Ch1a

COM 249 – Computer Organization and
Assembly Language
Chapter 1 (continued)
Performance
Based on slides from D. Patterson and
www-inst.eecs.berkeley.edu/~cs152/
Modified by S. J. Fritz Spring 2009 (1)

Algorithm


Determines number of operations executed (IC and possibly CPI)
Programming language, compiler architecture
§1.4 Performance
Understanding Performance
Determine number of machine instructions executed per operation
(IC, CPI)


Instruction set architecture
Determines the instructions needed for a function, the cycles needed
for each instruction and the clock rate of the processor (IC, CPI, and
Clock rate)


Processor and memory system


Determine how fast instructions are executed
I/O system (including OS)

Determines how fast I/O operations are executed
Modified by S. J. Fritz Spring 2009 (2)
Comparison of Airplanes
Airplane
Passenger
Capacity
Cruising
Range
(miles)
Cruising
Speed
(MPH)
Passenger
Throughput
(Pass x mph)
Boeing 777
375
4630
610
228,750
Boeing 747
470
4150
610
286,700
Concord
132
4000
1350
178,200
Douglas
DC 8-50
146
8720
544
79,424
Modified by S. J. Fritz Spring 2009 (3)
Defining Performance
Which airplane has the best performance?
Boeing 777
Boeing 777
Boeing 747
Boeing 747
Concorde
DC 8-50
Concorde
0
200
400
600
DC 8-50
0
2000
Passenger Capacity
4000
6000
8000
10000
Cruising Range (miles)
Boeing 777
Boeing 777
Boeing 747
Boeing 747
Concorde
Concorde
DC 8-50
0
500
1000
Cruising Speed (mph)
Modified by S. J. Fritz Spring 2009 (4)
1500
DC 8-50
0
10000 20000 30000 40000
0
0
0
0
Passengers x mph
Response Time and Throughput
• Response time
– How long it takes to do a task
• Throughput
– Total work done per unit time
• e.g., tasks/transactions/… per hour
• How are response time and throughput
affected by
– Replacing the processor with a faster version?
– Adding more processors?
• We’ll focus on response time for now…
Modified by S. J. Fritz Spring 2009 (5)
Response Time and Throughput
• Decreasing response time usually improves
throughput.
• Do these changes increase response time or
throughput or both?
• Replacing processor with faster one?
Both
• Adding additional processors?
Only throughput, since no one task gets
done faster.
Modified by S. J. Fritz Spring 2009 (6)
Relative Performance
• Define Performance
Performance = 1/Execution Time
• “X is n time faster than Y”
Performanc e X Performanc e Y
 Execution time Y Execution time X  n
• Example: time taken to run a program
– 10s on A, 15s on B
– Execution TimeB / Execution TimeA
= 15s / 10s = 1.5
– So A is 1.5 times faster than B
Modified by S. J. Fritz Spring 2009 (7)
Measuring Execution Time
• Elapsed time
– Total response time, including all aspects
• Processing, I/O, OS overhead, idle time
– Determines system performance
• CPU time
– Time spent processing a given job
• Discounts I/O time, other jobs’ shares
– Comprises user CPU time and system CPU time
– Different programs are affected differently by CPU
and system performance
Modified by S. J. Fritz Spring 2009 (8)
CPU Clocking
• Operation of digital hardware governed by a
constant-rate clock
Clock period
Clock (cycles)
Data transfer
and computation
Update state
• Clock period: duration of a clock cycle
– e.g., 250ps = 0.25ns = 250×10–12s
• Clock frequency (rate): cycles per second
– e.g., 4.0GHz = 4000MHz = 4.0×109Hz
Modified by S. J. Fritz Spring 2009 (9)
Program Execution Time(CPU Time)
• CPU TIME = Total CPU Clock Cycles X Clock Cycle Length
• Total CPU Clock Cycles = Number of Instructions X CPI
• CPI: Clock Cycles per Instruction
– Average number of clock cycles each instruction takes to execute
• CPU Designer’s Choice: Trade off between the number of
instructions and the duration of the clock cycle
– Long cycle: powerful but complex instructions (CISC)
– Short cycle: simple instructions (RISC)
Modified by S. J. Fritz Spring 2009 (10)
Instruction Count and CPI
Clock Cycles  Instructio n Count  Cycles per Instructio n
CPU Time  Instructio n Count  CPI  Clock Cycle Time
Instructio n Count  CPI

Clock Rate
• Instruction Count for a program
– Determined by program, ISA and compiler
• Average cycles per instruction
– Determined by CPU hardware
– If different instructions have different CPI
• Average CPI affected by instruction mix
Modified by S. J. Fritz Spring 2009 (11)
CPI Example
•
•
•
•
Computer A: Cycle Time = 250ps, CPI = 2.0
Computer B: Cycle Time = 500ps, CPI = 1.2
Same ISA
Which is faster, and by how much?
CPU Time
CPU Time
A
 Instructio n Count  CPI  Cycle Time
A
A
A is faster…
 I  2.0  250ps  I  500ps
B
 Instructio n Count  CPI  Cycle Time
B
B
 I  1.2  500ps  I  600ps
B  I  600ps  1.2
CPU Time
I  500ps
A
CPU Time
Modified by S. J. Fritz Spring 2009 (12)
…by this much
Classic CPU Performance Equation
• Basic performance equation in terms of IC, CPI and
clock cycle time:
• CPU time=Instruction count x CPI x Clock cycle time
Or
since clock rate is the inverse of clock cycle time:
• CPU Time = Instruction count x CPI
Cock Rate
Modified by S. J. Fritz Spring 2009 (13)
CPI in More Detail
• If different instruction classes take
different numbers of cycles
n
Clock Cycles   (CPIi  Instructio n Count i )
i1
• Weighted average CPI
n
Clock Cycles
Instructio n Count i 

CPI 
   CPIi 

Instructio n Count i1 
Instructio n Count 
Relative frequency
Modified by S. J. Fritz Spring 2009 (14)
CPI Example
• Alternative compiled code sequences using
instructions in classes A, B, C
Class
A
B
C
CPI for class
1
2
3
IC in sequence 1
2
1
2
IC in sequence 2
4
1
1
• Sequence 1: IC = 5
– Clock Cycles
= 2×1 + 1×2 + 2×3
= 10
– Avg. CPI = 10/5 = 2.0
Modified by S. J. Fritz Spring 2009 (15)
• Sequence 2: IC = 6
– Clock Cycles
= 4×1 + 1×2 + 1×3
=9
– Avg. CPI = 9/6 = 1.5
Basic Components of Performance
Components of Performance
Units of Measure
CPU execution time
Seconds
Instruction count
Instructions executed
Clock cycles per
Instruction (CPI)
Clock cycle time
Average number of clock
cycles per instruction
Seconds per clock cycle
Modified by S. J. Fritz Spring 2009 (16)
Performance Summary
The BIG Picture
Instructio ns Clock cycles
Seconds
CPU Time 


Program
Instructio n Clock cycle
• Performance depends on
–
–
–
–
Algorithm: affects IC, possibly CPI
Programming language: affects IC, CPI
Compiler: affects IC, CPI
Instruction set architecture: affects IC, CPI, Tc
Modified by S. J. Fritz Spring 2009 (17)
§1.5 The Power Wall
Power Trends
WHY?
• In CMOS IC technology
Power  Capacitive load  Voltage 2  Frequency
×30
Modified by S. J. Fritz Spring 2009 (18)
5V → 1V
×1000
Power Wall
• Clock rate and power have increased over
25 years and 8 computer generations and
then flattened off
• They grew together because they are
correlated
• They slowed down because we have run
into a practical power limit for cooling
microprocessors-thermal problems
Modified by S. J. Fritz Spring 2009 (19)
CMOS and Power
• Dominant technology for IC (integrated circuits) is
CMOS(complementary metal oxide semiconductor).
• For CMOS primary power dissipation is dynamic
power – power consumed during switching.
• Frequency switched is a function of clock rate.
• Capacitive load is a function of the number of
transistors and the technology.
• Therefore, power has been reduced 30 times by:
Power  Capacitive load  Voltage 2  Frequency
×30
Modified by S. J. Fritz Spring 2009 (20)
5v
1v
×1000
Reducing Power
• Suppose a new CPU has
– 85% of capacitive load of old CPU
– 15% voltage and 15% frequency reduction
Pnew Cold  0.85  (Vold  0.85) 2  Fold  0.85
4


0.85
 0.52
2
Pold
Cold  Vold  Fold
• The power wall
– We can’t reduce voltage further
– We can’t remove more heat
• How else can we improve performance?
New design…
Modified by S. J. Fritz Spring 2009 (21)
Constrained by power, instruction-level parallelism,
memory latency
Modified by S. J. Fritz Spring 2009 (22)
§1.6 The Sea Change: The Switch to Multiprocessors
Uniprocessor Performance
Sea Change
• Sea Change is an idiom meaning a profound
transformation or big change.
• Taken from Shakespeare’s The Tempest, when
Ariel sings:
• "Full fathom five thy father lies,
Of his bones are coral made,
Those are pearls that were his eyes,
Nothing of him that doth fade,
But doth suffer a sea-change,
into something rich and strange…”
• http://en.wikipedia.org/wiki/Sea_change
Modified by S. J. Fritz Spring 2009 (23)
From Uniprocessor to Multiprocessor
• Power limits have forced change in the design of
microprocessors.
• Microprocessors now have multiple processors or
“cores” per chip.
• Called multicore (dual core, quad core, etc.)
• Plan to double the number of cores per chip
every two years.
• Programmers need to rewrite their programs to
take advantage of multiple processors.
Modified by S. J. Fritz Spring 2009 (24)
Multiprocessors
• Multicore microprocessors
– More than one processor per chip
• Requires explicitly parallel programming
– Compare with instruction level parallelism
• Hardware executes multiple instructions at once
• Hidden from the programmer
– Hard to do
• Programming for performance
• Load balancing
• Optimizing communication and synchronization
Modified by S. J. Fritz Spring 2009 (25)
Multicore Microprocessors
Product
Cores
per chip
AMD
Intel
IBM
OpteronX4 Nehalem Power6
Barcelona
4
4
2
Sun Ultra
Spark T2
Niagara2
8
Clock
Rate
2.5 GHz
~2.5GHz 4.7 GHz 1.4 GHz
Power
120 W
~100 W
Modified by S. J. Fritz Spring 2009 (26)
~100 W
94 W
Parallelism
• Programmers need to switch to explicitly
parallel programming.
• Pipelining (Chapter 4) is an elegant
technique to overlap the execution of
instructions.
• Instruction-level parallelism abstracts the
parallel nature of the hardware so the
programmer and compiler can think of
sequential instruction execution.
Modified by S. J. Fritz Spring 2009 (27)
Parallel Programming
• Hard to write parallel programs:
• Parallel programming is by definition
performance programming and must be fast.
(If speed is not needed write sequentially.)
• For parallel hardware, programmer must
divide the application so that each processor
has same amount to do, with little overhead.
• See Newspaper story analogy p. 43
Modified by S. J. Fritz Spring 2009 (28)
Real Stuff: Manufacturing AMD chip
• Manufacture of a chip begins with silicon (found in
sand).
• Silicon is a semiconductor – does not conduct
electricity well.
• Material added to silicon to form:
– Conductors (copper or aluminum)
– Insulators (plastic or glass)
– Conduct or insulate under special conditions (as a
switch or transistor)
• VLSI (very large scale integration) circuit is
millions of conductors, insulators and switches in
a small package.
Modified by S. J. Fritz Spring 2009 (29)
• Yield: proportion of working dies per wafer
• http://www.intel.com/museum/onlineexhibits.htm
Modified by S. J. Fritz Spring 2009 (30)
§1.7 Real Stuff: The AMD Opteron X4
Manufacturing ICs
AMD Opteron X2 Wafer
• X2: 300mm wafer, 117 chips, 90nm technology
• X4: 45nm technology
Modified by S. J. Fritz Spring 2009 (31)
Integrated Circuit Cost
Cost per wafer
Cost per die 
Dies per wafer  Yield
Dies per wafer  Wafer area Die area
1
Yield 
(1  (Defects per area  Die area/2)) 2
• Nonlinear relation to area and defect rate
– Wafer cost and area are fixed
– Defect rate determined by manufacturing process
– Die area determined by architecture and circuit
design
Modified by S. J. Fritz Spring 2009 (32)
SPEC CPU Benchmark
• Benchmarks are programs used to measure performance
– Supposedly typical of actual workload
• Standard Performance Evaluation Corp (SPEC)
– Develops benchmarks for CPU, I/O, Web, …
• SPEC CPU2006
– Elapsed time to execute a selection of programs
• Negligible I/O, so focuses on CPU performance
– Normalize relative to reference machine
– Summarize as geometric mean of performance ratios
• CINT2006 (integer) and CFP2006 (floating-point)
n
n
Execution time ratio
i1
Modified by S. J. Fritz Spring 2009 (33)
i
CINT2006 for Opteron X4 2356
IC×109
CPI
Tc (ns)
Exec time
Ref time
SPECratio
Interpreted string processing
2,118
0.75
0.40
637
9,777
15.3
bzip2
Block-sorting compression
2,389
0.85
0.40
817
9,650
11.8
gcc
GNU C Compiler
1,050
1.72
0.47
24
8,050
11.1
mcf
Combinatorial optimization
336
10.00
0.40
1,345
9,120
6.8
go
Go game (AI)
1,658
1.09
0.40
721
10,490
14.6
hmmer
Search gene sequence
2,783
0.80
0.40
890
9,330
10.5
sjeng
Chess game (AI)
2,176
0.96
0.48
37
12,100
14.5
libquantum
Quantum computer simulation
1,623
1.61
0.40
1,047
20,720
19.8
h264avc
Video compression
3,102
0.80
0.40
993
22,130
22.3
omnetpp
Discrete event simulation
587
2.94
0.40
690
6,250
9.1
astar
Games/path finding
1,082
1.79
0.40
773
7,020
9.1
xalancbmk
XML parsing
1,058
2.70
0.40
1,143
6,900
6.0
Name
Description
perl
Geometric mean
11.7
High cache miss rates
Modified by S. J. Fritz Spring 2009 (34)
SPEC Power Benchmark
• Power consumption of server at
different workload levels
– Performance: ssj_ops/sec
– Power: Watts (Joules/sec)
 10
  10

Overall ssj_ops per Watt    ssj_ops i    poweri 
 i 0
  i 0

Modified by S. J. Fritz Spring 2009 (35)
SPECpower_ssj2008 for X4
Target Load %
Performance (ssj_ops/sec)
Average Power (Watts)
100%
231,867
295
90%
211,282
286
80%
185,803
275
70%
163,427
265
60%
140,160
256
50%
118,324
246
40%
920,35
233
30%
70,500
222
20%
47,126
206
10%
23,066
180
0%
0
141
1,283,590
2,605
Overall sum
∑ssj_ops/ ∑power
Modified by S. J. Fritz Spring 2009 (36)
493
Pitfalls and Fallacies
• Fallacies- commonly held misconceptions,
usually presented with a counter example.
• Pitfalls- easily made mistakes, often
generalizations of principles that are true in
a limited context.
• Purpose of these sections is to help you to
avoid making mistakes.
Modified by S. J. Fritz Spring 2009 (37)
• Amdahl's Law governs the speedup of using parallel
processors on a problem, versus using only one serial
processor.
– Before we examine Amdahl's Law, we should gain a better
understanding of what is meant by speedup.
• Speedup is the time it takes a program to execute in
serial (with one processor) divided by the time it takes
to execute in parallel (with many (j) processors). The
formula for speedup is:
•
S =
T(1)
T(j)
• Efficiency is the speedup, divided by the number of processors
used.
• http://cs.wlu.edu/~whaleyt/classes/parallel/topics/amdahl.html
Modified by S. J. Fritz Spring 2009 (38)
§1.8 Fallacies and Pitfalls
Amdahl’s Law
Amdahl’s Law
• If N is the number of processors, s is the amount of
time spent (by a serial processor) on serial parts of a
program and p is the amount of time spent (by a
serial processor) on parts of the program that can be
done in parallel, then Amdahl's law says that
speedup is given by
Speedup = (s + p ) / (s + p / N )
= 1 / (s + p / N ),
where we have set total time s + p = 1 for algebraic
simplicity.
http://www.scl.ameslab.gov/Publications/Gus/AmdahlsLaw/Amdahls.html
Modified by S. J. Fritz Spring 2009 (39)
Limitations of Amdahl’s Law
If a program needs 20 hours using a single processor, and a particular
portion of 1 hour cannot be parallelized, while the remaining portion
of 19 hours (95%) can be parallelized, then regardless of how many
processors we devote to a parallelized execution of this program, the
minimal execution time can not be less than that critical 1 hour. Hence
the speed up is limited up to 20x.
Modified by S. J. Fritz Spring 2009 (40)
Pitfall: Amdahl’s Law
• Pitfall: Expecting the improvement of one aspect
of a computer to increase overall performance by
an amount proportional to the size of the
improvement.
• Suppose a program runs in 100 seconds on a
computer, with multiply operations responsible for
80 seconds of this time.
• How much must the speed of the multiplication
improve to have the program run 5 times faster?
Modified by S. J. Fritz Spring 2009 (41)
• Improving an aspect of a computer and expecting
a proportional improvement in overall
performance
Taf f ected
Timprov ed 
 Tunaf f ected
improvemen t factor
• Example: multiply accounts for 80s/100s
– How much improvement in multiply performance to get
5× overall?
80
20 
 20
n
– Can’t be done!
• Corollary: make the common case fast
Modified by S. J. Fritz Spring 2009 (42)
§1.8 Fallacies and Pitfalls
Pitfall: Amdahl’s Law
Uses for Amdahl’s Law
• Estimate performance improvements
• Together with CPU performance
equation, use to evaluate potential
enhancements
• Use the Corollary: Make the common
case fast to enhance performanceeasier than optimizing the rare case.
• Use to examine the practical limits on
the number of parallel processors.
Modified by S. J. Fritz Spring 2009 (43)
Fallacy: Low Power at Idle
• Fallacy: Computers at low utilization use little
power.
• Look back at X4 power benchmark
– At 100% load: 295W
– At 50% load: 246W (83%)
– At 10% load: 180W (61%)
• Google data center
– Mostly operates at 10% – 50% load
– At 100% load less than 1% of the time
• Consider designing processors to make power
proportional to load
Modified by S. J. Fritz Spring 2009 (44)
Pitfall: MIPS as a Performance Metric
• Pitfall:Using a subset of the performance equation
as a performance metric
• MIPS: Millions of Instructions Per Second
– Doesn’t account for
• Differences in ISAs between computers
• Differences in complexity between instructions
Instructio n count
MIPS 
Execution time  10 6
Instructio n count
Clock rate


6
Instructio n count  CPI
CPI

10
6
 10
Clock rate
– CPI varies between programs on a given CPU
Modified by S. J. Fritz Spring 2009 (45)
Performance Measurements
Measurement
Computer A
Computer B
Instruction Count
10 billion
8 billion
Clock Rate
4 GHz
4 GHz
CPI
1.0
1.1
• Which computer has the higher MIPS rating?
•MIPS= Clock rate A = 4 GHz = 4000 B = 4GHz= 3630
CPI x106
1.0 x106
1.1 x106
• Which computer is faster?
CPU time = IC x CPI A=10 x 109 x1 = 2.5 B = 8 x 109x1.1= 2.2
Clock rate
Modified by S. J. Fritz Spring 2009 (46)
4GHz
4Ghz
• Cost/performance is improving
– Due to underlying technology development
• Hierarchical layers of abstraction
– In both hardware and software
• Instruction set architecture
– The hardware/software interface
• Execution time: the best performance measure
• Seconds = Instructions x Clock Cycles x Seconds
Program Program
Instruction
Clock cycle
• Execution time is the only valid measure of
performance.
Modified by S. J. Fritz Spring 2009 (47)
§1.9 Concluding Remarks
Concluding Remarks
• Power is a limiting factor
– Use parallelism to improve performance
• Via multiple processors
• Exploiting the locality of accesses to a memory
hierarchy,via caches
• The key hardware technology for modern
processors is silicon.
• Historical Perspective ( SEE 1.10 on CD)
Modified by S. J. Fritz Spring 2009 (48)
§1.9 Concluding Remarks
Concluding Remarks