Lecture #4 - The University of Texas at Dallas

Download Report

Transcript Lecture #4 - The University of Texas at Dallas

EE (CE) 6304 Computer Architecture
Lecture #4
(9/3/15)
Yiorgos Makris
Professor
Department of Electrical Engineering
University of Texas at Dallas
Course Web-site:
http://www.utdallas.edu/~gxm112130/EE6304FA15
Measuring Performance
• Performance is in units of things per sec
– bigger is better
• If we are primarily concerned with response time
performance(x) =
1
execution_time(x)
" X is n times faster than Y" means
Performance(X)
n
=
Execution_time(Y)
=
Performance(Y)
Execution_time(X)
Performance: What to measure
• Usually rely on benchmarks vs. real workloads
• To increase predictability, collections of benchmark
applications, called benchmark suites, are popular
• SPECCPU: popular desktop benchmark suite
–
–
–
–
CPU only, split between integer and floating point programs
SPECint2000 has 12 integer, SPECfp2000 has 14 integer pgms
SPECCPU2006 announced Spring 2006
SPECSFS (NFS file server) and SPECWeb (WebServer) added as
server benchmarks
• Transaction Processing Council measures server performance
and cost-performance for databases
–
–
–
–
TPC-C Complex query for Online Transaction Processing
TPC-H models ad hoc decision support
TPC-W a transactional web benchmark
TPC-App application server and web services benchmark
How Summarize Suite Performance (1/5)
• Arithmetic average of execution time of all pgms?
– But they vary by 4X in speed, so some would be more
important than others in arithmetic average
• Could add a weights per program, but how pick
weight?
– Different companies want different weights for their products
• SPECRatio: Normalize execution times to
reference computer, yielding a ratio proportional
to performance =
time on reference computer
time on computer being rated
How Summarize Suite Performance (2/5)
• If program SPECRatio on Computer A is
1.25 times bigger than Computer B, then
ExecutionTimereference
SPECRatio A
ExecutionTime A
1.25 

ExecutionTimereference
SPECRatioB
ExecutionTimeB
ExecutionTimeB Performance A


ExecutionTime A PerformanceB
• Note that when comparing 2 computers as a
ratio, execution times on the reference
computer drop out, so choice of reference
computer is irrelevant
How Summarize Suite Performance (3/5)
• Since ratios, proper mean is geometric mean
(SPECRatio unitless, so arithmetic mean meaningless)
GeometricMean  n
n
 SPECRatio
i
i 1
1. Geometric mean of the ratios is the same as
the ratio of the geometric means
2. Ratio of geometric means
= Geometric mean of performance ratios
 choice of reference computer is irrelevant!
• These two points make geometric mean of ratios
attractive to summarize performance
How Summarize Suite Performance (4/5)
• Does a single mean well summarize performance of programs in
benchmark suite?
• Can decide if mean a good predictor by characterizing
variability of distribution using standard deviation
• Like geometric mean, geometric standard deviation is
multiplicative rather than arithmetic
• Can simply take the logarithm of SPECRatios, compute the
standard mean and standard deviation, and then take the
exponent to convert back:
1 n

GeometricMean  exp    ln SPECRatioi 
 n i 1

GeometricStDev  exp StDevln SPECRatioi 
How Summarize Suite Performance (5/5)
• Standard deviation is more informative if know
distribution has a standard form
– bell-shaped normal distribution, whose data are symmetric
around mean
– lognormal distribution, where logarithms of data--not data
itself--are normally distributed (symmetric) on a logarithmic
scale
• For a lognormal distribution, we expect that
mean / gstdev, mean gstdev
68% of samples fall in range
95% of samples fall in range
mean / gstdev 2 , mean  gstdev 2
• Note: Excel provides functions EXP(), LN(), and
STDEV() that make calculating geometric mean and
multiplicative standard deviation easy


Example Standard Deviation (1/2)
• GM and multiplicative StDev of SPECfp2000 for
Itanium 2
14000
10000
GM = 2712
GSTEV = 1.98
8000
6000
5362
4000
2712
2000
apsi
sixtrack
lucas
ammp
facerec
equake
art
galgel
mesa
applu
mgrid
swim
0
fma3d
1372
wupwise
SPECfpRatio
12000
Example Standard Deviation (2/2)
• GM and multiplicative StDev of SPECfp2000 for AMD Athlon
14000
10000
GM = 2086
GSTEV = 1.40
8000
6000
4000
2911
2086
1494
apsi
sixtrack
lucas
ammp
facerec
equake
art
galgel
mesa
applu
mgrid
swim
0
fma3d
2000
wupwise
SPECfpRatio
12000
How to Mislead with Performance Reports
• Select pieces of workload that work well on your design, ignore
others
• Use unrealistic data set sizes for application (too big or too small)
• Report throughput numbers for a latency benchmark
• Report latency numbers for a throughput benchmark
• Report performance on a kernel and claim it represents an entire
application
• Use 16-bit fixed-point arithmetic (because it’s fastest on your
system) even though application requires 64-bit floating-point
arithmetic
• Use a less efficient algorithm on the competing machine
• Report speedup for an inefficient algorithm (bubblesort)
• Compare hand-optimized assembly code with unoptimized C code
• Compare your design using next year’s technology against
competitor’s year old design (1% performance improvement per week)
• Ignore the relative cost of the systems being compared
• Report averages and not individual results
• Report speedup over unspecified base system, not absolute times
• Report efficiency not absolute times
• Report MFLOPS not absolute times (use inefficient algorithm)
[ David Bailey “Twelve ways to fool the masses when giving
performance results for parallel supercomputers” ]
ISA Implementation Review
Fundamental Execution Cycle
Instruction
Fetch
Instruction
Decode
Operand
Fetch
Execute
Result
Store
Next
Instruction
Memory
Obtain instruction
from program
storage
Determine required
actions and
instruction size
Locate and obtain
operand data
Processor
program
regs
F.U.s
Data
Compute result value
or status
von Neuman
Deposit results in
storage for later
use
Determine successor
instruction
bottleneck
A "Typical" RISC ISA
•
•
•
•
32-bit fixed format instruction (3 formats)
32 32-bit GPR (R0 contains zero, DP take pair)
3-address, reg-reg arithmetic instruction
Single address mode for load/store:
base + displacement
– no indirection
• Simple branch conditions
• Delayed branch
see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC,
CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3
Example: MIPS ( MIPS)
Register-Register
31
26 25
Op
21 20
Rs1
16 15
Rs2
11 10
6 5
Rd
0
Opx
Register-Immediate
31
26 25
Op
21 20
Rs1
16 15
Rd
immediate
0
Branch
31
26 25
Op
Rs1
21 20
16 15
Rs2/Opx
immediate
0
Jump / Call
31
26 25
Op
target
0
Datapath vs Control
Datapath
Controller
signals
Control Points
• Datapath: Storage, FU, interconnect sufficient to perform the
desired functions
– Inputs are Control Points
– Outputs are signals
• Controller: State machine to orchestrate operation on the data
path
– Based on desired function and signals
Simple Pipelining Review
5 Steps of MIPS Datapath
Execute
Addr. Calc
Instr. Decode
Reg. Fetch
Next SEQ PC
Next SEQ PC
Adder
4
Zero?
RS1
RD
RD
RD
rslt <= A opIRop B
WB <= rslt
Reg[IRrd] <= WB
• Data stationary control
– local decode for each instruction phase
/ pipeline stage
MUX
Sign
Extend
MEM/WB
Data
Memory
EX/MEM
ALU
Imm
MUX MUX
A <= Reg[IRrs];
B <= Reg[IRrt]
ID/EX
IR <= mem[PC];
PC <= PC + 4
Reg File
IF/ID
Memory
Address
RS2
Write
Back
MUX
Next PC
Memory
Access
WB Data
Instruction
Fetch
Visualizing Pipelining
Time (clock cycles)
Reg
DMem
Ifetch
Reg
DMem
Reg
ALU
DMem
Reg
ALU
O
r
d
e
r
Ifetch
ALU
I
n
s
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Ifetch
Ifetch
Reg
Reg
Reg
DMem
Reg
Pipelining is not quite that easy!
• Limits to pipelining: Hazards prevent next
instruction from executing during its designated
clock cycle
– Structural hazards: HW cannot support
this combination of instructions (single
person to fold and put clothes away)
– Data hazards: Instruction depends on
result of prior instruction still in the
pipeline (missing sock)
– Control hazards: Caused by delay between
the fetching of instructions and decisions
about changes in control flow (branches
and jumps).
One Memory Port/Structural Hazards
Time (clock cycles)
Reg
DMem
Reg
DMem
Reg
DMem
Reg
ALU
Instr 4
Ifetch
ALU
Instr 3
DMem
ALU
O
r
d
e
r
Instr 2
Reg
ALU
I Load Ifetch
n
s
Instr 1
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Ifetch
Ifetch
Reg
Ifetch
Reg
Reg
Reg
DMem
Reg
One Memory Port/Structural Hazards
Time (clock cycles)
Stall
DMem
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Bubble
Instr 3
How do you “bubble” the pipe?
Reg
Reg
DMem
Bubble Bubble
Ifetch
Reg
Reg
Bubble
ALU
O
r
d
e
r
Instr 2
Reg
ALU
I Load Ifetch
n
s
Instr 1
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Bubble
DMem
Reg
Speed Up Equation for Pipelining
CPIpipelined  Ideal CPI  Average Stall cycles per Inst
Cycle Timeunpipelined
Ideal CPI  Pipeline depth
Speedup 

Ideal CPI  Pipeline stall CPI
Cycle Timepipelined
For simple RISC pipeline, CPI = 1:
Cycle Timeunpipelined
Pipeline depth
Speedup 

1  Pipeline stall CPI
Cycle Timepipelined
Example: Dual-port vs. Single-port
• Machine A: Dual ported memory (“Harvard
Architecture”)
• Machine B: Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate
• Ideal CPI = 1 for both
• Loads are 40% of instructions executed
1.05)
1.33
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe /
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) =
• Machine A is 1.33 times faster
Data Hazard on R1
Time (clock cycles)
and r6,r1,r7
or
r8,r1,r9
xor r10,r1,r11
Ifetch
DMem
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
sub r4,r1,r3
Reg
ALU
Ifetch
ALU
O
r
d
e
r
add r1,r2,r3
WB
ALU
I
n
s
t
r.
MEM
ALU
IF ID/RF EX
Reg
Reg
Reg
Reg
DMem
Reg
Three Generic Data Hazards
• Read After Write (RAW)
InstrJ tries to read operand before InstrI writes it
I: add r1,r2,r3
J: sub r4,r1,r3
• Caused by a “Dependence” (in compiler nomenclature).
This hazard results from an actual need for
communication.
Three Generic Data Hazards
• Write After Read (WAR)
InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Called an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:
– All instructions take 5 stages, and
– Reads are always in stage 2, and
– Writes are always in stage 5
Three Generic Data Hazards
• Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:
– All instructions take 5 stages, and
– Writes are always in stage 5
• Will see WAR and WAW in more complicated pipes
Forwarding to Avoid Data Hazard
or
r8,r1,r9
xor r10,r1,r11
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
and r6,r1,r7
Ifetch
DMem
ALU
sub r4,r1,r3
Reg
ALU
O
r
d
e
r
add r1,r2,r3 Ifetch
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
Reg
Reg
Reg
Reg
DMem
Reg
HW Change for Forwarding
NextPC
mux
What circuit detects and resolves this hazard?
mux
Immediate
MEM/WR
EX/MEM
ALU
mux
ID/EX
Registers
Data
Memory