EECS 152 Computer Architecture and Engineering Lec 01

Transcript EECS 152 Computer Architecture and Engineering Lec 01

Computer Organization & Design
计算机组成与设计
Weidong Wang (王维东)
[email protected]
College of Information Science & Electronic Engineering
信息与通信工程研究所
Zhejiang University
Course Information
• Instructor: Weidong WANG
– Email: [email protected]
– Tel(O): 0571-87953170;
– Office Hours: TBD, Yuquan Campus, Xindian (High-Tech)
Building 306, /email whenever
• TA:
– mobile，Email:
» Lu Huang黄露, 13516719473/6719473;
[email protected]
» Hanqi Shen沈翰祺, 15067115046; [email protected]
» Office Hours: Wednesday & Saturday 14:00-16:30 PM.
» Xindian (High-Tech) Building 308.（也可以短信邮件联系）
2
Lecture 4
Performance, Energy & Compiler
3
Computer Performance Metrics
• Response Time (latency延迟 )
– How long does it take for my job to run?
– How long does it take to execute a job?
– How long must I wait for the database query?
• Throughput吞吐量
– How many jobs can the machine run at once?
– What is the average execution rate?
– How many queries per minute?
• If we upgrade a machine with a new processor what to we increase?
• If we add a new machine to the lab what do we increase?
4
Performance性能 = Execution Time
• Elapsed消逝 Time
– Counts everything (disk and memory accesses, I/O, etc.)
– A useful number, but often not good for comparison purposes
• E.g., OS & multiprogramming time make it difficult to compare CPUs
• CPU time (CPU = Central Processing Unit = processor)
– Doesn’t count I/O or time spent running other programs
– Can be broken up into分成 system time, and user time
• Our focus: user CPU time
– Time spent executing the lines of code that are “in” our program
– Includes arithmetic, memory, and control instructions, …
5
Clock Cycles
• Instead of reporting execution time in seconds, we often
use cycles周期数
• Clock “ticks” indicate when to start activities
• Cycle time = time between ticks = seconds per cycle
• Clock rate (frequency) = cycles per second (1Hz. = 1cycle/sec)
– A 2GHz clock has a 500 picoseconds (ps) cycle time.
6
Performance and Execution Time
• The program should be something real people care about
– Desktop: MS office, edit, compile
– Server: web, e-commerce, database
– Scientific: physics, weather forecasting
程序（Program）是为实现特定目标或解决特定问题而用计算机语言编写的命令序列的集合。
7
Measuring Clock Cycles
• Clock cycles/program is not an intuitive直观的 or easily determined
value, so
– Clock Cycles = Instructions x Clock Cycles Per Instruction
• Cycles Per Instruction (CPI) used often
– 消费物价指数？(Consumer Price Index 物价指数)
• CPI is an average since the number of cycles per instruction varies
from instruction to instruction
– Average depends on instruction mix, latency of each inst. type etc.
• CPIs can be used to compare two implementations of the same
ISA, but is not useful alone for comparing different ISAs
– An X86 add is different from a MIPS add
8
Using CPI
• Drawing on the previous equation:
ExecutionTime  Instructions  CPI  ClockCycleTime
Instructions  CPI
ExecutionTime 
ClockRate
• To improve performance (i.e., reduce execution time)
– Increase clock rate (decrease clock cycle time) OR
– Decrease CPI OR
– Reduce the number of instructions
• Designers balance cycle time against the number of cycles
required
– Improving one factor may make the other one worse
9
Clock Rate时钟频率 ≠ Performance
• Mobile Intel Pentium 4
Vs
Intel Pentium M
– 2.4GHz
– P4 is 50% faster?
1.6GHz
• Performance on Mobilemark with same memory and disk
– Word, excel, photoshop, powerpoint, etc.
– Mobile Pentium 4 is only 15% faster
• What is the relative CPI?
– ExecTime = ICCPI/ClockRate
– IC  CPIM/1.6 = 1.15  IC  CPI4/2.4
– CPI4/CPIM = 2.4/(1.15  1.6)=1.3
10
CPI Calculation
• Different instruction types require different numbers of cycles
• CPI is often reported for types of instructions
n
ClockCylces   (CPI i  ICi )
i 1
• where CPIi is the CPI for the type of instructions
and ICi is the count of that type of instruction
11
CPI Calculation
• To compute the overall average CPI use
InstructionCounti 

ClockCylces    CPI i 

InstructionCount 
i 1 
CPI
n
12
Computing CPI Example
– 频度
• Given this machine, the CPI is the sum of CPIFrequency
• Average CPI is 0.5 + 0.4 + 0.4 + 0.2 = 1.5
• What fraction of the time for data transfer?
13
What is the Impact of Displacement
Based Memory Addressing Mode?
• Assume 50% of MIPS loads and stores have a zero
displacement.
14
Speedup加速比
• Speedup allows us to compare different CPUs or optimizations
CPUtimeOld
Speedup 
CPUtimeNew
• Example
– Original CPU takes 2sec to run a program
– New CPU takes 1.5sec to run a program
– Speedup = 1.333 or speedup 33%
15
Amdahl’s Law阿姆德尔定律IBM/1967
• If an optimization improves a fraction f of execution time by a factor of a
Told
1
Speedup 

[(1  f )  f / a]* Told (1  f )  f / a
• This formula is known as Amdahl’s Law
• Lessons from
– If f->100%, then speedup = a
– If a->∞, the speedup = 1/(1-f)
• Summary
– Make the common case fast
– Watch out for the non-optimized component
16
CPI as an analytical tool to guide design
Machine
CPI(throughput,
Program
Instruction Mix
not latency)
5 x 30 + 1 x 20 + 2 x 20 + 2 x 10 + 2 x 20
100
= 2.7 cycles/instruction
Where
program
spends
its time
Amdahl’s Law (of Diminishing递减 Returns)
If enhancement “E” makes
multiply infinitely fast, but other
instructions are unchanged, what
is the maximum speedup “S”?
Where
program
spends
its time
S=
1
(post-enhancement %) / 100%
=
1
48%/100%
= 2.08
Attributed to Gene Amdahl -- “Amdahl’s Law”
What is the lesson of Amdahl’s Law?
Must enhance computers in a balanced way!
Evaluating Performance
• Performance best determined by running a real application
– Use programs typical of expected workload
• e.g. compiler/editors, scientific applications, graphics, etc.
• Microbenchmarks微基准
– Small programs, synthetic or kernels from larger applications
– Nice for architects and designers
– Can be misleading误导
• Benchmarks基准
– Collection of real programs that companies have agreed on
– Components: programs, inputs & outputs, measurements rules,
metrics
– Can still be abused滥用
19
Benchmarks基准指标
• System Performance Evaluation Cooperative (SPEC)
• Scientific computing: Linpack, SpecOMP, SpecHPC,…
• Embedded benchmarks: EEMBC, Dhrystone, …
• Enterprise企业级 computing
– TPC-C, TPC-W, TPC-H
– SpecJbb, SpecSFS, SpecMail, Streams,
• Multiprocessor: PARSEC, SPLASH-2, EEMBC (multicore)
• Other
– 3Dmark, ScienceMark, Winstone, iBench, AquaMark, …
• Watch out注意: your results will be as good as your benchmarks
– Make sure you know what the benchmark is designed to measure
– Performance is not the only metric度量 for computing systems
• Cost, power consumption, reliability, real-time performance, …
把在VAX-11/780机器上的测试结果1757 Dhrystones/s定义为1 Dhrystone MIPS
22
Summarizing Performance
• Combining results from multiple programs into 1 benchmark score
– Sometimes misleading, always controversial有争议,
» … and inevitable不可避免
– We all like quoting a single number
• 3 types of means平均
– Arithmetic算术: for times
– Harmonic调和的 : for rates
– Geometric几何的: for rations
1 n
AM   Weighti  Timei
n i i
n
HM  n
Weighti 

Ratei
i 1
 n

GM    Ratioi 
 i 1

1
 
n
23
Using the Means
• Arithmetic mean:
– When you have individual performance scores in latency
– 等待时间
• Harmonic mean:调和平均数
– When you have individual performance scores in throughput
– 吞吐量/能力
• Geometric mean:
– Nice property: GM(X/Y)= GM(X)/GM(Y)
– But difficult to related back to execution times
• Note
– Always look at the results for individual programs
24
Performance Summary
• Performance is specific to a particular特定 programs
– Total execution time is a consistent一致 summary of performance
• For a given architecture performance increases come from:
–
–
–
–
Increase in clock rate (without adverse CPI effects)
Improvements in processor organization that lower CPI
Compiler enhancements that lower CPI and/or instruction count
Algorithm/Language choices that affect instruction count
• Pitfall陷阱: expecting improvement in one aspect of a
machine’s performance to affect the total performance
25
Break
•Energy
26
Switching Energy: Fundamental Physics
Every logic transition转换 dissipates消耗 energy.
V
dd
V
dd
C
1 C V2
E0->1=
dd
2
1 C V2
E1->0=
dd
2
Strong result: Independent of technology.
How can
we limit
switching
energy?
(1) Slow down clock (fewer transitions). But we like speed ...
(2) Reduce Vdd. But lowering Vdd lowers the clock speed ...
(3) Fewer circuits. But more transistors can do more work.
(4) Reduce C per node. One reason why we scale processes.
Scaling switching energy per gate ...
Recall process
scaling
Due to reducing V
and C (length and
width of Cs
decrease, but
plate distance gets
smaller).
Recent slope more
shallow弱的
because V is being
scaled
less aggressively.
激烈
From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005.
Second Factor: Leakage Currents
Even when a logic gate isn’t switching, it burns power.
Isub: Even when this nFet
is off, it passes an Ioff
leakage current.
0V =
We can engineer any Ioff
we like, but a lower Ioff also
results in a lower Ion, and thus
the lower the clock speed.
Igate: Ideal capacitors have
zero DC current. But modern
transistor gates are a few
atoms thick, and are not ideal.
Intel’s current processor
designs, leakage vs switching
power
A lot of work was
done to get a
ratio this good ...
50/50 is common
Bill Holt, Intel, Hot Chips 17.
Engineering “On” Current at 25 nm ...
We can increase Ion by
raising Vdd and/or lowering Vt.
V
d
I
V
g
I
dsV
s
ds
1.2 mA = Ion
0.25 ≈ V
t
I
off
= 0 ???
0.7 = V
dd
Plot on a “Log” Scale to See “Off” Current
V
d
I
V
g
dsV
s
We can decrease Ioff by
raising Vt - but that lowers Ion.
I
1.2 mA = Ion
ds
0.25 ≈ V
t
I
off
≈ 10 nA
0.7 = V
dd
Device engineers trade speed and power
We can reduce CV2 (Pactive)
by lowering Vdd.
We can increase speed
by raising Vdd and lowering
Vt.
We can reduce leakage
(Pstandby) by raising Vt.
From: Silicon Device Scaling to the Sub-10-nm Regime
Meikei Ieong,1* Bruce Doris,2 Jakub Kedzierski,1 Ken Rim,1 Min Yang1
Customize processes for product types ...
From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005.
Intel: Comparing 2 CPU generations ...
Find enough
tricks, and you
can afford to
raise Vdd a little
so that you can
raise the clock
speed!
Clock speed
unchanged ...
Lower Vdd, lower C,
but more leakage.
Design tricks:
architecture & circuits.
And so, we can transform this:
Gate delay
roughly
linear with
Vdd
Block processes stereo audio. 1/2
of clocks for “left”, 1/2 for “right”.
Into this:
CV2
power only
Ex: Top block
processes audio
channel 1, bottom
block processes
audio channel 2.
This magic trick brought to you by Cory Hall ...
Multiple Cores for Low Power
Trade hardware for power,
on a large scale ...
Cell:
The PS3 chip
Cell: Conventional CPU + 8 “SPUs”
L2 Cache
512 KB
PowerPC
8
Synergistic
Processing
Units
(SPUs)
One Synergistic协同 Processing Unit (SPU)
256 KB Local Store -- 128 128-bit Registers
SPU issues 2 inst/cycle (in order) to 7 execution units
SPU fills Local Store using DMA to DRAM and network
A “Schmoo” plot for a Cell SPU ...
The lower Vdd, the less dynamic
energy consumption.
1 C V2
E0->1=
dd
2
1 C V2
E1->0=
dd
2
The lower Vdd, the longer
the maximum clock period,
the slower the clock
frequency.
Clock speed alone doesn’t help E/op ...
But, lowering clock frequency while keeping voltage constant
spreads the same amount of work over a longer time, so chip
2
1 C V2
stays cooler ...
E = 1 CV
0->1
2
dd
E1->0=
2
dd
Scaling V and f does lower energy/op
1 W to get 2.2 GHz
7W to reliably get 4.4 GHz
performance. 26 C die
performance. 47C die temp.
temp.
If a program that needs a 4.4 Ghz
CPU can be recoded to use two 2.2
Ghz CPUs ... big win.
Intel’s dual-core analysis ...
But only if your app(s) can put 2 cores to use!
From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005.
How iPod puts its 2 cores to use ...
Two 80 MHz CPUs. This
chip is used in the most
iPods now, with one CPU
doing audio decoding, the
other doing photos, etc.
Other Low-Power Techniques
Add “sleep” transistors to logic ...
Example: Floating point unit logic.
When running fixed-point instructions,
put logic “to sleep”.
+++ When “asleep”, leakage power
is dramatically reduced.
--- Presence of sleep transistors
slows down the clock rate when the
logic block is in use.
Intel example: Sleeping cache blocks
From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005.
Recall: Most logic on a chip “too fast”
The critical path
Most wires have hundreds
of picoseconds to spare.
From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.
Use several supply voltages on a chip ...
Why? We could use the low-power Vdd
for logic well off the critical path.
From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005.
On a CPU, where does the heat go?
Half of the power
goes to latches
(Flip-Flops).
Most of the time,
the latches don’t
change state.
So (gasp) gated clocks are a big win.
But, done with CAD tools in a disciplined way.
From: Bose, Martonosi, Brooks: Sigmetrics-2001 Tutorial
IBM Power 4: How does die heat up?
4 dies on a
multi-chip
module
2 CPUs
per die
115 Watts: Concentrated in “hot spots”
Hot
spots
Fixed
point
units
Cache
logic
66.8 ℃ == 152 F
82℃ == 179.6 F
Break
•Compiler
Translation Hierarchy
• High-levelAssemblyMachine
54
Compiler编译器
Converts high-level language to machine code
词汇
解析器
句法
语义
55
Intermediate Representation (IR)
中间表示
• Three-address code (TAC)
t1= 2 * I
j= 2 * I + 1;
If (j >= n)
j = 2*I + 3;
return a[j];
t2 = t1 + 1
j = t2
t3 = j < n
If t3 goto L0
t4 = 2 * I
t5 = t4 + 3
j = t5
L0: t6 = a[j]
return t6
Single assignment
56
Optimizing Compilers
• Provide efficient mapping of program to machine
– Code selection and ordering
– Eliminating minor较小的 inefficiencies低效
– Register allocation分配
• Don’t (usually) improve asymptotic渐近 efficiency
– Up to programmer to select best overall algorithm
– Big-O savings are (often) more important than constant factors
• But constant factors also matter
• Optimization types
– Local: inside basic blocks
– Global: across basic blocks e.g. loops
57
Limitations of Optimizing Compilers
• Operate Under Fundamental Constraints约束条件
– Improved code has same output
– Improved code has same side effects
– Improved code is as correct as original code
• Most analysis is performed only within procedures程序
– Whole-program analysis is too expensive in most cases
• Most analysis is based only on static information
– Compiler has difficulty anticipating预见 run-time inputs
– Profile driven dynamic compilation becoming more prevalent
– Dynamic compilation (e.g. Java JIT)
• When in doubt, the compiler must be conservative谨慎保
守的
58
Preview: Code Optimization
• Goal: Improve performance by:
– Removing redundant work
• Unreachable code
• Common-subexpression elimination
• Induction variable归纳变量 elimination
– Using simpler and faster operations
• Dealing with constants in the compiler
• Strength reduction强度折减reformulating less expensive ones
– Managing register well
– 后面一一解析
ExecutionTime = Instructions x CPI x ClockCycleTime
59
Constant Folding折叠
• Detect & combine values that will be constant.
• Respect laws of arithmetic on target machine.
a = 10 * 5 + 6 – b;
60
Constant Propagation传播
• When x gets assigned a constant c, replace users of
x with uses of c, as long as x remains unchanged.
• Very useful for turning constants into immediates, for
code generator.
61
Constant Propagation
62
Reduction in Strength
• Some operations are faster (lower CPI) than others:
–
–
–
–
Multiplication slower than addition
Mult or Divide by 2n slower than shift left or right
Multiplies take 12 cycle on MIPS R2000
ALU instruction take 1 cycle
• Can replace complex instruction with equivalent
simple instructions
• Correctness:正确性
– Overflow behavior preserved?
– Other side-effects of operation? (interrupts)
63
Reduction in Strength
• How to replace multiplication with addition?
• Often presents in for loops & array walks.
while (i<100) arr[i++] = 0;
64
Common Sub-Expression
Elimination
65
Common Sub-Expression
Elimination
66
Induction Variable归纳变量 Elimination
while (i<100) arr [i++] = 0;
• Remember replacing multiply with add in loop?
• Why not get rid of induction variable I altogether?
• Induction variable is a variable that is changed by a
constant amount on every iteration of a loop.
67
Register Allocation & Assignment
• Register provide fastest data access in the
processor.
• Two terms:
– What variables belong in register? (allocation分配)
– What register will hold this value? (assignment委派)
• Once a value is in a register, we’d like not to spill洒出
& restore it to use same value again.
68
Assembly Code Generation
69
Compiler Performance on Bubble Sort
冒泡排序
70
Assembler汇编器
• Expands macros and pseudo instructions as well as
converts constants.
• Primary purpose is to produce an object目标 file
– Machine language instructions
– Application data
– Information for memory organization
71
Object File目标文件
72
Linker连接器
• Linker combines multiple object modules
– Identify确定 where code/data will be placed in memory
– Resolve code/data cross references
– Produces executable if all references found
• Steps
– Place code and data modules in memory
– Determine the address of data and instruction labels
– Patch修补 both the internal and external references
• Separation between compiler and linker makes standard
libraries and efficient solution to maintaining modular模块
化的 code.
73
Loader装载器
• Loader used at run-time
–
–
–
–
–
–
–
Reads executable file header for size of text/data segments
Create address space sufficiently large
Copy program from executable on disk into memory
Copy arguments参数 to main program’s stack
Initialize machine registers and set stack pointer
Jump to start-up routine程序
Terminate program when execution completes
74
SPIM System Calls
• SPIM provides a small number of OS “system calls”
Service
Code
Argument
Result
print_int
1
$a0 = integer
print_float
2
$f12 = float
print_double
3
$f12 = double
print_string
4
$a0 = string
read_int
5
Integer in $v0
read_float
6
Float if $f0
read_double
7
Double in $f0
read_string
8
$a0 = buffer, $a1 = length
Data in buffer
Sbrk
9
$a0 = amount
Address in $v0
Exit
10
• Set system call code in $v0 and pass arguments as needed
• See web site for details and examples.
75
MIPS Assembler Directives指令
• SPIM supports a subset of the MIPS assembler directives
• Some of the directives include
–
–
–
–
–
.asciiz – Store a null-terminated string in memory
.data – Start of data segment
.global – Identify an exported symbol
.text – Start of text segment
.word – Store words in memory
76
Procedure程序Call and Return
• Procedures are required for structured programming
– Aka: functions, methods, subroutines, …
• Implementing procedures in assembly requires several things to be done
– Memory space must be set aside for local variables
– Arguments must be passed in and return values passed out
– Execution must continue after the call
• Procedure Call Steps
–
–
–
–
–
–
Place parameters in a place where the procedure can access them
Transfer control to the procedure
Acquire the storage resources needed for the procedure
Perform the desired task
Place the result value in a place where the calling program can access it
Return control to the point of origin
MIPS Storgae Layout
Procedure Activation Record (Frame)
• Each procedure creates an activation record on the stack
Register Assignments
Calling Convention规矩
Caller vs. Callee Saved Registers
Preserved
No Preserved
Saved register ($s0-$s7)
Temporary register ($t0-t9)
Stack/frame pointer ($sp, $fp, $gp)
Argument registers ($a0-$a3)
Return address ($ra)
Return values ($v0-$v1)
• Preserved registers (Callee Save)
– Save register values on stack prior to use
– Restore registers before return
• Not preserved registers (Caller Save)
– Do what you please and expect callees to do likewise
– Should be saved by the caller if needed after procedure call
Call and Return
• Caller
– Save caller-saved register $a0-$a3, $t0-$t9, if necessary
– Load arguments in $a0-$a3, rest passed on stack
– Execute jal instruction
• Callee Setup
– Allocate memory for new frame ($sp = $sp - frame)
– Save callee-saved registers $s0-$s7, $fp, $ra, if necessary
– Set frame pointer ($fp = $sp + frame size - 4)
• Callee Return
–
–
–
–
Place return value in $v0 and $v1
Restore any callee-saved registers
Pop stack ($sp = $sp + frame size)
Return by jr $ra
Calling Convention惯例Steps
Simple Example
How to solve a complex problem
Specification
Break problem into simpler steps
Specify goal in executable forms
Algorithm
Coding in high language (like C)
Translate into ISA
Compiler
Processor
Execute instructions as fast as possible
HomeWork
• Readings:
– Read Chapter 3, 4;
• HW2B
– http://mypage.zju.edu.cn/wdwd/教学工作
– Or ftp：//10.13.71.58/（暂不能访问）
Acknowledgements
• These slides contain material from courses:
– UCB CS152.
– Stanford EE108B
91

EECS 152 Computer Architecture and Engineering Lec 01

Transcript EECS 152 Computer Architecture and Engineering Lec 01

Directory