Lecture 1: Course Introduction and Review

Download Report

Transcript Lecture 1: Course Introduction and Review

EEL 5708
High Performance Computer Architecture
Lecture 1
Introduction
August 21, 2006
Lotzi Bölöni
Fall 2006
EEL5708/Bölöni
Lec 1.1
Acknowledgements
• All the lecture slides were adopted from the
slides of David Patterson (1998, 2001) and
David E. Culler (2001), Copyright 19982002, University of California Berkeley
EEL5708/Bölöni
Lec 1.2
Case 1: VIA KT266 chipset for the
Athlon processors
EEL5708/Bölöni
Lec 1.3
Take 1: April 4, 2001
• Tom’s Hardware ( www.tomshardware.com). Web site for
hardware entusiasts.
• Review of the VIA Apollo KT266 chipset.
• http://www17.tomshardware.com/mainboard/01q2/010409/kt2
66-10.html
• The website’s conclusion:
KT266 is still way too slow to challenge or even
replace AMD's 760 chipset.
As a conclusion, I could maybe say the typical words
always used in early reviews "let's hope VIA will
finally improve KT266". However, I have my doubts
if this will happen any time soon. My advice to you
is to either forget about DDR altogether for the
time being, or to go for Athlon plus AMD760 and
NOTHING ELSE.
EEL5708/Bölöni
Lec 1.4
Take 2: One week later…
• Article title: “VIA Apollo KT266 revisited:
Much Ado About Nothing”
(http://www17.tomshardware.com/mainboard/01q2/0
10416/index.html)
• Another website (www.anandtech.com) obtains
different results.
• An additional resistor (!) mounted on the
motherboard and a different BIOS.
• Tom’s Hardware concludes that there are
indeed improvements, but they are not
significant enough to change the conclusion.
EEL5708/Bölöni
Lec 1.5
Take 3: Five months later
(September 2001)
• VIA KT266A is launched
• Tom’s Hardware: “’A’ stands for vastly improved
performance”
(http://www17.tomshardware.com/mainboard/01q3/01
0902/index.html)
• Changes: “improvements” to the memory controller.
• Processor frequency, bus frequency, etc. stay the
same. Pin-by-pin compatible with the predecessors!
• Conclusion:
“The performance of Apollo KT266A is nothing
short of impressive.”
EEL5708/Bölöni
Lec 1.6
Synthetic benchmarks:
EEL5708/Bölöni
Lec 1.7
Real world benchmarks
EEL5708/Bölöni
Lec 1.8
Some conclusions
• “Architecture” matters.
• Real world benchmarks less improvement than
synthetic ones: Amdahl’s Law
• Which benchmark do I care about? (this time at
least, they were consistent…)
• …
EEL5708/Bölöni
Lec 1.9
Case 2: Video compression performance
in Intel Pentium 4 vs. AMD Athlon
EEL5708/Bölöni
Lec 1.10
Take 1 (11/20/00): First impressions
• Intel Pentium 4 is launched.
• The initial measurements show that it
greatly overperforms the AMD Athlon for
MPEG 4 video compression.
•
http://www6.tomshardware.com/cpu/00q4/0
01120/index.html
EEL5708/Bölöni
Lec 1.11
Take 1 (11/20/00): First impressions
(cont’d)
EEL5708/Bölöni
Lec 1.12
Take 2: New results force new
conclusions
• Concerns are raised about the fact that the
measurement was done with a low quality
setting (MMX arithmetics)
• Repeating the measurements with floating
point arithmetics, the relative performance
was reversed.
• http://www6.tomshardware.com/cpu/00q4/0
01122/index.html
EEL5708/Bölöni
Lec 1.13
Take 2 : New results force new
conclusions (cont’d)
EEL5708/Bölöni
Lec 1.14
Take 3: Intel engineers create an
optimized version of the software
• As a response, Intel engineers created a modified
version of the software:
-recompiled it with higher optimizations.
-rewritten parts of the code to use the new instruction set
extensions (SSE2)
• The higher optimizations benefited both Intel and
AMD processors (but Intel more)
• The SSE2 options reversed the performance ranking
again.
• OBS: AMD engineers created an AMD optimized
version, too, with significant improvements, but this
did not change the rankings.
EEL5708/Bölöni
Lec 1.15
Take 3: Intel engineers create an
optimized version of the software
EEL5708/Bölöni
Lec 1.16
Take 3 (cont’d)
EEL5708/Bölöni
Lec 1.17
Case 2: Conclusions
• Real world benchmark, huge differences
– Why?
• Software solution to a hardware problem?
– Optimizing for the architecture
– So, what if it is not open source?
– Software development cycles…
• Picking the right architecture + understanding the
architecture we have
EEL5708/Bölöni
Lec 1.18
Review: Measuring performance
EEL5708/Bölöni
Lec 1.19
Performance measures
• Time to execute a given program
• Number of programs which can be run in
parallel
• Responsiveness (user interfaces)
• Predictable execution time (for real time
systems)
• Energy consumption (mostly for portables,
but check the new Google and Microsoft
data centers…)
• And so on…
EEL5708/Bölöni
Lec 1.20
Which is faster? (Latency vs throughput)
Plane
DC to
Paris
Speed
Passengers
Throughput
(pmph)
Boeing 747
6.5 hours
610 mph
470
286,700
BAD/Sud
Concorde
3 hours
1350 mph
132
178,200
• Time to run the task (ExTime)
– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns …
(Performance)
– Throughput, bandwidth
EEL5708/Bölöni
Lec 1.21
Definitions
• Performance is in units of things per sec
– bigger is better
• If we are primarily concerned with response time
– performance(x) =
1
execution_time(x)
" X is n times faster than Y" means
Execution_time(Y)
Performance(X)
n
=
=
Performance(Y)
Execution_time(X)
EEL5708/Bölöni
Lec 1.22
CPI
Computer Performance
CPU time
= Seconds
= Instructions x
Program
Program
Instruction
Inst Count CPI
Program
X
Compiler
X
(X)
Inst. Set.
X
X
Organization
Technology
inst count
Cycle time
Cycles x Seconds
X
Cycle
Clock
X
X
EEL5708/Bölöni
Lec 1.23
Cycles Per Instruction
(Throughput)
“Average Cycles per Instruction”
CPI = (CPU Time * Clock Rate) / Instruction Count
= Cycles / Instruction Count
n
CPU time  Cycle Time   CPI j  I j
j 1
n
CPI   CPI j  Fj
j 1
where Fj 
Ij
Instruction Count
“Instruction Frequency”
EEL5708/Bölöni
Lec 1.24
Example: Calculating CPI bottom up
Base Machine
Op
ALU
Load
Store
Branch
(Reg /
Freq
50%
20%
10%
20%
Reg)
Cycles
1
2
2
2
Typical Mix of
instruction types
in program
CPI(i)
.5
.4
.2
.4
1.5
(% Time)
(33%)
(27%)
(13%)
(27%)
EEL5708/Bölöni
Lec 1.25