AMA-L01-Introx

Transcript AMA-L01-Introx

Lecture 1: Introduction
•
•
•
•
•
•
Intro/Review: 2 lectures
Processor Front-end: 5 lectures
Execution Core: 4 lectures
Other topics: 6 lectures
Processor Case Studies: 11 classes
Mini-conference: 2 classes
First 8.5 weeks
Next 5.5 weeks
Last week of class
Course philosophy:
(1) First half: learn details about microarchitecture concepts
(2) Second half: study real designs, applying what we covered in part 1.
Lecture 1: Introduction
2
• Lectures:
– I’m not taking attendance, but since there’s no textbook,
attendance (and being awake) is incredibly important.
– There will be four homework assignments for this part.
• Supplemental Reading (required):
– “The Pentium Chronicles” by Robert P. Colwell, published
by Wiley-Interscience, ISBN: 0-471-73617-1
– Must complete reading this before the start of case studies
• Case studies:
– Paper reading is mandatory… you cannot participate if you
haven’t read the paper(s)
Lecture 1: Introduction
3
• Term Project
– Microprocessor/microarchitecture-based project
– Project must be approved
• Mini-Conference
– We will peer-review all projects, similar to how a
conference program committee reviews papers
– Last week of class will be used to hold a mini-conference
where you present your term project
– Food and drink will be provided! :-)
• No Exams, Hooray!
Lecture 1: Introduction
4
• 4 Homeworks at 5 points each = 20 pts
• 5 TPC reading summaries, 3 pts each = 15 pts
• 11 case-study reading summaries and participation,
3 pts each = 33 pts
• Term project = 32 pts
–
–
–
–
–
Abstract/Proposal: 5 pts
Mid-project Status: 2 pts
Write-up: 10 pts
Reviews (of other people’s projects): 5 pts
Final Presentation: 10 pts
Lecture 1: Introduction
5
• If you don’t do the readings, you’re not going to
contribute anything to the discussions, therefore …
– For each case-study session, you must do the reading
before the start of class
– You must also write a brief summary of the readings
– You must submit the summary at the start of class…
The summary is your entrance ticket to class:
If you don’t hand in the summary,
I’m not going to let you enter the classroom!
Lecture 1: Introduction
6
• What metric to use?
– CPI, IPC, MIPS, FLOPS, polygons/sec, frames/sec, …
• Absolute Runtime
– “How long will it take to run my program?”
– “How long will it take to run my programs?”
• Relative Performance
– “Will my program run faster on an Intel or AMD cpu?”
– “Will my programs run faster on an Intel or AMD cpu?”
– “Will my typical program run faster on Intel or AMD?”
Lecture 1: Introduction
7
This is the only performance metric that matters (for the uniprocessor world).
Everything else is just a proxy!!!
• Runtime =
Total Insts
Cycles
X
Instruction
Seconds
X
Cycle
Total Work
In Program
CPI or 1/IPC
1/f (clock freq.)
Algorithms,
Compilers,
ISA Extentions
Microarchitecture
Microarchitecture,
Process Tech
Lecture 1: Introduction
8
• Correct metric depends
– Single parallel (multi-threaded) application:
• Runtime
– Multiple applications (multi-programmed
workload):
• Typically total system throughput
• Latency/Runtime of a given program not so important
• Fairness and combined fairness/performance metrics
often used.
Lecture 1: Introduction
9
• Which power do you mean?
– Maximum/peak power delivery requirements
• “450W Power Supply”
– Average power delivery requirements
• Battery life
• Electricity bills
Lecture 1: Introduction
10
• Power to charge/discharge a capacitor
• P = VI
+
• I = C dv/dt
V
-
Lecture 1: Introduction
C
11
• P = ½CV2fa
–
–
–
–
C: total capacitance switched
V: power supply voltage
f: clock frequency
a: activity factor
• Really, P = Siall blocks Pi = ½fV2 × Siall blocksCiai
• Ci and ai are hard to determine
– Ci requires detailed circuit design
– ai depends on dynamic behavior (application specific)
Lecture 1: Introduction
12
• Cache Power
–
–
–
–
–
Clock frequency = 2 GHz
L1 Instruction Cache: C=1.515 mF, a = 0.88
L1 Data Cache: C=0.741 mF, a =0.6
L2 Unified Cache: C=12.7 mF, a = 0.07
Vdd = 1.5V
• PIL1 = ½ * 1.515 mF * (1.5)2 * 2GHz * 0.88
= ½ * 1.515e-9F * 2.25V2 * (1/500e-12 sec) * 0.88
= 3 FV2/s = 3 (columbs/volt)*(volt2)/second
= 3 columb*volt/second = 3 (Amp*sec) * (Watt/Amp) / sec
= 3 Watts
Lecture 1: Introduction
13
• L1 Data Cache: C=0.741 mF, a =0.6
• PDL1 =
= 1 Watt
• L2 Unified Cache: C=12.7 mF, a = 0.07
• PUL2 =
= 2 Watts
• Total Power of All Caches = PIL1 + PDL1 + PUL2 =
Lecture 1: Introduction
14
•
•
•
•
•
P = ½CV2fa
fV
P  ½CV2Va
P  V3
Perf  f  V
• Decrease V
– Performance drops linearly
– Power drops cubically!
Lecture 1: Introduction
A.K.A.VoltageFrequency Scaling
Rule of thumb:
3% Power reduction corresponds
to about a 1% Performance drop
Voltage can be decreased only
so far... after that, you can only
decrease clock frequency
15
• “Leakage”, “Dark Current”
– Dark current name comes from current measured in
photodetectors when no light is present
• Two Kinds:
– Channel leakage or subthreshold conductance
– Gate leakage
Lecture 1: Introduction
16
Gate
Applied Voltage
Source
Drain
Gate
Current
Threshold Voltage
+ + + + + Current
- - - - -
Source
Lecture 1: Introduction
Drain
17
• P = positive, N = negative
Gate
Source
+
+
Vdd
Drain
Gate
-
0 Volts
Lecture 1: Introduction
Drain
Source
PMOS
NMOS
=
+
+
18
Gate Leakage
Channel Leakage
Subthreshold Conductance
Lecture 1: Introduction
19
Gate
Iox = K2W(V/Tox)2e
-aTox/V
Source
Oxide Thickness keeps
Shrinking (faster transistors)
Probability of Quantum
Tunneling Increases
(Leakage increases)
Drain
Channel Length keeps
Shrinking (faster transistors)
Channel resistance decreases
(Leakage increases)
Lecture 1: Introduction
-Vth/nVq
Isub = K1We
(1-e
-V/Vq
)
20
• Electrons aren’t “here” or “there”
• Location is a probability distribution
• Non-zero probability of being anywhere
e-
Oxide
e-
P(Tunnel) << 1
Lecture 1: Introduction
P(Tunnel)
Non-negligible
21
• ED product (energy * delay)
– Lower is better
• Lower execution latency (i.e., higher performance)
• Lower energy consumption
– Can lead to not-so-great configurations
• Simple CPU  really long execution time, but very low power 
lower ED product (may not be acceptable)
• ED2 product
– Performance more heavily weighted
Lecture 1: Introduction
22
• Temperature of the chip determined by
– Power/heat generation rate
– Heat removal
• Given the two, T will settle at a steady state
– Heat flow is function of temperature gradient
– If there’s too much heat, T will increase until gradient large
enough to remove the heat fast enough
– So long as this steady state T is within allowed operating
conditions, everything should work fine
• May have impact on long-term reliability
Lecture 1: Introduction
23
• But, leakage is a function of temperature
-V
•  Temp leads to  Leakage
Isub = K1We
• Which burns more power
• Which leads to  Temp, which …
th/nVq
(1-e
-V/Vq
)
• Positive feedback loop can melt your chip
Lecture 1: Introduction
24
• Average temperature != local temperature
• Local spots may be hotter
– Leads to “hot spots”
– Temp anywhere cannot
exceed Tjmax (transistors stop
working)
– Possible to have good average
global/temp but still violate
Tjmax locally
Lecture 1: Introduction
(Simulated P4 Thermals)
25
Lecture 1: Introduction
26
Wire 1
Wire 2
current
change
Wire
1
induced
Wirecurrent
2
Magnetic Field
Capacitative Coupling
Lecture 1: Introduction
Inductive Coupling
27
Clock cycle time
Extra noise margin
 decrease in f
Clock cycle time
Lecture 1: Introduction
28
Water Tank
Ishower
Ishower
- Ijohn
Pressure
Drop
Ijohn
Lecture 1: Introduction
Flush!
29
Power Supply Pin
1.2V
1.5V
1.5V
1.5V
1.5V
Lecture 1: Introduction
1.5V
Local spikes in power consumption
can affect other very far away blocks
depending on the power distribution
network
30
up to 3 mA
0.5mA
++++
++++
0.5mA
++++
++++
1.5V
2 mA
1.5V
X
0.75V 2 X
mA
1 mA
Decoupling or
Debouncing
Capacitors
(“Decaps”)
Lecture 1: Introduction
31
• CPU (die) size greatly affects cost
– Current CPUs 1-2 cm2
– Embedded much smaller
• cost and footprint matters in cell phone or iPod
Die
Silicon Wafer
Lecture 1: Introduction
32
Manufacturing
Defects







13/16 working chips
81.25% yield
Lecture 1: Introduction

1/4 working chips
25.0% yield
33
Assuming $250 per wafer:
$5.92 per die
$58.82 per die
52 die, 81.25% yield 
42.25 working parts / wafer
Lecture 1: Introduction
17 die, 25.0% yield 
4.25 working parts / wafer
34
Yield applies
to all sorts of
fabrication
technologies,
not just plain
old silicon.
As technology
matures, yield
typically improves,
which helps
to reduce cost.
20” Display
$600
In 2009: $400?
30” Display
1.52 = 2.25x area
$1800
3x $$$
Prices from apple.com as of 11/26/2007
Lecture 1: Introduction
35
•
•
•
•
•
Design time (microarchitecture)
Implementation time (circuit, layout engineers)
Validation/Verification (test before fab)
Debugging (test after fab)
Repeat…
2x performance / 18 months
= 0.893% performance / week
Impacts Time-to-Market
Each week of product delay had
better earn you at least 0.9%
performance!
Lecture 1: Introduction
36
• Intel Pentium FDIV bug
– Verification/validation should catch this
– It didn’t (last minute optimization, full validation not run)
– Cost: ~ $500M
• Complexity can be costly
• Over half of the design effort is spent on verification
Lecture 1: Introduction
37
• Some additional direct and indirect costs
• Ex. MMX/SSE
– Costs extra HW, design time, verification, etc.
– Useless without cooperation from application writers
• Intel has a lot of SW people in-house to work on new applications, or
work with 3rd-parties to use new technologies in their applications
• Danger: benefits on new computers, but compatibility issues with older
computers
• Ex. Multi-Core
– Need support from OS vendors and application writers, otherwise no
one can use the extra processors
– Some of the cost shared by others; worthwhile investment for MSFT if
you have to buy Vista for full multi-core support
Lecture 1: Introduction
38
• Maximize performance
... Within the constraints of
– Peak power, average power, die area, metal layers,
thermals, implementation complexity, verification
complexity, time-to-market, cost to manufacturer (Intel),
cost to OEM (Dell), cost to end-customer (you)
• Huge, multi-variable optimization problem!
– Not all variables are independent
– Not all variables have the same weight
– The same variable may have different weights to different
customers
Lecture 1: Introduction
39
• Slightly different for different segments
– Laptops: maximize performance and battery life
– Embedded: attain “sufficient” performance and then
maximize battery life
• Your MP3 player only needs to be fast enough to run the MP3
codec; any additional performance provides no end-user benefit and
just costs more/consumes more power
– Server: throughput vs. latency
• In this course, we will be mostly focused on “highperformance” processors (desktop, server)
Lecture 1: Introduction
40

AMA-L01-Introx

Transcript AMA-L01-Introx

Directory