hspc05 - School of Computing

Download Report

Transcript hspc05 - School of Computing

High Performance Computer
Architecture Challenges
Rajeev Balasubramonian
School of Computing, University of Utah
Dramatic Clock Speed Improvements!!
3500
Intel Pentium 4 3.2 GHz
2500
2000
1500
1000
The 1st Intel processor 108 KHz
500
'03
'02
'01
2000
'99
'98
'96
'94
'92
'90
'88
'85
'82
'78
'76
'74
'72
0
1971
Clock speed (in MHz)
3000
Clock Speed = Performance ?
• The Intel Pentium4 has a higher clock speed
than the IBM Power4 – does the Pentium4
execute your program faster?
Clock Speed = Performance ?
• The Intel Pentium4 has a higher clock speed
than the IBM Power4 – does the Pentium4
execute your program faster?
Case 1:
Completing instruction
Clock tick
Case 2:
Time
Performance = Clock Speed x Parallelism
What About Parallelism?
0.09
Parallelism SPECInt / MHz
0.08
0.07
Intel
Alpha
HP
0.06
0.05
0.04
0.03
0.02
0.01
0
1995
1996
1997
1998
1999
2000
Dramatic Clock Speed Improvements!!
3500
Intel Pentium 4 3.2 GHz
2500
2000
1500
1000
The 1st Intel processor 108 KHz
500
'03
'02
'01
2000
'99
'98
'96
'94
'92
'90
'88
'85
'82
'78
'76
'74
'72
0
1971
Clock speed (in MHz)
3000
The Basic Pipeline
Consider an automobile assembly line:
Stage 1
Stage 2
Stage 3
Stage 4
1 day
1 day
1 day
1 day
A new car
rolls out
every day
A new car
rolls out every
half day
In each case, it takes 4 days to build a car, but…
More stages  more parallelism and less time
between cars
What Determines Clock Speed?
• Clock speed is a function of work done in each
stage – in the earlier examples, the clock speeds
were 1 car/day and 2 cars/day
• Similarly, it takes plenty of “work” to execute an
instruction and this work is broken into stages
Execution of a single instruction
What Determines Clock Speed?
• Clock speed is a function of work done in each
stage – in the earlier examples, the clock speeds
were 1 car/day and 2 cars/day
• Similarly, it takes plenty of “work” to execute an
instruction and this work is broken into stages
250ps  4GHz clock speed
Execution of a single instruction
Clock Speed Improvements
• Why have we seen such dramatic improvements
in clock speed?
 work has been broken up into more stages
 early Intel chips executed work equivalent
to approximately 56 logic gates; today’s
chips execute 12 logic gates worth of work
 transistors have been becoming faster
 as technology improves, we can draw
smaller and smaller transistors/gates on a
chip and that improves their speed
(doubles every 5-6 years)
Will these Improvements Continue?
• Transistors will continue to shrink and become
faster for at least 10 more years
• Each pipeline stage is already pretty small –
improvements from this factor will cease
• If clock speed improvements stagnate, should
we turn our focus to parallelism?
Microprocessor Blocks
Branch
Predictor
L1 Instr
Cache
L2 Cache
Decode &
Rename
L1 Data
Cache
Issue Logic
ALU
ALU
Register
File
ALU
ALU
Innovations: Branch Predictor
Improve prediction accuracy
by detecting frequent patterns
Branch
Predictor
L1 Instr
Cache
L2 Cache
Decode &
Rename
L1 Data
Cache
Issue Logic
ALU
ALU
Register
File
ALU
ALU
Innovations: Out-of-order Issue
Branch
Predictor
L1 Instr
Cache
L2 Cache
Out-of-order issue: if later
instructions do not depend on
earlier ones, execute them first
Decode &
Rename
L1 Data
Cache
Issue Logic
ALU
ALU
Register
File
ALU
ALU
Innovations: Superscalar Architectures
Branch
Predictor
L1 Instr
Cache
L2 Cache
Multiple ALUs: increase
execution bandwidth
Decode &
Rename
L1 Data
Cache
Issue Logic
ALU
ALU
Register
File
ALU
ALU
Innovations: Data Caches
Branch
Predictor
L1 Instr
Cache
L2 Cache
2K papers on caches: efficient
data layout, stride prefetching
Decode &
Rename
L1 Data
Cache
Issue Logic
ALU
ALU
Register
File
ALU
ALU
Summary
• Historically, computer engineers have focused on
performance
• Performance is a function of clock speed and
parallelism
• As technology improves, clock speeds will
improve, although at a slower rate
• Parallelism has been gradually improving and
plenty of low-hanging fruits have been picked
Outline
• Recent Microprocessor History
• Current Trends and Challenges
• Solutions to Handling these Challenges
Trend I : An Opportunity
• Transistors on a chip have been doubling every
two years (Moore’s Law)
• In the past, transistors have been used for
out-of-order logic, large caches, etc…
• In the future, transistors can be employed for
multiple processors on a single chip
Chip Multiprocessors (CMP)
• The IBM Power4 has two processors on a die
• Sun has announced the 8-processor Niagara
P1
P2
P3
P4
L2 cache
The Challenge
• Nearly every chip will have multiple processors,
but where are the threads?
• Some applications will truly benefit – they can be
easily decomposed into threads
• Some applications are inherently sequential – can
we execute speculative threads to speed up these
programs? (open problem!)
Trend II : Power Consumption
• Power a a f C V2 , where a is activity factor,
f is frequency, C is capacitance, and V is voltage
• Every new chip has higher frequency, more
transistors (higher C), and slightly lower voltage –
the net result is an increase in power consumption
Scary Slide!
• Power density cannot be allowed to increase at
current rates (Source: Borkar et al., Intel)
Impact of Power Increases
• Well, UtahPower sends you fatter bills every month
• To maintain constant chip temperature, heat
produced on a chip has to be dissipated away –
every additional watt increases cooling cost of a
chip by approximately $4 !!
• If temperature of a chip rises, the power dissipated
also increases (almost exponentially)  a vicious
cycle!
Trend III : Wire Delays
• As technology improves, logic gates shrink 
their speed increases and clock speeds improve
• As logic gates shrink, wires shrink too –
unfortunately, their speed improves only
marginally
• In relative terms, future chips will have fast
transistors/gates and slow wires
Computation is cheap, communication is expensive!
Impact of Wire Delays
• Crossing the chip used to take one cycle
• In the future, crossing the chip can take up to 30
cycles
• Many structures on a chip are wire-constrained
(register file, cache) – their access times slow
down  throughput decreases as instructions
sit around waiting for values
• Long wires also consume power
Trend IV : Soft Errors
• High energy particles constantly collide with
objects and deposit charge
• Transistors are becoming smaller and on-chip
voltages are being lowered  it doesn’t take much
to toggle the state of the transistor
• The frequency of this occurrence is projected to
increase by nine orders of magnitude over a 20
year period
Impact of Soft Errors
• When a particle strike occurs, the component is
not rendered permanently faulty – only the value
it contains is erroneous
• Hence, this is termed a transient fault or soft error
• The error propagates when other instructions read
this faulty value
• This is already a problem for mission-critical apps
(space, defense, highly-available servers) and may
soon be a problem in other domains
Summary of Trends
• More transistors, more processors on a single chip
• High power consumption
• Long wire delays
• Frequent soft errors
We are attempting to exploit transistors to increase
parallelism – in light of the above challenges, we’d
be happy to even preserve parallelism
Transistors & Wire Delays
• Bring in a large window of instructions so you
can find high parallelism
• Distribute instructions across processors so that
communication is minimized
Instructions
Processors
Difficult Branches
• Mispredicted branches result in poor parallelism
and wasted work (power)
• Solution: when you arrive at a fork, take both
directions – execute on low frequency units to
control power dissipation levels
Instructions
Processors
Thermal Emergencies
• Heterogeneous units allow you to reduce cooling
costs
• If a chip’s peak power is 110W, allow enough
cooling to handle 100W average – save $40/chip!
• If the application starts consuming more than
100W and temperature starts to rise, start
favoring the low power processor cores –
intelligent management allows you to make
forward progress even in a thermal emergency
Handling Long Wire Delays
• Wires can be designed to have different properties
• Knob 1: wire width and spacing: fat wires are
faster, but have low bandwidth
Handling Wire Capacitance
• Knob 2: wires have repeaters/buffers – many,
large buffers  low delay, high power consumption
Mapping Data to Wires
• We can optimize wires for delay, bandwidth, power
• Different data transfers on a chip have different
latency and bandwidth needs – an intelligent
mapping of data to wires can improve performance
and lower power consumption
Handling Soft Errors
• Errors can be detected and corrected by providing
redundancy – execute two copies of a program
(perhaps, on a CMP) and compare results
• Note that this doubles power consumption!
Leading
Thread
Trailing
Thread
Handling Soft Errors
• Trailing thread is capable of higher performance
than leading thread – but there’s no point catching
up – hence, artificially slow the trailing thread by
lowering its frequency  lower power dissipation
Peak thruput: 1 BIPS
Leading
Thread
2 BIPS
Trailing
Thread
Trailing thread never
fetches data from
memory and never
guesses at branches
Summary of Solutions
• Heterogeneous wires and processors
• Instructions and data have different needs: map
them to appropriate wires and processors
• Note how these solutions target multiple issues
simultaneously: slow wires, many transistors,
soft errors, power/thermal emergencies
Conclusions
• Performance has improved because of clock
speed and parallelism advances
• Clock speed improvements will continue at a
slower rate
• Parallelism is on a downward trend because of
technology trends and because low-hanging
fruits have been picked
• We must find creative ways to preserve or even
improve parallelism in the future
Slide Title
• Point 1.