Chip Architectures: Design Rationale

Download Report

Transcript Chip Architectures: Design Rationale

Chip Architectures:
Design Rationale
By Joe Peric
Rationale






Performance: We don’t want slow computers
Cost: They have to be affordable
Money: How much cash to design and test?
Time: How much time do we have?
Target Market: Who will want it?
Competition: What do we do to beat the
competition, and what are they doing?
Problems in Delivering Solutions







Clock frequency
Core efficiency
Processor complexity
Power and heat
Old mindsets in new times
Communicating with the rest of the computer
How do we build parallel computers?
Clock frequency




Increasing clock frequency will always decrease
latency and increase throughput.
It is the simplest way to increase processor
performance
Essentially the “heartbeat” of a processor
How it is viewed by different parties
Core Efficiency




More core efficiency = less wastefulness
Increasing core efficiency leads to more
complex cores
The increasing complexity could potentially
make the cores less efficient
“Useful” execution per unit time is decent
measure of efficiency
Complexity




CPUs were originally CISC, and were
complicated
Hardware was expensive
RISC was born to make hardware cheap by
decreasing complexity
Today’s situation: CISC/RISC hybrid with
astronomical complexity
Old CPU
And now…
Complexity





Complexity makes it harder to design and build,
and test new chips.
Steeper learning curves
Larger CPUs
Slower, more expensive, longer time to reach
market
Things have always been moving this way but
may change in the near future
CISC






Complex instruction set computing
CPU can handle large range of instructions
New instructions are added over time to new
chip designs when compilers demand them
Chips got expensive
Many instructions weren’t used often, but
implemented in the chips.
Chips got larger, and more complex
RISC






Reduced instruction set computing
CISC costs increased with complexity
RISC was to be cheap hardware alternative
Idea took Amdahl’s law to heart
Increased compiler costs
RISC computers started to become more CISC
like over time
VLIW




Very long instruction word
Designed to remedy the problem of RISC
mutating into CISC
Puts extreme emphasis on compiler
development and scheduling of code
VLIW executes “words” made up of multiple
RISC-like instructions
VLIW




VLIW didn’t catch on much
Software development too long and slow
Instruction level parallelism is already achieved
through current hardware
Making ILP done through software just make
software more expensive
VLIW Examples





Intel’s Itanium
New instruction set (IA-64)
Relied heavily on compiler optimization for an
ISA that hadn’t existed before
Huge investment and startup costs
Itanium chips were/are expensive and provide
little benefit over standard, well established
solutions
VLIW Examples
VLIW Examples




Transmeta Crusoe
Had own, brand new native ISA
Had code-morphing software that read machine
code and transformed it into the crusoe ISA
code in real-time
Code-morphing made this machine ISA
independent, but very slow
VLIW Examples
Power and Heat







Not an issue ten years ago compared to today
CPUs draw much more power than ever
They produce more heat than ever
How to give CPU power
How to cool CPU
Power to cool CPU
Power to cool cooling
Power and Heat


Thermal density
http://bwrc.eecs.berkeley.edu/classes/icdesign/ee241_s05/Lectures/Lecture20_Ther
mal.pdf
Power and Heat


Cost to wire more pins to deliver more power
Most pins dedicated to this
Power and Heat





What is the problem with heat?
How to remove it
How to prevent it to begin with?
Clock frequency, manufacturing processes
Other “tricks” to decrease power usage and heat
output
Engineer Mentality





Loving an idea
Market is the plaintiff, judge, jury and
executioner. Many fail to realize this
Brand new ideas are ok but tend to fail
Old ideas were ok but are outdated and
inapplicable
Slightly new but not radical ideas are best
Engineer Mentality






What has the market led us to?
x86 architecture
26 years of backwards compatibility in a slow,
inefficient monster that stole ideas from most
other design philosophies
Started in original IBM PC; now it’s everywhere
Radical new designs keep losing to this beast
Engineer mentality must disappointingly be
limited by the real world and the market
Solutions that worked in the past






Increase clock frequency
Increase width
Make memory faster, more integrated, more
available
OoO execution
Prediction
Deeper, wider execution pipelines
Problems with old solutions





Every year in the past few years have seen
diminishing returns from microarchitectual
enhancements
ILP and clock increase engineering complexity
and provide less and less benefit each year
Heat and power
Cost
Time to market
The rest of the computer



Processor is useless if it is outside the socket
How does it communicate with other CPUs,
memory, the rest of the computer?
Many different approaches by different
companies as to how it is done
Xeon’s bus





Essentially same design as the original Pentium’s
bus
Has increased in frequency and bus width
A few electrical improvements
Huge standardized technology; Intel is huge so
it cannot change its base technologies so fast
Congested design
Xeon’s bus

Intel’s solution to multi-CPU connectivity
Hypertransport




Newer then Intel’s interconnectivity approach
So far more promising
Open for use by anyone in consortium
Tries to simplify interconnections, reduce cost,
and spread bandwidth over the system to avoid
concentrated congestion
Hypertransport diagram
Power4/5 Interconnect



FBC: Fabric Bus Controller
FBC is like intercommunication chipset that
would be on Xeon servers, but it is in the actual
CPU
FBC handles memory transactions and
communication between other CPUs via their
FBCs
Power4/5 Interconnect
Multiprocessing





Connect multiple computers together with their
own CPUs
Beowulf clusters
To some extent the Internet
A LAN
Slow connections between CPUs
Multiprocessing



To remedy the problem of slow and timely
communication, processors are put on the same
circuit board
Communication much faster between processors
and applications
Can be expensive if you want many CPUs to be
connected in this fashion
Multiprocessing
4 sockets on one motherboard:
Multiprocessing






If the communication is still too slow and takes
too long, just stick the CPUs right next to each
other with no obstruction in between
Multi-core solutions
Quick communication between threads
External comms. Must accommodate 2 CPUs
now
Twice the space for one CPU
If you buy two CPUs, you’re stuck with both
Multiprocessing





The current trend in computer architecture
Single cores have not been doubling in
performance every 18 months; in the past 5
years it is closer to every 36 months (or more)
If you want to go faster, don’t wait 18 months
to double your speed. Go parallel instead
Design one core, copy & paste
Multiple cores can perform load balancing to
help keep cool
Multiprocessing



Dual Core
High-Speed bus on
CPU itself
Can talk to each
other’s cache
memory much
more quickly
Multiprocessing




SMT: Simultaneous multiple threading
Allows each CPU core to execute multiple
tasks/threads simultaneously instead of
sequentially
“Hyperthreading” is Intel’s implementation of
SMT on their CPUs
Threads communicate not through a high-speed
direct interconnect, but to each other directly
Multiprocessing





SMT increases CPU efficiency
One CPU pretending to be two CPUs is actually
effective
Two threads on a single core not as fast as two
threads on separate cores
Two threads on one core must fight for / share
execution resources
SMT is actually real multitasking
Microprocessors





Intel’s Xeon
AMD’s Opteron
Sun’s Niagara
IBM’s Cell
Intel’s Terascale
Xeon






Intel’s workstation/server CPU
Originally started as Pentium Pro
Lucrative market
Always had weak comms. busses
Add plenty of on-chip memory to alleviate
problem
Xeons are Pentiums given features to work as
server chips
Xeon Pictures
The Pentium Pro
- Only x86 CPU made for servers, then
moved to desktop
The current Xeon
- Fast CPUs, but on interconnect architecture
similar to Pentium Pro
Xeon Pictures

Die photo
Opteron





Introduced in 2003
Designed primarily as a server CPU
Can have up to 4 external communication ports;
3 for hyeprtransport, 1 for memory
128 KB L1 and 1,024 KB L2 Cache
106 million transistors
Opteron’s comms.
Opteron die photo
Niagara









New design by Sun
Contains 8 cores on one CPU
Each core can execute 4 threads simultaneously
32 threads simultaneously on one core
Very simple cores
Focus on throughput
Very weak performance on single threaded code
High bandwidth within CPU and to memory
No SMP support, meant for small systems
Niagara’s target
Cell







9 cores; one general purpose, 8 “SPEs”
SPE: Synergistic Processing Element
Each SPE has powerful math units
Coined as “supercomputer on a chip”
Used in a few servers/supercomputers
Used in Playstation 3
EIB: One bus connecting 9 cores
Cell die photo
Terascale






Proof-of-Concept design
The chip itself is toy-like with respect to real
power and ISA
Each chip has a router on it
Allows seamless addition of cores to the CPU
Very cheap for design
Not very effective for performance
Terascale die photo
Terascale




Each core can communicate to immediate
neighbour on each of 4 sides
Example chip had 80 cores
Cores not in use decrease power consumption
If area of CPU gets too hot, work done in that
area of CPU is passed to other cores
The Future




Old methods of improving performance are no
longer as fruitful as they used to be
Systems are developing more integration, as
fewer chips are needed in a computer to
perform the same functions
Parallelism at the instruction level seems to be
fully exploited from compilers and hardware
CPUs are dealing with thermal density problems
The Future





Moore’s law is holding and now transistor
budgets are becoming more relaxed
Cores have become ridiculously complicated
We are now seeing limits to sequential
computing at the hardware level
Single core performance not promising looking
into the future
Where does this lead us?
The Future






Parallel is promising for performance
One simple core can be copied many times
Most new designs have parallelism in mind
already
Software is taking it’s sweet time to catch up
Programmers need software to help them
parallelize their programs!
OSes need better scheduling and allocation
algorithms!
The Future



Most current compute-intensive programs and
algorithms can be parallelized
Uses: media processing is embarrassingly parallel
and obtains near linear performance increase
with more SMT and cores
Making programmers make parallel code could
lead to better programs, or programs that scale
with better hardware!
References








http://www.aceshardware.com/
http://www.anandtech.com/
http://www.sandpile.org/
http://www.chip-architect.com/
http://www.intel.com/
http://www.transmeta.com/pdfs/techdocs/efficeon_tm8600_prod_brief.pdf
http://www.amd.com/usen/assets/content_type/white_papers_and_tech_docs/31411.pdf
http://www.arstechnica.com/