Transcript ppt

CS 15-447: Computer Architecture
Lecture 26
Emerging Architectures
November 19, 2007
Nael Abu-Ghazaleh
[email protected]
http://www.qatar.cmu.edu/~msakr/15447-f08
15-447 Computer Architecture
Fall 2008 ©
Last Time: Buses and I/O
Control Lines
Data Lines
• Buses: Bunch of wires
• Shared Interconnect: multiple “devices” connect
to the same bus
• Versatile: new devices can connect (even ones
we didn’t know existed when bus was designed)
• Can become a bottleneck
– Shorter->faster; less devices->faster
• Have to:
– Define the protocol to make devices communicate
– Come up with an arbitration mechanism
15-447 Computer Architecture
Fall 2008 ©
2
Types of Buses
Processor Memory Bus
Processor
Memory
Bus
Adaptor
Bus
Adaptor
Backplane Bus
Bus
Adaptor
I/O Bus
I/O Bus
• System bus
– Connects processor and memory
– Short, fast, synchronous, design specific
• I/O Bus
– Usually is lengthy and slower; industry standard
– Need to match a wide range of I/O devices
– Connects to the processor-memory bus or backplane bus
15-447 Computer Architecture
Fall 2008 ©
3
Bus “Mechanics”
• Master Slave
• Have to define how we hand-shake
– Depends on whether its synchronous or not
• Bus arbitration protocol
– Contention vs. reservation; centralized vs. distributed
• I/O Model
– Programmed I/O; Interrupt driven I/O; DMA
• Increasing performance (mainly bandwidth)
–
–
–
–
Shorter; closer; wider
Block transfers (instead of byte transfers)
Split transaction buses
…
4
15-447 Computer Architecture
Fall 2008 ©
Today—Emerging Architectures
• We are at an interesting point in computer
architecture evolution
• What is emerging and why is it emerging?
5
15-447 Computer Architecture
Fall 2008 ©
Uniprocessor Performance (SPECint)
Performance (vs. VAX-11/780)
10000
From Hennessy and Patterson, Computer Architecture: A
Quantitative Approach, 4th edition, Sept. 15, 2006
3X
??%/year
1000
52%/year
100
10
25%/year
 Sea change in chip
design—what is emerging?
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
• VAX
: 25%/year 1978 to 1986
• RISC + x86: 52%/year 1986 to 2002
• RISC + x86: ??%/year 2002 to present
15-447 Computer Architecture
6
Fall 2008 ©
How did we get there?
• First, what allowed the ridiculous 52% improvement per
year to continue for around 20 years?
– If cars improved as much we would have 1 million Km/hr cars!
• Is it just the number of transistors/clock rate?
• No! Its also all the stuff that we’ve been learning about!
7
15-447 Computer Architecture
Fall 2008 ©
Walk down memory lane
• What was the first processor organization we
looked at?
– Single cycle processors
• How did multi-cycle processors improve those?
• What did we do after that to improve
performance?
– Pipelining; why does that help? What are the
limitations?
• From there we discussed superscalar
architectures
– Out of order execution; multiple ALUs
– This is basically state of the art in uniprocessors
– What gave us problems there?
8
15-447 Computer Architecture
Fall 2008 ©
Detour: couple of other design points
•
•
•
Very Large Instruction Word Architectures; let the compiler do the work
Great for energy efficiency—less Instruction Level Parallelism
Not binary compatible? Trasnmeta Crusoe Processor
9
15-447 Computer Architecture
Fall 2008 ©
SIMD ISA Extensions—Parallelism from the Data?
• Same Instruction applied to multiple Data at the same time
– How can this help?
• MMX (Intel) and 3DNow! (AMD) ISA extensions
• Great for graphics; originally invented for scientific codes
(vector processors)
– Not a general solution
• End of detour!
10
15-447 Computer Architecture
Fall 2008 ©
Back to Moore’s law
•
Why are the “good times” over?
– Three walls
1. “Instruction Level Parallelism” (ILP) Wall
–
–
–
–
–
Less parallelism available in programs (2->4->8->16)
Tremendous increase in complexity to get more
Does VLIW help?
What can help?
Conclusion: standard architectures cannot continue
to do their part of sustaining Moore’s law
15-447 Computer Architecture
Fall 2008 ©
11
Wall 2: Memory Wall
1000
100
CPU
10
1
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
9%/yr.
(2X/10 yrs)
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Performance
µProc 52%/yr.
(2X/1.5yr)
“Moore’s Law”
• What did we do to help this?
– Still very very expensive to access memory
• How do we see the impact in practice?
• Very different from when I learned architecture!
12
15-447 Computer Architecture
Fall 2008 ©
Ways out? Multithreaded Processors
• Can we switch to other threads if we need to
access memory?
– When do we need to access memory?
• What support is needed?
• Can I use it to help with the ILP wall as well?
13
15-447 Computer Architecture
Fall 2008 ©
Symmetric Multithreaded Processors
• How do I switch
between
threads?
• Hardware
support for that
• How does this
help?
• But, increased
contention for
everything (BW,
TLB, caches…)
14
15-447 Computer Architecture
Fall 2008 ©
Third Wall: Physics/Power wall
• We’re down to the level of playing with a few
atoms
• More error prone; lower yield
• But also soft-errors and wear out
– Logic that sometimes works!
– Can we do something in architecture to recover?
15
15-447 Computer Architecture
Fall 2008 ©
Power! Our topic next class
16
15-447 Computer Architecture
Fall 2008 ©
So, what is our way out? Any ideas?
Power Wall + Memory Wall + ILP Wall = Brick Wall
• Maybe architecture becomes commodity; this is
the best we can do
– This happens to a lot of technologies: why don’t we
have the million km/hr car?
• Do we actually need more processing power?
– 8 bit embedded processors good enough for
calculators; 4 bit ones probably good enough for
elevators
– Is there any sense to continue investing so much time
and energy into this stuff?
17
15-447 Computer Architecture
Fall 2008 ©
A lifeline? Multi-core architectures
• How does this
help?
• Think of the three
walls
• The new Moore’s law:
– the number of cores will double every 3 years!
– Many-core architectures
15-447 Computer Architecture
18
Fall 2008 ©
Overcoming the three walls
• ILP Wall?
– Don’t need to restrict myself to a single thread
– Natural parallelism available across threads/programs
• Memory wall?
– Hmm, that is a tough one; on the surface, seems like
we made it worse
– Maybe help coming from industry
• Physics/power wall?
– Use less aggressive core technology
• Simpler processors, shallower pipelines
• But more processors
– Throw-away cores to improve yield
• Do you buy it?
19
15-447 Computer Architecture
Fall 2008 ©
7 Questions for Parallelism
•
Applications:
1. What are the apps?
2. What are kernels of apps?
•
Hardware:
3. What are the HW building blocks?
4. How to connect them?
•
Programming Models:
5. How to describe apps and
kernels?
6. How to program the HW?
•
Evaluation:
7. How to measure success?
(Inspired by a view of the
Golden Gate Bridge from Berkeley)
20
15-447 Computer Architecture
Fall 2008 ©
Sea Change in Chip Design
•
Intel 4004 (1971): 4-bit processor,
2312 transistors, 0.4 MHz,
10 micron PMOS, 11 mm2 chip
•
RISC II (1983): 32-bit, 5 stage
pipeline, 40,760 transistors, 3 MHz,
3 micron NMOS, 60 mm2 chip
•
125 mm2 chip, 0.065 micron CMOS
= 2312 RISC II+FPU+Icache+Dcache
– RISC II shrinks to  0.02 mm2 at 65 nm
Processor is the new transistor!
21
15-447 Computer Architecture
Fall 2008 ©
Architecture Design space
• What should each core look like?
• Should all cores look the same?
• How should the chip interconnect between them
look?
• What level of the cache should they share?
– And what are the implications of that?
• Are there new security issues?
– Side channel attacks; denial of service attacks
• Many other questions…
Brand new playground; exciting time to do
architecture research
22
15-447 Computer Architecture
Fall 2008 ©
Hardware Building Blocks:
Small is Beautiful
• Given difficulty of design/validation of large designs
• Given power limits what can build, parallel is energy efficient
way to achieve performance
– Lower threshold voltage means much lower power
• Given redundant processors can improve chip yield
– Cisco Metro 188 processors + 4 spares
– Sun Niagara sells 6 or 8 processor version
• Expect modestly pipelined (5- to 9-stage)
CPUs, FPUs, vector, SIMD PEs
• One size fits all?
– Amdahl’s Law  a few fast cores + many small
cores
23
15-447 Computer Architecture
Fall 2008 ©
Elephant in the room
• We tried this parallel processing thing before
– Very difficult
• It failed, pretty much
– A lot of academic progress and neat algorithms, but little
impact commercially
• We actually have to do new programming
– A lot of effort to develop; error prone; etc..
– La-Z-boy programming era is over
– Need new programming models
• Amdahl’s law
• Applications: What will you use 1024 cores for?
• These concerns are being voiced by a substantial
segment of academia/industry
– What do you think?
– Its coming, no matter what
24
15-447 Computer Architecture
Fall 2008 ©