Transcript Outline
Research Accelerator
for MultiProcessing
Dave Patterson, UC Berkeley
January 2006
+ RAMP collaborators: Arvind (MIT), Krste Asanovíc (MIT),
Derek Chiou (Texas), James Hoe (CMU), Christos Kozyrakis (Stanford),
Shih-Lien Lu (Intel), Mark Oskin (Washington), David Patterson (Berkeley),
1
Jan Rabaey (Berkeley), and John Wawrzynek (Berkeley)
Conventional Wisdom
in Computer Architecture
Old: Multiplies are slow, Memory access is fast
New: “Memory wall” Memory slow, multiplies fast
(200 clocks to DRAM memory, 4 clocks for multiply)
Old: Power is free, Transistors expensive
New: “Power wall” Power expensive, Xtors free
(Can put more on chip than can afford to turn on)
Old: Uniprocessor performance 2X / 1.5 yrs
New: Power Wall + Memory Wall = Brick Wall
Uniprocessor performance only 2X / 5 yrs
Sea change in chip design: multiple “cores”
(2X processors per chip / ~ 2 years)
More instances of simpler processors are more power efficient
2
Uniprocessor Performance (SPECint)
10000
Performance (vs. VAX-11/780)
From Hennessy and Patterson,
Computer Architecture: A Quantitative
Approach, 4th edition, 2006
20%/year
1000
52%/year
100
10
25%/year
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
• VAX
: 25%/year 1978 to 1986
• RISC + x86: 52%/year 1986 to 2002
• RISC + x86: 20%/year 2002 to present
3
Sea Change in Chip Design
Intel 4004 (1971): 4-bit processor,
2312 transistors, 0.4 MHz,
10 micron PMOS, 11 mm2 chip
RISC II (1983): 32-bit, 5 stage
pipeline, 40,760 transistors, 3 MHz,
3 micron NMOS, 60 mm2 chip
125 mm2 chip, 0.065 micron CMOS
= 2312 RISC II+FPU+Icache+Dcache
RISC II shrinks to ~ 0.02 mm2 at 65 nm
Caches via DRAM or 1 transistor SRAM (www.t-ram.com) ?
Proximity Communication via capacitive coupling at > 1 TB/s ?
(Ivan Sutherland @ Sun / Berkeley)
• Processor is the new transistor?
4
Déjà vu all over again?
“… today’s processors … are nearing an impasse as
technologies approach the speed of light..”
David Mitchell, The Transputer: The Time Is Now (1989)
Transputer had bad timing
Custom multiprocessors strove to lead uniprocessors
Procrastination rewarded: 2X seq. perf. / 1.5 years
“We are dedicating all of our future product
development to multicore designs. …
This is a sea change in computing”
Paul Otellini, President, Intel (2004)
All microprocessor companies switch to MP
Procrastination penalized: 2X sequential perf. / 5 yrs
Biggest programming challenge: 1 to 2 CPUs
5
Problems with Sea Change
Algorithms, Programming Languages,
Compilers, Operating Systems, Architectures,
Libraries, … not ready for 1000 CPUs / chip
Software people don’t start working hard
until hardware arrives
1.
2.
•
3.
3 months after HW arrives, SW people list everything that must
be fixed, then we all wait 4 years for next iteration of HW/SW
How do research in timely fashion on 1000
CPU systems in algorithms, compilers, OS,
architectures, … without waiting years
between HW generations?
6
Characteristics of Ideal Academic
CS Research Supercomputer?
Scale – Hard problems at 1000 CPUs
Cheap – 2006 funding of academic research
Cheap to operate, Small, Low Power – $ again
Community – share SW, training, ideas, …
Simplifies debugging – high SW churn rate
Reconfigurable – test many parameters,
imitate many ISAs, many organizations, …
Credible – results translate to real computers
Performance – run real OS and full apps,
results overnight
7
Build Academic SC from FPGAs
As ~ 25 CPUs will fit in Field Programmable Gate
Array (FPGA), 1000-CPU system from ~ 40 FPGAs?
•
•
16 32-bit simple “soft core” RISC at 150MHz in 2004 (Virtex-II)
FPGA generations every 1.5 yrs; ~2X CPUs, ~1.2X clock rate
HW research community does logic design (“gate
shareware”) to create out-of-the-box, MPP that
runs standard binaries of OS and applications
Gateware: Processors, Caches, Coherency, Ethernet Interfaces,
Switches, Routers, … (some free from open source hardware)
E.g., 1000 processor, standard ISA binary-compatible, 64-bit, cachecoherent supercomputer @ 200 MHz/CPU in 2007
8
Why RAMP Good for Research MPP?
Scalability (1k CPUs)
Cost (1k CPUs)
Cost of ownership
Power/Space
(kilowatts, racks)
SMP
Cluster
Simulate
C
A
A
A
F ($40M)
C ($2-3M)
A+ ($0M)
A ($0.1-0.2M)
A
D
A
A
D (120 kw, D (120 kw, A+ (.1 kw,
RAMP
A (1.5 kw,
12 racks)
12 racks)
0.1 racks)
Community
D
A
A
A
Observability
D
C
A+
A+
Reproducibility
B
D
A+
A+
Reconfigurability
D
C
A+
A+
A+
A+
F
A
A (2 GHz)
A (3 GHz)
F (0 GHz)
C (0.1-.2 GHz)
C
B-
B
A-
Credibility
Perform. (clock)
GPA
0.3 racks)
9
RAMP 1 Hardware
Completed Dec. 2004 (14x17 inch 22-layer PCB)
Module:
5 Virtex II FPGAs,
18 banks DDR2400 memory,
20 10GigE conn.
Administration/
maintenance
ports:
10/100 Enet
HDMI/DVI
USB
~$4K in Bill of
Materials (w/o
FPGAs or DRAM)
BEE2: Berkeley Emulation Engine 2
10
Multiple Module RAMP 1 Systems
8 compute modules (plus power
supplies) in 8U rack mount chassis
2U single module tray for developers
Many topologies possible
Disk storage: via disk emulator +
Network Attached Storage
11
Quick Sanity Check
BEE2 uses old FPGAs (Virtex II), 4 banks DDR2-400/cpu
16 32-bit Microblazes per Virtex II FPGA,
0.75 MB memory for caches
32 KB direct mapped Icache, 16 KB direct mapped Dcache
Assume 150 MHz, CPI is 1.5 (4-stage pipe)
I$ Miss rate is 0.5% for SPECint2000
D$ Miss rate is 2.8% for SPECint2000, 40% Loads/stores
BW need/CPU = 150/1.5*4B*(0.5% + 40%*2.8%)
= 6.4 MB/sec
BW need/FPGA = 16*6.4 = 100 MB/s
Memory BW/FPGA = 4*200 MHz*2*8B = 12,800 MB/s
Plenty of room for tracing, …
12
RAMP FAQ on ISAs
Which ISA will you pick?
Goal is replacible ISA/CPU L1 cache, rest infrastructure unchanged (L2
cache, router, memory controller, …)
What do you want from a CPU?
Standard ISA (binaries, libraries, …), simple (area), 64-bit (coherency),
DP Fl.Pt. (apps)
Multihreading? As an option, but want to get to 1000 independent CPUs
When do you need it? 3Q06
RAMP people port my ISA , fix my ISA?
Our plates are full already
Type A vs. Type B gateware
Router, Memory controller, Cache coherency, L2 cache, Disk module,
protocol for each
Integration, testing
13
Handicapping ISAs
Got it: Power 405 (32b), SPARC v8 (32b),
Xilinx Microblaze (32b)
Very Likely: SPARC v9 (64b),
Likely: IBM Power 64b
Probably (haven’t asked): MIPS32, MIPS64
Not likely: x86
Even less likely: x86-64
We’ll sue you: ARM
14
RAMP Development Plan
1.
Distribute systems internally for RAMP 1 development
2.
Release publicly available out-of-the-box MPP emulator
3.
Based on standard ISA (IBM Power, Sun SPARC, …) for binary compatibility
Complete OS/libraries
Locally modify RAMP as desired
Design next generation platform for RAMP 2
Xilinx agreed to pay for production of a set of modules for initial contributing
developers and first full RAMP system
Others could be available if can recover costs
Base on 65nm FPGAs (2 generations later than Virtex-II)
Pending results from RAMP 1, Xilinx will cover hardware costs for initial set of
RAMP 2 machines
Find 3rd party to build and distribute systems (at near-cost), open
source RAMP gateware and software
Hope RAMP 3, 4, … self-sustaining
NSF/CRI proposal pending to help support effort
2 full-time staff (one HW/gateware, one OS/software)
Look for grad student support at 6 RAMP universities from industrial donations
15
RAMP Milestones 2006
Name Goal
Red
A Start
(SU)
Target
1Q06
CPUs
8 32b Power
hard cores
Details
Transactional
memory SMP
Blue
(Cal)
3Q06
1024 32b
Microblaze
soft cores
Cluster, MPI
Scale
White Features
1.0
2Q06
2.0
3Q06
3.0
4Q06
4.0
1Q07
Cache coherent,
64 hard PPC shared address,
128? soft 32b deterministic,
debug/monitor,
64? soft 64b commercial ISA
Multiple ISAs
16
the stone soup of
architecture research
platforms
Wawrzynek
Hardware
Chiou
Patterson
Glue-support
I/O
Kozyrakis
Hoe
Monitoring
Coherence
Asanovic
Oskin
Cache
Net Switch
Arvind
PPC
Lu
x86
17
Gateware Design Framework
Insight: almost every large building block fits
inside FPGA today
what doesn’t is between chips in real design
Supports both cycle-accurate emulation of
detailed parameterized machine models and rapid
functional-only emulations
Carefully counts for Target Clock Cycles
Units in any hardware design language
(will work with Verilog, VHDL, BlueSpec, C, ...)
RAMP Design Language (RDL) to describe plumbing
to connect units in
18
Gateware Design Framework
Design composed of units
that send messages over
channels via ports
Units (10,000 + gates)
Sending Unit
Receiving Unit
Channel
Port
CPU + L1 cache, DRAM controller….
Channels (~ FIFO)
Port
Lossless, point-to-point,
unidirectional, in-order message
delivery…
Sending Unit
Receiving Unit
Channel
DataOut
DataIn
__DataOut_READY
__DataIn_READ
__DataOut_WRITE
__DataIn_READY
Port “DataOut”
Port “DataIn”
19
Status
Submitted NSF proposal August 2005
Biweekly teleconferences (since June 05)
IBM, Sun donating commercial ISA, simple,
industrial-strength, CPU + FPU
Technical report, RDL document
RAMP 1/RDL short course/board distribution
in Berkeley for 40 people @ 6 schools Jan 06
FPGA workshop @ HPCA 2/06, @ ISCA 6/06
ramp.eecs.berkeley.edu
20
RAMP uses (internal)
Wawrzynek
BEE
Chiou
Patterson
Net-uP
Internet-in-a-Box
Kozyrakis
Hoe
TCC
Reliable MP
Asanovic
Oskin
1M-way MT
Arvind
BlueSpec
Dataflow
Lu
x86
21
Multiprocessing Watering Hole
RAMP
Parallel file system Dataflow language/computer Data center in a box
Thread scheduling Security enhancements Internet in a box
Multiprocessor switch design
Router design Compile to FPGA
Fault insertion to check dependability Parallel languages
Killer app: All CS Research, Ind. Advanced Development
RAMP attracts many communities to shared artifact
Cross-disciplinary interactions
Accelerate innovation in multiprocessing
RAMP as next Standard Research Platform?
(e.g., VAX/BSD Unix in 1980s)
22
Supporters
(wrote letters to NSF)
Gordon Bell (Microsoft)
Ivo Bolsens (Xilinx CTO)
Norm Jouppi (HP Labs)
Bill Kramer (NERSC/LBL)
Craig Mundie (MS CTO)
G. Papadopoulos (Sun CTO)
Justin Rattner (Intel CTO)
Ivan Sutherland (Sun Fellow)
Chuck Thacker (Microsoft)
Kees Vissers (Xilinx)
Doug Burger (Texas)
Bill Dally (Stanford)
Carl Ebeling (Washington)
Susan Eggers (Washington)
Steve Keckler (Texas)
Greg Morrisett (Harvard)
Scott Shenker (Berkeley)
Ion Stoica (Berkeley)
Kathy Yelick (Berkeley)
RAMP Participants:
Arvind (MIT), Krste Asanovíc (MIT),
Derek Chiou (Texas), James Hoe (CMU), Christos Kozyrakis (Stanford),
Shih-Lien Lu (Intel), Mark Oskin (Washington), David Patterson (Berkeley),
Jan Rabaey (Berkeley), and John Wawrzynek (Berkeley)
23
Conclusions
RAMP as system-level time machine: preview computers
of future to accelerate HW/SW generations
Trace anything, Reproduce everything, Tape out every day
FTP new supercomputer overnight and boot in morning
Clone to check results (as fast in Berkeley as in Boston?)
Emulate Massive Multiprocessor, Data Center, or Distributed Computer
Carpe Diem
Systems researchers (HW & SW) need the capability
FPGA technology is ready today, and getting better every year
Stand on shoulders vs. toes: standardize on design framework, multi-year
Berkeley effort on FPGA platforms (Berkeley Emulation Engine BEE2)
Architecture researchers get opportunity to immediately aid colleagues via
gateware (as SW researchers have done in past)
“Multiprocessor Research Watering Hole” accelerate
research in multiprocessing via standard research platform
hasten sea change from sequential to parallel computing
24
Backup Slides
25
UT FAST
1MHz to 100MHz, cycle-accurate, full-system,
multiprocessor simulator
X86, boots Linux, Windows, targeting 80486 to
Pentium M-like designs
Have straight pipeline 486 model with TLBs and caches
Statistics gathered in hardware
Heavily modified Bochs, supports instruction trace and rollback
Working on “superscalar” model
Well, not quite that fast right now, but we are using embedded 300MHz
PowerPC 405 to simplify
Very little if any probe effect
Work started on tools to semi-automate microarchitectural and ISA level exploration
Orthogonality of models makes both simpler
Derek Chiou, UTexas
26
Example: Transactional Memory
Processors/memory hierarchy that support
transactional memory
Hardware/software infrastructure for
performance monitoring and profiling
Will be general for any type of event
Transactional coherence protocol
Christos Kozyrakis, Stanford
27
Example: PROTOFLEX
Hardware/Software Cosimulation/test
methodology
Based on FLEXUS C++ full-system
multiprocessor simulator
Can swap out individual components to hardware
Used to create and test a non-block MSI
invalidation-based protocol engine in
hardware
James Hoe, CMU
28
Example: Wavescalar Infrastructure
Dynamic Routing Switch
Directory-based coherency scheme and
engine
Mark Oskin, U Washington
29
Example RAMP App: “Internet in a Box”
Building blocks also Distributed Computing
RAMP vs. Clusters (Emulab, PlanetLab)
Scale:
RAMP O(1000) vs. Clusters O(100)
Private use: $100k Every group has one
Develop/Debug: Reproducibility, Observability
Flexibility: Modify modules (SMP, OS)
Heterogeneity: Connect to diverse, real routers
Explore via repeatable experiments as vary
parameters, configurations vs. observations on
single (aging) cluster that is often idiosyncratic
David Patterson, UC Berkeley
30
Why RAMP Attractive?
Priorities for Research Parallel Computers
Insight – Commercial priorities radically different from research
1a. Cost of purchase
1b. Cost of ownership (staff to administer it)
1c. Scalability (1000 much better than 100 CPUs)
4. Power/Space (machine room cooling, number of racks)
5. Community synergy (share code, …)
6. Observability (non-obtrusively measure, trace
everything)
7. Reproducibility (to debug, run experiments)
8. Flexibility (change for different experiments)
9. Credibility (Faithfully predicts real hardware behavior)
10. Performance (As long as experiments not too slow)
31
Related Approaches (1)
Quickturn, Axis, IKOS, Thara:
FPGA- or special-processor based gate-level hardware emulators
Synthesizable HDL is mapped to array for cycle and bit-accurate netlist
emulation
RAMP’s emphasis is on emulating high-level architecture behaviors
Hardware and supporting software provides architecture-level
abstractions for modeling and analysis
Targets architecture and software research
Provides a spectrum of tradeoffs between speed and
accuracy/precision of emulation
RPM at USC in early 1990’s:
Up to only 8 processors
Only the memory controller implemented with configurable logic
32
Related Approaches (2)
Software Simulators
Clusters (standard microprocessors)
PlanetLab (distributed environment)
Wisconsin Wind Tunnel (used CM-5 to simulate
shared memory)
All suffer from some combination of:
Slowness, inaccuracy, scalability, unbalanced
computation/communication, target inflexibility
33