Slides - SoC for HPC

Download Report

Transcript Slides - SoC for HPC

OpenSoC Fabric
An open source, parameterized, network
generation tool
Farzad Fatollahi-Fard, David Donofrio,
George Michelogiannakis, John Shalf
CoDEx
SoC for HPC Programmatic Meeting
August 25-26, 2014. Denver, CO.
1
Motivation
Power, parallelism, and data movement drive the need for SoC
State of the Art in SoC
What technology is being used to build SoCs? What can we leverage?
OpenSoC Overview
An open source network generator using a new HDL - Chisel
OpenSoC Code Deep Dive and Demo
OpenSoC module walk through and network output
Conclusion and Future Work
New features for OpenSoC
1
2
3
4
5
2
Power: The New Design Constraint
Trends beginning in 2004 are continuing…
‣ Power densities
have ceased to
increase
‣ No power efficiency
increase with
smaller transistors
3
Power: The New Design Constraint
On-chip parallelism increasing to maintain performance
increases…
‣ We have come to the
end of clock
frequency scaling
‣ Moore’s Law is alive
and well
•
Now seeing core count
increasing
4
Parallelism increasing
NERSC Trends
Franklin
Hopper
Edison
Cori
(NERSC 8)
Core
Count
4
24
48 (logical)
>60
Clock
Rate
2.3GHz
2.1GHz
2.4 GHz
~1.5GHz
Memory
8GB
32GB
64GB
64-128GB
+On
package
Peak Perf
.352 PF
1.288 PF
2.57 PF
> 3 TF
Hierarchical Power Costs
Data movement is the dominant power cost
6 pJ
Cost to move data 1 mm on-chip
100 pJ
120 pJ
250 pJ
2000 pJ
~2500 pJ
Typical cost of a single floating point operation
Cost to move data 20 mm on chip
Cost to move off-chip, but stay within the
package (SMP)
Cost to move data off chip into DRAM
Cost to move data off chip to a
neighboring node
6
Motivation
Power, parallelism, and data movement drive the need for SoC
State of the Art in SoC
What technology is being used to build SoCs? What can we leverage?
OpenSoC Overview
An open source network generator using a new HDL - Chisel
OpenSoC Code Deep Dive and Demo
OpenSoC module walk through and network output
Conclusion and Future Work
New features for OpenSoC
1
2
3
4
5
7
Current Hardware Challenges
Embracing embedded designs
‣
Power is now limiting factor for leading-edge chips
•
Moore’s Law continues… but now reliant to more exotic
technologies such as 3D transistors, FinFET etc.
•
Design Validation / Verification dominating development costs
‣
‣
Solution: Smaller is Better
•
Simpler, 5 to 9 stage pipeline cores
•
Parallel is key to energy efficiency: CV2F
•
Large arrays of small simple cores are easier to verify, more
resilient
Vibrant commodity market in IP components
Building an SoC from IP Logic Blocks
It’s Legos with a some extra integration and verification cost
Processor Core (ARM, Tensilica, MIPS deriv)
With extra “options” like DP FPU, ECC
OpenSoC Fabric (on-chip network)
(currently proprietary ARM or Arteris)
DRAM
PCIe Gen3 Root complex
Integrated FLASH Controller
IO
10GigE or IB DDR 4x Channel
IB or
GigE
IB or
GigE
PCIe
Memory
DRAM
FLASH
Control
DDR memory controller
(Denali/Cadence, SiCreations)
+ Phy & Programmable PLL
Mem
Control
Mem
Control
SoC – What’s out there?
Some common features
‣ Cost driven
•
CPUs integrated with IO and Gfx to reduce BOM cost, decrease total
PCB area
•
Die power density / pin constraint driven
‣ Homogeneous cores
‣ Simple Networks
•
Most SoCs rely on ring or cross-bar interconnect
•
Increasing core count will drive need for more complex topologies
•
KNL present in 2016 NERSC machine will have mesh
What Interconnect Provides the
Best Power / Performance Ratio?
What tools exist to answer this question?
12
SoC - Interconnect Examples
Some common topologies
The Importance of Network Topology
Network topology can greatly influence application
performance
An analysis of on-chip interconnection networks for large-scale chip multiprocessors
ACM Transactions on computer architecture and code optimization (TACO), April 2010
The Importance of Networks
Networks consume a large fraction of total chip power...
Clock
distribution
10%
Dual FPMACs
34%
Routers and
links
26%
IMEM and
DMEM
20%
10-port RF
10%
A 5-GHz Mesh Interconnect for a Teraflops Processor.
IEEE Micro. 2007
What tools exist for SoC research
What tools do we have to evaluate large, complex
networks of cores?
‣ Software models
•
Fast to create, but
plagued by long
runtimes as system
size increases
‣ Hardware emulation
•
Fast, accurate evaluate
that scales with system
size but suffers from
long development time
A complexity-effective architecture for accelerating fullsystem multiprocessor simulations using FPGAs. FPGA
2008
Software Models
C++ based on-chip network simulators
‣ Booksim
•
Cycle-accurate
•
Verified against RTL
•
Few thousand cycles
per second
Booksim ISPASS 2013
‣ Garnet
•
Event driven
•
Simulation speed limits
designs to 100’s of
cores
GARNET ISPASS 2009
17
Hardware Models
HDL network generators and implementations
‣ Stanford opensource NoC router
•
Verilog
•
Precise but long
simulation times
‣ Connect network
generation
CONNECT: fast flexible FPGA-tuned networks-on-chip.
CARL 2012
•
Bluespec
•
FPGA Optimized
18
Motivation
Power, parallelism, and data movement drive the need for SoC
State of the Art in SoC
What technology is being used to build SoCs? What can we leverage?
OpenSoC Overview
An open source network generator using a new HDL - Chisel
OpenSoC Code Deep Dive and Demo
OpenSoC module walk through and network output
Conclusion and Future Work
New features for OpenSoC
1
2
3
4
5
19
Chisel: A New Hardware DSL
Using Scala to construct Verilog and C++ descriptions
‣ Chisel provides both
software and hardware
models from the same
codebase
‣ Object-oriented
hardware development
•
Allows definition of
structs and other highlevel constructs
‣ Powerful libraries and
components ready to
use
‣ Working processors
fabricated using chisel
Chisel
Scala
Software
Compilation
Hardware
Compilation
SystemC
Simulation
Verilog
C++
Simulation
FPGA
ASIC
Recent Chisel Designs
Chisel created cores successfully boot Linux
Processor
Site
Clock
test DCDC
site
test
site
SRAM
test site
Raven core – 28nm
21
Chisel Overview
Algebraic Graph Construct
How does Chisel work?
Algebraic Graph Construction
16
x
‣ Not “Scala to Gates”
‣ Describe hardware
functionality
‣ Chisel creates graph
Mux ( x > y , x , y )
representation
•
Flattened
Mux( x > y, x, y)
y
x
>
Mux
y
‣ Each node
translated to Verilog
or C++
22
OpenSoC Fabric
An open source, flexible, parameterized, NoC generator
‣ Part of the CoDEx tool suite, written in Chisel
‣ Dimensions, topology, VCs all configurable
‣ Fast functional C++ model for functional
validation
•
SystemC ready
‣ Verilog based description for FPGA or ASIC
•
Synthesis path enables accurate power / energy
modeling
‣ AXI Based endpoints
•
Ready for ARM integration
24
OpenSoC Fabric
An open source, flexible, parameterized, NoC generator
CPU(s)
CPU(s)
CPU(s)
AXI
AX
I
I
AX
OpenSoC
Fabric
AXI
AXI
I
AX
AX
I
CPU(s)
AXI
PC
Ie
10GbE
HMC
CPU(s)
OpenSoC: Current Status
Projected v1.0 release date of October 1st
‣ Available now:
•
2-D mesh or Flattened
Butterfly network of
arbitrary size
•
Wormhole routing
‣ In Development
•
Virtual Channels
•
AXI Interface
26
Motivation
Power, parallelism, and data movement drive the need for SoC
State of the Art in SoC
What technology is being used to build SoCs? What can we leverage?
OpenSoC Overview
An open source network generator using a new HDL - Chisel
OpenSoC Code Deep Dive and Demo
OpenSoC module walk through and network output
Conclusion and Future Work
New features for OpenSoC
1
2
3
4
5
27
OpenSoC Code Examples
Walkthrough of Switch and Full Network Tester
28
Motivation
Power, parallelism, and data movement drive the need for SoC
State of the Art in SoC
What technology is being used to build SoCs? What can we leverage?
OpenSoC Overview
An open source network generator using a new HDL - Chisel
OpenSoC Code Deep Dive and Demo
OpenSoC module walk through and network output
Conclusion and Future Work
New features for OpenSoC
1
2
3
4
5
29
Future additions
Towards a full set of features
‣ Photonics and circuit switched networks
‣ Integrated NIC model
‣ More diverse topologies and routing functions
‣ Validation against RTL and other simulators
‣ Standardized (AXI) interfaces at the endpoints
‣ More powerful synthetic traffic and trace replay
support
‣ Power modeling in the C++ model
30
Acknowledgements
‣ UCB Chisel
‣ US Dept of Energy
‣ Ke Wen
‣ Columbia LRL
‣ John Bachan
‣ Dan Burke
‣ BWRC
31
More Information
http://opensocfabric.org
32