Communication-Centric Design

Download Report

Transcript Communication-Centric Design

Communication-Centric Design
Robert Mullins
Computer Architecture Group
Computer Laboratory, University of Cambridge
(University of Twente, December 11th 2006)
Convergence to flexible parallel
architectures
• Power Efficient
– Better match application
characteristics (streaming,
coarse-grain parallelism…)
– Constraint-driven
execution
Embedded
GPUs Processors
FPGAs
Multi-Core
Processors
SoC Platforms
?
• Simple
–
–
–
–
Increased regularity
S/W programmable
Limited core/tile set
Ease verification issues
• Flexible
– Multi-use platform
2
Performance from Parallelism
• Shift to multi-core processors has begun.
• Performance is boosted in complexity and power
effective manner
• Latencies on-chip are much lower than in
previous parallel machines. This should make
programming easier + new oppurtinities
• Easy to increase number of cores as underlying
fabrication technology scales
• Simply an engineering problem?
3
Multi-core Showstoppers!
• As the number of cores increase current interconnection
network designs would require an unacceptable fraction
of the chip power budget
• Attempts at exploiting low-latency on-chip
communication are mitigated by complex network
interfaces (old problem) and routers
• Today’s programming models provide little hope of
efficiently exploiting a large degree of on-chip parallelism
(for the average programmer) – related to N.I. problem
• Power constraints will be such that only a small
percentage of the transistors may be active at once! Why
are 100’s of identical cores useful?
• New interconnection network solutions are critical to
making multi- and many-core chips work
• Power limits imply heterogeneity in some form?
4
Our Group’s Research
• Now: support evolution of
existing platforms
Embedded
GPUs Processors
FPGAs
Multi-Core
Processors
SoC Platforms
?
– Low-latency and low-power
on-chip networks
– System-timing
considerations
– Networking communications
within FPGAs
– Flexible networked SoC
systems, virtual IP
– On-chip serial interconnects
– Multi-wavelength optical
communication (off-chip)
– Fault tolerant design
• Future:
– Networks of processors to
processing networks
– Processing Fabrics
5
Low-Latency Virtual-Channel
Packet-Switched Routers
• Goal was to develop a virtual-channel
network for a tiled processor architecture
• Collaboration with Krste Asanović’s
SCALE group at MIT
• Problem faced is rising interconnect costs
• Networking communications can increase
communication latencies by an order of
magnitude or more!
6
The Lochside test chip (2004/5)
• UMC 0.18um Process
• 4x4 mesh network, 25mm2
• Single Cycle Routers
(router + link = 1 clock)
• May be clocked by both
traditional H-tree and DCG
• 4 virtual-channels/input
• 80-bit links
TILE
– 64-bit data + 16-bit control
• 250MHz (worst-case PVT)
16Gb/s/channel (~35 FO4)
• Approx 5M transistors
Mullins, West and Moore (ISCA’04, ASP-DAC’06)
Traffic
Generator,
Debug &
Test
R
7
Virtual-Channel Flow Control
8
Typical Router Pipeline
• Router pipeline depth limits minimum latency
– Even under low traffic conditions
– Can make packet buffers less effective
– Incurs pipelining overheads
9
Speculative Router Architecture
• VC and switch allocation may be performed concurrently:
– Speculate that waiting packets will be successful in acquiring a VC
– Prioritize non-speculative requests over speculative ones
Li-Shiuan Peh and William J. Dally, “A Delay Model and Speculative Architecture for Pipelined
Routers”, In Proceedings HPCA’01, 2001.
10
Single Cycle Speculative Router
11
12
13
Single Cycle Router Architecture
• Once speculation mechanism is in place a
range of accuracy/cycle-time trade-offs
can be made
– Blocked VC, pipeline and speculate – use low
priority switch scheduler
– Switch and VC next request calculation
• Don’t bother calculating next switch requests just
use current set. Safe to be pessimistic about what
has been granted.
• Need to be more accurate for VC allocation
– Abort logic accuracy
14
Single Cycle Router Architecture
• Decreasing accuracy often leads to poorer
schedule and more aborts but reduces the
router’s cycle time
• Impact of speculation on single cycle
router:
– 10% more cycles on average
– clock period reduced by factor of 1.6
– Network latency reduced by a factor of ~1.5
• Need to be careful about updating arbiter
state correctly after speculation outcome is
known
15
Lochside Router Clock Period
5-port router
4 VCs per port
64-bit links, ~1.5mm
90nm technology
30-35 FO4 delays (~800MHz)
100% standard cell
FF/Clocking
FIFOs/Control/Datapath
Link
•
•
•
•
23% (8.3 FO4)
53% (19 FO4)
22% (7.9 FO4) range 4.6-7.9
Could move to router/link pipeline
Option to pipeline control - maintaining single cycle best case
Impact of technology scaling
Scalability: doubling VCs to 8, only adds ~10% to cycle time
16
Router Power Optimisation
• Local and global clock gating & signal
gating
– Global clock gating exploits early-request
signals from neighbouring routers
– Slightly pessimistic (based on what is
requested not granted)
– Factor 2-4 reduction power consumption
• Peak 0.15mW/Mhz (0.35 unopt.)
• Low Random 0.06mW/Mhz (0.27 unopt.)
Mullins, SoC’06
17
Analysis of Power Consumption
•
•
•
•
22% Static power
11% Inter-Router Links
~1% Global Clock tree
65% Dynamic Power
}
Due to increase as %
as technology scales
– Power Breakdown
• ~50% local clock tree and input FIFOs
• ~30% on router datapath
• ~20% on scheduling and arbitration
(Low random traffic case)
18
Distributed Clock Generator (DCG)
• Exploits self-timed
circuitry to generate a
clock in a distributed
fashion
• Low-skew and lowpower solution to
providing global
synchrony
• Mesh topology
• Simple proof of concept
provided by Lochside
test chip
S. Fairbanks and S. Moore “Self-timed circuitry
for global clocking”, ASYNC’05
19
Beyond global synchrony
• Clock distribution issues
– Challenge as network is physically distributed
• Increasing process variation
• Synchronization
– Core clock frequencies may vary, perhaps adaptively
– Link and router DVS or other energy/perf. trade-offs
• Selecting a global network clock frequency
– Run at maximum frequency continuously?
– Use a multitude of network clock frequencies?
– Select a global compromise?
20
Beyond Global Synchrony
• A complete spectrum of approaches to system-timing exist
Timing Assumptions
Global
Synchronous
Local Clocks, Interaction
with data (becoming
aperiodic)
None
Delay
Insensitive
Less Detection
21
Data-Driven and Pausible Clocks
Mullins/Moore, ASYNC’07
22
Example: AsAP project (UC Davis, 2006)
Yu et al, ISSCC’06
23
Example: MAIA chip (Berkeley, 2000)
• GALS architecture,
data-flow driven
processing elements
(“satellites”)
Zhang et al,
ISSCC’00
24
Data-Driven Clocking for On-Chip Routers
• Router should be clocked when one or
more inputs are valid (or flits are buffered)
• Free running (paternoster) elevator
– Chain of open compartments
– Must synchronise before you jump on!
• Traditional elevator
– Wait for someone to arrive
– Close doors, decide who is in and who is out
– Metastability issue again (potentially painful!)
25
Data-Driven Clock Implementation
Incoming data
Sample inputs
when at least
one input is
ready (and
clock is low)
Either
admitted or
locked out
Local Clock
Generator
Template
Assert Lock
(Close Lift Doors)
26
Data-driven clocking benefits
Self-timed power gating?
DI barrier synchronisation and
scheduling extensions
NO GLOBAL CLOCK
27
Networks of processors to
processing networks
Embedded
GPUs Processors
FPGAs
Multi-Core
Processors
?
• How will on-chip
networks and cores
evolve over time?
SoC Platforms
28
Evolving towards processing networks
Increased flexibility to
optimise computation
and communication
– Network of Processors
– Number of processors increase
– Core architectures tailored to “many-core”
environment
– Remove hard tile boundaries
• Why fix granularity of cores, communication and
memory hierarchies?
– Move away from processor + router model
• Everything is on the network, we need much thinner
network interfaces
• Richer interconnection of components, increased
flexibility
– Add network-based services
• Network aids collaboration, focuses resources, supports
dynamic optimisations, scheduling, …
• Tailor virtual architecture to application
– Processing Network or Fabric
29
Thank You.
• 2006 Workshop on On- and Off-Chip Interconnection
Networks for Multicore Systems
http://www.ece.ucdavis.edu/~ocin06/
(Videos of talks and working group feedback are on-line)
• Computer Architecture: A Quantitative Approach, 4th
Edition, Appendix E, entitled Interconnection Networks.
http://ceng.usc.edu/smart/slides/appendixE.html
• Principles and Practices of Interconnection Networks by
William James Dally, Brian Patrick Towles
• On-chip network bibliography and resources:
http://www.cl.cam.ac.uk/~rdm34/onChipNetBib/noc.html
30