TERAFLUX FP7-ICT-2009-4 - Istituto Nazionale di Fisica

Download Report

Transcript TERAFLUX FP7-ICT-2009-4 - Istituto Nazionale di Fisica

University of Siena
F
TERA LUX.EU
Exploiting Dataflow Parallelism
in Teradevice Computing
(a Proposal to Harness the Future Multicores)
Barcelona
Supercomputing Center
An Overview of a
TERAFLUX-like Architecture
University of Augsburg
Roberto Giorgi – University of Siena (coordinator)
CASTNESS’11 Workshop
(Computing Architectures, Software tools and nano-tehnologies for
Numerical and Embedded Scalable Systems).
Rome 18/01/2011
INRIA
University of Cyprus
University of Manchester
Technologies for the coming years
• Many new technologies at the horizon: graphene, junctionless transistors…
paving the way for the 1 TERA in a chip/package in a few years
• Feasibility also explored in EU FET projects like TRAMS
2010-09-13
2
Which Multicore Architecture for 2020?
• Classical 1000-Billion Euro
Question !
• Lessons from the past:
– Message-Passing based
architectures have
poor PROGRAMMABILTY
– Shared-Memory based
architectures have limited
scalability or are quite
COMPLEX TO DESIGN
– Failure in part of the
computation compromises
the whole computation:
poor RELIABILITY
2010-09-13
3
CMP of the future == 3D stacking
• 1000 Billion- or 1 TERA- device
computing platforms pose new
challenges:
– (at least) programmability,
complexity of design, reliability
• TERAFLUX context:
– High performance computing and
applications (not necessarily
embedded)
• TERAFLUX scope:
– Exploiting a less exploited path
(DATAFLOW) at each level of
abstraction
G. Hendry, K. Bergman, “Hybrid Onchip Data Networks”, HotChips-22,
Stanford, CA – Aug. 2010
What we propose
• Exploiting dataflow concepts both
– at task level and
– inside the threads
• Offload and manage accelerated codes
– to localize the computation
– to respect the power/performance/temperature/reliability envelope
– to efficiently handle the parallelism and have an easy and powerful
execution model
 PUSHING THE DATA WHERE IS NEEDED
Some techniques proposed:
Giorgi, R., Popovic, Z., Puzovic, N. , DTA-C: A Decoupled multi-Threaded Architecture for CMP System. Computer
Architecture and High Performance Computing, 2007. SBAC-PAD 2007. 19th International Symposium on, vol., no.,
pp.263-270, 24-27 Oct. 2007. DOI= http://dx.doi.org/10.1109/SBAC-PAD.2007.27
M. Aater Suleman, Onur Mutlu, Jose A. Joao, Khubaib, and Yale N. Patt, "Data Marshaling for Multi-core Architectures"
Proceedings of the 37th International Symposium on Computer Architecture (ISCA), Saint-Malo, France, June 2010
Our pillars
•
•
•
•
•
•
•
FIXED and MOST-USED ISA (x86)
MANYCORE FULL SYSTEM SIMULATOR (COTSon)
REAL WORLD APPLICATIONS (e.g. GROMACS)
SYNCHRONIZATION: TRANSACTIONAL MEMORY
GCC based TOOL-CHAIN
OFF-THE-SHELF COMPONENTS FOR CORES, OS, NOC
FDU AND TSU (Fault Detection Unit and Thread Scheduling Unit)
Recent Reference Numbers
• 26/05/2010 - CEA/BULL – TERA100 is the first European
Supercomputer to reach 1 Petaflops (top500 #6 as of Nov. 2010)
– Cluster of 4370 nodes with 4 X64 per node (17480 CPUs or 139840 cores);
300TB of memory; 20 PB of secondary storage; Benchmark: LINPACK; 5 MW
• 11/11/2010 – National Supercomputing Center in Tianjin - Tianhe1A – Fastest supercomputer in the worlds (2.5 Petaflops)
– 7168 NVIDIA® Tesla™ M2050 GPUs (about 3.2 Million CUDA cores) and
14336 CPUs or 86016 cores; 230 TB of memory; Benchmark: LINPACK; 4 MW
2010-09-13
7
A possible TERAFLUX architectural instance
AC=Auxiliary Core
SC=Service Core
IOx=I/O or SC Core
DRAM
DRAM
DRAM
DRAM
NoC
AC AC AC AC AC
AC AC AC AC AC
AC AC AC AC AC
AC AC AC AC AC
AC AC AC AC AC
AC AC AC AC AC
AC AC AC AC AC
AC AC AC AC AC
AC AC AC AC AC
AC AC AC AC AC
AC AC AC AC AC
AC AC AC AC AC
AC AC AC
AC AC AC
SC
AC AC AC
AC AC AC
AC AC AC
IO1
AC AC AC
(disk)
AC AC AC
AC AC AC
IO2
AC AC AC (Keyboard)
AC AC AC
IO3
AC AC AC
(NIC)
AC AC AC
Core
(PE,L1$,L2$-partition)
AC
Uncore
(TSU/FDU, NoC-tap, …)
TSU=Thread Scheduling Unit
FDU=Fault Detection Unit
WP2
APPLICATIONS
Data
dependencies
WP3 Programming
Model
Holistic approach
Transactional
memory
Source code
Extract TLP
WP4 Compilation
Tools
Locality optimizations
T1
T2
T2
Threads
WP5
WP6
Abstraction Layer
and Reliability Layer
Virtual CPUs
WP7
9
Teradevice
hardware
(simulated)
VCPU
VCPU
VCPU
VCPU
VCPU
PC
PCPU
PC
PCPU
PCPU
PC
P
PCPU
PCPU
possibly
1,000-10,000 cores...
PCPU
PCPU
TERAFLUX: toward a different world
• Relying on existing architectures as much as possible and introduce
key modifications to enhance programmability, simplicity of design,
reliability.
– not a brand new language, but leverage and extend other open efforts
[C+TM, SCALA, OPEN-MP]
– not a brand new system, but leverage and extend other open software
frameworks [GCC]
– not a brand new CPU architecture, but leverage and extend industry
standard commodities [x86]
• However: the implications on “classical limitations” can be huge
– requirements of the hardware memory architecture which limit extensibility
(a.k.a. scalability) can be relaxed significantly
– Turning dataflow model into a general purpose approach through the
addition of transactions
Top-Level ARCHITECTURAL design:
A view from 1000 feet
• Pool of MANY Asymmetric Cores based on X86-64 ISA on a single
chip (e.g. 1000 cores or more)
• (some NoC, some Memory Hierarchy, some I/O, some physical
layout, (e.g., 3D multi-chip, off-the-shelf LINUX)
 not in the scope of TERAFLUX
– Some options will be however proposed/explored
• TERAFLUX Baseline Machine: the simplest thing we have now
• E.g., 64 nodes by 16 cores, L1, L2, hierarchical interconnections
– Need to evolve this architecture WITHOUT binding the software to it, to let
the Architecture to fully explore dataflow concepts at machine level
• Major “cross-challenge”: how to integrate the contribution by each
WP so that the work is done toward a higher goal we could NOT
reach as separate WPs
TERAFLUX key results
we are aiming at & long term impact
• Coarse grain dataflow model (or fine grain multithreaded model)
–
–
–
–
fine grain transactional isolation
scalable to many cores and distributed memory
with built-in application-unaware resilience
with novel hardware support structures as needed
• A solid and open evaluation platform based on an x86 simulator
based on COTSon by TERAFLUX Partner HPLabs
( http://cotson.sourceforge.net/ )
– enables leveraging the large software body out there
(OS, middleware, libraries, applications)
• We are available for cooperation on COTSon also with other EU
projects (especially TERACOMP projects)
• RESEARCH PAPERS: http://teraflux.eu/Publications
• TERADEVICE SIMULATOR: http://cotson.sourceforge.net
Conclusions:
Major Technical Innovations in TERAFLUX
• Fragmenting the Applications in Finer grained DF-threads:
– DF-threads allow an easy way to decouple memory accesses, therefore
hiding memory latencies, balancing the load, managing fault, temperature
information without fine grain intervention of the software.
• Possibility to repeat the execution of a DF-thread in case this
thread happened to be on a core later discovered as faulty
• Taking advantage of a “direct” dataflow communication of the data
(through what we call DF-frames).
• Synchronizing threads while taking advantage of native dataflow
mechanism (e.g. several threads can be synchronized at a barrier)
– DF-threads allow (atomic ) Transactional semantics (DF meets TM)
• A Thread Scheduling Unit would allow fast thread switching and
scheduling, besides the OS scheduler; scalable and distributed
• A Fault Detection Unit works in conjunction with TSU