Programming Models and Architectures for Many Core Systems

Download Report

Transcript Programming Models and Architectures for Many Core Systems

Programming Models and Architectures
for ManyCore Systems:
Challenges and Opportunities for the next 10 years.
Roberto Vaccaro & Lorenzo Verdoscia
Institute for High Performance Computing and Networking
National Research Council – Italy
[email protected]
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
1
Programming Models and Architectures for……
Introduction
CNR Bioinformatics
■ The computational and storage needs of workloads
in several areas as life science are growing
exponentially.
■ Heterogeneity/Computing Barriers Overcoming.
– The scientist should be allowed to look at the data
• easily,
• wherever it may be,
• with sufficient processing power for any desired algorithm to
process it.
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
2
Programming Models and Architectures for……
Introduction
CNR Bioinformatics
■ In life science the scientist requirements concerne
a range of different scales, from the local parallel
component processor to the global atchitectural
level of cross-organizational grid.
■ Integrated solutions capable to face the problems
at the different architectural level are needed.
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
3
Programming Models and Architectures for……
Introduction
CNR Bioinformatics
Grid of Clusters
Wide Area Netowrk
Cluster
Local Area Network
Commodity Machine
System Level Network
Microprocessor
Network on Chip
■ ManyCore Chip
■ Photonic Networks for intra-chip, inter-chip, box interconnects
(*) T. Agerwala, M. Gupta, “Systems research challenges: A scale-out perspective”, IBM
Journal of Research & Development, Vol. 50, N. 23, March/May 2006, pagg. 173,180
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
4
Programming Models and Architectures for……
Introduction
CNR Bioinformatics
■ An ensemble of N nodes each comprising p computing elements
■ The p elements are tightly bound shared memory (e.g., smp, dsm)
■ The N nodes are loosely coupled, i.e., distributed Memory
■ p is greater than N
■ Distinction is which layer gives us the most power through parallelism
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
5
Programming Models and Architectures for……
Introduction
CNR Bioinformatics
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
6
Programming Models and Architectures for……
Introduction
CNR Bioinformatics
■ GRIDs built over wide-area networks & across
organisational boundaries.
■ lack of (further) improvement in newtork latency.
The approach to Distributed Programming
currently prevailing synchronous
(using RPC primitives for ex.)
will have to be replaced with an
ASYNCHRONOUS PROGRAMMING APPROACH
more
- delay-tolerant
- failure-resilient
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
7
Programming Models and Architectures for……
Introduction
CNR Bioinformatics
■ A first step in that direction
- peer-to-peer (P2P) architectures
- service-oriented architectures (SOA)
capable of support reuse of both functionalities and data.
■ Using P2P architectures and protocols it is possible to
- realize distributed systems without any centralized control or hierarchical
organisation,
- achieve scalable and reliable location and exchange of scientific data
and software in a decentralised manner.
■ Service-Oriented Architecture (SOA) and the web-service infrastructures that
assist in their implementation facilitate reuse of functionality.
(*) G. Kandaswamyetahi “Building Web Services for Scientific Grid Applications”, IBM Journal
of Research & Development, Vol. 50, N. 23, March/May 2006, pagg. 249,260
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
8
Programming Models and Architectures for……
Introduction
CNR Bioinformatics
■ The possibility to locate and invoke a service across
machine and organisational boundaries (both in a
synchronous and an asynchronous manner) is provided by
SOA infrastructure fundamental primitive.
■ Computational scientist will be able to flexibly orchestrate
SOA services into computational workflow.
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
9
Programming Models and Architectures for……
Introduction
CNR Bioinformatics
■ Appropriate programming languages abstractions for
science has to be provided.
■ Fortran and Message Passing Interface (MPI) are no longer
appropriate for the above described architecture.
■ By using abstract machines it is possible to mix compilation
and interpretation as well as integrate code written language
seamlessly into an application or service.
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
10
Programming Models and Architectures for……
A viable approach
CNR Bioinformatics
■ Define a Multilevel Integrated Programming Model
■ Explore the management of concurrency in processor design on a range
of different scales
from instructions to programs
from microgrids to global grids
■ Evaluate the possibility and modalities to implement an integrated H/W and S/W
system capable to give the right answer in terms of:
- Inter/intra processor latency.
- More delay-tolerant and failure-resilient programming approach.
- Capability of data and functionality reuse at global
architecture level (distributed, cross-organisational).
- Capability to take advantages of parallel and distributed resources.
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
11
Programming Models and Architectures for……
Introduction
CNR Bioinformatics
By Little’s law, the amount of concurrency needed to hide the latency of
memory accesses will continue to increase as the gap between memory and
processor speed grows. Since the memory latency is improving at a rate of
only roughly 6% each year, the gap is projected to continue growing even as
the increase in processor speed decreases from the historic rate of about 60%
each year to about 20% each year.
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
12
Programming Models and Architectures for……
Computer hardware industry
CNR Bioinformatics
In 2005 a historic change of direction for computer hardware Industry.
● The major microprocessor companies all announced that
future products would be single-chip multiprocessors
future performance improvements would rely on
○ software-specified parallelism
rather than
○ additional software-transparent parallelism extracted
automatically by the microarchitecture
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
13
Programming Models and Architectures for……
Computer hardware industry
CNR Bioinformatics
■ It is meaningfull that a multibilliondollar industry has bet its
future on solving the general-purpose parallel computing
problem.
even if
so many have previously attempted but failed to provide a
satisfactory approach.
■ In order to tackle the parallel processing problem,
innovative solutions are urgently needed, which in turn
require extensive codevelopment of hardware and
software.
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
14
Programming Models and Architectures for……
Computer hardware industry
CNR Bioinformatics
■ Advances in integrated circuit technology impose new
challenges about how to implement a high performance
application for low power dissipation on processors created by
hundred of cores running at 200 MHz, rather than on one
traditional processor running at 20 GHz.
■ The convergence of the high-performance and embedded
industry.
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
15
Programming Models and Architectures for……
Computer hardware industry
CNR Bioinformatics
Multicore or Manycore?
■Multicore will obviously help multiprogrammed workloads,
which contain a mix of independent sequential tasks, but
how will individual tasks become faster?
■Switching from sequential to modestly parallel computing will
make programming much more difficult without rewarding
this greater effort with a dramatic improvement in powerperformance.
■Multicore is unlikely to be ideal answer and sneaking up on
the problem of parallelism via multicore solutions was likely
to fail.
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
16
Programming Models and Architectures for……
Computer hardware industry
CNR Bioinformatics
■We desperately need a new solution for parallel hardware
and software.
■Compatibility with old binaries and C programs is valuable to
industry, and some researchers are trying to help multicore
product plans succeed.
■We have been thinking bolder thoughts.
Our aim is to realiza thousands of processors on a chip for
new applications, and we welcome new programming
models and new architectures if theysimplify the efficient
programming of such highly parallel systems.
■Rather than multicore, we are, focused on “manycore”.
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
17
Programming Models and Architectures for……
Computer hardware industry
CNR Bioinformatics
■Between February 2005 and December 2006 a group of Researcher of University of California at
Berkeley from many background (circuit design, computer architecture, massively parallel
computing, computer-aided design, embedded h/w and s/w, programming languages, compilers,
scientific programming and numerical analysis) met to discuss parallelism from these many
angles.
■The result of the borrowing the good ideas regarding parallelism from different disciplines is the
report.
“The Landscape of Parallel Computing Research: A View from Berkeley”
Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands,
Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams,
Katherine A. Yelick
Electrical Engineering and Computer Sciences
University of California at Berkeley
Technical Report No. UCB/EECS-2006-183
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
December 18, 2006
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
18
Programming Models and Architectures for……
The Landscape
CNR Bioinformatics
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
19
Programming Models and Architectures for……
The Landscape
CNR Bioinformatics
■Seven critical questions used to frame the landscape of parallel
computing research:
1. What are the applications?
2. What are common kernels of the applications?
3. What are the hardware building blocks?
4. How to connect them?
5. How to describe applications and kernels?
6. How to program the hardware?
7. How to measure success?
■This report do not have the answers
- on some questions non-conventional and provocative perspectives are offered,
- On others seemingly obvious sometine-neglected perspectives are stated.
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
20
Programming Models and Architectures for……
The Landscape
CNR Bioinformatics
Embedded versus High Performance Computing
Have more in common looking forward than they did in the past
1.
Both are concerned with power, whether it is battery life for cell phones or cost of
electricity and cooling in a data center.
2.
Both are concerned with hardware utilization. Embedded systems are always
sensitive to cost, but efficient use of hardware is also required when you spend $
10M to $ 100M for high-end servers.
3.
As the size of embedded software increases over time, the fraction of hand tuning
must be limited and so the importance of software reuse must increase.
4.
Since both embedded and high-end servers now connect to networks, both need
to prevent unwanted accesses and viruses.
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
21
Programming Models and Architectures for……
The Landscape
CNR Bioinformatics
■The Biggest difference between the two target is the
traditional emphasis on realtime computing in embedded,
where the computer and the program need to be just fast
enough to meet the deadlines, and there is no benefit to
running faster.
■Running faster is usually valuable in server computing.
■As server applications become more media-oriented, real
time may become more important for server computing as
well
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
22
Programming Models and Architectures for……
Information Society Technologies (IST)
CNR Bioinformatics
Network of Excellence on High Performance Embedded
Architectures and Compilers (HiPEAC)
Meteo Valero (UPC Barcellona) HiPEAC Coordinator, introducing the pubblication of the
first HiPEAC research roadmap (*) wrote:
“From the document it is clear that there are many challenges ahead of us
in the design of future high-performance embedded systems. Some of them
are familiar such as the memory wall, the power problem, and the
interconnection bottleneck. Others are new like the proper support for
reconfigurable components, fast simulation techniques for multi-core systems,
new programming paradigms for parallel programming.”
(*) K. De Bosschere, W. Luk, X. Martorell, N. Navarro, M. O’Boyle, D. Pnevmatikatos, A. Ramirez,
P. Sainrat, A. Seznec, P. Stentrom, and O. Temam. “High-Performance Embedded Architecture and
Compilation Roadmap” Transactions on HiPEAC I, Lecture Notes in Computer Science 4050, pp 529, Springer-Verlag, 2007
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
23
Programming Models and Architectures for……
Parallelism
CNR Bioinformatics
For at least three decades the promise of parallelism has
fascinated researchers.
■In the past, parallel computing efforts have shown promise and
gathered investment, but in the end, uniprocessor computing
always prevailed.
■In this time general-purpose computing is taking an irreversible
step toward parallel architectures
●This shift toward increasing parallelism is not a triumphant stride
forward based on breakthroughs in novel software and architectures for
parallelism
●This plunge into parallelism is actually a retreat from aven greater
challenges that thwart efficient silicon implementation of traditional
uniprocessor architectures
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
24
Programming Models and Architectures for……
CW in Computer Architecture
CNR Bioinformatics
Old & New Conventional Wisdom (CW) in Computer Architecture
guiding principles illustrating how everything is changing in computing
1. Old CW: Power is free, but transistors are expensive.
▪New CW is the “Power wall”: Power is expensive, but transistors are “free”. That is,
we can put more transistors on a chip than we have the power to turn on.
2. Old CW: If you worry about power, the only concern is dynamic power.
▪ New CW: For desktops and servers, static power due to leakage can be 40% of
total power.
3. Old CW: Monolithic uniprocessors in silicon are reliable internally, with errors
occurring only at the pins.
▪ New CW: As chips drop below 65 nm feature sizes, they will have high soft and
hard error rates.
4. Old CW: By building upon prior successes, we can continue to raise the level of
abstraction and hence the size of hardware designs.
▪ New CW: Wire delay, noise, cross coupling (capacitive and inductive),
manufacturing variability, reliability, clock jitter, design validation, and so on conspire
to stretch the development time and cost of large designs at 65 nm or smaller feature
sizes.
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
25
Programming Models and Architectures for……
CW in Computer Architecture
CNR Bioinformatics
5. Old CW: Researchers demonstrate new architecture ideas by building chips.
▪New CW: The cost of masks at 65 nm feature size, the cost of Electronic Computer
Aided Design software to design such chips, and the cost of design for GHz clock
rates means researchers can no longer build believable prototypes. Thus, an
alternative approach to evaluating architectures must be developed.
6. Old CW: Performance improvements yield both lower latency and higher bandwidth.
▪ New CW: Across many technologies, bandwidth improves by at least the square of
the improvement in latency.
7. Old CW: Multiply is slow, but load and store is fast.
▪ New CW is the “Memory wall”: Load and store is slow, but multiply is fast. Modern
microprocessors can take 200 clocks to access Dynamic Random Access Memory
(DRAM), but even floating-point multiplies may take only four clock cycles.
8. Old CW: We can reveal more instruction-level parallelism (ILP) via compilers and
architecture innovation. Examples from the past include branch prediction, out-oforder execution, speculation, and Very Long Instruction Word systems.
▪ New CW is the “ILP wall”: There are diminishing returns on finding more ILP.
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
26
Programming Models and Architectures for……
CW in Computer Architecture
CNR Bioinformatics
9. Old CW: Uniprocessor performance doubles every 18 months.
▪ New CW is Power Wall + Memory Wall + ILP Wall = Brick Wall. In 2006,
performance is a factor of three below the traditional doubling every 18 months that
we enjoyed between 1986 and 2002. The doubling of uniprocessor performance may
now take 5 years.
10.Old CW: Don’t bother parallelizing your application, as you can just wait a little while
and run it on a much faster sequential computer.
▪ New CW: It will be a very long wait for a faster sequential computer.
11. Old CW: Increasing clock frequency is the primary method of improving processor
performance.
▪ New CW: Increasing parallelism is the primary method of improving processor
performance.
12. Old CW: Less than linear scaling for a multiprocessor application is failure.
▪ New CW: Given the switch to parallel computing, any speedup via parallelism is a
success.
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
27
Programming Models and Architectures for……
CW in Computer Architecture
CNR Bioinformatics
Conventional Wisdom (CW) in Computer Archietecture
1. Old CW: Power is free, but transistors are expensive.
▪New CW is the “Power wall”: Power is expensive, but transistors are “free”. That is,
we can put more transistors on a chip than we have the power to turn on.
7. Old CW: Multiply is slow, but load and store is fast.
▪ New CW is the “Memory wall”: Load and store is slow, but multiply is fast. Modern
microprocessors can take 200 clocks to access Dynamic Random Access Memory
(DRAM), but even floating-point multiplies may take only four clock cycles.
8. Old CW: We can reveal more instruction-level parallelism (ILP) via compilers and
architecture innovation. Examples from the past include branch prediction, out-oforder execution, speculation, and Very Long Instruction Word systems.
▪ New CW is the “ILP wall”: There are diminishing returns on finding more ILP.
9. Old CW: Uniprocessor performance doubles every 18 months.
▪ New CW is Power Wall + Memory Wall + ILP Wall = Brick Wall. In 2006,
performance is a factor of three below the traditional doubling every 18 months that
we enjoyed between 1986 and 2002. The doubling of uniprocessor performance may
now take 5 years.
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
28
Programming Models and Architectures for……
CW in Computer Architecture
CNR Bioinformatics
Uniprocessor Performance (SPECint)
From Hennessy and Patterson
Computer Architecture: A Quantitative
Approach, 4° edition, 2006
Sea change in chip
design: multiple “cores” or
processors per chip
• VAX:
25%/year 1978 to 1986
• RISC + x86: 52%/yaer 1986 to 2002
• RISC + x86: ??%/year 2002 to present
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
29
Programming Models and Architectures for……
CW in Computer Architecture
CNR Bioinformatics
The State of Hardware
■A Negative picture about the state of hardware is painted by CW
pairs based analysis.
■There are compensating positives as well
●Moore’s Law continues: it will soon be possible to put thausands of
simple processors on a single, economical chip;
●Very low latency & very high bandwidth for the communication
between these processors within a chip;
●Monolithic manycore microprocessors
- represent a very different design point from traditional
multichip multiprocessors
- provide promise for the development of new architectures and
programming models.
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
30
Programming Models and Architectures for……
Applications and Dwarfs
CNR Bioinformatics
■ Mining the parallelism experience of the high-performance computing
community to see if there are lessons we can learn for a broader view of
parallel computing.
The hypothesis
● is not that traditional scientific computing is the future of parallel computing
● is that the body of knowledge created in bulding programs that run well on
massively parallel computers may prove useful in parallelizing future
applications
■ Many of the authors from other areas, such as embedded computing, were
surprised at how well future applications in their domain mapped closely to
problems in scientific computing.
■ The way to guide and evaluate architecture innovation is to study a
benchmark suite based on existing programs, such as EEMBC (Embedded
Microprocessors Benchmark Consortium) or SPEC (Standard Performance
Evalution Corporation) or SPLASH (Stanford Parallel Applications for Shared
Memory).
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
31
Programming Models and Architectures for……
Applications and Dwarfs
CNR Bioinformatics
■ It is currently unclear how to express a parallel computation best: a very big
obstacle to innovation in parallel computing.
■ It seems unwise to let a set of existing source code drive an investigation into
parallel computing.
■ There is a need to find a higher level of abstraction for reasoning about
parallel application requirements.
■ The main aim is to delineate application requirements in a manner that is not
overly specific to individual applications or the optimizations used for certain
hardware platforms.
■ It is possible to draw broader conclusions about hardware requirements.
■ The approach is to define a number of “Dwarfs”, which each capture a pattern
of computation and communication common to a class of important
applications.
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
32
Programming Models and Architectures for……
Applications and Dwarfs
CNR Bioinformatics
■ Phil Colella identified seven numerical methods that he believed will be
important for science and engineering for at least the next decade
■ Seven Dwarfs
● Constitute classes where membership in a class is defined by
similarity in computation and data movement
● are specified at a high level of abstraction to allow reasoning about
their behavior across a broad range of applications
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
33
Programming Models and Architectures for……
Applications and Dwarfs
CNR Bioinformatics
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
34
Programming Models and Architectures for……
Applications and Dwarfs
CNR Bioinformatics
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
35
Programming Models and Architectures for……
Applications and Dwarfs
CNR Bioinformatics
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
36
Programming Models and Architectures for……
Applications and Dwarfs
CNR Bioinformatics
Seven Dwarfs, their descriptions, corresponding NAS benchmarks, and example computers.
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
37
Programming Models and Architectures for……
Applications and Dwarfs
CNR Bioinformatics
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
38
Programming Models and Architectures for……
Applications and Dwarfs
CNR Bioinformatics
Extensions to the original Seven Dwarfs.
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
39
Programming Models and Architectures for……
Recognition, Mining, Synthesis (RMS)
CNR Bioinformatics
Intel “Era of Tera” Computation Categories
Intel’s RMS and how it maps down to functions that are more primitive. Of the five categories at the top of the figure,
Computer Vision is classified as Recognition, Data Mining is Mining, and Rendering, Physical Simulation, and Financial
Analytics are Synthesis. [Chen 2006]
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
40
Programming Models and Architectures for……
Parallel Programming Models
CNR Bioinformatics
Comparison of 10 current parallel programming models for 5 critical tasks, sorted from most explicit to most implicit. High-performance computing
applications [Pancake and Bergmark 1990] and embedded applications [Shah et al 2004a] suggest these tasks must be addressed one way or the other
by a programming model: 1) Dividing the application into parallel tasks; 2) Mapping computational tasks to processing elements; 3) Distribution of data to
memory elements; 4) mapping of communication to the inter-connection network; and 5) Inter-task synchronization.
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
41
Programming Models and Architectures for……
Limits of Performance of Dwarfs
CNR Bioinformatics
Limits to performance of dwarfs, inspired by an suggestion by IBM that a packaging technology could offer virtually infinite
memory bandwidth. While the memory wall limited performance for almost half the dwarfs, memory latency is a bigger problem
than memory bandwidth
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
42
Programming Models and Architectures for……
Transistor Integration Capacity
CNR Bioinformatics
Transistor integration capacity
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
43
Programming Models and Architectures for……
Pollack’s Rule
CNR Bioinformatics
Pollack's Rule
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
44
Programming Models and Architectures for……
Frequency and Power Consumption
CNR Bioinformatics
Frequency and Power Consumption
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
45
Programming Models and Architectures for……
ManyCore System
CNR Bioinformatics
Illustration of a Many Core System
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
46
Programming Models and Architectures for……
Amdahl’s Law Limits Parallel Speedup
CNR Bioinformatics
Amdahl's Law limits parallel speedup
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
47
Programming Models and Architectures for……
Core Performances
CNR Bioinformatics
Performance of Large, Medium, and Small Cores
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
48
Programming Models and Architectures for……
Fine Grain Power Management
CNR Bioinformatics
Fine grain power management
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
49
Programming Models and Architectures for……
Network Power Estimate
CNR Bioinformatics
Network power estimate
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
50
Programming Models and Architectures for……
Three Dimensional Interconnect With Stacking
CNR Bioinformatics
Three dimensional interconnect with stacking
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
51
Programming Models and Architectures for……
Assembly of 3D Memory
CNR Bioinformatics
Assembly of 3D memory
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
52
Programming Models and Architectures for……
Recommended points from Berkeley
CNR Bioinformatics
■ The
overarching goal should be to make it easy to write programs that execute
efficiently on highly parallel computing systems
■ The target should be 1000s of cores per chip, as these chips are built from processing
elements that are the most efficient in MIPS per watt, MIPS per area of silicon, and
MIPS per development dollar.
■
Instead of traditional benchmarks, use 13 “Dwarfs” to design and evaluate parallel
programming models and architectures.
A dwarf is an algorithmic method that captures a pattern of computation and communication.
“Autotuners” should play a larger role than conventional compilers in translating parallel
programs.
■
To maximize programmer productivity, future programming models must be more
human-centric than the conventional focus on hardware or applications.
■
To be successful, programming models should be independent of the number of
processors.
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
53
Programming Models and Architectures for……
Recommended points from Berkeley
CNR Bioinformatics
■
To maximize application efficiency, programming models should support a wide range
of data types and successful models of parallelism: task-level parallelism, word-level
parallelism, and bit-level parallelism.
■
Architects should not include features that significantly affect performance or energy if
programmers cannot accurately measure their impact via performance counters and
energy counters.
■
Traditional operating systems will be deconstructed and operating system functionality
will be orchestrated using libraries and virtual machines.
■
To explore the design space rapidly, use system emulators based on FPGAs that are
highly scalable and low cost.
maybe they missed some key point, for example:
whenever it is possible, computational execution should happen in asynchronous
manner
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
54
Programming Models and Architectures for……
Because Asynchronous
CNR Bioinformatics
Low power consumption,
… due to fine-grain clock gating and zero stadby power consumption.
■ High operating speed,
… operating speed is determined by actual local latencies rather than global worst-case
latency.
■ Less emission of electro-magnetic noise,
… the local clocks tend to tick at random points in time.
■ Robustness towards variations in supply voltage, temperature, and fabrication process
parameters,
… timing is based on matched delays (and can even be insensitive to circuit and wire
delays).
■ Better composability and modularity,
… because of the simple hanshake interfaces and the local timing.
■ No clock distribution and clock skew problems,
… there is no global signal that needs to be distributed with minimal phase skew across
the circuit.
■
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
55
Programming Models and Architectures for……
Auto-tuners
CNR Bioinformatics
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
56
Programming Models and Architectures for……
Computational Model
CNR Bioinformatics
■ Designing clever parallel hardware and then work
out how to program it is a big mistake.
■ Designing parallel programming languages and
then work out how to implement them is usually a
mistake.
■ Developing the right computational model
alongside languages & hardware is the Key.
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
57
Programming Models and Architectures for……
Computational Model
CNR Bioinformatics
■ Think about systems, not just hardware or
software.
■ There is lots of (possibly) relevant work e.g.
- Dataflow (Single Assignment)
- Graph Rewriting (Functional Languages)
- Bulk Synchronous Parallelism (BSP)
- Transactional Memory
■ Don’t ignore previous work and particularly don’t
re-invent the wheel!.
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
58
Programming Models and Architectures for……
Language Effectiveness
CNR Bioinformatics
40
Language Effectiveness
35
30
Java
25
20
C++
15
10
C
5
0
1970
1975
Workshop
December 19, Napoli - Italy
1980
1985
1990
1995
2000
2005
R. Vaccaro & L. Verdoscia
59
Programming Models and Architectures for……
Language Effectiveness
CNR Bioinformatics
10000000
Language Effectiveness
Moore's Law
1000000
100000
10000
1000
100
10
1
1970
1975
Workshop
December 19, Napoli - Italy
1980
1985
1990
1995
2000
2005
R. Vaccaro & L. Verdoscia
60
Programming Models and Architectures for……
CISC Architecture
CNR Bioinformatics
Huge effort into improving performance of sequential
instruction stream
■ Complexity has grown unmanageable
■ Even with 1 billion transistors on a chip, what more can be
done?
■
Pipelining
Branch
Prediction
Out-of-Order
Execution
Prefetching
Renaming
Speculative
Execution
Workshop
December 19, Napoli - Italy
Value
Prediction
R. Vaccaro & L. Verdoscia
61
Programming Models and Architectures for……
TRIPS Prototype
CNR Bioinformatics
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
62
Programming Models and Architectures for……
Cyclops-64 Architecture
CNR Bioinformatics
Cyclops-64 Programming Models and System Software Supports
Application Programming API
Co-array
Fortran
UPC+/-
……
EARTH-C +/-
OpenMP-XN
Advanced Execution/
Programming Model
MPI
Kcc/gcc
Percolation
Compiler
Cyclops Thread Virtual Machine
Thread
Management
Tool
chain
Shared Memory
Operations
Fine-Grain
Multithreading
Thread Creation &
Termination
Dynamic memory
management
async function
invocation
Thread
Synchronization
acquire / release
fibers
Put / get
Load Balancing
Scheduling
Base
Execution
Model
Others
Fine-Grain
Multithreading
(e.g. EARTH,
CARE)
Put / get with sync
Location
Consistency
Infrastructur
e and Tools
System
Software
Simulation /
Emulation
Analytical
Modeling
SP
SP
SP
SP
SP
TU
TU
TU
TU
TU
TU
…
4 GB/sec
FPU
FPU
Off-Chip
Memory
FPU
SP
SP
TU
TU
FPU
4 GB/sec
24x24
1 Gbit/s
ethernet
Crossbar Network
A-Switch
6
Off-Chip
Memory
4 GB/sec
*6
6
24 PC cards
in 1 shishkebab
Other
Chips via 3D
mesh
MEMORY
BANK
MEMORY
BANK
MEMORY
BANK
SP
SP
SP
SP
SP
Workshop
December 19, Napoli - Italy
…
MEMORY
BANK
MEMORY
BANK
SP
MEMORY
BANK
MEMORY
BANK
50 MB/sec
MEMORY
BANK
Off-Chip
Memory
1 PetaFlops
DMA
SP
A-switch
Off-Chip
Memory
Cyclops-64 ISA
SP
SP
IDE
HDD
Communication Ports for
3D Mesh Inter-Chip Network
R. Vaccaro & L. Verdoscia
63
Programming Models and Architectures for……
hHLDS
CNR Bioinformatics
The homogeneous High Level Dataflow System (hHLDS) model
Firing rules in the classical model
Let
A={a1, …, an} be the set of actors
and
L ={ll, …, ln} be the set of links
A dataflow graph is a labelled directed graph
G = (N, E)
where
N=AL
is the set of nodes
E  (A × L)  (L × A) is the set of edges
firing of an actor
a token on each input link and no token on each output link
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
64
Programming Models and Architectures for……
hHLDS
CNR Bioinformatics
The hHLDS model
Special actors in the classical model
are characterized by having heterogeneous I/O conditions
Merge
A
Switch
B
T
F
A
A
L
L
T
F
Gate
Decider
B
L
R
L
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
65
Programming Models and Architectures for……
hHLDS
CNR Bioinformatics
Any actor has two input links and one output link and consumes and
produces only data tokens
firing of an actor
a token on each input link
effect
consumes all input tokens and can produces a token on its output link
b
c
b
≤
*
a
a+b*c
+
Workshop
December 19, Napoli - Italy
c
a
If b≤c then a
+
R. Vaccaro & L. Verdoscia
66
Programming Models and Architectures for……
hHLDS
CNR Bioinformatics
The hHLDS model
Comparison between the two models
input (a, c)
b := 1;
repeat
if a > 1 then a := a \ 2
else a := a * 5
b := b * 3;
until b = c;
output (d)
c
1
T
F
T
T
1
T
F
a
c
a
b
LST
LST
1
*
2
3
3
*
=
F
2
5
_:
5
*4
3
1
F
F
F
>
a
+
<
6
7
+
8
9
T
T
F
F
/2
T
= 10
5
= 11
>1
+ 12
F
T
+13
d
d
a)
Workshop
December 19, Napoli - Italy
+14
F
b)
R. Vaccaro & L. Verdoscia
67
Programming Models and Architectures for……
Dataflow Computational Model
CNR Bioinformatics
memory
memory

+
Initial

+
values
Results

+

DATA
Workshop
December 19, Napoli - Italy
R. Vaccaro & L. Verdoscia
68
Programming Models and Architectures for……