Diapositiva 1

Download Report

Transcript Diapositiva 1

Stamatis Vassiliadis Symposium
The Future of Computing
A+A=A
Mateo Valero
Barcelona Supercomputing Center
To Stamatis,
my loved friend
1
A+A=A
Stamatis Vassiliadis Symposium
The way we all do research ... As seen from HPCA 1999
•
Microarchitecture idea
Applications
Compiler
Production, public, custom, …
Simulator
Public, custom, …
Results
2
A+A=A
SPEC, PerfectClub, TPC-D, NAS, Splash …
How much we get from our idea
Stamatis Vassiliadis Symposium
The Past Future ... As seen from HPCA 1999
Applications
Algorithms
Absolutely obsessed with going to
the limits of extracting
available ILP on a single core
Compiler
Architecture
Hardware
3
A+A=A
Stamatis Vassiliadis Symposium
The Past Future Continued:
Advanced ILP Techniques for Superscalar Processors
•
•
•
•
•
•
•
•
•
•
•
4
Optimized Pipeline
Cache
Branch Predictors
Instruction Collapsing
Value Prediction
Reuse
Assisted/Subordinated Threads
Trace Cache/Processor
Control/Data Speculation
Kilo-instruction Processors
………
A+A=A
Stamatis Vassiliadis Symposium
Distant Parallelism: Non-numerical applications
•
(In)Dependent threads: e.g. m88ksim
check_issue
kill_time
TIMING
•
Application speed-up: 2.65
statistics
cmmutime
Real_execution
EXE
Sbus2
breakpoint?
FETCH
PC guess
5
A+A=A
breakpoint?
Stamatis Vassiliadis Symposium
fetch_next
The “immediate” future: Number of cores doubled every 18 months
“It is better for Intel to get involved in this now so when we get to the
point of having 10s and 100s of cores we will have the answers.
There is a lot of architecture work to do to release the potential,
and we will not bring these products to market until we have good solutions to the programming problem”
Justin Rattner Intel CTO
Marenostrum
Most beautiful supercomputer
Fortune magazine, Sept. 2006
#1 in Europe, #5 in the World
100's of TeraFlops
with general purpose Linux
supercluster of commodity
PowerPC-based Blade Servers
“Now, the grains inside these machines more and more will be multi-core type devices,
and so the idea of parallelization won't just be at the individual chip level,
even inside that chip we need to explore new techniques like transactional memory
that will allow us to get the full benefit of all those transistors
and map that into higher and higher performance.”
Bill Gates, Supercomputing 05 keynote
6
A+A=A
Stamatis Vassiliadis Symposium
Supercomputers will likely have millions of processing cores
7
A+A=A
Stamatis Vassiliadis Symposium
The “far” future (e.g. 2017) and The big question!
How to solve the programming problem? a.k.a. How to program the beast?
•
•
How to enable the power of the hundreds to millions of cores on a system?
•
We need a multidisciplinary top-down approach to this, including
Computer Architects must adapt their thinking. From now on, parallel software
requirements will directly drive systems design
•
•
•
•
•
•
•
•
Applications
Algorithms
Debugging
Programming models
Programming languages
Compilers
Operating Systems
Runtime environment
… as design drivers for future Architectures
8
A+A=A
Stamatis Vassiliadis Symposium
The holistic view: A + A = A
How to solve the programming problem? a.k.a. How to program the beast?
•
•
How to enable the power of the hundreds to millions of cores on a system?
•
We need a multidisciplinary top-down approach to this, including
Computer Architects must adapt their thinking. From now on, many-core software
requirements will directly drive processor design
•
•
•
•
•
•
•
•
Applications
Algorithms
Debugging
Programming models
Programming languages
Compilers
Operating Systems
Runtime environment
… as design drivers
9
A+A=A
Stamatis Vassiliadis Symposium
Applirithms
+
Adhesive
=
Architecture
Far Future: Applications
•
•
•
•
•
What will be the typical applications in
2017?
Is it Dwarfs and/versus RMS the right
path to follow?
Applications are ephemeral but the
kernels are forever: the applications may
change, the kernels stay the same.
Will streaming applications require new
architectures?
Are we approaching the special purpose
accelerators for specific applications?
M. Valero. Microsoft Workshop on Multicore, Seattle, June-2007
10
A+A=A
Stamatis Vassiliadis Symposium
Far Future: Algorithms
•
Bad news (for some folks): “Rethink and rewrite the algorithms”
•
For manycores, the algorithms need to carefully consider:
•
The right level of parallelism
•
Load Balancing
•
Communication-Computation overlapping
•
Speculation (e.g. in message passing)
Microsoft Workshop on Multicore, Seattle, June-2007
11
A+A=A
Stamatis Vassiliadis Symposium
Source: Jack Dongarra
Top-Down CMP Design, an initial programmer wishlist
•
Easy-to-express paralellism
•
•
•
Transactional Memory (TM): Compared to locks, TM provides an easy to use mechanism for
ensuring mutual exclusion
Hide all kind of non-uniformities to the programmer (heterogeneous cores, non-uniform memory
access, …)
Continue using standard tools
•
•
•
•
12
OpenMP: the industry standard for writing parallel programs on shared memory
TM and OpenMP combines ease with familiarity for programming multi-cores
•
•
BSC-UPC-Microsoft: IWOMP07, MEDEA07
Stanford: PACT07
Dataflow model ideally suited to express paralelism
•
Cell Superscalar = Distant Parallelism+Data Flow+ Out of Order Execution
Super computers: MPI+ (OpenMP/Cell Superscalar)+TM))
A+A=A
Stamatis Vassiliadis Symposium
Chip organization in 2017: many-core
Will they be homogeneous or heterogeneous?. Arrays of simple in order cores, fewer
complex out of order or a mix of the two? Consentry and Internet Security
Simultaneous Multithreading is just for servers?
Microsoft Workshop on Multicore, Seattle, June-2007
13
A+A=A
Stamatis Vassiliadis Symposium
Off-die Interconnect
Memory
Cache
Cache
On-die Interconnect
Cache
Cache
On-die Interconnect
Cache
Memory
Memory
Cache
Should we push for further optimizing classical OoO implementations or research how to
put into practical use radical new approaches such as dataflow or asynchronous
architectures?
Cache
•
•
How many cores will the processor of 2017 have?
Cache
•
•
Memory
Chip organization in 2017: memory and interconnection network
•
•
•
•
•
How will the latency and bandwidth
problems be addressed?
3D integration aware Computer
Architecture: it is a great future idea. Will
it will always be a great future idea?
What is the best many-core interconnect
topology?
How we can evaluate the importance of
the interconnection network in the
applications?
What are the obstacles that are
presented for parallel applications when
I/O doesn't scale well?
Microsoft Workshop on Multicore, Seattle, June-2007
14
A+A=A
Stamatis Vassiliadis Symposium
An overall picture of the Microsoft Many-core project
15
OpenMP+TM
STM
HW acceleration for Haskell
Many-core architecture
Power-aware
A+A=A
Functional
•
•
Transactional Memory
Imperative
•
•
Architectural support to programming
models
Programming model
•
Architecture
Programming models for future
many-core architectures
Applications
•
Stamatis Vassiliadis Symposium
HTM
An overall picture of the IBM MareIncognito project
•
•
•
•
•
•
Our 10-100 Petaflop research project for BSC (2010)
Port/develop applications to reduce
time-to-production once installed
Programming models (MPI, OpenMP+TM, CellSs)
Tools for application development
and to support previous evaluations
Application
development
an tuning
Evaluate node architecture
(heavily multicored)
Evaluate interconnect options
Performance
analysis and
Prediction
Tools
Model and
prototype
Interconnect
16
Fine-grain
programming
models
A+A=A
Stamatis Vassiliadis Symposium
Load
balancing
Processor
and node
Supercomputing and e-Science Consolider program
•
•
•
5 Grand Challenge applications
22 groups
119 senior researchers
Life Sciences
Compilers and
tuning of application
kernels
Earth Sciences
Programming
models and
performance
tuning tools
Astrophysics
Architectures
and hardware
technologies
Engineering
Material Sciences
Strong interaction
Interaction to be created
17
A+A=A
Stamatis Vassiliadis Symposium
Education for multi-core
I
18
A+A=A
Multicore-based
pacifier
Stamatis Vassiliadis
Symposium
programming
multicores