OS-iesp-roadmap

Download Report

Transcript OS-iesp-roadmap

Technology Drivers
•
Traditional HPC application drivers
–
–
•
New and evolving programming models
–
–
–
•
On-node RAM, DRAM, Flash
Stacked memory (performance implications for different access patterns)
Explicit cache/hierarchy management
On-node interconnect
Heterogenous cores
On-node power management
Global structures
–
–
•
•
Shifting emphasis from managing cycles to managing data
Programming models require more access to resource management decisions
Hybrid/Mixed programming models (composing applications)
Node and Memory structures
–
–
–
–
–
–
•
OS noise, resource monitoring and management, memory footprint
Complexity of resources to be managed
Global address space
Integration of collectives, esp synchronization
Resilience (soft errors and damaged cores)
HPC OS Sustainability
Increasing importance
and complexity of
resource management
Alternate R&D Strategies
• Evolve an existing OS
– Linux, Plan 9, IBM CNK, Kitten
• Start with an empty emacs buffer
• Steal components from existing operating systems
• Partitioning resources – independent management within a partition
– Composibility
• Collective/Global OS
– Global address space?
It’s time to define the winner
Research Agenda
• HPC Community OS
– Define basic structure
– Individual groups work on components
• Expose management of critical resources
• Simulation to evaluate scalability of resource management strategies
• Enable co-design of hardware to support resource management
• Define and implement OS mechanisms that will enable global, autonomic
runtime systems
Priority Research Direction:
Community OS Framework for HPC Systems
Key challenges
1. HPC applications have unique resource
management needs (e.g., memory layout)
2. Anticipated rapid evolution/revolution in
architectures and programming models
3. Limited ability to innovate in existing
commodity operating systems
Summary of research direction
1. Develop an OS framework specific to the
needs of HPC
2. Open system architecture that exposes the
management of critical resources
3. Empower developers of libraries and runtime
systems
4. Sustainability of HPC OS is difficult
1. Context for individual innovation and
contribution
Potential impact on usability, capability,
and breadth of community
1. This will enable full access to hardware
resources
2. Common foundation for libraries and runtime
environments
2. Timeframe: 2-3 years
Potential impact on software component
Priority Research Direction:
Scalable System Simulation
Key challenges
1. Inability to conduct “apples to apples”
comparisons in scalable resource
management
2. Evolution / revolution in new systems
3. Wide variety of existing simulators
Potential impact on software component
1. Ability to evaluate resource management
mechanisms and policies at scale
2. Enable architecture/OS co-design
Summary of research direction
1. Develop a scalable, full system simulation
capability
2. Address multi-scale challenges
3. Adapt techniques that have been used in
other branches of computational science
4. Develop common interfaces between
simulators
Potential impact on usability, capability,
and breadth of community
1. Critical for the OS research/development
community
2. Important for runtime community
3. Timeframe: 2-4 years
Priority Research Direction:
Open System APIs
Key challenges
Summary of research direction
1. Communication management
2. Thread management
3. Memory management
4. Power management
1. Develop community based APIs to expose
critical resources
2. Develop prototype runtime environments for
common programming models
5. Resilience (fault/failure
isolation/management)
Potential impact on software component
1. Provides a fixed point for innovation in API
implementation and innovation in the
implementation of runtimes (hourglass
principle)
2. Differentiation based on performance, not
functionality
Potential impact on usability, capability,
and breadth of community
1. Critical for supporting the development of
new programming models
2. Critical for enabling the development of new
architectures
3. Timeframe: 3 to 8 years
4.1 Operating Systems
A Community HPC OS
Robust,
Scalable
System
Simulation
APIs for energy
management
Runtime Environments enabled
Autonomic
runtime
systems
API for node
resilience
Community
OS Framework
Prototype
implementation
of OS Framework
2010
2011
2012
Next Generation
Interconnect API
2013 2014
2015
2016
2017
2018
2019