Transcript Slide 1

CARMA: A Comprehensive Management
Framework for High-Performance
Reconfigurable Computing
Ian A. Troxel, Aju M. Jacob, Alan D. George,
Raj Subramaniyan, and Matthew A. Radlinski
High-performance Computing and Simulation (HCS) Research Laboratory
Department of Electrical and Computer Engineering
University of Florida
Gainesville, FL
Troxel
#197 MAPLD 2004
CARMA Motivation


Key missing pieces in RC for HPC
 Dynamic RC fabric discovery and management
 Coherent multitasking, multi-user environment
 Robust job scheduling and management
 Design for fault tolerance and scalability
 Heterogeneous system support
 Device independent programming model
 Debug and system health monitoring
 System performance monitoring into the RC fabric
 Increased RC device and system usability
CARMA
(Holy Fire by Alex Grey)
Our proposed Comprehensive Approach to Reconfigurable
Management Architecture (CARMA) attempts to unify existing
technologies as well as fill in missing pieces
Troxel
2
#197 MAPLD 2004
CARMA Framework Overview

Applications
CARMA seeks to integrate:


User
Interface
Algorithm Mapping

Graphical user interface
Flexible programming model
COTS application mapper(s)


RC Cluster
Management
Graph-based job description


Performance
Monitoring
Middleware
API



Data
Network
To Other
Nodes
RC Fabric
API
COTS
Processor
Troxel


RC Node

Networks, hosts, and boards
Monitoring down into RC Fabric
Device independent middleware API
Multiple types of RC boards

RC Fabric
Distributed, scalable job scheduling
Checkpointing, rollback and recovery
Distributed configuration management
Multilevel monitoring service (GEMS)


DAGMan, Condensed Graphs, etc.
Robust management tool

Control
Network
Handel-C, Impulse-C, Viva, System Generator, etc.
PCI (many), network-attached, Pilchard
Multiple high-speed networks

3
SCI, Myrinet, GigE, InfiniBand, etc.
#197 MAPLD 2004
Application Mapper Evaluation

Evaluating on basis of ease of use, performance, hardware device independence, programming
model, parallelization support, resource targeting, network support, stand-alone mapping, etc.

C-Based tools

Celoxica - SDK (Handel-C)




Impulse Accelerated Technologies – Impulse-C







Troxel
StarBridge Systems - Viva
Nallatech – Fuse / DIMEtalk
Annapolis Micro Systems - CoreFire
Xilinx - ISE compulsory


Provides an option for hardware independence
Built upon open source Streams-C from LANL
Supports ANSI standard C
Graphical tools


Provides access to in-house boards: ADM-XRC (x1), Tarari (x4), RC1000 (x4)
Good deal of success after lessons learned
Hardware design focused
Evaluating the role of Jbits, System Generator, and XHWIF
Streams-C c/o LANL
Evaluations still ongoing
Programming model a fundamental issue to be addressed
4
#197 MAPLD 2004
CARMA Interface


Simple graphical user interface
 Preliminary basis for graphical user interface via the Simple Web
Interface Link Library (SWILL) from the University of Chicago*
 User view for authentication and job submission/status
 Administration view for system status and maintenance
Applications supported
 Single or multiple tasks per job (via CARMA DAGs**)
 CARMA registered (via CARMA API and DAGs) or not



Provides security, fault tolerance
Sequential and parallel (hand-coded or via MPI)
C-based application mappers supported
 CARMA middleware API provides architecture independence




Any code that can link to the CARMA API library can be executed (Handel-C
and ADM-XRC API tested to date)
Bit files must be registered with the CARMA Configuration Manager (CM)
All other mappers can use “not CARMA registered” mode
Plans for linking Streams/Impulse-C, System Generator, et al.
* http://systems.cs.uchicago.edu/swill/
Troxel
5
** Similar to Condor DAGs
#197 MAPLD 2004
CARMA User Interface
Troxel
6
#197 MAPLD 2004
CARMA Job Manager (JM)

Completed first version of CARMA JM







2
1
Prototyping effort (CARMA interoperability)


CARMA DAG Example
1
1 2
Hyper.1
Hyper.2
1 2 3
1
Task-based execution via Condor-like DAGs
1 2
Separate processes and message queues for fault-tolerance Hyper.3
1 2
Checkpointing enabled with rollback in progress
Links to all other CARMA components
File1
Fully distributed multi-node operation with job/task migration
Links to CARMA monitor and GEMS to make scheduling decisions
Tradeoff studies and analyses underway
1 2
File2
Hyper.4
1 2
1 2
stdout
Hyper.5
1
1
External extensions to COTS tools (COTS plug and play)

Expand upon preliminary work @ GWU/GMU*





Troxel
Striving for “plug and play” approach to JM
CARMA Monitor provides board info. (via ELIM)
Working to link to CARMA CM
Tradeoff studies and analysis underway
Integration of other CARMA components in progress
7
c/o GWU/GMU
* Kris Gaj, Tarek El-Ghazawi, et al., “Effective Utilization and
Reconfiguration of Distributed Hardware Resources Using
Job Management Systems,” Reconfigurable Architecture
Workshop 2003, Nice, France, April 2003.
#197 MAPLD 2004
CARMA CM Design


Builds upon previous design concepts*
Execution Manager (EM)





BIM
Provides board independence
Allows for configuration temporal locality benefits
Communication Module

Communication
Remote Node
Inter-Process Comm.
RC Hardware
Local Node
Control Network
File Transfers
Board API
Handles all inter-node communication
CM uses BIM to
Configure Board
Board Interface Module (BIM)
CM

Application
CARMA Board
Interface Language
BIM
CM spawns
BIM for each
Board

BIM


Board Specific
Communication
RC Board
Troxel
Comm.
Configuration
Manager
Manages configuration transport and caching
Loads, unloads configurations via BIM
Board Interface Module (BIM)


Forks tasks from JM and returns results to JM
Requests and releases configurations
Configuration Manager (CM)


Execution
Manager

RC Board

Configures and interfaces with diverse set of RC boards

Numerous PCI-based boards

Various interfaces for network attached RC
Instantiated at startup
Provides hardware independence to higher layers
Separate BIM for each supported board
Simple standard interface to boards for remote nodes
Enhances security by authenticating data and configurations
8
* U. of Glasgow (Rage), Imperial College
in UK, U. Washington, among others
#197 MAPLD 2004
Distributed CM Management Schemes
Jobs submitted
“centrally”
APP
Global view of
the system at all times APP MAP
GJM
GRMAN
Global view of
the system at all times
GRMAN
LAPP
Network
LAPP MAP
LJM
Network
Results,
Statistics
LRMON
Local Sys
Tasks,
States
…
LRMAN
LRMON
Local Sys
GRMAN
LAPP
Network
LAPP MAP
LJM
LRMAN
LRMON
Local Sys
Jobs submitted
locally
…
Server brokers configurations
LAPP
…
LRMON
Local Sys
Server houses configurations
LAPP
LAPP
Network
LAPP MAP
LJM
LRMAN
LRMON
LRMON
Local Sys
Local Sys
Client-Broker (CB)
Troxel
LRMAN
Jobs submitted
locally
Requests,
Requests, LAPP MAP
Statistics
Statistics
LJM
Configuration
LRMAN
Pointers
Configurations
Tasks,
Configurations
Client-Server (CS)
Master-Worker (MW)
Global view of
the system at all times
LAPP
Requests, LAPP MAP
Statistics
LJM
Requests,
Statistics
LRMON
Local Sys
Jobs submitted
locally
Requests
Requests
Configurations
…
LAPP MAP
LJM
LRMAN
LRMON
Local Sys
Simple Peer-to-Peer (SPP)
9
Note: More in-depth results for
distributed CM appeared at
ERSA’04
#197 MAPLD 2004
CM System Recommendations
Scalability projected up to 4096 nodes




Performed analytic scalability analysis based on 16-node experimental results

Dual 2.4GHz Xeons and a Tarari CPX2100 HPC board in a 64/66 PCI slot

Gigabit Ethernet and 5.3 Gbps Scalable Coherent Interface (SCI) control and data networks respectively
Flat system of 4096 has very high completion times (~5 minutes for SPP and ~83 hrs for CS)
Layered hierarchy needed for reasonable completion times (~2.5 sec for SPP over SPP at 4096 nodes)
CS reduces network traffic by sacrificing response time and SPP improves response time by increasing
network utilization
System Constraints
System Size (number of nodes)
<8
8 to 32
32 to 512
512 to 1024
1024 to 4096
Latency bound
Flat CS
CS over CS with
group size 4
SPP over CS
with group size 4
SPP over SPP
with group size 8
SPP over SPP
with group size 16
Bandwidth bound*
Flat CS
CS over CS with
group size 4
CS over CS with
group size 8
SPP over CS with
group size 8
SPP over CS with
group size 8
Best Overall
Flat CS
CS over CS with
group size 4
SPP over CS
with group size 4
SPP over CS with
group size 8
SPP over CS with
group size 8
Conclusions




CARMA CM design imposes very little overhead on the system
Hierarchical scheme needed to scale to systems of thousands of nodes (traditional MW will not work)
Multiple servers for CS scheme don’t reduce the server bottleneck for system sizes greater than 32
SPP over CS (group size 8) best overall performance for systems larger than 512 nodes
Troxel
10
* Schemes with completion latency values
greater than 5 seconds excluded
#197 MAPLD 2004
CARMA Monitoring Services

Monitoring service









ExMan
A
A
Processes task scheduling requests from JM
Maintains local information
Compact way to store performance logs
Supports simple query interface
System watchdog alerts based on defined
heuristics of failure conditions
Provides system monitoring and debug
Initial monitor version is complete
Studying FPGA monitoring options
Increasing the scheduling options
Tradeoff studies and analyses underway
Troxel
A
Query *
Proc.
RRD
*
C
CARMA Diagnostic


Gathers local and remote information
Updates GEMS* and local values
Round-Robin Database

D
JM
Query Processor


E
Statistics Collector


CARMA *
Diagnostic
11
Stat. *
Collector
B
GEMS
ConMan
BIM
BIM
BIM
FPGA
Board
FPGA
Board
FPGA
Board
To Other
Nodes
A
A
Initial CARMA Monitor Parameters
A) Stats from JM, ExMan, ConMan, BIM, Board
-Dynamic statistics (push or pull)
-Static statistics (pull)
B) Stats from remote nodes via GEMS
C) StatCollector passes info to the RRD from local
and remote modules via the Query Processor
D) JM queries RRD for resource information to
make scheduling decisions
E) The CARMA diagnostic tool performs system
administration, debug and optimization
* Gossip-Enabled Monitoring Service (GEMS); developed
by HCS Lab for robust, scalable, multilevel monitoring
of resource health and performance.
#197
For more info. see http://www.hcs.ufl.edu/gems
MAPLD 2004
CARMA End-to-End Service Description

CARMA Execution Stages
Functionality demonstrated to date











CARMA Node
Graphical user interface
Job/task scheduling based on board requirements and
7
uP
UI
configuration temporal locality
8
9
6
1
Parallel and serial jobs
CARMA registered and non-registered tasks
3
BIM
ExMan 3
JM
Remote execution and result retrieval
7
8
4
2
Configuration caching and management
RC
5
CM
Monitor
Mixed RC and “CPU-only” tasks
Fabric
Heterogeneous board execution (3 types thus far)
5
System and RC device monitoring
1) User submits job
Inter-node communication via SCI or TCP/IP/GigE
2) JM performs a task schedule request and
monitor replies with execution location
Fault-tolerant design
3) JM forwards tasks to local or remote ExMan


Virtually no system impact from CARMA overhead
despite use of unoptimized code



Troxel
Processes can be restarted while running
Less than 5MB RAM per node
Less than 0.1% processor utilization on a 2.4 GHz Xeon
server
Less than 200 Kbps network utilization
12
4) If task requires an RC board, ExMan sends a
configuration request to the local CM
5) The CM finds the file and configures the board
6) The user’s task is forked (runs on processor)
7) Users access RC boards via the BIM
8) Task results are forwarded to the originating JM
9) Job results are forwarded to the originating user
Note: All modules update the monitor
#197 MAPLD 2004
CARMA Framework Verification

Several test jobs executed concurrently








ADD.exe, a “CPU-only” task to add two numbers
NQueens.bit, an RC1000 task to calculate a subset of
the total number of solutions for an N×N board
4 RC1000s and 4 Tararis communicating via MPI
ADD.exe
ADD.exe
5
8
AddOne.bit
Parallel Sieve of Erasthones (on Tarari)
Parallel Monte Carlo Pi Generator (on Tarari)
Blowfish encrypt/decrypt (on ADM-XRC)
NQueens.bit
5
6
92
SWILL
Interface
Example System Setup
3
2
4
4
ADD.exe, a “CPU-only” task to add two numbers
AddOne.bit, an RC task to increment input value
Par. Add Test
Parallel N-Queens Test composed of


N-Queens Test
Parallel Add Test composed of
ADD.exe
11
Xeon Server
Xeon Server
uP
BIM
Tarari
ExMan
CM
JM
Monitor
BIM
RC1000
Xeon
Server
Xeon Server
uP
UI
ExMan
JM
CM
uP
BIM
Monitor
ADMXRC
ExMan
CM
JM
Monitor
Config.
Store
AddOne.bit
12
CM
GEMS
TCP/IP (Requests)
SCI (Config. Files)
TCP/IP (Tasks and Results)
Troxel
13
These simple applications used to test CARMA’s functionality, while
CARMA’s services have wider applicability to
#197 MAPLD 2004
problems of greater size and complexity.
Conclusions

First working version of CARMA complete & tested

Numerous features supported









Troxel
Simple GUI front-end interface
Coherent multitasking, multi-user environment
Dynamic RC fabric discovery and management
Robust job scheduling and management
Fault-tolerant and scalable services by design
Performance monitoring down into the RC fabric
Heterogeneous board support with hardware independence
Linking to COTS job management service
Initial testing shows the framework to be sound with very
little overhead imposed upon the system
14
#197 MAPLD 2004
Future Work and Acknowledgements

Continue to fill in additional CARMA features








Develop CARMA instantiations for other RC domains



Include support for other boards, application mappers, and languages
Complete JM rollback feature and finish linkage to LSF
Include broker and caching mechanisms for the peer-to-peer distributed CM scheme
Include more intelligent scheduling algorithms (e.g. Last Release Time)
Expand RC device monitoring and include debug and opt. mechanisms
Enhance security including secure data transfer and authentication
Deploy on a large-scale test facility
Distributed shared-memory machines with RC (e.g. SGI Altix)
Embedded RC systems (e.g. satellite/aircraft systems, munitions)
We wish to thank the following for supporting this research:






Troxel
Department of Defense
Xilinx
Celoxica
Alpha Data
Tarari
Key vendors of our HPC cluster resources (Intel, AMD, Cisco, Nortel)
15
#197 MAPLD 2004