Transcript sc08

Global Trees: A Framework for Linked
Data Structures on Distributed Memory
Parallel Systems
D. Brian Larkins, James Dinan, Sriram Krishnamoorthy,
Srinivasan Parthasarthy, Atanas Rountev, P. Sadayappan
Background
• Trees and graphs can concisely represent
relationships between data
• Data sets are becoming increasingly large
and can require compute-intensive
processing
• Developing efficient, memory hierarchyaware applications is hard
Sample Applications
• n-body simulation
• Fast Multipole Methods (FMM)
• multiresolution analysis
• clustering and classification
• frequent pattern mining
Key Contributions
• Efficient fine-grained data access with a
global view of data
• Exploit linked structure to provide fast
global pointer dereferencing
• High-level, locality-aware, parallel
operations on linked data structures
• Application-driven customization
• Empirical validation of the approach
Framework Design
Global Chunk Layer (GCL)
• API and run-time library for managing chunks
- built on ARMCI
• Abstracts common functionality for handling
irregular, linked data
• Provides a global namespace with access
and modification operations
• Extensible and highly customizable to
maximize functionality and performance
Chunks
• A chunk is:
• Contiguous memory segment
• Globally accessible
• Physically local to only one process
• Collection of user-defined elements
• Unit of data transfer
Programming Model
• SPMD with MIMD-style parallelism
• Global pointers permit fine-grained
access
• Chunks allow coarse-grained data
movement
• Uses get/compute/put model for globally
shared data access
• Provides both uniform global view and
chunked global view of data
Global Pointers
c
}
}
p
c = &p.child[i] + p.child[i].ci + p.child[i].no
4252
+
-4252
+
4340
Global Trees (GT)
•
Run-time library and API for global view
programming trees on DM clusters
•
Built on GCL chunk communication framework
•
High-level tree operations which work in parallel and
are locality aware
•
Each process can asynchronously access any
portion of the shared tree structure
GT Concepts
• Tree Groups
•
set of global trees
•
allocations are made from the same chunk pool
• Global Node Pointers
• Tree Nodes
•
link structure managed by GT
•
body is user-defined structure
Example: Copying a Tree
Tree Traversals
• GT provides optimized, parallel
traversals for common traversal orders
• Visitor callbacks are application-defined
computations on a single node
• GT currently provides top-down, bottomup, and level-wise traversals
Sample Traversal Usage
Node Mapping
Custom Allocation
• No single mapping of data elements to
chunks will be optimal
• GT/GCL supports custom allocators to
improve spatial locality
• Allocators can use a hint from call-site
and can keep state between calls
• Default allocation is local-open
Experimental Results
• Evaluate using:
• Barnes-Hut from SPLASH-2
• Compression operation from
MADNESS
• GT compared with:
• Intel’s Cluster OpenMP and
TreadMarks runtime
• UPC
Global Pointer Overhead
compress()
Barnes-Hut
Chunk Size and Bandwidth
Experiments run on the department WCI Cluster - 2.33GHz Intel Xeon, 6GB RAM, Infiniband
Impact of Chunk Utilization
Barnes-Hut
Experiments run on the department WCI Cluster - 2.33GHz Intel Xeon, 6GB RAM, Infiniband
Barnes-Hut Chunk Size
Selection
Barnes-Hut application from SPLASH-2 suite
Barnes-Hut Scaling
chunk size = 256, bodies = 512k
Local vs. Remote Access
MADNESS compress()
Related Approaches
•
Distributed Shared Memory (DSM)
•
•
Cluster OpenMP, TreadMarks, Cashmere, Midway
Distributed Shared Objects (DSO)
•
•
Charm++, Linda, Orca
Partitioned Global Address Space (PGAS) Languages and Systems
•
•
UPC, Titanium, CAF, ARMCI, SHMEM, GASNET
Shared pointer-based data structure support on distributed memory clusters
•
•
Parallel Irregular Trees, Olden
HPCS Programming Languages
•
Chapel, X10
Future Work
• Global Graphs
• GT data reference locality tools
• More applications
Questions
email: [email protected]