Transcript A Library
DGrid: A Library of Large-Scale
Distributed Spatial Data Structures
Pieter Hooimeijer, 2006-05-02
Motivation
• DGrid was designed to:
– Support very large sets of dynamic point data
(i.e. points that move unpredictably).
– Offer flexible trade-offs between the cost of
search operations and the cost of updates.
– Run on parallel and distributed systems.
2
Spatial Data Structures
• Definition—Any data structure that holds:
- points
- rectangles
- lines
- polygons
- curves
- etc.
• They are typically optimized for a particular
type of search operation.
• We’ll focus on point data.
3
• Commonly used example: The Quadtree.
– Works like a binary search tree.
– Each node in the tree has four children: NE,
SE, SW, NW.
– This is called ‘Recursive Decomposition’
4
• The quadtree implementation in DGrid works
like this:
A (1,3)
B (1,2)
C (2,0)
• This is a ‘bottom-up Matrix (MX) Quadtree.’
(section 3.1.2)
5
• Let’s do a search on that quadtree:
• We ruled out the ‘entire’ NE quadrant at
the root level of the tree.
6
• Trade-offs for bottom-up MX Quadtree,
compared to other tree data structures:
– The shape of the tree does not depend on
the insertion order.
– No need to balance the tree (which would be
expensive).
– Insertion and deletion are cheaper for
clustered data.
7
C++ Templates
• They look like this:
template <typename ItemType>
class vector {
// ...
};
• In this case, a separate vector< > class is
generated for each item type.
// Strong type-checking using templates:
MyType * a = someVector.get(5);
// instead of:
MyType * a = (MyType)someVector.get(5);
8
• Turns out, C++ templates are a crude
functional programming language.
• Why ‘crude?’
// C++
{- Haskell
Templates
-}
template
fact 1< =int
1 N>
struct
factfact
n = {n * fact (n - 1)
static const int value = N * fact <N - 1 >:: value;
};
template < >
struct fact <1 > {
static const int value = 1;
};
• This is ‘executed’ by the compiler!
9
• This is called template metaprogramming.
• It’s used extensively in DGrid, to make it:
– easier to use;
– faster;
– type safe.
10
Distributed Data
• DGrid uses Message Passing Interface (MPI)
to run on distributed systems.
– MPI is a library of basic ‘send’ and ‘receive’
operations.
– Each processor gets a unique ID (‘rank’).
– Use if-statements to run different code on
different processes.
11
12
DGrid
• DGrid has these data structures:
– Two types of 2D arrays.
– A quadtree.
– A distributed data structure.
– A location class.
• Allows nesting of these data structures.
13
• Let’s see some examples of nested data
structures:
– A 2D array of quadtrees
(implied: the quadtree contains locations).
– A quadtree of small 2D arrays.
– A 2D array of 2D arrays.
• Called ‘tiling.’
• A lot like a ‘shallow’ quadtree.
14
• DGrid uses the Composite Design Pattern:
location< >
DataStructure
15
• DGrid uses templates instead of a ‘Component’
interface.
• The result is that the user can do this:
using namespace dgrid::tags;
typedef dgrid::dgrid<MyItem,
partial_grid_tag<
quadtree_tag< > > > bucket;
bucket a(0, 0, 639, 639, tiles(64, 64) <<
tiles(1, 1));
• This is the ‘2D array of quadtrees’
example.
16
• Important: The definition of the data structure
is a type!
• Consequences:
– Can check parameters at compile time.
(Must be tiles( ) << tiles ( ) for this example,
or it won’t compile.)
– Compiler can optimize extensively (it knows which
functions are going to call each other).
– Can’t define a type at runtime, so ‘composition’
must be known at compile time.
17
• Data structure operations:
– insert(x, y, item) – add item at (x, y)
– delete(x, y, item) – remove item from (x, y)
– get(x, y, some_list) – get all items at (x, y)
– get_range(x0, y0, x1, y1) – get all items in the
range [ (x0, y0) : (x1, y1) ]
• Note: even the location class must support
these operations.
18
• In a nested data structure, operations are
passed on from level to level.
• Because the types are known at compile
time, these calls can be inlined.
– Pro: eliminates the overhead of the function
call.
– Con: code size increases (function body is
repeated).
19
20
Future Work
• Add more data structures, more search
operations.
• Separate interface further from
implementation.
• (Dynamic) Load Balancing.
21