Parallel Geospatial Data Management for Multi-Scale

Download Report

Transcript Parallel Geospatial Data Management for Multi-Scale

Parallel Geospatial Data Management for MultiScale Environmental Data Analysis on GPUs
Speaker: Dr. Jianting Zhang
Dept. of Computer Science,
The City College of New York & CUNY Graduate Center
Host: Dr. Dali Wang
Ecological Informatics
Computer Science
•Spatial Databases
•Data Mining
Geography
GIS Applications
Remote Sensing
Environmental sciences
Computer Science
A conceptual framework of high-performance
environmental data management
High-resolution Satellite Imagery
In-situ Observation Sensor Data
T
T
V
Ecological,
environmental and
administrative zones
T
Temporal
Trends
ROIs
T
Zonal
Statistics
Data Assimilation
T
Global and Regional Climate Model Outputs
High-End Computing Facility
C
B
A
Thread Block
Why GPU Computing?
Today’s GPUs closely represent supercomputer architectures in the past and are
essential building blocks of future exascale computing
ASCI Red: 1997 First 1 Teraflops
(sustained) system with 9298 Intel
Pentium II Xeon processors (in 72
Cabinets)
$$$?
Space?
Power?
•Feb. 2013
•7.1 billion transistors (551mm²)
•2,688 processors
•Max bandwidth 288.4 GB/s
•PCI-E peripheral device
•250 W (17.98 GFLOPS/W -SP)
• Suggested retail price: $999
Zonal Statistics
Min/max/avg/median/…
SQL:
SELECT COUNT(*) from T_O, T_Z
WHERE ST_WITHIN
(T_O.the_geom,T_Z.the_geom)
GROUP BY T_Z.z_id;
2
3
int pnpoly(int npol, float *xp, float *yp, float x, float y)
{
int i, j, c = 0;
for (i = 0, j = npol-1; i < npol; j = i++) {
if ((((yp[i] <= y) && (y < yp[j])) ||
((yp[j] <= y) && (y < yp[i]))) &&
(x < (xp[j] - xp[i]) * (y - yp[i]) / (yp[j] - yp[i]) + xp[i]))
c = !c;
}
return c;
}
To derive a complete histogram…
For point/raster cell with value v in zone z: his[z][v]++ (serial code)
How about Zonal Statistics on polygonal zones over raster cells?
Approach 1: treating each raster cell as a point and then use point-in-polygon testing
Approach 2: rasterizing polygon and then then use raster-based technique
Data and Computing Challenges
NASA Shuttle Radar Topography
Mission (SRTM) data
•11-day mission in February of 2000
•Elevation data on a near-global scale
•The most complete high-resolution
digital topographic database of Earth
•1 arc-second resolution (~30 meter)
•total SRTM raw data volume 12.3TB
•Continental US (tiles 01-06)
•SRTM: 20*109 (billion) raster cells (~40GB
raw, ~15GB compressed TIFF)
•Zones: 3141 counties, 87,097 vertices
Brute-force Point-in-polygon test
•RT=(#of points)*(number of vertices)*(number of ops
per point-in-polygon test)/(number of ops per second)
=20*109*87097*20/(10*109)=3.7*106seconds=40days
•Using up all Titan’s 18,688 nodes: ~200 seconds
•Flops utilization is typically low: can be <1% for data
intensive applications (typically <10% in HPC)
Hybrid Spatial Databases + HPC Approach
Observation : only points/cells that are close enough to polygons need to be tested
Question: how do we pair neighboring points/cells with polygons?
Raster block
Step1: divide a raster tile into blocks and generate
per-tile histograms
Step 2: derive polygon MBRs and pair MBRs with
blocks through box-in-polygon test (inside/intersect)
Step 3: aggregate per-blocks histograms into perpolygon histograms if blocks are within polygons
Step 4: for each intersected polygon/block pair,
perform point(cell)-in-polygon test for all the raster
cells in the blocks and update respective polygon
histogram
(A) Intersect
Minimum Bounding Boxes (MBRs)
(B) Inside
(C) Outside
Hybrid Spatial Databases + HPC Approach
Per-cell modification
Per-block Aggregation
Advantage: raster cells within polygons do not need point-in-polygon test individually
GPU Implementations (1)
4
Identifying parallelisms and mapping to hardware
(1) Deriving per-block histograms
(2) Block in polygon test
(3) Aggregate histograms for “within” blocks
(4) Point-in-polygon test for individual cells
Raster Block
GPU Thread Block
1
Points
Polygon
vertices
•Perfect coalesced
memory accesses
•Utilizing GPU
floating point power
…
M1
M1
M2
…
…
C1
C2
C3
…
•Point-in-poly test for each
of cell’s 4 corners
•All-pair Edge intersection
tests between polygon and
cell
2
AtomicAdd
GPU Implementations (2)
After incorporating spatial database
technique, the task now becomes I/O bound:
•Reading raw data (40 GB) from disk to
memory requires 400+ seconds (~100MB/s)
•Large memory footprint on CPUs, even
operations are done GPUs
•Significant data transfer time between CPUs
and GPUs (limited by PCI-E bandwidth 2-8
GB/s)
A Possible solution: Using
compression libraries (e.g., zlib) to
compress raw data
•Advantage: good compression
ratio, saves disk I/O
•Disadvantage 1: requires
decompression on CPUs
•Disadvantage 2: CPU memory
footprint and CPUGPU data
transfer overhead remain the same
Our solution: reusing Bitplane Quadtree (BPQ-Tree) for data compression/indexing
•Idea: chop M-bit rasters into M binary bitmaps and then build a quadtree for each bitmap
•BPQ-Tree achieves competitive compression ratio but is much more parallelization friendly
on GPUs
•Advantage 1: compressed data is streamed from disk to GPU without requiring
decompression on CPUs  reducing CPU memory footprint and data transfer time
•(Advantage 2): can be used in conjunction with CPU-based compression to further
improve compression ratio (verified but not presented)
GPU Implementations (3)
(Zhang et al 2011)
GPU Implementations (4)
BPQ-Tree Decoding on GPGPUs
1.
2.
3.
(Zhang et al 2011)
All the threads assigned to a computing block are bundled together to
process a quadrant of matrices in a BPQ-Tree pyramid during decoding.
The collective process is looped over all the quadrants and all levels of the
pyramid, i.e., Process Collectively and Loop (PCL).
The starting positions of the threads in the byte streams are calculated
efficiently on the fly in GPGPU shared memories – only the starting position
of the computing block needs to be pre-generated.
Step 0
3 2 0 1
0 0 0 0 3 2 0 1
Step 1
0 0 0 0 3 5 2 1
Step 2
0 0 0 0 3 5 5 6
Step 3
0 0 0 0 3 5 5 6
Experiments and Evaluations (1)
GeoTECI@CCNY
(Geospatial Technologies and
Environmental Cyberinfrastructure)
http://www-cs.ccny.cuny.edu/~jzhang/LabHardware.htm
Single Node Configuration 1: Dell T5400 WS
•Intel Xeon E5405 dual Quad-Core Processor (2.00
GHZ), 16 GB, PCI-E Gen2, 3*500GB 7200 RPM disk
with 32M cache ($5,000)
•Nvidia Quadro 6000 GPU, 448 Fermi core ( 574
MHZ), 6 GB, 144GB/s ($4,500)
Single Node Configuration 2: DIY WS
•Intel Core-i5 650 Dual-Core Processor
(hyperthreading enabled), 8 GB, PCI-E Gen3, 500GB
7200 RPM disk with 32M (recycled), ($1000)
•Nvidia GTX Titan GPU, 2688 Kepler core ( 837
MHZ), 6 GB, 288 GB/s ($1000)
SGI-Octane III: 2nodes mini-cluster
(each with dualquad core CPU, 48G
mem, 4 TB disk, 2
C2050 GPUs
Experiments and Evaluations (2)
GPU Cluster 1: OLCF Titan
18,688 nodes, 299,008 cores, 20+ PFlops
(GPU Cluster 2: CUNY HPCC Andy): 744 Nehalem CPU cores, 96 Fermi GPUs
http://www.csi.cuny.edu/cunyhpc/HPC_Systems.html
(GPU Cluster 3: CUNY HPCC Penzias): 1,172 Sandybridge CPU cores , 144 Kepler GPUs
Experiments and Evaluations (3)
Results on Data Pre-processing and BPQ-Tree Compression
Fixed Parameters:
•Raster Block size: 0.1*0.1 degree 360*360
(resolution is 1*1 arc-second)
•Maximum histogram bins: 5000
•Raster chunk size for encoding/decoding:
4096*4096 (a chunk is assigned to a block)
• Thread block size: 256
Tile #
1
2
3
4
5
6
Total
dimension
54000*43200
50400*43200
50400*43200
82800*36000
61200*46800
68400*111600
20,165,760,000
Partition Schema
2*2
2*2
2*2
2*2
2*2
4*4
36
Data Format
Original (Raw)
TIFF Compression
gzip compression [1]
BPQ-Tree Compression [2]
BPQ-Tree+ gzip compression
Volume (GB)
38
15
8.3
7.3
5.5
[1] asymmetrical: encoding(430s), decoding (47s);
~10X slower
[2] conceptually symmetrical; practically encoding
(195s), decoding (239s) – desirable for on-the-fly
coding of intermediate model output data
Experiments and Evaluations (4)
Wall-clock end-to-end runtimes
Single Node Config1 (Dell T5400+Quadro 6000)
Single Node Config2 (DIY+GTX Titan)
Cold Cache [1]
180s
101s
Hot cache [2]
78s
46s
[1] using “hdparm -t”==> 90-95 MB/s ; [2] using “hdparm –T”==> 3-3.5 GB/s
Observations:
(1) The end-to-end runtimes are disk I/O bound, although BPQ-Tree has
significantly reduced I/O time (400s+ if read raw data directly);
parallel I/O (e.g., ADIOS) on clusters is promising in I/O intensive
applications
(2) GTX Titan (based on Kepler) is significantly faster than Quadro 6000
(based on Fermi): more cores + larger caches; same graphics
memory capacity (6GB) +much lower price tag;  desirable
Experiments and Evaluations (5)
Breakdowns of major computing components (hot cache)
(Step 0): Raster decompression (s)
Step 1: Per-block histogramming (s)
Step 2: Block-in-polygon test (s)
Step 3: “within-block” histogram aggregation (s)
Step 4: cell-in-polygon test and histogram update (s)
total major steps( s)
Wall-clock end-to-end (s)
Quadro 6000
16.2
21.5
0.11
0.14
29.7
67.7
85
GTX Titan
8.30
13.4
0.07
0.11
11.4
33.3
46
8-core CPU
131.9
/
/
/
/
/
/
Discussions:
(1) Parameters(e.g., block/chunk size and #of bins) may affect performance –
additional experiments to be completed in future work
(2) GPU hardware architecture may also affect performance significantly:
8.1X speed up for Quadro 6000, 15.9X speedup for GTX Titan, and 5.8X
speedup for C2050 (Zhang et al 2011), over 8-core CPU
Experiments and Evaluations (6)
Wall-clock end-to-end runtimes on Titan
MPI_Init( &argc, &argv );
MPI_Comm_rank( MPI_COMM_WORLD, &myrank );
MPI_Comm_size( MPI_COMM_WORLD, &commsize );
printf("%d %d\n",myrank,commsize);
assert(commsize>0);
int round=(NUM_TILE-1)/commsize+1;
double start = MPI_Wtime();
for(int zz=0;zz<round;zz++)
{
int ww=zz*commsize+myrank;
if(ww>=NUM_TILE) break;
process tile #ww
}
double finish = MPI_Wtime();
double tot_time = finish - start;
double longest_tot_time;
MPI_Allreduce(&tot_time, &longest_tot_time, 1,
MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD);
if (myrank==0)
printf("longest_tot_time=%10.3f\n",longest_tot_time);
MPI_Finalize();
# of nodes
1
2
4
8
16
runtime(s)
60.7
31.3
17.9
10.2
7.6
Scalability on Titan
70
60
50
40
30
20
10
0
0
5
10
15
20
Take Home Messages
1. We aim at designing novel data structures and algorithms that are efficient on new GPU
parallel hardware to speed up processing of large-scale geospatial data.
2. Experiments on zonal statistics using 20+ billion NASA SRTM 30 meter resolution DEM data and
3000+ US county boundaries have demonstrated an impressive end-to-end runtime of 180s or
better performance using a single commodity GPU device
3. Additional experiments on OLCF Titan GPU cluster using 1-16 nodes have demonstrated good
scalability of the proposed technique. The end-to-end runtime can be further reduced to 10s or
better using 8 or more nodes  interactive exploration of large-scale geospatial data becomes
possible on high-end computing facilities
4. It is possible to further improve the performance by fine-tuning relevant parameters; however,
from application perspective, it is more important to understand the level of performance that
end users can expect before investing money and time  this study provides a concrete
example
5. Personal view: geospatial technologies have reached a tipping point due to hardware advances
and changed cost models (parallel hardware, large-memory capacity and BigData driven
application)there are great research/development/application opportunities to develop the
next generation high-performance GIS/Spatial Databases on different hardware platforms and
in different types of applications (e.g., embedding DBMS in CESM running on clusters)
http://www-cs.ccny.cuny.edu/~jzhang/
http://www.nsf.gov/awardsearch/showAward?AWD_ID=1302423
Special Thanks to:
Dali Wang (ORNL Host) (Ecosystem Simulation Science –ESD)
Yaxing Wei (Environmental Science Data/Systems - ESD)
Norbert Podhorszki (Scientific Data Group – CSM)
Seyong Lee (Future Technology Group –CSM)
Ben Mayer (Computational Earth Sciences –CSM)
Matt Norman (Scientific Computing/NCCS)
Anil Cheriyadat (GIST – CSE)
Dilip Reddy Patlolla (GIST – CSE)
Jiangye Yuan (GIST- CSE)
Varun Chandola (GIST – CSE)