Slides - Data Management Lab

Download Report

Transcript Slides - Data Management Lab

Spatial Big Data Challenges
Intersecting Cloud Computing and Mobility
Shashi Shekhar
McKnight Distinguished University Professor
Department of Computer Science and Engineering
University of Minnesota
www.cs.umn.edu/~shekhar
1
Spatial Databases: Representative Projects
Evacutation Route Planning
Parallelize
Range Queries
only in old plan
Only in new plan
In both plans
Shortest Paths
Storing graphs in disk blocks
2
Why cloud computing for spatial data?
• Geospatial Intelligence [ Dr. M. Pagels, DARPA, 2006]
• Estimated at 140 terabytes per day, 150 peta-bytes annually
• Annual volume is 150x historical content of the entire internet
• Analyze daily data as well as historical data
•
3
Eco-Routing
• Minimize fuel consumption and GPG emission
– rather than proxies, e.g. distance, travel-time
– avoid congestion, idling at red-lights, turns and elevation changes, etc.
U.P.S. Embraces High-Tech Delivery Methods (July 12,
2007)
By “The research at U.P.S. is paying off. ……..— saving roughly three million
gallons of fuel in good part by mapping routes that minimize left turns.”
4
Real-time and Historic Travel-time, Fuel Consumption, GPS Tracks
5
5
Eco-Routng Research Challenges
• Frames of Reference
– Absolute to moving object based (Lagrangian)
• Data model of lagrangian graphs
– Conceptual – generalize time-expanded graph
– Logical – Lagrangian abstract data types
– Physical – clustering, index, Lagrangian routing algorithms
• Flexible Architecture
– Allow inclusion of new algorithms, e.g., gps-track mining
– Merge solutions from different algorithms
• Geo-sensing of events,
– e.g., volunteered geographic information (e.g., open street map),
– social unrest (Ushahidi), flash-mob, …
• Geo-Prediction,
– e.g., predict track of a hurricane or a vehicle
– Challenges: auto-correlation, non-stationarity
• Geo-privacy
6
Cloud Computing and Spatial Big Data
• Motivation
• Case Study 1: Simpler to Parallelize
• Case Study 2 – Harder
• Case Study 3 – Hardest
• Wrap up
7
Simpler: Land-cover Classification
•
Multiscale Multigranular Image Classification into land-cover categories
Inputs
Output at 2 Scales
Mˆ odel  arg max{ quality( Model)}, where
Model
quality( M )  likelihood(observation | M )  2 penalty( M )
8
Parallelization Choice
Speedup
1. Initialize parameters and memory
2. for each Spatial Scale
3.
for each Quad
4.
for each Class
5.
Calculate Quality Measure
6
end for Class
7.
end for Quad
8. end for Spatial Scale
9. Post-processing
7
6
5
4
3
2
1
0
Class-level
Quad-level
2
4
8
Number of Processors
Input
• 64 x 64 image (Plymouth County, MA)
• 4 classes (All, Woodland, Vegetated,
Suburban)
Language
UPC
Platform
Cray X1, 1-8 processors)
Efficiency
1
0.75
0.5
Class-level
Quad-level
0.25
0
1
2
4
8
Number of Processors
9
Harder: Parallelizing Vector GIS
•(1/30) second Response time constraint on Range Query
• Parallel processing necessary since best sequential computer cannot meet requirement
• Blue rectangle = a range query, Polygon colors shows processor assignment
Set of
Polygons
Graphics
Engine
Display
30 Hz. View
Graphics
2Hz.
8Km X 8Km
Bounding Box
Set of
Polygons
Local
Terrain
Database
25 Km X 25 Km
Remote
Terrain
Databases
Bounding Box
High
Performance
GIS Component
10
Data-Partitioning Approach
•
•
•
Initial Static Partitioning
Run-Time dynamic load-balancing (DLB)
Platforms: Cray T3D (Distributed), SGI Challenge (Shared Memory)
11
DLB Pool-Size Choice is Challenging!
12
Hardest – Location Prediction
Nest locations
Vegetation durability
Distance to open water
Water depth
13
Ex. 3: Hardest to Parallelize
Name
Model
Classical Linear Regression
Spatial Auto-Regression
• Maximum Likelihood Estimation
y  xβ  ε
y  ρWy  xβ  ε
n ln(2 ) n ln( 2 )
ln(L)  ln I  W 

 SSE
2
2
• Need cloud computing to scale up to large spatial dataset.
• However, computing determinant of large matrix is an open problem!
 : thespatialauto - regression(auto- correlation) parameter
W : n - by - n neighborhood matrixoverspatialframework
14
Cloud Computing and Spatial Big Data
• Motivation: Spatial Big Data in National Security & Eco-routing
• Case Study 1: Simpler to Parallelize
– Map-reduce is okay
– Should it provide spatial declustering services?
– Can query-compiler generate map-reduce parallel code?
• Case Study 2 – Harder
– Need dynamic load balancing beyond map-reduce
• Case Study 3 – Hardest
– Need new computer science, e.g.,
• Eco-routing algorithms
• determinant of large matrix
• Parallel formulation of evacuation route planning
15
Acknowledgments
• HPC Resources, Research Grants
– Army High Performance Computing Research Center-AHPCRC
– Minnesota Supercomputing Institute - MSI
• Spatial Database Group Members
– Mete Celik, Sanjay Chawla, Vijay Gandhi, Betsy George, James Kang,
Baris M. Kazar, QingSong Lu, Sangho Kim, Sivakumar Ravada
• USDOD
– Douglas Chubb, Greg Turner, Dale Shires, Jim Shine, Jim Rodgers
– Richard Welsh (NCS, AHPCRC), Greg Smith
• Academic Colleagues
– Vipin Kumar
– Kelley Pace, James LeSage
– Junchang Ju, Eric D. Kolaczyk, Sucharita Gopal
16