MADALGO (1) - Department of Computer Science
Download
Report
Transcript MADALGO (1) - Department of Computer Science
Efficient Handling of
Massive (Terrain) Datasets
Lars Arge
AAR H U S U N I V E R S I T E T
Department of Computer Science
Lars Arge
1/13
Massive Data Algorithmics
Massive data being acquired/used everywhere
Storage management software is billion-$ industry
Examples (2002):
Phone: AT&T 20TB phone call
database, wireless tracking
Consumer: WalMart 70TB
database, buying patterns
WEB: Google index 8 billion
web pages
Geography: NASA satellites
generate Terrabytes each day
Science is increasingly about mining massive data (Nature 2/06)
Lars Arge
2/13
Terrain Data
New technologies: Much easier/cheaper to collect detailed data
Previous ‘manual’ or radar based methods
−Often 30 meter between data points
−Sometimes 10 meter data available
New laser scanning methods (LIDAR)
−Less than 1 meter between data points
−Centimeter accuracy (previous meter)
Denmark:
~2 million points at 30 meter (<<1GB)
~18 billion points at 1 meter (>>1TB)
COWI (and other) now scanning DK
NC scanned after Hurricane Floyd in 1999
Lars Arge
3/13
Massive data = I/O-Bottleneck
I/O is often bottleneck when handling massive datasets
Disk access is 106 times slower than main memory access!
read/write head
read/write arm
track
magnetic surface
“The systems
difference
between
CPU and
disk
Disk
tryintospeed
amortize
largemodern
access time
transferring
technologies
is analogous
to the difference in speed in
large
contiguous
blocks of data
sharpening a pencil using a sharpener on one’s desk or by
Need
to store
and access
to side
takeof
advantage
blocks!
taking
an airplane
to thedata
other
the worldof
and
using a
sharpener on someone else’s desk.” (D. Comer)
Lars Arge
4/13
I/O-efficient Algorithms
Taking advantage of block access important
Traditionally algorithms developed
without block considerations
I/O-efficient algorithms leads to large
runtime improvements
running time
Normal algorithm
I/O-efficient algorithm
data size
Lars Arge
5/13
Main memory size
Scalability: Hierarchical Memory
running time
Block access not only important on disk level
Machines have complicated memory hierarchy
Levels get larger and slower
L
L
Block transfers on all levels
1
2
data size
Lars Arge
6/13
R
A
M
My Research Work
Theoretically I/O- (and cache-) efficient algorithms work
Data structures, computational geometry, graph theory, …
Focus on Geographic Information Systems problems
Algorithm engineering work, e.g
TPIE system for simple, efficient,
and portable implementation of
I/O-efficient algorithms
Software for terrain data processing
LIDAR data handling
Terrain flow computations
Lars Arge
7/13
Example: Terrain Flow
Terrain water flow has many important applications
Predict location of streams, areas susceptible to floods…
7 am
3pm
Conceptually flow is modeled using two basic attributes
Flow direction: The direction water flows at a point
Flow accumulation: Amount of water flowing through a point
Flow accumulation used to compute other hydrological
attributes: drainage network, topographic convergence index…
Lars Arge
8/13
Terrain Flow Accumulation
Collaboration with environmental researchers at Duke University
Appalachian mountains dataset:
800x800km at 100m resolution a few Gigabytes
On ½GB machine: 14 days!!
ArcGIS:
Performance somewhat unpredictable
Days on few gigabytes of data
Many gigabytes of data…..
Appalachian dataset would be Terabytes sized at 1m resolution
Lars Arge
9/13
Terrain Flow Accumulation: TerraFlow
We developed theoretically I/O-optimal algorithms
TPIE implementation was very efficient
Appalachian Mountains flow accumulation in 3 hours!
Developed into comprehensive software package for flow
computation on massive terrains: TerraFlow
Efficient: 2-1000 times faster than existing software
Scalable: >1 billion elements!
Flexible: Flexible flow modeling (direction) methods
Extension to ArcGIS
Lars Arge
10/13
LIDAR Terrain Data Work: TerraStream
Now TerraStream software “pipeline” for handling terrain data
Points to DEM (incl. breaklines)
DEM flow modeling (incl “flooding”, “flat” routing, “noise” reduce)
DEM flow accumulation (incl river extraction)
DEM hierarchical watershed computation
All work for both grid and TIN DEM’s
Capable of handling massive datasets
Test dataset: 400M point Neuse river basin (1/3 NC) (>17GB)
Lars Arge
11/13
Examples of Ongoing Terrain Work
Terrain modeling, e.g
“Raw” LIDAR to point conversion (LIDAR point classification)
(incl feature, e.g. bridge, detection/removal)
Further improved flow and erosion modeling (e.g. carving)
Contour line extraction (incl. smoothing and simplification)
Terrain (and other) data fusion (incl format conversion)
Terrain analysis, e.g
Choke point, navigation, visibility, change detection,…
Major grand goal:
Construction of hierarchical (simplified) DEM where
derived features (water flow, drainage, choke points)
are preserved/consistent
Lars Arge
12/13
Thanks
Lars Arge
[email protected]
Work supported in part by
US National Science Foundation ESS grant EIA–98070734, RI grant
EIA–9972879, CAREER grant EIA–9984099, and ITR grant EIA–
0112849
US Army Research Office grants W911NF-04-1-0278
and DAAD19-03-1-0352
Ole Rømer Scholarship from Danish Science Research Council
NABIIT grant from the Danish Strategic Research Council
Danish National Research foundation (MADALGO center)
Lars Arge
13/13