MADALGO (1) - Department of Computer Science

Transcript MADALGO (1) - Department of Computer Science

Efficient Handling of
Massive (Terrain) Datasets
Lars Arge
AAR H U S U N I V E R S I T E T
Department of Computer Science
Lars Arge
1/13
Massive Data Algorithmics
 Massive data being acquired/used everywhere
 Storage management software is billion-$ industry
Examples (2002):
 Phone: AT&T 20TB phone call
database, wireless tracking
 Consumer: WalMart 70TB
database, buying patterns
 WEB: Google index 8 billion
web pages
 Geography: NASA satellites
generate Terrabytes each day
 Science is increasingly about mining massive data (Nature 2/06)
Lars Arge
2/13
Terrain Data
 New technologies: Much easier/cheaper to collect detailed data
 Previous ‘manual’ or radar based methods
−Often 30 meter between data points
−Sometimes 10 meter data available
 New laser scanning methods (LIDAR)
−Less than 1 meter between data points
−Centimeter accuracy (previous meter)
Denmark:
 ~2 million points at 30 meter (<<1GB)
 ~18 billion points at 1 meter (>>1TB)
 COWI (and other) now scanning DK
 NC scanned after Hurricane Floyd in 1999
Lars Arge
3/13
Massive data = I/O-Bottleneck
 I/O is often bottleneck when handling massive datasets
 Disk access is 106 times slower than main memory access!
read/write head
read/write arm
track
magnetic surface
“The systems
difference
between
CPU and
disk
 Disk
tryintospeed
amortize
largemodern
access time
transferring
technologies
is analogous
to the difference in speed in
large
contiguous
blocks of data
sharpening a pencil using a sharpener on one’s desk or by
 Need
to store
and access
to side
takeof
advantage
blocks!
taking
an airplane
to thedata
other
the worldof
and
using a
sharpener on someone else’s desk.” (D. Comer)
Lars Arge
4/13
I/O-efficient Algorithms
 Taking advantage of block access important
 Traditionally algorithms developed
without block considerations
 I/O-efficient algorithms leads to large
runtime improvements
running time
Normal algorithm
I/O-efficient algorithm
data size
Lars Arge
5/13
Main memory size
Scalability: Hierarchical Memory
running time
 Block access not only important on disk level
 Machines have complicated memory hierarchy
 Levels get larger and slower
L
L
 Block transfers on all levels
1
2
data size
Lars Arge
6/13
R
A
M
My Research Work
 Theoretically I/O- (and cache-) efficient algorithms work
 Data structures, computational geometry, graph theory, …
 Focus on Geographic Information Systems problems
 Algorithm engineering work, e.g
 TPIE system for simple, efficient,
and portable implementation of
I/O-efficient algorithms
 Software for terrain data processing
 LIDAR data handling
 Terrain flow computations
Lars Arge
7/13
Example: Terrain Flow
 Terrain water flow has many important applications
 Predict location of streams, areas susceptible to floods…
7 am
3pm
 Conceptually flow is modeled using two basic attributes
 Flow direction: The direction water flows at a point
 Flow accumulation: Amount of water flowing through a point
 Flow accumulation used to compute other hydrological
attributes: drainage network, topographic convergence index…
Lars Arge
8/13
Terrain Flow Accumulation
 Collaboration with environmental researchers at Duke University
 Appalachian mountains dataset:
 800x800km at 100m resolution  a few Gigabytes
 On ½GB machine: 14 days!!
 ArcGIS:
 Performance somewhat unpredictable
 Days on few gigabytes of data
 Many gigabytes of data…..
 Appalachian dataset would be Terabytes sized at 1m resolution
Lars Arge
9/13
Terrain Flow Accumulation: TerraFlow
 We developed theoretically I/O-optimal algorithms
 TPIE implementation was very efficient
 Appalachian Mountains flow accumulation in 3 hours!
 Developed into comprehensive software package for flow
computation on massive terrains: TerraFlow
 Efficient: 2-1000 times faster than existing software
 Scalable: >1 billion elements!
 Flexible: Flexible flow modeling (direction) methods
 Extension to ArcGIS
Lars Arge
10/13
LIDAR Terrain Data Work: TerraStream
 Now TerraStream software “pipeline” for handling terrain data
 Points to DEM (incl. breaklines)
 DEM flow modeling (incl “flooding”, “flat” routing, “noise” reduce)
 DEM flow accumulation (incl river extraction)
 DEM hierarchical watershed computation
 All work for both grid and TIN DEM’s
 Capable of handling massive datasets
 Test dataset: 400M point Neuse river basin (1/3 NC) (>17GB)
Lars Arge
11/13
Examples of Ongoing Terrain Work
 Terrain modeling, e.g
 “Raw” LIDAR to point conversion (LIDAR point classification)
(incl feature, e.g. bridge, detection/removal)
 Further improved flow and erosion modeling (e.g. carving)
 Contour line extraction (incl. smoothing and simplification)
 Terrain (and other) data fusion (incl format conversion)
 Terrain analysis, e.g
 Choke point, navigation, visibility, change detection,…
 Major grand goal:
 Construction of hierarchical (simplified) DEM where
derived features (water flow, drainage, choke points)
are preserved/consistent
Lars Arge
12/13
Thanks
Lars Arge
[email protected]





Work supported in part by
US National Science Foundation ESS grant EIA–98070734, RI grant
EIA–9972879, CAREER grant EIA–9984099, and ITR grant EIA–
0112849
US Army Research Office grants W911NF-04-1-0278
and DAAD19-03-1-0352
Ole Rømer Scholarship from Danish Science Research Council
NABIIT grant from the Danish Strategic Research Council
Danish National Research foundation (MADALGO center)
Lars Arge
13/13

MADALGO (1) - Department of Computer Science

Transcript MADALGO (1) - Department of Computer Science

Directory