No Slide Title

Download Report

Transcript No Slide Title

Application of c-squares spatial
indexing to an archive of
remotely sensed data
Tony Rees
Divisional Data Centre
CSIRO Marine Research, Australia
([email protected])
The starting point …
•
Archive of c.60,000
satellite scenes, 1986current, accumulating at
another 15+ per day
•
Region covered varies for
each scene
•
Often is requirement to
retrieve scenes including
a specific region or place
•
Low-res GIF images (like
examples shown) are
available for last 24
months only
•
Polygons can be
calculated for each scene,
but currently are not
stored in any spatially
searchable system.
Options for spatial indexing - 1
Bounding rectangles
•
Mostly simple to generate/
store/ query
•
Frequently a poor fit to the
actual dataset extent
•
Potentially become massively
over-representative as data
approaches the pole
•
Do not cope well with
oblique or curved boundaries
to the dataset footprint
(common in satellite images)
•
Poor discrimination for finescale queries (many “false
positives” returned)
•
Can be problems with
footprints which cross the
date line, or include a pole
bounding
rectangle
(follows lat, long
grid lines)
Scene
footprint
Options for spatial indexing - 2
Bounding polygons
•
OK to generate/ store
•
Best fit to the actual
dataset extent
•
Need additional points as
data approaches the pole
(regions of high
curvature)
•
Computationally
expensive to query (often
use bounding rectangles
as well, as pre-filter)
•
Tend to require a
dedicated GIS system to
handle the spatial queries.
bounding
polygon
inflection
points
Options for spatial indexing - 3
Raster (tiled) approach
•
Simple to generate/ store (as
list of tiles intersected)
•
Reasonable fit to the actual
dataset extent; improves with
decreasing unit square size
•
Automatically increases
resolution as data approaches
the pole (regions of high
curvature)
•
Simple/ rapid to query, no
computation or dedicated
GIS required
•
No ambiguity with footprints
which extend across date
line, or include a pole.
boundary
of all tiles
for this
scene
* * *
* * *
*
* * * *
****
*
****
**
*****
C-squares basics
• Based on a tiled (gridded) representation of the earth’s
surface, at choice of 10 x 10, 5 x 5, 1 x 1, 0.5 x 0.5 degree
tiles, etc. (0.5 x 0.5 degree tiles are used in this example)
• Every tile has a unique, hierarchical alphanumeric ID
(incorporates the ID of all parent, grandparent, etc. in every
child tile ID)
• Dataset (=scene) extents are represented by a list of all the
tiles included in, or intersected by the footprint
• Spatial search comprises looking for one or more tile IDs in
the set associated with any dataset (= simple text search).
(more details – see www.marine.csiro.au/csquares/ )
Spatial queries supported
• Retrieve all scenes which include all or part of a given tile
(at 10 x 10, 5 x 5, 1 x 1, or 0.5 x 0.5 degrees), or set of tiles
– optionally, filter also by other criteria e.g. date, satellite, etc.
• Retrieve all tiles associated with a given scene, with option
to:
– draw representation on a map (or range of maps)
– export the list to a file or another database
– compare the list with other lists (~ = polygon overlays)
NB, choice of initial tile size is important:
- too few tiles, only coarse spatial queries supported;
- too many tiles, indexes get very large (and queries potentially slow)
However, compression of blocks of contiguous tiles can be quite effective
(1 code can replace 4, 25, 100, or 400 codes in certain cases).
A real-world example
NOAA-12 .. 21 Jun 2003 06:23
5 x 5 degree squares
(99) = 36 after comp.
1 x 1 degree squares
(1,982) = 515 after
comp.
10 x 10 degree squares
(28)
(base level of hierarchy,
cannot compress)
0.5 x 0.5 degree squares
(7,691) = 704 after comp.
0.5 x 0.5 degree squares
- detail
Practicalities for satellite data
• A single scene may measure (say) 40 x 50 degrees approx., =
2000 1 x 1 degree squares, or 8000 0.5 x 0.5 degree squares
• Quadtree-like compression reduces this to (e.g.) 500 codes at
1 x 1 degree resolution, 700 codes at 0.5 x 0.5 degree
resolution
• Still require quite a lot of codes (e.g. 42 million) to represent a
collection of 60,000 scenes
• Each code is 10 characters long, scene IDs are (say) 6
characters long, thus c.670 million bytes required for raw
index data (before compiling secondary indexes, etc.)
• With secondary indexing, probably need around 2 Gb (+) to
hold the spatial index.
Designing the index
• Option 1: a table of scenes, with string of codes comprising
each scene footprint, 1 row per scene
– comment: fast to retrieve all codes for a scene, slow for a spatial query.
List of codes may present storage problems (e.g. in database field/s)
owing to length.
• Option 2: a table of c-square codes, with list of scenes
containing any code, 1 row per code (= an “inverted index”)
– comment: fast for a spatial search, slow to retrieve all codes for a
scene. List of scenes may present storage problems (e.g. in database
field/s) owing to length. Harder to manage with respect to scene
addition/deletion etc.
(continues…)
Designing the index (cont’d)
• Option 3: a table of scenes/c-square pairs, 1 row per pair,
indexed on both columns
– comment: OK to retrieve all codes for a scene, also for a spatial query.
Row length is always constant, new rows simply created as required.
However, table potentially becomes very long (e.g. 40 million rows),
may become slow to query …
• Option 4: as option (3), however single table replaced by
multiple smaller tables – e.g. split by blocks of c-squares,
blocks of scenes, or year, or satellite
– comment: will tend to be “tuned” to favour queries which can be
completed by accessing smallest number of tables. Could overcome
this by adding (say) element of Option 1 in parallel (with added cost of
storing the same information in more than one place).
Present philosophy …
• Currently going with Option 3: a table of scenes/csquare pairs, 1 row per pair, indexed on both
columns.
• However, will consider splitting this along spatial
or date lines if performance degrades excessively
as the index grows (= Option 4); may then add in
Option 1 in parallel also, for rapid footprint
retrieval if needed.
Actual table structure –
2 “search” tables only
Scene details (1 row per scene)
Scene/c-square pairs
(600-700 rows per scene)
Spatial Search Interface (prototype):
www.marine.csiro.au/csquares/satsearch.htm
Clickable
map
interface
Example search result
SQL for spatial search
(example for 0.5 degree search square)
select distinct A.scene_id, B.satellite, B.scene_date_time, B.image_location
from satdata.satdata_csq_all A, satdata.scene_info B
where (
(sqrsize = 0.5 and
(A.csquare = search_csq -- e.g. 3414:100:1 (0.5 degree square)
or A.csquare = substr(search_csq,1,8)||
':*'
or A.csquare = substr(search_csq,1,6)|| '**:*'
or A.csquare = substr(search_csq,1,4)||':***:*')
-- 1 level of compression
-- 2 levels of compression
-- 3 levels of compression
) -- (plus other supported search square size options go here)
and
and
and
and
order by
)
(startdate is null or B.scene_date_time >= startdate)
(enddate is null or B.scene_date_time <= enddate)
(sat = 'any' or B.satellite = sat)
A.scene_id = B.scene_id
B.scene_date_time, B.satellite;
This can be triggered by a HTTP request, e.g. as
follows:
http://aqua.hba.marine.csiro.au:7272/marq/owa/satdata_search.get_scenes?
sat=any&startday=20&startmon=06&startyr=2003&endday=22&endmon=
06&endyr=2003&W_long=150&N_lat=-25&E_long=155&S_lat=-30
… Generates either – HTML result page to browser, or machine parseable
scenes list, e.g.:
h1awj h1awl h1awm h1awo h1awr h1awt h1aww h1awy h1ax0 h1ax2 h1ax5
h1ax7 h1ax9 h1axb h1axd h1axe h1axh h1axj h1axl h1axn h1axo h1axr
h1axt h1axw h1axx h1ay0 h1ay1 h1ay4 h1ay8 h1ay9 h1ayb h1ayc h1ayd
h1aye h1ayg h1ayh h1ayj h1ayl h1ayo h1ayp h1ayr h1ays h1ayu h1ayw
h1az1 h1az3 h1az5 h1az8 h1aza h1azg h1azj h1azl h1azm h1azn h1azp
h1azq h1azt h1azw h1azz h1b00 h1b03 h1b04 h1b06 h1b09 h1b0a h1b0c
h1b0d h1b0f h1b0i h1b0j h1b0n h1b0q h1b0t h1b0u h1b0w h1b0y h1b10
h1b12 h1b13 h1b15 h1b18 h1b1b h1b1e h1b1h h1b1j h1b1m h1b1o h1b1r
h1b1u h1b1w h1b20 h1b21 h1b24 h1b26 h1b28 h1b2a h1b2b h1b2e h1b2g
Automated scene processing and upload
1. An automated routine prepares 2 sets of files: scene information, and
scene polygons, in batches of 100-300 scenes
2. For the first batch, a script initiates file upload to the database, then file
conversion (c-square generation) for each scene in the batch (currently
takes c.3 minutes/scene)
3. When all scenes have been converted, the c-squares are added to the
master searchable table, and the scenes are flagged as “upload
completed”
4. Close to the estimated completion time, a robot wakes up and checks at
10 minute intervals to detect when the job is complete (via a http call to
the same procedure which creates the web scene count on the opening
screen).
5. When the scene count returns “0 scenes currently being uploaded”, the
robot calls the script to start again from stage (2)-(5), with the next batch
of scenes.
C-square encoding algorithm
•
“First generation” conversion code constructed January-May 2003,
at CMR Hobart
•
Currently runs in Oracle PL/SQL, reads/writes from/to 6 Oracle
tables (including compression of the c-square string from each
scene)
•
Includes special treatment for:
–
scenes crossing date line
–
scenes including a pole
•
Polar X-Y coordinates used to interpolate between quoted boundary
points (gives better representation for southernmost boundary)
•
Special “anti-bleed” treatment incorporated (polygons touching, but
not entering a given square do not trigger that square)
•
Could probably benefit from re-writing the code in a different
language for faster operation, and/or refining method of operation.
Next steps …
• Keep a watch on index performance (i.e., retrieval
times) as database size increases
• Assess whether this approach is attractive, cf.
systems already in use elsewhere – and whether of
interest to others
• Look into improving the encoding procedure speed
(currently a bottleneck for processing a large
archive).