DATA MODELS IN GIS

Download Report

Transcript DATA MODELS IN GIS

DATA MODELS IN GIS
OUTLINE:
 Overview of models
 Data and levels of measurements
 Raster and vector models
 Conversion between models
 Databases
DIGITAL INFORMATION

GIS requires that both data and maps be
represented as numbers
 GIS places data into the computer’s memory in a
physical data structure (i.e. files and directories).
 files can be written in binary or as ASCII text.
 binary is faster to read and smaller, ASCII can be
read by humans and edited but uses more space.
 sent through a “pipe” consisting of 0s and 1s
 stored on devices that can store only 0s and 1s
 processed as 0s and 1s
DATA

locational and attribute data in a GIS

attribute type: discrete vs continuous

discrete: presumed to occur at distinct locations
with empty locations having a value of zero for the
attribute in question

continuous: feature occurs throughout
geographical region; no locations are empty
DATA
Levels of Measurement:

four levels are commonly recognized – nominal,
ordinal, interval and ratio

each subsequent level includes all characteristics of
preceding levels

data available at higher levels can be reduced to
lower levels; opposite is not true
LEVEL OF MEASUREMENTS
Nominal Scale

objects are classed into groups; groups possess
arbitrary labels (numbers/names)
i.e. religion, land use/cover

discrete variable
LEVEL OF MEASUREMENTS
Ordinal Scale

categorization plus an ordering/ranking of data
i.e. country road, street, highway

can identify larger/smaller but can not comment on
degree between variables


K=5, L=3, M=1 equivalent to K=500, L=300,
M=10
discrete variables
LEVEL OF MEASUREMENTS
Interval Scale

measurements arranged in rank and distance
between measurements is known

no “true” zero point
i.e. elevation/topographic lines,
temperature in oC

discrete or continuous
LEVEL OF MEASUREMENTS
Ratio Scale

like interval scaling: both rank
and separation are known,
but there is also a known,
fixed starting point
i.e. temperature on Kelvin
scale; speed

continuous and discrete
DATA MODELS – REPRESENTING
DATA
1. Reality – total phenomena as they actually exist
2. Conceptual Data Model – describes and defines included
entities (how they will be represented)
3. Logical Data Model – logical organization of the database
elements
4. Physical Data Model or File Structure – how information
will be structured for access
DATA MODELS

logical data model is how data are organized for use
by the GIS.

GISs have traditionally used either raster or vector for
maps.


raster – based on pixels

vector – based on points, lines and polygons
while most GIS systems can handle raster and
vector, only one is used for the internal organization
of spatial data.
DATA MODELS

rasters and vectors can be flat files … if they are
simple
Raster-based line
Vector-based line
Flat File
0000000000000000
0001100000100000
1010100001010000
1100100001010000
0000100010001000
0000100010000100
0001000100000010
0010000100000001
0111001000000001
0000111000000000
0000000000000000
Flat File
4753456
4753436
4753462
4753432
4753405
4753401
4753462
4753398
623412
623424
623478
623482
623429
623508
623555
623634
RASTER DATA MODELS

basic unit is cells or pixels which are uniformly
spaced

each cell/pixel has spatial and spectral information.
i.e. digital elevation data and digital images

spatially exhaustive sampling of the area of interest


every cell has a value, even if it is “missing.”
cell has a resolution, given as the cell size in ground
units.

higher resolution, smaller cell dimensions
RASTER DATA MODELS
Grid extent
Rows
Grid
cell
Resolution
Columns
Generic structure for a grid.
RASTER DATA MODELS
RASTER DATA MODELS
Fining of Resolution
RASTER DATA MODELS
CREATING RASTER DATA MODELS



creating raster is like laying a grid over a map

code each cell with a value representing attribute

every cell has a value, even if null or zero
(integers, ratios, etc.)
values for each cell are written into a file

spreadsheet, data base, word processor

imported into GIS so it can be reformatted
each pixel presumably has one value – in reality is
this correct? mixed pixel issue
RASTER AND MISSING DATA
GIS data layer as a grid with a large section of “missing data,” in this
case, the zeros in the ocean off of New York and New Jersey.
MIXED PIXEL ISSUE
Water dominates
Winner takes all
Edges separate
W W
G
W G
G
W E
G
W W
G
W W
G
W E
G
W W
G
W G
G
E
G
E
MIXED PIXEL ISSUE
“Largest share”
Water
Land
“Central point”
“Presence/Absence”
35%
70%
80%
100%
“Percent occurrence”
CREATING RASTER DATA MODELS


raster data visualized as map layers

map layer: data describing a single characteristic
for a location

multiple items of information require multiple
layers
creates problems – raster databases can become
enormous

each map layer has thousands of cells
RASTER DATA MODELS
Advantages

simple data structures

each cell can be owned by only one feature.

overlay and combination of maps and remote sensed
images easy

simulation easy, because cells have the same size
and shape

technology is cheap
RASTER DATA MODELS
Advantages

some spatial analysis methods simple to perform

local: cell by cell calculations

focal: models cell value based on neighbours

zonal: models cell value based on geographical
areas

global: models cell value based on all cells
RASTER DATA MODELS
Disadvantages

volumes of graphic data

use of large cells to reduce data volumes

poor at representing points, lines and areas; good at
surfaces

must often include redundant or missing data

network linkages are difficult to establish

projection transformations are time consuming
COMPRESSION TECHNIQUES

raster compression techniques used in GIS are runlength encoding and quad trees
Run-length Encoding – more efficient

values often occur in runs across several cells

form of spatial autocorrelation

e.g. array 0 0 0 1 1 0 0 1 1 1 0 0 1 1 1 would be
entered as 3 0 2 1 2 0 3 1 2 0 3 1
RUN-LENGTH CODING
Row-by-row coding:
CCCCCBBDCCCCBBDCCCBBBDDCBBA
ADDDDBAADDBBBAADDDAAAADDDA
AAA
Run-length coding:
5C 2B 1D 4C 2B 1D 3C 3B 2D 1C 2B 2A 4D
1B 2A 2D 3B 2A 3D 4A 3D 4A
A. Mixed Conifer
B. Douglas Fir

56 entries for 7x8 array, or
C. Oak Savannah

22 pairs (44 entries) for 7x8 array
D. Grassland
COMPRESSION TECHNIQUES
Quadtree Compression

hierarchical data model using a variable-sized grid cell

finer subdivisions are used in areas requiring finer
detail (higher resolution)

pixel in each higher layer is derived from average or
majority of 4 pixels from the lower layer

not as efficient for more variable or complex data

used primarily as a way to store data for rapid retrieval
on display devices
QUAD TREE STRUCTURE
RASTER DATA FORMAT

most raster formats are digital image formats.

most GISs accept TIF, GIF, JPEG or encapsulated
PostScript, which are not georeferenced.

DEMs are true raster data formats.
RASTER DATA FORMAT
VECTOR DATA MODELS

think of world as a space populated by discrete
features of various shapes and kinds – points, lines,
areas.

any location in space may be empty or occupied by
one or more point, line or area.
VECTOR DATA MODELS
point

zero-dimensional abstraction of an object represented by a
single X,Y co-ordinate.

normally represents a geographic feature too small to be
displayed as a line or area

stored by their real (earth) coordinates
VECTOR DATA MODELS
line

set of ordered co-ordinates that represent the shape of
geographic features too narrow to be displayed as an area at
the given scale or linear features with no area

lines and areas are built from sequences of points in order.

lines have a direction to the ordering of the points.
VECTOR DATA MODELS
polygon

feature used to represent areas.

defined by the lines that make up its boundary and a point
inside its boundary for identification.

have attributes that describe the geographic feature they
represent.
VECTOR DATA MODELS

vector data evolved the arc/node model in the 1960s.

an area consist of lines and a line consists of
points.

points, lines, and areas can each be stored in their own
files, with links between them.

endpoint of a line (arc) is called a node; arc junctions
are only at nodes.

stored with the arc is the topology (i.e. the connecting
arcs and left and right polygons).
TOPOLOGY

topological data structures dominate GIS software.

stored explicitly

allows automated error detection and elimination.

rarely are maps topologically clean when digitized or
imported.

GIS has to be able to build topology from unconnected
arcs.
13
11
2
12
10
7
POLYGON “A” 5
4
9
1
2
6
3
8
1
1xy
2xy
3xy
4xy
5xy
6xy
7xy
8xy
9xy
10 x y
11 x y
12 x y
13 x y
Points File
TOPOLOGY
File of Arcs by Polygon
A: 1,2, Area, Attributes
1 1,2,3,4,5,6,7
2 1,8,9,10,11,12,13,7
Arcs File
Arc/Node Map Data Structure with Files.
TOPOLOGY

relationship between nodes, arcs and polygons.

topologically structured database for ease of retrieval
and implementation of spatial-relational operations.

advantages:

simple, elegant and efficient

relational database construction and analysis

complete topology makes map overlay feasible.

topology allows many GIS operations to be done
without accessing the point files.
VECTOR DATABASE CREATION

database creation involves several stages:

input of the spatial data

input of the attribute data

linking spatial and attribute data

spatial data is entered via digitized points and lines,
scanned and vectorized lines or directly from other
digital sources

once the spatial data has been entered, much work is
still needed before it can be used
VECTOR DATABASE CREATION
Building Topology

once points are entered and geometric lines are
created, topology must be "built"

this involves calculating and encoding relationships
between the points, lines and areas

this information may be automatically coded into tables
of information in the database
VECTOR DATABASE CREATION
Editing

during topology generation process, problems such as
overshoots, undershoots and spikes are either flagged
for editing by the user or corrected automatically

automatic editing involves the use of a tolerance value
which defines the width of a buffer zone around objects
within which adjacent objects should be joined
VECTOR DATA MODELS
Advantages

good representation of structures (points, lines,
polygons)

compact and more efficient

topology can be completely described

accurate graphics

retrieval, updating and generalization of graphics and
attributes possible

work well with pen and light-plotting devices and tablet
digitizers.
VECTOR DATA MODELS
Disadvantages

complex data structures

combination of several vector polygon maps or polygon
and raster maps through overlay creates difficulties

simulation is difficult

display and plotting can be expensive

technology is expensive

not good at continuous coverage or plotters that fill
areas.

TIN must be used to represent volumes.
VECTOR DATA FORMATS

vector formats are either page definition languages or
preserve ground coordinates.

page languages are HPGL, PostScript, and Autocad
DXF.

true vector GIS data formats include ArcView
Shapefiles and ArcGIS Interchange Files (E00) which
has topology.
VECTOR DATA MODELS
List of coordinates “spaghetti”

simple

easy to manage

no topology

lots of duplication, hence need for large storage space

very often used in CAC (computer assisted cartography)
VECTOR DATA MODELS
Vertex Dictionary

no duplication, but still this model does not use topology
VECTOR DATA MODELS
Dual Independent Map Encoding (DIME)

developed by US Bureau of the
Census

nodes (intersections of lines) are
identified with codes

assigns a directional code in the
form of a "from node" and a "to
node"

both street addresses and UTM
coordinates are explicitly defined
for each link
VECTOR TO RASTER EXCHANGE


data exchange by translation (export and import) can lead to
significant errors in attributes and in geometry.
efficient data exchange is important for the future of GIS.
VECTOR TO RASTER EXCHANGE
ADVANCED DATA MODELS - TIN

triangulated irregular network is a set of elevation
points which have been connected to form a network
of triangles.

developed in early 1970s as a simple way to build a
surface

the sample points are connected by lines to form
triangles; within each triangle the surface is usually
represented by a plane

triangles fit together in a manner which simulates the
face of the land.
ADVANCED DATA MODELS - TIN
ADVANCED DATA MODELS - TIN

irregularly spaced sample points can be adapted to
the terrain

rough terrain - more points

smooth terrain - less points

an irregularly spaced sample is more efficient
ADVANCED DATA MODELS - TIN


TINs can be seen as
polygons having attributes of

slope, aspect and area,

three vertices having
elevation attributes
TIN model work best in
areas with sharp breaks in
slope
ADVANCED DATA MODELS - TIN
ADVANCED DATA MODELS - TIN
Advantages
 ability to describe the surface at different level of
resolution
 efficiency in storing data
 allows simple calculation of basin areas, slopes,
channels, and many other geometric parameters
Disadvantages
 in many cases require visual inspection and manual
control of the network
DATABASES

a spatial database is a collection of spatially
referenced data that acts as a model of reality

these selected phenomena are deemed important
enough to represent in digital form

the digital representation might be for some past,
present or future time period
DIGITIAL DATABASES

scaleless- data can be stored at the level of detail
found in the environment

cartographer is responsible for choosing the content
and resolution

scale critical factor:

level of resolution set by field instruments

digitizing - resolution of instrument and
abstraction and production factors
DIGITIAL DATABASES

problems when using data sets of different resolutions
i.e. roads may not line up

resolved using ancillary source materials

additional problems when using data sets of different
themes
i.e. combing elevation and drainage data – water
running uphill or non-level lakes
DIGITIAL DATABASES
Value of databases:

Cost of creation – cheaper to get data from an
existing database

Appropriateness of use

Lack of alternative data sources

Graphic output
METADATA

“data about the data”

could include data elements that: identify the
data, identify the custodians and access conditions
to the data, describe projection, content, quality of
data

describes the action taken when handling
databases of varying scale
Dataset information
Title
Ortofotos'95
Abstract
Ortofotos'95 is a collection of ortho-rectified aerial photographs. These aerial
photographs cover Portugal and were obtained in August 1995 in false color
infra red film at scale 1:40 000. CNIG, The Directorate General of Forests
and The Paper Mill industry are the owners of the aerial photographs (in
paper format).
Type of dataset
Airborne data>Aerial photos
Locations
Portugal
Temporal Range
1995-
Dataset scales
1:25 000-1:50 000
Dataset resolution
1 - 3 meters
Dataset quality
remarks
Aquisition of data: aerial photographs, the film is scanned at very high
resolution and ortho-rectified using DTM derived from topographic
cartography at scale 1:25 000
Information
creation date
1999-10-29
DATABASES

pre-1970s, command line based with read and write
to hard disk, tapes, diskettes

database approach – all reading and writing through
simple interface (no need to care about tapes, etc.)

small GIS projects sufficient to store geographic
information as simple files.

with large data volumes and number of data users
best to use a database management system (DBMS)

relational design has been the most useful (since
1980s)
DATABASE MANAGEMENT
SYSTEMS

contain tables or feature classes in which:
rows: entities, records, observations, features

all information about one occurrence of a feature
columns: attributes, fields, data elements, varaibles


one type of information for all features
key field is an attribute whose values uniquely identify
each row
Parcel Table
entity
Parcel #
8
9
36
75
Key field
Address
501 N Hi
590 N Hi
1001 W. Main
1175 W. 1st
Block
1
2
4
12
Attribute
$ Value
105,450
89,780
101,500
98,000
DATABASES - RDBM

tables are related or joined using a common record
identifier (column variable) present in both tables
Example:

goal: produce map of values by distinct/neighbourhood

problem: no distance code available in parcel table
Parcel #
8
9
36
75
Parcel Table
Address
Block
501 N Hi
1
590 N Hi
2
1001 W. Main
4
1175 W. 1st
12
$ Value
105,450
89,780
101,500
98,000
DATABASES - RDBM

solution: join parcel table containing values with
geography table containing location codings, using
Block as key field
Parcel #
8
9
36
75
Parcel Table
Address
Block
501 N Hi
1
590 N Hi
2
1001 W. Main
4
1175 W. 1st
12
$ Value
105,450
89,780
101,500
98,000
Secondary or foreign key
Block
1
2
4
12
Geography Table
District
Tract
A
101
B
101
B
105
E
202
City
Dallas
Dallas
Dallas
Garland
DATABASES - RDBM
Relational Linkages
Spatial Attributes
Water Right
Locations
Descriptive Attributes
DATABASES
Advantage

very flexible

export data to another system easily

enables simple operations
i.e. search for records satisfying some condition
Description
New Ice
Nilas, Ice Rind
Young Ice
Grey Ice
Grey-White Ice
First-Year Ice
Thin First-Year Ice
Thin First-Year Ice, first stage
Thickness
<10 cm
0-10 cm
10-30 cm
10-15 cm
15-30 cm
30-200 cm
30-70 cm
30-50 cm
Code
1
2
3
4
5
6
7
8
Thin First-Year Ice, second stage
50-70 cm
9
Medium First-Year Ice
Thick First-Year Ice
Old Ice
Second-Year Ice
Multi-Year Ice
70-120 cm
120-200 cm
1.
4.
7.
8.
9.