Transcript filename

Creating and Using Attribute Databases
In this lesson you will learn:
•
concept of the attribute database as a table
•
database elements: variables, observations, data, labels, data dictionary, aliases,
indexes
•
data types and formats
•
basic database operations
•
attribute queries
•
attribute statistics
•
attribute data graphs
The attribute database as a table
The attribute database as a table
Database elements
Portion of the data dictionary
for the Illinois Historic Tornado
database.
Creating the database
Steps in creating the attribute database
1.
2.
3.
4.
5.
identify the attributes to be captured
create attribute columns for each attribute; label each column
specify the data type for each attribute
specify validation rules for each attribute
specify the data format for each attribute
Data types
Data formats
Data type
Formatting Options
text
length of data field; i.e., maximum number of characters
Ex.: “Indian Shoals State Park “ has a length of 25 characters
integer numeric
short vs. long integer
short integer – stores numbers from –32,768 to 32,767
long integer – stores numbers from –2,147,483,648 to 2,147,483,647
decimal numeric
precision and scale:
precision = maximum length of the decimal number, including the
decimal point and digits to the left and right of the decimal point. Ex.:
176.8859 has a precision value of 8.
scale =maximum number of digits to the right of the decimal place. Ex.:
176.8859 has a scale value of 4.
Tabular database formats
Common database “exchange” formats
Database standards
Spreadsheet standards
MS-Access
filename.mdb
Paradox
filename.db
dBase II, III, IV, 5, 7
filename.dbf
MS-Excel
filename.xls
Lotus 1-2-3
filename.wks
Quattro Pro
filename.wq1
Open Database Connectivity
ODBC-compliant applications: ..assorted..
MS-Access, Visual FoxPro,
SQL Server, Oracle, dBase,
Paradox, DB2, Sybase, etc.
Formatted “text” file
Delimited text (comma
delimited, tab delimited, etc.)
filename.csv
filename.txt
Fixed-width text
filename.txt
Tabular database formats
Comma-delimited text (filename.csv)
Tab-delimited text (filename.txt)
Fixed-width text (filename.txt)
Basic database operations
a. data entry &
editing
b. sorts
c. queries
d. data statistics
e. data graphs
Basic database operations: data maintenance
• add/delete observations
• add/delete attribute fields
• edit data
− spell check (text & memo fields)
− find/replace
− re-enter data
− append new observations
• restructure attribute data
− calculate new field based on existing fields
− modify format
− change data type
Basic database operations: sorts
Single-column sort
Obs.
Index
Tract
13
A.013
101
147
A.147
419
AvgInc
Obs.
City
Ward
CityID
Alderman
ResidLU
2324
44200
18
Decatur
001
121
“R”
0.92
103
977
57800
19
Decatur
002
121
“I”
0.67
B.219
104
854
63400
21
Decatur
004
121
“R”
0.89
83
B.083
107
3842
33460
24
Decatur
005
121
“R”
0.70
6
A.006
109
2771
50050
20
Decatur
007
121
“D”
0.74
214
B.014
211
1644
38880
22
Decatur
009
121
“I”
0.23
189
A.189
212
1897
40010
115
Dixon
001
144
“R”
0.88
164
A.164
215
1330
39770
111
Dixon
002
144
“I”
0.80
97
A.097
217
1018
40005
113
Dixon
003
144
“R”
0.54
255
B.055
323
1226
47340
114
Dixon
004
144
“D”
0.66
337
B.137
618
1897
30500
112
Dixon
005
144
“R”
0.45
392
B.192
620
2170
30390
79
Elgin
003
207
“D”
0.61
…
…
…
…
…
…
…
…
…
…
Popln
Multi-column sort
…
Basic database operations: hierarchical sorts
(1)
(2)
(3)
Aa
01
ii
Aa
01
i
Aa
01
i
Aa
02
ii
Aa
02
ii
Aa
02
Ab
01
iii
Ab
01
ii
Ab
01
ii
Ab
02
iv
Ab
02
Ab
02
i
ii
ii
Hierarchical sort: column 1 (ascending); column 2 (ascending); column 3 (descending)
Simple attribute queries
Simple attribute queries
Land-use percentage, by city ward
residential
transportation & utilities
commercial
parks & open space
industrial
Compound attribute queries
Color
Age
Black
Chestnut
Bay
Gray
0-1 yr
1-2 yrs
2-3 yrs
3-5 yrs
> 5 yrs
The contingency table view of compound attributes
Buckskin
White
Multi-attribute queries
Color
Age
0-1 yr
1-2 yrs
2-3 yrs
3-5 yrs
> 5 yrs
Black
Chestnut
Bay
Gray
Buckskin
White
Multi-attribute queries
Operator
Set action
Logic Outcome
NOT
set complement
Logical converse of the operand.
AND
intersection of two sets
True if both operands are true, false otherwise.
OR
union of two sets
True if either 1st or 2nd operand is true, or if
both are true. False if both operands are false.
XOR
union less intersection
True if 1st operand is true or 2nd operand is
true. False if both are true or both are false.
Compound statements are written in the form: operand-1 LOGICAL OPERATOR operand-2;
i.e., horse = black AND horse = 5 years of age or older
The set
of Black
horses
“NOT Black”
horses
The set of all horses
The set of
Black
horses
The set of
horses ≥
5 yrs old
Data statistics
Measurement
scale (model)
Properties
Allowable operations
Examples
1. Nominal
measures
“categories”
count
eye color, land use
2. Ordinal
identifies order:
most to least,
smallest to largest;
count, <, =, >
class standing (fr, so, jr,
sr), physiographic relief
3. Interval
quantitative: no true
zero, but preserves
equal intervals
count, <, =, >, +, average, range, median,
standard deviation, etc.
°F, soil productivity
rating
4. Ratio
quantitative: has
true zero, preserves
ratios
count, <. =, >, +, -, ×,÷, ln()…
average, range, median,
standard deviation, etc.
distance, population
density, snow pack
depth
Measures of central tendency
Median: center point of a data distribution
exactly 50% of the observations have a data value < the median
and 50% have a data value > the median
Mean:
the average data value = 1/n × Σ (all data values)
the mean = the median only if the data are unimodal and
symmetrically distributed about the mean
mean
0.4
0.3
0.2
0.1
0.0
-6
-4
-2
0
V2
2
4
Measures of dispersion
Range: the span, or extent of data values
range = maximum data value – minimum data value
Variance: average squared distance of all observations from the mean
Standard Deviation: the square root of the variance, interpreted as the
average distance of all observations away from the mean.
for a unimodal symmetric distribution, approximately 68% of all data
values will lie within one standard deviation of the mean and 95.4%
within 2 standard deviations of the mean
Data graphs
40
120
30
80
20
40
10
0
0
1.0
1.1
1.2
1.2
1.3
1.4
1.5
1.5
Bulk.Density
1.6
1.7
1.8
1.8
1.9
4.5
5.0
5.5
6.0
6.5
wet.pH
7.0
7.5
8.0
8.5
Data graphs for visualizing the distribution of data
1.8
Bulk.Density
1.6
1.4
1.2
1.0
4.5
5.5
6.5
7.5
8.5
-3
-2
-1
wet.pH
0
1
2
3
Normal Distribution
Box-whisker plot
Quantile-Quantile plot, with Normal distribution
reference line
0.4
0.25
0.3
0.20
0.15
0.2
0.10
0.1
0.05
0.00
0.0
270
272
274
276
Elev.m.
Density plot
278
280
282
-3.2
-2.6
-2.1
-1.6
-1.0
-0.5
0.0
0.5
1.1
1.6
2.1
2.7
3.2
V2
Histogram with density plot (Normal distribution)
Data graphs for visualizing data relations
30
CEC
25
20
15
10
3000
4000
5000
6000
7000
Calcium
A bivariate scatterplot illustrating the relationship between soil Calcium and
Cation Exchange Capacity in a northern Illinois soil.
What you have learned
In this lesson you learned:
• Tabular databases are organized as tables, with rows as observations, columns as attributes, and the
data or information contained inside the table. It may also contain indexes, a data dictionary, and aliases.
• The data dictionary is vital to the proper interpretation and use of data. It should contain a description of
each attribute’s measurement scale, how it was measured, when and where it was collected, by whom,
and for what purpose.
• Database design includes: which attributes and how they are labeled, what data type to use for each
attribute, data validation rules, and data storage format.
• Basic data types include text string or memo for text or qualitative information, and integer, decimal,
and byte for numeric or quantitative information.
• Tabular databases can be created in database, spreadsheet, statistical analysis and other software and
exchanged in standard database, spreadsheet, ODBC, and formatted text file formats.
• Nearly all database software has functional capabilities for data entry and editing, sorts, queries, data
statistics, and data graphs.
• Save a copy of your database before performing any maintenance or segmentation! Be especially
careful with editing operations involving find/replace, and any operation that changes data formats or type.
• Single- and multi-column sorts are useful for isolating more obvious data errors and as a starting point for
segmenting the data into smaller databases, classifying observations, and creating indexes.
• Query operations can take the form of find queries, filter queries or subset queries, of which only the last
effects permanent change to the content of the database.
• Compound queries utilize the logical operators NOT, AND, OR and XOR to join query operands.
• Measures of central tendency, measures of dispersion, data distribution graphs, and scatterplots are
often useful in data verification, but their greatest value is in data segmentation.