Data Models for Ecological Databases

Download Report

Transcript Data Models for Ecological Databases

Data Models for Ecological
Databases
John Porter
Department of Environmental
Sciences
University of Virginia
DBMS Types
File system-based
Hierarchical
Network
Relational
Object-oriented
You’ve seen these before, now lets go
into more detail
File-System Based
Directory
Files
Files
Files
•very simple and easy to set up
•inefficient
•few capabilities
Hierarchical
Project
Hierarchical
efficient
Datasets
Investigators
not very general
Variables Locations
e.g. phylogenetic
structures
Codes
Methods
geographical
images
Network Database
Projects
Datasets
Links are hard-coded into
database. They are not a
property of the data
Locations
•very flexible
•unwieldy to modify
•not widely used
Relational Database
Projects
Data_id
Location_id
Datasets
Linkages are through
the properties of the
data itself - not hard
coded
Locations
Location_id
•widely-used, mature
•table-oriented
•restricted range of structures
Object Oriented
Methods
Object Data
Structure
Complex data structures,
along with the methods
to use the data are in the
database
•developing -few
commercial
implementations
•diverse structures
•extensible
Data Modeling
• DBMS Systems are highly flexible
• Good: they can do a lot!
• Bad: they have to be told how to do
it!
• A Database Management System is
the CANVAS, the DATA MODEL is
the painting…….
Data Modeling
Data modeling is used to develop the
database structures used in a database
Your data model effects
– reliability of the data
– efficiency and speed of queries
– the complexity of the database
Data modeling is an art, not a science!
Some Terminology:
Tables contain attributes or
fields (columns) and
multiple observations or
tuples (rows)
Spec_code
QRCALB
QRCRBR
Genus
Quercus
Quercus
Species
alba
rubra
Common Name
White Oak
Red Oak
Flat-file
Genus
Quercus
Quercus
Quercus
Quercus
Quercus
Species
alba
alba
alba
rubra
rubra
Common Name
White Oak
White Oak
White Oat
Red Oak
Red Oak
Observer
Jones, D.
Smith, D.
Doe, J.
Fisher, K.
James, J.
Tables in boxes
Species
Genus
Date
Observation
Species
Common
Name
Observer
Date
15-Jun-1998
12-Jul-1935
15-Sep-1920
15-Jun-1998
15-Sep-1920
Attributes
in ovals
Normalization
One widely-used approach for
reducing errors within a
database is to normalize your
data structures
Normalization is the process of
eliminating duplicate or
redundant information
Two-table Relational Database
Spec_code
QRCALB
QRCRBR
Genus
Quercus
Quercus
Spec_code
QRCALB
QRCALB
QRCALB
QRCRBR
QRCRBR
Observer
Jones, D.
Smith, D.
Doe, J.
Fisher, K.
James, J.
Species
Species
alba
rubra
Common Name
White Oak
Red Oak
Date
15-Jun-1998
12-Jul-1935
15-Sep-1920
15-Jun-1998
15-Sep-1920
Spec_code
Spec_code
Observation
Genus
Species
Common
Name
Observer
Date
Complex Data Model
Species
Images
Observations
Internet Links
Notation:
or
Locations
Observers
One-to-one
 One-to-many
Specimens
Data Model for Metadata
at theVCR/LTER
Personnel
Projects
Mailing Lists
Dataset
Locations
Variable
Codes
Dataset
Variable
Optional Linkage
Mandatory Linkage
“Beanstalk”& “String of
Pearls”
What Value Date
Location
Temp
SEV
23
10/19/00
Metadata
•methods
•units
Location Table
•Lat/Lon
Humid 95
10/19/00
SEV
Precip 0.01
10/18/00
VCR
Beanstalk / String of Pearls
• Highly normalized
• Extremely flexible - capable of handling
many different kinds of data
• Inefficient
– Queries can be very slow
– Can require large amounts of space
Why is there no perfect data
model for ecological data?
• One of the reasons data modeling is
an ART not a SCIENCE is that
ecologists use data in many different
ways
– Data that is perfectly formed for one
kind of analysis may be unusable for
another
– Different analytical software may be
used
Why No Perfect Model?
• Generally ecologists want to use
data in “flat file” formats that
combine all the tables containing
data into a single, denormalized
“spreadsheet”-type format- but even
that format can vary between
researchers
– ClimDB needed to support single
parameter and multiple parameter
formats to meet researcher needs