Transcript Document
GCE Data Toolbox -- metadata-based
tools for automated data processing and analysis
Wade Sheldon
University of Georgia
GCE-LTER
Rationale
Data processing, quality control, data analysis and
metadata generation traditionally carried out as
separate activities, often in different time frames
using different technologies
Problems:
Metadata may not reflect all processing steps
Much routine data analysis done w/o Q/C, metadata
No economy of scale – leads to “one-off” solutions
Metadata generation should ideally occur
throughout the data cycle and “inform” data
analysis
Design Goals
Develop Integrated Storage Standard
Develop Software to Support Standard
Tabular Data
QA/QC Information
Metadata (overall data set & columns/attributes)
Code Library/API
User Interfaces
Apply Technology to Acquire, Manage,
Distribute GCE-LTER Data
Explore Use as Prototype Technology for
Metadata-based Data Processing, Synthesis
Storage Standard
Developed Using MATLAB®
Local expertise, large scientific user base
Cross-platform (Win32, Solaris, *nix, Mac OS/x)
Rapid development environment
Supports multiple interfaces (interactive command line, batchmode scripts, GUI, WWW)
Good interoperability with other technologies (Java, PERL, SQL)
Defined “GCE Data Structure” Spec. (based on
MATLAB/C structures)
Structure with 17 named fields
Specific content rules for each field (software validation)
Combines data, metadata, QA/QC, processing history
Storage Standard
GCE Data Structure Specification (v1.1)
Category
Field
Description
Structure Info
title
version
datafile
createdate
editdate
history
metadata
name
description
units
datatype
variabletype
numbertype
precision
values
criteria
flags
Title of the Overall Data Set
List of Toolbox Versions Used
List of Data Files Processed
Date of Creation
Date of Last Edit
Processing History
General Metadata (parseable array)
Column Names
Column Descriptions
Column Units
Physical Data Types (Storage types)
Logical Data Types (Variable types)
Numerical Types
Decimal Places to Display
Table of Data Values (numerical, text)
QA/QC Criteria
QA/QC Flags Assigned
Metadata
Data Table
QA/QC Info
Software – GCE Data Toolbox
Core
Function Library
Create, Validate Structures
Import Data, Metadata (ASCII, MATLAB, SQL)
Manipulate Data, Metadata (unit conversions, add/delete/update)
Export Data, Metadata (various formats)
Dynamic, Rule-base QA/QC Flagging
Self-documenting Processing
Operation Logging (Processing History)
Transparent Metadata Creation/Updating
Dynamic (JIT) Metadata Generation for Columns
Support for
Metadata “Templating”
Application of Boilerplate Metadata based on Parameter Matching
Supports Rapid Documentation of Routine Data Sources
Software – GCE Data Toolbox
Support for Analysis
Descriptive Statistics, Reports
Visualization, Mapping
Support for
Synthesis
Composite Data Set Creation
Multiple Data Set Merge/Concatenation
Relational Join
Metadata Content Meshing
Data Set Summarization
Statistical Data Reduction/Re-sampling
Data Set Standardization
Unit Conversions (automatic, interactive)
Template-based Semantic Mapping
Automatic Semantic Mediation (prototype stage)
Software – User Interfaces
Unattended Batch Mode Processing
Interactive Command Line Processing
(conventional MATLAB UI)
GUI Applications
Full help text for each function
Well-defined input/output arguments
Standard Forms, Dialogs, Controls
No MATLAB Experience Required
WWW – MATLAB Web Server
HTML Forms, Querystring Input
HTML Pages and/or Static File Output
Command-Line Interface
GUI Applications
WWW Interface
Current Applications
Automated Data
Processing
Direct data import from data logger files, WWW data
sources (USGS), SQL queries
Automatic metadata creation (templates, data mining)
Rule-based QA/QC flagging
Data
Set Packaging
Batch processing to create/update data, metadata
products
On-demand generation of data, metadata, stat reports in
custom formats (end-user scripts, GUI applications,
WWW forms)
Current Applications
Data
Exploration/Analysis by PIs
Descriptive Statistics based on attribute metadata
Visualization with Interactive Filtering (Frequency
Histograms, 2D Plots, Map Plots)
Data
Reduction/Re-sampling to Provide
Customized Data at Various “Scales”
Aggregated Statistics
Binned Statistics
Query/Filtering (sub-selection)
Current Applications
Data
Harvesting (GCE)
USGS Data (WWW real-time, daily, finalized data)
Campbell Scientific Data Arrays (post-processing
triggered after LoggerNet Retrieval)
Sea-Bird Hydrographic Data
USGS
Data Harvesting Service for HydroDB
Weekly harvest for 31 stations/7 LTER Sites
Automatic Resampling, Unit Conversions, Q/C
Availability
Description, Screen-shots, Fully-functional
Toolbox Available on WWW:
http://gce-lter.marsci.uga.edu/lter/research/tools/data_toolbox.htm
Requires MATLAB 5.3, 6.0, 6.5 (any platform)
“Public” Version Compiled
Source Code Requests Considered on Case-byCase Basis
Future Development Plans
EML 2.0
Support
Metadata-mediated Data Set Integration
Unit
conversions
Re-sampling
More WWW Interface
Development