Transcript Document

GCE Data Toolbox -- metadata-based
tools for automated data processing and analysis
Wade Sheldon
University of Georgia
GCE-LTER
Rationale

Data processing, quality control, data analysis and
metadata generation traditionally carried out as
separate activities, often in different time frames
using different technologies
 Problems:




Metadata may not reflect all processing steps
Much routine data analysis done w/o Q/C, metadata
No economy of scale – leads to “one-off” solutions
Metadata generation should ideally occur
throughout the data cycle and “inform” data
analysis
Design Goals

Develop Integrated Storage Standard




Develop Software to Support Standard



Tabular Data
QA/QC Information
Metadata (overall data set & columns/attributes)
Code Library/API
User Interfaces
Apply Technology to Acquire, Manage,
Distribute GCE-LTER Data
 Explore Use as Prototype Technology for
Metadata-based Data Processing, Synthesis
Storage Standard

Developed Using MATLAB®






Local expertise, large scientific user base
Cross-platform (Win32, Solaris, *nix, Mac OS/x)
Rapid development environment
Supports multiple interfaces (interactive command line, batchmode scripts, GUI, WWW)
Good interoperability with other technologies (Java, PERL, SQL)
Defined “GCE Data Structure” Spec. (based on
MATLAB/C structures)



Structure with 17 named fields
Specific content rules for each field (software validation)
Combines data, metadata, QA/QC, processing history
Storage Standard
GCE Data Structure Specification (v1.1)
Category
Field
Description
Structure Info
title
version
datafile
createdate
editdate
history
metadata
name
description
units
datatype
variabletype
numbertype
precision
values
criteria
flags
Title of the Overall Data Set
List of Toolbox Versions Used
List of Data Files Processed
Date of Creation
Date of Last Edit
Processing History
General Metadata (parseable array)
Column Names
Column Descriptions
Column Units
Physical Data Types (Storage types)
Logical Data Types (Variable types)
Numerical Types
Decimal Places to Display
Table of Data Values (numerical, text)
QA/QC Criteria
QA/QC Flags Assigned
Metadata
Data Table
QA/QC Info
Software – GCE Data Toolbox
 Core





Function Library
Create, Validate Structures
Import Data, Metadata (ASCII, MATLAB, SQL)
Manipulate Data, Metadata (unit conversions, add/delete/update)
Export Data, Metadata (various formats)
Dynamic, Rule-base QA/QC Flagging
 Self-documenting Processing



Operation Logging (Processing History)
Transparent Metadata Creation/Updating
Dynamic (JIT) Metadata Generation for Columns
 Support for


Metadata “Templating”
Application of Boilerplate Metadata based on Parameter Matching
Supports Rapid Documentation of Routine Data Sources
Software – GCE Data Toolbox
 Support for Analysis


Descriptive Statistics, Reports
Visualization, Mapping
 Support for

Synthesis
Composite Data Set Creation

Multiple Data Set Merge/Concatenation
 Relational Join
 Metadata Content Meshing

Data Set Summarization


Statistical Data Reduction/Re-sampling
Data Set Standardization

Unit Conversions (automatic, interactive)
 Template-based Semantic Mapping
 Automatic Semantic Mediation (prototype stage)
Software – User Interfaces

Unattended Batch Mode Processing
 Interactive Command Line Processing
(conventional MATLAB UI)



GUI Applications



Full help text for each function
Well-defined input/output arguments
Standard Forms, Dialogs, Controls
No MATLAB Experience Required
WWW – MATLAB Web Server


HTML Forms, Querystring Input
HTML Pages and/or Static File Output
Command-Line Interface
GUI Applications
WWW Interface
Current Applications
 Automated Data
Processing

Direct data import from data logger files, WWW data
sources (USGS), SQL queries
 Automatic metadata creation (templates, data mining)
 Rule-based QA/QC flagging
 Data


Set Packaging
Batch processing to create/update data, metadata
products
On-demand generation of data, metadata, stat reports in
custom formats (end-user scripts, GUI applications,
WWW forms)
Current Applications
 Data
Exploration/Analysis by PIs

Descriptive Statistics based on attribute metadata
 Visualization with Interactive Filtering (Frequency
Histograms, 2D Plots, Map Plots)
 Data
Reduction/Re-sampling to Provide
Customized Data at Various “Scales”



Aggregated Statistics
Binned Statistics
Query/Filtering (sub-selection)
Current Applications
 Data



Harvesting (GCE)
USGS Data (WWW real-time, daily, finalized data)
Campbell Scientific Data Arrays (post-processing
triggered after LoggerNet Retrieval)
Sea-Bird Hydrographic Data
 USGS


Data Harvesting Service for HydroDB
Weekly harvest for 31 stations/7 LTER Sites
Automatic Resampling, Unit Conversions, Q/C
Availability

Description, Screen-shots, Fully-functional
Toolbox Available on WWW:
http://gce-lter.marsci.uga.edu/lter/research/tools/data_toolbox.htm

Requires MATLAB 5.3, 6.0, 6.5 (any platform)
 “Public” Version Compiled
 Source Code Requests Considered on Case-byCase Basis
Future Development Plans
 EML 2.0
Support
 Metadata-mediated Data Set Integration
 Unit
conversions
 Re-sampling
 More WWW Interface
Development