Preparation of DTOC Indices
Download
Report
Transcript Preparation of DTOC Indices
Scientific Database Approaches
John H. Porter
University of Virginia
&
Kristin Vanderbilt
University of New Mexico
Road Map
Why have Scientific Databases?
Challenges for Scientific Databases
Approaches to Scientific Databases
Strategies for Initiating Ecological
Databases
WHY have Scientific Databases?
Improvement of data quality
• multiple users provides multiple
opportunities for detecting and correcting
problems in data
Cost
• data costs less to save than to collect
again
• with environmental data, often data cannot
be collected again at any cost
WHY have Scientific Databases?
Environmental Policy and Management
• environmental policy decisions require data
that are regional or national, but most
ecological data is collected at smaller
scales
• National Policies
• International Policies
WHY have Scientific Databases?
New
Science
• Long Term
– long-term studies depend on databases to
retain project history
• Synthesis
– use of data for a purpose other than which it
was collected
• Integrated, multidisciplinary projects
– depend on databases to facilitate sharing of
data
Evolution of Data Sharing
- Traditional Model
Data
Collection
Use
Data
Lose or
Discard Data
Publications
Evolution of Data Sharing
–New Model
Data
Collection
Use
Data
Data and
Metadata
Publications
•Regional Analyses
•Global Change
•Long-term Studies
•Synthesis
Challenges for Scientific Databases
Long-term
perspective
• without databases, most data do
not outlive project that collected
them
The
20-year rule
• GOAL: data that is accessible
and interpretable 20-years in the
future
Meeting Long Term Needs
• TECHNOLOGICAL –
media & formats that do
not become obsolete
•CONTEXTUAL- need to
capture context of data
collection
•SEMANTIC - terms
need to be well-defined
Challenges for Scientific Databases
Deal with Diversity
• science means asking NEW questions
– new kinds of queries
• scientific data is heterogeneous and
diverse
• scientific users have different backgrounds
and goals
• the user community for a given database
will be dynamic
Characteristics of Ecological
Data
High
Satellite
Images
Weather
Stations
Business
Data
Data
Volume
(per
dataset)
Most
Software
Gene Sequences
GIS
Most Ecological
Data
Primary
Productivity
Biodiversity
Surveys
Population Data
Soil Cores
Low
High
Complexity/Metadata Requirements
Comparison to Business Databases
Business-oriented databases have been
very different from scientific databases
• Relatively small number of well-defined
data elements
– E.g., Part number, count, price
• Repeatable reports (e.g., sales report)
• Rules for integrating data well understood
• Intolerant of different values associated
with an element
– E.g., hourly rate of pay
Ecoinformatics Development:
Alignment with IT community
Information
Technology
Ecoinformatics
Reason: IT focused on proprietary business applications
modified from James Brunt
Changing Times
New emphases on “data mining” are
forcing business databases to become
more like scientific databases
• Example: data on customer demographics
are linked to regional store inventories
• Integration of data resources not designed
with integration in mind
Ecoinformatics Development:
Alignment with IT community
XML,
Web Services,
Semantic Mediation
IT
Ecoinformatics
Reason: IT now focuses on domain-neutral
access to distributed data products.
Modified from James Brunt
The Ecoinformatics Challenge:
Can we make information available to
ecologists:
• In ways they can locate the information
they need?
• With information in forms they can readily
use?
How can we assure that the information
is current and accurate?
Not all Scientific Databases are Alike!
Scientific data are available at a number
of different “levels”
LOW: individual investigator posts data
on web page for students to retrieve
MEDIUM: Online databases for
supporting a project
HIGH: system automatically integrates
data from a large number of sources
Different types of Scientific Databases
“Portal”, “Value-Added”
or “Integrated” Infobases
Researchers
International/
National/Regional Systems
Project or Site-Based Systems
Individual datasets
Tools for Creating Scientific
Databases
Web Server – HTML, XML
• IIS
• Apache – open source
Database Management Systems (DBMS)
• Input, query, update, sort, output
Statistical Packages
• Aggregate, graph
Programming Languages
• C++, JAVA, PERL, Python, Visual Basic, PHP
• Create Custom code
Tools for Scientific Database
Development
Relational Database Management
Systems – RDBMS in common use
• Access/ Microsoft SQL Server
• Oracle
• MySQL – open source
Statistical Packages
• SAS
• SPSS
• R – open source
Spreadsheets
Spreadsheets are fantastic tools – but
not for scientific databases!
• Encourage “bad practice” – irregular data
structures that can’t be parsed easily
• Lack “auditability” – difficult or impossible
to back-track calculations
• Proprietary formats become obsolete
• Lack export capabilities for other than
values or graphs (no formulae)
Not Every Scientific DB needs or
uses the same tools
Example 1 – Basic Data Access
• Post comma-delimited files on web server
• Metadata files – XML text files (structured)
or unstructured
Example 2 – Add Products
• Use SAS to conduct error-checking and
generate graphics from data
• Use scripts/programs to automate
production process
Possible Systems
Example 3- Manage Metadata in DBMS
• Metadata in Access Database
• Provide comma-delimited data files
Example 4- Manage Metadata on Web
• Link web forms to backend DBMS
Example 5- Full DMBS system
• Metadata in DBMS
• Data dynamically queried from DBMS
using web interface
Level of Structure
Unstructured Data/Metadata
• Easy to produce
• Hard to use
Structured Data/Metadata
• Harder to produce
• Easy to export, alter, update
• The specific tool used to structure data
(e.g., XML, DBMS) is increasingly less
critical than the structure itself
Evolving a Database
Development of a database is an
evolutionary process
Implement system based on current
priorities - but think ahead!
Seek scalable solutions
• avoid bottlenecks
• adding the 1000th piece of data should be
as easy as adding the first (or easier)
Developing a Database Questions to Ask
Why is this database NEEDED?
Who will be the USERS of the
database?
What types of QUESTIONS should the
database be able to answer?
What INCENTIVES will be available for
data providers?
Meeting the Challenges
Prioritize
• focus on developing the most critical data
resources
• most commonly, critical data refer to the
research site as a whole
– Meteorology & Climatology
– Bibliography of past research at the station
– GIS data layers for the station research area
Meeting the Challenges
Get additional resources
• NSF Grants
• Upcoming NSF initiatives:
– SEI+II – interdisciplinary research
– National Ecological Observing Network
(NEON)
• Institutional Support
Meeting the Challenges
Work with researchers and enlist their
help in developing ecological databases
• Develop policies for data collection and
sharing that dictate the responsibilities of:
–The data provider/producer
–The data system
–Users of the data
Use Standard Methods when
Possible
Advantages of using
standard methods
• Increases intercomparability
(and hence, value) of data,
facilitating cross-site
comparisons
• Reduces cost of methods
development
Standards
Costs of using standards
• Standard methods may be poorly suited to
local conditions
• Developing standards is time consuming
and difficult
For some types of monitoring,
standards may not exist, or may do a
poor job characterizing desired
parameters
Standards
“The wonderful thing about standards is that
there are so many of them to choose from”
Sources of Standards
• Published literature
• Government Agencies (e.g., USGS, EPA)
• Project standards (e.g., LTER Climate
Stations)
• Resource Discovery Initiative for Field
Stations (RDIFS) directory (under
development)
Information Systems
Developing an information system is a
critical component of research
• You can’t exploit data you no longer have!
Creating good “metadata” (data about
data) is crucial to maintaining data
usability over time
Exploit Partnerships &
Existing Resources
OBFS Resource Discovery Initiative for Field
Stations (RDIFS)
•
•
•
•
•
Ecoinformatics Training
Publications Database
Registry for field station data (free advertising!)
Database of standards
Keyword Thesaurus
Ecoinformatics.org/ Knowledge Network for
Biocomplexity Project
• Ecological Metadata Language
• Tools
Ecological
Metadata
Language
(EML)
Other Possible Collaborations
ORNL Mercury System
• Cataloging and metadata tools with the
data and metadata left on your system
Global Change Master Directory
• online system for metadata with searching
capabilities
OpENDAP.org
• Online tools for oceanographic data
Exploiting External Resources
Ecological
Society of
America journal
Ecological
Archives
• accepts “data
papers” for major
and important
data sets.
Concluding Thoughts
Developing ecological information
systems seems a daunting task
Every system starts somewhere. Even
oaks start with acorns!
Once started, you can build on
successes, a little at a time
Remember, the compound interest on
zero is zero!
Next Step
Experience is a good guide to helping
build the sort of database your users will
want to use
Its good to try out the existing systems
to see what works (and what doesn’t) as
a user