Data Integration and Synthesis tools

Download Report

Transcript Data Integration and Synthesis tools

Data Integration, Analysis, and Synthesis
Matthew B. Jones
National Center for Ecological Analysis and Synthesis
University of California Santa Barbara
Scalable Information Networks for the Environment
http://knb.ecoinformatics.org
Funding: National Science Foundation (DEB99-80154, DBI99-04777)
NCEAS’ Mission


Integrate existing data for broad
ecological synthesis
Use synthesis to inform policy and
management
Synthesis at NCEAS





Research
Management
Policy
200+ synthesis projects
1900+ participating scientists
Research projects




Hunsaker – Quantification of Uncertainty in Spatial
Data for Ecological Applications
Ives & Frost – Intrinsic and Extrinsic Variability in
Community Dynamics
Osenberg -- Meta-Analysis, Interaction Strength and
Effect Size; Application of Biological Models to the
Synthesis of Experimental Data
Murdoch –
Complex Population Dynamics
Management projects




Andelman – Designing and Assessing the Viability of
Nature Reserve Systems at Regional Scales:
Integration of Optimization, Heuristic and Dynamic
Models
Boersma & Kareiva – Prospectus For An Analysis of
Recovery Plans and Delisting
Kareiva – Habitat Conservation Planning for
Endangered Species
Lubchenco, Palumbi, & Gaines –
Developing the Theory
of Marine Reserves
Policy projects


Costanza & Farber -- The Value of the World's
Ecosystem Services and Natural Capital: Toward a
Dynamic, Integrated Approach
http://www.nceas.ucsb.edu/
Synthesis projects

Use existing data...

Distributed sources
Varying protocols
Varying formats

Obtained via personal collaboration


Functional breakdown

Functional breakdown for synthesis








Data
Data
Data
Data
discovery
access
storage
interpretation
Quality assessment
Data Conversion & Integration
Analysis & Modeling
Visualization
Presentation Outline

Integration, Analysis, and Synthesis:

Challenges
Data Heterogeneity








Population survey
Experimental
Taxonomic survey
Behavioral
Meteorological
Oceanographic
Hydrology
…




Economic
Social (urban
ecology)
Paleoecological
Historical


Land use
Demographics
Types of Heterogeneity

Intensional vs. Arbitrary Heterogeneity

Syntax (format)


Schema (organization)


CSV, Fixed ASCII, proprietary binary
Non-normalized models
Semantics (meaning/methods)




Protocol semantics (e.g., scale)
Parameter semantics (e.g., bodysize (g))
Conceptual framework (e.g., experimental trts)
Taxonomy + nomenclature
Data Dispersion

Data are distributed among:


Independent researcher holdings
Research station collections








LTER Network (24 sites)
Org. of Biological Field Stations (168 sites)
Univ. Cal Natural Reserve System (36 sites)
MARINE (62 sites)
PISCO
Agency databases
Museum databases
Access via personal networking

Not scalable
Lack of Metadata

Majority of ecological data undocumented



Lack information on syntax, schema and
semantics of data
Impossible to understand data without contacting
the original researchers
Documentation conventions widely vary

Requires large time investment to understand
each data set
Scaling Data Integration

Because of:




Data heterogeneity
Data dispersion
Lack of documentation
Integration and synthesis are limited to
a manual process

Thus, difficult to scale integration efforts
up to large numbers of data sets
A Date
Data Integration
10/1/1993
10/3/1994
10/1/1993
Site
N654
N654
N654
Species
PIRU
PIRU
BEPA
Date
10/1/1993
Site
N654
10/3/1994
N654
10/1/1993
N654
10/31/1993 1
10/31/1993 1
11/14/1994 1
C
11/14/1994 1
Area
2
2
1
Date
31Oct1993
14Nov1994
Count
26
29
3
Species
Picea
rubens
Picea
rubens
Betula
papyifera
Picea
rubens
Betula
papyifera
Picea
rubens
Betula
papyifera
Density
13
14.5
3
13.5
1.6
8.4
1.8
Site
1
1
picrub
13.5
8.4
betpap
1.6
1.8
B
Presentation Outline



Integration, Analysis, and Synthesis:
Challenges
Current work

Knowledge Network for Biocomplexity

Partnership for Biodiversity Informatics
Knowledge Network for
Biocomplexity (KNB)

National network for biocomplexity data




Data discovery
Data access
Data interpretation
Enable advanced services




Data integration
Analysis framework
Hypothesis modeling
Visualization
Central Role of Metadata

What metadata?





Ownership, attribution, structure, contents,
methods, quality, etc.
Critical for addressing data
heterogeneity issues
Critical for developing extensible
systems
Critical for long-term data preservation
Allows advanced services to be built
KNB Components


Ecological Metadata Language (EML)
Morpho -- data management for ecologists






Cross platform Java application
Metacat -- flexible metadata & data system
Analysis and Modeling engine
Data integration engine
Semantic Query Processor
Hypothesis Modeling Engine
Ecological Metadata Language

XML syntax for representing metadata

Extensible – can add new metadata

Modular – can subset metadata for
specific applications
EML 2.0beta3 modules














eml-resource -- Basic resource info
eml-dataset -- Data set info
eml-literature -- Citation info
eml-software -- Software info
eml-party -- People and Organizations
eml-entity -- Data entity (table) info
eml-attribute -- Attribute (variable) info
eml-constraint -- Integrity constraints
eml-physical -- Physical format info
eml-access -- Access control
eml-distribution -- Distribution info
eml-project -- Research project info
eml-coverage -- Geographic, temporal and taxonomic coverage
eml-protocol -- Methods and QA/QC
Metacat metadata system
SEV
SEV
Metacat
AND
CAP
Key
NRS
Metacat
OBFS
NCEAS
Metacat
LTER
Metacat
Metacat Catalog
Morpho clients
Web clients
Site metadata system
XML wrapper
SDSC
Metacat
Metacat architecture
Metacat Server
Query
Subsystem
Metacat Servlet (Dispatcher)
Java Servlet Engine (Tomcat)
HTTP Server (Apache)
Storage
Subsystem
JDBC
API
RDBMS
(Oracle)
Replication
Subsystem
Data Storage
Interface
FS
Adapter
File System
Validation
Subsystem
Transformation
Subsystem
Authentication
Interface
LDAP
Adapter
LDAP
Metacat web interface
OBFS Network
UC
Natural Reserve System
LTER
Network
Functional breakdown

Functional breakdown for synthesis








Data
Data
Data
Data
discovery
access
storage
interpretation
Quality assessment
Data Conversion & Integration
Analysis & Modeling
Visualization
Quality Assessment system
Data
+
+
Semantic
Metadata
+
Researcher
Decisions
Quality
Assessment
Report
Quality Assessment






Integrity constraint checking
Data type checking
Metadata completeness
Data entry errors
Outlier detection
Check assertions about data


e.g., trees don’t shrink
e.g., sea urchins do
Data Integration
Data
Integrated
Data Set
+
+
Semantic
Metadata
Date
10/1/1993
Site
N654
10/3/1994
N654
10/1/1993
N654
10/31/1993 1
10/31/1993 1
11/14/1994 1
11/14/1994 1
+
Species
Picea
rubens
Picea
rubens
Betula
papyifera
Picea
rubens
Betula
papyifera
Picea
rubens
Betula
papyifera
Researcher
Decisions
Density
13
14.5
3
13.5
1.6
8.4
1.8
Data Integration
N654
10/31/1993 1
10/31/1993 1
11/14/1994 1
C
11/14/1994 1
B
16
14
14.5
3
13.5
1.6
12
10
8
6
4
8.4
2
1.8
0
Betula papyifera
10/1/1993
betpap
1.6
1.8
Picea rubens
N654
Density
13
picrub
13.5
8.4
Betula papyifera
10/3/1994
Species
Picea
rubens
Picea
rubens
Betula
papyifera
Picea
rubens
Betula
papyifera
Picea
rubens
Betula
papyifera
Site
1
1
Picea rubens
Site
N654
Date
31Oct1993
14Nov1994
Count
26
29
3
Betula papyifera
Date
10/1/1993
Area
2
2
1
Picea rubens
Species
PIRU
PIRU
BEPA
Picea rubens
10/1/1993
10/3/1994
10/1/1993
Site
N654
N654
N654
Density (#/m2)
A Date
Scaling Analysis and Modeling
Analysis + Model Metadata
Data and Metadata Input
Inputs
Outputs
Processing
(from Morpho/Metacat)
Execution engine (plugins)
Output
SAS
R
Matlab
Simulation models
...
Scaling Analysis and Modeling
Data and Metadata Input
Configuration for Analysis and Models
DDL
Specification
(Inputs and
DDL Code)
Procedural
Specification
(Inputs and
proc code)
Input Map
Specification
(test inputs
mapped to
metadata/data
fields)
Script with
symbolically
resolved
variables
Test
Specification
Parser
Script with
unresolved
variables
Data/Metadata
Input facilitator
and Parser
Script with
some fully
resolved
variables
Script/Metadata/
Data Validation
and Conflict
Resolution
Input Map
Parser
Execution Engine
Analytical
Engine
Plugin
Ouput
Config File
Output
(HTML,
XML, Text,
etc.)
User or
ontological
input for
conflict
resolution
Data
Package
(Metadata
with data
file)
OuputRenderer
OutputStre
am from
Analytical
Engine
ScriptExecutor
Fully
resolved
final script
Semantic metadata


Describes the relationship between
measurements and ecologically relevant
concepts
Drawn from a controlled vocabulary

Ontology for ecological measurements
Ecological Ontologies
isa
isa
Organism
Species
Taxon
Species
Count (S)
 H' 
J ' 

 ln S 
Biodiversity
has
Sampling
Area (A)
has
Species
Eveness (J')
has
Abundance of
Species i (Ni)
pi 
Ni
N
S
H '   pi ln pi
1
has
S
N   Ni
1
Abundance (N)
Proportional
Abundance
Species i (pi)
Shannon
Diversity (H')
has
What drives synthesis





Science questions
Hypotheses
Analyses + Models
Integrated Data
Original Data
Conclusions




Barriers to integration can be addressed using
structured metadata
Can accomplish a lot with ‘just’ mechanical
transformations
Domain ontologies + semantic mediation are
paths to scaling integration
Analysis drives all other phases of integration