Grid Technologies and Networks

Download Report

Transcript Grid Technologies and Networks

Vision for the 21st Century
Information Environment in Ecology
(Ecoinformatics)
Deana Pennington
University of New Mexico
LTER Network Office
Shawn Bowers
UCSD
San Diego Supercomputer Center
Data Types
Ecological Metadata Language (EML) ======
Field data
Small
Complex formats
Heterogeneous
Imagery
Massive
Simple formats
Continuous spatial
Spatial Data Workbench:
SEEK: large ITR project
Small NPACI project
Ground sensors
If georeferenced
Massive
GIS
Simple formats
Moderately large
Continuous temporal
Complex formats
Wireless Sensor Workshop
NEON Observatories: question driven data collection
Analytical Domains:
Information
Acquisition,
Archival
& Retrieval
Data
Preprocessing
& Product
Creation
Integrated
Data
Analysis
&
Synthesis
Inference
From
Pattern
Information Technologies:
Hardware, networks
Semantic mediation
Electronic notebooks
Data mining
Processing Pipelines
Remote Sensing
Exploratory spatial
High-throughput
Wireless Sensors
data analysis
processing
Metadata
Pattern matching
Expert systems
Databases & Query
Visualization
Web design
Grid technologies
EML
Spatial Data Workbench
Wireless Sensors
SEEK Workflows
Computational Models
Genetic algorithms
Cellular automata
Adaptive agents, et al.
Characteristics of Ecological Data
High
Satellite
Images
Wireless
Sensors
GIS
Weather
Stations
Business
Data
Data
Volume
(per
dataset)
SEEK
Primary
Productivity
Gene Sequences
Biodiversity
Surveys
Population Data
Soil Cores
Low
High
Complexity/Metadata Requirements
Modified from B. Michener
Field Data:
Semantics
Date
10/1/1993
10/3/1994
10/1/1993
Site
N654
N654
N654
Species
PIRU
PIRU
BEPA
Date
10/1/1993
Site
N654
10/3/1994
N654
10/1/1993
N654
10/31/1993 1
10/31/1993 1
11/14/1994 1
11/14/1994 1
Area
2
2
1
Count
26
29
3
Species
Picea
rubens
Picea
rubens
Betula
papyifera
Picea
rubens
Betula
papyifera
Picea
rubens
Betula
papyifera
Date
31Oct1993
14Nov1994
Site
1
1
picrub
13.5
8.4
betpap
1.6
1.8
Density
13
14.5
3
13.5
1.6
8.4
1.8
Modified from B. Michener, 2003
Remotely Sensed & Ground Data
Remotely sensed



Satellite
Landsat since 1972 (multispectral)
Ikonos (hyperspatial)
Hyperion (hyperspectral)
Airborne
Air photos (historical reconnaisance)
Radar
Thermal
ADAR (multispectral)
Aviris (hyperspectral)
Ground data
Field data
Automated sensors
Wireless sensors
Target
Remotely sensed images capture
information continuous space,
which can then be compared
through time to derive events
Event
t=2
t=1
Wireless sensors capture information at a continuous time,
which can then be compared through space to derive
spatial patterns
Event A
Event A
Event A
t
t
t
History Repeats Itself…
“…use of remotely sensed data…lagged for many years. The
reasons for this have little to do with the sophistication of
remote sensing technology. Rather it has to do more with the
ability to store, manage, access and use the massive data
produced by satellites, radar facilities and other remote sensing
instruments. Without advanced information processing, it would
take decades to compile and analyze the incredible amounts of
information that produced by many of these instruments.”
-Dr. Rita Colwell, Director NSF, 1998
Environmental Cyberinfrastructure Needs for
Distributed Sensor Networks: a Report from
a NSF Sponsored Workshop (2003)

Sensors
Deployed Sensor Networks
Metadata
Security and Error Resiliency
Cyberinfrastructure for Sensor Networks

Analysis and Visualization

Education
Outreach
Collaboration and Partnering






Data
Integrated Data
Preprocessing
Analysis &
& Product Creation
Synthesis
Information
Acquisition,
Archival
& Retrieval
Inference
From
Pattern
Incorporating IT Analytical
Advances into Ecology
Grid Technologies
Knowledge Representation,
Semantics and Ontologies
The Semantic Web
Extend the current web with “knowledge” and
“meaning” for
Better searching (that is, better answers to current searches)
Automated software tools that process web information
(comparison shopping, making appointments, and so on)
Proposes a new form of web content, which uses
ontologies and knowledge representation techniques
The Semantic Web [Sci. Am., May ‘01,
Berners-Lee]
“Mom needs to see a specialist
for a series of physical therapy
sessions – can you take her?”
Find physical therapist
for mom using my
schedule
get openings
get physician
prescription
Semantic-Web
Agent
get possible
providers
and availability
Return provider
available within 10
miles of location
get
locations
Semantic Web Architecture
(RDF)
The Resource Description Framework (RDF),
which is a language to:
Define standard ontologies
 Annotate web-pages with Semantic-Web content

Ultimately, tools … to exploit semantic
mark up
Web-crawlers, search engines, personal agents
RDF / RDF Schema
Insurance
Provider
covers
Physican
worksAt
Medical
Facility
locatedAt
Location
Physical
Therapist
An RDF Schema (or OWL) ontology
Serves as a common set of terms (a vocabulary) with
relationships and constraints
Can be published as Web-content using RDF (for
others to use)
RDF / RDF Schema
Insurance
Provider
covers
Physican
worksAt
Medical
Facility
worksAt
University
Hospital
locatedAt
Location
Physical
Therapist
BlueCross
covers
Dr. Hartman
With RDF, this Web-page can be
annotated using the ontology
locatedAt
555 Univ.
Drive …
RDF / RDF Schema
Which Physical
Insurance
Provider
covers
Physican
Medical
Facility
worksAt
University
Hospital
Physical
Therapist
BlueCross
covers
Dr. Hartman
locatedAt
Therapists
workAt
Location
a Facility within
Location X?
worksAt
locatedAt
Annotations provide access to the
meaningful, or semantic content of
the Web-page
555 Univ.
Drive …
SEEK and the Semantic Web
We want to build technology using Semantic-Web
standards to …
… explore the use of semantics to help scientists
deal with heterogeneity
Define standard ecological ontologies
 Automate dataset and analytic-step discovery, exchange,
and integration
 Help researchers construct and reuse scientific
workflows, for example, for ecological modeling

1.
2.
3.
4.
5.
SEEK EcoGrid
Question of interest
Query EcoGrid for workflows (ontologies)
Query EcoGrid for data (ontologies & semantic mediation)
SRB optimizes and runs analysis
Get results…archive to EcoGrid
Working Groups:
1. EcoGrid
2. Semantic mediation & KR
3. Analysis & Modeling
4. Taxon
5. BEAM
6. EOT
Pipeline
60 Gigabits/second
Resources (data & computational)
Managed by Storage Resource Broker (SRB)
Pipeline
EcoGrid
Analytical Services
Storage
Resource
Broker
Data Services
(includes analytical libraries)
1. Node Registry
• Web service: XML standards, SOAP/WSDL protocols
• Data: REQUIRES standard metadata (EML and others)
• Workflows: standard workflow metadata?
Matt Jones, 2003
SEEK Components
Overview of architecture
AM: Analysis and Modeling System
Analytical Pipeline (AP)
AS x
TS1
ASy
ASz
Example of “AP 0”
TS2
ASr
etc.
Parameters w/ Semantics
Data Binding
SM: Semantic
Mediation


System
Logic Rules
Semantic Mediation
Engine
WSDL/UDDI
EG: EcoGrid
ASr
AP 0
Invasive species
over time
Library of Analysis
Steps, Pipelines
& Results
WSDL/UDDI
C
ECO2
C
W
S
D
L
/
U
D
D
I
Query Processing
ECO2-CL
Parameter
Ontologies &
Taxonomies
Execution Environment:
SAS. MATLAB, etc.
C
ECO2
C
C
MC
EML
Darw
Wrap
SRB
KNB Species
…
C
TaxOn
Raw data sets
wrapped
for integration
w/ EML, etc.
Benefits to Users

Scientists






Access to high end computing
technologies
Better integration of all relevant data
Workflow standardization and analysis
Time and resource efficiency
Reusable analytical steps & workflows
Students
Improved access to knowledge base

Environmental Managers
Accessibility to current scientific approach

Policy makers
Timely input to decision making
Formal
documentation of methods
(output in report format)
Reproducibility of methods
Visual creation and communication of
methods
Versioning
Automated data typing and transformation
SEEK: ENM workflows
EcoGrid
DataBase
Species
pres. & abs.
points
Species
pres. & abs.
points
Test sample
+A2
+A3
EcoGrid
Query
Physical
Transformation
Sample
Data
EcoGrid
DataBase
Training
sample
GARP
rule set
Data
Calculation
Validation
GARP
rule set
Integrated
layers
Env.
layers
EcoGrid
DataBase
EcoGrid
Query
EcoGrid
DataBase
Model quality
parameters
+A1
Integrated
layers
Layer
Integration
Native range
prediction map
Map
Generation
User
Selected
prediction
maps
Scaling
Archive
To Ecogrid
Generate
Metadata
Analytical Pipelines
Sloan Digital Sky Project:
Mapping the Universe
“The raw data…are fed through data
analysis software pipelines…to extract
about 400 attributes for each celestial
object…These pipelines embody much
of mankind’s knowledge of
astronomy.” Szalay et al., 2001
Species Distribution Pipeline
Species
pres. & abs.
points
Acoustic
Signal
Processing
Pipeline
Species
pres. & abs.
points
Test sample
+A2
+A3
Physical
Transformation
Model quality
parameters
+A1
Sample
Data
Training
sample
GARP
rule set
Data
Calculation
Validation
GARP
rule set
Integrated
layers
Image
Processing
Pipeline
EcoGrid
Query
Interpolation
Pipeline
Env.
layers
Integrated
layers
Layer
Integration
Native range
prediction map
Map
Generation
User
Selected
prediction
maps
Scaling
Remotely sensed data (land cover class, etc.) Archive
To Ecogrid
Ground sensor data (climate, etc.)
Generate
Metadata
Analytical Pipelines: SDW
SRB/
MCAT
Radiometric
Corrections
Maps
HPSS @ SDSC
Remotely Sensed
Imagery
Georegistration
Band
Indices
Data
Transformation
Site Field Observations
Ground truth
Climate
Supervised
Classification
Band
Selection
Unsupervised
Classification
Segmentation
Land Cover
(Patch) Metrics
Climate/Land Cover
Integrated Graphics
Exploratory analysis
Vegetation patterns
Vegetation dynamics
Model parameterization
Brain atlas
Registration
Template
Distance
Transforms
Prototypes
Grey value
images
Statistical
Classification
Segmented
images
Biomedical
Informatics
Research
Network
Kikinis et al., 2001
T. Kapur, et al., 1998; Tina Kapur, 1999.
Surgical Planning Laboratory, 2001
Society for Industrial and Applied
Mathematics (SIAM) Conference on
Imaging Science, 2004
CONFERENCE THEMES
 Image acquisition
 Image reconstruction and restoration
 Image storage, compression, and retrieval
 Image coding and transmission
 PDEs in image filtering and processing
 Image registration and warping
 Image modeling and analysis
 Statistical aspects of imaging
 Wavelets and multiscale analysis
 Multidimensional imaging sciences
 Inverse problems in imaging sciences
 Mathematics of visualization
 Biomedical imaging
 Applications
“By their very nature, these
challenges cut across the disciplines
of physics, engineering,
mathematics, biology, medicine, and
statistics.”
Why not ecology and environmental
science?
Ontologies
Astrophysics
Ontology
Ecology Ontology
•Landscape Ecology
•Land Managers
•Soil science
•Etc.
Generic
Image/Signal
Ontologies
Biomedical
Ontology
Digital Film
Ontology
And many others…
Landscape Ecology Example
Generic Image Ontologies
Structural
Ontologies
Method
Ontologies
Pixel calc
Classification
Segmentation
Domain Ontologies
Patch metrics
Atm Corr
Land cover class
Patch ID
Physical
Ontologies
Modified from Camara et al. (2001)
TM
EMR 7 bands
HDF Place/date
Calibrations
So far….

Grid Technology
EcoGrid vs semantic web

Analytical pipelines/Workflows
Sensors: generic vs domain specific
Reuse of actors/workflows
Workflow metadata and reporting

Ontologies/Semantic Mediation
Query EcoGrid for workflows
Query EcoGrid for data to fit the selected workflow(s)
Integration of heterogenous data types
Exploratory Data Analysis
Data Mining
-finding interesting patterns
Visualization
-showing interesting patterns
NDVI at Sevilleta
1989 90 91 92 93 94 95 96 97 98 99 00 01 2002
TM
AVHRR
MODIS
AVHRR: 1 x 1 km pixels, 14 years * 26 images/year * 1824 pixels = 663,936 data points
TM: 30 x 30m pixels, 14 years * 2 images/year * 65,260 pixels = 1,827,280 data points
if 20 images/year => 18,272,800 data points
if 30 years => 39,156,000 data points
Spatiotemporal Analysis & Vis:
Drought Effects
1999
2000
2001
2002
July 16-29
July 30-12
Aug 13-26
Aug 27-9
Sep 10-23
6
4
2
0
of all
cells
Sum
Spatiotemporal Analysis & Vis:
1989Drought
90 91 92 93 94 Effects
95 96 97 98 99 00 01 2002
Year
Spring
percentile 5
198 199 199 199 199 199 199 199 199 199 199 200 200 200
9
0
1
2
3
4
5
6
7
8
9
0
1
2
April 23 - October 8 Drought-Effects
Number of cells with significantly low productivity compared with historic range (5%)
Summer/Fall
160
160
140
North
South
120
100
group
Count
N
S
80
60
40
20
00
9
10
11
12
15
17
19
20
21
9
14
15
16
19
9
15
16
17
10
12
13
14
16
17
16
17
18
19
22
9
11
12
11
19
12
14
18
19
21
9
12
13
14
15
16
17
18
19
20
21
22
9
10
11
12
13
14
15
16
17
18
19
C
Sum of count
S
1989
F
S F 19911993
SF F S1994 F
1990
F
1995
S SF 2000
S F S
1996 1999
period
1989
90 9193 94 95 year96
99 00
S = Spring
Year
F = Summer/Fall
2001
01
F
S 2002 F
2002
Linking and Brushing
Visualization : Investigating cancer incidence and risk factors. From GeoVista Studio, Penn State University.
Hyperspectral Imagery = 224
bands
AVIRIS hyperspectral
data cube
> 50 gigabytes of raw
data per acquisition
Hyperspectral Example
Pavement
True
Color
Riparian
Clouds
Agriculture
False
Color
River
300 pixels
6 km
300 pixels * 300 pixels * 224 bands = 20,160,000 data points
Arid
Upland
Training Samples
Testing Samples
Legend
Limited
Set
Label Error
Land Cover Class
Full
Set
Clouds
River
Riparian
Arid Upland
Semi-arid Upland
Pavement
Agriculture
Barren
Limited Set:
192 training pixels, 7 mislabeled, out of 90,000 total pixels
*low % training pixels
*errors in training set
Supervised Classifiers
Class 1
Support Vector Machine
Hyperplane
Class
Means
Band 2
x
Class 2
Band 1
x
Pixel to be classified
Probability
Contours
Euclidean
Distance
Limited Sample Set
A) ML 89.4%
C) SVM 77.2%
B) NBN 83.3%
D) MD 69.4%
Clouds
River
Riparian
Agriculture
Arid Upland
Barren
Pavement
ML = Maximum Likelihood
NBN = Naïve Bayesian Network
SVM = Support Vector Machine
MD = Minimum Distance
Full Sample Set
A) ML 96.4%
C) SVM 72.9%
B) NBN 90.9%
D) MD 88.4%
Clouds
River
Riparian
Agriculture
Arid Upland
Semi-arid Upland
Barren
Pavement
ML = Maximum Likelihood
NBN = Naïve Bayesian Network
SVM = Support Vector Machine
MD = Minimum Distance
Data Mining Challenges
Biomedical Data
 Large sample sets
 Few correlates (dozens)
 Hard classes
Ecologic Data
 Paucity of accurate reference data
 Spatial autocorrelation
 Large number of potential correlates
 Fuzzy classes
 Uncertainty
Basic Research Need


Spatiotemporal analysis & visualization
techniques that explicitly deal with
these challenges
EcoGrid archive of ground truth data
and the ontologies that will allow us to
semantically mediate the classes
Where do we start?
Field data
SEEK: infrastructure
Imagery
Spatial Data Workbench:
Small NPACI project
Ground sensors
Wireless Sensor Workshop
Future Systems:
Link with SEEK
Pipeline
Semantic transformation
to integrate field data
Pipeline
Unspecified ground
sensor pipeline
EcoGrid
Query
+
Sample
Data
SRB/
MCAT
Data
Calculation
Map
Generation
Radiometric
Corrections
HPSS @ SDSC Georegistration
Remotely Sensed
Imagery
Data
Transformation
Site Field Observations
Ground truth
Climate
User
Maps
Band
Indices
Layer
Integration
Archive Generate
To Ecogrid Metadata
Unsupervised
Classification
Supervised
Land Cover
Segmentation
Classification
(Patch) Metrics
Band
Selection
Climate/Land Cover
Integrated Graphics
Algorithm
Ontologies
Validation
Image
Ontologies
Geographic
Ontologies
Spatial &
Temporal
Ontologies
Models
Competition
Connectivity
Climate
Urban expansion
Et al.
Domain
Ontologies
Signal Processing
Ontologies
We start with you!
Data Sharing
Metadata
Databases
Computer savvy
End!
Incorporating sensor processing
1. Build a generic image and signal processing knowledge base
2. Develop actors for these functions
3. Build knowledge bases for domains of interest, and relate
them to the generic
• ENM pipelines
• NEON competition
• Hazards (fire, flood, drought, disease)
4. Develop processing pipelines
5. Identify sensor (image and signal) data and analytical
resources, convert them to web services
6. When EcoGrid is ready, register them as nodes
National Center?




Multidisciplinary staff
Working groups (4-6 weeks)
Multidisciplinary postdocs
Summer school in ecoinformatics