Community Cyberinfrastructure and X-informatics

Download Report

Transcript Community Cyberinfrastructure and X-informatics

Community cyberinfrastructure and
X-informatics - Assessment of
convergence and innovation based
on project experience
Peter Fox
High Altitude Observatory,
NCAR
Work performed in part with Deborah McGuinness (RPI),
Rob Raskin (JPL), Krishna Sinha (VT), Luca Cinquini
(NCAR), Patrick West (NCAR), Stephan Zednik (NCAR),
Paulo Pinheiro da Silva (UTEP), Li Ding (RPI) and
others
Fox CI and X-informatics - CSIG 2008, Aug 11
1
Outline
• Background and inevitabilities
• Informatics -> e-Science
• Informatics methodology e.g. Semantic
Web as a approach and a technology
– Virtual Observatories: use cases, some
examples, and non-specialist use
– Data ingest, integration, mining and
where we are heading
• Discussion
2
Fox CI and X-informatics - CSIG 2008, Aug 11
Background
Scientists should be able to access a global, distributed
knowledge base of scientific data that:
• appears to be integrated
• appears to be locally available
But… data is obtained by multiple instruments, using
various protocols, in differing vocabularies, using
(sometimes unstated) assumptions, with
inconsistent (or non-existent) meta-data. It may be
inconsistent, incomplete, evolving, and distributed
And… there exist(ed) significant levels of semantic
heterogeneity, large-scale data, complex data
types, legacy systems, inflexible and unsustainable
implementation technology…
3
Fox CI and X-informatics - CSIG 2008, Aug 11
Information
Information
But
data has
products have
Lots of Audiences
More Strategic
Less Strategic
SCIENTISTS TOO
From “Why EPO?”, a NASA internal
report on science education, 2005
4
Fox CI and X-informatics - CSIG 2008, Aug 11
Shifting the Burden from the User
to the Provider
5
Fox CI and X-informatics - CSIG 2008, Aug 11
The Astronomy approach; datatypes as a service
Limited
interoperability
VO App
1
Open
VOTable
VO App2
VO App3
Geospatial Consortium:
Simple
Image
Access
Protocol
Web {Feature, Coverage, Mapping}
Simple
Service
Spectrum
Sensor Web Enablement:
VO layer
Sensor {Observation, Planning,
Analysis}Lightweight
Service
semantics
Access
Protocol
Simple
Time Access
Protocol
Limited meaning, hard
coded
use
the
same
approach
DBn
DB
DB
2
DB1
3
…………
Limited extensibility
Under review
Fox CI and X-informatics - CSIG 2008, Aug 11
6
Mind the
Gap!
As a result of
finding out who
is doing
what, the
• Informatics
- information
science
includes
sharing experience/ expertise, and substantial
science of (data and) information, the practice
coordination:
of information processing, and the engineering
• ofThere
is/ was still
a gap between
science
and the
information
systems.
Informatics
studies
the
underlying infrastructure and technology that is
structure, behavior, and interactions of natural
available
and artificial systems that store, process and
• Cyberinfrastructure is the new
communicate (data and) information. It also
research environment(s) that support
develops its own conceptual and theoretical
advanced data acquisition, data
foundations. Since computers, individuals and
storage, data management, data
organizations all process information,
integration, data mining, data
informatics has computational, cognitive and
visualization and other computing
social aspects, including study of the social
and information processing services
impact of information technologies. Wikipedia.
over the Internet.
7
Fox CI and X-informatics - CSIG 2008, Aug 11
Progression after progression
Informatics
IT Cyber
Infrastru
cture
Cyber
Informatics
Core
Informatics
Science
Informatics,
aka
Xinformatics
Science,
SBAs
8
Fox CI and X-informatics - CSIG 2008, Aug 11
Virtual Observatories
Make data and tools quickly and easily accessible
to a wide audience.
Operationally, virtual observatories need to find the
right balance of data/model holdings, portals and
client software that researchers can use without
effort or interference as if all the materials were
available on his/her local computer using the
user’s preferred language: i.e. appear to be
local and integrated
Likely to provide controlled vocabularies that may
be used for interoperation in appropriate
domains along with database interfaces for
access and storage -> thus part IT, part CI, part
Informatics
9
Fox CI and X-informatics - CSIG 2008, Aug 11
Added value
Education, clearinghouses,
disciplines, et c.
other
services,
Semantic mediation layer - mid-upper-level
VO
Portal
Semantic
interoperability
Added value
Web
Serv.
Added value
Semantic query,
hypothesis and
inference
VO
API
Mediation Layer
• Ontology - capturing concepts of Parameters,
Instruments, Date/Time, Data Product (and
Semantic mediation layer - VSTO - low level
associated classes, properties) and Service
Classes
• Maps queries to underlying data Metadata, schema,
data
• Generates access requests for metadata,
data
• Allows queries, reasoning, analysis, new
Added value
DBn
DB2
DB3 explanation,
hypothesis
generation,
testing,
et
c.
…………
DB
1
Fox CI and X-informatics - CSIG 2008, Aug 11
Query,
access
and use
of data
10
Semantic Web Methodology and
Technology Development Process
•
•
Establish and improve a well-defined methodology vision for
Semantic Technology based application development
Leverage controlled vocabularies, et c.
Rapid
Open World:
Evolve, Iterate, Prototype
Redesign,
Redeploy
Leverage
Technology
Infrastructure
Adopt
Science/Expert
Technology
Approach Review & Iteration
Use Tools
Analysis
Use Case
Small Team,
mixed skills
Fox CI and X-informatics - CSIG 2008, Aug 11
Develop
model/
ontology
11
Science and technical use cases
Find data which represents the state of the neutral
atmosphere anywhere above 100km and toward the
arctic circle (above 45N) at any time of high
geomagnetic activity.
– Extract information from the use-case - encode knowledge
– Translate this into a complete query for data - inference and
integration of data from instruments, indices and models
Provide semantically-enabled, smart data query services
via a SOAP web for the Virtual IonosphereThermosphere-Mesosphere Observatory that retrieve
data, filtered by constraints on Instrument, Date-Time,
and Parameter in any order and with constraints
included in any combination.
12
Fox CI and X-informatics - CSIG 2008, Aug 11
Inferred plot type
and return required
axes data
13
Fox CI and X-informatics - CSIG 2008, Aug 11
But data has Lots of Audiences
More Strategic
Less Strategic
From “Why EPO?”, a NASA internal
report on science education, 2005
14
Fox CI and X-informatics - CSIG 2008, Aug 11
What is a Non-Specialist Use Case?
Teacher accesses internet goes
to An Educational Virtual
Observatory and enters a
search for “Aurora”.
Someone
should be able
to query a
virtual
observatory
without having
specialist
knowledge
15
Fox CI and X-informatics - CSIG 2008, Aug 11
What should the User Receive?
Teacher receives four groupings of search results:
1) Educational materials:
http://www.meted.ucar.edu/topics_spacewx.php and
http://www.meted.ucar.edu/hao/aurora/
2) Research, data and tools: via research VOs but
the search for brightness, or green/red line emission
is mediated for them
3) Did you know?: Aurora is a phenomena of the
upper terrestrial atmosphere (ionosphere) also
known as Northern Lights
4) Did you mean?: Aurora Borealis or Aurora
Australis, etc.
16
Fox CI and X-informatics - CSIG 2008, Aug 11
Semantic Information Integration:
Concept map for educational use of
science data in a lesson plan
17
Fox CI and X-informatics - CSIG 2008, Aug 11
18
Fox CI and X-informatics - CSIG 2008, Aug 11
Informatics issues for
Virtual Observatories
• Scaling to large numbers of data providers and
redefining the roles/ relations among them
• Branding and attribution (where did this data come
from and who gets the credit, is it the correct version,
is this an authoritative source?)
• Provenance/derivation (propagating key information
as it passes through a variety of services, copies of
processing algorithms, …)
• Crossing discipline boundaries
• Data quality, preservation, stewardship
• Security, access to resources, policies
19
Fox CI and X-informatics - CSIG 2008, Aug 11
Provenance
• Origin or source from which something
comes, its intention for use, whom or what
it was generated for, the manner of
manufacture, history of subsequent
owners, sense of place and time of
manufacture, production or discovery;
documented in detail sufficient to allow
reproducibility
20
Use cases
• Who (person or program) added the comments
to the science data file for the best vignetted,
rectangular polarization brightness image from
January, 26, 2005 1849:09UT taken by the
ACOS Mark IV polarimeter?
• What was the cloud cover and atmospheric
seeing conditions during the local morning of
January 26, 2005 at MLSO?
• Find all good images on March 21, 2008.
• Why are the quick look images from March 21,
2008, 1900UT missing?
• Why does this image look bad?
21
22
23
24
Quick look browse
Yasukawa: Computer crash
Yasukawa: Computer crash
Yasukawa: Rain, cloud
25
26
Visual browse
27
28
29
Search
30
31
A Better Way to Access Data
The Problem
Scientists only use data from a single instrument because it is difficult to
access, process, and understand data from multiple instruments.
A typical data query might be:
“Give me the temperature, pressure, and water vapor from the AIRS
instrument from Jan 2005 to Jan 2008”
“Search for MLS/Aura Level 2, SO2 Slant Column Density from 2/1/2007”
A Solution
Using a simple process, SESDI allows data from various sources to be
registered in an ontology so that it can be easily accessed and understood.
Scientists can use only the ontology components that relate to their data.
An SESDI query might look like:
“Show all areas in California where sulfur dioxide (SO2) levels were
above normal between Jan 2000 and Jan 2007”
This query will pull data from all available sources registered in the
ontology and allow seamless data fusion. Because the query is
measurement related, scientists do not need to understand the details of
the instruments and data types.
32
Determine the statistical signatures of volcanic
forcings on the height of the tropopause
33
Fox CI and X-informatics - CSIG 2008, Aug 11
Detection and attribution relations…
34
36
Leveraged VSTO semantic framework indicating how volcano and
atmospheric parameters and databases can immediately be
plugged in to the semantic data framework to enable data
integration.
37
Fox CI and X-informatics - CSIG 2008, Aug 11
Discussion (1)
• Taken together, an emerging set of collected
experience manifests an emerging informatics
core capability that is starting to take data
intensive science into a new realm of realizability
and potentially, sustainability
–
–
–
–
Use cases
X-informatics
Core Informatics
Cyber Informatics
• Evolvable technical infrastructure
42
Fox CI and X-informatics - CSIG 2008, Aug 11
Progression after progression
Informatics
IT Cyber
Infrastruc
ture
Cyber
Informatics
Core
Informatics
Science
Informatics
Science,
Societal
Benefit
Areas, Edu
One example:
• CI = OPeNDAP server running over HTTP/HTTPS
• Cyberinformatics = Data (product) and service ontologies, triple store
• Core informatics = Reasoning engine (Pellet), OWL, CMAP,
• Science (X) informatics = Use cases, science domain terms, concepts in
an ontology
Fox CI and X-informatics - CSIG 2008, Aug 11
43
Discussion (2)
• The data and information challenges are (almost)
being identified as increasingly common
• Data and information science is becoming the
‘fourth’ column (along with theory, experiment
and computation)
• Semantics are a very key ingredient for progress
in informatics
• A sustained involvement of key inter-disciplinary
team members is very important -> leads to
incentives, rewards, etc. and a balance of
research and production
44
Fox CI and X-informatics - CSIG 2008, Aug 11
Summary
• Informatics is playing a key role in filling the gap
between science (and the spectrum of non-expert)
use and generation and the underlying
cyberinfrastructure
– This is evident due to the emergence of Xinformatics
(world-wide)
• Our experience is implementing informatics as
semantics in Virtual Observatories (as a working
paradigm) and Grid environments
– VSTO is only one example of success
– Data mining, data integration, smart search, provenance
• Informatics is a profession and a community activity
and requires efforts in all 3 sub-areas (science, core,
cyber) and must be synergistic
45
Fox CI and X-informatics - CSIG 2008, Aug 11
More Information
• Virtual Solar Terrestrial Observatory (VSTO):
http://vsto.hao.ucar.edu, http://www.vsto.org
• Semantically-Enalbed Science Data Integration (SESDI):
http://sesdi.hao.ucar.edu
• Semantic Provenance Capture in Data Ingest Systems
(SPCDIS): http://spcdis.hao.ucar.edu
• SAM/Semantic Knowledge Integration Framework
(SKIF): http://skif.hao.ucar.edu
• Conferences: numerous
• Journals: Earth Science Informatics
• Texts: <empty>, a few are in progress
• Courses:
– Semantic e-Science, fall 2008 course at RPI
– Geoinformatics, at Purdue
• Contact: Peter Fox [email protected]
Fox CI and X-informatics - CSIG 2008, Aug 11
46
Spare room
47
Fox CI and X-informatics - CSIG 2008, Aug 11
Translating the Use-Case - nonmonotonic?
GeoMagneticActivity has
ProxyRepresentation
Input
GeophysicalIndex is a
ProxyRepresentation (in
Physical properties: State of
Realm of Neutral Atmosphere)
neutral atmosphere
Kp is a GeophysicalIndex
Spatial:
hasTemporalDomain: “daily”
• Above 100km
hasHighThreshold:
• Toward arctic circlexsd_number = 8
(above 45N)
Date/time when KP => 8
Conditions:
Specification needed for
query to CEDARWEB
Instrument
Parameter(s)
Operating Mode
Observatory
Date/time
• High geomagnetic activity
Action: Return Data
Return-type: data
48
Fox CI and X-informatics - CSIG 2008, Aug 11
VSTO - semantics and ontologies in an operational
environment: vsto.hao.ucar.edu, www.vsto.org
Web Service
49
Fox CI and X-informatics - CSIG 2008, Aug 11
Semantic filtering by
domain or instrument
hierarchy
Partial exposure of
Instrument
class
hierarchy - users
seem to LIKE THIS
50
Fox CI and X-informatics - CSIG 2008, Aug 11
51
Fox CI and X-informatics - CSIG 2008, Aug 11
Semantic Web Services
52
Fox CI and X-informatics - CSIG 2008, Aug 11
Semantic Web Services
OWL document returned
using VSTO ontology can be used both
syntactically or
semantically
53
Fox CI and X-informatics - CSIG 2008, Aug 11
Semantic Web Services
54
Fox CI and X-informatics - CSIG 2008, Aug 11
Semantic Web Services
55
Fox CI and X-informatics - CSIG 2008, Aug 11
VSTO achievements
• Conceptual model and architecture developed by combined
team; KR experts, domain experts, and software engineers
• Semantic framework developed and built with a small,
cohesive, carefully chosen team in a relatively short time
(deployments in 1st year)
• Production portal released, includes security, et c. with
community migration (and so far endorsement)
• VSTO ontology version 1.2, (vsto.owl) in production, 2.0 in
preparation
• Web Services encapsulation of semantic interfaces in use
• Solar Terrestrial use-cases are driving the completion of the
ontologies (e.g. instruments)
• Using ontologies and the overall framework in other
applications (volcanoes, climate, oceans, water, …)
56
Fox CI and X-informatics - CSIG 2008, Aug 11
Semantic Web Basics
• The triple: {subject-predicate-object}
Interferometer is-a optical instrument
Optical instrument has focal length
An ontology is a representation of this knowledge
• W3C is the primary (but not sole) governing organization for
languages, specifications, best practices, et c.
– RDF - Resource Description Framework
– OWL 1.0 - Ontology Web Language (OWL 1.1 on the way)
• Encode the knowledge in triples, in a triple-store, software is
built to traverse the semantic network, it can be queried or
reasoned upon
• Put semantics between/ in your interfaces, i.e. between layers
and components in your architecture, i.e. between ‘users’ and
‘information’ to mediate the exchange
57
Fox CI and X-informatics - CSIG 2008, Aug 11
Semantic Web Benefits
• Unified/ abstracted query workflow: Parameters, Instruments, Date-Time
• Decreased input requirements for query: in one case reducing the
number of selections from eight to three
• Generates only syntactically correct queries: which was not always
insurable in previous implementations without semantics
• Semantic query support: by using background ontologies and a
reasoner, our application has the opportunity to only expose coherent
query (portal and services)
• Semantic integration: in the past users had to remember (and maintain
codes) to account for numerous different ways to combine and plot the
data whereas now semantic mediation provides the level of sensible data
integration required, now exposed as smart web services
– understanding of coordinate systems, relationships, data synthesis,
transformations, et c.
– returns independent variables and related parameters
• A broader range of potential users (PhD scientists, students, professional
research associates and those from outside the fields)
58
Fox CI and X-informatics - CSIG 2008, Aug 11
Example 1: Registration of
Volcanic Data
Location Codes:
• U - Above the 180° turn at
Holei Pali (upper Chain of
Craters Road)
• L - Below Holei Pali (lower
Chain of Craters Road)
• UL - Individual traverses
were made both above and
below the 180° turn at Holei
Pali
• H - Highway 11
SO2 Emission from Kilauea east rift zone vehicle-based (Source: HVO)
Abreviations: t/d=metric tonne (1000 kg)/day,
SD=standard deviation, WS=wind speed, WD=wind
direction east of true north, N=number of traverses
59
Registering Volcanic Data (1)
60
Registering Volcanic Data (2)
• No explicit lat/long data
• Volcano identified by name
• Volcano ontology framework will link
name to location
61
Example 2: Registration of
Atmospheric Data
Satellite data for SO2
emissions
Abbreviation: SCD: Slant Column
Density (in Dobson Unit (DU))
62
Registering Atmospheric Data (1)
63
SAM Project Objectives
S. Graves, R. Ramachandran
• To create a prototype Semantic Analysis and
Mining framework (SAM) comprising:
– Data mining and knowledge extraction web services
– Linked ontologies describing the mining services,
data and the problem domain
– Web-based client
• To allow users to discover and explore existing
data and services, compose workflows for
mining and invoke these workflows.
– Semantic search
– Automated web service invocation
– Automated web service composition
64
Fox CI and X-informatics - CSIG 2008, Aug 11
Data Mining Ontology: Design
Courtesy: R. Ramachandran
Fox CI and X-informatics - CSIG 2008, Aug 11
65
Data Mining Ontology: Snapshot
Courtesy: R. Ramachandran
Fox CI and X-informatics - CSIG 2008, Aug 11
66
The Information Era: Interoperability
Modern information and communications
technologies are creating an
“interoperable” information era in which
ready access to data and information can
be truly universal. Open access to data
and services enables us to meet the new
challenges of understand the Earth and
its space environment as a complex
system:
• managing and accessing large data sets
• higher space/time resolution capabilities
• rapid response requirements
• data assimilation into models
• crossing disciplinary boundaries.
67
Fox CI and X-informatics - CSIG 2008, Aug 11
Virtual Observatories
• Conceptual examples:
• In-situ: Virtual measurements
– Related measurements
• Remote sensing: Virtual, integrative
measurements
– Data integration
• Managing virtual data products/ sets
68
Fox CI and X-informatics - CSIG 2008, Aug 11
Virtual Solar Terrestrial Observatory
• A distributed, scalable education and research
environment for searching, integrating, and analyzing
observational, experimental, and model databases.
• Subject matter covers the fields of solar, solar-terrestrial
and space physics
• Provides virtual access to specific data, model, tool and
material archives containing items from a variety of
space- and ground-based instruments and experiments,
as well as individual and community modeling and
software efforts bridging research and educational use
• 3 year NSF-funded (OCI/SCI) project - completed
• Several follow-on projects
69
Fox CI and X-informatics - CSIG 2008, Aug 11
Problem definition
• Data is coming in faster, in greater volumes and outstripping our
ability to perform adequate quality control
• Data is being used in new ways and we frequently do not have
sufficient information on what happened to the data along the
processing stages to determine if it is suitable for a use we did not
envision
• We often fail to capture, represent and propagate manually
generated information that need to go with the data flows
• Each time we develop a new instrument, we develop a new data
ingest procedure and collect different metadata and organize it
differently. It is then hard to use with previous projects
• The task of event determination and feature classification is onerous
and we don't do it until after we get the data
70
Building blocks
• Data formats and metadata: IAU standard FITS, with
SoHO keyword convention, JPeG, GIF
• Ontologies: OWL-DL and RDF
• The proof markup language (PML) provides an interlingua
for capturing the information agents need to understand
results and to justify why they should believe the results.
• The Inference Web toolkit provides a suite of tools for
manipulating, presenting, summarizing, analyzing, and
searching PML in efforts to provide a set of tools that will
let end users understand information and its derivation,
thereby facilitating trust in and reuse of information.
• Capturing semantics of data quality, event, and feature
detection within a suitable community ontology packages
(SWEET, VSTO)
71