Data Grid - RCDL 2002

Download Report

Transcript Data Grid - RCDL 2002

The GRID Adventures:
SDSC's Storage Resource Broker
and Web Services
in Digital Library Applications
Arcot Rajasekar, Reagan Moore, Bertram
Ludäscher, Ilya Zaslavsky
[email protected]
San Diego Supercomputer Center
University of California, San Diego
Data and Knowledge Systems
Staff
• Reagan Moore
• Chaitan Baru
• Data Mining Lab (Tony Fountain)
• Advanced Query Processing Lab (Amarnath Gupta)
• Knowledge-Based Integration Lab (Bertram Ludäscher)
• Data Grid Lab (Arcot Rajasekar)
• Spatial Information Systems Lab (Ilya Zaslavsky)
+ 2-3 programmers in each lab, + graduate and undergraduate students
Now: connecting research with production databases and data grid
solutions
RCDL’02, Dubna, October 15-17 2002
2
Overview
• Intro
– SDSC and NPACI
• Part I: technologies
–
–
–
–
What is Data Grid
Data, Information, and Knowledge Infrastructures at SDSC/DICE
SDSC Storage Resource Broker, with examples
MIX (Mediation of Information Using XML), and Knowledge-Based
Mediation
• Part II: case studies
– BIRN: the First Operational Data Grid
– Web Services Demos
– Persistent Archives at SDSC
• Summary
RCDL’02, Dubna, October 15-17 2002
3
A Distributed National Laboratory for
Computational Science and Engineering
RCDL’02, Dubna, October 15-17 2002
4
1st Teraflops System for US Academia
• 1 TFLOPs IBM SP
– 144 8-processor compute nodes
– 12 2-processor service nodes
– 1,176 Power3 processors at 222
MHz
– Initially > 640 GB memory (4
GB/node), upgrade to
> 1 TB later
– 6.8 TB switch-attached disk
storage
• Largest SP with 8-way nodes
• High-performance access to HPSS
RCDL’02, Dubna, October 15-17 2002
5
Bioinformatics Infrastructure for
Large-Scale Analyses
• Next-generation tools for
accessing, manipulating, and
analyzing biological data
– Biology, Stanford University
– DICE, SDSC
• Analysis of Protein Data Bank,
GenBank and other databases
• Accelerate key discoveries for
health and medicine
• Supporting and leveraging new
data grid projects, such as
BIRN in biology
RCDL’02, Dubna, October 15-17 2002
6
SRB
Part I: technologies
What is Data Grid
Data, Information, and Knowledge
Infrastructures at SDSC/DICE
SDSC Storage Resource Broker
MIX (Mediation of Information Using XML),
and Knowledge-Based Mediation
What are Data Grids?
• Power Grid Analogy
– Multiple power generators
– Complex transmission networks
with switching
– Simple Usage Interface – plug and play
– Guaranteed Supply - Meeting of
demands (peak and lull)
– Complex cost function
•
•
•
•
•
More than one data provider
Best movement of data across computer networks
Seamless Access to Data with good ‘Finding Aids’
Guarantee of Data Access
Access Control, Quotas & Complex Usage Costing
RCDL’02, Dubna, October 15-17 2002
8
Data Grids
Data Grid - linking multiple data collections
Separate name spaces
Separate schema
Separate administration domains
Heterogeneous database instances
Database A
Data grid
Database B
The data grid is itself a collection that provides
mechanisms to hide latency and manage semantics
RCDL’02, Dubna, October 15-17 2002
9
Federated Digital Libraries
Virtual Data Grid - linking multiple data collections
Ability to execute processes to recreate derived data
Database A
Services
Virtual Data Grid
Database B
Services
The virtual data grid integrates data grid and digital library
technology to manage processes
RCDL’02, Dubna, October 15-17 2002
10
Why Data Grids: Data Handling Problems
•
•
•
•
•
•
•
•
•
•
•
Large Datasets; Large Number of Datasets; Scaling
Distributed, Heterogeneous Storage
Virtualization & Transparency
Collaboration, Access Control, Authentication, Security
Replication, Coherency, Synchronization
Fault Tolerance and Load Distribution
Scheduling, Caching & Data Placements
Data Migration over Time & Space
Data/Collection Curation
Uniform Name Space
Handling Legacy Data and
Data/Resource Evolution
• User-friendly Interfaces – foster
collaborations
RCDL’02, Dubna, October 15-17 2002
11
Why Data Grids: Metadata Problems
•
•
•
•
•
•
•
•
•
•
Types of Metadata – Relational to XML to unstructured
Standardized to User-defined Metadata
Large Number of Attributes;
Large Size; Scaling
Federation - integration over space
Evolution - integration over time
Evolution - integration over contexts
Discovery and Search
Presentation – user friendly
Extraction and Maintenance
RCDL’02, Dubna, October 15-17 2002
12
DAKS Data Management Hierarchy
• Model-Based Information Management
– Rule-based ontology mapping, conceptual-level mediation - CMIX
• Information Mediation
– Data federation across multiple libraries - MIX
• Digital Library
– Interoperable services for information discovery and presentation SDLIP
• Data Collection
– Tools for managing data set collections on databases - MCAT
• Data Handling
– Systems for data retrieval from remote storage - SRB
• Persistent Archives
– Storage of data collections for 30+ years
RCDL’02, Dubna, October 15-17 2002
13
SRB as a Solution
• The Storage Resource Broker is a middleware
• It virtualizes resource access
• It mediates access to distributed heterogeneous resources
• It uses a MetaCATalog to facilitate the brokering
• It integrates data and metadata
Application
MCAT
SRB Server
HRM DB2, Oracle, Illustra, ObjectStore HPSS, ADSM, UniTree UNIX, NTFS, HTTP, FTP
Distributed Storage Resources
(database systems, archival storage systems, file systems, ftp, http, …)
RCDL’02, Dubna, October 15-17 2002
14
Solution SRB
SDSC Storage Resource Broker & Meta-data Catalog
Application
Resource,
Mthd, User
User
Defined
C, C++,
Linux I/O
Unix
Shell
Java, NT
Browsers
Prolog
Web
Predicate
SRB
MCAT
Dublin
Core
Archives
HPSS, ADSM, HRM
UniTree, DMF
File Systems Databases
Unix, NT,
Mac OSX
Metadata
Extraction
Remote
Proxies
DB2, Oracle,
Sybase
DataCutter
Application
Meta-data
RCDL’02, Dubna, October 15-17 2002
16
SRB Space
DR
DR
DL
DL
SRB
SRB
SRB
SRB
DR
SRB
Client
Client
SRB
DL
SRB
Client
Client
DR
Client
Client
DR
SRB
SRB
MC
DL
RCDL’02, Dubna, October 15-17 2002
SRB
DR - Data Repository
DL - Dig Library
MC - Meta Catalog
DR
17
MySRB: Web-bases Access to the SRB
• Browse in Hierarchical Collections
• Registration of
(remote) Legacy Files & Directories
• Registration of SQL Objects
• Registration of URLs
• Data Movement Operations
– Ingest & Re-Ingest, Delete, Unlink
– Replicate, Copy, Move, S-Link
• Access Control Operations
– Read, Write, Own, Curate, Annotate, …
– Ticket-based Access
• Version Control Operations
– Read Lock, Write Lock, Unlock
– Check In Check Out
RCDL’02, Dubna, October 15-17 2002
18
Meta data Management in MySRB
• Types of Meta Data
– System-level Metadata
• Size, resource, owner, date, access
control, …
– User-defined Meta data
•
•
•
•
for data & collections
<name,value,unit> triples
No limits in number of metadata
Support for Collection-level schemas
– Comments, default values,
drop-down lists
• Support for Standardized Schemas
– (eg. Dublin Core)
– Annotations
• Supports textual annotations
• Annotator, date, context also registered
RCDL’02, Dubna, October 15-17 2002
19
SRB Projects
• Digital Libraries
– UCB, Umich, UCSB, Stanford,CDL
– NSF NSDL - UCAR / DLESE
• NASA Information Power Grid
• DOE ASCI Data Visualization Corridor
• Astronomy
– National Virtual Observatory
– 2MASS Project (2 Micron All Sky Survey)
• Particle Physics
– Particle Physics Data Grid (DOE)
– GriPhyN
– SLAC Synchrotron Data Repository
• Medicine
– Visible Embryo (NLM)
• Earth Systems Sciences
– ESIPS
– LTER
• Persistent Archives
– NARA
– LOC
• Neuro Science & Molecular Science
–
–
TeleScience, Brain Images, BIRN
JCSG (SSRL/SLAC), AfCS, …
RCDL’02, Dubna, October 15-17 2002
20
Large Data Project Examples
• Astronomy:
– National Virtual Observatory
• Integrate 18 sky surveys- (ITR prop)
– 2MASS Project (2 Micron All Sky Survey)
• 10TB; 5million files
• Co-locate Images for Spatial Access
• Data Mining across entire collection
• Replicate to CalTech HPSS
• Particle Physics:
– Particle Physics Data Grid (DOE)
– GrPhyN (NSF ITR proj)
• CERN LHC 1PB/yr (1billion obj)
• Multi-Lab integration
– SLAC Synchrotron Data
RCDL’02, Dubna, October 15-17 2002
Repository
21
National Virtual Observatory
Data Grid
1. Portals and Workbenches
2.Knowledge
& Resource
Management
Concept space
4.Grid
Security
Caching
Replication
Backup
Scheduling
3. Metadata
View
Bulk Data
Catalog
Analysis Analysis
Standard APIs and Protocols
Data
View
Information Metadata Data
Data
5.
Discovery delivery Discovery Delivery
Standard Metadata format, Data model, Wire format
6.
Catalog Mediator
Data mediator
Catalog/Image Specific Access
7. Compute Resources Derived Collections Catalogs Data Archives
RCDL’02, Dubna, October 15-17 2002
22
RCDL’02, Dubna, October 15-17 2002
23
RCDL’02, Dubna, October 15-17 2002
24
Digital Sky Data Ingestion
star catalog
Informix
SUN
input tapes from telescopes
SRB
SUN E10K
800 GB
Data
Cache
HPSS
….
10 TB
IPAC CALTECH
RCDL’02, Dubna, October 15-17 2002
SDSC
25
Digital Sky Data Ingestion
• The input data was on tapes in a random (temporal…) order.
• Ingestion nearly 1.5 year - almost continuous, 4 parallel
streams (4 MB/sec per stream), 24*7*365
• Total 10+TB, 5 million, 2 MB images in 147,000 containers.
• SRB performed a spatial sort on data insertion (Scientists
view/analyze data by neighborhood). The disc cache (800 GB) for the
HPSS containers was utilized.
• Ingestion speed limited by input tape reads
– Only two tapes per day can be read
• Work flow incorporated persistent features to deal with
network outages and other failures.
• C API was utilized for fine grain control and to be able to
manipulate and insert metadata into Informix catalog at IPAC
Caltech.
– http://www.ipac.caltech.edu/2mass
RCDL’02, Dubna, October 15-17 2002
26
DigSky Conclusion
•
•
•
•
•
•
•
•
SRB can handle large number of files
Metadata access is still less than ½ sec delay
Replication of large collections
Single command for geographical replication
On-the-fly sorting (out-of-tape sorting)
Availability of data otherwise not possible
Near-line access to 5 million files (10 TB)
Successfully used in web-access & large scale
analysis (daily)
RCDL’02, Dubna, October 15-17 2002
27
Demonstration
• goto mySRB
• For Additional Information:
http://www.npaci.edu/dice/srb
[email protected]
RCDL’02, Dubna, October 15-17 2002
28
MIX:
Mediation of Information
using XML
Mediation of Information using XML (MIX)
XML Query
XML
XML View
Document(s)
Wrapper
Data Source
(eg. home ads)
RCDL’02, Dubna, October 15-17 2002
Export:
• Schema & Metadata
(DTD, RDF,…)
• Capabilities
XML View
Document(s)
XML View
Document(s)
Wrapper
Native XML
Database
Legacy
Source
30
A Typical Mediation Scenario
User
Interface
Query
Results
Mediator
(integrated views over
heterogeneous sources)
Query “fragment”
Convert incoming query Wrapper
and outgoing data
SQL Database
RCDL’02, Dubna, October 15-17 2002
Query “fragment”
Wrapper
Wrapper
GIS
HTML
31
The Home Buyer Scenario
Web
Client
XMAS Query
Results (XML)
MIXm
Mediator
“Homes” mediator
Data
“Neighborhood” mediator
Data
Data
National test scores
“Schools” mediator
N’hood info Community info
(demographics) (name, ZIP)
www.sandag.cog.ca.us
RCDL’02, Dubna, October 15-17 2002
Crime info
(ZIP, stats)
www.sannet.gov
Home info
(real estate)
www.realtor.com
Schools info
(address, size)
www.asd.com
School district
info
(scores,spending,ZIP)
www.homeadvisor.msn.com
32
Home Buyer GUI
RCDL’02, Dubna, October 15-17 2002
33
An XML Query (XMAS)
$C:<*.condo>
<address zip=$Z/>
</condo> AT www.condo.com
AND
$S:<*.school type=elementary>
<address zip=$Z/>
</school> AT schools.org
...
<RealEstateAgent>
<name>J. Smith</name>
<condos>
<condo>
<address ... zip=92037>
<price>$170k OBO</price>
<bedrooms>2</bedrooms>
</condo>
<condos>
</RealEstateAgent>
RCDL’02, Dubna, October 15-17 2002
<folder>
$C
$S for $S
</folder> for $C
<condosAndSchools>
<folder>
<condo>
<address ... zip=92037>
<price>$170k OBO</price>
<bedrooms>2</bedrooms>
</condo>
<school>
<name>La Jolla High</name>
<address … zip=92037>
</school>
<school>…</school>
34
</folder>
Home Buyer GUI (Answers)
Generated
XMAS Query
RCDL’02, Dubna, October 15-17 2002
XML Answer
Document
35
Our Research
• In what query language does the user
pose a query?
User Query
• How does the query engine of the
XMAS
mediator rewrite the query?
• How does the mediator
Mediator
combine/restructure/post-process
partial results?
XML
• What data model and query
W1
W2
transformation scheme should the
wrappers use for different source
S1
S2
types?
W3
S3
For details: http://www.npaci.edu/DICE/MIX
RCDL’02, Dubna, October 15-17 2002
36
New MIX Challenges from Scientific Applications
• Complex Data
– SDSC’s Scientific Data Applications (current/planned, e.g.
Neurosciences: NCMIR, NIH BIRN, Earth sciences: GEON, GeoGrid, ...)
show that syntactic/structural integration is insufficient for ...
Complex Multiple-World Mediation Problems:
– complex, disjoint, seemingly unrelated data
– “hidden semantics” in complex, indirect relationships
=> Semantic (aka Model/Knowledge-Based) Mediation
– lift mediation to the level of conceptual models (CMs)
– use domain experts’ knowledge formalized as rules over CMs
=> Specialized Extensions
• temporal, geospatial, statistical, DQ/accuracy... operations
=> Extend Mediation Scope and Power via Deductive Rules
RCDL’02, Dubna, October 15-17 2002
37
INFORMATION
MEDIATION WITH
DOMAIN MAPS
An Unresolved Challenge
How do nerve cells change as we learn and remember?
A multi-resolution study of the rat hippocampus at Boston University
RCDL’02, Dubna, October 15-17 2002
39
Dendritic spine morphology and its variations
density = #spines/length
Reconstructions from the Synapse Lab, Boston University
RCDL’02, Dubna, October 15-17 2002
40
Hypothesis
• Distribution of spines changes
with learning
• Each spine type performs a different
task in information transmission
Observations
Next Questions
• Spine density, size,
• Does anyone else have
shape and PSD vary
with maturity
• Spine neck
geometry controls
peak Calcium amount
• Calcium flow
parameters depend
on the different
subclasses of spines
corroborative evidence
for these observations?
• Are these observations
true in other comparable
parts of the brain?
• Is this consistent with
the distribution of
Calcium-binding
proteins?
RCDL’02, Dubna, October 15-17 2002
41
Example for Formalizing Domain Knowledge:
Domain Map for SYNAPSE and NCMIR
A domain map comprises
• Description Logic facts ...
- concepts ("classes")
- roles ("associations")
• derived properties ...
• ... expressed as logic rules
- (e.g. F-logic)
Purkinje cells and Pyramidal cells have dendrites
that have higher-order branches that contain spines.
Dendritic spines are ion (calcium) regulating components.
Spines have ion binding proteins. Neurotransmission
involves ionic activity (release). Ion-binding proteins
control ion activity (propagation) in a cell. Ion-regulating
components of cells affect ionic activity (release).
domain expert knowledge
domain map
RCDL’02, Dubna, October 15-17 2002
equivalent Description Logic facts
42
Extended Mediator Architecture for Semantic Mediation
USER/Client
CM (Integrated View)
Domain Map
DM
Mediator
Engine
Integrated View
Definition IVD
XSB Engine
FL rule proc.
LP rule proc.
Graph proc.
GCM
GCM
GCM
CM S1
CM S2
CM S3
CM Plug-In
CM Queries & Results
(exchanged in XML)
Logic API
(capabilities)
CM-Wrapper
CM-Wrapper
CM-Wrapper
XML-Wrapper
XML-Wrapper
XML-Wrapper
S1
S2
S3
Comparison & Summary: Semantic Mediation
(Complex) Single World
/ Simple Multiple World
Complex Multiple World
Integration target
global schema
(common / shared)
1..n shared domain maps
Example scenario
suppliers’ catalogs
/ home buyer
complex scientific data
(neuroscience, geoscience,…)
large / small
large / none
none … small
none
direct, instance / schema level
relational, semistructured,
queries & transformations
(e.g., SQL, XQuery, XSLT)
indirect, conceptual (knowledge)
level
domain maps, formalized domain
knowledge (“semantic bridges”)
=> model-based (“semantic”)
mediation
conceptual (description logics),
object-oriented, deductive features
(e.g., GCM, F-logic)
DB expert
domain expert + KRDB expert
Schema level overlap
Instance level overlap
Source correlation
Techniques
Integration languages
Expressiveness
Integrators
RCDL’02, Dubna, October 15-17 2002
schema transformations, schema
integration
“structural” integration
44
Part II: case studies
BIRN
Web Services
Persistent Archives
NIH is Funding
a Brain Imaging Federated Repository
Biomedical Informatics
Research Network
(BIRN)
NIH Plans to Expand
to Other Organs
and Many Laboratories
Part of the UCSD CRBS
National Partnership for Advanced Computational Infrastructure
Center for Research on Biological Structure
RCDL’02, Dubna, October 15-17 2002
46
Infrastructure for Sharing Neuroscience Data
SOURCES:
•
•
•
•
•
•
•
•
NCMIR, U.C. San Diego
Caltech Neuroimaging
Center for Imaging Science, John Hopkins
Center for Computational Biology, Montana State
Laboratory of Neuro Imaging (LONI), UCLA
Computatuonal Neurobiology Laboratory, Salk Inst.
Van Essen Laboratory, Washington University
…
Data Management Infrastructure (DAKS/NPACI)
•
•
•
•
•
MIX
Mediation in XML
MCAT information discovery
SRB
data handling
HPSS storage
...
Surface atlas, Van Essen Lab
Knowledge-based
GRID infrastructure
? ?
?
?
Data Management Infrastructure (“Data Grid”)
GTOMO, Telemicroscopy, Globus, SRB/MCAT, HPSS
stereotaxic atlas LONI
MCell, CNL, Salk
NCMIR, UCSD
CCB, Montana SU
The Need for Semantic Integration
Cross-source queries
What is the cerebellar distribution of rat proteins with more than 70%
homology with human NCS-1? Any structure specificity?
How about other rodents?
Cross-source
relationships are
modeled
??? Integrated
View Definition ???
Wrapper
Semantic (knowledgebased) mediation
services
??? Integrated
View ???
???Mediator ???
Wrapper
Data, relationships,
constraints are
modeled (CMs)
Wrapper
Wrapper
Web
protein localization
morphometry
neurotransmission
CaBP, Expasy
Hidden Semantics: Protein Localization
Purkinje Cell layer of
Cerebellar Cortex
<protein_localization>
<neuron type=“purkinje cell” />
<protein channel=“red”>
<name>RyR</>
….
</protein>
<region h_grid_pos=“1” v_grid_pos=“A”>
<density>
<structure fraction=“0.8”>
<name>spine</>
<amount name=“RyR”>0</>
Molecular layer of
</>
Cerebellar Cortex
<structure fraction=“0.2”>
<name>branchlet</>
Fragment of dendrite
<amount name=“RyR”>30</>
</>
Mediation Services:
Source Registration (System Issues)
Source
Data Type
Result Delivery
table tree
Query Capability
Access Protocol
ARC SQL XML DOOD
QL
file
Tuple-at-a-time
Stream
SRB HTTP JDBC
Set-at-a-time
Binary for Viewer
Selections
SPJ
Mediation Services:
Source Registration (Semantics Issues)
• Domain Map Registration
– provide concept space/ontology
• … as a private object (“myANATOM”)
• … merge with others (give “semantic bridges”)
• … and check for conflicts
• Conceptual Model Registration
– schema: classes, associations, attributes
– domain constraints
– “put data into context” (linking data to the domain map)
Next
Mediation Services:
Integrated View Definition
DERIVE
protein_distribution(Protein, Organism, Brain_region, Feature_name,
Anatom, Value)
FROM
I:protein_label_image[ proteins ->> {Protein}; organism -> Organism;
anatomical_structures ->>
{AS:anatomical_structure[name->Anatom]}] ,
% from PROLAB
NAE:neuro_anatomic_entity[name->Anatom;
% from ANATOM
located_in->>{Brain_region}],
AS..segments..features[name->Feature_name; value->Value].
• provided by the domain expert and mediation engineer
• declarative language (here: Frame-logic)
Mediation Services: Semantic Annotation Tools
line drawing  annotation  (spatial) database for mediation
Part II: case studies
Web Services
Find school
districts in San
Diego where
computer
ownership rates
among residents
are over 80%
Web Services Demo 1
Clients: AxioMap, Polexis
Java Servlet
XML Mediator (Enosys)
XML query (XCQL)
Spatial Mediator
XML
WSDL
WSDL
Web Server
SOAP
Sociology Web Server
Workbench
SOAP
Java
Servlets
Oracle
DBMS
RCDL’02, Dubna, October 15-17 2002
San Diego
Digital
Divide
Survey
Java
Servlets
Boundaries of
municipalities
and school
districts
Oracle
DBMS
56
Web Services Demo 2
Web spatial source,
EPA data
ArcObjects spatial service
Spatial Mediator
Java Servlet
XML
WSDL
Web Server
SOAP
ESRI ArcObjects
Coordinate
Conversion
Service
RCDL’02, Dubna, October 15-17 2002
XML Wrapper
XML Wrapper
EPA Envirofacts Website
Local Pollution Data
57
Web Services Demo 3
GIS source,
WSDL: for spatial analysis,
survey data analysis,
DBMS query
UCR/FBI data
Process flow across
Web services
Counties
crossed by an
interstate
Counties with
decrease in
homicide
rates over …
%,
1993-99
Counties with
decrease in victims of
firearms over … %,
1993-99
RCDL’02, Dubna, October 15-17 2002
UCR
summaries
,
Oracle
WSDL
WSDL
Victim
data,
SWB
Spatial Query,
ArcIMS/
ArcObjects
58
Part II: case studies
Persistent Archives
Persistent Archives
•
•
•
•
NARA project
Store & Recover Data after 400 years
5 million emails
33 million web
pages
• 90 million
personnel
records
RCDL’02, Dubna, October 15-17 2002
60
Persistent Archives
• Challenges: each of the software and hardware systems may
become obsolete
– the storage media may degrade
– the storage system may become obsolete
– the database backups may become obsolete, with no way to recover the
collection (structure)
– the digital object formats may become obsolete, with no helper
application that can read them
• Persistent archive is a migration mechanism
– support for automatic migration to new technology; automatic ingestion,
management, access, catalog discovery
• Infrastructure independence
– Non-proprietary formatting -- Collection management -- Data set access
– Authentication -- Presentation
• Persistent archive is an interoperability system
– XML as a (meta-) information markup language
RCDL’02, Dubna, October 15-17 2002
61
Persistent Archive
Persistent archive
Describe archived data as collections
Describe processes used to create collections
Manage evolution of technology
Database A
(today)
Virtual Data Grid
Database A
(tomorrow)
The persistent archive is itself a virtual data grid that provides
mechanisms to manage migration to new technology
RCDL’02, Dubna, October 15-17 2002
62
Information Hierarchy (Simplest Definitions)
• Data
– digital object, i.e., the object representation as a bit stream
• Information
– any tagged data, where tags are treated as information attributes
– attributes may be tagged data within the digital object, or tagged
data that is associated with the digital object
• Knowledge
– higher-order concepts and relationships between attributes
– relationships can be procedural, temporal, structural, spatial,
functional, ... and described in a Logic formalism (semantic
networks, description logics, conceptual graphs, ...) which is
often rule-based (e.g. Datalog, Frame-Logic)
RCDL’02, Dubna, October 15-17 2002
63
What Types of Interoperability are Needed?
• Data management (digital objects)
– ability to work with multiple types of storage systems, across
separate administration domains
• Information management (attributes)
– ability to define a collection independent of database choice
– ability to migrate collection onto new databases
• Knowledge management (relationships)
– ability to manage relationships and high-level domain concepts
– ability to map concepts to collection attributes
RCDL’02, Dubna, October 15-17 2002
64
From XML-Based to Knowledge-Based Archives
• Collection-based archival with XML: save data "as is" plus...
– ... separate content from presentation
– ... tag your data (take a lift in the info hierarchy)
– ... use a self-describing, semistructured data format (XML)
• Knowledge-based archival: now add ...
– ... conceptual level information
– ... integrity constraints
– ... explanations/derivation rules:
• archiving only results y=f(x) vs. archiving the rules/function "f"
(e.g. f = “the Florida procedure”...)
=> employ knowledge representation languages
RCDL’02, Dubna, October 15-17 2002
65
Knowledge-Based Persistent Archive
Knowledge
Repository for
Rules
Access
Services
Rules - KQL
Knowledge
Relationships
Between
Concepts
Management
XTM DTD
Ingest
Services
Knowledge or
Topic-Based
Query / Browse
Attributes
Semantics
Information
Repository
SDLIP
Information
XML DTD
(Topic Maps / Model-based Access)
Attribute- based
Query
Fields
Containers
Folders
RCDL’02, Dubna, October 15-17 2002
Storage
(Replicas,
Persistent IDs)
Grids
Data
MCAT/HDF
(Data Handling System - SRB / FTP / HTTP)
Feature-based
Query
66
Knowledge-Based Archival: Senate Example
Data provider says:
“Please archive all records of legislative activities of the 106th senate!”
Integrity constraints, eg:
(1) {senators_with_file} = UNION (sponsor, cosponsors, submitted_by)
(2) {senators} = {sponsors} = {co-sponsors}
Violation:
– the rhs is a SUPERSET of the lhs !
Exceptions:
– (Chafee, John), (Gramm, Phil), (Miller, Zell)
(Possible) Explanations:
– senators who joined (Zell), passed away (Chafee), were forgotten (Gramm)!?
Checking ICs:
IF sponsor(X), not senator(X) THEN ADD(exception_log, missing_senator_info(X))
IF condition THEN action
Action = LOG, WARN, ABORT, ...
RCDL’02, Dubna, October 15-17 2002
67
NARA Herbicides
Collection:
Introduction
RCDL’02, Dubna, October 15-17 2002
68
The Herbicides Collection - input
From EBCDIC tapes:
6507213207565
6507243207565
6507253207565
6507263207565
6507273207565
6507283207565
6507293207565
6508022022365
AS890255
6508022022365
AS940140
6508042022365
AS925205
6508042022365
AS970065
6508062022365
BS290320
6508062022365
BS275298
6508073207565
YT080110
6508073207565
YT110060
6508113207565
6508123207565
6508151022465
YD350155
6508151022465
YD450150
260404040
260606060
260606060
260606060
260606060
260505050
260404040
060202020
040000{0000D0000000{048{
060000{0000D0000000{072{
060000{0000D0000000{072{
060000{0000D0000000{072{
060000{0000D0000000{072{
050000{0000D0000000{060{
040000{0000D0000000{048{
010000{0000C0000000{012{
000{000{
{0000000{0000000{0000000{0000000{
{0000000{0000000{0000000{0000000{
{0000000{0000000{0000000{0000000{
{0000000{0000000{0000000{0000000{
{0000000{0000000{0000000{0000000{
{0000000{0000000{0000000{0000000{
{0000000{0000000{0000000{0000000{
{0000000{0000000{0000000{0000000{1A
1B
000{000{
060202020 006000{0000C0000000{007B {0000000{0000000{0000000{0000000{1A
000{000{
1B
000{000{
060202020 004000{0000C0000000{004H {0000000{0000000{0000000{0000000{1A
000{000{
1B
000{000{
260202020 020000{0000D0000000{024{ {0000000{0000000{0000000{0000000{1A
000{000{
1B
000{000{
260202020 020000{0000D0000000{024{ {0000000{0000000{0000000{0000000{
260202020 020000{0000D0000000{024{ {0000000{0000000{0000000{0000000{
020202020 008000{0000C0000000{009F {0000000{0000000{0000000{0000000{1A
000{000{
1B
RCDL’02, Dubna, October 15-17 2002
69
The Herbicides Collection - preservation
Converted to XML:
<YEAR><yearnum>66</yearnum>
<MONTH><monthnum>01</monthnum>
<DATE><datenum>01</datenum>
<MISSION><num>206866</num>
<RUN><code>A</code>
<ctz>3</ctz><multi></multi><prov>27</prov>
<aircrafts>
<scheduled>02</scheduled><airborne>02</airborne><productive>02</productive>
</aircrafts>
<agent>O</agent><gal>02000</gal><hits>0</hits>
<aborts>
<maintenance>0</maintenance><weather>0</weather><battle_damage>0</battle_damage><other>0</other>
</aborts>
<type>D</type><area>024</area><rsult></rsult>
<UTM>
<utmid>1A</utmid>
<utm_coor>YS240780</utm_coor>
</UTM>
<UTM>
<utmid>1B</utmid>
<utm_coor>YS290630</utm_coor>
</UTM></RUN>
<RUN><code>B</code>
<ctz>3</ctz><multi></multi><prov>27</prov>
<aircrafts>
<scheduled>02</scheduled><airborne>02</airborne><productive>02</productive>
</aircrafts>
<agent>O</agent><gal>02000</gal><hits>0A</hits>
<aborts>
<maintenance>0</maintenance><weather>0</weather><battle_damage>0</battle_damage><other>0</other>
</aborts>
<type>D</type><area>024</area><rsult></rsult>
MAPPING
RCDL’02, Dubna, October 15-17 2002
70
From Geography Markup to Rendering
<?xml version="1.0" encoding="iso-8859-1"?>
<rs>
<r><name>Horton
Plaza</name><URL></URL><labelpos>41.46,77.51</labelpos><c>5076,1540
4986,1540 4895,1539 4803,1539 4715,1539 4622,1539 4534,1538 4534,1641
<?xml version="1.0"?>
4534,1745 4534,1856 4622,1856 4711,1856 4800,1856 4893,1855
4984,1855
<!DOCTYPE
svg PUBLIC "-//W3C//DTD SVG 20000303 Stylable//EN"
5075,1854 5075,1749 5076,1646 </c></r>
"http://www.w3c.org/2000/svg10-20000303-stylable" [
<r><name>Gaslamp</name><URL></URL><labelpos>44.60,83.00</labelpos><c>5
<!ENTITY
base "fill:#ff0000;stroke:#000000;stroke-width:1;">
162,1013 5084,1057 5083,1116 5081,1222 5079,1326 5079,1433
5076,1540
]>
5076,1646 5075,1749 5075,1854 5167,1854 5257,1855 5257,1750
5259,1647
<svg width="100%" height="100%" viewBox="0 0 11590 7547" style="shape5260,1541 5262,1434 5262,1328 5263,1222 5263,1013 </c></r>
rendering:geometricPrecision; text-rendering:optimizeLegibility">
...
<g id="karta" transform="scale(1, -1) translate(0, -7547)">
<g id="base" style="&base;">
<path id="a1" title="Horton Plaza" style="fill:#00ff00;" d="M5076,1540L 4986,1540
4895,1539 4803,1539 4715,1539 4622,1539 4534,1538 4534,1641 4534,1745
4534,1856 4622,1856 4711,1856 4800,1856 4893,1855 4984,1855 5075,1854
5075,1749 5076,1646 5076,1540z"/>
<path id="a2" title="Gaslamp" style="fill:#ffff00;" d="M5162,1013L 5084,1057
5083,1116 5081,1222 5079,1326 5079,1433 5076,1540 5076,1646 5075,1749
5075,1854 5167,1854 5257,1855 5257,1750 5259,1647 5260,1541 5262,1434
5262,1328 5263,1222 5263,1013 5162,1013z"/>
</g></g></svg>
SVG
XML encoding of geographic
features (such as GML)
VML or SVG or…
RCDL’02, Dubna, October 15-17 2002
71
XML Map
Viewer for
the
Herbicides
Collection
RCDL’02, Dubna, October 15-17 2002
72
Conclusion
• Necessity & Requirements of a Virtual Data Grid
• SRB – a proven solution
– It is an existing middle-ware
– Field-tested in multiple projects
– Proven Scalability: users, data & resources
• New element of data grid: knowledge management
• Working solutions
– BIRN: the first real data grid complete with
knowledge management and
cross-ontology bridges
– Web services, to expose grid
functionality in a uniform way
– Archiving data, information and knowledge as a grid
activity
• www.npaci.edu/DICE/
RCDL’02, Dubna, October 15-17 2002
73