Data grid technology

Download Report

Transcript Data grid technology

Storage Resource Broker
Building Preservation
Environments from
Federated Data Grids
Reagan W. Moore
San Diego Supercomputer Center
[email protected]
http://www.sdsc.edu/srb/
Topics
• Preservation environments
• Authenticity
• Integrity
• Digital library technology
• Metadata management
• Data grid technology
• Technology evolution management
Preservation
• Archival processes through which a digital entity is
extracted from its creation environment, and
migrated into a preservation environment, while
maintaining authenticity and integrity information.
• Extraction process requires insertion of support
infrastructure underneath the digital material
• Goal is infrastructure independence, the ability to use
any commercial storage system, database, or access
mechanism
Preservation Communities
• InterPARES - diplomatics
• Preservation of records
• NARA
• Preservation of records from federal agencies
• State archives
• Preservation of submitted “collections”
• Continuum model
• Preservation of active data and records
Digital Libraries
• Support the community vocabulary
• Discovery and browse using community
relevant terms
• Support the community data format
• Maintain information on the data format of
each item
• Support the community access services
• Provide services that manipulate and display
the community data format
Preservation Mandates
• Diplomatics
• Authenticity
• Integrity
• NARA
• Infrastructure independence
• Scalability
• State archives
• Automation of archival processes
InterPARES - Diplomatics
• Authenticity - maintain links to metadata for:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Date record is made
Date record is transmitted
Date record is received
Date record is set aside [i.e. filed]
Name of author (person or organization issuing the record)
Name of addressee (person or organization for whom the record is intended)
Name of writer (entity responsible for the articulation of the record’s content)
Name of originator (electronic address from which record is sent)
Name of recipient(s) (person or organization to whom the record is sent)
Name of creator (entity in whose archival fonds the record exists)
Name of action or matter (the activity for which the record is created)
Name of documentary form (e.g. E-mail, report, memo)
Identification of digital components
Identification of attachments (e.g. digital signature)
Archival bond (e.g. classification code)
InterPARES - Diplomatics
• Integrity - maintain links to metadata for
• Name(s) of the handling office / officer
• Name of office of primary responsibility for keeping
the record
• Annotations or comments
• Actions carried out on the record
• Technical modifications due to transformative
migration
• Validation
Preservation Approach
• Provide mechanisms to:
• Create archival context for the content
• Context is preservation metadata (provenance, administrative,
descriptive, structural, behavioral)
• Content is the submitted digital entity
• Assert integrity - the consistency between the context
and the content
• Track operations done on material and update context
• Assert authenticity - that the material represents the
original site
• Track the chain of custody
• Manage technology evolution (encoding standard,
storage repository, information repository, access
methods)
Data Grids
• Manage shared collections that are
distributed in space
• Location of item, access controls, checksums
• Implement infrastructure independence
• Standard operations for interacting with
storage repositories
• Implement presentation independence
• Standard APIs to support porting of user
interfaces
Preservation Environment
• Digital library infrastructure that
supports
• Preservation metadata
• Arrangement and description of items
• Access mechanisms
• Data grid infrastructure that supports
• Shared collections that are migrated forward in
time
• Management of technology evolution
• Administrative metadata providing status of
records
Infrastructure Independence
Data Access Methods (Web Browser, DSpace, OAI-PMH)
Storage Repository
• Storage location
• User name
• File name
• File context (creation date,…)
• Access constraints
Naming conventions
provided by storage
systems
Data Grids Provide a Level of Indirection
for Each Naming Convention
Data Access Methods (C library, Unix, Web Browser)
Data Collection
Storage Repository
Data Grid
• Storage location
• Logical resource name space
• User name
• Logical user name space
• File name
• Logical file name space
• File context (creation date,…)
• Logical context (metadata)
• Access constraints
• Control/consistency constraints
Data is organized as a shared collection
Data Grids
• Provide two levels of indirection:
• Low level API used to interact with storage
repositories
• Standard operations for manipulating files in a
storage system
• Standard operations for manipulating a catalog
stored in a database
• High level API used to support user interfaces
• Three basic APIs - “C” library call, Unix shell
commands, Java class library
• Other are interfaces ported on top of the basic
APIs.
Storage Resource Broker 3.3
Application
C
Library,
Java
Unix
Shell
Linux I/O NT Browser,
Kepler Actors
C++
DLL /
Python,
Perl,
Windows
HTTP,
OAI,
DSpace, WSDL,
OpenDAP, (WSRF)
GridFTP
Federation Management
Consistency & Metadata Management / Authorization, Authentication, Audit
Logical Name
Space
Database Abstraction
Databases DB2, Oracle, Sybase,
Postgres, mySQL,
Informix
Latency
Management
Data
Transport
Metadata
Transport
Storage Repository Abstraction
Archives - Tape,
File Systems
Sam-QFS, DMF, ORB
Unix, NT,
HPSS, ADSM,
Mac OSX
UniTree, ADS
Databases DB2, Oracle,
Sybase, Postgres,
mySQL, Informix
Standard Data Access Operations
Remote operations
Unix file system
Latency management
Procedures
Transformations
Third party transfer
Filtering
Queries
Collective operations
Replication
Fault tolerance
Load leveling
User Application
Common set of operations for interacting
with every type of storage repository
Archive
at SDSC
Archive
at NARA
Archive
at U Md
Building a Distributed Collection
Logical name space
Location independent identifier
Persistent identifier
Collection owned data
Authenticity metadata
Access controls
Audit trails
Checksums
Descriptive metadata
Inter-realm authentication
Single sign-on system
User Application
Data Grid
Common naming convention and set of
attributes for describing digital entities
Archive
at SDSC
Archive
at NARA
Archive
at U Md
Federated Server Architecture
Read Application
Logical Name
Or
Attribute Condition
Peer-to-peer
Brokering
Parallel Data
Access
1
6
SRB
server
3
SRB
server
4
SRB
agent
5
SRB
agent
1.Logical-to-Physical mapping
2.Identification of Replicas
3.Access & Audit Control
5/6
2
R1
MCAT
Data
Access
R2
Server(s)
Spawning
Managing Access
• Authenticate users independently of
storage systems
• Preservation environment owns the data
• Authorize data access independently of
storage system
• ACLs on both data and metadata
• Maintain audit trails of all accesses
• Both read and write
Collection-owned Data
• Store data at remote storage system under
data-grid ID
• Access data through data grid servers
• Track all operations on data and update state
information
• User authenticates to a data grid server
• Access controls are checked for permissions
• Data grid servers authenticate messages from other
servers
• Remote server authenticates to remote storage
system
• Multiple authentication mechanisms
• GSI / challenge-response / tickets
Provide Context for Data
• Properties of files
• Provenance - source
• Descriptive attributes
• Structure
• Organize properties as metadata in a
collection hierarchy
• Define operations on file properties
• Manage state information - location, replicas,
containers
• Separate context management from content
management
• Maintain consistency of context as operations are
done on content
Database Operations
• Standard interface to support
•
•
•
•
•
Schema extension - user defined attributes
Snowflake table creation
SQL generation
Import and export of XML files
Bulk metadata load and unload
• Operations required to manage a
catalog that resides in a database
National Archives and Records Administration Research Prototype Persistent Archive
Demonstrate preservation
environment
• Authenticity
• Integrity
• Management of
technology evolution
• Mitigation of risk of data loss
• Replication of data
• Federation of catalogs
• Management of preservation
metadata
• Scalability
• EAP collection
• 350,000 files
• 1.2 TBs in size
Federation of Three
Independent Data Grids
NARA
MCAT
Principle copy
stored at NARA
with complete
metadata catalog
U Md
MCAT
Replicated copy
at U Md for improved
access, load balancing
and disaster recovery
SDSC
MCAT
Deep Archive at
SDSC, no user
access, but
complete copy
Preservation Requirements
• Maintain authenticity and integrity of
electronic records
• Authenticity - assertion of provenance of data
• Integrity - assertion of invariance of bits
• Manage risk of data loss
• Media corruption / System failures / Operational errors
/ Natural disaster / Malicious users
• Manage technology obsolescence
• Support migration of collection to new systems
• Bulk data operations
Federation
Data Access Methods (Web Browser, DSpace, OAI-PMH)
Data Collection A
Data Grid
Data Collection B
Data Grid
• Logical resource name space
• Logical resource name space
• Logical user name space
• Logical user name space
• Logical file name space
• Logical file name space
• Logical context (metadata)
• Logical context (metadata)
• Control/consistency constraints
• Control/consistency constraints
Access controls and consistency constraints
on cross registration of digital entities
Data Grid Zones
• Choose how name spaces will be shared
• Cross register storage resources
• May the other data grid write to my storage?
• Cross register user names
• Users are authenticated by their home zone
• Cross register files
• Can replicate files into another data grid
• Cross register metadata
• Can build a copy of the metadata catalog
Peer-to-Peer Data Grids
Free Floating
Partial User-ID Sharing
Replication
Constraints
Occasional Interchange
Partial Resource Sharing
Replicated Data
No Metadata Synch
System Set Access Controls
System Controlled Complete Synch
Complete User-ID Sharing
User and Data Replica
Resource Interaction
Access
Constraints
System Managed Replication
Connection From Any Zone
Complete Resource Sharing
Replicated Catalog
Replication Data Grids
Federation Environments
Consistency
Constraints
Hierarchical Zone Organization
One Shared User-ID
Nomadic
System Managed Replication
System Set Access Controls
System Controlled Partial Synch
No Resource Sharing
Snow Flake
Super Administrator Zone Control
Master Slave
System Controlled Complete Synch
No User-ID Sharing
Deep Archive
Hierarchical Data Grids
Examples of Extensibility
• Storage Repository Driver evolution
•
•
•
•
•
•
•
•
•
Initially supported Unix file system
Added archival access - UniTree, HPSS
Added FTP/HTTP
Added database blob access
Added database table interface
Added Windows file system
Added project archives - Dcache, Castor, ADS
Added Object Ring Buffer, Datascope
Adding GridFTP version 3.3
• Database management evolution
•
•
•
•
•
•
Postgres
DB2
Oracle
Informix
Sybase
mySQL (most difficult port - no locks, no views, limited SQL)
Examples of Extensibility
• The 3 fundamental APIs are C library, shell commands,
Java
• Other access mechanisms are ported on top of these interfaces
• API evolution
•
•
•
•
•
•
•
•
•
Initial access through C library, Unix shell command
Added inQ Windows browser (C++ library)
Added mySRB Web browser (C library and shell commands)
Added Java (Jargon)
Added Perl/Python load libraries (shell command)
Added WSDL (Java)
Added OAI-PMH, OpenDAP, DSpace digital library (Java)
Added Kepler actors for dataflow access (Java)
Adding GridFTP version 3.3 (C library)
Storage Resource Broker Collections at SDS C
(2/22/2005 )
Data Gr id
NSF/ITR - National Virtual Observatory
NSF - National Partnership for Advanced Computational Infrastructure
Hayden Planetarium - Evolution of the Solar System visualizations
Public collections - NSF/NPACI - Joint Center for Structural Genomics
NSF/NPACI - Biology and Environmental collections
NSF - TeraGrid, ENZO Cosmology simulations
GBs of
data
stored
Ê
Number
of files
Ê
Number
of
Users
Ê
53,862
31,263
7,201
5,455
20,364
155,980
9,536,751
6,435,338
113,600
3,405,266
52,159
1,157,168
100
380
178
67
67
3,176
NIH - Biomedical Informatics Research Network
9,830
6,632,159
241
Miscellaneous static collections
Digital Library
8,013
Ê
161,352
Ê
241
720
253
2,620
559
2,654
92
99,010
45,365
8,892
53,048
71,318
1,052,202
2,387
2,074,138
NLM - D igital Embryo image collection
NSF/NPACI - Long Term Ecological Reserve
NSF/NPACI - Grid Portal
NIH - Alliance for Cell Signaling microarray d ata
NSF - National Science Digital Library SIO Explorer collection
NSF/NPACI -Transana education research video collection
NSF/ITR - Southern California Earthquake Center
Persistent Archive
Ê
NHPRC Persistent Archive Testbed (Kentucky, Ohio, Michigan, Minnesota)
UCSD Libraries archive
NARA- Research Prototype Persistent Archive
NSF - National Science Digital Library persistent archive
TOTAL
Ê
Ê
23
36
460
21
27
26
64
Ê
90
4,147
991
3,572
372,947
408,050
455,094
26,918,638
28
29
58
136
404 TB
59 million
5,167
Sites Using the SRB
Academia Sinica, Taiwan
ASCC, Computing Centre, Taiwan
Australian National University
Bedford Oceanography,Canada
Bioinformatics Institute, Singapore
CSIRO, Australia
Data Storage Institute, Singapore
EGEE, French National Center
GeoForschungsZentrum, Germany
James Cook University, Australia
KEK High Energy Physics, Japan
Max Planck Institute, Netherlands
Parallab, Norway
South Australian Advanced Computing
UIB (Parallab) , Norway
University of Amsterdam
University of Cambridge, Astronomy
University of Cambridge, e-Science
University of Edinburgh
University of Genoa, Italy
University of Hong Kong
Univrsity of Manchester
University of Oslo
University of Southampton
York Univ (UK)
CiteSeer, Penn State
City Univ. of New York
Geospatial Environment, UCSD
Drexel University
EOSDIS Distributed Active, NASA Goddard
Georgia Tech
Kentucky State Libraries & Archives
Library of Congress
Los Alamos National Lab
NASA Ames
NASA Goddard Space Flight Center
NCSA Grid Computing
NIH (NCI Center for Bioinformatics)
Penn State University
Pittsburgh Supercomputing Center
Purdue University. Indiana
Stanford University
TACC, University of Texas
Texas A & M
UC Santa Cruz
UCLA
UCSD Neuroscience
University of Maryland
University of Michigan, CAC department
University of New Mexico
University of Washington
University of Wisconsin
USC
Yale University
Preservation Strategies
• Emulation
• Migrate the display application onto new operating
systems
• Equivalent to forcing use of candlelight to look at 16th
century documents
• Transformative migration
• Migrate the encoding format to the new standard
• Migration period is expected to be 5-10 years
• Persistent object
• Characterize the encoding format
• Migrate the characterization forward in time
Persistent Objects
Display Applications
1980
1990
2000
2010
2020
Characterize standard manipulation operations
Characterize encoding format - data structure
1980
1990
Digital Entities
2000
2010
2020
Preservation
• Archival processes through which a digital entity is
extracted from its creation environment and migrated
to a preservation environment, while maintaining
authenticity and integrity information.
• Extraction process requires insertion of support
infrastructure underneath the digital material,
characterization of the authenticity and integrity,
characterization of the digital encoding format, and
characterization of the display operations
• Goal is infrastructure independence, the ability to use
any commercial storage system, database, or access
mechanism
For More Information
Reagan W. Moore
San Diego Supercomputer Center
[email protected]
http://www.sdsc.edu/srb/