APARSEN overview - Max Planck Institute for Psycholinguistics

Download Report

Transcript APARSEN overview - Max Planck Institute for Psycholinguistics

APARSEN
Metadata for preservation, curation and
interoperability
Workshop on Research Metadata in Context
7-8 Sept 2010, Nijmegen
David Giaretta
APA and STFC
Digital Preservation
• Ensure that digitally encoded information are
understandable and usable over the long term
– Long term could start at just a few years
• Easy to make claims
– Difficult to provide proof
• Reference Model for Open Archival Information
System (ISO 14721)
– The basic standard for work in digital preservation
– Defines terminology and compliance criteria
Not just BIT
Definitions
preservation
Not just
(OAIS)
rendering
• Long Term Preservation: The act of maintaining
information, Independently Understandable by a
Designated Community, and with evidence
supporting its Authenticity, over the Long Term.
• Long Term: A period of time long enough for there
to be concern about the impacts of changing
technologies, including support for new media and
data formats, and of a changing Designated
Community, on the information being held in an
OAIS. This period
Information
not extends into the indefinite future.
just DATA or
Authenticity
Documents
Basic concept
• Digital preservation had been dominated by libraries
and (state) archives
• However there was a focus there on “rendered
“CASPAR banned the use of the term metadata
objects” and “metadata”
unless absolutely necessary”
• Tendency to think data is an “easy” add-on
HOWEVER
• Need to deal with DATA – processed to new things,
not just rendered
• Need to follow OAIS – finer grained view
• Need to test and prove that things work
Data…
Level 2 GOME Satellite
instrument data
Contains numbers – need meaning
6
...to process to this
7
...or this
8
...through complex processing schemes
9
Just Format?
sfqsftfoubujpo jogpsnbujpo svmft
You have a file
JHOVE tells you it is WORD version 7
10
..with some extra information..
representation information rules
Format Registries – useful but not enough: formats can be
used for multiple purposes e.g. audio files used to store
configuration parameters
11
Examples (cont)
• “504b0304140000000800f696….”
• “This is a ZIP file which contains Word files,
each of which contains an encoded message
which needs the key ‘!D$G^AJU*KI’ to decode
it using encryption method SHA7”
12
Examples (cont)
• LaTex file containing an EPS (Encapulated
Postscript) version of an image
• Web page containing Java Applet generating
random numbers
• SWISS-PROT data
• Foreign Language emails
13
XML enough? – can stare at this and probably
understand it
<family>
<father>John</father>
<mother>Mary</mother>
<son>Paul</son>
</family>
14
..but what about this?
<VOTABLE version="1.1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.ivoa.net/xml/VOTable/v1.1 http://www.ivoa.net/xml/VOTable/v1.1"
xmlns="http://www.ivoa.net/xml/VOTable/v1.1">
<RESOURCE>
<TABLE name="6dfgs_E7_subset" nrows="875">
<PARAM arraysize="*" datatype="char" name="Original Source" value="http://wwwwfau.roe.ac.uk/6dFGS/6dfgs_E7.fld.gz">
<DESCRIPTION>URL of data file used to create this table.</DESCRIPTION>
</PARAM>
<PARAM arraysize="*" datatype="char" name="Comment" value="Cut down 6dfGS dataset for TOPCAT demo
usage."/>
<FIELD arraysize="15" datatype="char" name="TARGET">
<DESCRIPTION>Target name</DESCRIPTION>
</FIELD>
<FIELD arraysize="11" datatype="char" name="DEC" unit="DMS">
<DATA>
<FITS>
<STREAM encoding='base64'>
U0lNUExFICA9ICAgICAgICAgICAgICAgICAgICBUIC8gU3RhbmRhcmQgRklUUyBm
b3JtYXQgICAgICAgICAgICAgICAgICAgICAgICAgICBCSVRQSVggID0gICAgICAg
ICAgICAgICAgICAgIDggLyBDaGFyYWN0ZXIgZGF0YSAgICAgICAgICAgICAgICAg
ICAgICAgICAgICAgICAgIE5BWElTICAgPSAgICAgICAgICAgICAgICAgICAgMCAv
IE5vIGltYWdlLCBqdXN0IGV4dGVuc2lvbnMgICAgICAgICAgICAgICAgICAgICAg
15
Performance Viewer: side-by-side comparison and validation of the transformation. From left to
right: 3D visualization in Ogre3D, 3D model of the stage including the virtual dancer in VRML.
Figure 8 Some aspects of acousmatic production
Complex
Simple
Static
Static
Rendered
NonRendered
Dynamic
Dynamic
Information Model & Representation Information
Information
Object
The Information Model is key
1+
Data
Object
interpreted
using
Recursion ends at
KNOWLEDGEBASE of the
DESIGNATED COMMUNITY
(this knowledge will change
over time and region)
Physical
Object
Digital
Object
1+
Bit
Sequence
20
1+ Representation
Information
interpreted
using
Representation Information Network
README.txt
TEXT EDITOR
ENGLISH
LANGUAGE
Modules and Dependencies:
defining the Designated
Community
WINDOWS XP
FITS FILE
FITS
DICTIONARY
FITS
STANDARD
MULTIMEDIA
PERFORMANCE DATA
C3D
3D motion
data files
DirectX
MAX/MSP
3D scene
data files
motion to music
mapping strategy
PDF
STANDARD
PDF
s/w
FITS
JAVA s/w
DICTIONARY
SPECIFICATION
XML
SPECIFICATION
JAVA VM
UNICODE
SPECIFICATION
FITS FILE
FITS
STANDARD
FITS
DICTIONARY
DDL
DESCRIPTION
FITS JAVA
SOFTWARE
PDF
STANDARD
DICTIONARY
SPECIFICATION
DDL
DEFINITION
JAVA VM
PDF
SOFTWARE
XML
SPECIFICATION
UNICODE
SPECIFICATION
DDL
SOFTWARE
In principle we could use this, plus the
Dictionaries in order to understand the
keywords in order to extract the
numbers
If we can run this then we can use this
in a generic application to extract the
numbers
FITS FILE
If we cannot run the Java Virtual
Machine then we use this source
code to re-write in another
programming language such as C
FITS
STANDARD
FITS
DICTIONARY
DDL
DESCRIPTION
FITS JAVA
SOFTWARE
PDF
STANDARD
DICTIONARY
SPECIFICATION
DDL
DEFINITION
JAVA VM
PDF
SOFTWARE
XML
SPECIFICATION
DDL
SOFTWARE
UNICODE
SPECIFICATION
If we can run this then we can run the
Java software to extract the numbers
If we cannot run this then we can use
an emulator or use its RepInfo to recreate a Java VM
If we cannot run the DDL software then
we can look at the DDL definition and
write some software to extract the
numbers
•Rep
•Info
•Virtualisation
/DISCIPLINE
Virtualisation
2-D array
Height
Width
Bits per Pixel
2-D image
Height
Width
Bits per Pixel
Co-ordinate system
Time
2-D
astronomical
image
Height
Width
Bits per Pixel
Astronomical co-ordinate system
Time – EPOCH
Bandpass
General
Table
Time series
Number of columns
Names of columns
Number of rows
Value in cell at any row, column
Time corresponding to any row
Number of columns
Names of columns
Number of rows
Value in cell at any row, column
Science data
table
Number of columns
Names of columns
Number of rows
Value in cell at any row, column
Type of column value
Column “metadata”
Table “metadata”
Root node
Node 1
Node 3
Get the Root
Get the number of children for a node
Get child number “i”
Node 4
Node 6
Node 2
Node 5
Node 6
Node 7
Node 8
Node 9
Image
Earth
Observation
Image
X-ray
Astronomical
Image
Astronomical
Image
Optical
Astronomical
Image
Artistic
Image
Cultural
Heritage
Image
Package
Description
described
by
derived
from
Content
Information
Archival
Information
Package
delimited
by
further described by
Packaging
Information
identifies
Preservation
Description
Information
Preservation
Description
Information
Reference
Information
Provenance
Information
Context
Information
Fixity
Information
Access Rights
Information
described
by
Archival
delimited
by
Packaging
Package
Package
derived
from
Content
further described by
Interpreted
using
*
Data
Object
Physical
Object
Interpreted
using
Digital
Object
1
1...*
Bit
34
1
Other
Structure
adds
meaning
to
Reference
Provenance
Context
Fixity
Access
Rights
has
Provenance
Representation
Information
has
Cost sharing
DRM
Preservable
infrastructure
USE DATA
• Use application to find data in
Repository
• Create DIP with enough RepInfo for the
user (via DC profile)
• Obtain more RepInfo from Registry if
necessary
APARSEN
Technical
2000
Integration
1000
1100: Common Vision
2100: Preservation
Services
1200: Staff and
experience exchange
2200: Identifiers &
citabillity
2300: Storage solutions
1300: Common
standards
Spreading
excellence
4000
Economic/Legal
3000
2400: Authenticity &
Provenance
3100: Digital Rights &
access management
4100: External
W/S & symposia
3200: Cost /benefit
data collection and
modelling
4200: Formal
qualifications
3300: Peer Review &
3rd party Certification
1400: Common
testing environments
2500: Interoperability &
intelligibility
3400: Brokerage
services
1500: Internal W/S &
symposia
2600: Annotation,
Reputation & data
quality
2700: Scalability
3500: Data policies
and governance
1600: Common tools,
software repository
and market place
JPA
Integration
JPA
Research
4300: Training
courses
4400: Awareness
raising
4500: Liaison with
other stakeholders
4600: International
liaison
3600: Business cases
JPA
Spreading
excellence
Management
5000
5100: Financial
management
5200: Technical
co-ord.
5300: Evaluate
impact of the
Network of
Excellence
Economic/Legal
3000
Technical
2000
2100: Preservation
Services
2200: Identifiers
& citabillity
3100: Digital Rights
& access
management
2300: Storage
solutions
3200: Cost /benefit
data collection and
modelling
2400: Authenticity
& Provenance
3400: Brokerage
services
2500:
Interoperability &
intelligibility
3300: Peer Review &
3rd party
Certification
2600: Annotation,
Reputation & data
quality
2700: Scalability
JPA
Research
3500: Data policies
and governance
3600: Business
cases
TrustCertification of repositories
Reputation and trustability of datasets, publications and people
Authenticity
SustainabilityBusiness cases
Preservation
Cost/benefit analysis
Transfer of custody – who to hand over to and what to hand over
Storage solutions
UsabilityIntelligibility
Use by common tools
Cross domain usability
Interoperability
AccessIdentify of datasets, publication, people
Rights and responsibilities
Policies and governance
FUTURE
• Users may be unable to understand or use the data e.g. the
semantics, format, processes or algorithms involved
• Non-maintainability of essential hardware, software or support
environment may make the information inaccessible
• The chain of evidence may be lost and there may be lack of
certainty of provenance or authenticity
• Access and use restrictions may not be respected in the future
• Loss of ability to identify the location of data
• The current custodian of the data, whether an organisation or
project, may cease to exist at some point in the future
• The ones we trust to look after the digital holdings may let us
down
Links
• CASPAR – http://www.casparpreserves.eu
• CASPAR Source code - http://sourceforge.net/projects/digitalpreserve/
• OAIS Reference Model http://public.ccsds.org/publications/archive/650x0b1.pdf
•
and the updated draft is available from
http://public.ccsds.org/sites/cwe/rids/Lists/CCSDS%206500P11/Overview.aspx
• CASPAR Validation report
http://www.casparpreserves.eu/Members/cclrc/Deliverables/casparvalidation-evaluation-report/at_download/file
•
PARSE.Insight:
– www.parse-insight.eu
• Alliance for Permanent Access:
– www.alliancepermanentaccess.eu
• Digital Curation Centre:
– www.dcc.ac.uk
42
END