DCC Presentation

Download Report

Transcript DCC Presentation

Digital Curation Centre
a centre of support for data curation and preservation
UK Digital Curation Centre
One Year On
Liz Lyon Associate Director, Outreach
Chris Rusbridge, DCC Director
Overview
• Why is digital curation important?
• What are the challenges that the DCC faces?
• About the people and our collaborative approach
• Addressing the issues
• How can you contribute to the DCC?
2
Curation?
“maintaining and adding value to a
trusted body of digital information for
current and future use”
3
Digital curation continuum
For later use?
Static
Data preservation
4
In use now (and the future)?
Dynamic
Data curation
Assuring permanent access to the
records of science & the humanities?
Long term access to primary data
• Increasing data volumes from eScience and
Grid-enabled / cyberinfrastructure applications
• Changing research paradigm: data-driven
science, “big science”
• Observational data, simulations, large-scale
experimentation
• Multi-media resources, statistical data, surveys,
geo-spatial data……
5
6
Facilitate “post-processing” and
knowledge extraction
Enable the acquisition of newly-derived information and
knowledge
• Run complex algorithms over primary datasets
• Mining (data, text, structures)
• Modelling (economic, climate, mathematical, biological)
• Analysis (statistical, lexical, pattern matching, gene)
• Presentation (visualisation, rendering)
7
8
Provide additional functionality beyond
digital preservation processes
Annotations
• Gene and protein sequences
• e-Lab books (Smart Tea Project in chemistry)
9
Presentation services: subject, media-specific, data, commercial portals
Data creation /
capture /
gathering:
laboratory
experiments,
Grids,
fieldwork,
surveys, media
Resource
discovery, linking,
embedding
Data analysis,
transformation,
mining, modelling
Searching ,
harvesting,
embedding
Aggregator
services: national,
commercial
Harvesting
metadata
The scholarly knowledge
cycle : linking research
data to publications
eBank UK Project
http://www.ukoln.ac.uk/projects/ebank-uk/
Research &
e-Science
workflows
Repositories :
institutional,
e-prints, subject,
data, learning objects
Validation
Deposit / selfarchiving
Validation
Publication
Linking
10
Data curation:
databases & databanks
Peer-reviewed
publications: journals,
conference proceedings
Emerging policy on
open access to data
DCC people (some of them…)
• Management & Co-ordination
– Director Chris Rusbridge (University of Edinburgh)
• Community Support & Outreach
– Led by Dr Liz Lyon (UKOLN, University of Bath)
• Service Definition & Delivery
– Led by Professor Seamus Ross (HATII [ERPANET], University of
Glasgow)
• Development
– Led by Dr David Giaretta (Astronomical Software & Services,
CCLRC)
• Research
11
– Led by Professor Peter Buneman (Informatics, University of
Edinburgh)
The challenges we face
Standards
• Interoperability issues: technical & hopefully soluble
Scale
• Volume and diversity of datasets
Culture
• Bringing communities together
• Library/information science/archives “document tradition”
• Domain research (chemists, astronomers, biologists)
12
• Computer science (databases)
• Commercial suppliers (storage technology)
More challenges……
Process
• Highly-distributed organisation: use collaborative tools
Skills
• Distributed amongst the 4 partners & beyond
Engagement
• Lots of existing work and many significant players
Impact
• Visible & measurable, in the short & long-term
13
Meeting expectations (which are high…..)
• Of the community and our funders
User requirements analysis
Commissioned study
• Leona Carpenter
• Reporting now
• Desk-based research
• Focus groups
• Interviews
Results will inform research, development
service definition / delivery and outreach
14
Recommendations and priority tasks
Some sound bytes…
R&D issues: Annotation services, Ontology development, Automating
metadata creation, Tools and toolkits, Data Format Description
Language, Identifiers, Registries, Economic and cost-benefits studies
Advisory services :“Ask-a-Curator”,FAQs, reports, briefings,
awareness-raising materials, best practice guidance, Storage media,
“Like Erpanet”, advise Government, Research Councils, funding
bodies
Professional development: Short courses, conferences, seminars,
workshops, secondments to DCC and to working repository services
Outreach: Leadership for the future, case studies, sharing solutions,
collaboration with other partners, international peers, industry links
Taxonomy of “Users”
15
Outline Taxonomy of digital
curation users by role
4. Policy
makers
2. Data
Curators
-funding
bodies
1. Data
Creators
16
-other
leaders
3. Data
Re-users
Outline Taxonomy of digital
curation users by role
Data Preservers
4. Policy
makers
2. Data
Curators
-funding
bodies
1. Data
Creators
17
Data
-other
publishers leaders
3. Data
Re-users
Outline Taxonomy by significant
function of organisational entity
1.
4. Funders
5. Policy /
strategy
makers
Research
3. Learning &
teaching
2. Service
provision
“Designated communities”
18
Outline Taxonomy by significant
function of organisational entity
1.
4. Funders
5. Policy /
strategy
makers
Research
3. Learning &
teaching
2. Service
provision
Commercial
“Designated communities”
19
Service definition & delivery
• Advisory services
– Responses to queries—from legal to technical guidance
[email protected]
– Site visits (National Institute of Environmental eScience)
• Information Services
20
– Briefing Documents - Freedom of Information by Mags
McGinley
– DIGITAL CURATION MANUAL
– 20 chapters written by community experts e.g. Metadata
written by Michael Day, UKOLN
– Peer-reviewed
– Checklist for Compliance with best practices and standards
– Technology Watch
Services: workshops
• 2005 Programme
21
– Preservation of medical databases:
24-25 May at the Gulbenkian Institute,
Lisbon in collaboration with
ERPANET & the Wellcome Trust
– Institutional repositories: 6 July at the
University of Cambridge, UK in
collaboration with DSpace
– Cost models in collaboration with the
Digital Preservation Coalition July at
British Library
– Persistent identifiers liaising with NISO,
summer, UK location tbc
Development approach
• OAIS (Open Archival Information System)
linkage: focus on representation information
– link to global work on format registries?
– Concentrate on scientific data formats?
• Repository
– Representation Information
– Standards and Tools
– Aim for OAIS compliance
22
• Persistent identifiers
• Certification… RLG task force
• Open development wiki and email list
OAIS Reference Model –
Functional Model
23
How relevant to
curation?
Representation Net
24
Representation Information
More detail
25
How does this relate to
format registries?
High Level View
Example of use of
Representation
Information Labelling
26
Registry issues?
• Trusted repository of Representation Information
– Authenticity of information
– Access control
– Certificates/Digests : (are they trustable over the long
term?)
• Findability
– Persistent IDs
• What can we rely on?
– Labels (to support automated processing)
• Extensibility
• Distributed
27
Registry development
• Simple PHP prototype
• Scoping study- unification
– Formats, standards, tools
• More robust prototype in development
– Based on ebXML & JAXR
– Potentially distributed, cooperative
maintenance model
28
Development Roadmap
• Registry: complete prototype, link to
PRONOM, GDFR etc, handover to
service
• Representation information: describe
CCLRC (science) data using EAST, etc
• Certification work continues
• Additional tools: metadata extraction
• Testbeds, interactions with others
29
Research approaches
•
•
•
•
Publishing & integrating scientific databases
‘Archiving’ past states of volatile databases
Database provenance and annotation
Organisational dynamics of trusted
repositories
• Automating metadata extraction
• Cost-benefit analysis of data curation
• Rights and responsibilities
30
The database picture
31
Source data
Curated data: classified,
cleaned, annotated,
integrated, cross-linked
Curated Databases are Central
Much/most scientific data is now in databases
• They often do not contain source experimental data. Sometimes
just annotation/metadata
• They borrow extensively from, and refer to, other databases
• You are now judged by your data as well as your (paper)
publications!!
• These databases are built and maintained with a great deal of
human or computational effort.
32
What makes a database?
– it has internal structure or it changes.
Size alone doesn’t qualify
Archiving (preserving) volatile
databases
• How do you preserve something that changes every
hour or minute?
– Important for the scientific record – someone might have
cited your data at time t.
• Current practice
–
–
–
–
33
Create versions (how often?)
Log changes
Use diffs
Do nothing (common!)
Curated databases – some
issues
34
• Integrating and publishing data so that
someone else can use it.
• Annotating existing data and moving
annotations to other databases
• Provenance: where did this data come
from?
• Archiving: how do you preserve
something that is constantly changing?
How do we cite data?
• A URL or citation to an article is already
unsatisfactory.
– DCC client complaint: “I spend a lot of time
searching [electronic documents] for the part that
is relevant to the citation.”
• The problem is much worse when you are
citing something in a very large database.
• How do you use a citation to locate data?
• How do you ensure that the citation
persists?
35
– Connections with DB archiving and DOIs
Research approaches
•
•
•
•
Publishing & integrating scientific databases
‘Archiving’ past states of volatile databases
Database provenance and annotation
Organisational dynamics of trusted
repositories
• Automating metadata extraction
• Cost-benefit analysis of data curation
• Rights and responsibilities
36
– “Public domain, public interest, public funding”
paper Waelde & McGinley
37
www.dcc.ac.uk
• www.ijdc.net
• Launch planned
June/July
• Peer-reviewed
contributions
• Peter Buneman
Editor (research)
• Production editor
Philip Hunter
38
Sample
issue
Full papers
Invited
articles
News &
views
39
Papers for
submission
are very
welcome!
1st DCC International Conference
• Location - Bath UK
• 29-30 September 2005
• Keynote speakers
 Cliff Lynch CNI
 Graham Cameron
European Bio-informatics
Institute
• DCC Research update
• Social highlights
40
Associates Network
Goals
Develop understanding, share best practice, advance
research, promote recognition, develop consensus
Membership
International groups, national bodies, industry partners,
funders, research groups, HEIs, FEIs, individuals……
Benefits
Early access to R&D outputs, advisory services, training,
input to definition and design, community participation
41
Discussion Forum www.dcc.ac.uk
Please join us!
BADC
Cambridge
Leicester
Jodrell Bank
NIEeS
ESO
RLG
CMS-Bristol
BODC
NASA
NARA
CNES
ESA
RLG
BNSC
RG
IVOA
ESA
SDSC
RI
UNC
International
Collaborations
CEH
DPC
Council for
Museums, Archives
& Libraries
ResearchEDG
InstitutesGridPP
EGEE
So’ton
MIMAS
NOF
ILRT
CCLRC
NEODC
UKOLN
DELOS
AHDS
DPC
Standards
Bodies
NeSC
UofE
DLI (US)
Research
Councils
Capri
42
IBM
Almaden
OCLC
CDS
ESO
JHU
CSIRO
TU Vienna
Caltech
JHU
CSIRO
Data
Archive
LDC
Roslin
INRIA
MRC HGU
UPenn
Kyoto
USC
MIMAS
WT-CFG
Leicester
IC
Maastricht
Durham
NTUA
INRIA
HUJ
UPC
MaxPlanck
Dutch NA
Swiss NA
Urbino
Salzburg
UNC
EBI
GSK
ACM
HEIs
&
FE
Oxford
UofG
Innogen
NHS
NLA
OAI
NCS
Microsoft
IBM
Oracle
BT
STK
RDN. OCLC
IASSIST
Acknowledgements
Slides from Peter Buneman,
David Giaretta and others used
with thanks.
How you can help us
44
How does OAIS relate
to curation?
How do format
registries relate to
representation
information?
Who else is working
across these areas?
What outcomes would
you like to see?