PowerPoint Presentation - Managing Long-Lived Digital Data

Download Report

Transcript PowerPoint Presentation - Managing Long-Lived Digital Data

Managing Long-Lived Digital Data-sets
and their Curation: Interdisciplinary Policy Issues
Managing Digital Assets Forum
Washington, D.C.
October 28, 2005
James L. Mullins
Dean of Libraries
Purdue University
Environment
•Long-lived data collections powerful catalysts for progress
•Need for digital collections –
increasing rapidly
•NSB & NSF –
leadership - comprehensive strategy consistent policy framework
National Science Board, Long-Lived Digital Data Collections: Enabling Research
and Education in the 21st Century. National Science Foundation, September
2005. p.10
Environment
•Policies & strategies –
developed to facilitate management,
preservation, and sharing of digital data –
embrace heterogeneity in technical,
scientific and other features found across
the spectrum of digital data collections
National Science Board, Long-Lived Digital Data Collections: Enabling Research
and Education in the 21st Century. National Science Foundation, September
2005. p.10
Assumptions
• Scientific and Technical Researchers –
data proliferation
• Organization and retrieval of data – the challenge
• Solution - not apparent
• Information Technologists, Computer Scientists,
and Statisticians – “not our problem”
• Funding agencies - assess proposals partially on data
management plan
The Problem: Petabytes of Data, e.g.,
•Genomics – datasets – National Center for Biotechnology
Information (NCBI of NLM) – instruct researchers about
resources
•Atmospheric Data – climate change, weather prediction
•Geographical Information – seismological
•Astronomical Data – space exploration
•Nanotechnology – miniaturization of systems
•Nuclear engineering – maintenance of nuclear reactors
The Wants
•Researchers want consistent access to their own data,
now and future
•Researchers want informed (metadata) access to data of
colleagues
•Researchers want to share data through distributed access
or duplication of data-set
•Researchers want help in gaining this access
The Needs
•Taxonomy - categorization of data per research area
•Storage/Curation – management of data
•Metadata - data description to assist “data mining”
•Meta Search – locate and download data
•Distribution Grid – transmission
The Opportunities
•Taxonomy - disciplinary scientists, computer scientists and
librarians
•Storage/Curation – information technologists and librarians
•Metadata – librarians
•Meta Search – computer scientists and librarians
•Distribution Grid – information technologists
The Opportunities: NSF Data Scientists
•Taxonomy - Disciplinary Scientists, Computer Scientists & Librarians
•Principles of Library Science informs “Structure”
•Storage/Curation – Information Technologists and Librarians
•Librarians manage data as part of scholarly resources
•Metadata – Librarians
•Librarians create content description and retrieval points.
•Meta Search – Computer Scientists and Librarians
•Computer Scientists and Librarians collaborate in Distributed
Institutional Repository development: hardware and software
•Distribution Grid – Information Technologists
•Information Technologists build/operate high speed network
Purdue Libraries - Interdisciplinary Collaboration
•Taxonomy - Joint Proposal with Chemical Engineer to NSF
•Storage – Joint Proposal to & Receipt of EMC2 - Equipment – 32T
•Metadata – Data Information Specialists (Scientists) created to
collaborate on research issues\
•Meta Search – datasets in environment, atmospheric downloaded into
Libraries’ Distributed Institutional Repository (DIR) in collaboration
with disciplinary researchers
•Distribution Grid – Purdue on Teragrid (high speed network linking 8
research centers, funded by NSF), testing transmission of access to and
transmission of large datasets from one researcher to another.
Librarians testing with SDSC use of SRB
Distributed Institutional Repository (DIR)
Datasets
ETDs
Watershed
Users
Providers
Maintainers
P
o
r
t
a
l
Digital
Commons
Electronic Thesis
Photos
Earhart
E-journals
Datasets
PU Press
Databases
EMC
Climate data
E. Coli
Docs
Other
OAI
Repository
Datasets
LARS
SRB
Raw DB
TeraGrid
Raw DB
PTO
Policies
•Interdisciplinary collaboration required
•Inter-institutional collaboration highly desirable
•Funding sources must be shared and clear
•Results must be replicable and contributed to
academe
•External funding highly desirable
•NSF/NIH definitions of massive datasets accepted
•Distributed Institutional Repository (DIR) goal
•Curation of data is not new, libraries have been
archiving raw data for centuries
http://www.lib.purdue.edu
Purdue University Libraries