Transcript Document

A SCRIPT FOR ARCHIVING DIGITAL RESEARCH DATA
IMPROVING ACCURACY AND EFFICIENCY IN THE DATAVERSE NETWORK
Rachel Carriere, Thu-Mai Christian, Erin Crane, & Cheryl Thompson | School of Information & Library Science | University of North Carolina at Chapel Hill
ABSTRACT
RESULTS
The H. W. Odum Institute for Research in Social Science at the University of North Carolina at
Chapel Hill collects and preserves digital social science research data and makes it publicly
available online for discovery and secondary analysis via the Dataverse Network (DVN). The
current data ingest workflow requires a multitude of tasks within several software programs
to correct data variable label truncation, which is a result of the 255-character limit in
statistical software packages. Because of this inherent limitation in the statistical software,
the ingest of data into the archive is often a complex process that introduces a single point
of failure in the ingest workflow that can result in data corruption.
To avoid the risks and single point of failure in the data ingest workflow, a Python runtime script was
developed to eliminate direct user interaction with the DVN database. Rather, the script performs
background processes that locate the appropriate record, reads the TXT file containing complete data
variable labels, and communicates with the DVN database to correct any truncated labels. The burden
on the archivist is reduced and records in the DVN are accurate and complete.
An examination of the data ingest workflow presents an opportunity to eliminate this single
point of failure by introducing a newly-developed computing script that automates the
process of correcting truncated data variable labels—thus preserving the complete archival
record.
METHOD
This poster reports findings from an analysis of the DVN data ingest workflow and presents
one solution for improving the efficiency and accuracy of data ingest. Several observations
and interview sessions were conducted to study the various tasks and tools involved in the
current workflow. Models illustrating the workflow and tools were developed to assist in
the identification of points of failure and opportunities for improvement. The model below
highlights deficiencies in the current data ingest workflow.
GOALS
TRIGGER: RECEIVE DATA SUBMISSION PACKAGE
Transform
submission package
into archival package
Store data submission package
Edit the data file for
completeness and
accuracy
Convert submission files to archival
preservation formats (e.g., .pdfa, .por)
Scripting offers the power of customizing archival platforms and technologies to meet the needs of
today’s digital archival collections, archivists, and the research professionals who depend on them.
The increasing use of and dependency on digital research data have prompted funding agencies to issue
mandates requiring researchers to develop a data management plan that includes details about data
access, distribution, and archiving. Like other research universities, the University of North Carolina
Provost has assembled a task force to develop recommendations on the stewardship of digital research
data. As a result, much interest in the digital data archive has been generated. The Dataverse Network
platform offers a solution to social science data management and preservation needs; however, the
introduction of a script to address an inherent challenge confronting archivists is necessary to increase
the functionality of the DVN and the usefulness of its records.
Decide whether data file requires editing
Create text file for data edits
Create SQL code for data edits
Apply edits to data file in DVN
Verify edits were performed
Make the archival
package available
Publicly release archival package in DVN
1. The archivist uploads his/her data files to
the Dataverse Network (DVN) and notates
the automatically-generated Universal
Numerical Fingerprint (UNF).
2. The archivist initiates the Python runtime
script, which prompts the archivist to enter
the UNF and the file path to the TXT data
variable label file.
3. The Python script communicates with the
DVN PostgreSQL database engine to identify
the appropriate record and overwrite
truncated data variable label strings with
the correct strings.
4. The script displays to the archivist the data
variables that were modified for quality
control and documentation.
5. The data variable labels are complete in the
DVN, which enables discovery and proper
analysis of the data.
SUMMARY
Create catalog record and upload files
into Dataverse Network (DVN)
REPLACED BY SCRIPT
THE SCRIPT
Possible single point of
failure if apply edits to
wrong data in DVN
NEXT STEPS
• Convert the Python script to a Java GUI application to improve ease of use and usability
• Integrate the Java (JSP) application into the DVN web interface for data submissions
• Test the script with researchers and data producers to understand how the data ingest process could
be integrated into the research life cycle
Acknowledgements | Thanks to Jonathan Crabtree, Assistant Director of Archive and Information Technology, Odum Institute; Dr. Stephanie Haas, Systems Analysis Professor; & Freeman Lo, Applications Analyst