file naming and versioning

Download Report

Transcript file naming and versioning

Introduction to Managing
Research and Personal Data
TODAY’S TOPICS
•
•
•
•
•
•
•
What is Data? Why manage?
How to make data findable
File naming and Versioning
File formats
Storage and Back-up
Searching publicly available data
Research Data Services @ NCSU
WHAT IS DATA?
• Raw “stuff” lacking context
• “Essence of science” (Greenberg, et al, 2009)
• One person’s data is another’s information
Adapted from Greenberg & Baker (2012)
• http://www.niso.org/news/events/2012/dcmi/scientific_data/
Data comes from different sources
•
•
•
•
Observational
Experimental
Simulation
Derived or compiled
WHAT IS DATA?
• Observational: captured in real-time, usually
irreplaceable
– e.g. Sensor data, telemetry, survey data, sample
data
• Experimental: data from lab equipment, often
reproducible, but expensive
– e.g. gene sequences, chromatograms, toroid
magnetic field data
WHAT IS DATA? contd.
• Simulation: data generated from test models
where model and metadata (inputs) are more
important than output data
– e.g. climate models, economic models
• Derived or compiled: data that is reproducible
(but very expensive)
– e.g. text and data mining, compiled database, 3D
models, data gathered from public documents
WHY MANAGE DATA?
• Because it’s good to avoid disasters
– and to make data easy to find, share, use …
YouTube Break
HOW TO MAKE DATA FINDABLE
• Include enough information about data
• Structured information about an object (data)
• Jargon: METADATA = “data about data”
Greenberg & Baker (2012)
FILE NAMING AND VERSIONING
• If you follow a naming convention for your files
and folders…
– You won’t accidentally overwrite or delete files
– You will locate specific files more easily
– You will preserve significant differences between
different versions of the same file
– You won’t cause confusion if multiple people are
sharing the files
FILE NAMING AND VERSIONING
• Keep names short, if possible
• Use underscores ( _ ) instead of spaces
• Include a date in YYYY-MM-DD format
– ISO Standard 8601
• Don’t include the folder name in the file name
• Include a version number (v1, v01) at the end
• For “final” version, include the word FINAL
FILE NAMING AND VERSIONING
• Turn on “track changes” or other version
tracking function in Office, Google Docs, etc.
• For code, use versioning software
– Subversion, OpenCVS, Git, etc.
• Most of all, be consistent
FILE FORMATS
• Will I be able to open this file in ten years?
• Is this format proprietary or open?
• Does this format preserve as much of the
original information as possible?
• Is this format widely used and supported?
FILE FORMATS
•
•
•
•
•
•
For text: TXT, RTF, XML (DOCX, ODT, etc.)
For tabular data: CSV, TSV
For databases: CSV, SQL
For images: TIFF, JPG 2000 (lossless), SVG
For audio: MP3, FLAC (lossless), WAV
For video: AVI, MP4 with H.264
– Uncompressed, where possible
STORAGE OPTIONS
•
•
•
•
•
•
Hard drives
Flash drives and optical media
Google Drive
Dropbox, Microsoft SkyDrive, iCloud
SpiderOak
Velocity – temporary, for transferring files
Back-up Strategy
• Have regular back-up routine
• Keep local copies and on the cloud
• Important to track versions
PUBLICLY AVAILABLE DATA SETS
• Genomic, geospatial, climate, census, etc.
• Many researchers are opening up their data on
the web at the request of funding agencies
• Federal government data is open at data.gov
• Some public data is available through:
– google.com/publicdata
– aws.amazon.com/publicdatasets
– Open Data repositories
RESEARCH DATA SERVICES
• Libraries
– Data Management Planning guide for researchers
• DMPTool utility and DMP review service
– Geospatial and Numeric Data Services
– Copyright guidance
• OIT
– Storage, technical support
• SPARCS
– Grants and external funding
Discussion
• Mohan Ramaswamy
– [email protected]