data sets - Department of Information and Computing Sciences
Download
Report
Transcript data sets - Department of Information and Computing Sciences
http://www.dans.knaw.nl
Dirk Roorda, coordinator infrastructure
Overview
Part 1: The rising role of data
Part 2: The free use of data
Part 3: The care for data
Part 4: The re-use of data
Part 1: The rising role of data
http://en.wikipedia.org/wiki/Exabyte
Internet size (May 2009): 500 EB
500.000 PB
500 million TB
500 million fat USB disks
500 billion memory cards of 1 GB
70 memory cards per person
Data deluge
http://www.datadeluge.com/
http://en.wikipedia.org/wiki/File:Tree_of_life_SVG.svg
http://tolweb.org/tree/
Where does it come from?
• Instruments
• satellites, sensors, dna-sequencing
• Records
• administrations, censuses, surveys
• Digitisation
• the analog legacy
• Hobby
• pictures, movies, genealogy
• Integration
• better interoperability of existing data
The driving force
Information and Communication Technology
Babbage Analytical Engine
1870
A datacenter
Genealogy
2,5 PB
5328 servers
1,12 MW
http://www.ancestry.com/
http://blog.familytreemagazine.com/insider/Inside+Ancestrycoms+TopSecret+Data+Center.aspx
A closer look
• Linguistics
• text corpora, automatic translation
• Philology
• how to read a million books?
• History
• historical census data
• Archeology
• archive law, commercial research
Linguistics and Philology
A chronometric approach to Indian
alchemical literature
Assessing frequency changes in
multistage diachronic corpora
Evaluating methods for computerassisted stemmatology using
artificial benchmark data sets
A Corpus Study of the Rigveda
Dictionary generation for lessfrequent language pairs using
WordNet
An exercise in non-ideal authorship
attribution: the mysterious Maria
Ward
http://llc.oxfordjournals.org/
History
http://www.volkstellingen.nl/nl/
http://www.volkstellingen.nl/en/
Archaeology
http://edna.itor.org/nl/intern/upload_directory/a00002/downloads/IMG0013.tif
Archaeology (2)
http://edna.itor.org/nl/oai/oai_addi/oai_addi/OAI:EVALMA:a00002.xml/
Part 2: The free use of Data
Open Access
Data is information
Information is knowledge
Knowledge is power
Why share it?
Open Access
Shared knowledge is double knowledge
Without free sharing of knowledge,
scientific progress will halt
Tensions between sharing and not sharing
remain, though
A good Example
http://www.ploscompbiol.org/home.action
Work to do
• organise your data
• let your data work together with those of
others
• (colleagues, future scientists, the public)
• ask new questions to the data
• because there is so much of it
• create new (virtual) data collections
Part 3: The care for data
Research Data Recycling
• existing data
• collecting by experiments, surveys
• primary research data
• verifying results by others
• preserving unique data from experiments
• compilation, aggregation, annotation
• databanks
• data mining, analysis, visualisation
• new data as research input
Challenge: Software
Operating system (DOS, Windows 95, ...)
Programming Languages (Basic, Pascal)
File formats (Word Perfect, dBase)
Applications (Addressbook, Websites)
Old data may be locked up in old software.
Meeting the challenge
To prevent the problem in the future
Backward compatibility
Open Standards
Open Source Applications
Modular software engineering
keep data separated from interface and business logic
To remedy the problems of the past
Emulation
Migration
Challenge: Human organisation
Forgotten jargon
Forgotten knowledge
No metadata
Websites with broken links
Jargon
• II.17. Posterior berry aneurysm
with subarachnoid bleed.
• II.18. Subarachnoid bleed with
extension into the ventricles.
• II.19. Ruptured berry
aneurysm at the end of the
internal carotid artery, with
obstructive hydrocephalus.
Morgagni found the rupture.
• II.22. Subarachnoid
hemorrhage.
http://www.pathguy.com/morgagni.htm
Meeting the challenge
Persistent Identifiers
Enough Metadata
Codification of knowledge and practices
Wikipedia
Datamanagement early on
Part 4: The re-use of data
Data management
Use common infrastructure rather than
private means
Use open formats rather than proprietary
formats
Use open source software rather than
closed software
Use standard ways of documenting data
taxonomies, ontologies, metadata
schemes
Common Infrastructure
Local file shares
University repository
DANS
European Infrastructures
DANS
http://easy.dans.knaw.nl/dms
EASY
Dataset
Datafiles
Metadata
linguists make their technology accessible
- resources algorithms techniques
humanities and social sciences
- they are the target users
Geleerdenbrieven
=
Circulation of Knowledge
Archiving
=
circulation of information
Keep imagining