A Platform for Auditable, Distributed, Asymmetric - Data-PASS

Download Report

Transcript A Platform for Auditable, Distributed, Asymmetric - Data-PASS

Micah Altman
Associate Director, Harvard-MIT Data Center
Institute for Quantitative Social Science, Harvard University
Bryan Beecher
Director of Computing and Network Services
Inter-university Consortium of Political and Social Research, University of Michigan
Marc Maynard
Director of Technical Services
The Roper Center for Public Opinion Research, University of Connecticut
Jonathan Crabtree
Assistant Director for Archives and Information Technology
HW Odum Institute for Research in Social Science, University of North Carolina
CNI 2008 Fall Task Force Meeting
1
Our Story
 Who are you guys?
 What problem are you trying to solve?
 What have you done?
 Why do we care?
CNI 2008 Fall Task Force Meeting
2
Data-PASS
• Partnership devoted to
identifying, acquiring and
preserving data at-risk of
being lost to the social
science research
community
• Partners
– ICPSR
– Odum Institute
– Harvard MIT Data Center
– Roper Center
– National Archives
CNI 2008 Fall Task Force Meeting
http://flickr.com/photos/phauly/35555985/
3
Data-PASS
CNI 2008 Fall Task Force Meeting
4
Data-PASS
 Lots of little files (social science data)
 ASCII data files
 PDF technical documentation (codebooks)
 Millions of ‘em
 Archival storage
 Was tape
 Now disk
CNI 2008 Fall Task Force Meeting
5
Before
CNI 2008 Fall Task Force Meeting
6
After
CNI 2008 Fall Task Force Meeting
7
Archival storage?
http://failblog.org/2008/02/08/floppy-fail/
CNI 2008 Fall Task Force Meeting
8
Archival storage?
 Remote disks
 Grids
 Clouds
 With partners?
CNI 2008 Fall Task Force Meeting
9
Why roll your own?
 Policy-driven
 Auditable
 Asymmetric
 Independence of each location
CNI 2008 Fall Task Force Meeting
10
Syndicated Storage Platform (SSP)
 Start with LOCKSS
 Lots of Copies Keep Stuff Safe
 But used in a closed network
 Private LOCKSS Network (PLN)
 A few of them out there

MetaArchive perhaps the best known
 Biggest selling point was independence of each node
in the PLN
CNI 2008 Fall Task Force Meeting
11
PLNs
 LOCKSS is really easy to setup
 PLNs are more difficult
 Other differences between traditional PLN and our
needs
 Our content isn’t harvestable via HTTP
 Our PLN nodes are different sizes
 Our trust model requirement prevents a centralized
authority controlling the network
CNI 2008 Fall Task Force Meeting
12
SSP = Stone Soup Platform?
 ICPSR and Odum setup a small PLN
 HDMC provided a harvester and designed the schema
 Odum built the Comparator
 Roper is building the Invitor
CNI 2008 Fall Task Force Meeting
13
PLN
CNI 2008 Fall Task Force Meeting
14
Schema
• Nodes
– IP address
– Storage commitment
• AUs
– Max size
– # in the PLN
• Lots more
CNI 2008 Fall Task Force Meeting
15
Comparator
• diff for our SSP
• Compares
– Contents of the LOCKSS Cache Manager [sic]
– Schema
• Produces
– List of differences between “what is” and “what should
be”
– Feeds into another tool for “fixing the PLN”
• Machine-actionable output (XML)
CNI 2008 Fall Task Force Meeting
16
Invitor
• Reads the report from the Comparator
• Issues requests to PLN nodes to ADD or DROP an AU
– Expectation is that PLN nodes always accept an ADD if
they can
•
An offer they cannot refuse
• Requests may be reviewed/approved by a human
administrator (or not)
• USENET news technology?
CNI 2008 Fall Task Force Meeting
17
Summary
 Data-PASS is a group of archives committed to
preserving social science data
 Exploring various technology options
 One avenue is a custom LOCKSS deployment
 Network schema
 OAI data harvester
 Comparison tool
 Network update tool
CNI 2008 Fall Task Force Meeting
18