No Slide Title - Cornell University

Download Report

Transcript No Slide Title - Cornell University

Minerva
The Web
Preservation Project
1
Team Members
Main Reading Room
Library of Congress
Roger Adkins
Cassy Ammen
Allene Hayes
Melissa Levine
Diane Kresh
Jane Mandelbaum
Barbara Tillett
Cornell University
William Arms
Internet Archive
Brewster Kahle
Scott Kirkpatrick
2
1. Open Access Materials
on the Web
3
4
Approaches to Collecting and
Preservation of the Web
CLOSED ACCESS
Partnership with publishers
Publishers and libraries as partners
Selective collection of open access web
Librarianship in a new domain
Bulk collection of open access web
Automated processes
OPEN ACCESS
5
Web Preservation Project Pilot
• Small number of web sites nominated by selection officers.
Three chosen for close study.
http://www.whitehouse.gov/
http://www.algore2000.com/
http://www.georgewbush.com/
• Copies downloaded using HTTrack mirroring program.
Inspected for errors, anomalies, etc.
• Catalog records created using OCLC's CORC software
Loaded into Library of Congress's ILS system.
• Trial web site developed to evaluate user access.
• Discussions with Copyright Office on legal issues.
6
Example: The Internet Archive
7
Example: National Library of Australia
8
Example: National Library of Sweden
9
2. Selection and Collection
10
Collecting: Making a Snapshot
Web site
Snapshot
Download
A web site
is downloaded,
using a mirroring program.
A snapshot is stored in an
archive.
Archive
11
Collecting: Periodic Snapshots
Web site
Snapshot 1
Snapshot 2
At selected time intervals
additional snapshots are made.
Snapshot 3
Archive
12
Very Rough Estimates
There are no good estimates of how many Web sites the Library
of Congress would wish to collect and preserve.
OCLC's Web Characterization Project (February 2000)
Public web sites:
Annual increase:
2,900,000
700,000
If the Library of Congress collects 1%
Total number of sites:
Annual number new and changed:
30,000
15,000
But these numbers are very rough estimates (guesses)!
13
Selection Decisions
Which sites to collect?
• Bulk -- collect all within a certain category
• Selective -- collect sites selected by a librarian
How often to make snapshots?
• Monthly, weekly, or depending on circumstances
Which content to collect?
• HTML pages only
• Text and images only
• Everything
14
Examples of Selection Decisions
Selection
Frequency
Content
Internet Archive
bulk
monthly
HTML + images
Pandora
selective
varies
all
Kulturarw3
bulk
sweeps
all
Minerva
selective
irregular
all
15
Selection Decisions:
Recommendations
The Library needs a mixed strategy:
1.
Selective selection, for known important sites
2.
Bulk selection for selected categories (e.g., .gov sites)
3.
Bulk collection without selection for other materials
16
3. Use of the Collections for
Scholarship and Research
17
Analysis by Computer
Snapshot 1
Snapshot 2
Analysis
by
computer
Computer
programs can
be used to
analyze the
snapshot files.
Snapshot 3
Archive
18
Analysis by Patron
Snapshot 1
Access 1
Snapshot 2
Access 2
Snapshot 3
Access 3
Archive
Web site
People can
study an
access version
of a site
Analysis
by patron
19
Access Decisions
Style of access
•
•
Analysis of snapshot files by computer
Analysis of access version by patron
Editing
•
•
•
•
No editing (use snapshot files)
Minimal editing to make access version
Fuller editing to maintain experience
Automatic or by hand
Policy
•
Who has access to the collections?
20
Examples of Access Decisions
Style
Editing
Internet Archive
computer
no
Pandora
researcher
yes
Minerva
researcher
yes
21
Recommendations about the Use
of the Collections for Scholarship
and Research
The Library should support the use of the collection in a
variety of ways.
1. Computer analysis of snapshot files
2. Automated editing to create access versions of all
selected sites, without human checking.
3. Human editing of a few, very important sites.
22
4. Information Discovery
23
Options for Information Discovery
Very large numbers of Web sites will be collected and
preserved. Some form of index or catalog is required.
Options
•
List of sites (e.g., Internet Archive)
Access by URL + date
•
Automatic index (e.g., Web search engines)
•
Catalog (e.g., MARC or Dublin Core)
Catalog record for individual site or group of sites
Access through Library catalog
24
Information Discovery:
Web Preservation Project
Procedure
•
•
MARC catalog records created using OCLC's CORC system.
Loaded into Library of Congress's ILS.
Observations about procedure
•
•
•
Cataloguing effort similar to other electronic files.
Some similarities to serials.
No significant workflow difficulties.
25
Cataloguing Observations
• Detailed information is continually changing.
• Difficulty in selecting title (HTML <title> is often poor).
• Problems with identifiers (multiple, changing URLs).
• Collection level records suitable for special events.
It is difficult to evaluate cataloguing strategy because of lack of
knowledge of user needs.
26
Recommendations about
Information Discovery
1. The Library should experiment with various approaches to
indexing and cataloguing Web sites, including automated
indexing, Dublin Core and MARC cataloguing.
2. The Library will probably not be able to afford individual
catalog records for all Web sites that are collected.
27
5. Storage and Preservation
28
Workflow
Catalog
External
Access
Analysis
by patron
Archive
Accession
Control
Web Crawler
Process
Analysis by
computer
snapshot
Web site
29
Preservation Objective
Objective is to preserve the digital collections in a
manner that makes them usable for scholarship
and research in the future.
What is preserved?
•
•
•
Preservation of bits
Preservation of content
Preservation of experience
How is it used?
•
•
Analysis by computer program
Viewed by human researcher
30
Process of Preservation
Time 0
Version 1
Version 2
Time 1
This process may
be applied to
either the
snapshot or the
access version.
Time 2
Version 3
31
Storage Decisions: Identification
Identification of Web site
•
•
URL, but Web sites may change their URL
URN (e.g., Handle or PURL)
Identification and provenance of versions
•
•
•
Web site identifier
Collection information (date, time, etc.)
History of changes
Recommendations
1. Assign URN (e.g., Handle) to each Web site.
2. Store provenance metadata with every file.
32
Preservation Recommendations
1. Keep the unedited snapshot files by repeated refreshing.
2. Use automated migration of individual files as the basic
technique for keeping Web sites (more of less) functional at
moderate cost.
3. Use manual editing for a small number of particularly important
sites.
In general, it is not possible to maintain the experience of using
Web sites as technology changes, even with expensive editing.
33
6. General Recommendations
34
General Recommendations
1.
Collection and preservation of Web materials should be seen as a
single program.
2. The program needs a full-time team of librarians and technical
staff.
3.
Some aspects can be subcontracted to specialists (e.g., the Web
crawler), but the leadership must come from the Library.
4. The Library should seek partnerships with other libraries and
archives.
5.
Most processes will be automatic, with skilled attention given to a
small number of particularly important sites.
35
Demonstration of Pilot System
36