From Data to Discovery – Building automated cataloguing

Download Report

Transcript From Data to Discovery – Building automated cataloguing

From Data to Discovery
Building Automated
Cataloguing Tools with Perl
Huw Jones
Cambridge University Library
Cambridge
Small city, big
University = lots
of libraries!
Lots of libraries
= lots of books
Bibliographic records
University Library: 3.85 M
Other libraries: 2.5 M
8 databases
Data problems
Quality
Duplication
Quality - fullness
of 2.5 M records
in our databases
1 M are short
records
Quality – coding
Duplication
Effects
•
•
•
•
•
•
Difficulty in resource discovery
Patchy retrieval
Lack of authority control
Difficulty with standard deduplication
Burden on staff time
Ties us to multiple database model
Aims
Better records
Fewer records
Existing Solutions?
•
•
•
•
Manual recataloguing
Commercial solutions
Universal catalogue
Discovery layer
Either don’t solve the core problem, or
expensive and/or time consuming
Our solution
Automated Cataloguing Tools!
• Short record enrichment
• Automated MARC correction
• Deduplication
Order important – full, well coded records
are easier to deduplicate
General principles
• Retrieve some records from a Voyager
database
• Examine and/or manipulate them
• If necessary, make changes in the
database
N.B. Watch indexes and table space!
General tools
•
•
•
•
Perl – holds everything together
Perl DBI – connects to databases
SQL – retrieves records from database
MARC::Record modules (from CPAN) – to
examine/manipulate records
• Pbulkimport/Batchcat – to make changes
to the database
Batchcat vs Pbulkimport
• Batchcat – installed on PC with Voyager
• More versatile
• Can’t be used on server
• Pbulkimport – limited functionality
• Needs Bibliographic Detection Profile and
Bulk Import Rule (SYSADMIN)
• Can be used on server
Books
• Learning Perl / Randal L. Schwartz and
Tom Phoenix. 3rd ed. (Sebastopol, Calif. :
O’Reilly, 2001). ISBN: 0596001320
• Programming the Perl DBI / Alligator
Descartes and Tim Bunce. (Sebastopol,
Calif. : O’Reilly, 2000). ISBN: 1565926994
Enriching short records
How to get from this …
to this
Basic mechanism
• Take short record
• Find a matching full record
• Overlay short record with full record
• Need a source of full records
• In Cambridge - University Library - large
database of full, authority controlled
records
File of SHORT
RECORD bib ids
Connects to LOCAL database
and checks if a valid bib id
Connects to EXTERNAL source.
Finds best FULL RECORD
match and scores it
Retrieves SHORT RECORD
info from local database
Compares match score to
overlay threshold. If OK,
retrieves MARC record for FULL
RECORD
Corrects FULL MARC record.
Removes inappropriate fields.
Inserts fields to be retained from
SHORT RECORD
In local database overlays
SHORT RECORD with FULL
RECORD
Output
Interface
Results
• Service has been running for 1 year (much
of which was testing)
• 18 libraries subscribed to use service
• 90,000 short records upgraded
MARC checking and correction
• Bibliographic standard – agreed minimum
standard for cataloguing
• Every week, libraries receive an
automatically generated file of MARC
coding errors for correction
• Based on MARC::Lint module with many
alterations
Output
Mechanism
• Connects to database using Perl DBI
• Retrieves MARC record for records
created/edited in last week
• Runs them through MARC check
• Prints errors to file
• Emails file to library
Over 100,000 errors pointed out so far!
MARC Correction
How to get from this …
•
•
•
•
•
•
•
•
•
•
•
•
=LDR 00472nam\\2200157\a\4500
=001 662002
=005 20071205064734.0
=008 071129s1985\\\\nyua\\\\\\\\\\001\0\eng\d
=020 \\$a9780961751111
=100 1\$aBroecker, W.S.,$d1931=245 10$aHow to build a habitable planet ;$cBy Wallace S. Broecker.
=260 \\$aNew York ;$bEldigio Press,$cc1985
=300 \\$a291p $bill $c23cm
=504 \\$aIncludes index.
=650 \0$aAstronomy.
=650 \0$aAstrophysics.
to this!
•
•
•
•
•
•
•
•
•
•
•
•
=LDR 00453nam 2200157 a 4500
=001 662002
=005 20071205064734.0
=008 071129s1985\\\\nyua\\\\\\\\\\001\0\eng\d
=020 \\$a9780961751111
=100 1\$aBroecker, W. S.,$d1931=245 10$aHow to build a habitable planet /$cby Wallace S. Broecker.
=260 \\$aNew York :$bEldigio Press,$cc1985.
=300 \\$a291 p. :$bill. ;$c23 cm.
=504 \\$aIncludes index.
=650 \0$aAstronomy.
=650 \0$aAstrophysics.
MARC Correction
• Version of module which, where there is
no ambiguity, corrects errors
• Built into short record upgrade program
• Also offered as a retrospective service to
clean up legacy records
• Possibility of building it into weekly check
Mechanism
•
•
•
•
Connects to database using Perl DBI
Retrieves full MARC record
Runs against correction module
Replaces corrected record in database
Output
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Bib id: 662002
How to build a habitable planet ; By Wallace S. Broecker.
100: UPDATE: Spaces inserted between initials in subfield _a
245: UPDATE: By uncapitalised at start of subfield c
245: UPDATE: Space forward slash inserted before subfield _c
260: UPDATE: Full stop inserted at end of field
260: UPDATE: Space colon inserted before subfield _b
300: UPDATE: Full stop inserted after the p in pagination
300: UPDATE: Full stop inserted at end of field
300: UPDATE: Illustration abbreviation has been corrected
300: UPDATE: Space colon inserted before subfield _b
300: UPDATE: Space inserted between digits and cm
300: UPDATE: Space inserted between digits and p in pagination
300: UPDATE: Space semi-colon inserted before subfield c
Results
• In testing 70,000 records processed
• Corrected over 200,000 MARC coding
errors
• May run ALL our existing records through
at some stage
Deduplication – in progress!
Three stages:
• Identification of groups of duplicates
• Identification/construction of ‘best’ record
• Deletion of other records – relinking of
holdings/items/Purchase Orders to ‘best
record’
Identification of duplicates
• Connect to a database with Perl DBI
• Use SQL to retrieve records
• For each record, retrieve all available data
from tables
• Use matching algorithm to identify groups
of duplicates
And you’ll end up with something like this:
Identification of best record
• For each of group of duplicates, MARC
records retrieved
• Passed to scoring algorithm
• Record with highest score forms basis of
‘best’ record
• Retains set fields (i.e. subject headings)
from ‘other’ records
• Corrects any MARC coding errors
But …
• No relinking functionality, even in BatchCat
• No viable workaround for libraries using
Acquisitions/without losing circulation
history
In conclusion …
• Tools for librarians, not replacements!
• Do the stuff programs do well, allowing
humans to concentrate on what humans
do well
• Won’t do all the work, just makes a
solution to major data problems feasible
Questions?