ResearchForum Dooley

Download Report

Transcript ResearchForum Dooley

Society of American Archivists Research Forum
18 August 2015
A Deep Dive into the Archival MARC
Records in WorldCat (and ArchiveGrid)
Jackie Dooley
Program Officer
OCLC Research
OVERVIEW
• Research objective
• Research questions
• The data set
• High-level findings
• Next steps
RESEARCH OBJECTIVE
Research Objective
Establish a detailed profile of MARC data element
occurrences in archival catalog records, providing a
view of 30+ years of practice.
• Reveal variations in descriptive practice
• Debunk inaccurate assumptions
• Characterize before MARC usage diminishes
• Suggest improvements in descriptive practice
• Enable analysis of implications for discovery
SAMPLE RESEARCH QUESTIONS
Sample research questions
• Are descriptions and index terms rich enough to enable effective
discovery of archival materials?
• In what significant ways does archival description differ from one
type of material to another?
• To what extent does use of the archival control byte successfully
capture the universe of archival descriptions?
• Is it true that archivists usually describe materials at the
collection level?
• How often is DACS used as the content standard? And APPM as
its predecessor?
• To what extent are the DACS minimum requirements met?
THE DATA SET
Archival records in WorldCat
OCLC’s WorldCat database of 300+ million records,
filtered to extract “archival” records (currently 4 million, or
about 1% of the total)
Brief version of the filter specs:
• “Unpublished” materials in any format (e.g., text,
visual, moving image, sound recording)
• Coded for “archival control” (Leader byte 08)
• Held by a single institution (i.e., only one attached
holding)
• Excludes published materials in any format, as
well as theses and dissertations
Spoiler alert: It’s not perfect.
Same records as in ArchiveGrid
The full filter specs:
•
•
•
•
•
•
Only one library holding symbol is attached (to eliminate non-unique items or collections)
The MARC Leader has one or more of the following:
– Leader byte 06 (recordtype) has the value d (manuscript music), f (manuscript
cartographic), g (projected graphics), i (nonmusic recording), j (music recording), k
(visual), p (mixed), r (realia), or t (textual manuscript). [does this include all the new
ones?]
– Leader byte 06 has the value "a" (language material) and Leader byte 07
(bibliographic level) has the value "c" (collection).
– Leader byte 08 has the value "a" (archival control).
Field 260 subfields "a" and "b" are not present (to filter out published works)
"Bibliography" does not occur at the beginning string of any MARC subject heading
subfield "a" or "v" (to filter out published works).
Field 502 is not present (to filter out theses and dissertations).
Records with material type "book" or "serial" that have no value in fields 008 or 006
“Nature of Contents” bytes (to eliminate theses, reference works, and other non-archival
materials).
http://beta.worldcat.org/archivegrid/about/
So what do you think of our scoping
of archival data elements?
Briefest version of the filter specs:
• “Unpublished” materials in any format
• Under “archival control”
• Held by a single institution
• Excludes all published materials
Spoiler reminder: It’s not perfect.
HIGH-LEVEL FINDINGS
A. Full data
B. Mixed materials
C. Text
D. Visual materials
E. Music scores
A. Maps
B. Audio recordings
Percent of records by type of material
1%
4%
25%
26%
Book
Mixed
Visual
Map
Score
44%
A. Full data
• “Archival control”: 28% of records
• Dates: Nearly half have date span
• Bibliographic level
– 53% describe collections
– 40% describe single items
– “Component” levels rarely used
• 95% are mixed materials, text, or visual materials
• 85% have ≥1 indexed creator names
• 75% have ≥1 indexed subject terms
• 30% have an 856 field (link to external content)
Bibliographic level by type of material
Inclusion of 6xx (subject) index terms
120%
100%
80%
600
610
650
60%
651
653
655
40%
20%
0%
All
Book
Map
Mixed
Rec
Score
Visual
A. Full data, cont.
• Cataloging level
– 29% full cataloging
– 25% minimal
– 44% unknown
• Cataloging rules
– Specified in 30% of records
– appm in 18% of records, dacs in 7%, gihc in 5%
• Form of material: Used most heavily for non-textual materials
• Language
– Two thirds in English
– Not specified in ≥ 25% of records
• Place of publication vs. location of repository
B. Mixed Materials
• 44% of all records
• 50% are under archival control
• 94% are collection records, 5% are components
• 1xx in 70% of records
• Title: 11% have no 245 $a
• Notes
•
•
•
•
520 in 74% of records
545 field in 31% of records
500 field in 39% of records
No other 5xx used in ≥ 25% of records
B. Mixed Materials, cont.
•
•
•
•
600 in 40% of records; mean of 1.5 per record
650 in 52% of records; mean of 3.0 per record
651 in 45% of records; mean of 1.3 per record
655 in 63% of records; mean of 1.3 per record
•
7xx in 28% of records
•
856 in 29% of records
C. Text
•
25% of all records
–
–
•
4% are book and pamphlet collections
21% are textual manuscripts
•
25% of textual manuscript records are under archival
control
30% are collection records, 70% are items
•
•
1xx in 77% of records
Title: 11% have no 245 $a
•
Notes
–
–
43% have 520 field
54% have 500 field
C. Text, cont.
•
•
•
•
600 in 31% of records; mean of 0.9 per record
650 in 42% of records; mean of 1.7 per record
651 in 31% of records; mean of 0.8 per record
655 in 36% of records; mean of 0.7 per record
•
7xx in 50% of records
D. Visual Materials
• 26% of all records
• ≤ 10% are under archival control
• 57% have 007 (technical data values)
• 15% are collection records, 76% are items
• 1xx in 51% of records
• Notes
– 500 in 77% of records
– 520 in 68% of records
– 540 in 57% of records
D. Visual Materials, cont.
•
•
•
•
600 in 32% of records; mean of 1.1 per record
650 in 68% of records; mean of 4.2 per record
651 in 38% of records; mean of 1.5 per record
655 in 81% of records; mean of 1.5 per record
• 7xx in 31% of records
• 856 in 48% of records
E. Music Scores
•
•
•
•
4% of all records
1xx in 90% of records
240 in 41% of records
500 in 96% of records; negligible use of other 5xx’s
•
•
650 in 96% of records; mean of 2.4 per record
655 in 34% of records; genre/form terms often in 650
instead
•
856 in 25% of records
F. Maps
• Less than 1% of all records
• 65% have 007 (technical data values)
• Field 043 (hierarchical geographic area code) in 80% of
records
• 052 in 66% of records (geographic classification)
• 1xx in 53% of records
• 255 in 92% of records (cartographic mathematical data)
F. Maps, cont.
• 500 in 93% of records; use of other 5xx’s negligible
• 650 in 68% of records; mean of 2.8 per record
• 651 in 83% of records; mean of 2.7 per record
• 655 in 84% of records; mean of 1.8 per record
• 7xx in 50% of records
G. Audio Recordings
•
•
•
•
Less than 1% of all records
60% have 007 (technical data values)
1xx in 83% of record
Notes
–
–
–
–
500 in 77%
520 in 68%
530 in 27%
540 in 57%
G. Audio Recordings, cont.
• 650 in 68%; mean of 5.2 per record
• 651 in 47%; mean of .9 per record
• 655 in 67% of records; mean of 1.2 per record
• 7xx in 100% of records
• 856 in 22% of records
NEXT STEPS
Draw conclusions (a few for starters)
• Mixed and textual materials cataloged as collections; other
formats not so much
• “Archival control” byte is far from universally used, so has
little value
• Few of the note fields added for archival or visual materials
communities are widely used (does it matter?)
• As many as 25% of titles for mixed and textual collections
make for lousy browsing (e.g., “Papers” or “Records”)
• Ponder implications for next-gen cataloging (linked data,
BIBFRAME, schema.org)
Please send feedback
• Do the data debunk any assumptions?
• Are you dubious about any of the data?
• Would you tweak the specs of our filter?
• Are changes in practice called for?
• What other questions should I be asking?
• Is this a useful project or just an “interesting” one?
Publications & future research
• Publish this data
• Second paper: Implications for discovery
• Future research?
– Data content
– Potential for data remediation
• Generic titles (e.g., Papers, Records)
• Missing language codes
• Other?
– Descriptive practice for web archiving
• If you need an OCLC data set for research ...
SAA Research Forum
Thanks!
Jackie Dooley
Program Officer, OCLC Research
[email protected]
@minniedw
SM