Publisher Name Authority Project: An Attempt to Enhance
Download
Report
Transcript Publisher Name Authority Project: An Attempt to Enhance
Publisher Name Authority Project: An
Attempt to Enhance Data Mining for
Collection Analysis & Comparison
Lynn Silipigni Connaway
Consulting Research Scientist III
Akeisha Heard
Technical Intern
XXV Annual Charleston Conference
04 November 2005
Introduction
Research Goals
Develop a service to support advanced collection
intelligence
Cluster collected objects based on their issuing entity
• As can be determined via metadata about the objects
• Gain intelligence about the nature of individual
publishers
• Collection intelligence
• Acquisition patterns
• User behavior
Research Objectives
Resolve
• ISBN prefixes to publisher name
• Variant publisher names to a preferred form
Capture and make available for use various attributes of
individual publishers
• Location of publisher
• Language(s) of materials published
• Genre(s)/format(s) of materials published
• Dominant subject domain(s) of the publisher's output
• Parent company and subsidiaries
Theoretical Foundation: Authority Control
Adhere to authorized form
• Personal names
• Corporate entities
Why no authorized form for publishing entities?
Pragmatic Foundation: Collection Development
Identified publisher series
• Retrospective conversion project (1984)
Family tree
• Which publishers are related?
Approval plans
• Which publishers publish which subjects?
Pragmatic Foundation: OCLC WorldCat Data Mining
Collection Analysis
• Which libraries have the most items by a publisher in a
particular subject area?
• How do library holdings by publisher compare?
E-books for a particular STM publisher (2000)
• Cataloged as reproductions
• 2 publishers!
Pragmatic Foundation: Citation Analysis
Sweetland (1989)
• Reader functions of citations
• Information retrieval via citation databases
• Document retrieval
• Includes interlibrary loan verification
• Bibliometrics
• Faculty and researcher productivity measure
Other functions
• Creation of references/bibliographies
Pragmatic Foundation : Education for Librarians
Collection development & acquisitions librarian education
• Subject focuses of publishers
• Parent and subsidiary relationships
Specialized Corporate Authority Files
ACOLIT (Ruggeri, 2004)
• Names, uniform titles, Italian and international Catholic
institutions, Catholic religious communities, and
institutions
• Related to the Catholic Church, Papal State, and
Vatican City State
COPAR (Boddaert, 2004)
• French official corporate bodies
• Mainly national and preceding the French Revolution
CORELI (Boddaert, 2004)
• Religious corporate bodies from 3 French ancient
specialized catalogues
Specialized Corporate Authority Files
Chinese Modern Author Authority Database (Hu, Tam &
Lo, 2004)
• Chinese authors of expanded works and Chinese corporate
bodies since 1912
Chinese Name Authority Database (Hu, Tam & Lo, 2004)
• Mainly Taiwanese personal names with some Taiwanese
corporate bodies
Specialized Corporate Authority Files
Case study by Elias & Fair (1983)
• Standard Oil Co.’s Media Query File
• No authority control
• 3 professionals in 6 months averaged 12 telephone
calls/day from reporters
• Decided against canonical list for media names
• Noted 20 unique variants for Wall Street Journal
including WSJ, Wall St. Jnl, Wall Street Jnl
Specialized Corporate Authority Files
Case study by French, Powell & Schulman (1997, 2000)
• Smithsonian Astrophysical Observatory’s Astrophysics
Data System database
• Programmatically identify author affiliations and map
variant names to canonical name
• Investigated various techniques separately and
iteratively to bring variants together including:
• Lexical cleanup
• Data clustering algorithms
• Approximate string-matching
• Reduced number of unique strings by 55%
• Required manual review of clusters
Database Quality
Literature: Database Quality
Review by O’Neill & Vizine-Goetz (1988)
• Busch (1981)
• < 35% of 141 OCLC libraries routinely reported
errors
• Pollock & Zamora (1983)
• Noted misspellings comprise 90-96% of errors &
include:
• Omission
• Insertion
• Substitution
• Transposition
Literature: Database Quality
Intner (1989)
• Reviewed 215 matching records in OCLC and RLIN
• Errors relating to publishers:
OCLC
RLIN
Count
(Total)
%
Count
(Total)
%
64
(205)
31.2
52
(191)
27.2
MARC tagging in
260 field
4
(25)
16.0
3
(26)
11.5
Typographic errors
4
(32)
12.5
6
(45)
13.3
Application of
AACR2 & LCRI
Literature: Database Quality
Romero (1994)
• Evaluated cataloging of library science students
• Noted 221 errors (28.22%) in the publisher
description area
Issues: Historical Practices
Different rules for abbreviations
• LC Rule Interpretation B.14
• State postal (2-letter) abbreviation if it appears in
the item along with the place
• Anglo-American Cataloguing Rules, Revised (2002)
• Abbreviations included in Appendix B.14
Issues: Historical Practices
ALA Catalog Rules (1941)
• Multiple places of publication and publishers and neither
or first is prominent
• Include first listed first, indicate omission
• Multiple places of publication and publishers and first is
not prominent
• Include prominent first
• Include first listed second
• Unknown place of publication – [n.p.]
Issues: Historical Practices
Anglo-American Cataloging Rules (1967)
• Multiple places of publication and publishers and neither
or first is prominent
• Include first listed only, omit others
• Multiple places of publication and publishers and first is
not prominent
• Include prominent only, omit others
• Unknown place of publication – [n. p.]
Issues: Historical Practices
Anglo-American Cataloguing Rules, Revised (2002)
• Multiple places of publication and publishers and neither
or first is prominent
• Include first listed only, omit others
• Multiple places of publication and publishers and first is
not prominent
• Include first listed first
• Include prominent second
• Unknown place of publication – [S.l.]
Issues: Historical and Local Practices
“u.a.”
• At least one German institution uses “u.a.” as mark of
omission
• Means “et al.”
• Not an AACR2r rule
• Local practice?
• Is local practice/policy an error?
Issues: Historical and Local Practices
WorldCat enhanced records
• Eliminate or lessen the probability of these issues
Examining Quality of WorldCat
WorldCat: Publisher Name Selection Criteria
Fixed field lang = “eng”
WorldCat by Language
NonEnglish
39%
English
61%
WorldCat: ISBN Validation Errors
WorldCat records with ISBNs: 22.69%
ISBNs by Language
Non-English
45%
English
55%
WorldCat: ISBN Validation Errors
English Language
Valid
Invalid
7,561,445
99.90%
7,600
0.10%
13,147,325
99.88%
15,654
0.12%
All Languages
Valid
Invalid
WorldCat: MARC Tagging Errors
Examined English language records based on some known
issues and manual evaluation
Total MARC tagging errors found: 11,874 (0.03%)
WorldCat Tagging Errors
Other
2%
Dates
tagging
43%
MARC 260
vs 300
tagging
55%
WorldCat: MARC Tagging Errors
MARC 260 vs 300 tagging
• In 260 field, information from 300 field in
$a, $b, $c and/or $e
Dates tagging
• Date in $a or $b
• Five digit year
• “cm” follows year
WorldCat: Typographical Errors
Used “Typographical Errors in Library Databases” to identify
and quantify English language WorldCat errors (Ballard,
2005)
• Total errors: 26,599 (0.08%)
• Require manual examination to determine if actual
errors
• Searching for Institi*
• Misspelled:
• American Institite of Physics
• British Standards Institition
• Spelled correctly:
• Institiúid Ard-Léinn Bhaile Átha Cliath (Dublin
Institute for Advanced Studies)
WorldCat: Typographical Errors
Top words (10.4%):
Word
Probability
According
to Ballard
Error Type
WorldCat
Count
Worchester
Highest
Insertion
398
Metheun
High
Transposition
355
Universt*
Highest
Omission
299
Unives*
Highest
Omission
275
Westminister [and] Press
Highest
Insertion
266
Niagr*
High
Omission
260
Phildel*
High
Omission
235
Tallahasee
High
Omission
234
John Hopkins Press
Highest
Omission
227
Institi*
High
Substitution
226
WorldCat: Typographical Errors
“Westminister”
• Only included on Ballard list in combination with other
words
• Total errors in WorldCat: 628 (2.36%)
• Require manual review
Where are we now?
WorldCat: MARC 260 Evaluation
Top 10 terms in 260 $b in WorldCat
Term
Count
press
2,094,111
co
1,664,005
university
1,550,435
dept
1,084,647
pub
984,234
research
853,954
service
710,314
institute
660,346
office
649,794
chu ban she
620,735
WorldCat: MARC 260 Evaluation
University Press names in 260 $b in WorldCat
Term
Count
oxford
35,804
hopkins
22,564
cambridge
21,951
harvard
17,069
cornell
11,305
stanford
10,900
purdue
5,468
yale
5,076
princeton
4,746
rutgers
3,854
Clustering
Attempting programmatic clustering of publishers using
ISBN prefixes
• Data clustering (The Free Dictionary)
• "The science of extracting useful information from
large data sets or databases"
• Classification of similar objects into different groups
• Partitioning of a data set into subsets (clusters)
• Data in each subset (ideally) share some common
trait
WorldCat: Clustering Example
Used ISBN prefix 019 (Oxford University Press)
• Total WorldCat records: 58,004,317
• Records with ISBN prefix 019: 84,276 (0.15%)
• Non-unique publisher names from ISBN prefix records:
91,528
NACO normalized
unique publisher
names
Number of clusters
Non-singleton
clusters
Largest cluster
One or more
019 ISBN
All 019 ISBNs
1,550
1,386
919
799
222
(24.16%)
205
(25.66%)
82 text strings
81 text strings
Challenges: Publisher Name Authority File
Quality issue
• Level of acceptance for cluster
• What is acceptable?
Subsidiaries and Relationships
• Oxford & Auckland
• Examined manually to determine relationship
Form of name
• What is acceptable?
• Likely to use the most prominent form of name
Questions and Discussion
Contact Information:
[email protected]
[email protected]
Project Web Site:
http://www.oclc.org/research/projects/publisherns/