Mining for Digital Resources

Download Report

Transcript Mining for Digital Resources

Mining for Digital Resources:
Identifying and Characterizing Digital
Materials in WorldCat
Brian Lavoie
Lynn Silipigni Connaway
Ed O’Neill
ACRL 12th National Conference
Minneapolis, MN
April 9, 2005
More information about the OCLC Research Data Mining
activity is available online:
http://www.oclc.org/research/projects/mining/
Rising Digital Tide
 Equivalent of 5 exabytes of new information created in
2002; 92 percent stored on magnetic or optical media
Lyman and Varian
 Rush to digitize:
Cultural artifacts (images, audio, video, text)
• Published content (books, journals, databases)
• Communication (listservs, blogs, chat rooms)
• Government information (reports, data, forms, records)
•
 Survey of Academic Libraries:
Average expenditure in 2003 on digital resources:
$250,000 (8 percent increase)
• 40 percent of respondents intend to reduce spending on print
resources in order to increase spending on digital resources
•
Purpose of Study
Focused questions …
 Identify digital resources in WorldCat
•
Bibliographic criteria for algorithmic identification
 Characterize digital materials:
•
Cataloging activity; material types; holdings patterns …
But also broader questions …
 Explore ways to use information in bibliographic records to
generate new views of the catalog
 “Large scale experiments with existing catalog records to
see what can be done with legacy data”
Roy Tennant, Library Journal
Data Sources
 WorldCat: world’s largest and most comprehensive
bibliographic database
•
> 50,000 libraries worldwide use and contribute to WorldCat
 Copy of WorldCat from July 2004:
•
~53 million records
 Copy of WorldCat holdings file from July 2004:
•
~950 million holdings
 Caveats:
No presumption that all (or even most) digital materials are
cataloged in WorldCat
• Focus on cataloging practice and experimentation with
bibliographic data
•
Identifying Digital Materials
 “Standard” MARC21 criteria:
Type of Record: computer file [LDR/6 = m]
• Form of Item: electronic [008/23 or 29 = s]
• General Materials Designation: electronic resource [245 $h]
•
 Other criteria:
•
•
•
•
Physical Description: electronic resource [007/0 = c]
Electronic Location and Access [856 2nd ind. = 0, no $3]
Additional Materials/Form of Material: computer file/electronic
resource [006/0 = m]
Reproduction Note: electronic reproduction [533 $a]
Analysis of “Other Criteria”
 Analyzed records that did NOT meet any of the standard
criteria, but DID meet at least one of the other criteria:
007/0 = c
856 2nd ind. = 0
006/0 = m
533 $a
Recall
Very High
High
Medium
Low
Precision
Low
Low
Low
High
 Cataloging issues:
Accompanying materials
• Separate record vs. combined record
• Mis-codings
•
 Opted for conservative strategy of using only standard criteria
•
Wrote algorithm for automatic scanning of WorldCat
The WorldCat Digital Bucket
WorldCat
Digital
~750,000 records
(~1.5 percent)
~53 million records
Dynamics
 Earliest Digital Record (lowest OCLC #):
•
•
•

#1617882: entered on September 11, 1975
American Antiquarian Society
Data file on tape reel
Latest Digital Record (highest OCLC #):
•
•
•
#55794312: entered on July 1, 2004
Mississippi State University
Master’s thesis in PDF format
 Rate of Growth: January 2004 – July 2004
Net increase of 1.8 million WorldCat records
• Net increase of 61,000 records describing digital materials:
~3 percent of total increase
•
WorldCat Cataloging Activity for Digital Materials:
Number of “Digital Records” Entered, by Year (’75-’04)
180000
Number of Records
160000
140000
120000
Contributed: 98%
(WorldCat: 88%)
100000
80000
60000
40000
20000
0
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04*
Year Entered Into WorldCat
*2004 value estimated
Distribution of Digital Material Types in
WorldCat (July 2004)
Pamphlets Other
5%
Theses 3%
Serials 3%
6%
Books
43%
Gov. Docs
14%
Comp. Files
26%
Digital Material Types in WorldCat: 1985 and 2004
(Percent of Total)
Books:
Computer Files:
Government Docs:
Serials:
Theses:
Pamphlets:
Other:
1985
98
1
1
100
2004
43
26
14
6
3
3
5
100
Digital (e-)Books: Additional Characteristics
Median Holdings:
Uniquely Held:
Total Holdings:
1
65 percent
~13 million
Percent of Total Holdings Set By:
ARLs:
6
Non-ARL academics:
71
Publics:
13
(All books in WC: 3)
(All books in WC: 32 percent)
(All books in WC: ~700 million)
(All books in WC: 23)
(All books in WC: 44)
(All books in WC: 24)
Digital books with at least one print
equivalent cataloged in WorldCat:
~88,000
Percent of digital books available online:
70 percent
Looking Ahead … “Murky Buckets”?
 Early view: format most important feature of digital materials
•
Implies one “digital bucket”
 But as number and variety of digital materials expand …
•
•
Need for increasingly fine distinctions between buckets
“Online e-book” requires 3 filters to surface in search results
Format (digital), Means of access (network), Material type (book)
 “Murky Bucket Syndrome”: We cannot entirely, unambiguously
slice and dice [large bibliographic databases] because of historic data
entry and cataloging practices … that were not oriented toward our
new needs
Lorcan Dempsey, quoted by Roy Tennant in Library Journal
 Particularly troublesome for digital materials:
•
Cataloging practices in flux; new types of digital resources
Conclusions
 Identification and categorization of digital materials:
For now … need more work to identify consistent cataloging
patterns in existing bibliographic records
• And for the future … need clear, stable practices for cataloging
digital materials
•
 Benefits:
End users (resource discovery based on new views of the
catalog)
• Librarians (digitization priorities, collection analysis …)
•
 “Processable catalogs”:
•
Make bibliographic data work harder!
More information …
 Paper forthcoming
 Contacts:
• [email protected][email protected][email protected]
 Presentation to be posted on OCLC Research Web site:
• http://www.oclc.org/research/