Questions and Discussion

Download Report

Transcript Questions and Discussion

Capturing Untapped Descriptive Data:
Creating Value for Librarians and Users
Lynn Silipigni Connaway
OCLC Research
ASIST 2006 Conference
November 9, 2006
WorldCat: July 2006
Manifestations (records): 67,282,165
Total holdings: 1,071,507,045
Digital Items: 1,571,803
Works: 53,472,668
Institutions: 26,236
Physical Items*: ~1.6 billion
*Estimated
Origin of materials represented in WorldCat
Unknown
14%
Rest of World
40%
US
34%
Canada
3%
UK
9%
Some aspects of “Global WorldCat” …
Content Languages: 476
Materials w/non-US origins:
43% of WC non-English
35.3 million (52%)
Top 5 non-English:
Top 5:
German: 4.5 million
UK:
6.1 million
French:
Germany:
4.0 million
Spanish: 2.9 million
France:
2.9 million
Dutch:
Netherlands:
2.2 million
Canada:
2.1 million
4.2 million
2.1 million
Chinese: 1.6 million
Non-English Metadata Language:
9.3 million (20 languages)
Top 5:
Dutch:
4.1 million
Japanese:
0.7 million
French:
1.4 million
Finnish:
0.7 million
German: 1.0 million
OCLC WorldCatTM: Decision-making Resource
 Collection management
• Cooperative collection development
• Comparative collection analysis
• Collection assessment
• Mass digitization
• Off-site storage
• Preservation
 Services
• Virtual reference
• Recommender services
 Systems
• Precision
OCLC WorldCatTM: Data Mining Research Projects
 Audience Level
 Publisher Name Server
 WorldMap
Audience Level: Rationale and Objectives
Holdings represent selection decisions by
librarians … implies there are about 1
billion individual selection decisions in the
WorldCat holdings file
Selections are made to serve the interests of
a library’s target community …
• Associate target community (audience level) to
?
particular library profiles - e.g., ARL, non-ARL
academic, public, K-12 school …
Implies: we can infer materials’ audience level
from holdings patterns, which in turn can support:
•
•
•
•
Collection management
Readers’ advisory services
Reference services
Information retrieval
Example : Mother Goose
Publisher Name Server: Research Objectives
 Resolve for data mining and quality of WorldCat
• ISBN prefixes to publisher name
• Variant publisher names to a preferred form
 Complement Collection Analysis Service
• Librarians
• Publishers
 Capture and make available various attributes of individual
publishers
• Location of publisher
• Language(s) of materials published
• Genre(s)/format(s) of materials published
• Dominant subject domain(s) of the publisher's output
• Parent company and subsidiaries
Publisher Name Server: Methodology
 Programmatically cluster publishers using ISBN prefixes
• Data clustering (The Free Dictionary)
• "The science of extracting useful information from
large data sets or databases"
• Classification of similar objects into different groups
• Partitioning of a data set into subsets (clusters)
• Data in each subset (ideally) share some common
trait
 Hand parse the entities and resolve ISBN prefixes
Publisher Name Server: Database
 To date >800 records
 Relational database, preserving hierarchical relationships
 Begins with high-occurrence entities to identify:
• “Top 10” lists (USA, UK, Canada, Australia, Germany,
France, Japan, Italy)
• Top university presses
• Mergers and acquisitions
Top U.S. Publishing Entities in WorldCat
(22,680,201 total U.S. records)
ISBN
Prefix
WorldCat
Records
Publishing Entity
0-13
50,298
Prentice-Hall, Inc.
0-07
44,545
McGraw Hill, Inc.
0-06
44,362
HarperCollins (Firm)
0-16
40,451
United States G.P.O.
0-471
37,710
John Wiley & Sons
0-312
33,318
St. Martin's Press
0-671
31,765
Simon & Schuster, Inc.
0-02
27,602
MacMillan Publishers
0-15
18,420
Harcourt Brace & Company
0-394
18,043
Random House (Firm)
0-590
17,290
Scholastic Inc.
0-385
16,768
Doubleday and Company, Inc.
0-395
16,699
Houghton Mifflin Company
0-19
15,724
Oxford University Press
0-03
15,417
Holt, Rinehart, and Winston
Publisher Name Server: Database













Database Fields:
Publisher Name, Preferred
Form
Source of Preferred Form
Former Names
Variant Forms
ISBN Prefixes
HQ City
HQ Country
Other Cities
URL
----Languages
Formats
DDC Subjects
LCC Subjects
Data Sources:
U.S. Library of Congress, National
Authority File, 110 (Corporate
Name) field
Books In Print Online (W.W. Bowker)
The International ISBN Registry (K.G.
Saur)
Publishers’ Weekly Online
Hoover’s Handbook Online
Standard and Poor’s Corporate
Descriptions
The Directory of Corporate Affiliations
(DIALOG)
Company websites
DATA MINING
Entity-Parsing in a World of Mergers and Acquisitions
Pearson PLC
Penguin Books
Allen Lane
Puffin Books
Ladybird Books
Pearson Canada
Copp Clark
Riverhead Books
Pearson Technology Group
Adobe Press
Cisco Press
Putnam Books Berkeley Publishing Group
Pearson Education, Inc.
Avery
Addison-Wesley
Publishing Company
Benjamin/Cummings
Publishing Company
Allyn and Bacon
Scott, Foresman
and Company
Prentice-Hall, Inc.
HarperCollins
Educational Publishers
Dominie Press
Longmans, Green,
and Co.
OCLC WorldMapTM: Objectives
 Geographically represent library data from UNESCO, ARL,
and NCES
• Number of libraries
• Amount of library expenditures
• Number of volumes and titles
• Number of librarians
• Number of users
OCLC WorldMapTM: Objectives
 Research prototype
• Test geographical representation of WorldCat
• Titles and holdings by country of publication
• Support data mining research area
• Visually display mined data to ease review and
analysis
• Internal use
• Sales and marketing
• External use
• Library collection assessment and comparison
• Complement the AAU/ARL Global Resources Network
project
• Project of the Council on Library and Information
Resources (CLIR)
OCLC WorldMapTM: Technology
 First implemented SVG
• Open standard maintained by W3C
• Simple XML file
• Young technology
• Browser support limited
• Requires plug-in
 Converted to Flash
• Browser compatibility
• Plug-in compatibility (if a plug-in was installed!)
 For a detailed comparison of SVG and Flash, see:
http://www.carto.net/papers/svg/comparison_flash_svg/
OCLC WorldMap
TM
Potential Future Projects
 Audience Level
• Integrate into WorldCat.org and OPACS to limit searches
and retrieved sources
 Publisher Name Server
• Integrate into OCLC Collection Analysis Service for
publisher business intelligence
 WorldMap
• Subject information “aboutness”
• Language of item
• Content language
• Metadata language
• Holdings by country of library
Presentation will be available at
http://www.oclc.org/research/presentations/default.htm
Prototypes available at
http://www.oclc.org/research/researchworks/default.htm
Project Web Site:
http://www.oclc.org/research/projects/default.htm
Questions and Discussion
Contact Information:
[email protected]