This is the title

Download Report

Transcript This is the title

Making data work harder
Lorcan Dempsey
OCLC
OVGTSL 2005 Conference
Newark, May 11-13
OCoLR
20041025
#53928015
OCLCR
Overview
 Some context
 Looking at data in action
• OpenWorldCat
• FRBR
• Data mining
OCoLR
20041025
#53928015
OCLCR
Context: value
 Amazoogle: what should we be doing which fits into
a world that they occupy. Where do we provide
unique value.
 ROI: libraries invest in data but do not extract
as much value as they might from it. Unless we
release more value, then the argument for this
investment becomes weaker.
 User: how do we co-create value with users. What
opportunities are there for mixing catalog data
and user contributed data?
 Management intelligence: how do we use data better
to inform management decisions?
OCoLR
20041025
#53928015
OCLCR
Context: consequences
 The role of the catalog?
 The role of structured data?
 The role of the library?
OCoLR
20041025
#53928015
OCLCR
Data




Open WorldCat
FRBR
WorldCat Wiki
Management intelligence
OCoLR
20041025
#53928015
OCLCR
FRBR
 ‘Interim FRBR’ in OWC
 FRBR in research projects
• FictionFinder
• Curioser
• xISBN
• Algorithm
• Top 1000
 FRBR in FirstSearch – late this year
OCoLR
20041025
#53928015
OCLCR
OCoLR
20041025
#53928015
OCLCR
Top Sets for Fiction (Records)
Record Keys
1,296
1,267
971
828
defoe, daniel\1661 1731/robinson crusoe
carroll, lewis\1832 1898/alices adventures in
wonderland
cervantes saavedra, miguel de\1547 1616/don
quixote
stevenson, robert louis\1850 1894/treasure island
624
twain, mark\1835 1910/adventures of huckleberry
finn
twain, mark\1835 1910/adventures of tom sawyer
618
swift, jonathan\1667 1745/gullivers travels
689
Top Sets for Fiction (Holdings)
Holding
29,043
26,088
Keys
twain, mark\1835 1910/adventures of huckleberry
finn
carroll, lewis\1832 1898/alices adventures in
wonderland
20,843
twain, mark\1835 1910/adventures of tom sawyer
19,410
18,566
defoe, daniel\1661 1731/robinson crusoe
cervantes saavedra, miguel de\1547 1616/don quixote
18,492
stevenson, robert louis\1850 1894/treasure island
18,123
dickens, charles\1812 1870/christmas carol
Taking FRBR onto the open web
 Curio(u)ser
OCoLR
20041025
#53928015
OCLCR
MetaWiki
 WIKI – web pages
 metaWIKI – data
 Capture user input in structured
ways
OCoLR
20041025
#53928015
OCLCR
Extending
Wiki’s utility
MetaWiki:

Wiki:




supported markup:
• wikitext
page editing:
• a single text
block
searches:
• full text
searching
collections
managed:
• one per wiki
OCoLR
20041025



supported markup:
• wikitext
• structured data (e.g.,
MARC, METS, DC…)
page editing:
• a single text block, or,
• field level
searches:
• full text searching
• fielded searching
collections managed:
• one/multiple per OaiWiki
#53928015
OCLCR
Lorcan:
note that
this is a
work in
progress
Management intelligence
 So we have all this data – what can
it tell us?
 Several projects underway: only some
discussed here
OCoLR
20041025
#53928015
OCLCR
Making Data Work Harder

Activities “shed” data:
• Cataloging  bibliographic information
• Web site traffic  transaction logs
• Reference queries  search term lists

Need to mine this data for intelligence that creates
value for libraries and users

OCLC Research undertaking a number of data-mining
projects aimed at:
Knowing more about the characteristics of library
collections
• Creating interesting and useful data displays
• Generating intelligence to support library decision-making
•
OCoLR
20041025
#53928015
OCLCR
Data mining
 OCLC has a new collection analysis
service
 Some research projects looking at
systemic questions described here.
OCoLR
20041025
#53928015
OCLCR
Looking at Library Print Book
Collections … Systematically
OCLC/Ithaka collaboration: Use WorldCat to characterize the
“system-wide” print book collection – i.e., aggregate print
book holdings in WorldCat
32 million print books, representing
26 million distinct works
Only about 120,000 works had both
print book and e-book manifestations
Half of print books published after
1977; more than 80% still “in copyright”
Rareness is common! Only a third of print books have
more than five holdings; half have two or less
Intelligence of this kind can help establish digitization priorities
and inform preservation planning
OCoLR
20041025
#53928015
OCLCR
More information: http://www.oclc.org/research/presentations/lavoie/cni2005.ppt
The Implications of GooglePrint
…
Potentially covers about one third
of print books in WorldCat
~60 percent of “GooglePrint”
books held by only one of the
Google 5
Less than 5 percent held by all of
the Google 5
~20 percent of “GooglePrint
books” out of copyright
Paper forthcoming …
OCoLR
20041025
#53928015
OCLCR
Know Your Audience!
Holdings represent selection decisions by
librarians … implies there are about 1
billion individual selection decisions in the
WorldCat holdings file
?
Selections are made to serve the interests of
a library’s target community …
• Associate target community (audience level) to
particular library profiles - e.g., ARL, non-ARL
academic, public, K-12 school …
Implies: we can infer materials’ audience level
from holdings patterns, which in turn can support:
•
•
•
•
OCoLR
Collection management
Readers’ advisory services
Reference services
Information retrieval
20041025
#53928015
Paper forthcoming!
OCLCR
“Last Copy”: Identifying At-Risk
Materials
~23 million WorldCat records have only a
single holding attached
Libraries need to know what portions of
their collections are:
Rare … Rare and valuable …
“Last copy” (artifact and/or content)
Identification of rare materials essential
intelligence in support of storage, digitization,
and preservation decision-making
Data-mining study of Vanderbilt holdings in WorldCat:
• Identified 23,000 items held uniquely by Vanderbilt
• ~60 % are print books
• ~60 % produced prior to 1950; ~25 % produced after 1970
OCoLR
20041025
#53928015
OCLCR
Paper forthcoming!
Thank you!
OCLC Research:
http://www.oclc.org/research/
Lorcan:
http://orweblog.oclc.org/
OCoLR
20041025
#53928015
OCLCR