Starter show and example presentation slides using OCLC

Download Report

Transcript Starter show and example presentation slides using OCLC

Big Data, Linked Data: Classification Research at the Junction
24th ASIS&T SIG/CR Classification Research Workshop, 2 November 2013
The Interplay of
Big Data, WorldCat,
and Dewey
Rebecca Green, OCLC
[email protected]
Michael Panzer, OCLC
[email protected]
The world’s libraries. Connected.
Roadmap
• Setting the stage
• Big data
• WorldCat as big data
• Literary warrant and the DDC
• “Classification analytics”
• Classified works
• Access points
• Trending topics
• Structure of discipline
The world’s libraries. Connected.
Setting
the stage
The world’s libraries. Connected.
3 V’s of big data
• Volume
• Terabytes (10004), petabytes (10005), exabytes
(10006), . . .
• Number of transactions vs. number of bytes
• My big data is not your big data
The world’s libraries. Connected.
3 V’s of big data – cont.
• Variety
• Sources, perspectives, standards
• Structured vs. unstructured data
• Semantically related datasets
• Velocity
• Data creation
• Data analysis
The world’s libraries. Connected.
WorldCat as big data
• Variety
• Records in MARC Bibliographic Format
• Records in MARC Holdings Format
• Records in MARC Authority Format (e.g., LCSH,
FAST, BISAC, MeSH, VIAF)
• Vendor records
• WorldCat knowledge base
• Institutional registry data
• Institution-specific acquisitions, circulation, ILL data
The world’s libraries. Connected.
WorldCat as big data
• Volume
• Bibliographic data: over 300 million records
• Holdings data: over 2 billion records
• Authority data
• LCSH: 26.4 million headings
• VIAF: 24.2 million clusters; 21 million links between records
The world’s libraries. Connected.
Literary warrant and the DDC
• DDC editorial rules call for literary warrant to be
taken into account for:
• Expansions (i.e., development of new classes)
• Reductions (i.e., discontinuing entire classes)
• Form of name used in class descriptions
• Order in which topics are listed in multitopic caption
• Creation of and choice of examples in add instructions
• Indexability of topics (print; WebDewey)
• Form of name for index entries
The world’s libraries. Connected.
“Classification
analytics”
The world’s libraries. Connected.
Classified works
• Periodic profiles of distribution of classified
works across the classification to identify:
• Expansions: Disciplines/subjects with sufficient
literary warrant
• Reductions: Classes with insufficient literary warrant
The world’s libraries. Connected.
Classified works:
Expansion warranted (1)
306.44 Language
Including pragmatics
Class here anthropological linguistics,
ethnolinguistics, sociolinguistics
306.446
Bilingualism and multilingualism
306.449
Language planning and policy
306.449 4–.449 9 Specific continents, countries, localities
in modern world
Add to base number 306.449
notation 4–9 from Table 2, e.g.,
language policy of India 306.44954
The world’s libraries. Connected.
Classified works:
Expansion warranted (2)
• Records retrieved in WorldCat searches on dd:306.44*
not dd:(306.440* or 306.446* or 306.449*)
Time
period
Records
retrieved
Language-specific: English,
French, German, Spanish
1981-1985
120
14
1986-1990
412
59
1991-1995
912
134
1996-2000
1230
163
2001-2005
1603
199
2006-2010
2369
446
The world’s libraries. Connected.
Classified works:
Reduction warranted (1)
006.33
*Knowledge-based systems
...
006.336
006.336 3
006.337
006.338
*Programming for knowledge-based systems
*Programming languages for knowledgebased systems
Programming for knowledge-based systems
for specific types of computers, for specific
operating systems, for specific user interfaces
*Programs for knowledge-based systems
The world’s libraries. Connected.
Classified works:
Reduction warranted (2)
• Records retrieved in WorldCat searches for disjunction of
DDC class number and standard subdivisions of number
DDC
class
19861990
19911995
19962000
20012005
20062010
20112015
006.33
1241
978
612
660
915
246
006.336
0
1
1
6
14
3
006.3363
1
1
0
0
1
0
006.337
0
1
5
5
10
0
006.338
0
0
3
1
3
1
• Duplicates not filtered out of search results for 006.33
• Duplicates filtered out of all other search results
The world’s libraries. Connected.
Access points
• Analysis of subject heading data in DDC
categorized content to identify:
• Areas where expansions of new classes should be
considered
• Additional access points / mappings for DDC
classes
• Additional topics to be added to class description
The world’s libraries. Connected.
Access points: Standing room
topics and literary warrant
• DDC class
004.678 *Internet
Including extranets, virtual private networks
Class here World Wide Web
• LCSH:
010 ## $a sh 97006102 ​
150 ## $a Extranets (Computer networks) ​
450 ## $a Virtual private networks (Computer networks)
• dd: 004.678* and (hl: extranets w computer w networks)
retrieves 69 records
The world’s libraries. Connected.
Access points:
Topics added to class description
004.6 *Interfacing and communications
...
Including sensor networks
...
006.22
*Embedded computer systems [formerly 004.1]
Class here microcontrollers
For a specific aspect of embedded computer
systems, see the aspect, e.g., systems
analysis and design of embedded computer
systems 004.21, wireless sensor networks
004.6, software for embedded systems 005.3
The world’s libraries. Connected.
Trending topics
• My trending topics are not your trending topics
• Twitter—sudden high-magnitude spike in activity
• DDC—“quick” achievement of literary warrant
threshold + plateaus at steady rate
• Trending topic detection vs. new topic detection
• Newly minted LCSHs
• Chapter/paper titles
• Conferences
The world’s libraries. Connected.
Trending topics:
Newly minted LCSHs (1)
Date entered
LCSH
2012-08-13
Big data
2012-08-22
Contrast data mining
2013-07-18
Linked data
The world’s libraries. Connected.
Trending topics:
Newly minted LCSHs (2)
Time
period
Records retrieved,
su:“big data"
Records retrieved,
su:“big data"
or ti:“big data"
2001-2005
1
17
2006
0
0
2007
0
0
2008
0
2
2009
0
0
2010
0
7
2011
6
74
2012
51
227
2013
131
413
The world’s libraries. Connected.
Trending topics :
Conferences
• Big data: 29th British National Conference on Databases
• 1st Workshop on Architectures and Systems for Big Data
• Workshop on big data
• Big Data Analytics: First International Conference
• The Semantic Web: Semantics and Big Data: 10th International
Conference
• 2012 workshop on Management of big data systems
• 2nd Workshop on Research in the Large : Using App Stores, Wide
Distribution Channels and Big Data in UbiComp Research
• IEEE International Congress on Big Data
• Big Data 2 Knowledge (Workshop)
The world’s libraries. Connected.
Trending topics :
Chapter/paper titles
• Welcome to the big data age
• Big Brother and big data around the world
• How to make sense of big data?
• Business and social implications of big data
• Big data and health care
• How should big data abuses be addressed?
• What is big data?
• Does big-data equal big value?
• Big-data technologies
The world’s libraries. Connected.
Trending topics :
Newly minted LCSHs (3)
Time
period
Records retrieved,
su:"linked data"
Records retrieved,
su:"linked data"
or ti:"linked data"
2001-2005
7
38
2006
1
2
2007
2
8
2008
2
14
2009
14
34
2010
17
72
2011
29
84
2012
54
152
2013
57
114
The world’s libraries. Connected.
(Non-)Trending topics :
Newly minted LCSHs (4)
Time period
Records retrieved,
su:“contrast data
mining”
Records retrieved,
su:“contrast data mining”
or ti:“contrast data mining”
2001-2005
0
0
2006
0
0
2007
0
0
2008
0
0
2009
0
0
2010
0
0
2011
1
1
2012
1
3
2013
5
9
The world’s libraries. Connected.
Structure of discipline
• Analysis of title data in DDC categorized content
to identify facet structure of discipline
• Retrieve bibliographic records from WorldCat for
monographic literature
• Isolate title data
• Identify noun phrases in the titles
• Use conceptual density measure of Agirre & Rigau
• Disambiguate noun phrases
• Identify appropriate generalizations
The world’s libraries. Connected.
The Interplay of Big Data,
WorldCat, and Dewey
That’s all, folks! -- Thank you
=
La fin -- Merci beaucoup
The world’s libraries. Connected.
(Non-)Trending topics :
Newly minted LCSHs (5)
Time period
Records retrieved,
Records retrieved,
su:“Attribute focusing su:“Attribute focusing Data mining”
Data mining”
or ti:“Attribute focusing ”
2001-2005
0
0
2006
0
0
2007
0
0
2008
0
0
2009
0
0
2010
0
0
2011
0
0
2012
1
1
2013
0
0
The world’s libraries. Connected.