text mining 101: what you should know

Download Report

Transcript text mining 101: what you should know

TEXT MINING 101:
WHAT YOU SHOULD KNOW
Ethan Pullman (Carnegie Mellon University)
Denise Novak (Carnegie Mellon University)
Kristen Garlock (Ithaka.org)
Patricia Cleary (Springer US)
NASIG annual
Saturday, June 11, 2016 10:30 am
Working with Your Constituents
Ethan Pullman
Humanities Liaison & Library Instruction Coordinator
Carnegie Mellon University
Audience Survey
On a scale from 1-5, novice to experienced, how familiar are you with text mining?
1 -- 2 -- 3 -- 4 -- 5
What is your role in providing Text Mining (TM) Services?
A. We have an expert librarian/service center
B. I work directly with my department(s)
C. Other: (Please describe)
My TM population mainly consists of:
Faculty
PhDs/TAs
How many of you have used TM for your own research/project?
Undergrads
Text Mining Briefly
●
What it is?
●
What is its purpose?
Text mining is the automated processing of large amounts
of structured digital texts
Purpose: retrieval, analysis, and interpretation of texts.
_______________________________________________
Note: Mining non-textual information falls under “data mining”.
Although often included with Text Mining as “Text & Data
Mining”, data mining is different and requires tools and
methodologies that are distinct from text mining.
Photo adapted from Text Mine ‘01
Text Mining Examples
Visualization tools build word clouds from words mined from large texts.
SDFB mines British early modern texts to trace “social connections”
between individuals from that period (read more)
It Ain’t About the Money, Money, Money…
or is it?
Authors’ Guild vs. Google Books: A Rhetorical Analysis
A class project that used text mining to analyze case documents and
briefs submitted by Authors’ Guild in Authors’ Guild vs. Google. The
analysis shed light on the rhetorical strategy used by Authors’ Guild
lawyers and informed outcome prediction.
The Role of Library Liaisons
What is new?
Acquiring texts?
Providing access?
Librarians need to understand:
> how texts are used in the digital age
> what tools are available
> issues impacting acquisition and access
How I stay informed ….
Stay Informed:
Faculty Profiles
●
Curriculum Vitae
●
Publications
●
Syllabi
How I stay informed
Attend departmental lectures
●
●
Read about campus initiatives
●
Visit research showcases
How I stay informed ...
●
Maintaining our Text-Mining Website:
●
Professional participation:
●
●
Organizations & Conferences: for example, Text Analytics World;
●
Social Networks/Email lists, blogs
●
Seek continuing education opportunities
Collaborate with our acquisition and data services librarians
Acquisitions Point of View
Denise Novak
Acquisitions Librarian
Carnegie Mellon University
Supporting Text Mining of the
JSTOR Digital Library
Kristen Garlock
Associate Director of Education and Outreach -JSTOR
Ithaka
What is Data for
Research?
Data for Research is a self-service
website for generating datasets from
the content on JSTOR.
http://dfr.jstor.org
How it works
Service is free, permitted under Terms & Conditions.
● Data for Research: Researcher creates free account on
site, defines parameters of dataset, submits request,
downloads dataset.
● Full-text datasets: Letter agreement (may be established
with individuals or libraries). Datasets not limited by
licenses or institutional affiliation.
Support for Text Mining
Why?
● Supporting new types of scholarship is part of our mission
● Opportunities to build beneficial partnerships
● Increasing value of publications; corpus in and
of itself has value as a scholarly tool
NOTE: For a bibliography of projects and research that incorporated datasets from JSTOR, please
contact Kristen Garlock ([email protected])<mailto:[email protected])>.
Challenges
Biggest challenges:
● Staffing and support
● Keeping up with evolving researcher needs
Trends:
● Increasing numbers of requests
● Requests for larger and more complex datasets
● Interest from non-technologists
● Scholars not anticipating/understanding gaps or data issues in datasets
● Desire to combine datasets from multiple sources
Springer TDM policy
Patricia Cleary
Global eProduct Development Manager
Springer US
Springer TDM Policy Update (June 2016)
• This presentation provides an overview of the current Springer TDM policy.
• Springer is currently working on a new combined TDM policy for Springer Nature.
• The new TDM policy will be announced sometime in the near future.
Springer is currently working on a new combined
TDM policy for Springer Nature
Springer’s TDM policy was introduced in 2014
• The volume of scientific publications is increasing and TDM software tools
continue to improve
• Springer acknowledges the need for a more formalized process to enable TDM
• Strive to make it as simple as possible for researchers
Springer grants text- and data-mining rights to
subscribed content, provided the purpose is
non-commercial research
For researchers with subscription access
• Individual researchers can download subscription and open access content for
TDM purposes directly from the SpringerLink platform
• No registration or API key is required
• Full-text content can be accessed easily and programmatically at friendly URLs
based on the content’s Digital Object Identifier (DOI)
For researchers with no subscription access
• Researchers who do not have subscription access to SpringerLink can send
requests for TDM access to a contact within Springer
• These inquiries will be considered on a case by case basis
Implementation by academic and government institutions
• For subscribers at academic and government institutions, these rights will be
included in all new and renewed SpringerLink subscription agreements as an
additional TDM clause
• Existing subscribers may also add the TDM clause before their agreement is up for
renewal
Use of text and data mining results and research output
• Publications or analyses resulting from TDM of subscribed content may include
quotations from the original text of up to 200 characters, or 20 words, or 1
complete sentence
• Should cite the original Springer content in the form of a DOI link
• Permission to reproduce images may be granted on a case-by-case basis
• For Open Access (OA) publications from Springer, BioMed Central and
SpringerOpen, TDM is usually allowed without restrictions since the majority of
Springer's OA content is licensed under CC-BY
Technical guide to downloading content
• For TDM researchers interested in cross-publisher automated downloading, the
CrossRef TDM initiative may be useful
• Springer is actively collaborating with CrossRef on this project and we expect
Springer content to be fully supported soon
• Guidelines for performing TDM of Springer content are located on the Springer’s
text- and data-mining policy page on Springer.com
Springer Metadata API
• Springer provides the free Springer Metadata API for searching within Springer
content
• Provides rich searching for the vast majority of Springer, BioMed Central and
SpringerOpen documents, including all journal content, book chapters and
protocols
• The Springer Book Archives will soon be searchable through this API as well
Q&A
[Q] Do publishers prefer to sign agreements directly with
researchers, or with the libraries that either have an active
subscription or have purchased the corpus to be mined?
[A] So far, Springer has only signed licenses with libraries. We
are currently focused on customers who have an active
subscription with us. TDM access to content is for researchers
have access to through their institutional subscription, and OA
content.
Q&A (cont’d)
[Q] If libraries do sign agreements on behalf of researchers,
does Springer expect libraries to track or monitor researcher
activities, either for compliance to terms of the agreement, or
for reporting purposes?
[A] Springer doesn't expect libraries to directly monitor
researcher TDM activities as separate from regular content
access activities. TDM access is subject to the same restrictions
as any regular content from a library-researcher relationship
Q&A (cont’d)
[Q] What drives publisher decisions to host data vs. send the data to
libraries for hosting? What types of costs are associated with hosting?
How can libraries support an infrastructure for text mining if the data is
sent on drives, and do publishers mind if researchers get copies of this
data (sort of like a dataset that we buy for them?)
[A] This is different per publisher. Since Springer provides content that is
DRM-free, we can host content on our native site SpringerLink, or offline at
the library.
The advantage of SpringerLink is that the library does not have to
constantly receive updated data from us, and doesn’t have to build a GUI or
Useful links
Springer's Text and Data Mining Policy
https://www.springer.com/gp/rights-permissions/springer-s-text-and-data-miningpolicy/29056
Springer / BioMed Central API Portal
https://dev.springer.com/
CrossRef TDM Initiative
Thank You!
[email protected]