Computational Research and Copyright - Virginia

Download Report

Transcript Computational Research and Copyright - Virginia

Computational Research and
Copyright
John Unsworth
BNN Future of the Academy Speaker Series
MIT Faculty Club
May 25, 2012
HATHI TRUST
A Shared Digital Repository
HathiTrust Research Center
Goals of the HTRC
• Maintain repository of text mining algorithms, retrieval tools,
derived data sets, and indices available for human and
programmatic discovery.
• Be a user-driven resource, with an active advisory board, and
a community model that allows users to share tools and
results.
• Support interoperability across collections and institutions,
through use of inCommon SAML identity.
• See also: http://www.ideals.illinois.edu/handle/2142/29936 -a report prepared by the Illinois Center for Informatics
Research in Science and Scholarship, on the experience of
Google Digital Humanties grant recipients.
The HathiTrust Research Center
• The HathiTrust Research Center (HTRC) enables
computational access for nonprofit and
educational users to published works in the
public domain.
• In the future, it will offer computational access to
in-copyright works from the HathiTrust as well.
• The center will break new ground in the areas of
text mining and non-consumptive research,
allowing scholars to fully utilize content of the
HathiTrust Library while observing the
requirements of current U.S. copyright law.
HTRC Partners
• The HTRC is a collaborative research center
launched jointly by Indiana University and the
University of Illinois, along with the HathiTrust
Digital Library and Google.
• The HTRC will help researchers meet the
technical challenges of working with massive
digital collections, by developing tools and
cyberinfrastructure that enable advanced
computational access to those collections.
Memos of Understanding
Completed:
• IU/UIUC MOU
• HT/IU/UIUC MOU
• Google/UIUC MOU
• Google/IU MOU
To be developed:
• HTRC-Researcher/Center MOU
Executive Committee
The HathiTrust Research Center is led by an Executive
Management Team that includes:
• Stephen Downie (Co-director), Professor and Associate
Dean for Research, University of Illinois Graduate School of
Library and Information Science
• Beth Plale (Co-director and chair), Data To Insight Center
director and professor in the School of Informatics and
Computing at Indiana University
• Scott Poole, I-CHASS director and professor in the
Department of Communication at the University of Illinois
• Robert McDonald, Indiana University Associate Dean of
Libraries
• John Unsworth, Vice-Provost and CIO, Brandeis University
Advisory Board
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Cathy Blake, University of Illinois, Urbana-Champaign
Beth Cate, Indiana University
Greg Crane, Tufts University
Laine Farley, California Digital Libraries
Brian Geiger, University of California at Riverside
David Greenbaum, University of California at Berkeley
Fotis Jannidis, University of Wurzberg, Germany
Matthew Jockers, Stanford University
Jim Neal, Columbia University
Bill Newman, Indiana University
Bethany Nowviskie, University of Virginia
Andrey Rzhetsky, University of Chicago
Pat Steele, University of Maryland
Craig Stewart, Indiana University
David Theo Goldberg, University of California at Irvine
John Towns, National Center for Supercomputing Applications
Madelyn Wessel, University of Virginia
Timeline: Phase 1
• The primary areas of work in Phase 1 include
architecting the core cyberinfrastructure for data
analysis, deploying some general-purpose analytical
tools, and prototyping end-user services, including an
access portal, support center capabilities, and facilities
for sharing and storing derived research data. In Phase
1, only the public domain works in the HathiTrust will
be available to researchers, since the security
framework and policies for working with copyrighted
material will still be under development. The HTRC will
deliver a demonstration system in June 2012.
Timeline: Phase 2
• This phase, which will require significant
funding, will involve development of an
operational research center that will provide
ongoing and up-to-date access to the HTRC
research corpus and associated tools. Phase 2
will commence during the 18th month of the
project, and its launch will depend on
garnering resources during Phase 1 and on the
sustainability plan that will be developed in
Phase 1.
Current Collections
• HTRC currently has a 250,000 volume
collection of non-Google digitized content and
a 50,000 volume collection of content that IU
libraries digitized. These collections reside in
a cluster of 3 4-core, 16 GB RAM machines.
• About 2.8M volumes of Google-produced
public domain material will shortly be added
to the HTRC collections, now that the Google
MOUs have been signed.
HTRC Access and Use
• Users will be able to access the HTRC through a portal
or programmatically, through a Data API
• The Data API cannot be used to download volumes, but
it can be used to move data to a location where
computation takes place. It can also be used to search
SOLR indexes and pass volume IDs to other services for
access and computation.
• The target audience of the HTRC is non-profit and
educational researchers
• Authentication will depend on InCommon, a
Shibboleth implementation that most HathiTrust
institutions already support.
Architecture
• Solr Indexes: The HathiTrust and the HTRC both use Apache
SOLR to index the materials in their collections.
• The Solr index is accessed through the Data API layer. The
Data API layer limits some access, and does auditing, but
otherwise is a pass through to the Solr API.
• Volume Store: HTRC uses Apache Cassandra, a noSQL data
store cluster to hold the volumes of digitized text.
• Volume- and page-level access to HTRC data is provided
through the HTRC Data API. Each machine has 500 GB of
disk, and the volumes are partitioned and replicated across
the 3 Cassandra instances.
• Registry: IU is running a version of WSO2 Governance
Registry, where applications are registered prior to running
in the non-consumptive framework. The registry is also
used as a temporary storage for returned results.
“Research in which
computational analysis is
performed on one or more
Books, but not research in
which a researcher reads or
displays substantial
portions of a Book to
understand the intellectual
content presented within
the Book.”
Non-consumptive Research
• One of HTRC’s unique challenges is support
for non-consumptive research.
• This will entail bringing algorithms to data,
and exporting results, and/or providing people
with secure computational environments in
which they can work with copyrighted
materials without exporting them.
• Why is this worth doing? Because it enables a
new art of information that can be used to
make new kinds of arguments (and possibly to
settle some old ones).
Non-Consumptive Research
• HTRC received funding from the Alfred P. Sloan
Foundation for development of secure
infrastructure on which to carry out execution of
large-scale parallel tasks on copyrighted data
using public compute resources such as
FutureGrid or resources at NCSA.
• The high-level design uses a pool of VM images
that run in a secure-capsule mode and are
deployed onto compute resources. The team is
working on a proof of concept deployment
process with an OpenStack platform using Sigiri.
Blacklight
• Developed at the University of Virginia, Blacklight
is an open-source discovery interface:
http://projectblacklight.org/
• Blacklight supports faceted searches, a known
need of researchers.
• We expect Blacklight to be a significant
component of the public face of the HTRC.
• Blacklight is designed to support data that is both
full text and bibliographic.
• Blacklight is built on SOLR, the same technology
that we already use to index the HTRC data.
Google DH study
• Google Digital Humanities Awards Recipient
Interviews Report, Prepared For The
Hathitrust Research Center by Virgil E. Varvel
Jr. and Andrea Thomer at the Center For
Informatics Research In Science And
Scholarship, Graduate School of Library and
Information Science, University Of Illinois At
Urbana-.‐Champaign, in Fall 2011
Scope of the report
• 22 researchers who had received Google
Digital Humanities grants were invited to
debrief on their experience, in order to
provide input to the design of the HTRC
• Interviews were conducted by phone, in
person, or by Skype, using a semi-structured
interview protocol
Findings of the report: OCR
• OCR quality is a significant issue; steps should
be taken to improve OCR output as possible
• OCR quality should be indicated in volumelevel metadata
• Scalability of scanned page images is
necessary for human correction of OCR errors
Other Findings of the Report
• Researchers would like better metadata about the
languages included in texts, particularly in multi-lingual
documents.
• Better metadata about language by sections within
volumes would be helpful.
• Automatic language identification functions would be
helpful, but human‐created metadata is preferred,
particularly for documents with low OCR quality.
• For one researcher, the primary issue was retrieving
the bibliographic records in usable form. It took 10
months to design the queries and get the data.
Matt Jockers, “The Nineteenth-Century Literary Genome”
via
Digital Humanities Specialist (aka Elijah Meeks)
http://dhs.stanford.edu
Arguing with Data
• Data enables arguments based on
quantitative and/or empirical data
• Data still requires interpretation, and you
can still make better and worse
interpretations, and more or less
compelling arguments
• In addition to new kinds of arguments, you
can make new kinds of mistakes, especially
mistakes based on incomplete data or on
an incomplete understanding of data
Mistakes based on incomplete data
Mistakes based on incomplete data
New kinds of arguments
http://tedunderwood.wordpress.com/
Ted Underwood is exploring the changing
etymological basis of diction in English, over a
200-year period, especially the shift from words
derived from German, to words derived from
Latin, and back again.
Etymology and Style
Ted Underwood, 2011
o English professors have a long, lively history of drawing
specious conclusions from the “Latinate” or
“Germanic” character of a particular writer’s style.
o There is nevertheless good evidence that older words
do predominate in informal, and especially spoken
English. [Laly Bar-Ilan and Ruth A. Berman,
“Developing register differentiation: the LatinateGermanic divide in English,” Linguistics 45 (2007): 135.]
o Can we use this fact to trace broad changes of register
in the history of written English?
The fundamental distinction is not Latinate/Germanic, but
date of entry. French was the written language for 200 years;
words that entered English before that point had to be used
in the spoken language to survive. This includes “Latinate”
words like “street” and “wall.”
http://bit.ly/h8cJem
To understand the significance of the result, it needs to be broken
down by genre. Initial results suggest that fiction and nonfiction prose
both become more formal (less like speech) in the 18c. Drama and
poetry change little, although older, less formal, “speechlike” words
always predominate in drama.
The Value of HTRC
• Ted’s investigation concerns historical trends: as
such, it is reasonable to think that it might be
interesting to extend beyond 1900.
• Can he do that? Only if he is given the data.
• Will researchers have this kind of computational
access to copyrighted data? Only through some
institutional affordance like HTRC.
• Insitutions are risk-averse: in some sense, the most
important infrastructure in HTRC is the MOU.