The HathiTrust Research Center: Building Shared Computational

Download Report

Transcript The HathiTrust Research Center: Building Shared Computational

The HathiTrust Research Center:
Building Shared Computational Resources to Mine
the Largest Academic Digital Library Corpus
Tweet Us: #HTRC #SESS037 #EDU13
Tweet Us: #HTRC #SESS037 #EDU13
The HathiTrust Research Center:
Building Shared Computational Resources to Mine
the Largest Academic Digital Library Corpus
Robert H. McDonald – Indiana University
Beth Sandore Namachchivaya – University of Illinois
John Unsworth – Brandeis University
Educause Annual Meeting
Anaheim, CA
October 16, 2013
http://bit.ly/1fDcK91
Tweet Us: #HTRC #SESS037 #EDU13
http://www.hathitrust.org/htrc
HathiTrust Partnership
Allegheny College
Arizona State University
Baylor University
Boston College
Boston University
California Digital Library
Carnegie Mellon University
Columbia University
Cornell University
Dartmouth College
Duke University
Emory University
Florida State University
Getty Research Institute
Harvard University Library
Indiana University
Iowa State University
Johns Hopkins University
Lafayette College
Library of Congress
Massachusetts Institute of
Technology
McGill University`
Michigan State University
New York Public Library
New York University
North Carolina Central
University
Tweet Us: #HTRC #SESS037 #EDU13
North Carolina State University
Northwestern University
The Ohio State University
The Pennsylvania State University
Princeton University
Purdue University
Stanford University
Syracuse University
Texas A&M University
Tufts University
Universidad Complutense de Madrid
University of Alabama
University of Arizona
University of Calgary
University of California
Berkeley
Davis
Irvine
Los Angeles
Merced
Riverside
San Diego
San Francisco
Santa Barbara
Santa Cruz
The University of Chicago
University of Connecticut
University of Delaware
University of Florida
University of Illinois at Chicago
The University of Iowa
University of Maryland
University of Miami
University of Michigan
University of Minnesota
University of Missouri
University of Nebraska-Lincoln
The University of North
Carolina at Chapel Hill
University of Notre Dame
University of Oklahama
University of Pennsylvania
University of Pittsburgh
University of Utah
University of Virginia
University of Washington
University of Wisconsin-Madison
Utah State University
Virginia Tech
Wake Forest University
Washington University
Yale University Library
http://www.hathitrust.org/htrc
HathiTrust Mission
To contribute to the common good by collecting,
organizing, preserving, communicating, and sharing
the record of human knowledge
Tweet Us: #HTRC #SESS037 #EDU13
http://www.hathitrust.org/htrc
HathiTrust Services
• Long-term preservation
– Bit-level and migration
•
•
•
•
•
•
•
Bibliographic search
Full-text search
Reading and download capabilities
Print on demand
Collections
Datasets
HathiTrust Research Center
Tweet Us: #HTRC #SESS037 #EDU13
http://www.hathitrust.org/htrc
HathiTrust “Wow” Numbers
•
•
•
•
•
•
•
•
10,819,596 total volumes
5,672,046 book titles
281,890 serial titles
3,786,858,600 pages
485 terabytes
128 miles
8,791 tons
3,469,225 volumes(~32% of total) in the
public domain
Tweet Us: #HTRC #SESS037 #EDU13
http://www.hathitrust.org/htrc
Discovery and Use
• Search, collections, online access
• APIs and data feeds
– Data API
– Bibliographic API
– “Hathifiles” inventory files
– OAI
• Computational Research
– Distribution of datasets
– Protocol-based access
– Research Center
Tweet Us: #HTRC #SESS037 #EDU13
http://www.hathitrust.org/htrc
Research Center in Context
Tweet Us: #HTRC #SESS037 #EDU13
http://www.hathitrust.org/htrc
Goals for HTRC
• Provide a persistent and sustainable structure to
enable scholars to ask and answer new questions.
– Leverage data storage and computational infrastructure at Indiana
& Illinois
– Stimulate community development of new functionality and tools
– Use tools to enable discoveries that would not be possible
without the HTRC
• Enable scholars to fully utilize content of HathiTrust
Library while preventing intellectual property misuse
within U.S. copyright law.
– Provide a secure computational and data environment for
scholars to perform research using HathiTrust Digital Library.
Tweet Us: #HTRC #SESS037 #EDU13
http://www.hathitrust.org/htrc
• Board of Governors
• Executive Committee
• Executive Director
HathiTrust
HathiTrust
Research
Center
Data
Copy
#2
Indiana
University
Tweet Us: #HTRC #SESS037 #EDU13
University
of
Michigan
Data
Copy
#1
University
of
Illinois
http://www.hathitrust.org/htrc
HTRC Governance
•
•
Reports to the HathiTrust Board of Governors
HTRC Executive Committee
– J. Stephen Downie (Co-director), Professor and Associate
Dean for Research, University of Illinois GSLIS
– Beth Plale (Co-director and Chair), Director Data To Insight
Center and professor in the School of Informatics and
Computing at Indiana University
– Robert H. McDonald, Associate Dean of Libraries/Deputy
Director Data to Insight Center at Indiana University
– Beth Sandore Namachchivaya, Associate University Librarian
for Information Technology Planning & Policy at the
University of Illinois
– John Unsworth, Vice Provost for Library & Technology
Services and Chief Information Officer at Brandeis University
•
•
HTRC Advisory Board (See members next slide)
Google Public Domain agreement – in place for IU and
UIUC
Tweet Us: #HTRC #SESS037 #EDU13
http://www.hathitrust.org/htrc
HTRC Advisory Board
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Cathy Blake, University of Illinois, Urbana-Champaign
Beth Cate, Indiana University
Greg Crane, Tufts University
Laine Farley, California Digital Library
Brian Geiger, University of California at Riverside
David Greenbaum, University of California at Berkeley
Fotis Jannidis, University of Wurzberg, Germany
Matthew Jockers, Stanford University
Jim Neal, Columbia University
Bill Newman, Indiana University
Bethany Nowviskie, University of Virginia
Andrey Rzhetsky, University of Chicago
Pat Steele, University of Maryland
Craig Stewart, Indiana University
David Theo Goldberg, University of California at Irvine
John Towns, National Center for Supercomputing Applications
Madelyn Wessel, University of Virginia
Tweet Us: #HTRC #SESS037 #EDU13
http://www.hathitrust.org/htrc
Data Overview
Hathifiles
•
•
•
•
Tab-delimited inventory files
Aggregated monthly
Daily incremental files
Contain
– Identifiers
– Limited bibliographic information
– Rights, language, gov docs status information
Tweet Us: #HTRC #SESS037 #EDU13
http://www.hathitrust.org/htrc
Content Distribution
U.S. Federal
Government
Documents
(worldwide)
4%
In-copyright or
undetermined
70%
"Public Domain”
30%
Public Domain
(worldwide)
15%
Public
Domain
(US)
10%
Open Access
.1%
Creative Commons
.01%
Tweet Us: #HTRC #SESS037 #EDU13
http://www.hathitrust.org/htrc
Content Sources
LC
1%
Minnesota
1%
Yale UNC-Chapel Hill
0%
Harvard Madrid Virginia 0%
Utah
State
1%
Indiana
1%
Chicago
0%
0%
2%
NCSU
0%
Columbia
NorthwesternDuke
0%
0%
1%
0% Illinois
Penn State
NYPL Princeton
Purdue
0%
0%
3%
3%
0%
Cornell
Wisconsin 4%
5%
Michigan
45%
California
33%
Tweet Us: #HTRC #SESS037 #EDU13
http://www.hathitrust.org/htrc
Dates
1900-1909
4%
1910-1919
4%
1920-1929
4%
1930-1939
4%
1600-1699
0%
1800-1849
3%
1700-1799
1850-1899
1%
8%
1940-1949
4%
1500-1599
0%
0-1500
0%
2000-2009
10%
1990-1999
14%
1980-1989
15%
1960-1969
11%
1970-1979
13%
1950-1959
6%
Tweet Us: #HTRC #SESS037 #EDU13
http://www.hathitrust.org/htrc
Language Distribution
Arabic Latin
2%Italian 1%
Japanese 3%
Remaining
Languages
14%
3%
Russian
4%
Chinese
4%
Spanish
5%
French
7%
Tweet Us: #HTRC #SESS037 #EDU13
The top 10 languages make up
~86% of all content
English
48%
German
9%
http://www.hathitrust.org/htrc
Data Availability
Source
Data Management
Access
Catalog
Bib Data
Ingest
Bibliographic
Data
Rights
Data
Holdings
Data
Content Package
Storage
Full-text Search
PageTurner
Collections
APIs
Indiana
Michigan
Tweet Us: #HTRC #SESS037 #EDU13
Datasets
http://www.hathitrust.org/htrc
How is it available?
• Web interfaces
• APIs
– Data API
– Bib API
• Data feeds and distribution
– Hathifiles
– OAI
– Datasets
• Soon: Virtual Machines
Tweet Us: #HTRC #SESS037 #EDU13
http://www.hathitrust.org/htrc
Copyright
Copyright
• Strongly bound to US copyright issues with
constant vigilance of the international scene
• Status determinations via:
– Bibliographic metadata
– Automatic and manual rights determination
Tweet Us: #HTRC #SESS037 #EDU13
http://www.hathitrust.org/htrc
Automatic Rights Determination
• Conducted on all works at time of ingest and
when records are modified
– Public domain worldwide
• US works published before 1923, US federal
government publications, non-US works published prior
to 1872
– Public domain in the United States
• Non-US works published prior to 1923
Tweet Us: #HTRC #SESS037 #EDU13
http://www.hathitrust.org/htrc
Manual Rights Determination
• IMLS-funded CRMS project
–
–
–
–
–
US-published works 1923-1963
Conformance with formalities
Expanding to non-US works
Double-blind review with expert review for conflicts
Staff at 4 HathiTrust partner institutions (15 will take
part in non-US)
– As of February 2012 ~190,000 reviewed, more than
100,000 opened
• Rights Holder Permissions
Tweet Us: #HTRC #SESS037 #EDU13
http://www.hathitrust.org/htrc
Rights Attributes
id
name
type
dscr
1
pd
copyright
public domain
2
ic
copyright
in-copyright
3
opb
copyright
out-of-print and brittle (implies in-copyright)
4
orph
copyright
copyright-orphaned (implies in-copyright)
5
und
copyright
undetermined copyright status
6
umall
access
available to UM affiliates and walk-in patrons (all campuses)
7
world
access
available to everyone in the world
8
nobody
access
available to nobody; blocked for all users
9
pdus
copyright
public domain only when viewed in the US
10
cc-by
copyright
Creative Commons Attribution
11
cc-by-nd
copyright
Creative Commons Attribution-NoDerivatives
12
cc-by-nc-nd copyright
Creative Commons Attribution-NonCommercial-NoDerivatives
13
cc-by-nc
Creative Commons Attribution-NonCommercial
14
cc-by-nc-sa copyright
Creative Commons Attribution-NonCommercial-ShareAlike
15
cc-by-sa
copyright
Creative Commons Attribution-ShareAlike
16
orphcand
copyright
orphan candidate - in 90-day holding period (implies in-copyright)
17
cc-zero
copyright
Creative Commons Zero license (implies pd)
18
und-world
copyright
Undetermined copyright status and permitted as world-viewable
by the depositor
19
Ic-us
copyright
In copyright in the US
copyright
Rights Determination Reason Codes
id
1
2
3
4
5
6
7
8
name
bib
ncn
con
ddd
man
pvt
ren
nfi
dscr
bibliographically-derived by automatic processes
no printed copyright notice
contractual agreement with copyright holder on file
due diligence documentation on file
manual access control override; see note for details
private personal information visible
copyright renewal research was conducted
needs further investigation (copyright research partially complete; an ambiguous,
unclear, or other time-consuming situation was encountered)
9
cdpp
10
cip
title page or verso contain copyright date and/or place of publication information not in
bib record
condition review and in-print status research was conducted
11
12
unp
gfv
unpublished work
Google viewability set at VIEW_FULL
13
crms
derived from multiple reviews in the Copyright Review Management System (CRMS) via
an internal resolution policy; consult CRMS records for details
14
add
author death date research was conducted or notification was received from authoritative
source
15
exp
expiration of copyright term for non-US work with corporate author
16
Del
Deleted from repository; see note for details
17
Gatt
Non-US public domain work restored to in-copyright in the US by GATT
Type of work
Searchable
(bibliographic
and full-text)
Viewable*
Full-PDF
download
(Data API)
Print on
Demand
Print
disabilities*
Preservation
uses (Section
108)*
Public domain
worldwide
Worldwide
Worldwide
Worldwide
Partners
worldwide
N/A
Public domain
(US) – Non-US
works
published
between 1872
and 1923.
Worldwide
When accessed
from with the
United States
Partners only if
scanned by
Google, if not,
worldwide.
Partners in the
US if scanned
by Google, if
not, anyone US
Works that
rights holders
have opened
access to in
HathiTrust
Worldwide
Worldwide
Works that are
in-copyright or
of
undetermined
status
Worldwide
Available within Partners in the
the United
US; partners
worldwide
States
where similar
laws in effect
N/A
Worldwide (if
Worldwide with Partners
digitized by
permission
worldwide
Google, full-PDF
only available if
opened with CC
license)
Partners in the
Not available
Not available
Not available
US; partners
worldwide
where similar
laws in effect
Partners in the
To participating Not available
Not available
US
partners
N/A
Partners in the
US; partner
worldwide
where similar
laws in effect
Partners in the
Orphan works
Worldwide
US; partners
worldwide
* Note: Access to in-copyright works is subject to conditions on Terms of Access slide. See here also.
where similar
laws in effect
http://www.hathitrust.org/htrc
Tweet Us: #HTRC #SESS037 #EDU13
HTRC Research
Paradigm
Bring the
COMPUTATION
to the
DATA!
•
•
•
•
•
•
•
Web services architecture and protocols
Registry of services and algorithms
Solr full text indexes
noSQL store as volume store
openID authentication
Portal front-end, programmatic access
Data mining algorithms
Portal
Blacklight
Agent
instance
Agent
instance
SEASR analytics
service
WSO2 registry
services, collections, data
capsule images
HTRC Data API v0.1
WS02
Identity
Server
Agent
framework
Agent
instance
Agent
instance
Solr index
Task
deployment
Meandre
Orchestration
Non-consumptive
Data capsules
NCSA local resources
Volume store
Volume store
(Cassandra)
Volume store
(Cassandra)
(Cassandra)
rsync
NSF XSEDE
Big Red II/IU Quarry
HathiTrust
corpus
Page/volume
tree (file system)
33
Tweet Us: #HTRC #SESS037 #EDU13
Programmatic
access e.g.,
University of Michigan
http://www.hathitrust.org/htrc
HTRC
Request
Spatial plots
All the complexity
Statistical plots
Complexity hiding interface
Tabular info
HTRC
Subsets of
corpus
Other data
(dictionaries,
wiki data)
Complexity hiding interface
Text mining
algorithms
HTRC Research
Access
VM
Image
Store
VM
Image
Manager
Request
for VM
Researcher
VM
Manager
VM
Image
Builder
Secure
Virtual
Cloud
VM
instance
SSH
Tweet Us: #HTRC #SESS037 #EDU13
Non-consumptive
Output Storage
http://www.hathitrust.org/htrc
1
Select volumes for analysis
3
View/download results
Named Entities
Word frequencies
Tweet Us: #HTRC #SESS037 #EDU13
2
Select algorithm
Topic models
http://www.hathitrust.org/htrc
Research Engagements
Colin Allen
Professor, Cognitive Science
Indiana University
https://inpho.cogs.indiana.edu/
1315 volumes selected using a keyword search for ‘Darwin', ‘Romanes',
'anthropomorphism', and 'comparative psychology’. This set contains lots of books
that are not of particular interest -- e.g., books on theology, college course catalogs.
Challenge: Find the philosophical arguments in haystack of sentences
Digging into Data 2011
Yearly values of ratio between two wordlists in three
different genres. 4,275 volumes. 1700-1899
Ted
Underwood,
Dept of
English, UIUC
http://goo.gl/hVbNfZ
Phenotypes implemented at level of
genes
General study: understanding of how
phenotypes, such as human healthy diversity
and maladies, are implemented at level of
genes.
Why HTRC: capture properties of
language automatically -- for text
transformations and information extraction.
Generalize grammatical and idiomatic patterns
as related to systems biology.
Andrey Rzhetsky
Professor, Department of Medicine
University of Chicago
http://www.ci.uchicago.edu/research/rzhetsky/
Tweet Us: #HTRC #SESS037 #EDU13
http://www.hathitrust.org/htrc
Other Grants and Proposals involving
HTRC
• Zdenek Zdrahal, “DiscoveryCORE, Discovering Hidden Relationships in
Semantically Connected Resources”, NEH Digging Into Data Challenge.
• Matthew Wilken, NotreDame, “Literary Geography at Scale”, American
Council of Learned Societies (ACLS).
• Ichiro Fujinaga, “Single Interface for Music Score Searching and Analysis
(SIMSSA)” to SSHRC, Canada. Pending.
• Andrew Piper, Text Mining the Novel: Establishing the Foundations of a New
Discipline, SSHRC, Canada.
• Robert Liffe, University of Sussex, Textual Genomics Project (TTGP), United
Kingdom Arts and Humanities Research Council.
• Edie Rasmussen. From Indexer’s Legacy to Scholar’s Desktop.
• Adam Farquhar, The British Library. IRIS, Arts and Humanities Research
Council grant.
Tweet Us: #HTRC #SESS037 #EDU13
http://www.hathitrust.org/htrc
Workset Creation for Scholarly Analysis
Funded at $493,000 by the Andrew W. Mellon Foundation;
Co-PIs: J. Stephen Downie, Tim Cole, Beth Plale; 1 July 2013 30 June 2015. Goals:
1) enriching the metadata in the HathiTrust corpus
2) augmenting string-based metadata with URIs to leverage
discovery and sharing through external services, and
3) formalizing the notion of collections and worksets in the
context of the HathiTrust Research Center.
Includes an open, competitive Request for Proposals in
November 2013, with the intent to fund four prototyping
projects that will build tools for enriching and augmenting
metadata for the HathiTrust corpus.
Tweet Us: #HTRC #SESS037 #EDU13
http://www.hathitrust.org/htrc
HTRC Sloan Cloud for Secure TextMining at Scale
Funded at $606,000 by The Alfred P. Sloan Foundation; Beth
Plale, Indiana University, PI; Atul Prakash, University of Michigan,
Co-PI; Fall 2011 - Spring 2013.
Goal: Prototype a system that enables secure text mining to be
carried out at scale using public cloud resources, including:
1. a software cloud infrastructure based on OpenStack
2. mechanisms for managing a secure virtual machine We plan
The Sloan Cloud will provide users with dedicated virtual
machines that are pre-configured with appropriate tools and
provide secure access to remote data that cannot be funneled
through the VM to outside filesystems.
Tweet Us: #HTRC #SESS037 #EDU13
http://www.hathitrust.org/htrc
Thank You
• This presentation was made possible with content
provided by many HTRC colleagues John Unsworth, J.
Stephen Downie, Beth Plale, Robert H. McDonald, Beth
Sandore, Yiming Sun, Miao Chen, Guangchen Ruan,
Loretta Auvil, Kirk Hess, and many others…
• The HTRC Non-Consumptive Research Grant is
graciously funded by the Alfred P. Sloan Foundation
• IU D2I-PTI is graciously funded by The Lilly Endowment,
Inc.
• HTRC - http://www.hathitrust.org/htrc
• IU D2I Center - http://d2i.indiana.edu/
• UIUC GSLIS - http://www.lis.illinois.edu/
Tweet Us: #HTRC #SESS037 #EDU13
http://www.hathitrust.org/htrc
Contact Information
Speakers:
Robert H. McDonald, Indiana University
[email protected] | @mcdonald
Beth Sandore Namachchivaya, University of Illinois
[email protected]
John Unsworth, Brandeis University
[email protected] | @unsworth
Requests for assistance:
Miao Chen, HTRC Education and Outreach
[email protected]
Tweet Us: #HTRC #SESS037 #EDU13
http://www.hathitrust.org/htrc
The HathiTrust Research Center:
Building Shared Computational Resources to Mine
the Largest Academic Digital Library Corpus
Tweet Us: #HTRC #SESS037 #EDU13