Scaling up to archive the UK Web

Download Report

Transcript Scaling up to archive the UK Web

Building a National
Collection of the Historical
UK Web for scholarly use
Helen Hockx-Yu
Head of Web Archiving, British Library
IIPC General Assembly, Paris, May April 2014
Scholarly interaction with web archives (1)
 Archive-driven
– Initiated by archival institutions
– Aimed at understanding scholarly requirements and improving archival
practice
 Scholar-driven
– Initiated by scholars with research interest related to web archiving or
archived web material, including many “unknown” scholars
– A number of active research groups emerging: Netlab, WebArt and DMI,
IHR, OII, ODU…
– Attention from the Web Science community
 Project-based
– Various scale, scope and funding sources
– Developing web archiving or discipline specific solutions
– Researchers and archiving institutions work as partners
www.bl.uk
2
Scholarly interaction with web archives (2)
• Phase 1: Building collections
– Scholars’ involvement in scoping collections, selecting and
describing websites relevant to research interest
– Creation of specific, (narrow) topical collections, e.g. “Religion,
politics and law since 2005” in the UK Web Archive
• Phase 2: Formulating research questions
– Brain-storm sessions, workshops etc.
– Shift of focus to web archives in entirety
– Lack of awareness & baseline knowledge
– Time & resource consuming
– Challenging: you don’t know what you don’t know
www.bl.uk
3
3
Scholarly interaction: the “go-to” state
 Independent use of web archives
 Meet common scholarly requirements, support scholarly workflow
 Base-line knowledge is self-explanatory, e.g. scope of the archive, its
coverage and lacunae, how it was collected, and how a particular
website was crawled
 Clear interfaces and jargon-free descriptions in alignment with scholarly
requirements
 Open access
− Including provision of downloadable derived or secondary datasets, e.g.
http://data.webarchive.org.uk/opendata/
 Publication of work citing web archives
www.bl.uk
4
4
Selective archiving since 2003
• Permission-based
• Open UK Web Archive
http://www.webarchive.org.uk/
ukwa/
• ~14,000 websites, ~64,000
instances
• URL and full-text search
• Curated collections
• Many websites no longer
available on the live web
www.bl.uk
5
6th April 2013…
• Legal Deposit Libraries (NonPrint Works) Regulations
2013
• Extension of existing legal
framework
• Systematic collection of UK’s
published output for heritage
& preservation
• By 6 UK Legal Deposit
Libraries
www.bl.uk
6
JISC UK Web Domain dataset (1996-2014)
• Collaboration between the Internet Archive (IA), the Joint Information
Systems Committee (JISC) and the British Library
• Extracted copies of UK websites from the Internet Archives collection
– 1st tranche : 1996 – 2010, 30TB, 2.5 billion URLs
– 2nd tranche: 2010 – April 2013, 27.5TB, 1.5 billion URLs
(estimated)
• Research agreement between JISC and IA, upholding IA’s Terms of Use
– Access via IA’s Wayback Machine
– Allows replication / extraction of derivative or secondary datasets
• BL hosts the dataset on behalf of JISC
www.bl.uk
7
Completed work
• Analytical Access to the Domain Dark Archive Project
– Use cases & experimental UI
• Demonstrating the Value of the UK Web Domain Dataset for Social
Science Research
– Analysis of link graph
– Paper accepted for WebSci’14: Mapping the UK Webspace:
Fifteen Years of British Universities on the Web
• MA thesis by Jules Mataly: The Three Truths of Margaret Thatcher:
Creating and Analysing
• Secondary datasets under open licence
– Format profile, Geoindex, Host Link Graph
www.bl.uk
8
Exploring Host Link Graph
www.bl.uk
Courtesy of Peter Webster, Rainer Simon and Jules Mataly
9
Visualising links (to and from bl.uk)
Interactive version
How it is done
www.bl.uk
10
Visualising links (to and from bl.uk)
Interactive version
How it is done
www.bl.uk
11
Big UK Domain Data for Arts and
Humanities
• Funded by the UK Arts and Humanities Research Council as one of
the 21 “Big Data” projects
• Collaboration between the Institution of Historical Research, Oxford
Internet Institute, British Library and Aarhus University
• Develop theoretical and methodological framework for the study of
web archives
• Build on ADDAA: researchers and the BL co-produce access tools
• A major study of the history of UK web space from 1996 to 2013 +
sub-projects covering a range of disciplines
• Also an online training course and peer-reviewed journal articles.
www.bl.uk
12
New projects and initiatives
• "ALEXANDRIA: Foundations for Temporal Retrieval, Exploration
and Analytics in Web Archives
– 5-year project funded by the European Research Council
– Develop new models and algorithms for retrieval, exploration, and
analytics of web archives
– Collaborate on common issues, eg, publications date versus crawl
dates
• RESAW, a Research Infrastructure for the Study of Archived Web
Materials
– Currently a coordinated, self-organising, and self-financing open
network
– Preparing application for EU’s Horizon 2020 framework
www.bl.uk
13
Benefits
• Helps researchers understand the value of web archives and explore new
ways of using these for scholarly research
• Allows BL to obtain hands-on experience with indexing and processing
large scale web archive datasets
• Analytics and visualisations can be applied to our own Legal Deposit
collection
• Acts as test-bed for research and development projects
• Enables BL to participate in various UK, European and international
projects
• Helps curators understand characteristics of large scale digital corpora
• Improve the way we collet and store web archive
www.bl.uk
14
Some Issues
 Ownership
 Data quality
- Different formats, ARC and WARCs
- Partially de-duplicated
 Context
- No crawl log or information o data cap applied during crawl time
- No detailed information on extraction mechanism
 More general issues related to analytical access
- Scepticism or suspicion about hidden algorithms behind analysis
- Biases in data and how data collection decisions lead to variances in
outputs
- Need to manage expectations, analysis and visualisation as finished
products and first steps
- Ethical and privacy issues
www.bl.uk
15
Thank you!
Questions?
Getting in touch:
Twitter: @ukwebarchive
Email: [email protected]
UK Web Archive:
http://www.webarchive.org.uk
www.bl.uk
16