British Library Web Archiving

Download Report

Transcript British Library Web Archiving

Web Archives: Interacting with Scholars
Helen Hockx-Yu
Head of Web Archiving
British Library
28 November 2013
Access to Web Archives
OVERVIEW
2
Web Archiving initiatives worldwide
http://en.wikipedia.org/wiki/File:Map_of_Web_archiving_initiatives_worldwide.png
3
(Scholarly) use of web archives?
 Restricted access, e.g. large scale national web archives
referred to as “dark archives”
 Archiving institutions’ focus on data collection, not usage
 “Document-centric” access methods
 Cannot produce replicas of original websites
 No agreed way of calculating / benchmarking access
statistics
 Little evidence of scholarly use of web archives, making it
difficult to understand requirements
4
Access methods
 International Internet Preservation Consortium (IIPC) – 46
members worldwide
 “IIPC members’ archives” has 29 entries
 19 have full or partial online access, often permission-based
 URL search as standard, universal access method - requires
users to know the URL of the website they are looking for
 For many archives, full-text search is the next challenge on the
roadmap
URL search
Keyword
search
Full-text
search
Thematic
Collections
Subject
Browsing
Alphabetical
browsing
26
15
11
11
9
14
5
Web archive as historical document
6
UK Web Archive
SCHOLARLY FEEDBACK
7
Scholarly feedback

User Survey in 2012 to identify scholarly
value of the UK Web Archive, as perceived
by researchers
 To obtain feedback on the access
mechanisms currently offered by archive
 To identify gaps in terms of content
coverage
 To obtain insight into reason why
researchers may or may not use the web
archive
8
Methodology



By IRN Research between May and June 2012
94 telephone interviews with previous and nonusers of the UK Web Archive – 74% are nonusers
A small group was asked to undertake a second
phase, running search and detailing each stage
– documented as case studies
9
Interview sample by subject
Subject
Non-users
Users
Arts and Humanities
33
10
Social Sciences
27
11
Science Technology
Medicine
4
3
Total
64
24
Unclassified
6
-
10
Scholarly value
Non users
Users
Appreciate potential value but for
many no relevant content
All understand the value as
snapshot of selective sites at
specific times
More special collections would
increase value
Value would increase with more
scientific and technical content
11
Access Mechanisms
Non users
Users
Search tool easy to use but
complicated for minority
Majority satisfied with presentation
of results and ease of use of site
Most search / browse by special
collections
More interest in visualisation tools
Search results unstructured and
random
Need for improved data mining tools
More explanation about functions
and features needed
Limited interest in visualisation tools
12
Additional functions and features
Non users
Users
Improvements to search results
pages
6-monthly updates
Interactive features
Interactive features
Facility to suggest special
collections
Too much text on home page
13
Content coverage
Non users
Users
More relevant special collections
More images, illustrations, rich
media
More images, blogs
Politics, contemporary British history
Too much missed from specific
websites
14
Reason for using or not using UKWA
Non users
Users
Current content not relevant
Majority “very likely” to use again as
there is content of interest
More information regarding
selection policy
Another 39% “quite likely”
Less than a quarter “very likely” to
use again
15
Why do researchers use / not use a web archive
 Relevance of content
determines whether
researchers use it
 Selective web archives
please some but disappoint
others
 Use web archives for
reference AND analytics
 Still a significant portion of
the research community yet
to be reached
16
Access statistic of the UK Web Archive: 1 Jan – 28 Nov
2013
17
Web Archives
INTERACTING WITH
SCHOLARS
18
Scholarly interactions: three types

Archive-driven
 Initiated by archival institutions
 Aimed at understanding scholarly requirements and improving archival
practice

Scholar-driven
 Initiated by scholars with research interest related to web archiving or
archived web material, including many “unknown” scholars
 A number of active research groups emerging
 Netlab, WebArt and DMI, IHR, OII, ODU…
 Attention from the Web Science community

Project-based
 Various scale, scope and funding sources
 Developing web archiving or discipline specific solutions
 Researchers and archiving institutions as partners
19
Scholarly interactions: three phases
 Phase 1: Building collections
 Scholars’ involvement in scoping collections, selecting and
describing websites relevant to research interest
 Creation of specific, (narrow) topical collections, e.g. “Religion,
politics and law since 2005” in the UK Web Archive
 Phase 2: Formulating research questions
 Brain-storm sessions, workshops etc.
 Shift of focus to web archives in entirety
 The Analytical Access to the Domain Dark Archive (AADDA) project
 9 research proposals by arts, humanities and social sciences
scholars
 A prototype UI for analytical access
 Lack of awareness & baseline knowledge,
 Time & resource consuming
 Challenging: you don’t know what you don’t know
20
Scholarly interaction: three Phases
 Phase 3: independent use of web archives
 The desired “go-to” state, meet common scholarly
requirements
 Web archives do not become bottlenecks
 Base-line knowledge is self-explanatory, e.g. scope of the
archive, its coverage and lacunae, how it was collected, and
how a particular website was crawled
 Clear interfaces and jargon-free descriptions in alignment
with scholarly requirements
 Open access
 Including provision of downloadable derived or secondary
datasets, e.g. http://data.webarchive.org.uk/opendata/
21
How was the UK web linked in 1996?
• By Rainer Simon using UK
Host-Level Link Graph
(1996-2010) dataset
• Based on the 1996 portion:
58,842 hosts (nodes);
184,433 host-to-host links
(edges)
• UK web as part of the global
web
• Scalability issues with large
dataset over time
22
Web Archives
SCHOLARLY REQUIREMENTS
23
Scholarship is changing
 Blurred boundaries between scholarly sources and popular
sources, even more so in the context of the web
 Any source used for scholarly purposes can be defined as
scholarly source
 Scholarship is evolving: computational engaged research
gaining momentum e.g. digital humanities
 Redrawing disciplinary boundaries
 Less text-based, multi-media driven
 Web playing an important role – will archives of the web
do the same?
24
Scholarly use (of digital sources): key characteristics

Availability or accessibility

Text and paratext, defined by Gérard Genette as “accompaniments” that
“surround or prolong the text”. Niels Brugger (2010) applied this concept
to websites and argues it is different in form and function, and plays a
crucial role in textual coherence of a website

Or context, in the usual sense of the word, e.g. out and in-links

Citation – backbone of research - requires persistence identification of
sources, ideally retrievable

Sources relevant and specific to research question, without any
arbitrarily imposed (national , geographical or format related) boundaries

Quality

Flexibility /ability to apply digital methods for analytics and discovery of
new knowledge
25
Requirements for web archives
Characteristics
of Scholarly use
Requirements for web archives
Availability
No access restriction, available online
Paratext or
context
Access to collection policy and scope, crawl configuration, craw log
and any contextual information
Persistence
and citability
- Longevity of web archives
- Persistent identifiers
- Standards of citing archived websites
- Integration with bibliographical management tools (eg Zotero)
Collect /
organise
research
corpus
- Archiving of research corpora on demand
- Means to mix and match and reassemble corpora based on
research questions
Quality
- Archival version represents as much as possible the live website
in completeness, intellectual content, behaviour and look and feel
- Curation
Multiple access methods including data analytics and
Applying Digital -visualisations
methods
- Access to web archives as “big data”
Boundary &
formatindependent
- Interlinked web archives
- integration with other digital and printed holdings eg books,
ejournals
26
Unique Selling Points (USPs)
 The live web as a fast evolving, interactive, multi-dimensional,
open and participatory and interlinked collective system
 Web archives as static, flat, exclusive, individual systems with
boundaries and limitations
 Focus on USPs – things that differentiate web archives from the
live web
 Some web resources have vanished and web archives hold the
only copies of these
 Periodic snapshots showing evolution and change of websites
 Web archives as comprehensive historical datasets - lends itself
to opportunities for analytical access
 Linked web archives
27
Who has archived http://www.conservatives.com/?
Mementos service
 Allow users to find archived web pages (mementos) in multiple
web archives across the world (search based on aggregated
metadata)
 Exposes the memento protocol, which adds time dimension to
HTTP - accessing the past web as it is to access the current
web
 uses the Memento aggregate TimeGate hosted by lanl.gov
 Source code
 Also developed the Find memento bookmarklet, finding
archived versions of 404 webpages while browsing
UK Web Archive
EXTRA SLIDES FOR
ILLUSTRATION
30
UK Web Archive: search interface
31
UK Web Archive: browse interface
32
Using N-gram for scholarly research
 Courtesy of Dr Peter Webster, Institute of Historical Research, University of London
33
UK Web Archive: visual browsing
34
RSS feed of latest instances
35
Replacing original search function on site
36
Showing the big picture
http://seadragon.com/view/wky
37