Analytics and Access to the UK Web Archive
Download
Report
Transcript Analytics and Access to the UK Web Archive
Access and Analytics to the UK Web Archive
Lewis Crawford,
Web Archive Technical Lead
The British Library
Introduction
This talk will cover:
Background of the UK Web Archive
Traditional access methods to Web Archives
Full text search for resource discovery
Problems of scale – needles and haystacks
Web Archiving: the basics
What
Selecting, capturing, storing, preserving and managing access to
snapshots of websites over time
How
Use crawler software to download websites automatically
Selective or domain archiving
Provide access in a Web Archive
When
Since mid 1990s
Who
Heritage and memory organisations, eg (IIPC)
University libraries
Not-for-profit and commercial organisations, eg Internet Archive
Individual researchers
Why
Global information resource
Artefact of cultural and technology change
Representative sample of the web: historical and sociological data that
may not be found elsewhere
Part of national digital heritage - legal requirements
UK Web Archive:
Web archive as historical documents
5
Multimedia based content
3D visualisation wall
Full text search
N-gram visualisation
N-gram visualisation
Media based results
Semantic analysis
Scale: needle and haystack
Subject hierarchy visualisation UK Web Archive
~ 10,000 websites collected since 2004
~ 40,000 instances
Google: “seen 1 trillion
unique URLs”
more than a billion new
pages are added to the web
every day
The UK web domain
9 million .uk domain
names registered in
December 2010
~ 1 million using other
domain names
Growing at 11% - 14% per
year
40% estimated to be in
scope for Legal
Deposit
Estimated ~110TB each
UK domain crawl
The value of the haystacks – content visualisation
Big Data analytics
Java Map/Reduce to use
Tika to extract text and
generate XML files for Solr
ingest
Hive & Pig for ad hoc query
analysis
Search indexing process
Node 1
XML Media store
SOLR
DIH Indexes new xml
Generate xml files
XML Image store
SOLR
DIH Indexes new xml
Hadoop
XML Document store
SOLR
DIH Indexes new xml
Node 50
Dedicated
Indexer
Indexer
Replication
Replication
SOLR
Dedicated
Search
Replication
SOLR
Dedicated
(w)arcs
Search
Document
Meta Service
SOLR
Dedicated
Search
Generate (w)arcs
Insert meta
information
Indexer
Dedicated
Retrieve (w)arcs and meta
information
WCT
Crawlers
Dedicated
Meta
Database
Web Access
Tag cloud analysis – General Election 2005
•Special Collection 2005 general
election
•147 websites archived
during and immediately
after the UK general
election campaign of 2005.
• Tag clouds (or weighted lists)
generated for websites
belonging to key political parties
• Shows the most frequently
used words in the websites
during the 2005 election
campaign
• Special collection 2010 general
election now available
The value of the haystacks – postcode-based access
1:
2-5:
5+
50+
100+
Blue
Green
Purple
Yellow
Red
Questions?
Thank you.
http://www.webarchive.org.uk
[email protected]
@relephantdata