Transcript Slide 1

Building Scalable Web Archives
Florent Carpentier, Leïla Medjkoune
Internet Memory Foundation
IIPC GA, Paris, May 2014
Internet Memory Foundation
Internet Memory Foundation (European Archive)
Established in 2004 in Amsterdam and then in Paris:
• Mission: Preserve Web content by building a shared WA platform
• Actions: Dissemination, R&D and partnerships with research groups and
cultural institutions
• Open Access Collections: UK National Archives & Parliament, PRONI, CERN
The National Library of Ireland, etc.
Internet Memory Research
Spin-off of IMF established in June 2011 in Paris
• Mission: Operate large scale or selective crawls & develop new
technologies (processing and extraction)
Internet Memory Foundation
Focused crawling:
• Automated crawls through the Archivethe.Net shared
platform
• Quality focused crawls :
• Video capture (You Tube channels), Twitter crawls,
complex crawls
Large scale crawling
• Inhouse developed distributed software
• Scalable crawler: MemoryBot
• Also designed for focused crawl and complex scoping
Research projects
Web Archiving and Preservation
 Living Web Archives (2007-2010)
 Archives to Community MEMories:
(2010-2013)
 SCAlable Preservation Environment
(2010-2013)
Webscale data Archiving and
Extraction
✓ Living Knowledge (2009-2012)
✓ Longitudinal Analytics of Web
Archive data (2010-2013)
MemoryBot design (1)
• Started in 2010 with the support of the LAWA
(Longitudinal Analytics of Web Archive data)
project
• URL store designed for large-scale crawls (DRUM)
• Built in Erlang: distributed and fault-tolerant
system language
• Distributed (consistent hashing)
• Robust: topology change adaptation, memory
usage regulation, process isolation
MemoryBot design (2)
MemoryBot performance
• Good throughput and slow decrease
• 85 resources written per second, slowing to 55 after 4
weeks on a nine 8-core servers cluster (32 GiB of RAM)
MemoryBot counters
MemoryBot counters
MemoryBot – quality
• Support of HTTPS, retries on server failure,
configurable URL canonicalisation
• Scope: domain suffixes, language, hops sequence,
white lists, black lists
• Priorities
• Trap detection (URL pattern identification, within
PLD duplicate detection)
MemoryBot – multi-crawl
• Easier management
• Politeness observed across different crawls
• Better resource utilisation
IM Infrastructure
Green datacenters
•
•
•
•
Through a collaboration with NoRack
Designed for massive storage (petabytes of data)
Highly scalable/low consumption
Reduces storage and processing costs
Repository :
• HDFS (Hadoop File System): Distributed, fault-tolerant
file system
• Hbase. A distributed key-value index (temporal archives)
• MapReduce: A distributed execution framework
IM Platform (1)
Data storage:
• temporal aspect (versions)
Organised data:
• Fast and easy access to content
• Easy processing distribution (Big Data)
Several views on same data:
• Raw, extracted and/or analysed
Takes care of data replication:
• No (W)ARC synchronisation required
IM Platform (2)
Extensive characterisation and data mining actions:
• Process and reprocess information any time depending
on needs/requests
– Extract information such as MIME type, text resources,
images metadata, etc.
SCAlable Preservation Environment
(SCAPE)
QA/Preservation challenges?
• Growing size of web archives
• Ephemeral and heterogenous content
• Costly tools/actions
 Develop scalable quality assurance tools
 Enhance existing characterisation tools
Visual automated QA: Pagelizer
• Visual and structural
comparison tool developped by
the UPMC as part of SCAPE
• Trained and enhanced through
a collaboration with IMF
• Wrapped by IMF team to be
used at large scale within its
platform
 Allows comparison of two web
pages snapshots
 Provides a similarity score as
an output
Visual automated QA: Pagelizer
• Tested on 13 000 pairs of URLs (Firefox & Opera)
• 75% of correct assessment
• Whole workflow runs for around 4 seconds/pair
• 2 seconds for screenshot (depends on page
rendered)
• 2 seconds for comparison
• Performance already cut per 2 since initial tests (map
reduce)
Next steps
Improvements are to be made:
• Performance
• Robustness
• Correctness
New test in progress on a large scale crawl:
• Results to be disseminated to the community
through the SCAPE project and through on-site
demos (contact IMF)!
Thank you.
Any questions?
http://internetmemory.org - http://archivethe.net
[email protected]
[email protected]