Update from SB/KB
Download
Report
Transcript Update from SB/KB
Recent approaches to capture web content,
which Heritrix can’t harvest
Capturing Social Media
Screen filming of Rich Media
Project: Event crawl of The Eurovision Song Contest in
Copenhagen 2014
Cooperation with researchers
NAS workshop, Paris 2014/Sabine Schostag
Why focus on social media?
Nowadays social media are the primary communication
platforms during cultural and political events
Politicians, artists, musicians, even the traditional news media
such as TV – use the social media more than traditional web
pages
The entries on social media pages are ephemeral, so we need
to capture them in a very high frequency
NAS workshop, Paris 2014/Sabine Schostag
Which social media did we crawl?
Twitter.com comments
Youtube.com video and comments
Facebook.com comments
Live blogs
Excluded for technical reasons …
instagram.com video and image
tumblr.com multimedia blog
flickr.com images
vimeo.com video
NAS workshop, Paris 2014/Sabine Schostag
Which Tools did we use?
Harvesting with NetarchiveSuite using Heritrix 1.4* ,
weekly, daily and hourly
”Crontab” based screen dumping of static url’s using
PhantomJS to searchable PDF’s
Manually LAP (Live Archive Program) browsing
XML Extracts from API’s using own developed tools
and/or Digitalfootprints.dk
Harvesting YouTube videos by extracting the video url’s
from the “watch-url” pages with own developed tool
Screenrecording using CamStudio.org and a Netlab.dk
linux tool wrapping ”ffmpeg”
NAS workshop, Paris 2014/Sabine Schostag
…more about the automated screen filming tool
developed as part of research project by
curator/researcher, now implemented as a tool
allows scheduled capturing
is well suited to capture pre-planned streamed content
is well suited to capture frequently updated content which
refreshes automatically (no mouseclicks)
is not a replacement for existing collection methods, but a
supplement
NAS workshop, Paris 2014/Sabine Schostag
…more about the automated screen filming tool
The tool enables the
user to programme
every mouseclick,
every interaction on
the webpage
NAS workshop, Paris 2014/Sabine Schostag
…some screenshots from the filming tool
ESC 2004 and the European
Parliament Elections 2014
NAS workshop, Paris 2014/Sabine Schostag
Lessons learned
NetarchiveSuite using Heritrix 1.4* can’t harvest js with AJAX and
the high frequency of feeds f.x. 47.000 tweets/minut.
You can record the ”look and feel” with screen recording and
dumping, but it is a HUGE manual work producing files and
provenance documentation outside the archive.
The LAP tool is not rather useful as it doesn’t support https (most of
the social media use https today).
”Digitalfootprints.dk” can archive almost all XML content for twitter
and could be harvested afterwards by NetarchiveSuite Heritrix.
NAS workshop, Paris 2014/Sabine Schostag
Current issues
wider access
better access (free text search)
inclusion of older net collections
collection of websites with restricted access
advanced web content, ie. with
sound/video/live interaction (chat, virtual
worlds …)
electronic communication networks ≠ the
web
long-term preservation
documentation
NAS workshop, Paris 2014/Sabine Schostag
… and from the techical point of view
more stable and operational screen recording and dumping
tools for huge social media events
build social media API extract plugins into Heritrix and better
support for WARC linking of e.g. Youtube watch and video
download url’s.
Build scripting and https support into the LAP-tool.
upgrade NetarchiveSuite to Heritrix 3.* to better support js
with AJAX (using the Umbra plugin) and continuously crawling.
NAS workshop, Paris 2014/Sabine Schostag
Epilogue
For the first time in Netarchive’s history the whole team met
for to days
NAS workshop, Paris 2014/Sabine Schostag