Update from SB/KB

Download Report

Transcript Update from SB/KB

Recent approaches to capture web content,
which Heritrix can’t harvest
 Capturing Social Media
 Screen filming of Rich Media
 Project: Event crawl of The Eurovision Song Contest in
Copenhagen 2014
 Cooperation with researchers
NAS workshop, Paris 2014/Sabine Schostag
Why focus on social media?
 Nowadays social media are the primary communication
platforms during cultural and political events
 Politicians, artists, musicians, even the traditional news media
such as TV – use the social media more than traditional web
pages
 The entries on social media pages are ephemeral, so we need
to capture them in a very high frequency
NAS workshop, Paris 2014/Sabine Schostag
Which social media did we crawl?




Twitter.com comments
Youtube.com video and comments
Facebook.com comments
Live blogs
Excluded for technical reasons …




instagram.com video and image
tumblr.com multimedia blog
flickr.com images
vimeo.com video
NAS workshop, Paris 2014/Sabine Schostag
Which Tools did we use?
 Harvesting with NetarchiveSuite using Heritrix 1.4* ,
weekly, daily and hourly
 ”Crontab” based screen dumping of static url’s using
PhantomJS to searchable PDF’s
 Manually LAP (Live Archive Program) browsing
 XML Extracts from API’s using own developed tools
and/or Digitalfootprints.dk
 Harvesting YouTube videos by extracting the video url’s
from the “watch-url” pages with own developed tool
 Screenrecording using CamStudio.org and a Netlab.dk
linux tool wrapping ”ffmpeg”
NAS workshop, Paris 2014/Sabine Schostag
…more about the automated screen filming tool
 developed as part of research project by
curator/researcher, now implemented as a tool
 allows scheduled capturing
 is well suited to capture pre-planned streamed content
 is well suited to capture frequently updated content which
refreshes automatically (no mouseclicks)
 is not a replacement for existing collection methods, but a
supplement
NAS workshop, Paris 2014/Sabine Schostag
…more about the automated screen filming tool
 The tool enables the
user to programme
every mouseclick,
every interaction on
the webpage
NAS workshop, Paris 2014/Sabine Schostag
…some screenshots from the filming tool
ESC 2004 and the European
Parliament Elections 2014
NAS workshop, Paris 2014/Sabine Schostag
Lessons learned
 NetarchiveSuite using Heritrix 1.4* can’t harvest js with AJAX and
the high frequency of feeds f.x. 47.000 tweets/minut.
 You can record the ”look and feel” with screen recording and
dumping, but it is a HUGE manual work producing files and
provenance documentation outside the archive.
 The LAP tool is not rather useful as it doesn’t support https (most of
the social media use https today).
 ”Digitalfootprints.dk” can archive almost all XML content for twitter
and could be harvested afterwards by NetarchiveSuite Heritrix.
NAS workshop, Paris 2014/Sabine Schostag
Current issues





wider access
better access (free text search)
inclusion of older net collections
collection of websites with restricted access
advanced web content, ie. with
sound/video/live interaction (chat, virtual
worlds …)
 electronic communication networks ≠ the
web
 long-term preservation
 documentation
NAS workshop, Paris 2014/Sabine Schostag
… and from the techical point of view
 more stable and operational screen recording and dumping
tools for huge social media events
 build social media API extract plugins into Heritrix and better
support for WARC linking of e.g. Youtube watch and video
download url’s.
 Build scripting and https support into the LAP-tool.
 upgrade NetarchiveSuite to Heritrix 3.* to better support js
with AJAX (using the Umbra plugin) and continuously crawling.
NAS workshop, Paris 2014/Sabine Schostag
Epilogue
 For the first time in Netarchive’s history the whole team met
for to days
NAS workshop, Paris 2014/Sabine Schostag