Web Site Preservation - Arts and Humanities Data Service

Download Report

Transcript Web Site Preservation - Arts and Humanities Data Service

Preserving Web Sites
Brian Kelly
UKOLN
University of Bath
Email: [email protected]
UKOLN – a centre of expertise in digital information management
Contents
•
•
•
•
•
•
Why Is Web Site Preservation An Issue?
The Nightmare Scenario
Administrative Issues
Technical Challenges
What Is My Web Site?
What Is My Preferred Future For My Web
Site?
• Mothballing Procedures
• Lessons For Future Work
• Questions
UKOLN – a centre of expertise in digital information management
Why Is Web Site Preservation
An Issue?
Digital Resources Don't Rot
• Digital resources (images, video, software, Web
sites, …) don't degrade due to environmental
factors. This is a key difference with physical
resources.
• Web sites are made from various digital resources:
HTML pages, GIF, JPEG, etc. image files, PDF
resources, software (CGI scripts, JavaScript, etc.)
• These won't degrade so why is Web site
preservation an issue?
• Isn't the fact that old Web sites won't disappear and
may be embarrassing more of a challenge?
UKOLN – a centre of expertise in digital information management
Digital Resources Do Rot!
In fact digital resource do 'rot':
• Operating systems are upgraded and existing
applications case to work
• Security holes are identified and there is a need to
install patches
• Resources may be dependent on external
resources (e.g. links, news feeds, …) which may
disappear
• Resources may be hosted by external services
and there is a need for ongoing funding for the
hosting
• …
UKOLN – a centre of expertise in digital information management
The Nightmare Scenario
To be avoided:
• The funding finishes
• Project staff leave, partnership dissolves
• Hosting agency upgrades operating system,
resulting in scripts to access resources from
backend database are broken
• User finds page with invitation to project launch and
travels to meeting. Unfortunately the event took
place in 2002.
• Invoice for domain name is not paid, as
administrator has left.
• Web site domain taken over by porn company
• Prime Minister picks up pen containing project URL
and visits pornographic Web site
UKOLN – a centre of expertise in digital information management
It Has Happened
Webtechs.com
• Software company which
hosted early HTML
validation service
• In 1998/99 confusion over
payment of domain name
• March 1999 company
receives many messages
saying validation service
is now a porn site
• Over 30,000 links to Web
site!
• Sept 1999 porn company agrees to sell
domain name back to Webtech
UKOLN – a centre of expertise in digital information management
The Embarrassment Still Exists
The hijacked Web site
can still be accessed
using the Internet
Archive's Wayback
Machine.
Note that the archived
Web site contains
JavaScript (and
Active X controls?)
which could delete
data on the
viewer's PC
See <http://www.exploit-lib.org/issue1/webtechs/>
UKOLN – a centre of expertise in digital information management
Related Incidents
Grove
• Grove Art and Music set up Web site in mid 1990s
• Company decides cost of managing and
maintaining Web site isn't justified
• Company relinquishes domain
• Domain bought by porn company
• Grove belatedly realise they have made a mistake
Lynx
• A Web site which provides advice on the Lynx text
browser decides to get a 'better' domain name
• The old domain name is taken over by a porn
company
• The author
Web
site
in hismanagement
workshops!
UKOLN – a uses
centre of the
expertise
in digital
information
A Possible Scenario For NOF-digi
A potential scenario:
• Cultural heritage Web site developed
• Heritage body has limited networking expertise
• Domain name lapsed due to lack of knowledge of
terminology ("What's a DNS? Is this invoice legit?")
• Once virtual domain name lapsed, accesses go to
service developer's Web site
• Service developer's Web site has links to Web sites
they've built (including some of a dubious nature)
• Once address expires in DNS caches links go to a
porn company
• Cultural heritage gateway points to a porn site!
Let's ensure that this doesn't happen
UKOLN – a centre of expertise in digital information management
A Web site isn't just for Christmas,
it's for life!
The lessons:
• You need to be aware that Web sites developed
using short-term project funding need to be kept
for a long period after funding finishes
• Porn domain name pirates are looking for Web
sites whose domain name has expired
• Web sites which are well-linked and easily found
using Google are particularly attractive to porn
pirates
• Avoid this happening to your Web site
• Avoid your Web site linking to sites you link to
UKOLN – a centre of expertise in digital information management
Other Administrative Issues
Digital Signatures
• You buy a digital signature
which identifies your Web
site as belonging to a
legitimate organisation
• The digital signature is used
for (a) the encryption of
credit card details and (b)
use of an Intranet
• You fail to renew the
signature / renewal not
accepted as the consortium
is not a legal entity
http://www.bathimpact.com/
• Users see "Non-valid
io_article.php?section=
On%20Campus&ref=48
signature"
message
UKOLN – a centre of expertise in digital information management
What Is My Web Site?
What do we mean by my Web site?
What purposes could be provided by my Web site?
• The public Web site which users see
• Several Web services used by users (e.g.
www.foo.org.uk, search.foo.org.uk, …)
• The Web site containing a public area and a private
area for use by consortia members
• A public Web site and a private one
• Several public Web sites, one for each member of
the consortia
See <http://www.ukoln.ac.uk/qa-focus/
documents/briefings/briefing-15/>
UKOLN – a centre of expertise in digital information management
The Preferred Future
For My Web Site
After the project funding finishes:
• The project money has helped pump-prime an
activity which is core to my organisation's mission.
The project Web site will be developed through my
organisation's existing funding streams.
• We'd like to build on the work. We're looking for
new funding streams.
• We've decided we don't want to engage in the eworld. We'd like someone to take the Web site off
our hands (we don't want it to become a porn site!)
• We haven't given any though to this. Anyway we're
all left the project.
UKOLN – a centre of expertise in digital information management
Technical Issues
Standards And Formats
• Has the Web site been designed using open
standards, which should help future-proofing?
• Have proprietary formats been used (for which
backwards compatibility may not be considered)
Architecture & Implementation
• Has the technical architecture of the Web site been
documented?
• Can I continue to use technical systems after
funding has finished
UKOLN – a centre of expertise in digital information management
Content Issues
Accuracy:
• Is the content of my Web site accurate today –
and tomorrow
• Could the content of my Web site be misleading
Usability:
• Are links working today – and tomorrow
Legal:
• Is my Web site legal today (accessibility;
copyright; defamation; IPR; …)?
• Will my Web site be legal tomorrow, if new
legislation is enacted?
UKOLN – a centre of expertise in digital information management
Mothballing Your Web Site (1)
Before funding finishes you should take steps
for the mothballing of your Web site:
• Run a link check across the Web site. Fix broken
internal links and as many external links as is
reasonable. Document the link report.
• Run HTML (and CSS) validation checks across the
Web site. Fix as many invalid pages as is
reasonable. Document the findings.
• Run an accessibility check across the Web site. Fix
as many inaccessible pages as is reasonable.
Document the findings.
This should not be an onerous task if you have following NOF-digi
guidelines. Note that
errors found later occurred after your funding finished.
UKOLN – a centre of expertise in digital information management
Mothballing Your Web Site (2)
You should also address technical areas:
• Remove any backend scripts which are no longer
needed (e.g. online booking forms for old events).
• Remember that scripts, etc. are liable to go
wrong. Ensure that applications are configured to
break gracefully and provide meaningful errors:
 The config.ssi is missing. This should be
reported to the systems administrator (email
[email protected] or ring +44 020 123
123. Please provide the URL of the broken
page and the project name)
 Apache error 6963
UKOLN – a centre of expertise in digital information management
Mothballing Your Web Site (3)
You should also address the content of your Web site:
• Clarify the status of the Web site on the home
page.
• Ensure the tense of the content reflects the
position i.e. don't say "This project will …"
• Ensure that contact details will remain valid i.e.
provide generic email addresses not an
individuals
• Remember that many users will arrive deep in
your Web site (e.g. using Google). If necessary
use CSS to flag all pages with a watermark
See <http://www.ukoln.ac.uk/qa-focus/
documents/briefings/briefing-04/>
UKOLN – a centre of expertise in digital information management
Mothballing Toolkit
UKOLN and AHDS
are developing a QA
methodology by QA
Focus work which is
funded by JISC
We have developed
an automated selfassessment toolkit
which, although
aimed at JISC
projects, is free to
use by all
http://www.ukoln.ac.uk/
qa-focus/toolkit/mothballing/
UKOLN – a centre of expertise in digital information management
Testing Repurposing Of
Your Web Site
You may find that:
• Your Web site is repurposed by third parties
• You wish to move your Web site to another location
In order to check that repurposing can happen without
errors you should think about testing the process:
• If you have a PDA use Avantgo.com (or similar) tool to
access Web site on another device
• Use a Web site mirroring tool (e.g. HTTrack) to copy
your Web site to your desktop PC
Such tools can:
• View your Web site will look on other devices
• Spot potential problems for mirroring your Web site
UKOLN – a centre of expertise in digital information management
See <http://www.ukoln.ac.uk/qa-focus/documents/briefings/briefing-05/>
Lessons For The Future
How easy is it for you to implement mothballing
techniques?
• You may find that deploying a watermark on every
page of your Web site is time-consuming to
implement
• Any difficulties encountered with your NOF-digi
project should be noted, and lessons learnt should
be applied for future development work
• Think about preservation from the original planning
stage for a Web site
UKOLN – a centre of expertise in digital information management
Case Study - Exploit Interactive (1)
Exploit Interactive:
• EU-funded ejournal available at
<http://www.exploit-lib.org/>
• Funded from Jan 1999 – Dec 2000
• Web site is still hosted locally
Issues:
• Should we continue hosting domain after 3 years?
• What is the cost of this (domain name registration,
disk storage, system maintenance)?
UKOLN – a centre of expertise in digital information management
Case Study - Exploit Interactive (2)
Findings:
• Disk storage is 4Gb (large proportion is log files)
• A 30 Gb disk drive costs ~ £40
• It was decide to run an annual link check of the Web
site. Although there were broken links to external
sites, the internal links all worked.
• It was estimated that it would take about 30 minutes
/ year to run a link check and document findings.
• A policy for the ongoing provision of the Web site
was agreed
• See <http://www.ukoln.ac.uk/qa-focus/
documents/case-studies/case-study-17/>
UKOLN – a centre of expertise in digital information management
Short-Medium Term Access Strategy
• We will seek to ensure the Web site continues for at
least 10 years after the end of funding.
• We will seek to ensure that the Web site continues
to function.
• We will not fix broken links to external resources.
• We will not fixing non-compliant HTML resources.
We will use the following procedures:
• We will have internal administrative procedures to
ensure that the domain name bill is paid.
• We will record disk space usage and provide an
estimate of the cost of providing disk space
• We will run a link checker annually and record the
nos. of internal broken links. We will keep an audit
trail to see if internal links start breaking.
• Any changes to the policy … need to be agreed by
an appropriate management group.
UKOLN – a centre of expertise in digital information management
Conclusions
To conclude:
• Web sites can disappear
• They may reappear as porn sites!
• Organisations should ensure they have
procedures to ensure this does not happen
• You should developed a medium term Web site
preservation strategy
• You should test mirroring of your Web site
• You should seek to address such issues at the
planning stage of your Web site
UKOLN – a centre of expertise in digital information management
Questions?
UKOLN – a centre of expertise in digital information management