JCDL 2006 Doctoral Consortium

Download Report

Transcript JCDL 2006 Doctoral Consortium

Website Reconstruction
using the Web
Infrastructure
Frank McCown
http://www.cs.odu.edu/~fmccown/
Doctoral Consortium
June 11, 2006
Web
Infrastructure
HTTP 404
4
Cost of Preservation
Publisher’s cost
(time, equipment, knowledge)
Client-view
Server-view
H
Filesystem backups
Furl/Spurl
Browser cache
InfoMonitor
LOCKSS
Hanzo:web
iPROXY
TTApache
Web archives
SE caches
H
L
H
Coverage of the Web
6
Research Questions

How much digital preservation of websites is
afforded by lazy preservation?




Can we reconstruct entire websites from the WI?
What factors contribute to the success of website
reconstruction?
Can we predict how much of a lost website can be
recovered?
How can the WI be utilized to provide
preservation of server-side components?
7
Prior Work

Is website reconstruction from WI feasible?




Web repository: G,M,Y,IA
Web-repository crawler: Warrick
Reconstructed 24 websites
How long do search engines keep
cached content after it is removed?
8
Timeline of SE Resource
Acquisition and Release
Vulnerable resource – not yet cached (tca is not defined)
Replicated resource – available on web server and SE cache (tca < current time < tr)
Endangered resource – removed from web server but still cached (tca < current time < tcr)
Unrecoverable resource – missing from web server and cache (tca < tcr < current time)
Joan A. Smith, Frank McCown, and Michael L. Nelson. Observed Web Robot Behavior on Decaying Web Subsites,
D-Lib Magazine, 12(2), February 2006.
Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy
Webmaster, Technical report, arXiv cs.IR/0512069, 2005.
9
How Much Did We
Reconstruct?
“Lost” web site
Reconstructed web site
A
A
B
D
B’
C
E
F
G
C’
F
E
Missing link to D;
points to old
resource G
F can’t
be found
12
Reconstruction Diagram
added
20%
identical
50%
changed
33%
missing
17%
13
Results
Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster,
Technical Report, arXiv cs.IR/0512069, 2005.
Warrick Milestones




www2006.org – first lost website reconstructed (Nov
2005)
DCkickball.org – first website someone else
reconstructed without our help (late Jan 2006)
www.iclnet.org – first website we reconstructed for
someone else (mid Mar 2006)
Internet Archive officially “blesses” Warrick (mid Mar
2006)1
1http://frankmccown.blogspot.com/2006/03/warrick-is-gaining-traction.html
15
Proposed Work

How lazy can we afford to be?




Find factors influencing success of website
reconstruction from the WI
Perform search engine cache characterization
Inject server-side components into WI for
complete website reconstruction
Improving the Warrick crawler


Evaluate different crawling policies
Development of web-repository API for inclusion
in Warrick
16
Factors Influencing Website
Recoverability from the WI


Previous study did not find statistically
significant relationship between recoverability
and website size or PageRank
Methodology



Sample large number of websites - dmoz.org
Perform several reconstructions over time using
same policy
Download sites several times over time to capture
change rates
17
Evaluation

Use statistical analysis to test for the
following factors:






Size
Makeup
Path depth
PageRank
Change rate
Create a predictive model – how much of my
lost website do I expect to get back?
18
SE Cache Characterization

Web characterization is an active field
Search engine caches have never been characterized

Methodology






Randomly sample URLs from four popular search
engines: Google, MSN, Yahoo, Ask
Access cached version if present
Download live version from the Web
Examine HTTP headers and page content
Attempt to access various resource types (PDF,
Word, PS, etc.) in each SE cache
19
Evaluation







Compute the ratio of indexed to cached
Find types, size, age of resources
Do http Cache-control directives ‘no-cache’ and
‘no-store’ stop resources from being cached?
Compare different SE caches compare
How prevalent is the use of NOARCHIVE meta
tags to keep HTML pages from being cached?
How much of the Web is cached by SEs?
What is the overlap with the Internet Archive?
20
Marshall TR Server – running EPrints
We can recover the missing page and PDF,
but what about the services?
Recovery of Web Server
Components



Recovering the client-side representation is
not enough to reconstruct a dynamicallyproduced website
How can we inject the server-side
functionality into the WI?
Web repositories like HTML



Canonical versions stored by all web repos
Text-based
Comments can be inserted without changing
appearance of page
23
Injection Techniques



Inject entire server file into HTML comments
Divide server file into parts and insert parts
into HTML comments
Use erasure codes to break a server file into
chunks and insert the chunks into HTML
comments of different pages
24
Recover Server File from WI
25
Evaluation


Find the most efficient values for n and r
(chunks created/recovered)
Security



Develop simple mechanism for selecting files that
can be injected into the WI
Address encryption issues
Reconstruct an EPrints website with a few
hundred resources
26
100%
Naive
90%


URL canonicalization
Crawling policies





Naïve policy
Knowledgeable policy
Exhaustive policy
Exhaustive
70%
Frequency
Recent Work
Knowledgeable
80%
60%
50%
40%
30%
20%
10%
0%
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Efficiency ratio bins
Reconstruct 24 websites with each policy
Found that exhaustive and knowledgeable
are significantly more efficient at recovering
websites
Frank McCown and Michael L. Nelson, Evaluation of Crawling Policies for a WebRepository Crawler, HYPERTEXT 2006, To appear.
1
Warrick API


API should provide a clear and flexible
interface for web repositories
Goals:



Shield Warrick from changes to WI
Facilitate inclusion of new web repositories
Minimize implementation and maintenance costs
28
Evaluation



Internet Archive has endorsed use of Warrick
Make Warrick available on SourceForge
Measure the community adoption &
modification
29
Risks and Threats



Time for enough resources to be cached
Search engine caching behavior may change
at any time
Repository antagonism


Spam
Cloaking
30
Timeline
Timetable
Summary
When this work is completed, I will have…
 demonstrated and evaluated the lazy
preservation technique
 provided a reference implementation
 characterized SE caching behavior
 provided a layer of abstraction on top of SE
behavior (API)
 explored how much we store in the WI
(server-side vs. client-side representations)
32
Thank You
Questions?
33