Web Archiving: Why? How? Why?

Erik Hetzner


What is web archiving?

PLOS 2003


PLOS 2004


PLOS 2006


PLOS 2007


PLOS 2011


PLOS 2014


PLOS 2015


Not just screenshots

  • Web archiving allows you to browse the web, as it was


  • The web is now part of the historical & scientific record



  • J Clerk Maxwell (1865) A Dynamical Theory of the Electromagnetic Field Philosophical Transactions of the Royal Society of London 155: 459-512. 10.1098/rstl.1865.0008



  • PR Smeesters, A Vergison, D Campos, E de Aguiar, VYM Deyi, L Van Melderen (2006) Differences between Belgian and Brazilian Group A Streptococcus Epidemiologic Landscape. PLOS One 1(1): e10. 10.1371/journal.pone.0000010





Link rot

  • “50% of the URLs within U.S. Supreme Court opinions suffer reference rot”
  • JL Zittrain, K Albert, L Lessig (2013) Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations Harvard Public Law Working Paper No. 13-42. 10.2139/ssrn.2329161

Content changing

  • White House web documents listing the “Coalition of the Willing” changed, but dates remained the same
  • “The text of three of these five documents was altered at some point after their initial release, even though in most cases the documents still retained their original release dates and were presented as unaltered originals. These alterations to the public record changed the apparent number of countries making up the coalition, as well as the names of countries in the coalition.”
  • S Althaus, K Leetaru Airbrushing History, American Style. http://www.clinecenter.illinois.edu/research/affiliated/airbrush/

But how can that work?

  • It‘s actually pretty simple.
    • Crawl the web
    • Replay the crawls


  • Largely standard web crawling


  • Initialize queue of URLs with seeds
  • For each URL in queue
    • Download URL and save the response
    • Extract all the other URLS and add to queue


  • Store the response (and sometimes the request):
HTTP/1.1 200 OK
Date: Wed, 30 Apr 2008 20:48:28 GMT
Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4 mod_ssl/2.0.54 OpenSSL/0.9.7g
Last-Modified: Mon, 16 Jun 2003 22:28:51 GMT
ETag: "34dc-67e-2ed02ec0"
Accept-Ranges: bytes
Content-Length: 1662
Connection: close
Content-Type: image/jpeg

FF D8 FF E0 00 10 4A



  • You can see how this would work, if all URLs (for CSS, JS, images, etc.) are rewritten


  • Concatenate the results into ~100MB files
http://www.archive.org/index.php 20080430204826 text/html 29000
HTTP/1.1 200 OK
Date: Wed, 30 Apr 2008 20:48:25 GMT
http://www.archive.org/flv/flv.js?v=1.34 20080430204833 application/x-javascript 16969
HTTP/1.1 200 OK
Date: Wed, 30 Apr 2008 20:48:32 GMT
  • compress the results


  • Store these file on a bunch of disks


  • Index the files with the URL (in sortable order), the timestamp, the filename it is stored in, and the byte-offset into the file.
org,archive)/index.php 20080430204826 file1.arc.gz 8643 …
org,archive)/flv/flv.js?v=1.34 20080430204833 file2.arc.gz 5130 …

Tricky engineering


  • Storage
    • Need to store data in the petabyte range
    • Deduplication


  • “Leak”s into the live web
    • Rewriting all URLs is hard
    • JavaScript makes it harder

Full-text indexing

  • Full-text indexing
    • So much content
    • What is relevant when you have many copies of the same page?

Brief history

  • Wayback Machine (Internet Archive) began in 1996.
    • Now contains 9PB of data, growing by 20TB/week.
  • Later, other organizations got involved.
    • Many countries have “legal deposit” laws. These laws require publishers to deposit items in a library.
    • These countries have often decided that these laws apply to the national internet
    • Examples: France, Iceland, Denmark, Norway, Portugal


  • Section 108
    • Section 108 of the Copyright Act provides limitations on exclusive rights for libraries and archives
    • Section 108 study group has determined that libraries and archives have the right to preserve web content
    • Opt out
    • Government entities, political parties and campaigns cannot opt out
  • Section 108 Study Group (2008) The Section 108 Study Group report.


  • The same architectural principles that make the scalable, proxyable web possible make web archiving possible
  • Think of web archiving as “proxy”ing the web, with time travel


  • Stateless
    • No client content stored on the server
  • This means that we need not emulate a server in order to recreate the user experience


  • Code on demand
    • Client side code (javascript)
  • Allows us to archive client behavior


  • Identification of resources
    • Resources have globally unique identifiers
  • We needn’t keep track of whether this is document “1” from site A or document “1” from site B


  • Self-descriptive messages
    • Every server request-response is self-sufficient
  • We do not need to archive context


  • Hypermedia as the engine of application state
    • Each document contains within itself “links” to next client state
  • For each page, we have the next globally unique next states for the client
  • RT Fielding (2000) Architectural styles and the design of network-based software architectures. UCI.


  • Stateful
    • Client state stored on server
  • All code on server
  • No identification of resource
    • Navigate to locations via interaction
  • Context dependent messages
    • Every interaction depends on all previous interactions

Next steps



Social media

  • Twitter, etc.
    • Often customized for the user
    • Highly dynamic


  • Integrating the concept of “time” into web request
  • Works with Wikipedia, Internet Archive, etc.
  • http://mementoweb.org/

Catching up with the changing web

  • HTML5
  • Single page applications
  • etc.

Scholarly archiving


  • Work with http://perma.cc/ to ensure that all web references in our articles are preserved