Web Archiving: Why? How? Why?

Erik Hetzner

ehetzner@plos.org

What is web archiving?

PLOS 2003

plos-20030404.png

PLOS 2004

plos-20040405.png

PLOS 2006

plos-20060202.png

PLOS 2007

plos-20070406.png

PLOS 2011

plos-20111101.png

PLOS 2014

plos-20140102.png

PLOS 2015

plos-20151205.png

Not just screenshots

  • Web archiving allows you to browse the web, as it was

Why?

  • The web is now part of the historical & scientific record

Why?

maxwell.png

  • J Clerk Maxwell (1865) A Dynamical Theory of the Electromagnetic Field Philosophical Transactions of the Royal Society of London 155: 459-512. 10.1098/rstl.1865.0008

Why?

web-protocol.png

  • PR Smeesters, A Vergison, D Campos, E de Aguiar, VYM Deyi, L Van Melderen (2006) Differences between Belgian and Brazilian Group A Streptococcus Epidemiologic Landscape. PLOS One 1(1): e10. 10.1371/journal.pone.0000010

Why?

404.png

Why?

cdc-archive.png

Link rot

  • “50% of the URLs within U.S. Supreme Court opinions suffer reference rot”
  • JL Zittrain, K Albert, L Lessig (2013) Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations Harvard Public Law Working Paper No. 13-42. 10.2139/ssrn.2329161

Content changing

  • White House web documents listing the “Coalition of the Willing” changed, but dates remained the same
  • “The text of three of these five documents was altered at some point after their initial release, even though in most cases the documents still retained their original release dates and were presented as unaltered originals. These alterations to the public record changed the apparent number of countries making up the coalition, as well as the names of countries in the coalition.”
  • S Althaus, K Leetaru Airbrushing History, American Style. http://www.clinecenter.illinois.edu/research/affiliated/airbrush/

But how can that work?

  • It‘s actually pretty simple.
    • Crawl the web
    • Replay the crawls

Crawling

  • Largely standard web crawling

Procedure

  • Initialize queue of URLs with seeds
  • For each URL in queue
    • Download URL and save the response
    • Extract all the other URLS and add to queue

Storing

  • Store the response (and sometimes the request):
HTTP/1.1 200 OK
Date: Wed, 30 Apr 2008 20:48:28 GMT
Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4 mod_ssl/2.0.54 OpenSSL/0.9.7g
Last-Modified: Mon, 16 Jun 2003 22:28:51 GMT
ETag: "34dc-67e-2ed02ec0"
Accept-Ranges: bytes
Content-Length: 1662
Connection: close
Content-Type: image/jpeg

FF D8 FF E0 00 10 4A
…

Replay

Replay

  • You can see how this would work, if all URLs (for CSS, JS, images, etc.) are rewritten

Storage

  • Concatenate the results into ~100MB files
http://www.archive.org/index.php 207.241.229.39 20080430204826 text/html 29000
HTTP/1.1 200 OK
Date: Wed, 30 Apr 2008 20:48:25 GMT
[…]
http://www.archive.org/flv/flv.js?v=1.34 207.241.229.39 20080430204833 application/x-javascript 16969
HTTP/1.1 200 OK
Date: Wed, 30 Apr 2008 20:48:32 GMT
[…]
  • compress the results

Storage

  • Store these file on a bunch of disks

Storage

  • Index the files with the URL (in sortable order), the timestamp, the filename it is stored in, and the byte-offset into the file.
…
org,archive)/index.php 20080430204826 file1.arc.gz 8643 …
org,archive)/flv/flv.js?v=1.34 20080430204833 file2.arc.gz 5130 …
…

Tricky engineering

Storage

  • Storage
    • Need to store data in the petabyte range
    • Deduplication

Leakage

  • “Leak”s into the live web
    • Rewriting all URLs is hard
    • JavaScript makes it harder

Full-text indexing

  • Full-text indexing
    • So much content
    • What is relevant when you have many copies of the same page?

Brief history

  • Wayback Machine (Internet Archive) began in 1996.
    • Now contains 9PB of data, growing by 20TB/week.
  • Later, other organizations got involved.
    • Many countries have “legal deposit” laws. These laws require publishers to deposit items in a library.
    • These countries have often decided that these laws apply to the national internet
    • Examples: France, Iceland, Denmark, Norway, Portugal

Copyright

  • Section 108
    • Section 108 of the Copyright Act provides limitations on exclusive rights for libraries and archives
    • Section 108 study group has determined that libraries and archives have the right to preserve web content
    • Opt out
    • Government entities, political parties and campaigns cannot opt out
  • Section 108 Study Group (2008) The Section 108 Study Group report.

Architecture

  • The same architectural principles that make the scalable, proxyable web possible make web archiving possible
  • Think of web archiving as “proxy”ing the web, with time travel

REST

  • Stateless
    • No client content stored on the server
  • This means that we need not emulate a server in order to recreate the user experience

REST

  • Code on demand
    • Client side code (javascript)
  • Allows us to archive client behavior

REST

  • Identification of resources
    • Resources have globally unique identifiers
  • We needn’t keep track of whether this is document “1” from site A or document “1” from site B

REST

  • Self-descriptive messages
    • Every server request-response is self-sufficient
  • We do not need to archive context

REST

  • Hypermedia as the engine of application state
    • Each document contains within itself “links” to next client state
  • For each page, we have the next globally unique next states for the client
  • RT Fielding (2000) Architectural styles and the design of network-based software architectures. UCI.

BBS

  • Stateful
    • Client state stored on server
  • All code on server
  • No identification of resource
    • Navigate to locations via interaction
  • Context dependent messages
    • Every interaction depends on all previous interactions

Next steps

Emulation

oldweb-today.png

Social media

  • Twitter, etc.
    • Often customized for the user
    • Highly dynamic

Memento

  • Integrating the concept of “time” into web request
  • Works with Wikipedia, Internet Archive, etc.
  • http://mementoweb.org/

Catching up with the changing web

  • HTML5
  • Single page applications
  • etc.

Scholarly archiving

PLOS

  • Work with http://perma.cc/ to ensure that all web references in our articles are preserved