Skip to Main Content

Library Guides

Finding and Using Digital Archives: Web archives

This guide covers how to find digitised and digital archives material and how to critically examine what you find.

Introduction

Web Archives are treated separately to other born-digital archives in this guide. In some ways, websites are more similar to books than to archives, as they are published discrete entities. However, just as archival documents refer to other documents in their archive, so websites link to other websites, and this aspect of them is crucial to their capture for preservation.

This page will be, by necessity, a very brief overview of using and understanding web archives. For a more thorough introduction to web archives as sources, we recommend History in The Age of Abundance: How the Web is transforming historical research by Ian Milligan, which we have used extensively in the writing of this guide.

Why are web archives different?

To understand why web archives are different, let's first consider a document in a paper archive.

On the right here, you can see a a programme for a Sports Day held in 1920. The programme includes multiple fonts and an illustrative design around the edge. These elements are all fixed to the page and so will appear the same no matter who takes it out of its box and when. The typefaces look the same today as they did in 1920, although the edges of the paper are a little more scruffy than they would have been when it first came back from the printers. The content has not changed in the 100 years since the document was produced and neither has the experience of using it - if you come to the archive to view this document, you will hold it in your hand and read it in exactly the same way as you would have done in 1920. Your experience is not mediated through any technology - you don't need to know how to make paper in order to use the document.

A web page, however, is not a static single document but a composite of text and images, with instructions about how to put these different elements together. They will appear differently depending on the size of screen that is being used to view them - such as laptop or a mobile - and may even look different depending on the web browser that is used.

When we preserve a web page there are three different elements we have to consider preserving - the actual information contained in the web page (the content); how it looked to people at the time (the experience); and the underlying code (the technology). All of these elements could be of interest to different types of researchers.

The very first webpage on the internet was http://info.cern.ch. Although it had changed over the years, in 2013 the website was resurrected as a historical document and can be browsed in two different ways. Firstly, as a modern website allowing you to click the links to navigate around it. Secondly through an emulator that gives the website the same appearance as it would have had in 1991 and which requires you to use numbers to navigate through the hierarchy instead of links. However neither of these versions uses the original code of the website, which is preserved separately as it would not be able to run on a modern operating system. 

This intensive approach is not sustainable for every website, but will give you an idea of what to consider when you are using web archives - which element of the preserved webpage are you interested in as a researcher?

 

A traditional archival document

1920 Sports Day programme

Timeline

  • 1991 - first webpage created
  • 1996 - Internet Archive is founded
  • 2001 - Wayback Machine is launched to view the Internet Archive
  • 2013 - UK Web Archive gains the right to collect any website with a .uk address

Collecting web archives

Websites are can be collected for preservation either automatically, by crawlers, or manually, by an individual making a specific capture.

Crawlers

Web archiving crawlers start from one website and follow the links from that website to other websites to move across the internet. They may have restrictions in place by geographic area (such as only websites with a .uk address from the UK Web Archive) or instructions to only follow a certain number of links from each initial page. This means that they may record different pages within a website on different days, or even different components of the same page at different time.

The easiest way to demonstrate this is with weather websites, as they are frequently updated. We can take bbc.co.uk/weather as an example. This page was captured by the Internet Archive on 13 November 2008 - you can see the archived page here. That page includes a link to the 'Full 5 day forecast for London, United Kingdom'. You would therefore hope that clicking on the link would enable you to see the 5 day forecast for London on 13 November 2008. However, it actually takes you to a capture of the London weather forecast for 10 November 2008 - three days earlier - which was the most recent time that the crawler had visited the page. The next date that the web crawler visits the 5 day forecast for London page is 4 December 2008.

When browsing an archived website on the Wayback Machine or other platforms, it's important to be aware that you may be viewing a website that never existed in the form that you are viewing it. Pay attention to the dates of captures, especially if the captures were around a time when information was likely to have been updated (e.g. political websites around the date of an election).

Just as Google's search results are ranked by the popularity of a website and the number of links to it, so web crawls have an inherent bias towards documenting the most popular websites, as they are the ones that are most likely to be found by following links from other sites. You will therefore find that corporate websites are more likely to be well-documented than an amateur hobby website, which may have only been captured infrequently, if at all.

Manual capture

Institutions may choose also to perform more manual, targeted capture of websites. These use similar underlying technology to the web crawlers, but the parameters will be more tightly controlled. These can be to document the institution's own history (the University Archive performs captures of the University website for this purpose) or to put together thematic collections in a manner similar to purchasing books for a library. 

For example, the Library of Congress maintains an archive of its own websites since 2016 as well as thematic collections relating to international elections and political administrations, such as the African Government Web Archive

Example Web Archives

Internet Archive/Wayback Machine

The Internet Archive is the most well-known of the web archives. The Wayback Machine is the portal for accessing its captures.

Archive-It

As well as its own web crawlers, the Internet Archive runs the subscription-based Archive-It service, which enables institutions to make thematic collections of archived websites. The webpages are added to the Wayback Machine, but the collections can also be browsed on the Archive-It page. Some highlights include:

The UK Web Archive

The UK web archive aims to crawl all websites with a .uk address at least once a year and also hosts thematic collections on subjects such as Black and Asian Britain, Mental Health and the Smoking Ban 2007. It has had a legal right to collect websites since 2013 but not all websites in the UK Web Archive are viewable remotely - unless copyright permission has been given by the owner, they are only viewable on British Library premises.To see what is available remotely, use the search bar and then ensure the 'Viewable Online' box is ticked. You can find out more about using the service remotely in this blogpost from 30 March 2020.

Web archives as Big Data

Currently most archived websites are available for access through browsers like the Wayback Machine, which allow you to look at a website as if it was live on the internet today. However, there are other ways of using this data.

Instead of looking at one website to see what information it displayed on a certain date, or how to changed incrementally over time, we can take a Big Data approach and use the archives to look at patterns or trends.

The UK Web Archive has developed a protoype search engine called Shine which allows you to do this with their own dataset. Alongside a search engine which retrieves web pages matching the search terms, Shine also has a 'trends' functionality. This allows you to see how frequently a word or phrase is mentioned within the dataset 1996-2013. For example if we search for "millennium bug" (the belief that computers would not be able to handle the date 01012000 and so widespread chaos would occur) we can see, as we would expect, a steady climb in frequency until 1999 and then a slow decline afterwards. However the parallel term "Y2K bug" which refers to exactly the same event, has a different trajectory, and seems to peak halfway through 2000. Using the Trends search can therefore give us an idea of when an event rose into public consciousness, without ever needing to look at individual webpages, as well as helping us to determine which search terms might be most useful for using other sources, such as newspaper archives.

Further reading

The Archives Unleashed project is developing a toolkit for analyzing web archives, using a digital humanities/Big data approach.

Documenting The Now build tools to help activists and researchers preserve and work with social media data in an ethical way.

The UK Web Archive blog details some of their recent collecting activities and events.

The blog of the Web Science and Digital Libraries Research Group at Old Dominion University blog reports on new methods for interacting with and visualising web archive data.