What is Web Archiving?
Web Archiving is the process of capturing the content of pages on the World Wide Web for long-term preservation. Due to the large size of the web, archivists traditionally use “web crawlers,” or little programs that browse the web and automatically index the results. Besides web crawling, other methods to archive the web include database archiving, which involves extracting the database content into a standard schema, like XML, and transactional archiving, which involves capturing the interaction between the web client and server. The standard file format for web preservation is WARC, or Web ARChive files.
Who would want to use Web Archiving?
Traditionally, archiving often falls to librarians, archivists, and information scientists. There are also organizations, such as Internet Archive and Rhizome, which undertake their own massive web archival efforts. The Internet Archive maintains the most comprehensive web preservation effort, which is accessible through their Wayback Machine. They have been using web crawlers since 1996 to amass the most complete copy of the World Wide Web to date. As time goes on, more and more individual scholars and researchers are preserving their own born-digital content. Due to the ephemerality of digital content, anybody who works on a digital project should consider methods of preservation.
How do I get access to Web Archiving?
There are a number of tools and web crawlers to assist in digital archiving, of all levels. Here is a list from the most basic (easiest to get up and running) to the most complex.
- Web Recorder is a web based archiving tool developed by Rhizome. It allows you to make an “interactive copy” of a website, and records the effect of video, images, and clicking links, which most other tools cannot capture. You have to create an account, but usage is free, and this is the easiest tool to get started.
- Archive-It is Internet Archive’s subscription service based tool for individuals and organizations to preserve collections of digital content. It allows partners working with the Internet Archive to identify important web pages for their records, pointing the “web crawler” to specific web pages. Archive-It is similar to Heritrix, but with a Graphical User Interface, and a playback mechanism.
- Heritrix is Internet Archive’s open-source web crawling tool that you can download and deploy on your own. It is something of an industry standard, being used by various national libraries around the world. Heritrix is more complex to use, and works like Archive-It without the Graphical User Interface. You need your own server to deploy Heritrix.
- Scrapy is not really a tool, but a Python library (meaning a collection of code) that’s used for “web scraping,” another way of saying web crawling. It has a bit more functionality than traditional web crawlers, because it can also be used with APIs to extract data from websites. Scrapy requires some basic knowledge of the Python programming language.
Where can I find more help with using Web Archiving?
- Video Tutorial for using Web Recorder.
- The official documentation for using Scrapy, which requires a bit of background in Python.