Getting the most out of the Wayback Machine
Tips for the Wayback Machine browser extension, API, and more!
Roughly a year ago, the Wayback Machine Chrome extension got a major update.
The new version has useful customization features and the ability to connect it to your personal Wayback Machine account, making it an even more essential tool for journalists and investigators. (For the sake of efficiency, I’m going to use WM to refer to the Wayback Machine.)
Here’s a rundown of the extension, a look at advice surfaced in a recent Medium post by cyb_detective (they also have a Substack you should subscribe to!), and a few other tips.
Extension: Main Menu
Before we dive in, please do two things:
Instal the Chrome extension.
Create a free account at archive.org.
Once you have an account, use it to sign into the extension. Open the extension by clicking on the little Internet Archive icon in your Chrome extensions menu and then click the button I highlighted in red below:
Now you have the option to automatically add everything you save via the extension to your account. This means the pages you archive will get stored there for easy retrieval.
I’ll show you how to activate that feature in the settings. For now, let’s get into the extension’s basic functionality. Here’s the interface:
You can instantly archive the page you’re viewing by clicking Save Page Now at the top. I always click the Screenshot button before archiving to ensure the WM captures an image of the page.
The Oldest and Newest buttons are shortcuts that will take you to, well, the oldest or most recently archived versions of the page you’re on. You can also click the little red calendar button between them to go to the page’s full archive history.
Most of the features relate to the page you’re on when you open the extension. But two overall site options are URLs and Site Map. The former takes you to a list of all URLs archived for the domain (as opposed to just the page you’re viewing). The latter takes you to a page where the WM gives you a visual breakdown of the data it has for the site. It “groups all the archives we have for websites by year, then builds a visual site map, in the form of a radial-tree graph, for each year.”
Getting your settings right
Click the Settings icon in the bottom left:
This opens the Context menu, where you have a few useful options.
Check the Wayback Machine Count box so you can see how many times a page has already been archived when viewing it in your web browser. I recommend checking the 404 Not Found option, which tells you if there’s an archived version of the webpage you’re on in cases where the page is dead. I also check the Alert if Content is Available box. This can surface information from fact checking organizations if the page you’re on has been checked. It’s niche, but can be interesting.
Now click on the General menu at the top.
I don’t typically have the Auto Save Page option enabled because I use Hunchly as my tool to capture webpages during an investigation. I treat the WM as a place to do targeted saves and to check for older versions of pages. But as you can see below, there are ways to make sure the WM automatically grabs pages that have never been archived, or that have not been grabbed recently. It’s definitely an option to consider.
I do recommend checking the Save To My Web Archive box. This puts all of the pages into your account at Archive.org for easy retrieval. That’s why I strongly encouraged you to create an account.
Using the Wayback Machine’s API
The WM has an API you can use to pull larger sets of data. Typically, an API (application programming interface) is only useful if you know how to code. But
published an article that showed ways to access data via the API using simple URLs. No programming required. You can pull data from the WM if you want to add it to a spreadsheet for further sorting and examination. Note that this approach is mostly useful if you’re dealing with a large amount of data.One simple formula shared in their article is this:
https://web.archive.org/cdx/search/cdx?url=osintme.com
It tells the WM’s API (which is called CDX) to return a list of all the pages archived for the domain ostintme.com. You can replace osintme.com with any domain and run this in your web browser. Then it’s easy to copy and paste the results into a spreadsheet.
Cyb_detective offers other formulas for getting archived pages for a domain using filters such as timeframe and site section. These are useful approaches to gather data for a domain with a lot of archived pages.
Of course, you can also pull up archived URLs for a domain without using the API. A 2021 piece from osintcurio.us has tips for domain-based shortcuts. For example, if you want an easy to navigate list of the URLs archived for osintme.com, you can use this formula:
https://web.archive.org/*/www.osintme.com/*
A Few Final Tips
You can search in the WM using keywords. People default to searching by URL, which is understandable given the WM is all about archiving webpages. But you can and should run keyword queries such as names, social media handles, etc.
The WM is for searching webpages. But the larger Internet Archive has a ton of other digital material you can search. This includes an archive of TV news clips and their captions. It has clips from 2,484,000 shows since 2009.
The Internet Archive also has a tremendously useful Advanced Search. This is where you can craft a specific query targeting media type, keyword, date range etc., and deploy boolean search operators.
Most journalists and researchers know about the WM and Internet Archive, but I think we fail to utilize the full range of customization and search options. I hope this helps you get more out of these incredibly useful services.
If you have a WM or Internet Archive tip to share, please add it in the comments! And if you liked this article, subscribe to the free Digital Investigations newsletter.
A great place to watch old movies, or download the occassional 400 page 1800s medical text.