Personal tools
You are here: Home Tech Tidbits Discoverable Archives

Discoverable Archives

Simple, simple standard for finding past versions of web pages even when the the URL no longer exists.

Thanks to the Internet Archives machine it is possible to search for old web pages that no longer reside where their original URL is (for example, a thirteen year old web page of mine). What is possible by slightly changing the Way Back Machine model of the Internet Archives is to have an index page with links to all of the versions of a single page (rather than only organizing by web site). Because the archived pages are stored in website with a completely different domain, storing a completely different URL naming scheme on an existing web site can be problematic as well as a potential performance drag.

For Discoverable Archives there should be a well organized archive that preserves the URLs, essentially by having this structure: archive domain, version/date, old URL. Then there should be an index page for each version of the old URL: archive domain, index folder, old URL. For certain web pages that can never be considered static or are generated by the viewer's identity, there may be no straightforward method of preserving the document. Some archive pages might be generated based on updates to a page and others by weekly/monthly/yearly snapshots. Obviously, historic data should need only one snapshot, except under the off chance where the document needs to be fixed, in which case the archive will preserve both versions.

To make the archives discoverable is as simple as adding one or two features into all pages of a web site. First is to create a header link tag that points to the index page with links to previous versions (e.g. <link rel="archive" href="http://archivedomain.gov/indexofURLs/www.agency.gov/thispage.html" title="Archived Versions of This Page" />). Then at the bottom of the page or another place where a web page template can automate its creation and be outside of the content div tag, should be that same link and title. As an option, there might be a rel attribute in the <a> tag equal to "archive".

The above rule should be copied for the web page that is now used instead of a true 404 error. As web servers for human readability and security have closed off directory listings and pure 404 errors, and replaced them with auto-generated web pages, adding the above mentioned link tag and the link in the actual viewed page. This would even work if the entire domain is stripped of any pages, as long as the domain is maintained. If there is no domain maintained, then the URLs just become URIs that a search engine might be able to use to find the archived pages.

A tougher problem is how to deal with new pages with new URLs that replaced older pages with different URLs (this is especially a problem when the web content management tools produce technology/server specific URLs or during organizational reorganizations). Then the link to the older page in the archive site needs to be hand maintained/coded. That will preserve the jump from one URL to another for an older version. When the newer page is edited/changed, but stays at the same URL, the hand maintained/coded jump can be replace with the auto-generated links. Of course, the new management may choose to ignore the old regime, but in government continuity should be respected.

 

Document Actions
What's News