Guidelines to make archivable websites
Archivability refers to the ease with which the content, structure, functionality, and front-end presentation(s) of a website can be preserved and later re-presented, using contemporary web archiving tools. The way your website is designed can prevent a web crawler from archiving your content.
Here are some things that can be done to improve the visibility of content for archiving:
- Maintain stable links. Stable links, maintained either through redirects or by not changing web addresses, ensures the continued usability.
- Use durable data formats. Along with markup, file formats may be at risk of becoming obsolete; even if the content is saved to long-term storage, there may be no way of interpreting it in the future. Prefer open formats or at least those that can be read using open-source software.
- Use http GET instead of http POST.
- Provide every unique resource (pages, images and files) on the website with its own static URL: URI.
- Allow every resource in the public domain to be browsed as well as searched i.e. positioned on ‘front-end’ of the site, reachable via http.
- Avoid proprietary formats for important content, especially the home page. It is not advisable to create home pages relying heavily on images or animations such as Flash, but if you do create such pages also provide alternative text-only HTML versions.
- Create a Site Map, and if possible a XML site map too, which lists the pages in a site, their relative importance, and how often they are updated. Creating a site map ensures all the website content can be crawled (some pages may not be discoverable by the crawler, for example pages which use Flash or JavaScript navigation).
- Use robots.txt to prevent access to areas of the site which may cause problems if crawled e.g. databases, including online catalogues, calendar functions. Also check your robots.txt file to be sure directories containing stylesheets and images are not restricted. To provide full access to our crawler specifically, add the following two lines to your robots.txt file.
User agent: archive.org_bot
Disallow:
- Search and filtering tools on websites cannot be captured. This creates issues if a user can only find content by using a search box or filtering. This can be solved by providing standard links to this type of content which would otherwise only be accessed via selecting drop down menus.
- Provide a static alternative for streamed AV content that is fully resolvable (by http GET requests). Provide a static alternative also to significant information contained within interactive maps e.g. .html, .jpeg, .csv download.
- Use breadcrumb trails where possible, that provide links back to each previous page the user navigated through, and shows the user's current location in a website.
- Make hyperlinks logical, human-readable and consistent.
- Make sure all content is under one central website domain.
- Make sure HTML5 and CSS are validated and standards compliant.
- Follow accessibility standards. If the site is accessible, it will very likely be archive-friendly as well. Once web content is archived, it can no longer be changed, which means that its accessibility can be improved no further than when it was archived. Following web accessibility best practices may not only be a legal mandate or organizational priority; it critically ensures the usability of your website for (the growing number of) users with impairments. The specific guideline to provide equivalent text for non-textual content can facilitate both search crawler indexing and later full-text search in the archive.