Requirements and best practices for web preservation
Web archiving technologies are always improving, but some web content remains difficult to capture, preserve, or make accessible. It is difficult to capture content that relies on:
- human interaction and interactive technologies
- streaming media or data
- databases or document filters
- proprietary technology
- dynamic elements
LAC collects the Web for posterity and for the purposes of building its own digital and library research collections.
Effort shall be undertaken to faithfully preserve the content and functionality of web resources targeted for inclusion in the Web and Social Media Preservation Program’s collections. LAC cannot offer any warranty, guarantee, or service level with respect to the preservation or functionality of selected web resources, nor is it possible to acquire and preserve all websites.
On this page
Requirements for preservation of a web resource
LAC accepts nominations of Canadian web resources for preservation as documentary heritage.
- LAC requires at least three months advance notice to collect a final preservation master copy of a given web resource;
- LAC will begin collecting a final preservation copy once all final revisions of the web resource are complete;
- If there is a known date for decommissioning or taking a web resource permanently offline, (“site must close on date X”), please notify us three months prior.
To nominate a web resource for digital preservation, please contact us at archivesweb-webarchives@bac-lac.gc.ca and provide us with:
- The URL of the resource to be preserved;
- The date when the final edits to the web resource will be made; and
- Comments on whether you are the website owner as relevant;
- The date the web resource must be decommissioned, if known.
What can I do to ensure my website is preservable?
Library and Archives Canada recommends the following best practices when designing web resources. Any recommendations not adopted from within these guidelines will increase the difficulty of preserving your website.
Best practices in web development and architecture
To maximize the ability of LAC to acquire and preserve your web resource, please save documents directly on your own server (for example, your images, audio and video recordings, stylesheets and JavaScript files). Web resources hosted within a single domain tend to be simpler to preserve than resources distributed across multiple domains.
- Where possible, ensure that objects such as audio, video, stylesheets, Javascript, etc. are hosted within your site or domain rather than on third-party platforms.
- Do not rely on Facebook or other social media platforms to host your audio, video, or images. These platforms change almost daily, which complicates LAC’s ability to collect resources from such servers.
Maintain a consistent base URL or domain
If you have published a major resource with a given URL (for example, www.mywebsite.com/webarchiving) avoid changing this address. Changing it to webarchiving.mywebsite.com will result in LAC preserving two distinct resources, which may or may not be associated to each other in the GCWA.
Interactive, proprietary, or backend-dependent technologies are difficult to preserve
Content requiring human interaction (like running database searches or dynamic filters) can be difficult to capture and preserve. These may be important features for your website, but we can’t guarantee they will be preserved faithfully. Such features include, for example:
- Performed an action that accessed or created data in a database;
- Hovered a mouse cursor over an item;
- Enables the user to zoom in or out of resources or a map.
Robots.txt exclusions
It is a normal practice to prohibit or delay web search engines from excessively querying your web resource or domain. However this also inhibits LAC’s web crawlers, which can make a web resource impossible or tedious to acquire. Similarly, instructing crawlers to skip a website’s CSS and/or JavaScript directories would have a considerable impact on the digital preservation copy of your web resource.
To allow web crawlers to successfully access all components of your website:
- Do not create an exclusion file (usually named robots.txt), or ensure that access is granted to user-agent archive.org_bot (known as whitelisting, this permits LAC’s web crawler to access your website while excluding access to anyone else not specified);
- For any website security software, whitelist user-agent archive.org_bot and special-archiver.
Avoid the use of tokens and session identifiers
Avoid the use of session tokens unless absolutely necessary. Tokens and session tracking (for example, www.website.com/t?=123456/…) can prevent LAC from confirming we have crawled all pages in a target web resource. This can make preservation of a given website more complex.
Make use of direct and static hyperlinks
Avoid the use of dynamically generated URLs when possible.
Observe International Web Development and Accessibility Standards
Since web crawlers interact with target web resources in a manner similar to a browser, following W3C international standards and best practices for web development generally facilitates digital preservation for web resources.
Ensure you consult and abide by W3C standards and best practices.
Also consult and abide by W3C’s Web Accessibility Initiative (WAI) at WCAG 2.0 Level AA at least.
For Government of Canada web resources, consult and abide by:
Generate a site map and indices
Web crawlers operate and acquire digital preservation copies of target web resources by following hyperlinks. Any page within your web resource that is not hyperlinked to other pages is referred to as an “orphaned page”. Orphaned pages are effectively invisible to web crawlers.
Databases and other dynamic technologies are invisible to crawlers in part because their contents are not discoverable or accessible by direct hyperlink (because they reside in the database and are accessed by dynamic URIs). Generating an index of a database, for example, is an effective method of enabling LAC to extract all its content.
Generating “site maps” and even comprehensive indices of the major components of your web resources are not only a best practice, they ensure web crawlers can follow, detect and acquire all aspects of your website (even if the rest of the content doesn’t follow these guidelines!)
Include metadata and define your character encoding
The Web and Social Media Preservation Program relies on header and embedded metadata in web resources such as title and character coding. Ensuring these are present in your page headers will ensure LAC can automate the proper indexing and curation of your resource by its correct name and details.
To ensure digital preservation and faithful emulation of your web resource from within the GCWA:
-
The media or MIME type must be correctly identified through the HTTP parameter “Content-Type”. The value for this field can be provided in two ways:
- Content-Type field of the HTTP header: the HTTP header is supplied by a Web server and it defines a set of characteristics of the content before it is downloaded;
- Meta tag Content-Type: can be included on the source code of a page. Example: <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
The Content-Type field in the HTTP header must correctly identify the character set encoding in order for successful capture and rendering of the archived copy (in the example above, “UTF-8”). The meta tag Content-Type in the source code of a page must be consistent with the character set cited in the HTTP header.
Use archiving-friendly platform providers and content management systems
Avoid the use of proprietary “website builders” and content management systems (e.g., Wix, Squarespace) in favour of open source frameworks whenever possible. Proprietary systems are difficult for crawlers to navigate and capture.
Additional resources
Library of Congress. “Recommended Formats Statement for websites”
Portuguese Web Archive. "Recommendations for authors to enable web archiving"