Guidelines on File Formats for Transferring Information Resources of Enduring Value
On this page
1. Effective Date
These Guidelines have been approved by the Senior Director General Innovation and Chief Information Officer Branch and take effect on October 1, 2014.
2. Application
These Guidelines provide advice on the digital file formats to be used when transferring information resources of enduring value (IREV) to Library and Archives Canada (LAC).
These Guidelines apply to all persons and organizations transferring digital IREV to LAC (hereinafter referred to as the “donor”).
These Guidelines supersede the Local Digital Format Registry File Format Guidelines for Preservation and Long-term Access Version 1.0 (2010).
3. Definitions
See Appendix A.
4. Context
These Guidelines are part of LAC’s Stewardship Policy Framework (2013) and the accompanying Policy on Holdings Management (July 2014). These documents mandate that IREV acquired and managed by LAC be accessible over time, and that consideration be given to stewardship requirements and resource capacity. The sustainability of IREV is therefore to be considered as part of all acquisition, stewardship, and reappraisal activities.
In accordance with sections 8 (2), and 10 of the Library and Archives of Canada Act, and section 2 (a) and (b) of the Legal Deposit of Publications Regulations, these Guidelines outline the appropriate file formats for submission to LAC of digital publications affected by legal deposit. While the Library and Archives of Canada Act section 10 (4) entitles LAC to collect all published versions and formats of a given title, LAC currently prefers to acquire publications in digital file formats defined within these Guidelines.
In accordance with sections 7, 12 and 13 of the Library and Archives of Canada Act, these Guidelines outline the appropriate digital formats that support any agreements between LAC and federal institutions for the transfer of digital IREV. Where such a transfer is governed by an existing records transfer agreement that specifies a digital format other than what is outlined in these Guidelines, federal institutions must consult with LAC prior to preparing the transfer.
These Guidelines will also apply to other acquisition agreements in which LAC representatives specify the file formats for the transfer of IREV.
5. Objective
These Guidelines restrict the number and types of file formats submitted to those formats that LAC has reasonable confidence can be preserved and made accessible over time, thereby ensuring sustainability.
6. Expected Results
Adherence to these Guidelines will allow LAC to achieve the following:
- Collaboration with donors on the long-term management and preservation of IREV;
- Acquisition of only those digital file formats identified as being sustainable;
- Transfer of digital IREV in a consistent, transparent and reliable manner that enables overall accountability;
- Alignment with international best practice in digital preservation.
7. Approach
File formats are specific patterns or structures that organize and define data. Some formats contain only one stream of uncompressed data, others may contain codecs to encode and compress the data and others may support several streams of media.
In addition to file formats, there are also container or encapsulating formats. These formats can contain and support various types or layers of data and metadata. Each of these formats may be handled by different programs, processes, or hardware but for the data stream to be interpreted properly, the information must be wrapped together.
The ability to preserve and use digital information is at risk if the computer hardware and software needed to access the information are no longer available or if the format specifications are not obtainable. The
use of appropriate file formats is therefore critical to sustainable long-term preservation. Due to a mix of technical and practical issues, certain file formats are more suitable for preservation.
The file format recommendations in these Guidelines are based on LAC’s experience in collecting and preserving digital content as well as international best practices1. In developing these Guidelines, LAC has attempted to balance the requirements for quality, stability, potential longevity and industry acceptance. Where possible, a preference has been placed on the selection of non-proprietary national and international standards, or failing this, on de facto industry standard file formats. De facto formats are widely used and recognized formats that have become industry standards because of their ubiquitous use and support and not because they have been formally approved by a standards organization. In some cases, LAC has also selected formats that it believes will become widely adopted in the near future.
The following criteria were considered when evaluating the sustainability of a given format:
-
Openness/transparency
- The relative ease with which knowledge of the file format and its technical information can be accumulated.
-
Adoption as a preservation standard
- The extent to which the format has been formally adopted by national libraries, archives and other memory institutions internationally.
-
Stability/compatibility
- The degree to which the format is backward and forward compatible.
- The degree to which the format is protected against file corruption.
- The relative frequency of updated or replacement versions of the format over time.
-
Dependencies/interoperability
- The degree to which the format relies on a particular hardware or software.
8. Scope
These Guidelines identify broad content categories covering all digital IREV acquired by LAC and provide transfer file format recommendations for each category. The file formats covered in this document have been divided into the following content categories2 and subcategories:
- Text
- Presentations
- Email
-
Still images
- Digital photographs
- Scanned text
- Digital audio
-
Digital moving images
- Digital cinema
- Digital video
- Geospatial
- Computer Aided Design
- Data sets
The transfer file formats are identified as either:
- Preferred for transfer; or
- Acceptable for transfer.
Preferred formats are those formats that are readily usable and have been identified by LAC as possessing a high degree of long-term sustainability. These formats require little or no immediate management to achieve appropriate levels of preservation.
Acceptable formats are those that meet LAC’s minimum criteria for sustainability. These formats may require LAC to perform some preservation actions on ingest to ensure their long-term sustainability.
All other formats are considered unacceptable because they do not meet the minimum requirements to be considered sustainable by LAC.
As a general rule, LAC will only accept file formats listed in these Guidelines. The onus is on the donor to ensure that IREV are in a preferred or acceptable file format. LAC reserves the right to refuse any file that is not in a preferred or acceptable file format and to request the migration of the files to a preferred or acceptable format. IREV may be exempted from compliance on a case-by-case basis after consultation with LAC representatives from the functional area responsible for acquisition.
These Guidelines do not contain information on creation, migration and capture standards. See LAC’s Standards on Digitization (in development) for information on the production of digital IREV.
These Guidelines do not give information on the generation of metadata during the record creation process. See LAC’s Standards on Metadata (in development).
These Guidelines do not outline how to achieve the actual physical or electronic transfer of IREV. Discuss the logistics of the transfer with the LAC representative responsible for the transfer.3
9. Transfer Requirements
When transferring digital IREV, identify the applicable content category and submit the resources in a preferred or acceptable format. Formats are listed by name and include a reference to the relevant specification that defines appropriate encoding methods. The formats in each section are organized alphabetically and do not imply an order of preference for any given format. However, LAC always prefers to receive a preferred file format over an acceptable file format.
Where required, the format category tables include a column that specifies the codec that must be used with each format. Donors must submit files that comply with both the format and codec that are listed.
In some cases, the donor must take additional steps to ensure that files are acceptable for long-term preservation by:
- Deactivating file level encryption;
- Deactivating digital rights management technologies;4
- Embedding in each record all fonts necessary to interpret the information;5
- Providing metadata6 either embedded within the record itself or in an accompanying digital file.
9.1 Text Formats
9.2 Presentation Formats
9.3 Email Formats 7
9.4 Formats for Still Images
This content category contains two subcategories: digital photographs and scanned text.
9.4.1 Digital Photographs
9.4.2 Scanned Text
9.5 Digital Audio Formats
9.6 Formats for Digital Moving Images
This content category contains two subcategories: digital cinema and digital video.
9.6.1 Digital Cinema
Acceptable Formats
|
Acceptable Codecs
|
Format Specifications
|
Digital Cinema Package (DCP)
Unencrypted Interop or SMPTE compliant
|
JPEG 2000
(as outlined by the DCI specifications)
|
Digital Cinema Initiatives, DCI Specification Version 1.2, 2012
|
9.6.2 Digital Video
9.7 Geospatial Formats
Preferred Formats
|
Format Specifications
|
Band Interleaved by Line (BIL)
|
BIL, BIP, and BSQ raster files
|
Band Interleaved by Pixel
|
BIL, BIP, and BSQ raster files
|
Band Interleaved Sequential (BSQ)
|
BIL, BIP, and BSQ raster files
|
Digital Elevation Model (DEM)
|
USGS, Part 1: General and Part 2: Specifications, Standards for Digital Elevation Model
|
Environmental Systems Research Institute (ESRI) Arc/Info ASCII Grid
|
ESRI ASCII Raster Format:
http://help.arcgis.com/en/arcgisdesktop/10.0/help/index.html#/ESRI_ASCII_raster_format/009t0000000z000000/
http://webhelp.esri.com/arcgisdesktop/9.1/index.cfm?id=886&pid=885&topicname=ASCII%20to%20Raster%20(Conversion)
http://resources.esri.com/help/9.3/arcgisengine/java/GP_ToolRef/spatial_analyst_tools/esri_ascii_raster_format.htm
|
Environmental Systems Research Institute (ESRI) Shapefile (SHP)
|
ESRI Shapefile Technical Description
|
GeoTiff
|
GeoTiff Format Specification, Version 1.8.2, Revision 1.0, 2000
|
Geography Markup Language (GML)
|
ISO 19136:2007 & Version 3.2, OpenGIS Geography Markup Language (GML) Encoding Standard 07-036
|
Keyhole Markup Language (KML)
|
Open Geospatial Consortium Inc. OGC KML 07-147r2:
|
9.8 Computer Aided Design Formats
9.9 Formats for Data Sets
Tabular data from databases and spreadsheets must meet the following requirements:
- Each record must contain an end-of-record marker;
- Each field within a file must be defined with the same fixed width;
- Each record must be defined with the same logical record length;
- All fields within a record in a database, or tuples in a relational database, should have the same logical format;
- A record should not contain nested repeating groups of data;
- Every file must be accompanied by documentation that specifies the field names and the field definitions.8
10. Roles and responsibilities
Responsibility for administering these Guidelines rests with the Directors General of the relevant functional areas.
Directors are responsible for implementing these Guidelines within their management areas.
LAC staff involved in the acquisition, stewardship and reappraisal of digital IREV are responsible for communicating and operationalizing these Guidelines.
Donors are to adhere to these Guidelines and consult with LAC on any matters that may impede their ability to comply with these Guidelines.
11. Monitoring, evaluation and review
The functional area responsible for acquisition will monitor application of these Guidelines and report on compliance.
Evaluation and review of these Guidelines will be undertaken every 3 years by representatives of the branches responsible for acquisition and stewardship, or earlier if requested by senior management.
12. Consequences
Non-compliance with these Guidelines will have a negative impact on acquisition, stewardship and reappraisal activities and results.
Consequences for non-compliance with these Guidelines may include initial or full rejection of proposed file transfers, or corrective measures, at the discretion of LAC staff responsible for the acquisition of IREV. Corrective measures may include any actions deemed appropriate and acceptable under the circumstances.
13. Information
Please address any questions about these Guidelines to:
Director General
Evaluation and Acquisition Branch
Library and Archives Canada
550 de la Cité Boulevard
Gatineau, Québec
K1A 0N4
Appendix A: Definitions
- Acceptable format: a file format that meets LAC’s minimum requirements for sustainability. This format may require LAC to perform some preservation actions on ingest to ensure their long-term sustainability.
- Bitmap: an image created from a series of bits and bytes that form pixels. Each pixel carries a value that defines a bits/bytes colour or greyscale. Such images are also known as raster images.
- Codec: hardware or software capable of encoding and/or decoding a data stream for transmission. When used with digital audio or video, the term codec refers to the digital signal encapsulated in a wrapper.
- Container format: a format that can contain and support various types or layers of audio, video, still imagery and their associated metadata. For the data stream to be properly interpreted, the information must be encapsulated or wrapped together. The wrapper refers to a particular way of storing and synchronizing data content into a single file.
- Compression: the encoding of information using fewer bits than in the original. There are two forms of data compression – lossless and lossy. A lossless compression technique discards no information. It looks for more efficient ways to represent data, while making no compromises in accuracy. Lossy compression accepts some degradation in the data to achieve smaller file sizes. Because of this degradation in quality, lossy compression should be avoided.
- Computer Aided Design (CAD): vector programs used to create animations that represent two- and three-dimensional surfaces of inanimate objects. CAD and vector graphics programs can output binary and XML formats.
- Data sets: data stored in defined fields such as databases and spreadsheets.
- Database formats: organized collections of data that conform to a logical structure. Database formats are determined by data models that describe specific data structures used to model an application and generally include navigational, relational, and hybrid models.
- Digital audio: file formats that encode recorded sound as machine readable files by converting acoustic sound waves into digital signals. Digital audio formats are generally composed of both a wrapper format and an encoding
method or codec. Audio file stream encodings are independent of the audio container file format.
- Digital cinema: both born-digital cinematic productions and digital moving image files created by digitizing motion picture film.
- Digital moving images: a sequence of bitmap digital images displayed in rapid succession at a constant rate, giving the appearance of movement. Digital moving image file formats function as containers or wrappers to provide storage areas for any moving image essence, associated audio essence (if present), as well as metadata. Moving image essence data contained within a given wrapper file format is encoded for playback using a specific codec. The parameters of the codec employed determines the presence and method of compression that was used to store the digital moving image data within the wrapper. This category includes two subcategories: digital cinema and digital video.
- Digital photographs: both still photographs produced by digital cameras as well as scanned images of photographic prints, slides, and negatives.
- Digital rights management technologies: technologies to prevent unauthorized use or reproduction of digital content and devices.
- Digital video: both born-digital video and digital files created by digitizing video from an analog source.
- Email: electronic communication transmitted over the Simple Mail Transfer Protocol (SMTP) between two or more accounts. Email is composed of a header, message body and attachments. The header is structured metadata that establishes the provenance of the record. Data that must be present is: sender name and address; names and addresses of all recipients; sent date; and, received date. The message body is the intellectual content of the message. Attachments are any additional objects sent with the email.
- Encapsulating format: see container format.
- Encryption: the use of an algorithm to render a file unreadable. A decryption key is required to undo the work of the algorithm.
- Enduring value: the quality of having continuing archival or historical usefulness or significance to Canadian society.
- End-of-record marker: in a file varies in accordance with the operating system this is used to create the file. In a MAC OS environment a carriage return (CR - ASCII code OxOD) is placed at the end of a record. In a DOS or Windows OS environment a CR+ a Line Feed (LF – ASCII code 0x0A) is placed at the end. In UNIX only a LF appears at the end.
- File format: specific pattern or structure that organizes and defines data. Some formats contain only one stream of uncompressed data, others may contain codecs to encode and compress the data, and others may support several streams of media.
- Geospatial: data may be contained within a database to enable analysis across the datasets (e.g. geo-database), united within a complex file format structure where one geospatial file is comprised of several distinct, but related, formats (e.g. shapefile), or contained within a single file (e.g. GML).
- Information resources: any documentary material, published or unpublished, regardless of communications source, information format, production mode or recording medium.
- Information resources of enduring value (IREV): information resources that have long-term importance and relevance to Canadian society.
- Metadata: data about other data.
- Migration: the movement of digital information from one software/hardware environment/storage medium to another as standards and technology evolve.
- Preferred format: a file format that is readily usable and has been identified by LAC as possessing a high degree of long-term sustainability. This format requires little or no immediate management to achieve appropriate levels of preservation.
- Presentation format: a format that conveys graphical information to audiences as a slide show.
- Raster image: see bitmap.
- Scanned text: a photograph of a printed page produced by either a digital camera or scanner.
- Spreadsheets: tables made up of columns and rows that contain cells of data. Relationships between cells can be pre-defined as mathematical formulas.
- Still images: files that are sampled and bitmapped as a grid of rectangular dots, picture elements (pixels) or points of color.
- Stewardship: the responsible management of IREV in one’s care, custody, control or ownership so that it can be passed on to future generations.
- Sustainability: ensuring that the documentary heritage acquired and managed by LAC is accessible over time, including giving consideration to its one-time or ongoing stewardship requirements and to LAC’s resource capacity. In the context of these Guidelines sustainability is tied to the suitability of a format to preserve encoded information over time. Factors that contribute to a format’s sustainability include quality, stability, potential longevity and industry acceptance.
- Text: there are two general types of text: plain and formatted. Formatted text files contain encoded ASCII data and format definitions that display the information in a defined pattern. Plain text files contain encoded ASCII or Unicode data that has no formatting or layout code to influence the presentation of the data.
- Unacceptable format: a format that does not meet the minimum requirements to be considered sustainable by LAC.
- Vector graphics: digital images made up of object-oriented images that use the geometry of points, lines, curves and polygons to represent images.
- Wrapper: see container format.
Appendix B: Bibliography
- Library of Congress. Sustainability of Digital Formats .. Accessed August 20, 2013.
- National Archives and Records Administration. NARA Bulletin 2013-XX Revised Format Guidance for the Transfer of Permanent Electronic Records. 2013.
- National Archives (UK). Suitable file formats for transfer of digital records to The National Archives. September 2011. Accessed August 20, 2013.
1. See Appendix B: Bibliography.
2. Web content is not currently a content category because LAC actively harvests the web content that it seeks to acquire and preserve. Normally, LAC does not accept pre-harvested web content from donors. Any transfer of web content has to be negotiated with LAC.
3. Government departments may also consult the Procedures for the Transfer of Unpublished Information Resources of Enduring Value from Government of Canada Institutions to Library and Archives Canada (2013).
4. This is a requirement for publications submitted to LAC on Legal Deposit, in accordance with the Legal Deposit ofPublications Regulations, section 2 (a). For all other IREV, this applies only if the donor has the legal right to do so.
5. If the donor has the legal right to do so.
6. This is a requirement for publications submitted to LAC on Legal Deposit, in accordance with the Legal Deposit of Publications Regulations, section 2 (b). Generally, the preferred format of the metadata is a structured format (e.g. XML, CSV, DBF) to facilitate reuse. Furthermore, certain metadata standards may also be necessary such as ONIX 3.0 or Dublin Core for bibliographic metadata. Contact LAC to discuss metadata requirements prior to transfer.
7. Email attachments are considered a component of the email and therefore the attachment does not have to meet the transfer standards specified by the format category that the attachment alone would fall under.
8. Please clarify the specific documentation requirements for data sets with the LAC representative responsible for the transfer.