The PDF/A file format

PDF/A is a file format that is intended to solve a basic problem of electronic data handling: how to make sure that documents can still be read or processed in the future. If you have been working with computers for some years, you may already have experienced that reusing older archived data can be a challenge. Those WordPerfect files created 15 years ago can no longer be read by your new word processor. You may still have drawings created in an application that simply doesn’t exist any more.
For organisations who need to archive thousands or millions of documents electronically, it became crucial to have a document format which:

  • preserves the original appearance of documents.
  • is well documented.
  • is vendor and operating system independent.
  • is self-contained: no additional data are needed to view the document.
  • can be searched.

The PDF file format meets most of the above requirements provided a set of extra rules are applied to make sure that the data can still be processed dozens of years from now. This set of rules is called PDF/A. Its development is in the hands of the ISO organisation.

PDF/A is a well defined subset of the PDF standard, optimised for the long-term preservation of electronic documents.

PDF/A flavors

The PDF/A standard gradually evolve, to meet new needs or use newer technology. There are currently 2 PDF/A flavors.

PDF/A-3

This is a refinement of the PDF/A-2 file format. The specification (ISO 19005-3:2012. Part 3) was published on October 17, 2012.

The following restrictions and rules ensure that PDF/A-3 files meet the above mentioned goals:

  • PDF/A-3 files adhere to the PDF 1.7 specifications.
  • The use of transparency is allowed, as are layers.
  • A PDF/A file should be self-contained which means that it cannot contain any external references or dependencies.
  • All fonts must be embedded in the file. Subsetting fonts (storing only a part of the full font) is not allowed.
  • RGB or CMYK data can be included but you cannot mix them: the file is either an RGB file or a CMYK file.
  • Comments and notes are only permitted to a limited extent. They must behave in the same way when viewed on screen and printed.
  • PDF/A files cannot contain embedded contents such as music, movies or or other files.
  • The file should not contain forms or JavaScript code.
  • Compression algorithms for which the copyright is in the hands of a company are not supported since patent rights could restrict the use of the files. In PDF/A-2 files LZW compression cannot be used but JPEG 2000 compression is allowed. The advantage of this latter compression algorithm is that it supports both lossy and lossless data compression.
  • A PDF/A-3 file can contain other PDF/A documents as embedded files. Other arbitrary file formats (such as XML, CSV, CAD, word-processing documents, spreadsheet documents and others) can also be embedded in a PDF/A-3 file.

Next to things that are not allowed, there is also some information that needs to be present in a PDF/A file but that you may not find in regular PDF files:

  • There is a separate PDF/A identifier which needs to be present in the file.
  • Although their presence is not mandatory, the use of metadata is recommended. These metadata should be coherent (clear and logically consistent).

PDF/A-2

This version was released in 2011. It is also referred to as ISO 19005-2. The specifications are identical to those of PDF/A-3 with one exception: in a PDF/A-2 file you cannot embed any other file format except PDF/A documents.

PDF/A-1

The PDF/A-1 standard dates from 2005 and is also known as ISO 19005-1:2005. It has meanwhile become a well established data format. These are some of the specifications of the format:

  • PDF/A-1 files adhere to the PDF 1.4 specifications.
  • Transparency should not be used in PDF/1-a files, nor should layers.
  • A PDF/A file should be self-contained which means that it cannot contain any external references or dependencies.
  • All fonts must be embedded in the file. Subsetting fonts (storing only a part of the full font) is not allowed.
  • RGB or CMYK data can be included but you cannot mix them: the file is either an RGB file or a CMYK file.
  • Comments and notes are only permitted to a limited extent. They must behave in the same way when viewed on screen and printed.
  • PDF/A files cannot contain embedded contents such as music, movies or or other files.
  • The file should not contain forms or JavaScript code.
  • Compression algorithms for which the copyright is in the hands of a company are not supported since patent rights could restrict the use of the files. In PDF/a-1 files LZW and JPEG compression cannot be used.

Next to things that are not allowed, there is also some information that needs to be present in a PDF/A file but that you may not find in regular PDF files:

  • There is a separate PDF/A identifier which needs to be present in the file.
  • Although their presence is not mandatory, the use of metadata is recommended. These metadata should be coherent (clear and logically consistent).

To confuse matters, there are actually 2 subflavors of PDF/A-1

  • PDF/A-1a – In a PDF/A-1a file the content of the document is also embedded in the file as ‘tagged content’. This means that the PDF describes the visual appearance of the document but also contains all text as structural data using Unicode so that the logical structure of the text is still recognizable and searching through the text or extracting it is easier. Let me give a simple example: suppose the word ‘Appalachians’ (a mountain range in the US) appears only once in a certain document in which it is hyphenated. If the PDF file only contains a description of the content, it contains the words ‘Appa-‘ and ‘lachians’. A query to find all documents relating to the Appalachians will not list this file. But if all the text of the PDF document is also embedded as tagged content, the full word is included so that a search engine can indeed locate this word in the file.
  • PDF/A-1b – This variant of the the PDF/A-1 standard only focuses on the integrity of the visual display of the document.

How to create a PDF/A file

The cheapest solution is probably to purchase and use Adobe Acrobat 8 or later. PDF/A support is build right into this application.

There are also third party tools and plug-ins on the market such as Callas pdfaPilot. pdfaPilot can convert a number of file formats such as PostScript/EPS, JPEG, TIFF and PNG into PDF/A files. It also includes PDF/VT validation feature.

More information

Try this page from the US government or jump to this PDF/A web site.

The ‘PDF/A in a Nutshell 2.0’ ebook provides an in-depth introduction to the standard. You can find the same content on the pdfa.org web site.

3 thoughts on “The PDF/A file format

  1. I have a problem of converting a file to PDF/A-1b format. it has arrows in the questionnaire which were imbedded as windings. I get the following error message when trying to convert.

    WIndings-Regular True type (CID) embedded as a subset.

    How do I avoid the font from being enbedded as a subset as this is not allowed n PFD/A-1b compliance.

  2. Is there any open source library , which allows us to create PDF/A-1a compliant PDF docs. There are many PDF/A-1b but i am not aware about PDf/a-1a. Olease inform me about it.

Leave a Reply

Your email address will not be published. Required fields are marked *