The PDF/A file format

Nowadays more and more documents are archived electronically. If you have been working with computers for some years, you may already have learned that reusing older archived data can be a challenge. Those WordPerfect files created 10 years ago can no longer be read by your new word processor. You may still have drawings created in an application that simply doesn’t exist any more.
For organisations who need to archive thousands or millions of documents electronically, it became crucial to have a document format which:

  • preserves the original appearance of documents.
  • is well documented.
  • is vendor and operating system independent.
  • is self-contained: no additional data are needed to view the document.
  • can be searched.

The PDF file format meets most of the above requirements provided a set of extra rules are applied to PDF files to make sure that the data can still be processed a hundred years from now. This set of rules is called PDF/A. Its development is in the hands of the ISO organisation. PDF/A is defined in the ISO 19005-1:2005 standard.

PDF/A is a well defined subset of the PDF standard, optimised for the long-term preservation of electronic documents..

There are currently 2 PDF/A flavors, both based on the PDF 1.4 specifications:

  • PDF/A-1a
    • The content of the document is also embedded in the file as “tagged content”. This means that the PDF describes the visual appearance of the document but also contains all text as structural data using Unicode so that the logical structure of the text is still recognizable and searching through the text or extracting it is easier. Let me give a simple example: suppose the word “Appalachians” (a mountain range in the US) appears only once in a certain document in which it is hyphenated. If the PDF file only contains a description of the content, it contains the words “Appa-” and “lachians”. A query to find all documents relating to the Appalachians will not list this file. But if all the text of the PDF document is also embedded as tagged content, the full word is included so that a search engine can indeed locate this word in the file.
  • PDF/A-1b
    • Only focuses on the integrity of the visual display of the document

What makes PDF/A files well suited for archiving documents?

There are a number of restrictions that apply to PDF/A files.

  • PDF/A-1 files adhere to the PDF 1.4 specifications.
  • Transparency should not be used.
  • A PDF/A file should be self-contained which means that it cannot contain any external references or dependencies.
  • All fonts must be embedded in the file. Subsetting fonts (storing only a part of the full font) is not allowed.
  • RGB or CMYK data can be included but you cannot mix them: the file is either an RGB file or a CMYK file.
  • Comments and notes are only permitted to a limited extent. They must behave in the same way when viewed on screen and printed.
  • PDF/A files cannot contain embedded contents such as music, movies or or other files.
  • The file should not contain forms or Javascript code.
  • Compression algorithms for which the copyright is in the hands of a company are not supported since patent rights could restrict the use of the files. This means LZW and JPEG compression cannot be used in PDF/A files.

Next to things that are not allowed, there is also some information that needs to be present in a PDF/A file but that you may not find in regular PDF files:

  • There is a separate PDF/A identifier which needs to be present in the file.
  • Although their presence is not mandatory, the use of metadata is recommended. These metadata should be coherent (clear and logically consistent).

How do you create a PDF/A file.

The cheapest solution is probably to purchase and use Adobe Acrobat 8 or later. PDF/A support is build right into this application.

There are also third party tools and plug-ins on the market such as Callas pdfaPilot.

More information

Try this page from the US government or jump to this PDF/A web site.

Add a Comment