The PDF file format

The average user never needs to look ‘under the hood’ of PDF files. For curious people, this page takes a closer look at the way information is stored in a PDF file.

General conventions

Here is some useful information in case you intend to open PDF-files to edit them straight away:

  • PDF files are either 8-bit binary files or 7-bit ASCII text files (using ASCII-85 encoding).
  • Every line in a PDF can contain up to 255 characters.
  • Every line ends with a carriage return, a line feed or a carriage return followed by a line feed (depending upon the application or platform used to create the PDF file).
  • PDF is case sensitive.
  • The file format is completely independent of the platform that it is viewed or created on. Files can be moved back and forth between Macs, Windows system, Linux systems,… When FTP-ing a PDF file, it does make sense to compress it, to avoid data corruption by some outdated web system that the file needs to go through.

PDF file structure

PDF files use a fixed structure, they always contain 4 sections:

  • A header, which contains information on the PDF-specifications the file adheres to. This line looks like this: ‘%PDF-1.2’.
  • The body area which contains a description of the various elements that are placed on the pages.
  • A cross-reference table which refers to all the elements from the body that are used on the pages of the PDF-file.
  • A trailer that tells applications or RIPs where to find the cross-reference table and always ends with ‘%%EOF’. If this line is missing, the PDF-file is not complete and can probably not be processed by any RIP or application. This is not the case with PostScript files. If the last few lines of a PostScript file are missing (because of a lost connection while transferring the file or a computer crash) you can often still print most of the pages. With a PDF-file, you’ll lose everything.

PDF imaging model

Objects that are placed on PDF pages are called ‘marks’. The page surface is called the ‘canvas’.

  • A coordinate system is used to define where each mark is placed. By default, coordinates are defined in points (72 units per inch) but this measurement system can be redefined within a PDF. The origin is in the bottom left but this 0,0 coordinate can also be redefined. This flexible coordinate system is called ‘User Space’. Afterward when a PDF is sent to an output device such as a printer, the RIP needs to recalculate everything to ‘Device Space’, the coordinate system of the output device.
  • Marks can have a number of characteristics:
    • a fill
    • a stroke
    • a color, which can be defined in one of the color spaces that PDF supports (11 in the most recent versions)
    • a certain level of transparency (from PDF 1.4 onwards)
  • All graphic objects are either
    • paths – shapes made out of lines, curves and or rectangles
    • text
    • bitmap images
    • Form XObjects – which are reusable elements
    • PostScript language fragments – worst case
  • Some of the content of a PDF can be optional content, marks that can be selectively viewed or hidden. In Acrobat optional content is somewhat misleadingly referred to as ‘layers’. Some practical examples of this:
    • Maps that contain a coordinate grid that can be activated/deactivated
    • Brochures with text in multiple languages
    • Packaging files with the die-cut and embossing information as separate optional content.

Modifying data in a PDF

If data are appended to a PDF-file (for instance because the user edited text in Adobe Acrobat and saved the file again or if you merge PDF files), another body area, cross-reference table, and trailer are added to the end of the file. This bloats the PDF file size. By opening a file in Acrobat and using ‘Save as’, you force the application to clean up the PDF file so there are no more multiple data areas. The same is true when deleting pages in a PDF: only when you force the application to regenerate a new file will it clean up unused data.

One thought on “The PDF file format

  1. Thank you for a very good summary!
    PDF 2.0 (ISO 32000-2 / 2017) introduced UTF-8 encoded strings as an additional format for text strings.

Leave a Reply

Your email address will not be published. Required fields are marked *