The Problem with PDF Files

PublishedNov 13, 2023in News

Scan2CAD is a solution for converting your files, of varying formats, to CAD or CNC data. Out of all of the potentially complex file formats we’ve supported over the years, and continue to support, PDF is by-far the biggest pain in the butt. So much so that I have considered writing this post for many years, to plant a flag in the ground for software developers supporting this format.

Why are PDF files such a pain? Let me to break it down for you:

PDFs are versatile

…By which I mean, they can contain data, text and imagery of different formats.

In the context of Scan2CAD, we deal with PDFs which can contain both raster and vector imagery, all of which we must appropriately convert to CAD or CNC data. Therefore the versatility of a PDF file brings a huge scope of elements which must be supported ‘out-of-the-box’.

There’s no true PDF standard

…I’m oversimplifying with this statement. There is a known way to support PDF files. But the problem is that this can be slightly (or sometimes radically) different for each PDF.

There are tens of thousands of applications and hardware which create PDF files. Each may opt to encode a PDF in a slightly different manner. We have spent over a decade iterating upon our PDF interpreter and it’s still not unusual to find a PDF that has been encoded in a novel way which requires further development to support. You can see how much of our change log mentions added support for new PDF files.

PDFs are ‘wrappers’ for raster images

If you have a PDF which contains a raster image, you might tell someone that the image format is ‘.pdf’. It’s not quite the full story. PDFs are just wrappers which may contain any number of different raster image file formats, such as JPEG, TIFF and somewhat antiquated raster file formats such as JBIG2. This again means that quite extensive development is required to support these raster image formats contained within the PDF. In fact, as an example, we developed ‘jbig2codec’ a library dedicated to decoding just one of the raster image formats found within PDFs.

Many PDFs use solid-fill polygons to display the bounds of a page

Sometimes, you can open a PDF and it can appear blank when it’s not. This can happen when the application which created the PDF used a white solid-fill polygon to represent the bounds of the PDF page. So when opening a PDF in some applications, this white polygon overlays the contents of the page, and you get, what appears to be, an empty page.

To account for this in Scan2CAD, we automatically remove all solid white rectangles contained in the PDF be default. A setting which can be disabled if you so wish.

This one quirk of PDF files is a good example of the messy nature of the file format.

Hidden text layers are common

There are many situations that a PDF may contain useless text strings. This can happen when an application (including some document scanner software integrated with hardware) uses a commonly rudimentary OCR to create a hidden text layer, representing the text in a raster image.

I don’t really know why applications do this. I suspect it’s just feature bloat on their part. Adding a simple OCR to make images searchable.

I don’t think I’ve ever seen a hidden text layer – on a technical drawing – which is useful. The OCR-created text will usually contain a large amount of false positives, forming nonsense information.

What’s worse, when these PDFs are opened in a subsequent application, like Scan2CAD, the user may think that this new application is creating this nonsense text.

For this reason, in Scan2CAD, we added a quick method for deleting all vector text strings from a PDF, if required.

Conclusion

PDF files are a pain in the butt if you wish to extensively support them beyond a simple viewer. But they aren’t going anywhere. So software developers will instead rant in a blog post.

The Problem with PDF Files

PDFs are versatile

There’s no true PDF standard

PDFs are ‘wrappers’ for raster images

Many PDFs use solid-fill polygons to display the bounds of a page

Hidden text layers are common

Conclusion

About

GET HELP

LABS

FROM THE BLOG