OCR for PDF Files—How to Convert Text in Your PDF

Updated Mar 14, 2020
Microsoft Surface Pro Viewing a PDF File

PDFs are one of the most popular file formats around, no matter what industry you’re working in. They’re the perfect way to share and exchange all types of documents. A key reason for this is that you can open them using any standard web browser. Additionally, they can contain a wide variety of graphical information and text. As versatile as PDFs are, however, they’re not infallible. If you’ve got text in your PDF, you’ll have a hard time trying to edit it. The answer, then, is to convert text in your PDF. 

In this article, we’ll go through everything you need to know about converting text in your PDF files—from why it’s necessary to how the process works. We’ll also delve into the importance of OCR technology


Table of Contents


Video: How to convert text in a PDF

View video transcript

Using OCR on PDF files is a common requirement. PDF files contain images just like TIFF files and BMP files and JPEGs, and so on. But PDF Files are a little more complex when it comes to using OCR because they can contain both raster and/or vector elements. You can see here that I’ve loaded a PDF in to Scan2CAD and when I switch off raster we can see what disappears and when we switch off vector, so we can see that this PDF contains both raster and vector elements. So, let’s load that into the canvas and view just the raster elements. Let’s zoom in. So, we can see this is a raster image, I.e. It’s made up of pixels, and we want to use OCR on this image, converting the text using optical character recognition to fully edit all vector text strings. If we now change the view to the vector image, we can see that we already have part of this floor plan in vector format, held within the same PDF. We can see, for example, that we have some text here and so on.

So, what we need to do is convert the raster parts of this PDF into vector, and then automatically combine that with the existing vector, creating a final vector PDF file with all the elements. To do that, we’ll use Scan2CAD. We’ll go to the “vectorize” button. We need to vectorize the image as well as use OCR, because there’s elements that need to be converted to vectors and there’s also elements that need to be converted to text using the OCR. So, I’m going to use the default settings here and we’ll just click “run”, and we can see it’s complete already. We’ll click the “vector color” button to see what kind of results we have here, and we can see that we have pink, which represents the vector text, where we have bedroom and bathroom and so on. So, the results look very good. So, I’m happy with just the defaults we’ve used there. I’m going to kick “okay”. Now, to save that to the canvas and combine it with the existing vector image. Let’s turn off the raster image now in the view and we can see that we have one, complete vector image in which we can edit the text, which was previously raster, to whatever we need. Click “OK”. And we now have fully editable, vector text strings from the original, raster image.

What is a PDF?

PDF Logo Before we start explaining what OCR is (don’t be too intimidated if it’s an unfamiliar term), we’re going to delve into the PDF file format—what makes it so special and why it can be tricky to work with.

Believe it or not, PDFs have been around since 1993! Yes, they’ve had a pretty lengthy run so far. As we’ve said, the PDF file format is quite easily one of the most versatile and ubiquitous around. With most file formats, there are limitations with compatibility. For example, you might find it difficult to open a TIFF file that someone has sent you. PDFs, by comparison, have the edge—they can be opened on virtually any device. Not only that, but they display documents in the same manner, no matter what you’re opening them with—a far cry from formats like Microsoft Word’s .doc.

One of the most intriguing aspects of the PDF format, however, has to be its ability to support both raster and vector elements. But what exactly is the difference between raster and vector? Let’s take a look…

Raster text

Raster text is made up purely of pixels—tiny squares of color that become more apparent as you change the size of the text. The issue here, then, is that raster text has no structure—it’s just pixels. As such, you’ll find it difficult to edit raster text. To do so, you’d have to use a paint brush or erase specific sections. It’s essentially like painting over an entire canvas—you can’t make changes to individual sections. 

If that wasn’t enough to give you trouble with text in your PDF, raster text comes with an even wider range of issues. You’ll find it impossible to combat pixelation when attempting to zoom into or resize your raster text. Additionally, you won’t be able to attach any data to your text—or edit it within CAD software.

A floorplan saved in .TIFF format with the labels "Bedroom" and "Bathroom"

This raster version of a floorplan is not editable nor scalable.

Vector text

Vector text, on the other hand, is in a world of its own. If you’re looking to edit individual elements of our text, vector is the way to go.

Each element within vector text is mathematically defined. This means that it’s infinitely scalable—you can zoom in or resize as much as you’d like, with absolutely no impact on the text. As a result, you won’t have to worry about any degradation in quality. 

And that’s not all. Vector text is also incredibly easy to edit. Unlike raster text, you can change individual elements within your vector text. So, if you’ve got a typo or you’re looking to add more text, you can do it quickly and efficiently. Even more handily, you can easily take the elements you like and reuse them in other drawings or PDFs. 

Vector floorplan

Meanwhile, you can edit this vector version of a floorplan in CAD software.


Raster text vs. Vector text

Choosing whether to convert raster text to vector all depends on what type of PDF drawings you’re using. If, for example, you’re sharing technical drawings in the PDF file format, you’ll probably need them to be editable in CAD software. To do so, you’ll need your raster text in a vector format.

Why you should avoid raster text in your PDF files….

  • It can’t be edited—making it a pain if you discover a typo or you realize you need to add more text.
  • It’s resolution dependent—if you decide to change the scale, you’ll have to deal with pixelation.
  • You can’t attach any additional data.

Why you should use vector text in your PDF files instead…

  • It can be edited—whether it’s to shorten your text or elaborate on a point, it’s as easy as pie!
  • It’s infinitely scalable—you don’t have to worry about your text losing quality. 
  • You can attach additional data to your text object.

Needless to say, using vector text in your PDF files will make your life easier—cutting your workload in half, if you end up needing to make alterations to any of your PDF files. So, if you’ve got a PDF file containing raster text, you’ve come to the right place—we’ll show you how you can use Scan2CAD’s OCR capabilities to convert text in your PDF.

Can’t I just use an online converter?

You’re probably wondering if it might not just be simpler to use an online converter—rather than use Scan2CAD and learn what OCR actually is. We’ll make it short and sweet: you’d be better off avoiding online converters altogether. 

Most conversion software you’ll come across will struggle when it comes to raster text. In most cases, they’re not advanced enough to differentiate between text and images. Your text will instead be converted to simple vector shapes—like angles and arcs. This is otherwise known as exploded text: 

Exploded text

Example of exploded text. The characters are formed of vector shapes rather than actual text.

Not only is this not actual text, it’s also incredibly difficult to edit. This makes the entire conversion process redundant. Instead, you want a converter that will give you text strings, as Scan2CAD does: 

Example of vector text strings. This is the desired result of vectorization because they can be edited and displayed correctly.

As a market-leading raster to vector conversion software, you can be sure that Scan2CAD has what it takes to convert text in your PDF to vector. Using OCR technology, Scan2CAD can produce impeccable results.


What is OCR?

Optical Character Recognition is industry-leading technology that enables Scan2CAD to detect any raster text in your file and subsequently convert it to vector text. Though it might sound pretty simple, it can oftentimes be a bit more difficult depending on the situation. 

For starters, there are hundreds of font styles out there—making it pretty tricky for some computers to recognize letters. Take the example of the letter ‘g’ below…

Lowercase letter g in six fonts

Easy, right? Wrong. Computers find it difficult to figure out what the image represents. You’ve probably come across this problem before if you’ve ever attempted to convert text with an online converter—the converter, more often than not, will give you a garbled output, as a result of it not recognizing your text. 

And that’s where OCR technology saves the day! OCR works by teaching itself to learn and recognize the shape of each letter. Once it does so, it will be able to detect them when they show up in any image. It’s so advanced that it can work on a feature detection basis. 

Of course, as groundbreaking as OCR is, that’s not to say it can work with any raster textwhich is why we created our own raster text quality checklist


Get the best conversion result

Before we get into the thick of Scan2CAD’s conversion process, we first have to look at the ways in which you can increase your chances of a successful conversion. To put it simply, you need to ensure that your drawing is viable. If your text suffers from any of the following issues, you’re probably going to struggle with getting a decent output…

Poor image quality for raster to vector conversion

Raster images with any of these problems are unlikely to convert successfully.

As incredible as Scan2CAD’s OCR technology is, it still has limitations. As such, you should try to ensure you’re working with a high quality PDF—the raster text should be clean and crisp. Additionally, you need to meet a few conditions…

Is your text legible?

If the raster text in your PDF file is of poor quality, you’re probably better off replacing it entirely. If you’re unable to read it, for example, how do you expect the OCR technology to do so? To remedy this, you can retype your text manually in Scan2CAD. 

Are the text characters touching?

Scan2CAD won’t be able to recognize text characters that are touching. If you don’t want to have to manually retype your text, you can use software like Scan2CAD which will automatically split the text characters during the OCR process. 

Is the text written over other drawing elements?

It can be difficult for Scan2CAD to recognize raster text in a PDF if it’s written over drawing elements. Similarly, Scan2CAD will struggle if your text is underlined or inside a box. 

Is the text at more than one orientation?

If you’ve got text displayed in your PDF at different orientations, it would be more challenging to use OCR compared to a file with all text at horizontal rotation. Fortunately, good OCR software will support text at multiple locations.

 


Convert text in your PDF with Scan2CAD

Prepare your image

To ensure you get the best possible output, you should clean up your image before you begin the conversion process. Scan2CAD’s got its own suite of ‘Raster Effects’ that can help you remove any image distortion and clean image noise. 

Choose conversion settings

To start the process along, you’ll need to click the convert icon:  Vectorize icon

Once you’ve done so, the Vectorization Settings dialog box will launch. You can choose to select ‘Vectorize‘, ‘OCR‘ or ‘Vectorize and OCR‘, depending on what type of drawing you’re working with. 

If you’re converting text only, then you should select ‘OCR’. If, on the other hand, you’re converting a raster image along with text, then ‘Vectorize and OCR’ is your go-to option. Got a PDF containing vector text and raster text? Not to worry: Scan2CAD can convert the raster text and combine it with existing vector elements. 

No matter which conversion option you pick, you’ll need to look at OCR settings under the ‘OCR’ tab. Here, you need to specify the size and rotation of the text in your image. 

Convert

Once you’re happy with the settings, it’s time to click ‘Run‘. Wait for the conversion process to finish and look at the resulting output in the preview window. Not happy? Alter the settings and run the conversion again until you’ve got the output you requirethen click ‘OK‘ to save your vector text. 

And that’s all there is to it!

scan2cad advert for free trial