OCR Guide: Converting Images to Searchable Documents

Updated Jun 20, 2019
Glasses patent drawing with text converted using OCR

If this isn’t your first time to this blog, you’ll know that we’re forever touting the benefits of converting your images from raster to vector. And for good reason—you simply can’t get the full potential from your technical drawings while they’re in a raster format.

But there are other valuable benefits to converting your images than just editing your drawings. What if your goal is to create a searchable database of the data held within your images? This is where technology like OCR can be a real game-changer.

If you convert text within your imagery to text strings, you can begin to catalog your imagery into a searchable database. Once organized into such a system, one would simply have to search for a text string within the imagery and the relevant image would appear. This level of efficiency is possible when you use conversion software that incorporates the power of OCR.

In this article, we’ll explore the process by which you can transform your images into versatile, editable and, most importantly, searchable documents. Let’s get stuck in!


Table of contents


What is OCR and how does it work?

Optical Character Recognition (OCR) in Scan2CAD  OCR stands for Optical Character Recognition. It is the technology that allows computers to detect and highlight text within an image. You can see it in action in various different forms across the globe, as it is put to use by industries with a range of OCR needs. For example, the cameras that police use to track number plates rely on OCR, as does the software that enables law clerks to search for particular legal cases within a giant database.

There are a number of different techniques that OCR utilizes, the two most common of which are pattern recognition and feature extraction. The former involves a computer searching an image and comparing the information within to a collection of fonts, numbers and symbols that it already has stored. While somewhat effective, this approach is limited in the sense that the OCR will only be able to detect common fonts like Times New Roman or its very own OCR-A

Feature extraction, on the other hand, has vastly improved the accuracy of OCR technology. Instead of matching similar letters, the computer is looking for certain features that it has learned, in combination, form a particular letter or number. It should recognize, for example, that a short horizontal line sitting on top of a longer, vertical line makes a ‘T’. Using this technique, a computer system that can retain multiple neural networks (which allow for deep learning) can even be trained to recognize handwritten text!


Raster text vs vector text

Comparison of poor quality raster text with vector text string

Raster

Raster images are good for certain purposes. If you want to store high quality photographs, for example, TIFF files are handy because they support a large number of colors and boast lossless compression—allowing images to retain their quality even after editing or compression has taken place. 

The issue when it comes to text, however, is that raster images are made up of pixels. And that’s it. Even if a raster image appears to contain text, for all intents and purposes (in other words, from a computer’s perspective) the text is indistinguishable from the imagery because it’s all just pixels. The text isn’t really text and thus it isn’t possible to search for these details within a raster image.

What’s more, data cannot be attached to particular elements of the file, and zooming in or changing scale will result in a reduction in quality of the overall image. All of this is to say that having textual information stored in a raster format is a bad idea.

 

Vector

Vector images are comprised of distinct elements, each of which is defined by a mathematical equation. This means that users can edit or attach data to individual components (including text) of a technical drawing. 

As vector text is recognized as such (distinct from the surrounding drawings) you can search through it as you would in any other document. There’s also the option of attaching data to the text elements within vector images. You may, for example, add metadata like ‘page title’ or ‘draft number’ to your drawings. 

Before you can make the most of this potential versatility, however, you need to convert the text in your images using OCR. 


Why make searchable databases from your images?

Patent drawing of the Cameron EVO BOP (a drill part).

Making your images searchable can save a huge amount of time and effort. Imagine you have a large volume of patent drawings, for example. In such a case, storing them as raster images isn’t efficient at all. What you have is just a collection of pixels—the images do not hold any useful information about their contents. How will you ever be able to locate the image that refers to, say, ‘fig. 2’ when needed?

Enter OCR. When you use OCR to convert the pixels in your image into vector text, you are creating a database of information related to the image. This information can then be searched for by users who may be faced with tens of thousands of images to scroll through. 

On a somewhat more serious note, making your images searchable can also provide protection on a legal basis. Take, for example, designs for products. If your work is patented, this needs to be documented and available for others to see, so that they don’t infringe on your designs. Inventors working for large companies like Nike ensure their patented designs are searchable through large online databases. Interested parties can then find the images by performing a simple search on engines like Google Patents.

Aside from benefits to your workflow like increased efficiency and organisation, making your images searchable can also be a savvy business decision. It’s not just easier for you to locate your work—depending on where you store it, it’s also easier for other people to find. This could be great for promoting your services and getting your name or brand out there.


Why you need more than just OCR

There are many simple OCR solutions available which will convert imagery containing only text to fully editable text strings. However, if the imagery you are converting contains elements other than text you will hit a multitude of problems. OCR software will attempt to convert whatever it is provided therefore a key part of the solution is in identifying what should and shouldn’t be sent to OCR.

Scan2CAD is focused on solutions for converting technical drawings. Scan2CAD’s technology will identify which elements are likely to be text and ‘send’ these elements to the OCR, other elements are vectorized into their appropriate vector entities creating a much higher level of OCR accuracy.

performing OCR on patent drawings

Scan2CAD identifies the areas of the image which are likely to contain text. Converting only the appropriate elements using OCR


Tips to ensure OCR is successful

Optimize for conversion

 

If you want to end up with a high quality image, your original document needs to be optimized for conversion. This means making sure the raster text is as clean and clear as possible. Manually erase any dirt or smudges, to prevent the software assuming that such flaws are part of the actual image. It’s worth running through Scan2CAD’s raster text quality checklist to ensure that the image you want to use is suitable in the first place.

Please note: if your image contains too many flaws like overlapping characters, text positioned at different orientations, or unusual fonts, successful conversion may not be possible.

Select the right conversion software

As with many things in life, the quality of the end product in text conversion largely depends on the quality of the software you use. Cheap (and even free) conversion programs are available on the internet. We urge you to exercise caution when it comes to these enticing options, though.

If you don’t invest in a legitimate brand, the OCR may not be up to scratch. Issues like text orientation and non-standard fonts can easily stump basic conversion software. This may result in the final product containing exploded text, rather than defined text strings. The former is just a collection of vector lines and curves. In other words, the software has assumed that the letters are mere shapes rather than text. Thus, you will not be able to edit them as text—let alone make them searchable!

So, even when the results produced by cheap online converters initially look promising, closer inspection may reveal otherwise…


Convert your images to searchable documents

 

Using Scan2CAD to convert text and other elements in an image

Scan2CAD is the world’s leading solution for converting technical drawings. Scan2CAD’s powerful OCR capabilities are designed for real-world technical drawings. Not simplified text-only images.

Want to give it a try yourself? Learn more about Scan2CAD.

 

scan2cad advert for free trial