Tech Xpress

Extract Text from Scanned Images of Documents - OCR in Windows

The output of scanned documents are usually images. A big drawback about this - You cannot copy or search text in the scanner generated image file. This is where OCR tools are used. OCR or Optical Character Recognition is a technique used to translate images of text into editable text.

Here's how we can rip text from an image using OCR utilities in Windows. There are free and commercial tools to do this. Lets look at both of them.

The Free Option
1. Create a folder named OCR in your D directory. Off course you can choose to create it any where you want. I'm just using D:\OCR for this demo.

2. Copy the image file (that contains some characters) to the D:\OCR folder. The image can be of any format i.e PNG, TIF, JPG, PNM etcetera. For the sake of brevity, let's call the image Demo.png.

3. Download Gocr.exe Windows binary [The source files of GOCR are available at sourceforge. We would only need the Windows binary though]. GOCR or JOCR is a Free and Open Source OCR engine which creates a text file from an image file.
Copy the downloaded gocr045.exe file to D:\OCR.
As GOCR only accepts image files of .pnm format, we need to convert our Demo.png to Demo.pnm.

4. Convert the image to PNM format. You can skip this step if your image file is in pnm format.
We'll use ImageMagick to convert image files to pnm format.
Download ImageMagick for Windows. Install it.

a. Start Windows Command Prompt [Start >> Run. Type CMD and press enter].
Navigate to D:\OCR from the Command prompt and type -
convert demo.png demo.pnm

Converting PNG image to PNM using ImageMagick command line - via Tech.Xpress (txpress.blogspot..com)This converts Demo.png image file to Demo.pnm.
ImageMagick can convert files from JPG, PNG, TIF et cetera formats to PNM. It supports conversion from over a hundred formats.
b. Place the Demo.pnm file to D:\OCR.

5
. Open up the Windows Command Prompt [Start >> Run. Type CMD and press enter].
Navigate to D:\OCR.
Type gocr045 -i Demo.pnm -o Demo.txt

Using GOCR to extract text from images - via Tech.Xpress (txpress.blogspot..com)This will create a text file Demo.txt with most of the text from Demo.pnm image file.

The Commercial Option
There are a few options here, but I have read ABBYY FineReader is a good one. You can download and try it free for 30 days.
[Download APPYY Fine Reader]

posted by Vijeesh Ravindran, Monday, October 06, 2008


2 Comments:

Comment by Anonymous Anonymous on July 25, 2014 at 1:46 AM  
Excellent dude!