Get Updates by Email

Saturday, 5 July 2014

Managed to install Tesseract OCR tonight

I'm feeling happy. Tonight I managed to install Tesseract OCR on my gently ageing laptop that runs OpenSUSE 12.3.

What are the steps that I need to replicate in case I want to reinstall it?

First, I had to download the following RPM files. Most of them were sourced from RPM search engine, PBone. I made sure to download for OpenSUSE 12.X whenever it was available.
  1. tesseract-ocr-eng-3.02-23mgc26.i686.rpm
  2. tesseract-ocr-3.02.02-126.1.i586.rpm
  3. liblept2-1.68-6.2.2.i586.rpm
  4. libwebp2-0.1.3-4.1.4.i586.rpm
  5. libpng14-14-1.4.11-2.5.1.i586.rpm

Then, install them in the reverse order that they were downloaded. This is because to install Tesseract, you need liblept. But liblept requires libwebp and libpng. And the tesseract-ocr-eng file is a language dictionary, which should be installed after you get Tesseract OCR to install.

I haven't gotten the hang of using Tesseract for OCR yet. But here is a little command I used in Konsole to run OCR on a bunch of JPG files (it will process all JPG files in the directory).

for a in *jpg; do convert $a $a.tif; tesseract $a.tif $a -l eng; rm $a.tif; done
rename jpg.txt txt *jpg.txt

What does it do? First it looks for all the files ending in .jpg. Then each one is converted to TIF. Then Tesseract does OCR on the TIF file and outputs it into a .txt file. Then the TIF file is erased.

The second command merely renames all files ending with "jpg.txt" to "txt". 

Everyday is a learning experience, and using Linux is a journey of discovery. I do enjoy using Linux, even if it's not as convenient as Windows. But I've been without Windows for a few years now, and I keep plodding along.