|
BACK TO
INDEX >>
OPTICAL CHARACTER RECOGNITION IS AN UPHILL BATTLE FOR OPEN SOURCE
OCR & ICR - IMT Magazine
(Issue 2 - 06)
By:
Nathan Willis
If you use
Linux, or another free operating system, and need optical character
recognition (OCR) software, be prepared for a challenge. OCR is a tricky
problem on any computing platform -- both because it is conceptually hard,
and because the task does not lend itself to simple, easy-to-use
interfaces. OCR is the use of visual pattern matching to extract text from
an image -- usually a scanned paper document, but it could be a digital
photo, a frame of video, or a screenshot just as easily.
Proprietary
platforms like Mac OS X and Windows have a few commercial choices, but even
the best of them can be headache-inducing. A few years ago, while working
for a non-profit organization, I had to scan in and convert old documents
using OCR -- out-of-print books, magazine articles, the occasional printout
of a lost digital original. We tried several different products, and found
that even the best of them required thorough proofreading, numerous
corrections, manual tweaking of the input images, and a whole lot of time.
So it was with low expectations that I approached the task of finding a
suitable open source OCR program.
Extracting
text with Kooka
The best of
the lot available today is
Kooka,
the KDE environment's default scanning application. Kooka uses the GNU
project's
Ocrad
as its OCR engine. It supports a wide variety of scanners via the
SANE
library, for when you are acquiring images as you work.
Its
documentation suggests that Kooka can use other OCR engines in addition to
Ocrad, but Kooka offers only one other option through its preferences,
GOCR.
Despite selecting GOCR and restarting Kooka as directed, the application
persisted in using Ocrad. I did verify that GOCR was installed, functional,
and located in /usr/bin as Kooka expected by testing its GTK front end,
gtk-ocr. Kooka just would not use it.
Kooka's
controls give you a lot of leeway over the resolution, brightness, and
contrast of the document you are scanning. For the best results, you want to
tweak the settings to give maximum contrast, eliminating if possible specks
of dust and shadows on the paper. The downside is that excessive contrast in
the scan can wash out the serifs, dots, and thin strokes of the letters,
making them harder for OCR software to distinguish.
In my case,
the images I needed to OCR were already-digitized scans of the user manual
for an old camera system. I chose previously digitized images in order to
test all of the OCR apps I found (some of which do not perform scanning) on
an equal footing. The images were of good quality, and the text was
perfectly readable to start with. To adjust brightness and contrast in these
images I had to open them in an external image editor, which Kooka
conveniently lets you do through the Image menu. You initiate an OCR job
with OCR Image from the Image menu. Kooka presents a dialog from which you
can tweak several settings, including whether the program should attempt to
guess the layout of the document, and whether to enable spell-checking on
the result.
Listing 1
is the result of the first pass with Kooka's OCR, in which all settings were
left at their defaults. Clearly there is much to be desired; among other
things a plethora of stray pixels were picked up and interpreted as
punctuation marks.
Listing 2
is the result of the second pass, after having cleaned up the original image
considerably with the GIMP.
Listing 3
began with the original text from Listing 2, corrected by the spell checker.
Correcting
spelling with Kooka
Spell
checking in Kooka is a manual process; the program cycles through the OCRed
text and (much as in any office app) asks you about every unrecognized word,
suggesting replacements when possible. Kooka uses existing spell-checking
engines, such as
ispell,
aspell,
and
hspell
(for Hebrew).
The problem
with this approach is that none of these spell-checking engines are aware
that the checked text is from an OCR source. Word processing spelling errors
are very different from OCR spelling errors; the former are the results of
mistyping, the latter are the results of machine vision problems. OCR
spell-checking needs to take into account that the mistakes it is looking
for stem from similarities in letter shape, confusion between capital and
lowercase letters, and misplaced word breaks. Ispell and Aspell can suggest
replacement corrections based on dictionary words, but they are of little
help.
Another
feature often found in commercial OCR applications is the ability to mark or
exclude areas of the page from scanning. In some cases, it is possible to
indicate that two or more blocks of text are connected, which can be very
helpful when scanning in magazine layouts. Kooka has the beginnings of
automatic layout detection via the Ocrad engine -- you can tell it to
interpret the document as multi-column or to guess at more complex layouts.
But, in most cases, specifying the layout yourself would be much faster.
Everything
else
By my count,
Kooka was able to correctly identify just 123 of 270 words, or 45.56 percent
-- not an encouraging number. But the pickings are slim for Linux users. I
did examine other applications:
gtk-ocr
and
Clara.
Both are functioning GUI programs, but neither has close to the feature set
of Kooka.
Gtk-ocr is
sparse in its interface and controls. The command-line client may be more
functional, but for a graphically intensive task such as this I would not
recommend it. Clara has a more promising feature list than gtk-ocr, but the
last update to the program was a long time ago, and it uses Xlib for its
interface -- a fact that may frighten away younger users, but elicit
feelings of nostalgia in others with more experience.
A number of
other projects turn up on a
Freshmeat
or
SourceForge
search for "OCR," but most are academic in nature, and not suited for end
users. Given the inherent complexity of tasks involving natural language
processing (OCR, speech recognition, and machine translation, to name a
few), it should come as no surprise that (a) the available tools are just
marginally useful, and (b) research into the subject continues.
I hope that
some of the research being conducted will find its way into a usable open
source OCR application for Linux. Right now, the best alternatives for those
needing to extract text from an image all involve spending money. There are
a few proprietary solutions advertising Linux support (VueScan
and
OCRShop,
for example) and proprietary OCR software for proprietary OSes -- but don't
underestimate the value of paying a typist to transcribe your text the
old-fashioned way.
|