|
BACK TO
INDEX >>
WHAT IS OCR (OPTICAL CHARACTER
RECOGNITION)
OCR & ICR - IMT Magazine
(Issue 2 - 06)
By: IMT Staff
What Is OCR?
Optical Character
Recognition (OCR) is a process of converting printed materials into text or
word processing files that can be easily edited and stored. The technology
has enabled such materials to be stored using much less storage space than
the hard copy materials. OCR technology has made a huge impact on the way
information is stored, shared and edited. Prior to Optical Character
Recognition, if someone wanted to turn a book into a word processing file,
each page would have to be typed word for word.
OCR technology requires
both hardware and software. In addition, sophisticated OCR systems require
an additional circuit board in the computer itself to complete the process.
An optical scanner scans the text on a page, then breaks the fonts down into
a series of dots called a bitmap. The software can read most common fonts
and distinguish where lines start and stop. This bitmap is then translated
into computer text.
While Optical Character
Recognition has made huge advances in recent years, it still does not
perform well in recognizing handwriting or fonts that look similar to
handwriting. There are systems within the banking industry that use OCR
technology to try to read the amounts on hand written checks, to go along
with the computer's ability to read the routing and account numbers.
To give an idea of the
power of OCR, let us take a look at a real-world example. Imagine a police
department that has all its criminal records stored in vast file cabinets.
Although scanning millions of pages would be an expensive and time-consuming
undertaking, the benefits are huge. Once the OCR system has converted the
pages into computer-readable text, a detective, for example, could search
through the entire history in a few seconds. Manually finding a particular
record might not be too difficult, but imagine a detective trying to search
for all the crimes committed on a certain intersection between 8 and 8:30.
This example only scratches the surface of the power of searchable text, and
it is only one reason that many companies and institutions are spending
millions of dollars to OCR their legacy data.
Ideal Source Material
for OCR
OCR works best with
originals or very clear copies and mono-spaced fonts like
Courier.
If you have choices, use the following source material:
-
12 point or greater font size.
-
Black text on a white background.
-
A clean copy; not a fuzzy multi-generation copy from a
copy machine.
-
Standard type font (Times, New Roman, etc.) Fancy fonts
may not be recognized.
-
Single column layout.
OCR Limitations
-
Using text from a source with font size less than 12
points or from a fuzzy copy will result in more errors.
-
Except for tab stops and paragraphs marks, MOST document
formatting is lost during text scanning, (Bold, Italic &
Underline are sometimes recognized).
-
The output from a finished text scan will be a single
column editable text file. This text file will always require
spellchecking and proofreading as well as reformatting to desired final
layout.
-
Scanning plain text files or printouts from a spreadsheet
usually works, but the text must be imported into a spreadsheet and
reformatted to match the original.
What Source Material
Doesn't Work Well for OCR?
-
Forms (especially with boxes and check boxes)
-
Very small text
-
Multi-generation fuzzy or blurry copies from a copy
machine
-
Mathematical formulas
-
Draft copies of documents with hand-written revisions
-
Fancy text and unusual fonts
-
Handwritten text
|