Pre-processing forms is essential
By Dana Allen, President
Sequoia Data Corporation
Forms were developed to provide a low-cost method for transmitting information, and although they may be easy for humans to read, unfortunately, they are much more complicated images for computers to process.
Forms vary and must be dealt with differently depending on the individual characteristics of the form being processed. The paper is frequently shaded, and text on the back may bleed through to the front. Forms also contain lines, instructional text, and corporate logos or objects which interfere with the information the user wants to extract. The form being scanned is often a photocopy or carbon copy of the original, and therefore the scanned information on the form can be faint or damaged, and may have extraneous marks on the image. As a result of these issues, it is impossible to scan a form, send it to a recognition engine, and receive any meaningful data without pre-processing. Sequoia Data specializes in software solutions for pre-processing images in order to make automated forms processing practical.
Problems with Skew
One of the most significant hindrances to OCR accuracy is skew. Typically, scanning will result in 1% to as much as 15% skew. This resultant skew a) makes it difficult to recognize and process the form because the information the computer is looking for is shifted out of the OCR zone b) results in unacceptably high OCR errors c) can lead to data from different fields being mixed together if lines of data are tightly compressed. The OCR reads part of one line and part of another as one field.
So the first step to facilitate forms processing is to correct the skew and register the image to provide consistent positioning. Sequoia's ScanFix performs this function as well as automatically locating and removing lines, dot-shaded backgrounds, specks, random "noise" and correcting inverse text. Forms typically contain shaded backgrounds that interfere with the data. This background needs to be removed to enable OCR or other recognition technologies to work effectively. If the image contains complex backgrounds that cannot be easily removed with ScanFix, GrayFix, Sequoia's dynamic grayscale thresholding software will automatically remove shading and convert the image to black and white.
Once the image is aligned and de-skewed, software can automatically identify and remove the form, leaving the variable data behind. Although this sounds simple, there are still challenges to address. For example, data may overlap lines or instructional text, or a key signature may be scrawled across form text. Sequoia's FormFix accurately identifies these problems, removes the lines and instructional text, and reconstructs damaged characters in the data left behind. Clean, recognizable characters remain that can then be passed to the OCR engine to be converted to usable data.
With documents that vary greatly, FormFix's IDL (Intelligent Document Logic) makes it possible to identify and extract specific objects on forms, such as signatures, names, addresses. It can also identify encoded data on checks, stamps on deeds, logos, or even part numbers from an engineering drawing. By grouping these objects, IDL enables automatic processing of this data.
Sequoia's software, which works either individually or in combination with a wide variety of platforms and operating systems, takes care of the difficult pre-processing of forms. Without pre-processing, forms recognition rates are poor, unwanted data gets captured, and fields are incomplete. With pre-processing, the results are faster and more accurate data recognition rates, enhanced images, and up to 90% reduction in file sizes, leading to lower storage needs and faster network performance. Pre-processing is an essential ingredient of successful forms processing. *
Dana Allen is the president of Sequoia Data Corporation (Burlingame, CA 415-696-8750; Fax 415-696-8755). Sequoia Data specializes in image enhancement and forms processing technologies. ScanFix, FormFix and Grayfix are registered trademarks of Sequoia Data Corporation.
IW Special Supplement, March 1996 SX
|