OCR is an acronym for Optical Character Recognition. This is a process whereby text is extracted from a scanned image and converted into a plain text file. OCR has been available for some time on other machine platforms and accuracy has been steadily improving.
Sleuth 3
is the latest member of the
Ovation Suite
which is a suite of software designed to complement the
Ovation Pro
Desktop Publishing System. Sleuth
3 can be used with the majority of scanners available for 32-bit RISC OS computers. This
version has most of the features of the full version, with the exception that you cannot save the resulting text.
Loading Sleuth
To load Sleuth, double-click over the application's icon and it will be installed on the icon bar. If you click
Select
over the icon bar icon an empty window will be opened, t
he application now requires a sprite to process.
Loading a sprite
To load a sprite drag the sprite's icon onto the icon bar icon or into the open window. Only one sprite can be loaded at a time; if a further sprite is loaded it will replace the current sprite. A new sprite cannot be loaded if the package is currently converting a sprite. If a sprite file contains more than one sprite only the first sprite will be displayed. Most scanners, especially those using a TWAIN driver (see below), will output monochrome or greyscale sprites suitable for use with Sleuth.
Sleuth can
automatically
load sprites that have been squashed using
!Squash
Greyscale sprites
Sleuth will accept greyscale sprites and convert them into monochrome sprites.
Using the
Input and processing
preferences dialogue box,
the user can determine the levels of grey converted to white in this process
Pre-sharpening
facility is also provided that will generally improve the quality of degraded text during the conversion.
Scanning using TWAIN
Sleuth supports TWAIN to allow direct scanning into the package. To use this facility you must have a copy of TWAIN, an appropriate scanner and scanner driver.
TWAIN supports the
Select
and
Acquire
options on the
submenu. Choosing
Select
opens a dialogue box that allows you to choose the scanner driver source that you require. Choosing
Acquire
allows you to set up the scanner before scanning. Sleuth will only accept images that are monochrome, so ensure that the
option is selected. Full information about these options is provided with TWAIN.
Scan and OCR...
option
opens the same dialogue box as the
Acquire
option, but the OCR process will start as soon as the image is acquired. This is useful when you want to scan the same area from more than one page. When using t
Scan and OCR...
option
you must do a full scan rather than a preview.
option allows you to save the scanned sprite.
Please note that
!TWAIN
must be seen before Sleuth can use these options.
Configuring Sleuth
Before beginning the OCR process, it is important that Sleuth is configured correctly for your needs. The
Preferences
dialogue box is made up of two separate sets of options; these are accessed through the
Output
and
Input and processing
options.
Output preferences
Reject character
is the character used by Sleuth when, in its estimation, there is no equivalent character in the set of characters that it knows, or it is unsure what the correct character is. The
Reject character
should be set to an infrequently used character. If you change this character and your keyboard does not contain the default ~ character, hold down the
key and enter 126 on the numeric keypad.
End-of-line string
End-of-paragraph string
Smart quotes
Ligatures
Remove hyphens
options relate to the how the output is saved if it is saved as text.
End-of-line string
option can be set to any character. A carriage return is represented by
escape sequence, a line feed by
. These can be used separately or in combination e.g.
. You can also enter a space if you want to reformat the text in another package. These characters will be inserted at the end of each line of text converted by Sleuth.
End-of-paragraph string
option uses the same escape sequences as the
End-of-line
option. Sleuth will insert these characters at the end of each paragraph.
Smart quotes
option will cause Sleuth to recognise smart quotes
if they are present in the image. Otherwise they will be converted into ordinary quotes " and '
Choosing the
Ligatures
option will replace the pairs of characters fi and fl
with the single character ligatures
and
Remove hyphens
option, if selected, will remove hyphens at the end of lines. It will also remove the
End-of-line string
so that hyphenated words are joined up correctly.
As these settings are read just before the converted text is saved, it is possible to change these settings after the image is converted to text. When saving in RTF format the
Remove hyphens
process is done automatically and the
End-of-line string
and
End-of-paragraph string
are ignored. The settings for
Ligatures
and
Smart quotes
will be honoured.
RTF fonts
options allow the user to map fonts present on the system to the generic fonts types (Serif, Sans serif and mono-spaced) recognised by Sleuth. These mappings obviously only apply when the output is saved in RTF format.
When using Sleuth Batch mode you can set what file format is used for the output files. The types are
Editable (
allows the results of the OCR to be edited at some later date),
. A further option,
Append
, allows you to concatenate the results of OCRing
the files in the batch. These options must be set before the batch is processed.
Input and processing preferences
It is very important to set the
Resolution
option before starting the OCR process if you have loaded a sprite into Sleuth and you want to output in RTF format. The value refers to the resolution in dpi of the scan you have loaded. This setting will determine what size and spacing text will have when it is output in the RTF file. This setting cannot be altered after the OCR process has begun. Users of TWAIN do not need to set this value as it is read directly from TWAIN.
Auto page orientation
makes Sleuth examine the sprite file before displaying it to ensure that it has not been scanned, sideways or upside down. If it has, then the program will automatically rotate the sprite by 90, 180 or 270 degrees and then display it. Note that the original image on disc will not be affected. This option is off by default because of the slight time penalty it causes when an image is loaded.
Sleuth will accept greyscale sprites and convert them into monochrome sprites. The user can choose up to which level of grey is converted to white in this process using the
Grey scale sprites
options. A
Pre-sharpening
facility is also provided that will generally improve the quality of degraded text during the conversion of the sprite from greyscale to monochrome.
Sleuth now has foreign language support built-in, you can choose which language is to be used from the
Language
menu. This will select the resources for that language. You will need to save the preferences and reload the program for the change to take effect.
Learned
menu allows the user to select one of the files that has been provided that best suits the current use of the package. A short description of the learned file is displayed in the icon to the right of the menu icon.
Word checking
options allow you to force Sleuth to limit how it checks whether a word is valid or not. If you choose
Context
checking is used to distinguish similar characters from one another, for example, an
(an upper case i) and an
(a lower case l) in a sans serif font. Context checking should only be disabled when you are getting a lot of similar errors, where a lower case l (ell) is being interpreted as a 1(one) incorrectly for example. When
Context
is selected you can choose which dictionaries you want Sleuth to use.
Speed/Accuracy
options are fairly self-explanatory.
Medium
is the default and recommended setting and will give the best overall results in the majority of cases.
Quick
will improve the speed of output, but generally at the cost of accuracy.
Careful
will generally improve the accuracy, but at the cost of speed of output.
Multitask
option allows you to define how much time Sleuth will use each time it gets control of the processor. Increase this value on RISC OS 3.5 and 3.6 machines if you have other applications on the icon bar and find that Sleuth slows down considerably because of this.
Min. text size
option tells Sleuth what size of text to ignore in the image. You should set this value if you are OCRing documents which include graphics that contain text. Sleuth generally ignores text in graphics, but sometimes it will try to convert it. The text is usually smaller than the main body of text to be converted so by setting this threshold such text can be ignored.
Clicking on
will set the options for the current session. Choosing
will also set the options and save the settings to a file so that they will be reloaded when the application is next run.
The icon bar menu
These are the other options available from the icon bar menu:
The user dictionary
Sleuth now has a user dictionary. This allows the user to tell Sleuth about words
that are not in the built-in dictionary,
that it will encounter. This will improve the accuracy in reading these words.
To edit the user dictionary choose the
Edit dictionary...
option from the icon bar menu. This will open the
Edit dictionary
dialogue. You can enter words one at a time or you can drop a text file onto the dialogue box and the program will read all of the words and place them in the dictionary. Any duplicate words will be discarded. Entering words that are already in the built-in dictionary should not cause any problems as they will not be added to the dictionary, however having a large user dictionary will slow down the program and use more memory.
Use lower case to add normal words. Adding
color
will cause Sleuth to recognise
color
Color
and
COLOR
. Whereas
effectively adds
and
but not
adds just
To delete a word, click over it to highlight it and then click on
Delete
Cancel
will close the dialogue box without saving the changes made, choosing
will save the changes into the user dictionary.
Minimise memory
This option on the icon bar menu will reduce the memory required by Sleuth to the absolute minimum. Use this option when you want to free some memory, but retain Sleuth on the icon bar.
Reload image
Once you have started an OCR, you may want to alter some of the settings described above. You can stop the OCR, change the settings and then click on the
Reload image
option to allow you to restart OCRing the image. This only works with the last image you were OCRing.
Stop OCR
Use this option to stop the
OCR process. No output will be shown if you choose to do this.
This will quit the program, you will be warned if there is unsaved output from an OCR.
Manipulating the sprite
Before starting to convert a sprite you may need to alter it. The sprite needs to have black text on a white background and the text has to be upright. Sleuth allows the sprite to be rotated, inverted or zoomed. To rotate or invert the sprite choose the
option from the main menu and this submenu will be displayed:
Rotate
option offers a further submenu from which you can choose to rotate the sprite by 90, 180 or 270 degrees. Choosing the
degree option will rotate the sprite in an anti-clockwise direction; the
degree option will rotate the sprite in a clockwise direction.
Choosing the
Invert
option will swap the black and white elements of the sprite.
To zoom the sprite choose the
option from the main menu and the usual zoom dialogue box will be opened.
Starting the OCR process
Sleuth will deal with complex page layouts automatically. A complex page layout can consist of text in columns and/or include graphics. Sleuth should extract text from the scan and output it in a sensible order. You can start the OCR process by clicking
over the sprite window and choosing the
option. The sprite window will be closed and another window opened; this is the OCR progress window:
It shows various statistics concerning the progress of the OCR. The top section shows messages that indicate what the program is currently doing. Initially it will show
Finding lines
, this indicates that the program is in the process of extracting the text from the scan. Once Sleuth has found all of the lines it will begin converting them to text and the window will show
Reading lines
. If you close the OCR progress window clicking on the icon on the iconbar with
Select
will re-open it. As the OCRing progresses the statistics are updated showing which line of the total in the current zone is being read, which zone is being processed out of the total number plus the time taken, estimated accuracy, words per minute and characters per second. You should pay attention to the estimated accuracy, whilst this value is not incredibly accurate, it will give you an indication of how the program thinks it is doing. If it is below 95% it is probably worthwhile stopping the OCR, reloading the image and using a zone to OCR a smaller portion. This will save you having to wait to discover if there is indeed a problem with OCRing your image.
If you experience problems or are unhappy with the results you obtain from Sleuth please refer to the problem solving sections on pages 22-25.
Stopping the OCR process
It is possible to halt the OCR process before all of the text has been converted. To do this click
over the icon bar icon and choose the
Stop OCR
option.
No output will be shown if you choose to do this.
Using zones
Sleuth will be able to deal with most complex page layouts automatically, but some page layouts will cause it to output text in the wrong order. It is difficult to describe the page layouts that could cause problems, but examples include pages where text in columns runs around a complex graphic shape, pages containing tables or where text in graphic captions appear in the main text.
To overcome problems caused by such images, zones can be created. A zone is a user-defined area from which Sleuth will extract text. To create a zone drag with
Select
over the sprite window. As soon as you start dragging a zone rectangle will be drawn with eight handles and the pointer will be positioned in the bottom right-hand corner handle as shown below:
As you continue to drag the zone will be resized. The window will scroll automatically if necessary. Release
Select
to finish drawing the zone. The zone can be resized by dragging with
Select
over one of the handles and it can be moved by dragging with
Select
inside the zone.
You can create as many zones as you need to enclose the text you want to convert. The order of output from the zones will follow the order in which the zones were created, so text from the second zone created will follow on from the text from the first zone. You cannot draw an ordinary zone on top of another, although zones can overlap. To have overlapping zones, draw a zone and then move it on top of another zone. If any text is common in the overlapping zones it will be output twice.
By default the program will treat the area inside of a zone as it would treat a page without zones, i.e. it will try to find columns or graphics within the text. See the section
Single column zones
below for details of how to stop this.
Automatic zone creation
Sleuth can now find areas of text and create the zones around them for you automatically. Choose the
Find zones
option to try this. The zones that are created can be manipulated
in exactly the same way as zones you draw yourself.
Changing the order of zones
You will notice that all of the zones you create are linked by a yellow line that extends from the bottom of one zone to the top of another. These lines show the order of the zones and, therefore, the order of the text that will be output from them. To change this order, drag over the red (or grey if the zone is not selected) square at the bottom of a zone. A line will be drawn to follow the pointer, drag the line to the zone that you want to be next in the order. You can drop the line anywhere inside the zone. The program will then automatically update the order of the zones, deleting and making links as necessary. This process can be repeated until the order of the zones is correct.
Manipulating zones
Choosing the
option from the main menu will open the following menu:
These options allow zones to be manipulated. To manipulate a zone it must be selected, an ordinary selected zone is drawn in red. To select a zone, click over it with
Select
. It is possible to have more than one zone selected. To select another zone click over it with
Adjust
. All zones can be selected by choosing the
Select all
option. Selected zones can be moved together by dragging inside any zone in the selection.
To deselect a zone click
Adjust
over it. All zones can be deselected by choosing the
Clear
option.
To delete the selected zones choose the
Delete
option. The links between zones will automatically be remade after any deletion based on the current order of the zones. Zones will be deleted when a new sprite is imported or the current sprite rotated.
The selected zones can be copied by choosing the
option. Links will automatically be made between the old and new zones. Zones will be resized to fit inside the grey border inside the sprite window.
Ignore zones
Ignore zones can be created to stop the program trying to OCR a part of a scan. This is useful if a page contains a graphic which, in turn, contains some text. To create an ignore zone you can draw an ordinary zone and then choose the
Ignore
option. Alternatively, you can draw a zone with Shift held down. If you use this method you can draw an ignore zone on top of another zone. Ignore zones are drawn containing a cross to distinguish them from other zones.
If you use ignore zones on a complex page layout and allow the rest of the image to be automatically processed, Sleuth can output the text in the wrong order. This is due to the program interpreting the ignore zone as white space. If this occurs it is recommended that if you use ordinary text zones to define where the text is and its order.
Single column zones
Single column zones are provided to allow you to force the program into treating the area inside a zone as a single column. This is useful if the scanned text is widely spaced due to tabulation, since usually such spacing would cause the program to output the text in the wrong order. To create a single column zone, draw an ordinary zone and choose the
Single column
option. Single column zones will be drawn in green when selected.
Saving zones
If you repeat an OCR operation regularly on the same page layout you can save the zones that you use so you don't have to recreate them each time. To do this choose the
Save zones
option. The zone file that you save can then be reloaded after another sprite file has been loaded. You need to load the zone file after each sprite as zones are discarded after a sprite is loaded. If you scan using TWAIN the zones will be maintained between scans.
Table zones
Choosing the
Table
option on the
Zone
submenu will tell the program to interpret the data within that zone as a table. The text will be output in the Editor in CSV format. You need to extract this data manually and give a CSV filetype after saving the output as a text file. The program will only recognise simple tables with a single line in each row.
Zones and TWAIN
After using the
Preview
option, or any other scan from TWAIN, you can create zones to determine which parts of the scan you want to re-scan at a higher resolution for OCRing. Sleuth will automatically calculate what the total area of the scan will be from the zones you create and pass this information to TWAIN when you choose the
option. The zones will then be scaled so that they cover the same area on the new scan as they did on the preview scan.
Zone files should not be loaded during the preview stage, unless they were saved at this stage, otherwise they will be scaled after scanning the final image.
Guided editing
As text appears in the text window you will notice that certain words are highlighted in yellow. This implies that the program is suspicious that these words are not correct. It is possible that a word is correct, but the program suspects that something may be wrong and so draws your attention to the word. In most cases the word will be highlighted and will contain one or more errors. To step through the suspicious words place the caret in the text and press F1 or click on the
Next suspicious
button. Pressing F1 will move the caret to the first suspicious word and highlight the word in white. A word is considered corrected once the caret enters a word and the highlighting is removed. Pressing F1 again will move onto the next word. If you decide to edit a word further down the page, pressing F1 will always take you back to the first uncorrected suspicious word. You should note that a word may not be highlighted, but is wrong. This should be a rare occurrence and will only happen when the word spellchecks correctly.
Whilst editing you may wish to add the currently highlighted word to the user dictionary. You can do this by pressing F2 or clicking on the
Add word
button. A small dialogue box will open for a short time telling you that the word has been added to the dictionary. Adding a word to the dictionary will not have any effect on the currently highlighted words.
As you step through the dubious words the
Other occurrences
option shows how many times this word occurs further on in the text. If you find that the same word is read incorrectly or has been incorrectly highlighted as suspicious when in fact it is correct you can correct all of the occurrences in one go. To do this, edit the first occurrence so that it is correct then press F4 or click on the
Replace
button. Using this option when the highlighted words are correct will remove the highlighting from them all.
Whilst the caret is in the text window the word display window is open showing the word containing the caret from the original sprite.
A list of keypresses used when editing text is given below:
move the caret to the
next line
move
the caret to the previous line
move
the caret to the next character
move the caret to the previous character
Shift
move the caret to the next word
Shift
move the caret to the previous word
Ctrl
move the caret to the end of the line
Ctrl
move the caret to the start of the line
Page Up
scroll the text window up
Page Down
scroll the text window down
Delete
delete the character before the caret
delete the character after the caret
Shift Delete
delete the word at the caret
Ctrl Delete
delete the line at the caret
Inserting and deleting paragraph breaks
It is possible to insert and delete paragraph breaks in the text. To insert a paragraph break place the caret at the start of the first line of the paragraph or at the end of the previous line and press Return. To delete a paragraph break place
the caret at the start of the first line of the paragraph and press Delete or
at the end of the previous line and press Copy (End).
The insertion and deletion of paragraphs will affect the way that styles are assigned if the output is saved as an RTF file. When inserting and deleting a paragraph, the new paragraph will inherit the style of the previous paragraph. Any local styles will be maintained.
Saving the output
(this feature does not exist in the
version)
Once the OCR process has finished you can save the resulting output. You have three choices, you can save the text as plain ASCII text, as a Rich Text Format (RTF) file or in a editable file format. Saving a RTF file will allow you to import data directly into packages that accept RTF files maintaining any style information that Sleuth has managed to extract from the scanned image.
To save in a particular file format choose the relevant
option from the main menu over the text window. This will open a standard save as dialogue box. Drag the icon to the filer or application where you want to save the text.
Editable file format
When you save the output in this format you can reload it at a later date and still be able to edit the file using the guided editing features described above. To see the relevant portions of the image you must have saved the image and it must still exist in the same location as it was when it was OCRed.
Style information
Before you save the output as an RTF file you can see some of the information that will be included in it by choosing the
Show styles
option. This will switch the display to use outline fonts to mimic the styles in the original document. In the text editor the styles that will be shown are: normal, bold, italic and bold italic variants of the fonts that are specified in the
Output preferences
dialogue.
When the file is output as a RTF file these styles will be augmented with information on font size, justification, line spacing and paragraph spacing. Sleuth will create base styles if the same style is used in two or more paragraphs and overlay local styles on top of these. So, for example, you can get a paragraph with a base style of normal serif text with a single word in italic serif within it.
Sleuth can have problems in identifying styles if there is insufficient information for it to analyse. Cases where this can occur are, for example, where only a small amount of text is in a particular style or where the serifs of a font have been lost during the printing/scanning process. Sleuth can also experience difficulty in distinguishing bold words in normal text if the variants are quite similar.
Batch processing
You can let Sleuth process a batch of scanned images for you whilst you do something else. To use this facility place the scanned images within a directory. The images can be squashed or normal mono or greyscale sprites. You can only process 75 sprites at a time. Output from the batch will be placed in a sub-directory named
SlthOutput
. The output files will have the same name as the input image files. By default the output file format will be Editable
so that the user can edit the files using the guided editing features, but this can be changed using the
Output preferences
dialogue box. If the Append option is chosen all of the output will be written into one file with the same name as the first image processed. This option can only be used if
is chosen as the output file type.
Zones can be used with the images as they are processed. To use a zone file create a sub-directory named
SlthZones
and place the zone files that you want to use within it. The zone file should have the same name as the image it is to be used for. If you want to use the same zone file for every image in the batch, call the zone file
Progress
window will show relevant information as the batch is processed. If an error occurs
during the batch processing, Sleuth will attempt to move onto the next image. The
Progress
window will show which images have been processed successfully.
The batch process can be halted at any time by clicking on the
Stop OCR
button or by choosing the
Stop OCR
option from the icon bar menu.
How to get the best results from Sleuth
There are many factors which affect the accuracy of the output from Sleuth. We recommend that the following suggestions are followed where possible to get the best results:
Use 300 or 400 dpi resolution scanning if it is available. Scan the original document, if available, not a photocopy. Before scanning a document scan a portion and convert it to ensure that the scan is not too light or too dark. If you are using a hand-held scanner try to make the scan as straight as possible. Here is a suggested method for scanning to get the best results:
Review the document and plan the best method of scanning. Do not waste time scanning graphics which will be discarded in the OCR process and will take up memory and disc space.
Scan a portion of the document and convert it to ensure that the contrast setting of the scanner is correct. Time spent getting this right will be saved later by having fewer errors in the output to correct.
If you are using a hand-held scanner, run the scanner over the document without scanning. This will make sure that there are no obstacles that will impede the scanner's progress during scanning and it will give you an opportunity to make sure that the scanner is running as straight as possible. Try to line up the body of the scanner with a straight line on the page or the left edge of the text being scanned.
Scan the document carefully. If you are scanning to the end of a page you may find that a hand-held scanner will jump slightly as the body of the scanner rolls off the end of the page.
When this happens,
this will certainly affect the output from Sleuth on the line being scanned, and it can have further detrimental effects if the scanner is tilted. Often in this case the image gets dark smudges on it which degrade the text past the point of recognition. To alleviate this problem scan the top half of the page and then turn the document round and scan the bottom half. The second half can be rotated in Sleuth before conversion.
When you have completed your scanning, load the images one at a time into Sleuth for conversion or use Batch Processing.
If using a flat-bed scanner ensure that the document sits flat on the scanner. If you do not have it flat, dark smudges will appear on the scan.
When scanning books,
use the side of the scanner that has the least gap between the scannable area and the edge of the scanner. This will minimise problems with scanning to the inner margin and keeping the book flat.
Problem solving
Here are a list of possible problems that may occur when using Sleuth, an explanation of the problem and possible solutions:
I'm getting output but it is all punctuation characters or apparently random characters
- The scan is either too light, too dark or contains a font that Sleuth has not been trained on. Re-scan the image to lighten or darken it. If there is a font that is not recognised by Sleuth in the image it will never accurately convert it.
The output is good but the occasional word comes out badly
- It is likely that there is some defect in the scan where the conversion accuracy breaks down or the word is in a font unknown to Sleuth.
The output is good, but gradually gets worse
- Check that the contrast of the image is consistent. Uniform contrast over the image is important in order to get good results. Inconsistent contrast is likely to occur if the document has not been printed well or is a photocopy. Try re-scanning the document in portions adjusting the contrast of the scanner to compensate for the change in document contrast.
The output is good, but the order of the text is wrong
- The page layout is probably complex or it contains text separated by large amounts of white space. If the complex page output is very poor try OCRing again, but use zones this time. If the page contains lots of white space in a single column of text place a zone around the text and make the zone a single column zone.
The output has a lot of the words highlighted in the editor, but they are not spelt incorrectly
- The words are probably not in the spell check dictionary, either you can ignore the highlighted words or you can un-tick the
Context
option from the
Preferences
dialogue box when you OCR this sort of scan. Generally the words will be of a specialised or technical nature.
The output is fair, but could be better
- Try using the
Careful
option from the
Preferences
dialogue box.
The output is excellent, but could be faster
- Try using the
Quick
option from the
Preferences
dialogue box.
I get text that is contained in graphics converted
- Use an ignore zone to ignore the graphic or set the
Min. font size
in the
Input preferences
dialogue box to a larger value. If the page layout is complex you may need to use text zones, to keep the order of output correct.
Sleuth's limitations
Although Sleuth has only been trained on these fonts it will also recognise other similar fonts without any further training. It will recognise the following characters :
, we would like to thank them for their efficiency and their help in this matter.
Sleuth will convert characters between 9 and 24 point in size, that equates to approximately 1/8th to 1/3rd of an inch depending on the font. The actual limits are between 10 and 80 pixels measured from the top of an upper-case letter, like an
, to the bottom of a lower-case letter with a descender, like a
. To convert 24 point text use a 200 dpi scan.
Sleuth will convert text at speeds between 80 to 400 words per minute depending on the machine being used, the size and quality of the text and the resolution of the sprite being converted.
Sleuth will automatically cope with slightly skewed scans (generally less than 2 degrees of skew, depending on line spacing) and lines of text that are slightly wavy.
Important points to remember
Here is a list of points to check and remember when using Sleuth:
Check that the image is not too bright or too dark.
Check that the image is not overly skewed.
Check that the image has a uniform contrast.
Use the highest resolution scan that is available unless the text is large. 300-400 dpi should be sufficient.
Sleuth has only been trained on the fonts given earlier.
There is a limitation on the size of text that can be converted.
Sleuth has been trained on a specific set of characters, other characters will not be converted correctly.
Sleuth will automatically deal with most complex page layouts automatically.
Set the
Resolution
setting in the
Preferences
dialogue box if you load in a sprite and want to output in RTF format.
Style information can be incorrect if there is only a small amount of text in a particular style or the scan quality is poor.
How Sleuth works
This is a brief description of the way Sleuth 3 works. It follows the order in which the program works, starting with the training process.
Training
In order to recognise letters, the OCR program must be taught their shapes. The main problem is that there is a lot of variation for each letter, due to different fonts, different sizes, and many kinds of defects introduced by the printing and scanning processes.
The training program for Sleuth 3 (not supplied) incorporates a sophisticated algorithm which starts with an outline font and mimics
the processes of printing, faxing and scanning to make many examples of each letter. The training program uses about 1,000,000 examples in all (for v3.00). Each of these shapes is converted into a list of numbers (a
feature vector
) which represents the important aspects of the shape. Since each feature vector uses about 100 bytes, it would require about 100Mb to store all this information directly. It is therefore
summarised
in a special way and saved for use when reading.
Finding lines of text
The first thing Sleuth 3 does is to find all the black shapes in the image. Halftone areas (lots of small dots close together) are detected and these are stored along with large shapes as graphic areas. These are used later when the lines of text are found and ordered. Very small shapes that do not seem to be part of a half-tone area or a letter are weeded out at this stage as noise.
Next the program estimates the skew in the image, and removes it. It does not rotate the actual shapes (which would introduce further distortion) but moves them so that lines are horizontal and gutters
(the space between columns) are vertical. Then gutters are found. After that, the shapes are gradually merged together to form lines. Care is taken to avoid jumping over graphics and gutters or joining dropped capitals onto a line. Special processing is used for table zones.
The lines of text are tentatively ordered, but unlike Sleuth 2, the final ordering is not fixed until they have been read. This means that Sleuth 3 is less likely to be confused by non-text that looks a bit like text (which could cause Sleuth 2 to get the order of the lines wrong); it can use information about text size to join lines of text into blocks; and it can use sentence structure to order blocks of text where this is ambiguous. Unfortunately, this also means that Sleuth 3 cannot output text until the whole page has been read.
Character recognition
During reading, shapes are extracted from the image, converted into feature vectors and compared with the information about known shapes acquired during training. The result is a shortlist of possible letters, with scores representing how close the matches are, and also information about the style (bold/italic/serif) of the letter.
Although the character recognition is at the heart of any OCR program, it is only useful as one part of a complex system. Letters are often joined to their neighbours, and sometimes letters split into two or more parts, so these problems must be dealt with before character recognition can take place. Some characters (eg p,P,c and C) can only be identified once the scale and position is known from the rest of the line. Others (eg I,l,1,e and c) can often only be correctly identified by using information about the likely identities of other shapes in the same word.
Reading a line of text
Sleuth 3 uses a technique called
oversegmentation
to locate the letters in a line of text. The image of the line is split into many small fragments by making vertical cuts at likely points. These are then ordered left to right, and the program considers various ways of joining them together again to make letters. Here is part of a word that has been oversegmented:
Sleuth 2 used a similar technique, but only when it failed to recognise a word using a simple way of dividing up the words into letters. Sleuth 3 uses oversegmentation on every line, and also has a special module to recognise monospaced text and divide the line up in this case. When text quality is poor, Sleuth 2 sometimes fails to find enough letters in the line to make a reliable estimate of scale and position. In the picture above, Sleuth 2 would see the whole shape as a reject at its first attempt - and if much of the line was joined up into clumps of letters, or if many letters were split, it would make a mess of the whole line. Sleuth 3 stands a much better chance of getting most letters right first time and so can deal with worse quality text. The disadvantage is that it is somewhat slower.
A shortlist of possibilities is obtained for each shape in the line, and the well-recognised letters are used to find the baseline and scale of the text in the line, and therefore remove letters from the shortlists if they have the wrong size or position.
Next, word breaks are found, and each word is visited in turn. If it is possible to choose letters from the shortlists so that the word spellchecks, and without any letters being poorly recognised, the word is considered done. Here
s how the word
excellent
from a low quality scan might look like at this stage. There are no split or joined letters, but the correct interpretation is far from obvious.
cxcc11cnt <- first choices
e eeIle [ <- second choices
lI <- third choices
!! <- fourth choices
ii <- fifth choices
} <- sixth choice
If the word is not recognised at the first attempt, it is examined more closely. The program considers various ways in which the fragments can be put together to make new possibilities for the shapes in the word. Each time a new set of pieces is combined to make a shape, the shape must be recognised, and each time a new set of shortlists for a word is produced, the results must be spellchecked. The time the program spends on this depends on the Quick/Medium/Careful settings.
For example, in the
image earlier, the pieces
would probably be combined so that 1 was on its own as a
, 2 and 3 would be combined to make a reasonable
, then 4 would be added to make the
, 4 and 5 could make an italic
, 5 and 6 both make bad i
s, but together make a good
, it is possible that 6, 7 and 8 could make a bad
, but that would be discarded for 7, 8 and 9 to be combined to make the
Final steps
Once all the (supposed) text lines have been read, any ones that were unreadable are rejected, and the remainder are grouped into blocks and ordered. Then paragraph breaks are found and for each paragraph the text size, line spacing, justification and main style (bold/italic/serif/monospaced) is found. The program then finds any words which do not match the main style. Finally, the text and formatting information are sent to the text editor, or to a file if in batch mode.
Why is Sleuth 3 slower than Sleuth 2?
The main reasons are:
Sleuth 3 uses a more careful initial segmentation of each line into letters as described earlier.
Sleuth 3 has a more sophisticated spellchecker which uses information about word frequencies (eg it knows that
is more likely than
). It also incorporates a user dictionary.
Sleuth 3 does more processing of graphics, which aids the division into text and graphics and the segmentation of text areas, eg using vertical rulings.
Sleuth 3 is trained on more fonts and a wider variety of text qualities, including symbols that have been artificially
faxed
Prices & Other BEEBUG Products
Description
Price
0079b
Sleuth 3
116.32
0081b
Sleuth 2
81.08
0051b
Sleuth 1 - 2 upgrade
40.54
0048b
Sleuth 1 - 3 upgrade
81.08
0049b
Sleuth 2 - 3 upgrade
45.83
0082d
Ovation Pro
158.62
0083b
Ovation Pro Colour Supp
41.12
0108c
Ovation
57.58
0093c
Hearsay 2
29.00
0098b
Masterfile 3
23.50
0094b
Hard Disc Companion II
22.99
0096b
Desktop Thesaurus
0092b
TypeStudio
PHANb
Phantasm
17.50
PAS3b
ArcScan III
15.86
PAU1a
ArcScan Index Update
Prices include VAT but not carriage which will be charged at cost.
BEEBUG
Ltd, 117 Hatfield Road, St Albans, AL1 4JS, England
p 3333333SGDSUUUUUUUUUUUUUUUUUUUUUUUUUUUUCC3CCCCCCCCCCCCCC3CC3CCCC333CC3CCCCCC33CCCC3##3C333233323C32332CC3C33333CC3333233CCCCCCC3CC333233CCCC32C33#3CCC3CCCCCCCCCCCCCC3CC3CCCD