home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!news.claremont.edu!lexcel!mtinews!mti!adrian
- From: adrian@mti.mti.com (Adrian McCarthy)
- Newsgroups: comp.programming
- Subject: Re: Spell checking routine
- Message-ID: <709@mtinews.mti.com>
- Date: 2 Sep 92 23:08:24 GMT
- References: <la5rkvINN9vr@aludra.usc.edu>
- Sender: news@mtinews.mti.com
- Organization: Micro Technology, Inc., Anaheim, CA
- Lines: 46
- Nntp-Posting-Host: mti
-
- In article <la5rkvINN9vr@aludra.usc.edu> joseph@aludra.usc.edu (Jeff Lawson) writes:
- >I need a routine that will suggest words for a mispelled word from a
- >large dictionary. I'd like something that could produce words with
- >totally different spellings, like WordPerfect's. It doesn't really
- >matter what code it's written in, but C is preferable. Thanks
-
- One of my favorite techniques that I use in my toy spell checker is to use
- multiple hash indexes into the dictionary. Each has index is generated with
- a different type of hash function. The main hash function is the usual sort,
- designed for efficient lookup.
-
- The additional hash functions are designed to cause words that are alike
- in some way to collide. Thus, once you look up the incorrect word,
- you don't have to do any more work to guess at correct spellings; a list
- of guesses is already computed and stored as the collision list. Precomputing
- the index when the dictionary is built means that guessing is blindingly fast.
-
- For example, a hash function based on a Soundex-type of algorithm will
- automatically give you a list of words which sound like the word you tried to
- look up (within the limits of the Soundex algorithm).
-
- I have another hash function which maps characters to the finger a touch typist
- would use. This quickly finds mistakes where you used the right finger, but
- missed the key you aimed at.
-
- Yet another hash function is bitmask of characters, so words that are anagrams
- of each other map to the same index. This is very useful for quickly solving
- word-jumble puzzles or making a program to play Scrabble. (It's also not too
- bad at finding transposed characters, but it tends to find a lot of other words
- as well.)
-
- This technique has worked pretty well for me. I've found it's quicker to
- try to sort a precomputed list of guesses based on similarity to the original
- word than to work the other direction. Of course, there is a big space
- trade-off because the hash indexes may be large for a big dictionary.
-
- Using the more traditional approach of generating guesses, it's pretty
- straightforward to transpose adjacent characters, delete characters, and
- substitute characters. To quickly make reasonable guesses for inserting
- characters, compute the frequency of all of the trigraphs (3-letter
- combinations) in your dictionary. Build a sparse 2-d table indexed by the
- first and last letter whose values are a sorted list of the most common
- interceding letters. (If memory requirements are very limiting, you could
- try a simpler but similar approach using digraphs instead of trigraphs.)
-
- Aid. (adrian@lexcel.com)
-