home *** CD-ROM | disk | FTP | other *** search
- Xref: sparky comp.programming:2313 alt.msdos.programmer:2200 comp.sys.ibm.pc.programmer:337 comp.unix.programmer:4296
- Newsgroups: comp.programming,alt.msdos.programmer,comp.sys.ibm.pc.programmer,comp.unix.programmer
- Path: sparky!uunet!cis.ohio-state.edu!pacific.mps.ohio-state.edu!linac!att!cbfsb!cbnewsc!cbnews!eotto
- From: eotto@cbnews.cb.att.com (erick.otto)
- Subject: Soundex algorithms /With code 'C' etc.
- Organization: AT&T
- Date: Fri, 14 Aug 1992 12:09:13 GMT
- Message-ID: <1992Aug14.120913.26413@cbnews.cb.att.com>
- Followup-To: poster
- Keywords: soundex, papers, code
- Lines: 281
-
-
- All,
- Thank you very much everyone who replied to my requests for the soundex
- code/algorithms.
-
- I have promised to post all the outcome and have combined all the stuff
- I could find in the below posting.
- I contains various articles on the soundex algoritm.
- The next article should contain various code fragments (functions)
- that produce soundex codes, pick the ones you like.
-
- P.S. The code is cpio'ed and uuencoded
-
-
- Erick Otto
- eotto@cbnews.att.com
- ------------------------- Article 1 -------------------------------------------
- 1. Retain the first letter of the name and drop all occurrances
- of a, e, h, i, o, u, w, y in other positions.
-
- 2. Assign the following numbers to the remaining letters after
- the first:
- b, f, p, v ---> 1
- c, g, j, k, q, s, x, z ---> 2
- d, t ---> 3
- l ---> 4
- m, n ---> 5
- r ---> 6
-
- 3. If two or more letters with the same code were adjacent in the
- original name (before step 1) omit all but the first.
-
- 4. Convert to the form "letter, digit, digit, digit" by adding
- trailing zeros (if there are less than 3 digits) or by dropping
- rightmost digits (if there are more than three).
-
- "The Art of Computer Programming, Sorting and Searching"
- Knuth, Donald, Addison-Wesley, 1973, pp. 391-392.
-
- ------------------------- Article 2 -------------------------------------------
- The algorithm is very simple,
-
- 1: Assign number values to all but the first letter of the
- word, using this table
- 1 - B P F V
- 2 - C S K G J Q X Z
- 3 - D T
- 4 - L
- 5 - M N
- 6 - R
- 7 - A E I O U W H Y
-
- 2: Apply the following rules to produce a code of one letter and
- three numbers.
- A: The first letter of the word becomes the initial character
- in the code.
- B: When two or more letters from the same group occur together
- only the first is coded.
- C: If two letters from the same group are seperated by an H or
- a W, code only the first.
- D: Group 7 letters are never coded (this does not include the
- first letter in the word, which is always coded).
-
- Of course, this can be used without the numeric substitution to
- produce abbreviations also, but the numbers indicate the phonemic
- similarity (e.g. Bear = Bare = B6), or Rhymes (e.g. Glare = G46,
- Flair = F46). This can also be useful for finding duplicate entries
- in a large database, where a name may be slightly mis-spelled (e.g.
- Smith = Simth = S53).
-
- ------------------------- Article 3 -------------------------------------------
- I've written a soundex program based on the rules in Knuth's _Searching_and_
- Sorting. These are also the rules used at the National Archives to sort
- census data. These rules differ slightly from the ones posted by Mr. Stewart.
- If you don't need to match anyone else's soundex, the most important rule is
- to be consistent. I will insert Knuth's rules below.
- The algorithm is very simple,
-
- 1. Retain the first letter of the name, and drop all occurrances of a, e, h,
- i, o, u, w, y in other positions.
-
- 1: Assign number values to all but the first letter of the
- word, using this table
- 1 - B P F V 2 - C S K G J Q X Z
- 3 - D T 4 - L
- 5 - M N 6 - R
- 7 - A E I O U W H Y
-
- 2. Assign number values as above, except for 7.
-
- 2: Apply the following rules to produce a code of one letter and
- three numbers.
- A: The first letter of the word becomes the initial character
- in the code.
- B: When two or more letters from the same group occur together
- only the first is coded.
- C: If two letters from the same group are seperated by an H or
- a W, code only the first.
-
- 3. If two or more letters with the same code were adjacent in the original
- name (begore step 1), omit all but the first.
-
- D: Group 7 letters are never coded (this does not include the
- first letter in the word, which is always coded).
-
- 4. Convert to the form "letter, digit, digit, digit" by adding trailing
- zeros or dropping rightmost digits.
-
-
- BTW according to the reference in Knuth's book, this algorithm was
- developed by Margaret Odell and Robert Russell in 1922.
- ------------------------- Article 4 -------------------------------------------
- transcribed from ACIUS tech note TN-06
- (c) 1992 ACIUS Inc
-
- INTRODUCTION
-
- This technical note will show you how to use 4th Dimension code to
- build phonetic equivalent for an alphanumeric field. It will give you
- phonetic searching capabilities in your databases.
-
- About the Soundex Function
-
- In order to acheive phonetic searching, we use an algorithm calles
- "Soundex." This algorithm converts a string of characters into a
- phonetic equivalent by grouping together characters that sound alike
- and assigning a single digit to each of them. The procedure allows
- you to convert any passed string to its phonetic equivalent, thus
- allowing you to adjust the phonetic precision.
-
- The following four points must be taken into consideration when using
- the "Soundex" function.
-
- The user must always know the correct first character: The algorithm
- creates the phonetic equivalent with the first character of the
- phonetic equivalent that is equal to the first character of the
- original string.
-
- Each character is read and assigned a number: This number is
- associated with a group; charaters are grouped according to their
- sounds. For example, group six contains characters "m" and "n"
- because they sound alike. We have set up a common grouping in
- "Changing Character Sound Grouping," but you may change this as you
- feel necessary.
-
- While converting the string to its phonetic equivalent: If character
- n+1 is in the same group as character n, character n + 1 is ignored.
-
- The second paramater of the procedure determines the precision of the
- phonetic equivalent: By default, it is set to three different groups
- of character sounds. If you increase the precision, it will be more
- discriminating in determining phonetic matches. For example, if you
- have a precision of three and build a phonetic equivalent for the word
- "monitor," searching for a similar word, like "monita," will return a
- match. However, if you have a precision of six, a match will not be
- found. We have found a precision of three or four gives the best
- results.
-
- Soundex Function
-
- The operation of the code is quite straightforward. This procedure
- walks backwards down the string until it runs into either a colon or
- the end of the string. The Soundex function returns a string that can
- be used as the phonetic representation of a give string. This
- function returns a result of <0 if an error has occurred.
-
- Changing the Character Sound Groupings
-
- In the Soundex function, we group characters together that sound alike
- and then assign an interfer to them. Examine the lines of code:
-
- $Sounds:= "aehiouwybfpvcgjkqsxzdtlmnr" `Group chars that sound alike
- $SoundsI:="11111111222233333333445667" `Assign numbers for each group
-
- Each letter is given a specific code, as shown in the following table.
-
- Letter Soundex Code
- a,e,h,i,o,u,w,y 1
- b,f,p,v 2
- c,g,j,k,q,s,x,z 3
- d,t 4
- l 5
- m,n 6
- r 7
-
- You can regroup these characters as you like. $SoundsI contains the
- numerical equivalent for each character in $Sounds. If you wanted to
- group "aeh" in its own separate group, you would declare the two
- variables as follows:
-
- $Sounds:= "aehiouwybfpvcgjkqsxzdtlmnr" `Group chars that sound alike
- $SoundsI:="11122222333344444444556778" `Assign numbers for each group
-
- Following this simple rule, you can recreate your own soundex
- groupings for your specific phonetics.
-
- [4D specific examples deleted]
-
- Usage:
-
- Apply the soundex function to fields in your database you will
- phonetic search against, and store the results in another field in the
- same table, i.e.
- Name (filed to search against)
- SNDX_Name := soundex(Name)
- Then, when user wants to do a sounds alike search, translate the
- user's input to soundex string and look for matches
- SNDX_input:=soundex(Input)
- select * from TABLE where SNDX_Name = SNDX_input
-
- ------------------------- Article 5 -------------------------------------------
- The Soundex algorithm described by two previous follow-ups is only
- the best-known of sounds-like alorithms. It is tied to single-language
- assumptions, which may be violated in the CD-Disc application cited.
-
- A proprietary algorithm which is available for license and as a
- product which might fit that application better is Proximity
- Technology's Proximity Search algorithm. They have a (patented) VLSI
- accelerator chip which used to be necessary (on old PC-ATs) if the
- database grew larger than trivial (400 records); a modern 486 might
- not need one, I don't know. Their "Friendly Finder" would list the
- 10 to 20 records which most closely match the key you are typing,
- with the list changing as you extend the key, character by character --
- like incremental-search mode in emacs, only it shows 20 matching lines,
- like list-matching-lines. Usually there are a few records in the list
- that you don't understand why their algorithm thinks they're a close
- match, but what you want is almost always on page 1. In 1986, Friendly
- Finder would scan DBase II and III and ASCII files. I suppose
- they may have upgraded it for DBase IV.
-
- They will also license to developpers to use the library & hardware
- in your programs -- but for a customer-inquiry device, the pop-up
- database-searcher may be just right -- you can paste the product
- number from the searcher into the order-entry program, I think!
-
- The company exists to exploit their two resources: the Proximity
- Search Algoritm in hardware and software, and their exclusive
- licesinsing arrangement with Merriam-Webster's lexicographic database.
- (Yes, that pocket-dictionary/spell-checker/thesaurus that has the
- MW logo on it has the Proximity algorithm in it, and probably has
- their logo in fine print too.)
-
- To contact them: in 1986, they had an office in Framingham Mass,
- which I visited as an Evil Defense Contractor looking for useful
- technologies for databases (they already had a contract with NSA
- for use of their VLSI chip in something hush-hush, but that had
- little to do with me except provide a handle for checking references
- if I got my sponsor interested); it may still be there if you call
- 508-555-1212. Their literature listed home office as:
- Proximity Technology Inc.
- 3511 NE 22nd Avenue
- Fort Lauderdale, FL 33308-6226
- 305-566-3511 (FAX: -2088)
- TELEX 160054 PRXMTY
-
-
- Disclaimer: My only remuneration from them was a Beta copy of the
- thesaurus that my wife was supposed to review, but never used due to
- lack of a harddrive to install it on and a trial (no hardware) copy of
- Friendly Finder for evaluation. Hmmm. Maybe I should install those
- again, they are nice programs. I wonder how it would work on my
- 1E4 record book catalogue on the super-486 at the office?
- ------------------------- Article 6 -------------------------------------------
- Actually there was an interesting article in the C-users journal of june or
- july about string comparisons with unequal lengths and approximating how much
- they differ and listing the strings(or records) that have an acceptable
- difference.
- *****************************************************************************
- *&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&*
- * | *
- * ______ _______ _____ _______ ------- | Erick Otto *
- * / __ \ |__ __| / _ \ |__ __| -=====----- | Network Systems- *
- * | (__) | | | \ \ \_\ | | -=======----- | Netherlands *
- * | __ | | | / / __ | | --=====------ | eotto@cbnews.att.com *
- * | | | | | | | (\ / / | | ----------- | eotto@hvlpa.att.com *
- * |_| |_| |_| \_____/ |_| ------- | x'2632, BE118 *
- * | *
- *&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&*
- *****************************************************************************
-
-
-