NetNews Usenet Archive 1992 #18

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #18 / NN_1992_18.iso / spool / comp / programm / 2313 < prev next >

Wrap

Internet Message Format | 1992-08-14 | 12.4 KB

Xref: sparky comp.programming:2313 alt.msdos.programmer:2200 comp.sys.ibm.pc.programmer:337 comp.unix.programmer:4296 Newsgroups: comp.programming,alt.msdos.programmer,comp.sys.ibm.pc.programmer,comp.unix.programmer Path: sparky!uunet!cis.ohio-state.edu!pacific.mps.ohio-state.edu!linac!att!cbfsb!cbnewsc!cbnews!eotto From: eotto@cbnews.cb.att.com (erick.otto) Subject: Soundex algorithms /With code 'C' etc. Organization: AT&T Date: Fri, 14 Aug 1992 12:09:13 GMT Message-ID: <1992Aug14.120913.26413@cbnews.cb.att.com> Followup-To: poster Keywords: soundex, papers, code Lines: 281 All, Thank you very much everyone who replied to my requests for the soundex code/algorithms. I have promised to post all the outcome and have combined all the stuff I could find in the below posting. I contains various articles on the soundex algoritm. The next article should contain various code fragments (functions) that produce soundex codes, pick the ones you like. P.S. The code is cpio'ed and uuencoded Erick Otto eotto@cbnews.att.com ------------------------- Article 1 ------------------------------------------- 1. Retain the first letter of the name and drop all occurrances of a, e, h, i, o, u, w, y in other positions. 2. Assign the following numbers to the remaining letters after the first: b, f, p, v ---> 1 c, g, j, k, q, s, x, z ---> 2 d, t ---> 3 l ---> 4 m, n ---> 5 r ---> 6 3. If two or more letters with the same code were adjacent in the original name (before step 1) omit all but the first. 4. Convert to the form "letter, digit, digit, digit" by adding trailing zeros (if there are less than 3 digits) or by dropping rightmost digits (if there are more than three). "The Art of Computer Programming, Sorting and Searching" Knuth, Donald, Addison-Wesley, 1973, pp. 391-392. ------------------------- Article 2 ------------------------------------------- The algorithm is very simple, 1: Assign number values to all but the first letter of the word, using this table 1 - B P F V 2 - C S K G J Q X Z 3 - D T 4 - L 5 - M N 6 - R 7 - A E I O U W H Y 2: Apply the following rules to produce a code of one letter and three numbers. A: The first letter of the word becomes the initial character in the code. B: When two or more letters from the same group occur together only the first is coded. C: If two letters from the same group are seperated by an H or a W, code only the first. D: Group 7 letters are never coded (this does not include the first letter in the word, which is always coded). Of course, this can be used without the numeric substitution to produce abbreviations also, but the numbers indicate the phonemic similarity (e.g. Bear = Bare = B6), or Rhymes (e.g. Glare = G46, Flair = F46). This can also be useful for finding duplicate entries in a large database, where a name may be slightly mis-spelled (e.g. Smith = Simth = S53). ------------------------- Article 3 ------------------------------------------- I've written a soundex program based on the rules in Knuth's _Searching_and_ Sorting. These are also the rules used at the National Archives to sort census data. These rules differ slightly from the ones posted by Mr. Stewart. If you don't need to match anyone else's soundex, the most important rule is to be consistent. I will insert Knuth's rules below. The algorithm is very simple, 1. Retain the first letter of the name, and drop all occurrances of a, e, h, i, o, u, w, y in other positions. 1: Assign number values to all but the first letter of the word, using this table 1 - B P F V 2 - C S K G J Q X Z 3 - D T 4 - L 5 - M N 6 - R 7 - A E I O U W H Y 2. Assign number values as above, except for 7. 2: Apply the following rules to produce a code of one letter and three numbers. A: The first letter of the word becomes the initial character in the code. B: When two or more letters from the same group occur together only the first is coded. C: If two letters from the same group are seperated by an H or a W, code only the first. 3. If two or more letters with the same code were adjacent in the original name (begore step 1), omit all but the first. D: Group 7 letters are never coded (this does not include the first letter in the word, which is always coded). 4. Convert to the form "letter, digit, digit, digit" by adding trailing zeros or dropping rightmost digits. BTW according to the reference in Knuth's book, this algorithm was developed by Margaret Odell and Robert Russell in 1922. ------------------------- Article 4 ------------------------------------------- transcribed from ACIUS tech note TN-06 (c) 1992 ACIUS Inc INTRODUCTION This technical note will show you how to use 4th Dimension code to build phonetic equivalent for an alphanumeric field. It will give you phonetic searching capabilities in your databases. About the Soundex Function In order to acheive phonetic searching, we use an algorithm calles "Soundex." This algorithm converts a string of characters into a phonetic equivalent by grouping together characters that sound alike and assigning a single digit to each of them. The procedure allows you to convert any passed string to its phonetic equivalent, thus allowing you to adjust the phonetic precision. The following four points must be taken into consideration when using the "Soundex" function. The user must always know the correct first character: The algorithm creates the phonetic equivalent with the first character of the phonetic equivalent that is equal to the first character of the original string. Each character is read and assigned a number: This number is associated with a group; charaters are grouped according to their sounds. For example, group six contains characters "m" and "n" because they sound alike. We have set up a common grouping in "Changing Character Sound Grouping," but you may change this as you feel necessary. While converting the string to its phonetic equivalent: If character n+1 is in the same group as character n, character n + 1 is ignored. The second paramater of the procedure determines the precision of the phonetic equivalent: By default, it is set to three different groups of character sounds. If you increase the precision, it will be more discriminating in determining phonetic matches. For example, if you have a precision of three and build a phonetic equivalent for the word "monitor," searching for a similar word, like "monita," will return a match. However, if you have a precision of six, a match will not be found. We have found a precision of three or four gives the best results. Soundex Function The operation of the code is quite straightforward. This procedure walks backwards down the string until it runs into either a colon or the end of the string. The Soundex function returns a string that can be used as the phonetic representation of a give string. This function returns a result of <0 if an error has occurred. Changing the Character Sound Groupings In the Soundex function, we group characters together that sound alike and then assign an interfer to them. Examine the lines of code: $Sounds:= "aehiouwybfpvcgjkqsxzdtlmnr" `Group chars that sound alike $SoundsI:="11111111222233333333445667" `Assign numbers for each group Each letter is given a specific code, as shown in the following table. Letter Soundex Code a,e,h,i,o,u,w,y 1 b,f,p,v 2 c,g,j,k,q,s,x,z 3 d,t 4 l 5 m,n 6 r 7 You can regroup these characters as you like. $SoundsI contains the numerical equivalent for each character in $Sounds. If you wanted to group "aeh" in its own separate group, you would declare the two variables as follows: $Sounds:= "aehiouwybfpvcgjkqsxzdtlmnr" `Group chars that sound alike $SoundsI:="11122222333344444444556778" `Assign numbers for each group Following this simple rule, you can recreate your own soundex groupings for your specific phonetics. [4D specific examples deleted] Usage: Apply the soundex function to fields in your database you will phonetic search against, and store the results in another field in the same table, i.e. Name (filed to search against) SNDX_Name := soundex(Name) Then, when user wants to do a sounds alike search, translate the user's input to soundex string and look for matches SNDX_input:=soundex(Input) select * from TABLE where SNDX_Name = SNDX_input ------------------------- Article 5 ------------------------------------------- The Soundex algorithm described by two previous follow-ups is only the best-known of sounds-like alorithms. It is tied to single-language assumptions, which may be violated in the CD-Disc application cited. A proprietary algorithm which is available for license and as a product which might fit that application better is Proximity Technology's Proximity Search algorithm. They have a (patented) VLSI accelerator chip which used to be necessary (on old PC-ATs) if the database grew larger than trivial (400 records); a modern 486 might not need one, I don't know. Their "Friendly Finder" would list the 10 to 20 records which most closely match the key you are typing, with the list changing as you extend the key, character by character -- like incremental-search mode in emacs, only it shows 20 matching lines, like list-matching-lines. Usually there are a few records in the list that you don't understand why their algorithm thinks they're a close match, but what you want is almost always on page 1. In 1986, Friendly Finder would scan DBase II and III and ASCII files. I suppose they may have upgraded it for DBase IV. They will also license to developpers to use the library & hardware in your programs -- but for a customer-inquiry device, the pop-up database-searcher may be just right -- you can paste the product number from the searcher into the order-entry program, I think! The company exists to exploit their two resources: the Proximity Search Algoritm in hardware and software, and their exclusive licesinsing arrangement with Merriam-Webster's lexicographic database. (Yes, that pocket-dictionary/spell-checker/thesaurus that has the MW logo on it has the Proximity algorithm in it, and probably has their logo in fine print too.) To contact them: in 1986, they had an office in Framingham Mass, which I visited as an Evil Defense Contractor looking for useful technologies for databases (they already had a contract with NSA for use of their VLSI chip in something hush-hush, but that had little to do with me except provide a handle for checking references if I got my sponsor interested); it may still be there if you call 508-555-1212. Their literature listed home office as: Proximity Technology Inc. 3511 NE 22nd Avenue Fort Lauderdale, FL 33308-6226 305-566-3511 (FAX: -2088) TELEX 160054 PRXMTY Disclaimer: My only remuneration from them was a Beta copy of the thesaurus that my wife was supposed to review, but never used due to lack of a harddrive to install it on and a trial (no hardware) copy of Friendly Finder for evaluation. Hmmm. Maybe I should install those again, they are nice programs. I wonder how it would work on my 1E4 record book catalogue on the super-486 at the office? ------------------------- Article 6 ------------------------------------------- Actually there was an interesting article in the C-users journal of june or july about string comparisons with unequal lengths and approximating how much they differ and listing the strings(or records) that have an acceptable difference. ***************************************************************************** *&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&* * | * * ______ _______ _____ _______ ------- | Erick Otto * * / __ \ |__ __| / _ \ |__ __| -=====----- | Network Systems- * * | (__) | | | \ \ \_\ | | -=======----- | Netherlands * * | __ | | | / / __ | | --=====------ | eotto@cbnews.att.com * * | | | | | | | (\ / / | | ----------- | eotto@hvlpa.att.com * * |_| |_| |_| \_____/ |_| ------- | x'2632, BE118 * * | * *&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&* *****************************************************************************