home *** CD-ROM | disk | FTP | other *** search
Text File | 1992-09-02 | 30.8 KB | 1,221 lines |
- Path: sparky!uunet!europa.asd.contel.com!darwin.sura.net!sgiblab!cs.uoregon.edu!ogicse!das-news.harvard.edu!spdcc!dirtydog.ima.isc.com!keps.kodak.com!cronkite!atar!umak
- From: umak@atar.epps.kodak.com (Uma Krishnan)
- Newsgroups: comp.databases
- Subject: Soundex Algorithm and SQL Parser Available
- Keywords: Soundex,SQL Parser
- Message-ID: <1992Sep2.142044.9443@APS.Atex.Kodak.COM>
- Date: 2 Sep 92 14:20:44 GMT
- Article-I.D.: APS.1992Sep2.142044.9443
- References: <1992Aug27.182721.27790@cbfsb.cb.att.com> <Btp2Bo.4yC@news.larc.nasa.gov> <mr.715283007@ogre> <1992Aug31.204025.23562@cbfsb.cb.att.com>
- Sender: news@APS.Atex.Kodak.COM
- Reply-To: umak@atar.Atex.Kodak.COM
- Organization: Atex Publishing Systems, Inc.
- Lines: 1206
-
- Thanks to Ruth Iverson for pointers to SQL parser and to Walt for soundex
- algorithm.
-
- I am forwarding the responses that I got from them.
-
- ====== SQL LEX/YACC PARSER POINTERS ========
-
-
- Greetings to all of you who were interested in my findings regarding
- a generic public domain SQL interpreter. Unfortunately, all I was able
- to come up with were a couple of fairly old, but still largely ANSI-compatible
- examples of lex/yacc grammars for SQL. I am using these as a guide to
- "roll my own" SQL to B-tree interpreter.
-
- One of these originated from eris.tar.Z on wilma.cs.brown.edu, which Warwick
- was kind enough to point me at [ thanks, Warwick! ]. The other sample
- came from /pub/sql.shar on the animal-farm.nevada.edu ftp site.
-
-
- There doesn't seem to be much of what we're all looking for out there
- in PD land, but I hope this information helps!
-
-
- Ruth Iverson iverson@nas.nasa.gov
- NASA Ames Research Center
-
-
-
- ======= SOUNDEX ALGORITHM =======
-
- #!/bin/sh
- #
- # This is a shell archive. To extract its contents,
- # execute this file with /bin/sh to create the file(s):
- #
- # README soundex.ec soundex2b.c soundex4.c
- # mds.globals.h soundex1.c soundex3.c soundex5.c
- # soundex.4gl soundex2a.c
- #
- # This shell archive created: Wed Feb 19 17:03:20 EST 1992
- #
- echo "Extracting file README"
- sed -e 's/^X//' <<\SHAR_EOF > README
- XREADME
- X
- XThis directory holds files that contain various routines that produce or use
- Xthe Soundex string matching code. The files and their contents are:
- X
- X
- X Programs and functions written in C
- X
- X soundex1.c Program to display soundex code for one string.
- X
- X soundex2a.c Function written by Jonathan Leffler. Has #define's to
- X create a main() for testing.
- X
- X soundex2b.c A more recent version of the function in soundex2a.c.
- X
- X soundex3.c Program with a copy of the function in soundex2a.c that
- X displays soundex matches of a word from a list of words.
- X
- X soundex4.c Function callable from Informix-4GL. Doesn't appear to
- X zero-pad the last soundex character correctly.
- X
- X soundex5.c Another function callable from C.
- X
- X
- X Function writen in ESQL/C
- X
- X soundex.ec Function to return soundex code
- X
- X mds.globals.h Header file for soundex.ec
- X
- X
- X Function written in Informix-4GL
- X
- X soundex.4gl Function to return soundex code
- X
- X
- XThe following people contributed either directly or by referral to these
- Xfiles:
- X
- X David I. Berg <infmx!dberg@uunet.uu.net>
- X Neil Briscoe <nbriscoe@cix.compulink.co.uk>
- X Luis P. Caamano <ddssuprs!lpc@uunet.uu.net>
- X John Gorman <semantic!john@uunet.uu.net>
- X Walt Hultgren <walt@rmy.emory.edu>
- X Jonathan Leffler <johnl@informix.com>
- SHAR_EOF
- if [ `wc -c < README` -ne 1405 ]
- then
- echo "Lengths do not match -- Bad Copy of README"
- fi
- echo "Extracting file mds.globals.h"
- sed -e 's/^X//' <<\SHAR_EOF > mds.globals.h
- X/*
- X * mds.globals.h Header file containing structures & definitions
- X * needed by ALL of the subroutines for the mds database.
- X */
- X
- X/*
- X*
- X* Set Xenix to 1 so SCO xenix precompiler picksup Xenix specific
- X* routines not Sun OS
- X*
- X*/
- X
- X
- X
- X#include <stdio.h>
- X#include <sys/types.h>
- X#include <ctype.h>
- X
- X#define FALSE 0
- X#define TRUE !FALSE
- X/* #define XENIX 1 */
- X
- X#ifdef XENIX
- X#include <sys/ndir.h>
- X#else
- X#include <sys/dir.h>
- X#endif
- X
- Xextern int errno;
- X
- X/*
- X * Subroutine Declarations.
- X */
- X
- Xchar *mktemp(), *strclip(), *strfromdate(), *index(), *getenv(), *malloc();
- Xchar *strchr();
- Xint strncmp(), int_search(), strlen();
- SHAR_EOF
- if [ `wc -c < mds.globals.h` -ne 644 ]
- then
- echo "Lengths do not match -- Bad Copy of mds.globals.h"
- fi
- echo "Extracting file soundex.4gl"
- sed -e 's/^X//' <<\SHAR_EOF > soundex.4gl
- X{ soundex.4gl Calculate Soundex Code }
- X
- X
- X{
- X Summary: Calculate the 4-character Soundex code for a supplied string
- X
- X Environment: SunOS 4.1, Informix-4GL
- X
- X Submitted by: Walt Hultgren <walt@rmy.emory.edu>
- X
- X
- X This function will return the 4-character Soundex code for the string
- X supplied by the calling routine. The "char(100)" definition of "str"
- X should be tailored to fit your needs.
- X
- X}
- X
- X
- Xfunction soundex ( str )
- X
- X
- X define str char(100), { supplied string }
- X str_leng smallint, { length of string }
- X i_str smallint, { pointer to string characters }
- X
- X s_code char(4), { soundex code }
- X i_code smallint, { pointer to characters of soundex code }
- X
- X char3 char(3),
- X char1 char(1),
- X p_char1 char(1)
- X
- X
- X {
- X Initilize soundex code and check length of supplied string
- X }
- X
- X let s_code = "0000"
- X
- X if ( str is NULL ) then let str_leng = 0
- X else let str_leng = length ( str )
- X end if
- X
- X
- X {
- X Calculate soundex code if string is non-NULL
- X }
- X
- X if ( str_leng > 0 )
- X then
- X
- X {
- X Step through letters in string
- X }
- X
- X let i_code = 0
- X
- X for i_str = 1 to str_leng
- X
- X let char1 = upshift ( str [ i_str, i_str ] )
- X
- X if ( char1 < "A" or char1 > "Z" )
- X then
- X continue for
- X end if
- X
- X
- X {
- X If first letter, start soundex code with it
- X }
- X
- X if ( i_code = 0 )
- X then
- X let s_code[1,1] = char1
- X let i_code = 1
- X let p_char1 = "0"
- X
- X continue for
- X end if
- X
- X
- X {
- X Get group code for this letter
- X }
- X
- X let char3 = "*", char1, "*"
- X
- X case
- X when ( "BFPV" matches char3 ) let char1 = "1" exit case
- X when ( "CGJKSXZ" matches char3 ) let char1 = "2" exit case
- X when ( "DT" matches char3 ) let char1 = "3" exit case
- X when ( "L" matches char3 ) let char1 = "4" exit case
- X when ( "MN" matches char3 ) let char1 = "5" exit case
- X when ( "R" matches char3 ) let char1 = "6" exit case
- X otherwise let char1 = "0" exit case
- X end case
- X
- X
- X {
- X If group code is non-zero and not the same as the previous code,
- X append it to soundex code. Stop after 4 soundex characters.
- X }
- X
- X if ( char1 <> p_char1 )
- X then
- X
- X if ( char1 <> "0" )
- X then
- X
- X let i_code = i_code + 1
- X let s_code[i_code,i_code] = char1
- X
- X if ( i_code = 4 )
- X then
- X exit for
- X end if
- X
- X end if
- X
- X let p_char1 = char1
- X
- X end if
- X
- X end for
- X
- X end if
- X
- X
- X {
- X Return soundex code to calling routine
- X }
- X
- X return s_code
- X
- X
- Xend function
- SHAR_EOF
- if [ `wc -c < soundex.4gl` -ne 3168 ]
- then
- echo "Lengths do not match -- Bad Copy of soundex.4gl"
- fi
- echo "Extracting file soundex.ec"
- sed -e 's/^X//' <<\SHAR_EOF > soundex.ec
- X/* I/O Routines for the reporting functions of the mds 4gl database */
- X
- X#include "mds.globals.h"
- X
- X$include sqltypes;
- X
- X#define SURNAMELEN 21
- X#define SOUNDEXLEN 5
- X#define UPSHIFT ('a' - 'A')
- X
- X/*
- X * Args : 1st off - string to be `soundex'ed (usually a surname)
- X */
- X
- Xmake_soundex (nargs)
- Xint nargs;
- X{
- X char inputstr[SURNAMELEN];
- X char workstr[SURNAMELEN];
- X char longsoundex[SURNAMELEN];
- X char outputstr[SOUNDEXLEN];
- X char nullstr[SOUNDEXLEN];
- X char *inptr, *workptr, *outworkptr, *longsoundptr, *nodupsptr, *outptr;
- X char ch, oldch;
- X
- X rsetnull (CCHARTYPE, nullstr);
- X
- X if (nargs != 1) {
- X
- X /*
- X * Close the database safely
- X */
- X
- X retquote (nullstr);
- X return (1);
- X }
- X
- X /*
- X * Pop input string
- X */
- X
- X popquote (inputstr, SURNAMELEN);
- X
- X if ((risnull (inputstr)) || (*inputstr == '\0')) {
- X retquote (nullstr);
- X return (1);
- X }
- X inptr = inputstr;
- X workptr = workstr;
- X
- X /*
- X * Remove ALL non alphabetic characters and force to uppercase
- X */
- X
- X for (; ((ch = *inptr) != '\0'); inptr++)
- X if ((ch >= 'A') && (ch <= 'Z'))
- X *workptr++ = ch;
- X else if ((ch >= 'a') && (ch <= 'z'))
- X *workptr++ = (ch - UPSHIFT);
- X
- X *workptr = '\0';
- X
- X /*
- X * Remove any duplicates at the beginning of the string
- X */
- X
- X for (outworkptr = workptr = workstr, oldch = '\0'; ((ch = *workptr) != '\0'); workptr++) {
- X if (ch != oldch)
- X *outworkptr++ = ch;
- X oldch = ch;
- X }
- X
- X *outworkptr = '\0';
- X
- X /*
- X * Test whether soundex string has any alphabetic characters in it
- X */
- X
- X if (*workstr == '\0') {
- X retquote (nullstr);
- X return (1);
- X }
- X for (workptr = (workstr + 1), longsoundptr = longsoundex; ((ch = *workptr) != '\0'); workptr++)
- X switch (ch) {
- X case 'B':
- X case 'F':
- X case 'P':
- X case 'V':
- X *longsoundptr++ = '1';
- X break;
- X
- X case 'C':
- X case 'G':
- X case 'J':
- X case 'K':
- X case 'Q':
- X case 'S':
- X case 'X':
- X case 'Z':
- X *longsoundptr++ = '2';
- X break;
- X
- X case 'D':
- X case 'T':
- X *longsoundptr++ = '3';
- X break;
- X
- X case 'L':
- X *longsoundptr++ = '4';
- X break;
- X
- X case 'M':
- X case 'N':
- X *longsoundptr++ = '5';
- X break;
- X
- X case 'R':
- X *longsoundptr++ = '6';
- X break;
- X }
- X
- X *longsoundptr = '\0';
- X
- X /*
- X * Remove any duplicates. eg. "11234453" --> "123453"
- X */
- X
- X for (longsoundptr = nodupsptr = longsoundex, oldch = '0'; ((ch = *longsoundptr) != '\0'); longsoundptr++) {
- X if (ch != oldch)
- X *nodupsptr++ = ch;
- X oldch = ch;
- X }
- X
- X *nodupsptr = '\0';
- X
- X /*
- X * Copy 1st character from upshifted original and then upto 3 digits from the longsoundex
- X */
- X
- X outputstr[0] = *workstr;
- X
- X for (outptr = (outputstr + 1), longsoundptr = longsoundex; (((ch = *longsoundptr) != '\0') && (longsoundptr <= (longsoundex +
- X 3))); longsoundptr++)
- X *outptr++ = ch;
- X
- X *outptr = '\0';
- X
- X
- X retquote (outputstr);
- X return (1);
- X}
- X
- X
- SHAR_EOF
- if [ `wc -c < soundex.ec` -ne 3234 ]
- then
- echo "Lengths do not match -- Bad Copy of soundex.ec"
- fi
- echo "Extracting file soundex1.c"
- sed -e 's/^X//' <<\SHAR_EOF > soundex1.c
- X/***
- X* SOUNDEX ALGORITHM in C *
- X* *
- X* The basic Algorithm source is taken from EDN Nov. *
- X* 14, 1985 pg. 36. *
- X* *
- X* As a test Those in Illinois will find that the *
- X* first group of numbers in their drivers license *
- X* number is the soundex number for their last name. *
- X* *
- X* RHW PC-IBBS ID. #1230 *
- X* *
- X****************************************************************/
- X
- X#include <ctype.h>
- X
- Xchar (*soundex(out_pntr, in_pntr))
- Xchar *in_pntr;
- Xchar *out_pntr;
- X{
- Xextern char get_scode();
- Xchar ch,last_ch;
- Xint count = 0;
- X
- X strcpy(out_pntr,"0000"); /* Pre-fill output string for */
- X /* error and trailing zeros. */
- X *out_pntr = toupper(*in_pntr); /* Copy first letter */
- X last_ch = get_scode(*in_pntr); /* code of the first letter */
- X /* for the first 'double-letter */
- X /* check. */
- X /* Loop on input letters until */
- X /* end of input (null) or output */
- X /* letter code count = 3 */
- X
- X while( (ch = get_scode(*(++in_pntr)) ) && (count < 3) )
- X {
- X if( (ch != '0') && (ch != last_ch) ) /* if not skipped or double */
- X *(out_pntr+(++count)) = ch; /* letter, copy to output */
- X last_ch = ch; /* save code of last input letter for */
- X /* next double-letter check */
- X }
- X return(out_pntr); /* pointer to input string */
- X}
- X
- Xchar get_scode(ch)
- Xchar ch;
- X{
- X /* ABCDEFGHIJKLMNOPQRSTUVWXYZ */
- X /* :::::::::::::::::::::::::: */
- Xstatic char soundex_map[] = "01230120022455012623010202";
- X
- X /* If alpha, map input letter to soundex code. If not, return 0 */
- X
- X if( !isalpha(ch) ) /*error if not alpha */
- X return(0);
- X else
- X return(soundex_map[(toupper(ch) - 'A')] );
- X}
- X
- Xmain(argc, argv)
- Xint argc;
- Xchar *argv[];
- X{
- Xchar *code[10];
- X
- Xint i;
- X
- X if(argc == 1) /* No arguments, give usage */
- X {
- X printf("\nUsage: soundex (name) (...)\n");
- X exit(1);
- X }
- X
- X
- X for(i = 1; i < argc; i++)
- X {
- X soundex(code, argv[i]) ;
- X
- X printf("The Soundex Code for \"%s\" is: %s\n", argv[i],code);
- X }
- X
- X exit(0);
- X}
- SHAR_EOF
- if [ `wc -c < soundex1.c` -ne 2918 ]
- then
- echo "Lengths do not match -- Bad Copy of soundex1.c"
- fi
- echo "Extracting file soundex2a.c"
- sed -e 's/^X//' <<\SHAR_EOF > soundex2a.c
- X/*
- X** SOUNDEX CODING
- X**
- X** Rules:
- X** 1. Retain the first letter; ignore non-alphabetic characters.
- X** 2. Replace second and subsequent characters by a group code.
- X** Group Letters
- X** 1 BFPV
- X** 2 CGJKSXZ
- X** 3 DT
- X** 4 L
- X** 5 MN
- X** 6 R
- X** 3. Do not repeat digits
- X** 4. Truncate or ser-pad to 4-character result.
- X**
- X** Originally formatted with tabstops set at 4 spaces -- you were
- Xwarned!
- X**
- X** Code by: Jonathan Leffler (john@sphinx.co.uk)
- X** This code is shareware -- I wrote it; you can have it for free
- X** if you supply it to anyone else who wants it for free.
- X**
- X** BUGS: Assumes ASCII
- X*/
- X
- X#include <ctype.h>
- Xstatic char lookup[] = {
- X '0', /* A */
- X '1', /* B */
- X '2', /* C */
- X '3', /* D */
- X '0', /* E */
- X '1', /* F */
- X '2', /* G */
- X '0', /* H */
- X '0', /* I */
- X '2', /* J */
- X '2', /* K */
- X '4', /* L */
- X '5', /* M */
- X '5', /* N */
- X '0', /* O */
- X '1', /* P */
- X '0', /* Q */
- X '6', /* R */
- X '2', /* S */
- X '3', /* T */
- X '0', /* U */
- X '1', /* V */
- X '0', /* W */
- X '2', /* X */
- X '0', /* Y */
- X '2', /* Z */
- X};
- X
- X/*
- X** Soundex for arbitrary number of characters of information
- X*/
- Xchar *nsoundex(str, n)
- Xchar *str; /* In: String to be converted */
- Xint n; /* In: Number of characters in result string
- X*/
- X{
- X static char buff[10];
- X register char *s;
- X register char *t;
- X char c;
- X char l;
- X
- X if (n <= 0)
- X n = 4; /* Default */
- X if (n > sizeof(buff) - 1)
- X n = sizeof(buff) - 1;
- X t = &buff[0];
- X
- X for (s = str; ((c = *s) != '\0') && t < &buff[n]; s++)
- X {
- X if (!isascii(c))
- X continue;
- X if (!isalpha(c))
- X continue;
- X c = toupper(c);
- X if (t == &buff[0])
- X {
- X l = *t++ = c;
- X continue;
- X }
- X c = lookup[c-'A'];
- X if (c != '0' && c != l)
- X l = *t++ = c;
- X }
- X while (t < &buff[n])
- X *t++ = '0';
- X *t = '\0';
- X return(&buff[0]);
- X}
- X
- X/* Normal external interface */
- Xchar *soundex(str)
- Xchar *str;
- X{
- X return(nsoundex(str, 9));
- X}
- X
- X/*
- X** Alternative interface:
- X** void soundex(given, gets)
- X** char *given;
- X** char *gets;
- X** {
- X** strcpy(gets, nsoundex(given, 4));
- X** }
- X*/
- X
- X
- X#ifdef TEST
- X#include <stdio.h>
- Xmain()
- X{
- X char buff[30];
- X
- X while (fgets(buff, sizeof(buff), stdin) != (char *)0)
- X printf("Given: %s Soundex produces %s\n", buff,
- X soundex(buff));
- X}
- X#endif
- SHAR_EOF
- if [ `wc -c < soundex2a.c` -ne 2839 ]
- then
- echo "Lengths do not match -- Bad Copy of soundex2a.c"
- fi
- echo "Extracting file soundex2b.c"
- sed -e 's/^X//' <<\SHAR_EOF > soundex2b.c
- X/*
- X@(#)File: soundex.c
- X@(#)Version: 1.2
- X@(#)Last changed: 89/12/18
- X@(#)Purpose: Produce SOUNDEX code for string
- X@(#)Author: Jonathan Leffler (john@sphinx.co.uk)
- X*/
- X
- X/*
- X** SOUNDEX CODING
- X**
- X** Rules:
- X** 1. Retain the first letter; ignore non-alphabetic characters.
- X** 2. Replace second and subsequent characters by a group code.
- X** Group Letters
- X** 1 BFPV
- X** 2 CGJKSXZ
- X** 3 DT
- X** 4 L
- X** 5 MN
- X** 6 R
- X** 3. Do not repeat digits if they come from adjacent characters.
- X** (Corrected by: Raymond Chen <reading!bosco.berkeley.edu!raymond>)
- X** 4. Truncate or zero-pad to 4-character result.
- X**
- X** Originally formatted with tabstops set at 4 spaces -- you were warned!
- X**
- X** This code is shareware -- I wrote it; you can have it for free
- X** if you supply it to anyone else who wants it for free.
- X**
- X** BUGS: Assumes ASCII
- X*/
- X
- X#include <ctype.h>
- X
- X#ifndef lint
- Xstatic char sccs[] = "@(#)soundex.c 1.2 89/12/18";
- X#endif
- X
- Xstatic char lookup[] = {
- X '0', /* A */
- X '1', /* B */
- X '2', /* C */
- X '3', /* D */
- X '0', /* E */
- X '1', /* F */
- X '2', /* G */
- X '0', /* H */
- X '0', /* I */
- X '2', /* J */
- X '2', /* K */
- X '4', /* L */
- X '5', /* M */
- X '5', /* N */
- X '0', /* O */
- X '1', /* P */
- X '0', /* Q */
- X '6', /* R */
- X '2', /* S */
- X '3', /* T */
- X '0', /* U */
- X '1', /* V */
- X '0', /* W */
- X '2', /* X */
- X '0', /* Y */
- X '2', /* Z */
- X};
- X
- X/*
- X** Soundex for arbitrary number of characters of information
- X*/
- Xchar *nsoundex(str, n)
- Xchar *str; /* In: String to be converted */
- Xint n; /* In: Number of characters in result string */
- X{
- X static char buff[10];
- X register char *s;
- X register char *t;
- X char c;
- X char l;
- X
- X if (n <= 0)
- X n = 4; /* Default */
- X if (n > sizeof(buff) - 1)
- X n = sizeof(buff) - 1;
- X t = &buff[0];
- X
- X for (s = str; ((c = *s) != '\0') && t < &buff[n]; s++)
- X {
- X if (!isascii(c) || !isalpha(c))
- X continue;
- X c = toupper(c);
- X if (t == &buff[0])
- X {
- X l = *t++ = c;
- X continue;
- X }
- X c = lookup[c-'A']; /* Assumes ASCII */
- X if (c != '0' && c != l)
- X *t++ = c;
- X l = c;
- X }
- X while (t < &buff[n])
- X *t++ = '0';
- X *t = '\0';
- X return(&buff[0]);
- X}
- X
- X/* Normal external interface */
- Xchar *soundex(str)
- Xchar *str;
- X{
- X return(nsoundex(str, 4));
- X}
- X
- X/*
- X** Alternative interface:
- X** void soundex(given, gets)
- X** char *given;
- X** char *gets;
- X** {
- X** strcpy(gets, nsoundex(given, 4));
- X** }
- X*/
- X
- X
- X#ifdef TEST
- X#include <stdio.h>
- Xmain()
- X{
- X char buff[30];
- X
- X printf("String? ");
- X while (fgets(buff, sizeof(buff), stdin) != (char *)0)
- X {
- X printf("String : %sSoundex: %s\n", buff, soundex(buff));
- X printf("String? ");
- X }
- X putchar('\n');
- X}
- X#endif
- SHAR_EOF
- if [ `wc -c < soundex2b.c` -ne 3032 ]
- then
- echo "Lengths do not match -- Bad Copy of soundex2b.c"
- fi
- echo "Extracting file soundex3.c"
- sed -e 's/^X//' <<\SHAR_EOF > soundex3.c
- X#include <stdio.h>
- X#include <ctype.h>
- X
- X#define TRUE 1
- X#define FALSE 0
- X
- X#define DEFAULT_DICT "/usr/dict/words"
- X#define PATTERN_SIZE 6
- X
- X#define my_toupper(x) (islower(x) ? toupper(x) : (x))
- X
- Xstatic char lookup[] = {
- X '0', /* A */
- X '1', /* B */
- X '2', /* C */
- X '3', /* D */
- X '0', /* E */
- X '1', /* F */
- X '2', /* G */
- X '0', /* H */
- X '0', /* I */
- X '2', /* J */
- X '2', /* K */
- X '4', /* L */
- X '5', /* M */
- X '5', /* N */
- X '0', /* O */
- X '1', /* P */
- X '0', /* Q */
- X '6', /* R */
- X '2', /* S */
- X '3', /* T */
- X '0', /* U */
- X '1', /* V */
- X '0', /* W */
- X '2', /* X */
- X '0', /* Y */
- X '2', /* Z */
- X};
- X
- Xchar *soundex();
- X
- X
- Xmain (argc, argv)
- Xint argc;
- Xchar *argv[];
- X{
- X int count;
- X char pattern[PATTERN_SIZE];
- X FILE *fp, *fopen();
- X
- X if (argc < 2)
- X {
- X fprintf (stderr, "Usage: %s word [ - | wordlist ...]\n",
- X argv[0]);
- X fprintf (stderr, " use \"-\" to read from stdin.\n");
- X exit (1);
- X }
- X
- X strcpy (pattern, soundex(argv[1]));
- X
- X if (argc == 2) /* use default dictionary */
- X {
- X if ((fp = fopen (DEFAULT_DICT, "r")) == NULL)
- X {
- X fprintf (stderr, "%s: Cannot open %s for reading\n",
- X argv[0], DEFAULT_DICT);
- X }
- X else
- X {
- X match (fp, pattern);
- X fclose(fp);
- X }
- X }
- X else /* use specified dictionaries */
- X {
- X for (count = 2 ; count < argc ; count++)
- X {
- X if (strcmp (argv[count], "-") == 0)
- X {
- X match (stdin, pattern);
- X }
- X else if ((fp = fopen (argv[count], "r")) == NULL)
- X {
- X fprintf (stderr, "%s: Cannot open %s for reading\n",
- X argv[0], argv[count]);
- X }
- X else
- X {
- X match (fp, pattern);
- X fclose(fp);
- X }
- X }
- X }
- X exit (0);
- X}
- X
- X/************************************************************************
- X/*
- X/*
- X/************************************************************************/
- X
- Xint match (fp, pattern)
- Xregister FILE *fp;
- Xregister char *pattern;
- X{
- X char word[BUFSIZ];
- X register char *wordp = &word[0];
- X
- X /* read all words before our stuff */
- X while (fgets(wordp, BUFSIZ - 1,fp) != NULL && *pattern != my_toupper
- X (*wordp))
- X ;
- X
- X if (wordp == NULL)
- X return (FALSE);
- X
- X word[strlen(word) - 1] = '\0'; /* remove the \n */
- X if (the_same (pattern, soundex(wordp)))
- X {
- X puts (word);
- X }
- X
- X while (fgets(wordp, BUFSIZ - 1,fp) != NULL)
- X {
- X if (*pattern != my_toupper (*wordp)) /* give it up */
- X {
- X break;
- X }
- X word[strlen(word) - 1] = '\0'; /* remove the \n */
- X if (the_same (pattern, soundex(wordp)))
- X {
- X puts (word);
- X }
- X }
- X return (TRUE);
- X}
- X
- X
- X
- X/************************************************************************
- X/*
- X/*
- X/************************************************************************/
- X
- Xint the_same (str1, str2)
- Xregister char *str1;
- Xregister char *str2;
- X{
- X while (*(str1++) == *(str2++))
- X if (*str1 == '\0')
- X return (TRUE);
- X return (FALSE);
- X}
- X
- X
- X/*
- X** SOUNDEX CODING
- X**
- X** Rules:
- X** 1. Retain the first letter; ignore non-alphabetic characters.
- X** 2. Replace second and subsequent characters by a group code.
- X** Group Letters
- X** 1 BFPV
- X** 2 CGJKSXZ
- X** 3 DT
- X** 4 L
- X** 5 MN
- X** 6 R
- X** 3. Do not repeat digits
- X** 4. Truncate or ser-pad to 4-character result.
- X**
- X** Originally formatted with tabstops set at 4 spaces -- you were
- Xwarned!
- X**
- X** Code by: Jonathan Leffler (john@sphinx.co.uk)
- X** This code is shareware -- I wrote it; you can have it for free
- X** if you supply it to anyone else who wants it for free.
- X**
- X** BUGS: Assumes ASCII
- X*/
- X
- X/*
- X** Soundex for arbitrary number of characters of information
- X*/
- Xchar *soundex(str)
- Xchar *str; /* In: String to be converted */
- X{
- X static char buff[PATTERN_SIZE +1];
- X register char *s;
- X register char *t;
- X char c;
- X char l;
- X
- X t = &buff[0];
- X
- X for (s = str; ((c = *s) != '\0') && t < &buff[PATTERN_SIZE]; s++)
- X {
- X if (!isascii(c))
- X continue;
- X if (!isalpha(c))
- X continue;
- X if (islower(c))
- X c = toupper(c);
- X if (t == &buff[0])
- X {
- X l = *t++ = c;
- X continue;
- X }
- X c = lookup[c-'A'];
- X if (c != '0' && c != l)
- X l = *t++ = c;
- X }
- X while (t < &buff[PATTERN_SIZE])
- X *t++ = '0';
- X *t = '\0';
- X return(&buff[0]);
- X}
- SHAR_EOF
- if [ `wc -c < soundex3.c` -ne 4996 ]
- then
- echo "Lengths do not match -- Bad Copy of soundex3.c"
- fi
- echo "Extracting file soundex4.c"
- sed -e 's/^X//' <<\SHAR_EOF > soundex4.c
- X/***
- X*
- X* soundex()
- X* adopted for 4gl
- X*
- X**/
- X#include "stdio.h"
- X#include "ctype.h"
- X#define MAX_DIGITS 4 /** number of digits in code sequence **/
- X
- Xstatic char omit_letter[] ={"AEHIOUWY"};
- Xstatic char *code_group[] =
- X{
- X "BFPV", "CGJKQSXZ", "DT", "L", "MN", "R", NULL
- X};
- X/*
- XIf one wants to test this stand alone than remove the comments
- X
- X#define popquote(a,b) strcpy(a,str)
- X#define pushquote(a) strcpy(rstr,a)
- Xchar str[100];
- Xchar rstr[100];
- Xmain()
- X{
- X while(gets(str))
- X {
- X if(strcmp (str,"end") == 0) break;
- X soundex(1);
- X *str = 0;
- X printf("code = %s for name %s\n",rstr,str);
- X }
- X}
- X*/
- Xsoundex(args)
- Xint args;
- X{
- X char name_for_soundex[100];
- X char soundex[MAX_DIGITS];
- X char *p;
- X int k,i,j;
- X
- X popquote(name_for_soundex,sizeof(name_for_soundex));
- X
- X /**
- X first character not translated. But shouldn't we make sure
- X it is an alpha? In case a funny user enters " Smith";
- X **/
- X for(i = j = 0; !isalpha(name_for_soundex[i]);i++)
- X ;
- X
- X soundex[j++] = toupper(name_for_soundex[i]);
- X
- X for (i++;j < MAX_DIGITS && name_for_soundex[i];i++)
- X {
- X /**
- X Only alpha's are allowed why this should return an error
- X if the letter isn't an alpha I don't know.
- X it would prohibit names like D'artagne. Who was
- X one of the three musketeers
- X **/
- X if (isalpha(name_for_soundex[i]))
- X {
- X name_for_soundex[i] = toupper(name_for_soundex[i]);
- X if (strchr(omit_letter,name_for_soundex[i]) == NULL)
- X {
- X for(k = 0; code_group[k];k++)
- X {
- X if(strchr(code_group[k],name_for_soundex[i]))
- X soundex[j++] = (k + 1) + 48;
- X }
- X }
- X }
- X }
- X
- X /** fill string if neccesary **/
- X while (j < MAX_DIGITS)
- X {
- X soundex[j++] = '0';
- X }
- X soundex[j] = '\0';
- X pushquote(soundex);
- X return(args);
- X}
- SHAR_EOF
- if [ `wc -c < soundex4.c` -ne 1980 ]
- then
- echo "Lengths do not match -- Bad Copy of soundex4.c"
- fi
- echo "Extracting file soundex5.c"
- sed -e 's/^X//' <<\SHAR_EOF > soundex5.c
- X/*
- X * Reference: Adapted from Knuth, D.E. (1973) The art of computer programming;
- X * Volume 3: Sorting and searching. Addison-Wesley Publishing Company:
- X * Reading, Mass. Page 392.
- X *
- X * 1. Retain the first letter of the name, and drop all occurrences of
- X * a, e, h, i, o, u, w, y in other positions.
- X *
- X * 2. Assign the following numbers to the remaining letters after the first:
- X * b, f, p, v -> 1 l -> 4
- X * c, g, j, k, q, s, x, z -> 2 m, n -> 5
- X * d, t -> 3 r -> 6
- X *
- X * 3. If two or more letters with the same code were adjacent in the original
- X * name (before step 1), omit all but the first.
- X *
- X * 4. Convert to the form ``letter, digit, digit, digit'' by adding trailing
- X * zeros (if there are less than three digits), or by dropping rightmost
- X * digits (if there are more than three).
- X *
- X * The examples given in the book are:
- X *
- X * Euler, Ellery E460
- X * Gauss, Ghosh G200
- X * Hilbert, Heilbronn H416
- X * Knuth, Kant K530
- X * Lloyd, Ladd L300
- X * Lukasiewicz, Lissajous L222
- X *
- X * Most algorithms fail in two ways:
- X * 1. they omit adjacent letters with the same code AFTER step 1, not before.
- X * 2. they do not omit adjacent letters with the same code at the beginning
- X * of the name.
- X *
- X */
- X
- X#include <stdio.h>
- X#include <ctype.h>
- X
- X#define SDXLEN 4
- X
- Xchar *soundex(name)
- Xchar *name;
- X{
- Xstatic char buf[SDXLEN+1];
- Xstatic char map[] = "01230120022455012623010202";
- Xregister char mc, uc, pc = '0';
- Xregister int idx;
- X
- Xstrcpy(buf,"Z000");
- X
- Xfor (idx = 0; *name && idx < SDXLEN; name++)
- X if (isalpha(*name)) {
- X uc = toupper(*name);
- X mc = map[uc-'A'];
- X if (idx == 0 || (mc != '0' && mc != pc)) {
- X buf[idx] = idx ? mc : uc;
- X idx++;
- X }
- X pc = mc;
- X }
- Xreturn(buf);
- X}
- SHAR_EOF
- if [ `wc -c < soundex5.c` -ne 1928 ]
- then
- echo "Lengths do not match -- Bad Copy of soundex5.c"
- fi
- echo "Done."
-
- ------------------------------------------------------------------------------
-
- --
- Walt Hultgren Internet: walt@rmy.emory.edu (IP 128.140.8.1)
- Emory University UUCP: {...,gatech,rutgers,uunet}!emory!rmy!walt
- 954 Gatewood Road, NE BITNET: walt@EMORY
- Atlanta, GA 30329 USA Voice: +1 404 727 0648
-