Source Code 1994 March

home *** CD-ROM | disk | FTP | other *** search

/ Source Code 1994 March / Source_Code_CD-ROM_Walnut_Creek_March_1994.iso / compsrcs / misc / volume37 / phon2alf / part01 < prev next >

Wrap

Text File | 1993-05-15 | 24.7 KB | 847 lines

Newsgroups: comp.sources.misc From: dave@devteq.co.uk (Dave Thomas) Subject: v37i055: phone2alf - guess sensible words from a phone number, Part01/01 Message-ID: <1993May16.005347.15283@sparky.imd.sterling.com> X-Md4-Signature: 02b36b8a396558f7562d560dde569c82 Date: Sun, 16 May 1993 00:53:47 GMT Approved: kent@sparky.imd.sterling.com Submitted-by: dave@devteq.co.uk (Dave Thomas) Posting-number: Volume 37, Issue 55 Archive-name: phone2alf/part01 Environment: C This is a set of three programs which together make slightly better than random guesses at words that can be made up from phone numbers (actually any numbers). We apply a simple heuristic to try to put likely English (actually any language) words at the top of the list of permutations. Often these words aren't English, but the should have been. :-) This was all written as a bit of fun, and is released without restriction for non-commercial use. I'd be interested in any feedback. From_____________________________________________________________________ | Dave Thomas - Devteq Ltd - 18 Thames St - Windsor - Berks SL4 1PL - UK | | Tel: +44 753 830333 - Fax: +44 753 831645 - email: dave@devteq.co.uk | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -----------------------v--- cut here ---v-------------------------------- #! /bin/sh # This is a shell archive. Remove anything before this line, then feed it # into a shell via "sh file" or similar. To overwrite existing files, # type "sh file -c". # Contents: README Makefile generate num2alpha.c score.c threes.c # Wrapped by kent@sparky on Sat May 15 19:47:51 1993 PATH=/bin:/usr/bin:/usr/ucb:/usr/local/bin:/usr/lbin ; export PATH echo If this archive is complete, you will see the following message: echo ' "shar: End of archive 1 (of 1)."' if test -f 'README' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'README'\" else echo shar: Extracting \"'README'\" $2799 characters$ sed "s/^X//" >'README' <<'END_OF_FILE' XPHONE2ALF - generate sensible words from a phone number X XThis is a set of three programs which together make slightly Xbetter than random guesses at words that can be made up from Xphone numbers (actually any numbers). X XClearly, generating the possible permutations of 'words' from Xa phone number is simple. However with over 2000 possible Xpermutations of a 7 digit number, looking for the word that's Xjust right can take a while. X XTo speed things up, we apply a simple heuristic, based on a Xspelling checking algorithm I remember vaguely from a mid X70's CACM or SigPLAN. Before starting, we build up a Xfrequency table for each of the 21952 possible combinations Xof three consecutive letters (including start and end of word Xas letters) in a (largish) sample of text representative of Xthe language we want to test against. Thus for the word X'Thus' we would count X X <SOW>th X thu X hus X us<EOW> X XTo reduce the size of the table, and prevent possible Xoverflow issues, we actually store ~log2(frequency). X XWe then look at the words generated by our phone number. For Xeach, we add together the previously observed trigraph Xfrequencies for the letters in that word. This gives us a Xscore. By sorting the words by that score, we put the words Xmost likely to be in our target language set first. X XTo use this package, first determine a set of files and/or Xdirectories contains words you feel respresentative of your Xtarget. The bigger, the better. As distributed, the program Xwill look through some of your usenet directories. Edit the XMakefile to point to these. X XRun 'make'. This generates three programs. X X 'threes' does the counting of frequencies. The makefile X will invoke this automatically to generate the X file 'freq.table'. X X 'num2alpha' takes a number as a parameter and generates on X stdout all words that can be made using the X letters on a phone pad corresponding to the digits X in the number. X X 'score' Takes a list of words and scores them according X to the previously observed frequencies of X occurance of each words constituent trigraphs. X X XTo generate a sorted list of words for a phone number, run the Xshell script X X generate <number> X X..... X XThis was all written as a bit of fun, and is released without Xrestriction for non-commercial use. I'd be interested in any Xfeedback. If anyone wants to make a million out of this, I'll Xtake my cut :-) X X XFrom_____________________________________________________________________ X| Dave Thomas - Devteq Ltd - 18 Thames St - Windsor - Berks SL4 1PL - UK | X| Tel: +44 753 830333 - Fax: +44 753 831645 - email: dave@devteq.co.uk | X ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ X X X END_OF_FILE if test 2799 -ne `wc -c <'README'`; then echo shar: \"'README'\" unpacked with wrong size! fi # end of 'README' fi if test -f 'Makefile' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'Makefile'\" else echo shar: Extracting \"'Makefile'\" $1721 characters$ sed "s/^X//" >'Makefile' <<'END_OF_FILE' X# X# Generate the various programs and datafiles needed to X# play with the 'English from phone numbers' suite. See X# README for more info. X# X# X# dave@devteq.co.uk X# X X# X# WORDS points at a set of files and/or directories containing X# text in the language you want to use. Be careful not to include X# too many non text files, as these will degrade the result. X# In particular, if you're using the USENET feed as a source, X# avoid .binaries & .sources (becuase uuencoded stuff contains X# patterns you don't want to accept and programmer can't spell X# in the first place!) X# X XWORDS = /usr/spool/news/comp/lang /usr/spool/news/alt/politics X XCFLAGS = -O X X####################################################################### X# X# Rest shouldn't need changing X# X XSHAR_FILES = Makefile README generate num2alpha.c score.c threes.c X Xall: threes num2alpha score freq.table X X# X# The THREEs program generates the original frequency tables X# based on a set of input files & directories X Xthrees: threes.o X cc -o threes threes.o X X# X# Num2Alpha generates all possible letter combinations corresponding X# to a given phone number X Xnum2alpha: num2alpha.o X cc -o num2alpha num2alpha.o X X# X# Score takes a set of words (normally generated by num2alpha) and X# generates a score based on the frequencies of each word's letters X# in the original sample set (generated by threes) X Xscore: score.o X cc -o score score.o X X X# X# Generate the binary frequency table based on the given set X# of files & directories. This will always be into a file called X# freq.table X Xfreq.table: threes X threes $(WORDS) X X X# X# Save 'em all away X# X Xshar: phone2alf.shar X Xphone2alf.shar: $(SHAR_FILES) X shar -d -v $(SHAR_FILES) >phone2alf.shar X END_OF_FILE if test 1721 -ne `wc -c <'Makefile'`; then echo shar: \"'Makefile'\" unpacked with wrong size! fi # end of 'Makefile' fi if test -f 'generate' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'generate'\" else echo shar: Extracting \"'generate'\" $642 characters$ sed "s/^X//" >'generate' <<'END_OF_FILE' X#!/bin/sh X# X# Given a phone number as $1, generate a sorted list X# words where each letter comes from the characters on X# a telephone keypad corresponding to the input number. X# These words are sorted according to their liklihood of X# being approximately 'English' (or to be more accurate X# their nearness to an arbitrary sample of text previously X# loaded into 'freq.table' by the trieme program) X# X# dave@devteq.co.uk X Xif [ $# -ne 1 ]; then X echo usage: $0 \<phone no\> X exit 1 Xfi X Xif [ ! -f freq.table ]; then X echo You must run \'threes\' first to generate the frequency table X exit 2 Xfi X Xnum2alpha $1 | score | sort -rn X END_OF_FILE if test 642 -ne `wc -c <'generate'`; then echo shar: \"'generate'\" unpacked with wrong size! fi chmod +x 'generate' # end of 'generate' fi if test -f 'num2alpha.c' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'num2alpha.c'\" else echo shar: Extracting \"'num2alpha.c'\" $3270 characters$ sed "s/^X//" >'num2alpha.c' <<'END_OF_FILE' X/***********************************************************************\ X* num2alpha.c * Convert a phone number to alpha strings. * * X\************************************************************************ X* * X* Given a phone number, generate all possible alpha strings * X* corresponding to the labels on those digits on a standard phone. * X* * X*_______________________________________________________________________* X* Dave Thomas - Devteq Ltd - 18 Thames St - Windsor - Berks SL4 1PL - UK* X* Tel: +44 753 830333 - Fax: +44 753 831645 - email: dave@devteq.co.uk* X*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~* X* * X\***********************************************************************/ X X#include <stdio.h> X#include <stdlib.h> X#include <ctype.h> X#include <memory.h> X X/*** X* Array indexed by 'digit' - '0', gives set of each alpha X* character for that digit X***/ X Xchar *alphas[10] = X{ X /* 0 */ "0", X /* 1 */ "1", X /* 2 */ "abc", X /* 3 */ "def", X /* 4 */ "ghi", X /* 5 */ "jkl", X /* 6 */ "mno", X /* 7 */ "prs", X /* 8 */ "tuv", X /* 9 */ "wxy", X}; X X/*** X* array holds result of each permutation X***/ X Xchar result[20] = X{0}; /* any bigger & we'll take forever anyway */ X Xvoid Xperm (const char *pNumber, int offset) X{ X if (pNumber[offset] == 0) X { X puts (result); X } X else X { X char *nextAlpha = alphas[pNumber[offset] - '0']; X X while (*nextAlpha) X { X result[offset] = *nextAlpha; X perm (pNumber, offset + 1); X nextAlpha++; X } X } X} X X X/***********************************************************************\ X* convert * Convert a number to alpha strings * * X\************************************************************************ X* * X* Convert a single number to its corrsponding alpha strings. * X* * X\***********************************************************************/ X Xvoid Xconvert (const char *pNumber) X{ X register const char *p; X X /*** X * First check its actually a number X ***/ X X for (p = pNumber; *p; p++) X if (!isdigit (*p)) X { X fprintf (stderr, "'%s' is not a number - skipping...\n", pNumber); X return; X } X else if ((*p == '0') || (*p == '1')) X { X fprintf (stderr, "'%s' contains a '%c' - this doesn't" X " have a letter associated with it\n", pNumber, *p); X return; X } X X if ((p - pNumber) >= sizeof (result)) X { X fprintf (stderr, "Number '%s' is longer than %d characters\n", X pNumber, sizeof (result) - 1); X return; X } X X perm (pNumber, 0); X} X X/***********************************************************************\ X* main * Generate alpha strings from number * * X\************************************************************************ X* * X* Accept a number from the argument list, and generate the cross * X* product of its digits and those digits labels on a standard phone. * X* * X\***********************************************************************/ X Xint Xmain (int argc, char **argv) X{ X argc--; X argv++; X X while (argc--) X convert (*argv++); X X return 0; X} END_OF_FILE if test 3270 -ne `wc -c <'num2alpha.c'`; then echo shar: \"'num2alpha.c'\" unpacked with wrong size! fi # end of 'num2alpha.c' fi if test -f 'score.c' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'score.c'\" else echo shar: Extracting \"'score.c'\" $4694 characters$ sed "s/^X//" >'score.c' <<'END_OF_FILE' X X/***********************************************************************\ X* score.c * Calculate trieme score for a set of words * * X\************************************************************************ X* * X* Someone's previously sampled a whole lot of English words, and * X* generated a table of frequencies for all the triemes (groups of * X* 3 consecutive letters) in that sample. We now read in a set of * X* words from stdin and add together the previously accumulated * X* frequencies for each trieme in each word. For each word, when then * X* output the total score for that word. * X* * X* Why? * X* * X* Well - the word list is generated from (say) all the permutations * X* of letters corresponding to a phone number. By scoring each * X* against a population of triemes, we can identify those combinations* X* of letters that are most likely to be words from that population. * X* We can then more readily find appropriate nmemonics! * X* * X* Released to the net for free use. Clearly the author accepts no * X* responsibility for anything to do with this program. Share & enjoy * X* * X*_______________________________________________________________________* X* Dave Thomas - Devteq Ltd - 18 Thames St - Windsor - Berks SL4 1PL - UK* X* Tel: +44 753 830333 - Fax: +44 753 831645 - email: dave@devteq.co.uk* X*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~* X* * X\***********************************************************************/ X X#include <stdio.h> X#include <stdlib.h> X X#define FREQ_TABLE_FILE "freq.table" X X#define SOW ('a'-1) /* start & end of word 'pseudo' characters */ X#define EOW ('z'+1) X X#if !defined(TRUE) X#define TRUE 1 X#define FALSE 0 X#endif X Xunsigned char count[28][28][28] = X{0}; /* trieme counter */ X X/***********************************************************************\ X* handleWord * Accumulate trieme frequencies for a single word * * X\************************************************************************ X* * X* Given a word delimited by SOW/EOW, calculate the frequencies for * X* each trieme (including the delimters) * X* * X\***********************************************************************/ X X#define INDEX(ch) (ch-SOW) X Xvoid XhandleWord (char *pWord) X{ X register char *p; X int score = 0; X X for (p = pWord; *(p + 1) != EOW; p++) X { X score += count[INDEX (*p)][INDEX (p[1])][INDEX (p[2])]; X } X X p[1] = 0; /* remove EOW */ X X printf ("%-5d %s\n", score, pWord + 1); X} X X/***********************************************************************\ X* processFile * Process an individual file * * X\************************************************************************ X* * X* Go through a file extracting word by word. For each, calll handleWord * X* to build trieme frequencies. If we ever come across a NUL in the file * X* we give it - it must be a binary file. * X\***********************************************************************/ X Xvoid XprocessFile (FILE * f) X{ X char word[50]; /* collects current word */ X int iWord; /* index into it */ X X int fInWord = FALSE; X int ch; X X while (((ch = fgetc (f)) != EOF) && ch) X { X X if (!fInWord && isalpha (ch)) X { X fInWord = TRUE; X iWord = 0; X word[iWord++] = SOW; X } X X if (fInWord) X { X if (!isalpha (ch)) X { X if (iWord > 4) X { X word[iWord] = EOW; X handleWord (word); X } X fInWord = FALSE; X } X else X { X word[iWord] = tolower (ch); X if (iWord < (sizeof (word) - 1)) X iWord++; X } X X } X } X X} X X/***********************************************************************\ X* main * Parse args and run analysis * * X\************************************************************************ X* * X* Trundle through the arguments. For each file found, process it. * X* For each directory, process it recursively. * X* * X\***********************************************************************/ X Xint Xmain () X{ X FILE *f = fopen (FREQ_TABLE_FILE, "r"); X X if (f == NULL) X { X perror (FREQ_TABLE_FILE); X return 0; X } X X if (fread (count, 1, sizeof (count), f) != sizeof (count)) X { X fprintf (stderr, "Couldn't read in '%s'\n", FREQ_TABLE_FILE); X return 0; X } X X fclose (f); X X X processFile (stdin); X return 0; X X} END_OF_FILE if test 4694 -ne `wc -c <'score.c'`; then echo shar: \"'score.c'\" unpacked with wrong size! fi # end of 'score.c' fi if test -f 'threes.c' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'threes.c'\" else echo shar: Extracting \"'threes.c'\" $7110 characters$ sed "s/^X//" >'threes.c' <<'END_OF_FILE' X/***********************************************************************\ X* threes.c * Calculate trigraph frequencies for a group of files * X\************************************************************************ X* * X* Given a group of files or directories, calculate the frequencies * X* of each group of three adjacent letters in each word. We treat * X* start of word and end of word as special letters. * X* * X* The output is an array of 28^3 unsigned chars, corresponding to * X* log(2) of the frequencies (except log(2)(0) == 0). * X* * X* This is a simple precursor to finding most likely English words * X* amongst perumations of letters generated from (for example) * X* phone numbers. * X* * X* Released to the net for free use. Clearly the author accepts no * X* responsibility for anything to do with this program. Share & enjoy * X* * X*_______________________________________________________________________* X* Dave Thomas - Devteq Ltd - 18 Thames St - Windsor - Berks SL4 1PL - UK* X* Tel: +44 753 830333 - Fax: +44 753 831645 - email: dave@devteq.co.uk* X*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~* X* * X\***********************************************************************/ X X#include <limits.h> X#include <stdio.h> X#include <stdlib.h> X#include <ftw.h> X#include <ctype.h> X#include <sys/types.h> X#include <sys/stat.h> X X#define SOW ('a'-1) /* start & end of word 'pseudo' characters */ X#define EOW ('z'+1) X X#if !defined(TRUE) X#define TRUE 1 X#define FALSE 0 X#endif X Xtypedef unsigned int CountType; X#define COUNT_MAX UINT_MAX /* depends on CountType */ X/* MUST BE 2^n -1 */ X XCountType count[28][28][28] = {0}; /* trigraph counter */ X X/***********************************************************************\ X* handleWord * Accumulate trigraph frequencies for a single word * X\************************************************************************ X* * X* Given a word delimited by SOW/EOW, calculate the frequencies for * X* each trigraph(including the delimters) * X* * X\***********************************************************************/ X X#define INDEX(ch) (ch-SOW) X Xvoid XhandleWord (char *pWord) X{ X register char *p; X X X for (p = pWord; *(p + 1) != EOW; p++) X { X if (++count[INDEX (*p)][INDEX (p[1])][INDEX (p[2])] == COUNT_MAX) X { X printf ("Trieme overflow \"%c%c%c\"\n", p[0], p[1], p[2]); X count[INDEX (*p)][INDEX (p[1])][INDEX (p[2])] -= 1000; X } X } X X} X X#undef INDEX X X X/***********************************************************************\ X* processFile * Process an individual file * X\************************************************************************ X* * X* Go through a file extracting word by word. For each, calll handleWord * X* to build trigraph frequencies. If we ever come across a NUL in the * X* file we give it - it must be a binary file, so we stop processing. * X\***********************************************************************/ X Xvoid XprocessFile (FILE * f) X{ X char word[50]; /* collects current word */ X int iWord; /* index into it */ X X int fInWord = FALSE; X int ch; X X while (((ch = fgetc (f)) != EOF) && ch) X { X X if (!fInWord && isalpha (ch)) X { X fInWord = TRUE; X iWord = 0; X word[iWord++] = SOW; X } X X if (fInWord) X { X if (!isalpha (ch)) X { X if (iWord > 4) /* ignore words < 3 chars */ X { X word[iWord] = EOW; X handleWord (word); X } X fInWord = FALSE; X } X else X { X word[iWord] = tolower (ch); X if (iWord < (sizeof (word) - 1)) X iWord++; X } X X } X } X X} X X/***********************************************************************\ X* ftwCallback * Handle call backs from file tree walking * * X\************************************************************************ X* * X* For each regular file file found by FTW, open it and process the * X* contents. * X* * X\***********************************************************************/ X Xint XftwCallback (const char *pName, const struct stat *pStat, int flag) X{ X FILE *pFile; X X if (flag == FTW_F) X { /* is it a file? */ X if ((pFile = fopen (pName, "r")) == NULL) X perror (pName); X else X { X fputs (pName, stderr); X fputs (" ", stderr); X putc ('\r', stderr); X processFile (pFile); X fclose (pFile); X } X } X return 0; X} X X X/***********************************************************************\ X* dumpFrequencies * Dump log(2) of frequencies * * X\************************************************************************ X* * X* Go through the frequency array, converting values to their approximate* X* log(2) (by looking for the highest order bit set). Then dump out the * X* lot to a file "freq.h" * X\***********************************************************************/ X Xvoid XdumpFrequencies () X{ X int z = 0, nz = 0; X int i; X CountType *pCount = (CountType *) count; X FILE *op; X X if ((op = fopen ("freq.table", "w")) == NULL) X { X perror ("freq.table"); X return; X } X X for (i = 0; i < sizeof (count) / sizeof (CountType); i++, pCount++) X { X int val = *pCount; X val == 0 ? z++ : nz++; X X /* Try to chop into the word to work out the magnitude. X We're assuming a sensible representation of 'int' here X */ X X if (val) X { /* map 0 => 0 */ X X CountType bit; X int hi, lo, med; X X hi = sizeof (CountType) * CHAR_BIT; /* Top bit number */ X lo = 0; X X while ((hi > lo)) X { /* 2^lo <= val < 2^hi */ X med = (hi + lo) / 2; X bit = (1 << (med)) - 1; /* lo <= med <= hi */ X if (val == bit) X { X hi = med; X break; X } X X if (val >= bit) X lo = med + 1; /* 2^med <= val < 2^hi */ X else X hi = med; /* 2^lo <= val < 2^med */ X } X val = hi - 1; X } X X fputc (val, op); X } X X printf ("\n%d zero, %d not zero\n", z, nz); X X fclose (op); X X} X X/***********************************************************************\ X* main * Parse args and run analysis * * X\************************************************************************ X* * X* Trundle through the arguments. For each file found, process it. * X* For each directory, process it recursively. * X* * X\***********************************************************************/ X Xint Xmain (int argc, char **argv) X{ X argc--; X argv++; /* Point to first file */ X X if (argc <= 0) /* no args, do stdin */ X processFile (stdin); X else X { X X while (argc--) X { X char *pName = *argv++; X int rc; X X /* Note: ftw works as expected if its arg is a regular X file as well as a directory */ X X if ((rc = ftw (pName, ftwCallback, 12)) != 0) X perror (pName); X } X X } X X dumpFrequencies (); X return 0; X X} END_OF_FILE if test 7110 -ne `wc -c <'threes.c'`; then echo shar: \"'threes.c'\" unpacked with wrong size! fi # end of 'threes.c' fi echo shar: End of archive 1 $of 1$. cp /dev/null ark1isdone MISSING="" for I in 1 ; do if test ! -f ark${I}isdone ; then MISSING="${MISSING} ${I}" fi done if test "${MISSING}" = "" ; then echo You have the archive. rm -f ark[1-9]isdone else echo You still must unpack the following archives: echo " " ${MISSING} fi exit 0 exit 0 # Just in case...