Black Box 4

home *** CD-ROM | disk | FTP | other *** search

/ Black Box 4 / BlackBox.cdr / editors / markv10.arj / MARKV.DOC next >

Wrap

Text File | 1991-05-14 | 10KB | 245 lines

Mark V. Shaney V1.0 a probabilistic text generator (c) 1991 Stefan Strack stst@cs.albany.edu ss9557@albnyvms.bitnet You may copy this program freely, provided you leave the program and this documentation unchanged. You may not charge more than the cost of the media for its distribution. I disclaim all liabilities for any damages or loss of data incurred by using this program. Comments are appreciated. -------------------- Credits Mark V. Shaney featured in the "Computer Recreations" column by A.K.Dewdney in Scientific American. The original program (for a main-frame, I believe) was written by Bruce Ellis based on an idea by Don P. Mitchell. Dewdney tells the amusing story of a riot on net.singles when Mark V. Shaney's ramblings were unleashed. -------------------- Who is Mark V. Shaney? Mark V. Shaney produces a confused imitation of style and contents of a piece of writing. Mark reads the original text and builds a "word probability table" that reflects the probability of a word following a sequence of words. In output mode, Mark will generate random text weighed by the probabilities in this table (a so-called Markov chain, hence Mark's name). Since Mark considers punctuation as part of the word, he is likely to produce grammatical sentences, albeit a caricature of the original text. Mark V. Shaney is in the same league with the famous "ELIZA" and "Racter" programs, and shows that you don't need AI for "almost human" writing. This implementation of Mark V. Shaney allows you to vary the "randomness" of the text output and supports huge probability tables in expanded memory or on disk. -------------------- What the program runs on Mark V. Shaney is not a speed demon. While you could run the program on a dual-floppy XT, you probably would not want to. A 640K AT system with a hard disk is recommended as the minimum configuration. If you have memory above 640K and an expanded memory driver installed, Mark V. Shaney will be able to make use of it. -------------------- Using the program Type MARKV to start Mark V. Shaney. The menu on the bottom line of the screen gives you the following choices invoked by pressing their first letter: Read text The program will prompt you for a text file to read, building a word probability table as it goes along. Text files should not contain control or formatting characters other than tabs and carriage return/line feeds. Reading is very slow and can take several minutes for large files. Speed can be improved by increasing the number of DOS buffers in CONFIG.SYS. The bottle neck seems to be the CPU, which means you won't gain much from reading from a RAM drive. You can repeat this command to build a table from several text files. Reading will stop if you press any key. Generate text After you created a working probability table by either reading text files or loading a saved table, "Generate text" will start the probabilistic text generator. You are prompted for a file to append the text output to. If you press <Return>, output will go to the screen. Mark's ramblings will word-wrap automatically; no other formatting is done. The output will stop, when (a) Mark comes across a word that is the last word in one of the input texts, or (b) you press <Escape> after pausing output. You can pause output by pressing any key. Load table Once a probability table has been build, it can be saved to a file (see Save table). "Load table" prompts you for a table to be retrieved from disk. This is much faster than re-creating the table from scratch by reading the original text files again. The down-side is that saved tables are huge, about 9 times the size of the input text. Save table Saves the probability table to disk. Quit Exits the program. -------------------- Command line options MARKV [-e/-m/-d[n:path]] [-g#] [-s#] -e use EMS (default) -m use conventional memory -d[n:path] use disk -g# set word grain to # (default 2) -s# set random number seed to # These options are explained in the next sections. Mark will also accept the slash (/) character instead of the hyphen, or simply the option letter by itself. -------------------- The word grain option In the original description of Mark V. Shaney's algorithm, the word grain is 2. This means that the program breaks down an input text into pairs of words, and calculates the probability of a third word following a given pair of words. The probability table then looks like this: Word grain 2 \Next Word Word seq \ A B C D .. --------------------------------------------------- AB | 0.0 0.02 0.08 0.15 .. BC | 0.0 0.0 0.0 0.02 .. CD | 0.02 0.22 0.0 0.0 .. DE | 0.1 0.06 0.03 0.0 .. .. | .. .. .. .. .. MARKV.EXE allows you to choose the word grain, i.e. the number of words that make up a word sequence, with the command line option -g#. E.g.: MARKV -G3 After reading a text, this results in the following probability table: Word grain 3 \Next Word Word seq \ A B C D .. -------------------------------------------------- ABC | 0.01 0.0 0.0 0.02 .. BCD | 0.0 0.0 0.0 0.0 .. CDE | 0.04 0.01 0.0 0.0 .. DEF | 0.02 0.0 0.08 0.0 .. .. | .. .. .. .. .. Similarly, -g1 breaks down an input text into word sequences of length 1. The word grain can be any number greater than or equal 1. Word sequences of longer than 3 are not very interesting, since Mark V. Shaney will essentially re-create the original text when in output mode. In general, large values for the word grain produce low variability in the output text. Also, large input texts tend to result in more diverse output. You can create a sufficiently variable output from a short text sample by setting the word grain to 1. Table creation speed is largely independent of the word grain, whereas table size increases with word grain. Text playback depends on the structure of the current probability table, but is independent of the value of the word grain parameter with which the program was invoked. This is useful to know if you are loading a previously created table from disk. -------------------- Storage options MARKV.EXE is able to use EMS memory or a disk for storing huge probability tables. The location of the working probability table can be specified by one of the following options: MARKV -E This tells Mark to use expanded memory if available; this is also the default. If expanded memory is not available, the table will reside in conventional memory. 1 MB of EMS allows you to process about 110K of text input. MARKV -M Mark will keep the table in conventional memory along with the program code. On a 640K machine, this gives you room for processing approx. 40K of original text. MARKV -D[Drive:Path] Mark uses the disk to store the working probability table. The optional Drive:Path specifies were the temporary file is to be stored. This is the slowest option, but you can speed up processing greatly by specifying a RAM drive. The amount of input text that can be processed is limited by the available disk space. Example: MARKV -DE: will set up disk storage on drive E:. -------------------- The random number seed option Mark will initialize the random number generator by reading the system time at program start. This makes sure that the random output is different every time you use the same probability table. In case you see some particularly interesting output scrolling off screen, you may wish to restart the program under identical conditions. For this purpose, Mark displays the number that initialized the random number generator when you exit the program. You can then provide this seed on the command line as in MARKV -S12345 to re-create a previous run. -------------------- Sources of text The larger the input text, the more diversified will be Mark V. Shaney's ramblings. With a word grain of 2, a 15 to 30K text will produce a varied, but still comprehensible output. Good choices for input text are: personal letters, text books, childrens' books, simple narratives, poems (anything made up of simple, short sentences). Poor choices are texts containing incomplete or complicated sentences with special terminology (technical writing, program documentation (this one's especially poor :-) ), etc ..). The Old Testament provides excellent material downloadable from many anonymous FTP-sites; try e.g. the Song of Solomon. (Please don't take offense; this suggestion comes out of deep respect for the Bible) -------------------- Implementation The source for Mark V. Shaney is ca. 400 lines of code written in PDC Prolog V3.21. Word probability tables are stored in external databases. The dictionary (list of unique words) and frequency table are hashed to speed up table generation, and cross-referenced for text output in constant time. I designed this program for flexibility (i.e. adjustable word grain) and speedy output generation. The trade-offs are slow text scanning and large memory/disk space requirements.