Wilf’s Programmers’ Workshop

Wilf Hey continues his workshop on text handling.

While creating a binary tree of items, as we do this month with the words in the novel Moby Dick, it is wise to try to anticipate other useful information to collect. For example, one of the tasks set for us was to note the longest word encountered, and another was to tabulate the usage for the 'top ten' words within the article. You can readily add code to cope with the first of these questions on the 'input' side - that is, as soon as a word has been found, you can compare its length with that of the longest word found so far. However it is not so easy with the 'top ten' words. How can this be accomplished?

You will have to attach a usage count to each word. This will be a small number - an integer - but it should accompany each word as it is placed into the binary search tree. When an item is first added to the tree (because it was established as missing therein), a record is created for it with a usage counter within it, set to value [1]. Whenever that same word is encountered again in the text, it will of course be found (after following a small number of branches, using the binary search algorithm), and you can simply add one to its usage counter before writing it back to the tree constituting the work output file. Before going on to the next word, we must pass the word, along with its usage counter, to a routine which will make sure the top ten words (the ten words with the highest usage counts) are noted. This is quite tricky, and prone to programming mistakes - you may end up with a list of one word appearing several times in the final list. This error arises when you forget to remove an item from the top-ten list before adding it again.

In the example program MOBY.BAS (a QBasic source, readily adaptable to other languages), you will see that I have made provision for keeping this 'top-ten' usage in a separate DIM, so that at any instant the most popularly used words so far will appear on the screen. This table is built with two pieces of information: the word just last read, itself, and its usage counter: both of these are retrieved from the binary tree, once the appropriate algorithm has established the proper place for the word within the tree. Running through the two hundred thousand plus words of Moby Dick, replete with its arcane and specialised vocabulary, were you at all surprised about what the ten most used words were? [THE], [OF], [AND], [A], [TO], [IN], [THAT], [HIS], [IT], [I]. So with the binary search algorithm, along with a few adjuncts like a simple bean counter attached to each item, we are able to answer the remaining questions, and establish that though there are 210,226 words in the novel, there are only 18,679 different words.

I added a few other bits of information into the MOBY.BAS program, recognising that there may be circumstances where they become valuable, For example, of the eighteen-odd-thousand words used by Melville in his novel, how many are used just once? I thought about it, and realised that what I could do is keep a counter to which I ADDED one very time a word was ADDED to the binary tree, and SUBTRACTED one whenever a word's usage counter was set to exactly the value TWO. Instead I opted for the slightly easier strategy: the 'multi-use' counter, to which one is added whenever a usage count becomes exactly TWO: no subtraction is needed. This indicates the number of words which appear more than once in the novel.

Another feature I added to the process - one that takes up hardly any time - was to note the deepest (highest numbered) level to which the binary tree grows, along with the first word that occurs at that level. I reasoned that this would be extremely valuable at estimating how long the binary sorting may take. At first I pictured the job of scanning the text as being one that slows down as the tree grows larger, since more comparisons have to be made all the time. That is true - and the process does slow down - but surprisingly, not by a very great deal. But there was a side-effect of my adding the 'Depth' tally: I noted that after the scan had gone deeply into the book, the word at the deepest level was consistently a Roman numeral. Why should that be? This peculiarity arises from the fact that the Roman numerals that appear come from chapter headings: they tend to be strings of alphabetic characters like each other and therefore close alphabetically. For example, from thirty (XXX) to thirty-nine (XXXIX) there are ten 'words' that come in an order that makes them fall into a long spur, nine levels deep. This cannot be representative of how long or complex the sort process is, because these Roman numerals are used only once each.

I had a flash of inspiration: I decided to continue keeping note of the deepest level as before, but added a note of the deepest level accessed more than once. This combines the 'Multi-use' counter idea and the 'Depth' idea. In the course of testing this, I found, simply by looking at the interactive results on the screen, two typing errors I had made within the text of Moby Dick: the Roman numerals of two chapter headings were keyed incorrectly, and so two Roman numerals appeared to be 'multi-used'. This resulted in these 'words' (and the deep number of their long spurs) appearing in the 'Multi-depth' note. So I altered the text to correct these errors. No doubt there are many, many more errors, but I gave up spell-checking because of the peculiar, slang and archaic terminology always confusing the issue.

Maybe you balked at my definition of 'word' for this project: I came to wonder about it myself. I built a very simple diagnostic tool for analysing the output from MOBY.BAS - namely the binary table of the text. This is ANLYS.BAS - also a QBasic program - which appears also on this SuperCD. Using only text mode it displays a small portion of the table, and using arrow keys (plus a few others - look at the source of the program for explanation near the end) you can navigate over the whole of the tree, up and down branches. One of the first things I noted while I was testing ANLYS was that it reported MOBY as appearing 83 times, but DICK only 82. I thought for a moment: is there a scene in which somebody yells out something like 'THAR SHE BLOWS, IT'S MOBY..' and is cut off when the white whale's fluke slams his tiny launch. No, I could not remember anything like that. I loaded MOBYDICK.TXT into a word processor, and looked for [MOBY], for [DICK], and, for good measure, for [MOBY DICK]. This made things even worse, because each of these appeared not 82, nor 83, but 84 times. I thought about the possessive - as in [MOBY DICK'S] - and looked for this. Yes, there were two such occurrences. That accounted for the 84 count for [DICK] and [MOBY DICK], and why ANLYS found only 82 mentions of [DICK]: in the tree, [DICK'S] is an entirely different word. So I went back to ANLYS, and indeed found that [DICK'S] happens, and its usage count is two.

After this bit of detective work, I was none the wiser why ANLYS reported [MOBY] with a count of 83 instead of 84. I suspected that I had a bug - especially because MOBY was the first word in both the text and the tree, and could be considered a special case; maybe the counter was being missed on the first occasion? Testing ruled that out quickly. Eventually I found out the reason: there are indeed exactly 83 occurrences of [MOBY] in the text, plus one of ['MOBY] - the familiar name preceded by an apostrophe. How did this apostrophe get there? It arises from the convention that quotes inside of quotes use 'single quotes' instead of the usual 'double quotes'. Should a text scanning program take steps to avoid classifying [MOBY] and ['MOBY] as different words? Should it be clever enough to scan ahead and recognise where quotations start and end so that it can accommodate such conventions? And more importantly - is there any way that such a scanner could do its job without getting into ambiguities?

Now consider a new project, to be built on the foundation of this one. There exist programs that 'learn' how you write, and which words you use most commonly. After a while, as you type the first few letters of a word, it will suggest the WHOLE word, allowing you to use a single keystroke to finish it. Some Internet browsers, for example, do this with URL's that you begin keying which you have used before. Can we do this with the words that make up Moby Dick? What changes will be necessary? I remember an experiment I made with one key entry program that had this feature: I kept on accepting each suggestion without question. The output text made little sense, but as I read it I realised with horror that it had captured something of my style; it used my kind of words, and about as often as I do. I recognised some whole phrases: articles and adjectives were placed correctly (most of the time). It was language - my language - but it meant nothing. Can you make a program that will, given the tree output from MOBY.BAS (called MOBYDICK.NDX) produce a new 'great American novel' - or something that sounds a little like one? What sort of changes would you need to make to MOBY.BAS? (Hint: maybe you will want to include punctuation and lettercase as part of distinct words).

<>

Here's the culprit: the use of apostrophes for 'quoted quotes' has created an ambiguity. The text scanner stripped away double-quotes, but not knowing this convention, it treated the string ['MOBY] as a word, different from [MOBY].

The on-screen results of the MOBY.BAS program, which has analysed 'Moby Dick' for the number of words, distinct words and most used words in it, and has identified the longest word found. In the process it has built a useful sorted index.

ANLYS.BAS is a simple interactive program that guides us around the index to 'Moby Dick' created by the earlier program. This tells us such details as the number of times a particular word appears within the entire text.

How to contact Wilf
I’m always pleased to receive letters and e-mail with programming queries, ideas and opinions. As a strict rule I can’t reply directly with personal one-to-one programming advice, but your input could form the basis of a future Workshop. You can e-mail me at
whey@futurenet.co.uk, Fax to 01225 732295 (+44 1225 732295 from outside the UK) or write to: Wilf’s Programmers’ Workshop, PC Plus, Future Publishing, 30 Monmouth Street, Bath BA1 2BW, United Kingdom