home *** CD-ROM | disk | FTP | other *** search
- Path: uunet!rs
- From: rs@uunet.UU.NET (Rich Salz)
- Newsgroups: comp.sources.unix
- Subject: v10i27: Bill Tuthill's "hum" text concordance package, Part01/03
- Message-ID: <474@uunet.UU.NET>
- Date: 26 Jun 87 21:48:47 GMT
- Organization: UUNET Communications Services, Arlington, VA
- Lines: 1965
- Approved: rs@uunet.uu.net
-
- Submitted by: John Gilmore <hoptoad!gnu>
- Mod.Sources: Volume 10, Number 27
- Archive-name: hum/Part01
-
- I advertised this package in comp.text.desktop and a bunch of people
- are interested. I find it a good bunch of tools for text analysis and
- I think it belongs in the archives.
-
- It was developed by Bill Tuthill while he was at UC Berkeley. He
- authorized me to pass it on to you for publication.
-
- [ Its a pretty gud writing tool. --r$ ]
- : To unbundle, sh this file
- echo Hum.tutorial
- cat >Hum.tutorial <<'@@@ Fin de Hum.tutorial'
- .if n .ds Q \&"
- .if n .ds U \&"
- .if t .ds Q ``
- .if t .ds U ''
- .ND 3 November 1981
- .nr LL 6.5i
- .nr FL 6.2i
- .RP
- .TL
- Hum \(em A Concordance and Text Analysis Package
- .AU
- William Tuthill
- .AI
- Comparative Literature Department
- University of California
- Berkeley, California 94720
- .AB
- A new package of programs for literary and linguistic
- computing is available, emphasizing the preparation
- of concordances and supporting documents.
- Both keyword in context and keyword and line generators are provided,
- as well as exclusion routines, a reverse concordance module,
- formatting programs, a dictionary maker, and lemmatization facilities.
- There are also word, character, and digraph
- frequency counting programs,
- word length tabulation routines,
- a cross reference generator, and other related utilities.
- The programs are written in the C programming language,
- and implemented on several Version 7 Unix\(dg systems at Berkeley.
- .FS
- \(dg Unix is a trademark of Bell Laboratories.
- .FE
- They should be portable to any system that has a C compiler
- and which supports some kind of pipe mechanism.
- At Berkeley, they reside in ~hum/bin; documentation is in ~hum/man.
- This paper constitutes a tutorial introduction to the package.
- Manual pages for the various programs are available separately.
- .AE
- .SH
- The Literary Text
- .PP
- There are two indispensable prerequisites for the analysis
- of a literary text by computer: the computer, and the text itself.
- About 98% of the work involved is entering and correcting the text,
- so if you can obtain your text elsewhere, you will save a lot of time.
- Check with the Unix consultants to see what tape formats
- can be read at the Berkeley installation,
- and have the text taped in one of these formats.
- It is also possible to read cards through the link
- to the IBM 4341; again, check with the consultants.
- .PP
- If you are forced to enter the text yourself (and most people are),
- you must learn to use the Unix system, and the editor.
- Fortunately, the Computer Center has provided
- good documentation at the beginning stages.
- Start off with the tutorial,
- \fICommunicating with Unix,\fP\|
- which will teach you how to use the Unix system.\**
- .FS
- Ricki Blau,
- \fICommunicating with Unix,\fP\|
- Computing Services, Berkeley (1980).
- .FE
- Then read through
- \fIEdit: A Tutorial\fP\|
- carefully, and learn how to use the \fBedit/ex\fP editor.\**
- .FS
- Ricki Blau and James Joyce,
- \fIEdit: a Tutorial,\fP\|
- Computing Services, Berkeley (1980).
- .FE
- At first, you will be spending most of your time inside the editor.
- .PP
- After you become comfortable with the editor,
- it would save you time in the long run
- to familiarize yourself with \fBex\fP and \fBvi\fP.
- For \fBvi\fP, read
- \fIAn Introduction to Display Editing with Vi,\fP\|
- a tutorial introduction that is relatively easy to understand.\**
- .FS
- William Joy and Mark Horton,
- \fIAn Introduction to Display Editing with Vi,\fP\|
- EECS Dep't, Berkeley (1980).
- The \fIEx Reference Manual,\fP\|
- EECS Dep't, Berkeley (1980),
- by the same two authors,
- can be used in conjunction with the introduction to \fBvi\fR.
- .FE
- \fBEdit\fR is a simplified but less flexible version of \fBex\fR,
- both being line-oriented command editors.
- \fBVi\fR is identical to visual mode of \fBex\fR,
- a screen-oriented display editor that permits
- intraline changes to be made easily and effectively.
- By the time you begin correcting your text,
- you should know how to use \fBvi\fP, or at least open mode of \fBex\fR.
- It is far easier to make changes this way
- than with the substitute command of \fBedit\fR.
- .PP
- When you finish correcting your text,
- it would be helpful for you to learn more about the C-Shell,
- which is the default command processing language at Berkeley.
- The tutorial and reference document to consult is
- \fIAn Introduction to the C Shell.\fP\**
- .FS
- William Joy,
- \fIAn Introduction to the C Shell,\fP\|
- EECS Dep't, Berkeley (1979).
- .FE
- Most of the text analysis is going to be done with programs
- called from the shell.
- Another good document that is slightly out of date,
- but which is still very helpful, is
- \fIUnix for Beginners.\fP\**
- .FS
- Brian Kernighan,
- \fIUnix for Beginners,\fP\|
- Bell Laboratories, Murray Hill (1978).
- .FE
- By beginner, they mean someone
- who is a systems programmer on another system.
- The article teaches you about programming the shell,
- and covers other advanced topics.
- .SH
- Practical Suggestions
- .PP
- Before beginning text entry,
- you should organize your file and directory structure.
- If you are encoding a novel (or a saga),
- you should probably make a directory for the novel,
- and then put each chapter in a file of its own.
- If you are entering an epic poem, it would be best
- to have a directory for the epic, and individual files for each section.
- And if you are typing in many short poems by various authors,
- put each author into a separate directory,
- and each poem into a file in its respective directory.
- In other words, let your data determine your file structure.
- Later, it will be easy to access and label your texts.
- In any case, files should not be larger than 250,000 characters,
- and even then they are difficult to work with.
- .PP
- At this point, you must decide how your text is to be numbered.
- A novel, of course, should be labelled by page number,
- an epic or romance by line number and perhaps section number,
- and lyric poems by author, poem number, and line number.
- A good rule is to follow the numbering in the best edition available.
- We will discuss section numbering and author labelling later.
- Line numbering is no problem,
- since many Unix programs are built around line structure.
- If you need to page number, however, reserve the equals sign,
- if possible, because it is normally the page indicator.
- Put an equals sign on a line of its own every time
- there is a new page in your novel,
- and follow line separation in your edition exactly;
- do not hyphenate words from one line to the next, however.
- The \fBkwic\fP, \fBsfind\fP, and \fBxref\fP programs will use the page indicator
- to label your text with the correct page numbers.
- .PP
- Text in a foreign language, especially if there are diacritical marks,
- requires a good deal of thought and preparation.
- The first rule to follow is that an accent mark
- cannot be a punctuation mark, and vice-versa,
- because an accent mark is part of a word,
- while a punctuation mark is not.
- One typical problem is the cedilla,
- which is often represented as an overstruck comma.
- A similar problem is the umlaut, which is best
- represented by the double quote mark.
- If you have cedillas or umlauts in your text,
- you will have to use one of two special programs,
- \fBcedilla\fP or \fBumlaut\fP, along with special concordance options.
- All of this will be explained below.
- A third problem area is the use of a single quote mark,
- which interferes with the apostrophe and the acute accent.
- Two solutions are to use the double quote mark all the time,
- or to use the grave accent for a single quote mark
- (provided, of course, you don't need the grave accent).
- .PP
- When you have decided what is a punctuation mark
- and what is a diacritical mark,
- you may have to establish a punctuation file (Pfile, for example)
- containing all punctuation marks that are not part of a word.
- The default punctuation marks, usable for English
- and some other languages, are:
- .DS
- ,.;:-"?!()[]{}
- .DE
- Note that the apostrophe is considered part of a word.
- If these punctuation marks are acceptable,
- you do not need a punctuation file.
- If you want a different punctuation set, however,
- simply type your punctuation marks onto a single line in a file.
- Many Humanities programs will read the file you specify
- after the \-d option, and reload the punctuation set
- with the last line of this file.
- .PP
- After text entering is all finished,
- you will begin the process of correcting the text,
- often the most laborious part of your work.
- There are a number of programs to help you in this endeavor,
- although nothing can replace careful rereading of the text.
- In order to save money and eyestrain, use \fBlpr\fP to get
- a hard-copy version of your text from the lineprinter.
- If you are working with modern English,
- \fBspell\fP may help you to identify words
- that are not in the on-line dictionary.
- If you are willing to type in your text twice,
- you can check one version against the other with \fBdiff\fP,
- a process that will eliminate most errors.
- All the preceding programs are part of the standard Unix system.
- Humanities programs that may help are \fBcfreq \-a\fP,
- which can alert you to the presence of strange characters,
- and \fBtprep \-t\fP, which gets rid of useless trailing white space.
- .SH
- Background and References
- .PP
- Generating a concordance is one of the best uses
- of a computer's capabilities that a Humanities scholar can make.
- Before the 1950's, many concordances were done by hand,
- and you may well imagine the time and effort that went into this.
- The first large-scale computer concordance was
- a concordance to the Revised Standard Version of the \fIBible,\fP\|
- done in 1956 on a Univac I computer.\**
- .FS
- John W. Ellison,
- \fINelson's Complete Concordance of the Revised Standard Version Bible,\fP\|
- New York, Thomas Nelson & Sons (1957).
- .FE
- This pioneering concordance set a standard for intelligent design
- and legibility that has seldom been matched since then.
- A hand concordance to the King James Version had been done
- by James Strong in the late 1800's; it took 35 years to complete.\**
- .FS
- James Strong,
- \fIAn Exhaustive Concordance of the Bible,\fP\|
- New York, Hunt & Eaton (1894).
- .FE
- .PP
- It is important to emphasize that not all computer concordance projects
- go this smoothly, showing such a dramatic improvement in efficiency.
- For example, a hand concordance to Shakespeare's plays was done
- by John Bartlett over a 15 year period in the late 1800's,
- using \*Qonly the leisure that [his teaching] profession allowed.\*U\**
- .FS
- John Bartlett,
- \fIA Complete Concordance to the Dramatic Works of Shakespeare,\fP\|
- New York, Macmillan & Co (1896).
- .FE
- A computer generated concordance to Shakespeare done at Harvard
- University, by contrast, took nearly as long to complete.
- It was started in the days before their computer recognized
- upper and lower case distinctions, and consequently,
- Shakespeare's text, as concorded, contains no capital letters!
- This work was done on an IBM 360/50 computer.
- A preliminary version was published by a German press in 1969,
- but the final typeset version was not ready until 1973.\**
- .FS
- Marvin Spevack,
- \fIThe Harvard Concordance to Shakespeare,\fP\|
- Cambridge Mass, Harvard University Press (1973).
- .FE
- .PP
- Since the pioneering computer concordance by Ellison,
- a plethora of concordances has been published,
- some better than others, and some almost unusable.
- Much work has remained, for lack of money and interest,
- in the form of semi-legible lineprinter output.
- This has resulted in a kind of academic provincialism,
- since these stacks of lineprinter output are seldom shared.
- One viable alternative to this situation would be
- to make and distribute microfilm concordances,
- drastically reducing the cost of paper, printing, and mailing.
- .PP
- There are, of course, many other ways besides the concordance
- to make use of the computer for literary studies.
- Susan Hockey's recent book gives a good introduction to
- the various possibilities for computer research in the Humanities.\**
- .FS
- Susan Hockey,
- \fIA Guide to Computer Applications in the Humanities,\fP\|
- Baltimore and London, Johns Hopkins University Press (1980).
- .FE
- She gives little practical advice, but does very well at
- pointing out the current limitations of the computer.
- James Joyce has written a provocative article on
- poetry generation and analysis by computer, which combines
- the intuitive and analytical approaches to the subject.\**
- .FS
- James Joyce, \*QPoetry Generation and Analysis,\*U in
- \fIAdvances in Computers,\fP\|
- vol. 13, pp. 43-72 (1975).
- .FE
- Finally, there is a good article from Bell Labs
- on statistical text processing, which describes
- some linguistic research done on their Unix system.\**
- .FS
- L. E. McMahon, L. L. Cherry, and R. Morris,
- \*QStatistical Text Processing,\*U
- in \fIThe Bell System Technical Journal,\fP\|
- vol. 57, no. 6, pt. 2, pp. 2137-54 (1978).
- .FE
- .SH
- Making a Concordance
- .PP
- Before you start using the special package of Humanities programs,
- you will need to set your pathname properly, since
- the programs all reside in the directory \*Q~hum/bin\*U.
- The best way to do this is to have a \*Q.cshrc\*U file
- something like the following:
- .DS
- set path = ~hum/bin
- set path = (/usr/cc/bin /usr/ucb/bin $path .\|)
- set history = 20
- set noclobber
- .DE
- In addition to setting your pathname, this will give you
- a history list for performing repeat commands,
- and will prevent certain kinds of unintentional file destruction.
- After creating the .cshrc file, type the command
- \fBsource .cshrc\fP in order to activate the new settings.
- Now you can obtain an index to the concordance package
- by typing \fBhuman index\fP;
- manual pages for all the programs are available
- through the same command.
- .PP
- How you use the concordance package
- will be largely determined by how big your text is,
- and by what kind of concordance you want.
- There are two basic styles available, the
- Key Word In Context concordance, and the
- Key Word And Line concordance
- (hence the names \fBkwic\fP and \fBkwal\fP).
- At other computer installations,
- a \fBkwal\fP-like program might be called \*Qkwoc\*U or \*Qconc\*U.
- To try them out, take a short text, and type these two commands:
- .DS
- kwic text | sort | format
- kwal text | sort | format
- .DE
- Many people use the \fBkwal\fP program for poetry, and the
- \fBkwic\fP program for prose, but if you are doing formulaic analysis,
- you will probably want to use the \fBkwic\fP program for poetry as well.
- If you decide to use \fBkwic\fP, learn to use \fBtprep\fP,
- since your concordance will look much better
- if the end of line is trimmed and the beginning of line is padded.
- .PP
- If you have a text that is fairly long,
- you may want to concord it in the background,
- while you do other work.
- With background processing,
- you should always redirect output to a temporary file,
- where you can examine (and perhaps edit) the results
- before sending your concordance to the lineprinter.
- Use a command something like this:
- .DS
- kwic text* | sort | format > /tmp/final &
- .DE
- The asterisk expands to mean any file beginning with text,
- so if your text has several parts, this will put them all
- together, in the same order as they are listed with \fBls\fP.
- The greater-than symbol redirects output to the tempfile,
- and the ampersand puts the process in background.
- If you logout before the process is completed,
- then all your background processes will be halted,
- and you will have a partial concordance, or none at all.
- (You can wait for background processes to terminate
- by typing \*Qwait\*U before logging out.)
- .PP
- If your text is extremely long, or if you want to go home,
- you may want to submit the concordance,
- so that you can logout without ending the concording process.
- Submit processing is charged at the rate of $1.00 per processor minute.
- Use the following command:
- .DS
- submit "kwic text* | sort | format > /tmp/final"
- .DE
- The quotes are necessary, to remove the magic
- of the pipe and redirect symbols.
- A few hours later, or the next morning,
- you can return to examine the temporary file,
- and if everything is all right, you can send it to the lineprinter.
- Make sure to examine the end of your output file with \fBtail\fP;
- your concordance should end with the entries for X, Y, and Z.
- .PP
- \fBKwic\fP and \fBkwal\fP have many options,
- and you will probably use some or all of them.
- They are fully explained in the manual section,
- but a thumbnail sketch is also given here.
- The \-k option sets the keyword length;
- you should run the \fBmaxwd\fP program on all your texts,
- to determine the length of your longest word.
- If it is longer than 15, reset the keyword length accordingly.
- The \-w option can be used to label the concordance
- as to work or author, and the \-f can be used
- to label it for poem number or chapter.
- .PP
- Line numbering is done automatically,
- but can be affected by using the \-l and \-r options.
- The \fBkwic\fP program will also do page numbering,
- which can be controlled with the \-p and \-i options.
- The \fBkwal\fP program has the esoteric \-s and \-x options instead,
- to skip over a text-embedded lefthand identification field,
- and to suppress automatic linenumbering.
- It is possible to use \fBlno\fP to double linenumber
- or hemistich number your text,
- and then use \-s and \-x when compiling the concordance.
- You may have some other system for labelling your lines.
- Older text entered on IBM cards often has the embedded id field.
- .PP
- With the \fBkwic\fP program, the most popular option
- is the \-c to set context width.
- The default context width is 50 characters,
- that is to say, 25 on either side of the keyword.
- This width is suitable for the CRT terminal,
- but to fill up a page from the lineprinter, a width
- of 100 to 120 is better, depending on the \-w and \-f options.
- The \fBkwal\fP program, by contrast, gives context on a line by line basis,
- so if you have some extremely long lines,
- they may be too wide for the page.
- The \-d option, to read a punctuation file, was explained above.
- The + option is used along with the \fBcedilla\fP and \fBumlaut\fP programs.
- The \fBkwic\fP and \fBkwal\fP programs can both read from standard input,
- if you use a hyphen instead of a filename, but this is not recommended.
- .PP
- Ordinarily, sorting is done from the beginning of the line,
- so that the primary sort field is the keyword,
- and the secondary sort field is the line or page number.
- With \fBkwic\fP, if you want to sort by context rather than by text location,
- you will have to call the \fBsort\fP program as follows:
- .DS
- kwic text* | sort \-dft\e| +3 +1 | format
- .DE
- The \-d means dictionary order, ie, ignore punctuation marks.
- The \-f means to fold upper case to lower case, ie, ignore capitals.
- The \-t indicates that the tab character between fields is a
- vertical bar (the backslash takes away its magic meaning).
- The +3 makes the keyword and its following context
- into the primary sort field, and +1 makes
- the linenumber (or page number) into the secondary sort field.
- The results of \fBsort\fP are then piped to \fBformat\fP.
- If you are doing formulaic analysis, you will want to sort this way.
- Make sure to use \fBtprep\fP if you do this kind of sorting.
- .PP
- If you are planning to publish your concordance,
- or even if other people want to peruse it, you should read
- William Ingram's article on concordance style.\**
- .FS
- William Ingram, \*QConcordances for the Seventies,\*U
- in \fIComputers and the Humanities,\fP\|
- vol. 13, pp. 43-72 (1975).
- .FE
- Concordances for the eighties will be much like those of the seventies.
- Ingram gives many practical suggestions, and points out some
- ridiculous errors made in several recently published concordances.
- .SH
- Concordance Modules
- .PP
- The concordance package comes with a number of independent modules,
- in order to attain acceptable flexibility and efficiency.
- With these modules, you can exclude unnecessary words,
- create a reverse concordance,
- put the concordance into separate alphabetical files,
- lemmatize, and format the output.
- The \fBsort\fP program may be considered a module,
- although it is not technically part of the Humanities package.
- For convenience, these modules may be divided into two types:
- those used before \fBsort\fP, and those used afterwards.
- .PP
- The \fBexclude\fP program reads words in an ignore file
- (\*Qexclfile\*U by default),
- and filters out concordance entries with these keywords.
- It can also be used with an only file, to filter out
- all concordance entries except those with the desired keywords.
- \fBExclude\fP should immediately follow \fBkwic\fP or \fBkwal\fP,
- in order to reduce the amount of output going to later modules,
- especially \fBsort\fP, which is the most expensive module.
- .PP
- The \fBrevconc\fP program reverses the keyword,
- so that the concordance can be sorted by word endings
- rather than by word beginnings.
- \fBRevconc\fP should be called both before and after the \fBsort\fP program,
- unless you want your words backwards in the final concordance.
- .PP
- If you are using the + option of \fBkwic\fP or \fBkwal\fP,
- to indicate cedillas or umlauts,
- you will need to filter your output with \fBcedilla\fP or \fBumlaut\fP.
- These programs replace the plus sign with a backspace
- and the appropriate accent mark.
- .PP
- The \fBformat\fP program counts and removes repeated occurrences of keywords,
- and creates keyword section headings for the concordance.
- If desired, keyword counting can be suppressed,
- as can mapping of the keyword to upper case,
- and printing of the section heading itself.
- \fBFormat\fP is generally the last module used before the output
- is sent to a temporary file or to the lineprinter.
- .PP
- If you were to use all of the modules above,
- which is quite unlikely, this would be the proper order
- of the command line:
- .DS
- kwic + text | exclude | revconc | sort | revconc | umlaut | format
- .DE
- Of course, after all that work, it would be good to redirect the
- output to a temporary file somewhere.
- .PP
- For large concordance projects,
- you will want to reduce the size of the output files,
- so you can edit and print them more easily.
- The \fBdict\fP program sends concordance entries to their
- proper dictionary file, from A to Z
- (actually in ASCII sequence from blank to tilde).
- This is useful for large concordances that you must edit by hand.
- \fBDict\fP should be used in place of \fBformat\fP;
- concordances can be formatted when they are finally printed.
- .PP
- The \fBlemma\fP program takes specified words and positions them
- in the proper order after their specified headword.
- This is for keeping inflected words together in one location.
- \fBLemma\fP should be called only after the \fBdict\fP program
- has separated your concordance into alphabetical sections.
- It only works on groups of files created by \fBdict\fP\|.
- .PP
- Generally, it is considered bad style for a concordance
- to contain multiple entries when the same word occurs twice on a line.
- For example, Shakespeare's line, \*QTomorrow, tomorrow, and tomorrow,\*U
- when run through \fBkwal\fR, will produce three identical entries.
- \fBKwic\fR will produce three slightly different entries,
- but two of them will be largely redundant.
- In order to avoid such redundancy,
- use the Unix utility \fBuniq\fP as follows:
- .DS
- kwal text* | sort | uniq | format
- kwal text* | sort \-u | format
- .DE
- The second line has exactly the same effect, and is somewhat faster.
- It is more difficult to remove redundant lines
- with \fBkwic\fP than with \fBkwal\fP,
- because \fBuniq\fP will not ignore differing context fields.
- However, it is possible to write an \fBawk\fP program
- that should work for this purpose.
- .PP
- At the current time, no facilities are provided
- for making interlinear concordances;
- it is expected that they will be provided in the future.
- .SH
- Other Programs for Textual Analysis
- .PP
- There are other methods of textual analysis besides the concordance,
- and programs for some of these methods are available.
- One common method is the word frequency count,
- which can be accomplished with the \fBfreq\fP program.
- Ordinarily, \fBfreq\fP will give frequency counts of words in your text,
- organized into alphabetical order.
- With the \-n option, it will give these words organized
- by order of frequency, with the most common words first.
- If you run out of core with \fBfreq\fP, the same thing can
- be done, but much more slowly, with this command:
- .DS
- wheel +1 text | sort | uniq \-c
- .DE
- You will probably want to use \fBfreq\fP in conjunction
- with \fBpr \-n\fP, to create n-column output.
- See the manual page for \fBpr\fP(1) in the
- \fIUnix Programmer's Manual\fP\| for details.
- .PP
- Some linguistics students may be interested in
- the relative frequency of different characters;
- the \fBcfreq\fP program may help them out.
- \fBCfreq\fP can also count the frequency of digraphs in a text.
- Comparison of word lengths may also be useful,
- for which purpose there is a \fBwdlen\fP program,
- which prints out a hologram along with word length frequencies.
- .PP
- Most good concordances nowadays are published along with
- a word frequency list (in numerical order),
- and a reverse list of the graphic forms.
- This is simply a list of all the different words in the text,
- sorted from the end to the beginning, rather than vice-versa.
- It is far less elaborate than the reverse concordance,
- but serves many of the same purposes.
- To create such a list, use the following program sequence:
- .DS
- wheel +1 text | rev | sort \-u | rev
- .DE
- \fBWheel +1\fP gives a word list, \fBrev\fP reverses each word,
- and the \-u option of \fBsort\fP suppresses all but one copy of
- each identical line, rather than giving you useless repeated lines.
- Finally \fBrev\fP restores normal order to all the words.
- Again, \fBpr \-n\fP may be of some use in creating multi-column output.
- .PP
- Syntactic analysis by computer is more difficult than lexical analysis,
- but the \fBwheel\fP program may be of some help for the former endeavor.
- It rolls through the text several words at a time,
- printing word clusters of arbitrary size on each line.
- These syntactic clusters can then be sorted and counted,
- using \fBsort\fP and \fBuniq\fP, in order to find repeated patterns.
- .DS
- wheel +3 text | sort | uniq \-c | sort \-r
- .DE
- The above command, for example, yields a count
- of three-word syntactic clusters,
- with the most frequent ones heading the list.
- .PP
- Pattern searching is another way to study syntactic patterns
- of a text with the aid of a computer.
- The Unix utility \fBgrep\fP is useful in this context,
- but it is line-oriented, and many literary texts
- cannot be studied on a line by line basis.
- So there is a program called \fBsfind\fP, which searches
- through a text sentence by sentence for a specified pattern.
- The entire sentence is printed, along with the matching pattern.
- .PP
- If you do not need a full-length concordance,
- or if you just want a simple index that gives word locations,
- there is a cross reference generator named \fBxref\fP.
- It will index distinct words in a text by line number,
- or by page number, depending on how your text is labelled.
- It is slower than \fBfreq\fP, but much faster than a concordance series.
- Moreover, its output will be relatively compact.
- In the final concordance, you may want
- to have some common words listed with \fBxref\fP indexing,
- and more important words listed with full concordance entries.
- .PP
- After doing a bit of work with the computer,
- you may feel constrained by not knowing a programming language.
- Many things are extremely difficult to accomplish
- if you do not know how to program;
- but with a little programming knowledge,
- these tasks become exceedingly trivial.
- If you are so inclined, you ought to learn \fBawk\fP.
- It has the simplicity of \fBbasic\fP,
- the string handling capabilites of \fBsnobol\fP,
- and the economy of expression found in \fBapl\fP.
- The tutorial, \fIAwk \(em A Pattern Scanning and Processing Language,\fP\|
- should be enough to get you started.\**
- .FS
- Alfred Aho, Brian Kernighan, and Peter Weinberger,
- \fIAwk \(em A Pattern Scanning and Processing Language,\fP\|
- Bell Laboratories, Murray Hill (1978).
- .FE
- .SH
- Related Utilities
- .PP
- For times when you have to change a print wheel or type ball,
- there is the \fBpause\fP program for temporarily halting output.
- Pause may also be helpful for generating form letters or mailing lists,
- since it gives you time to change paper or envelopes.
- .PP
- For those who work with textual variants,
- the \fBpair\fP program may be of some use.
- \fBPair \-m\fP intercalates two files line by line,
- while \fBpair\fP sets two files side by side.
- Different manuscript versions can be compared
- by means of this program.
- It is possible to make two textual variants parallel
- by inserting blank lines in one variant;
- then \fBdiff\fP can be used to collate the manuscripts.
- (There is even a \fBdiff3\fP program for examining three texts.)
- .PP
- Many novice Unix users wonder how they can linenumber a text,
- as with the \*Qset number\*U option of \fBex/vi\fP.
- There are several ways, but the best approach
- for Humanities users is the \fBlno\fP program.
- It allows you to set the linenumber
- to start at whatever value you want,
- and also supports double line numbering and hemistich numbering.
- .PP
- If your final copy is to be lineprinted, you may want to use
- \fBtolpr\fP to shift your text to the right, away from the holes.
- It is possible to do extra formatting on your concordance,
- and even to print your concordance using the phototypesetter.
- If you want to use the typesetter,
- use \fBtroffmt\fP instead of \fBformat\fP,
- and send the resulting output to \fBtroff \-Q\fP.
- It is not necessary to use the \-ms macros.
- .PP
- If you intend to transport tapes to or from another installation,
- especially an IBM-compatible machine,
- a number of programs in \*Q/usr/inp/bin\*U may be helpful.
- For importing text encoded in fixed length format,
- the \fBline\fP program can be used to insert missing newline characters;
- \fBtprep\fP can then be used to strip off trailing blanks.
- For exporting text, the \fBfixlen\fP program can be used to transform
- variable length Unix lines into IBM-style fixed length records.
- For rearranging fields of a record
- (for instance, to move linenumbers from the last few columns
- to the first few columns), use the \fBpermute\fP program.
- .SH
- Acknowledgements
- .PP
- The author would like to thank Professors
- Joseph J. Duggan and John Lindow for research funding
- during the early stages of this project.
- James Joyce gave advice on the general design of this package,
- and Brad Rubenstein and Bob Campbell helped immeasurably
- with much of the actual programming.
- The project was finally completed with the financial support
- of a grant from the National Endowment for the Humanities.
-
- @@@ Fin de Hum.tutorial
- echo Makefile
- cat >Makefile <<'@@@ Fin de Makefile'
- BIN=
-
- all: $(BIN)accent $(BIN)cfreq $(BIN)dict $(BIN)exclude $(BIN)format\
- $(BIN)freq $(BIN)kwal $(BIN)kwic $(BIN)lno $(BIN)med\
- $(BIN)maxwd $(BIN)pair $(BIN)pause $(BIN)revconc $(BIN)sfind\
- $(BIN)skel $(BIN)togrk $(BIN)tolpr $(BIN)tosel $(BIN)tprep\
- $(BIN)troffmt $(BIN)wdlen $(BIN)wheel $(BIN)xref
-
- clean:
- rm -f $(BIN)accent $(BIN)cfreq $(BIN)dict $(BIN)exclude $(BIN)format
- rm -f $(BIN)freq $(BIN)kwal $(BIN)kwic $(BIN)lno $(BIN)med
- rm -f $(BIN)maxwd $(BIN)pair $(BIN)pause $(BIN)revconc $(BIN)sfind
- rm -f $(BIN)skel $(BIN)togrk $(BIN)tolpr $(BIN)tosel $(BIN)tprep
- rm -f $(BIN)troffmt $(BIN)wdlen $(BIN)wheel $(BIN)xref
- rm -f $(BIN)cedilla $(BIN)umlaut
-
- $(BIN)accent: accent.c
- cc accent.c -O -s -o $(BIN)accent
- ln $(BIN)accent $(BIN)cedilla
- ln $(BIN)accent $(BIN)umlaut
- $(BIN)cfreq: cfreq.c
- cc cfreq.c -O -s -o $(BIN)cfreq
- $(BIN)dict: dict.c
- cc dict.c -O -s -o $(BIN)dict
- $(BIN)exclude: exclude.c
- cc exclude.c -O -s -o $(BIN)exclude
- $(BIN)format: format.c
- cc format.c -O -s -o $(BIN)format
- $(BIN)freq: freq.c # separate i&d space
- cc freq.c -O -s -o $(BIN)freq
- $(BIN)kwal: kwal.c
- cc kwal.c -O -s -o $(BIN)kwal
- $(BIN)kwic: kwic.c
- cc kwic.c -O -s -o $(BIN)kwic
- $(BIN)lno: lno.c
- cc lno.c -O -s -o $(BIN)lno
- $(BIN)med: med.c
- cc med.c -O -s -o $(BIN)med -lcurses -ltermcap
- $(BIN)maxwd: maxwd.c
- cc maxwd.c -O -s -o $(BIN)maxwd
- $(BIN)pair: pair.c
- cc pair.c -O -s -o $(BIN)pair
- $(BIN)pause: pause.c
- cc pause.c -O -s -o $(BIN)pause
- $(BIN)revconc: revconc.c
- cc revconc.c -O -s -o $(BIN)revconc
- $(BIN)sfind: sfind.c
- cc sfind.c -O -s -o $(BIN)sfind
- $(BIN)skel: skel.c
- cc skel.c -O -s -o $(BIN)skel
- $(BIN)togrk: togrk.c
- cc togrk.c -O -s -o $(BIN)togrk
- $(BIN)tolpr: tolpr.c
- cc tolpr.c -O -s -o $(BIN)tolpr
- $(BIN)tosel: tosel.c
- cc tosel.c -O -s -o $(BIN)tosel
- $(BIN)tprep: tprep.c
- cc tprep.c -O -s -o $(BIN)tprep
- $(BIN)troffmt: troffmt.c
- cc troffmt.c -O -s -o $(BIN)troffmt
- $(BIN)wdlen: wdlen.c
- cc wdlen.c -O -s -o $(BIN)wdlen
- $(BIN)wheel: wheel.c
- cc wheel.c -O -s -o $(BIN)wheel
- $(BIN)xref: xref.c
- cc xref.c -O -s -o $(BIN)xref
- @@@ Fin de Makefile
- echo ORIGIN
- cat >ORIGIN <<'@@@ Fin de ORIGIN'
- Unix-From: sun!cairo!tut Thu Oct 10 20:50:28 1985
- Date: Thu, 10 Oct 85 14:31:16 PDT
- From: sun!cairo!tut (Bill Tuthill)
- Message-Id: <8510102131.AA08577@cairo.sun.uucp>
- To: gnu@l5.uucp
- Subject: Re: Hum tutorial
-
- Date: Fri, 19 Jun 87 16:45:45 PDT
- From: sun!tut (Bill Tuthill)
- Message-Id: <8706192345.AA04406@cairo.sun.uucp>
- To: hoptoad!gnu
- Subject: Re: Hum programs
-
- It's fine with me if you submit Hum to Rich $alz for mod.sources.
- How did he get such a $trange letter in his la$t name?
- @@@ Fin de ORIGIN
- echo README
- cat >README <<'@@@ Fin de README'
- Here is the source code for the "hum" concordance package.
- To compile all the programs, just type "make", and the programs will
- all be compiled, one by one, and put in "../bin". You can change
- the destination bin if you like (for example, to "/usr/hum/bin") by
- changing the first line in the "Makefile". All the programs here
- are self-contained, requiring no special libraries, and having no
- dependencies. If you do extensive revising, especially on the longer
- programs, you will probably want to split them up, and change the
- "Makefile" accordingly.
-
- Several programs here are new and undocumented. The "togrk"
- program employs Wes Walton's Greek transcription system to convert
- English letters to Greek character strings for "nroff/troff". The
- "troffmt" program replaces "format" for the phototypesetter, and
- requires the macro package "../tmac/trmacs" to work correctly.
- It has not been fully debugged.
-
- The old "lemma" program was not good enough to distribute.
- It is currently in the process of revision so it can be used along
- with "dict". Disk seeks and reads will replace sequential I/O.
- If you need the new "lemma" program, contact Bill Tuthill for a
- software update.
- @@@ Fin de README
- echo accent.c
- cat >accent.c <<'@@@ Fin de accent.c'
- # include <stdio.h> /* accent.c (rev3.7) */
-
- char mode = 'a'; /* accent mode: a=file c=cedilla u=umlaut */
- char accfile[15] = "accfile"; /* file containing accent mark definitions */
-
- usage() /* print proper usage and exit */
- {
- fprintf(stderr, "Usage: accent [-a accfile] [filename(s)]\t(rev3.7)\n");
- fprintf(stderr, "\taccent mark definitions read from accfile\n");
- exit(1);
- }
-
- main(argc, argv) /* user-controlled module for accent marks */
- int argc;
- char *argv[];
- {
- FILE *fopen(), *fp;
- int n, i;
-
- n = strlen(argv[0]);
- /* umlaut */
- if (argv[0][n-2] == 'u' && argv[0][n-1] == 't')
- mode = 'u';
- /* cedilla */
- if (argv[0][n-2] == 'l' && argv[0][n-1] == 'a')
- mode = 'c';
- i = 1;
- if (mode == 'a' && argv[i][0] == '-')
- {
- if (argv[i][1] == 'a' && argc >= 3)
- {
- strcpy(accfile, argv[++i]);
- argc -= 2;
- i++;
- }
- else
- usage();
- }
- if (mode == 'a')
- getacc(accfile);
-
- fp = stdin;
- do {
- if (argc > 1 && (fp = fopen(argv[i], "r")) == NULL)
- {
- fprintf(stderr, "%s: can't access file %s\n",
- argv[0], argv[i]);
- exit(1);
- }
- else {
- if (mode == 'a')
- accent(fp);
- else /* c or u mode */
- ced_um(fp);
- fclose(fp);
- }
- } while (++i < argc);
- exit(0);
- }
-
- char from[BUFSIZ/8]; /* accent marks placed in text */
- char into[BUFSIZ/8]; /* accent marks to be output */
-
- getacc(file) /* retrieve accent mark definitions */
- char *file;
- {
- FILE *afp, *fopen();
- char str[BUFSIZ/16];
- register int n;
-
- if ((afp = fopen(file, "r")) == NULL)
- {
- fprintf(stderr, "can't find accent file: %s\n", file);
- exit(1);
- }
- for (n = 0; fgets(str, BUFSIZ/16, afp) && n < BUFSIZ/8; n++)
- {
- if (strlen(str) == 4)
- {
- from[n] = str[0];
- into[n] = str[2];
- }
- else {
- fprintf(stderr,
- "syntax error in line %d of %s\n", n+1, file);
- fprintf(stderr,
- "usage: <fromchar> <space> <intochar>\n");
- exit(1);
- }
- }
- from[n] = into[n] = NULL;
- }
-
- accent(fp) /* change accent marks into <BS><acc> */
- FILE *fp;
- {
- register int c, n;
-
- while ((c = getc(fp)) != EOF)
- {
- n = position(from, c);
- if (n >= 0)
- {
- putchar('\b');
- putchar(into[n]);
- }
- else
- putchar(c);
- }
- }
-
- ced_um(fp) /* change + into cedilla: <BS>, or umlaut: <BS>" */
- FILE *fp;
- {
- register int c;
-
- while ((c = getc(fp)) != EOF)
- {
- if (c == '+')
- {
- putchar('\b');
- if (mode == 'c')
- putchar(',');
- if (mode == 'u')
- putchar('"');
- }
- else
- putchar(c);
- }
- }
-
- position(str, c) /* return location of c in str, -1 if not */
- char str[], c;
- {
- register int i;
-
- for (i = 0; str[i]; i++)
- if (str[i] == c)
- return(i);
- return(-1);
- }
- @@@ Fin de accent.c
- echo cfreq.c
- cat >cfreq.c <<'@@@ Fin de cfreq.c'
- # include <stdio.h> /* cfreq.c (rev 3.7) */
- # include <ctype.h>
-
- char *chars[512] =
- {
- "^@", "^A", "^B", "^C", "^D", "^E", "^F", "^G",
- "^H", "^I", "^J", "^K", "^L", "^M", "^N", "^O",
- "^P", "^Q", "^R", "^S", "^T", "^U", "^V", "^W",
- "^X", "^Y", "^Z", "^[", "^\\", "^]", "^^", "^_",
- " ", "!", "\"", "#", "$", "%", "&", "'",
- "(", ")", "*", "+", ",", "-", ".", "/",
- "0", "1", "2", "3", "4", "5", "6", "7",
- "8", "9", ":", ";", "<", "=", ">", "?",
- "@", "A", "B", "C", "D", "E", "F", "G",
- "H", "I", "J", "K", "L", "M", "N", "O",
- "P", "Q", "R", "S", "T", "U", "V", "W",
- "X", "Y", "Z", "[", "\\", "]", "^", "_",
- "`", "a", "b", "c", "d", "e", "f", "g",
- "h", "i", "j", "k", "l", "m", "n", "o",
- "p", "q", "r", "s", "t", "u", "v", "w",
- "x", "y", "z", "{", "|", "}", "~", "^?",
- };
-
- struct tnode /* binary tree for digraphs and count */
- {
- char *word;
- int count;
- struct tnode *left;
- struct tnode *right;
- };
-
- long count[128]; /* count of individual characters */
- long total; /* total number of characters */
- char printable = 0; /* toggle printable chars option */
- char ascii = 0; /* toggle all ascii chars option */
- char digraph = 0; /* toggle digraph frequency count */
- char nomap = 0; /* turn off digraph mapping to lower case */
-
- main(argc,argv) /* count frequency of characters or digraphs */
- int argc;
- char *argv[];
- {
- FILE *fopen(), *fp;
- struct tnode *tree(), *root;
- char dgraphs[3];
- int i;
-
- root = NULL;
- dgraphs[2] = NULL;
- if (argc == 1)
- {
- puts("Usage: cfreq [-p -a -d -m -] filename(s)\t(rev3.7)");
- puts("-p: list all printable characters (blank - '~')");
- puts("-a: list all ascii characters (null - delete)");
- puts("-d: count digraphs rather than single characters");
- puts("-m: disable mapping of digraphs to lower case");
- puts("- : read standard input instead of files");
- exit(1);
- }
- for (i = 1; i < argc; i++)
- {
- if (*argv[i] == '-')
- getflag(argv[i]);
- else if ((fp = fopen(argv[i], "r")) != NULL)
- {
- if (digraph)
- while (dfreq(dgraphs, fp))
- root = tree(root, dgraphs);
- else
- cfreq(fp);
- fclose(fp);
- }
- else /* cannot open file */
- {
- fprintf(stderr,
- "Cfreq cannot access the file: %s\n", argv[i]);
- exit(1);
- }
- }
- if (digraph)
- {
- printf("Digraph: Freq:\n");
- treeprint(root);
- }
- else
- pr_results();
- exit(0);
- }
-
- getflag(f) /* parses command line to set options */
- char *f;
- {
- struct tnode *tree(), *root;
- char dgraphs[3];
-
- f++;
- switch(*f)
- {
- case 'p':
- printable = 1;
- break;
- case 'a':
- ascii = 1;
- break;
- case 'd':
- digraph = 1;
- break;
- case 'm':
- nomap = 1;
- break;
- case NULL:
- root = NULL;
- dgraphs[2] = NULL;
- if (digraph)
- {
- while (dfreq(dgraphs, stdin))
- root = tree(root, dgraphs);
- printf("Digraph: Freq:\n");
- treeprint(root);
- exit(1);
- }
- else
- cfreq(stdin);
- break;
- default:
- fprintf(stderr, "Invalid cfreq flag: -%s\n", f);
- exit(1);
- break;
- }
- }
-
- cfreq(fp) /* run through files counting characters */
- FILE *fp;
- {
- register int c;
-
- while ((c = getc(fp)) != EOF)
- {
- count[(int)c]++;
- total++;
- }
- }
-
- pr_results() /* print out table of character counts */
- {
- int i;
-
- printf("Char:\t Freq:\n");
- if (printable)
- for (i = 32; i < 127; i++)
- printf("%s\t%6ld\n", chars[i], count[i]);
- else if (ascii)
- for (i = 0; i < 128; i++)
- printf("%s\t%6ld\n", chars[i], count[i]);
- else
- {
- for (i = 65; i < 91; i++)
- printf("%s\t%6ld\n", chars[i], count[i]);
- for (i = 97; i < 123; i++)
- printf("%s\t%6ld\n", chars[i], count[i]);
- }
- printf("Total:\t%6ld\n", total);
- }
-
- dfreq(dgraphs,fp) /* drives program through text by digraphs */
- char dgraphs[];
- FILE *fp;
- {
- register int c;
-
- if ((c = getc(fp)) != EOF)
- {
- if (c == '\n' || c == '\t')
- dgraphs[0] = ' ';
- else if (!nomap && isupper(c))
- dgraphs[0] = tolower(c);
- else dgraphs[0] = c;
- }
- else return(0);
- if ((c = getc(fp)) != EOF)
- {
- if (c == '\n' || c == '\t')
- dgraphs[1] = ' ';
- else if (!nomap && isupper(c))
- dgraphs[1] = tolower(c);
- else dgraphs[1] = c;
- }
- else return(0);
- ungetc(c, fp);
- return(1);
- }
-
- struct tnode *tree(p, w) /* build tree beginning at root */
- struct tnode *p;
- char *w;
- {
- struct tnode *talloc();
- char *strsave();
- int cond;
-
- if (p == NULL)
- {
- p = talloc();
- p->word = strsave(w);
- p->count = 1;
- p->left = p->right = NULL;
- }
- else if ((cond = strcmp(w, p->word)) == 0)
- p->count++;
- else if (cond < 0)
- p->left = tree(p->left, w);
- else /* if cond > 0 */
- p->right = tree(p->right, w);
- return(p);
- }
-
- treeprint(p) /* print out contents of binary tree */
- struct tnode *p;
- {
- if (p != NULL)
- {
- treeprint(p->left);
- printf("%s\t %5d\n", p->word, p->count);
- treeprint(p->right);
- }
- }
-
- struct tnode *talloc() /* core allocator for tree */
- {
- struct tnode *p;
- char *malloc();
-
- if ((p = ((struct tnode *)malloc(sizeof(struct tnode)))) != NULL)
- ; /* will return */
- else /* if (p == NULL) */
- overflow();
- return(p);
- }
-
- char *strsave(s) /* allocate space for string of text */
- char *s;
- {
- char *p, *malloc(), *strcpy();
-
- if ((p = malloc((unsigned)(strlen(s)+1))) != NULL)
- strcpy(p, s);
- else /* if (p == NULL) */
- overflow();
- return(p);
- }
-
- overflow() /* exit gracefully in case of core overflow */
- {
- fprintf(stderr,
- "Cfreq: no more core available (maximum on PDP is 64K bytes).\n");
- exit(1);
- }
- @@@ Fin de cfreq.c
- echo dict.c
- cat >dict.c <<'@@@ Fin de dict.c'
- # include <stdio.h> /* dict.c (rev3.7) */
-
-
- main(argc, argv) /* split file into dictionary sections */
- int argc;
- char *argv[];
- {
- FILE *fp, *fopen();
- char *strcpy(), root[15]; /* root name of outfile */
-
- if (argc == 1 || argc > 3)
- {
- puts("Usage: dict [-]filename [outroot]\t\t(rev3.7)");
- puts("- : read standard input rather than file");
- exit(1);
- }
- if (argc == 2)
- *root = 'X';
- if (argc == 3)
- strcpy(root, argv[2]);
-
- if (argv[1][0] == '-' && argv[1][1] == NULL)
- dict(stdin, root);
- else if ((fp = fopen(argv[1], "r")) != NULL)
- {
- dict(fp, root);
- fclose(fp);
- }
- else /* can't open input file */
- {
- fprintf(stderr,
- "Dict cannot access the file: %s\n", argv[1]);
- exit(1);
- }
- exit(0);
- }
-
- dict(fp, root) /* write each letter to separate file */
- FILE *fp;
- char *root;
- {
- FILE *sp, *fopen();
- char s[BUFSIZ], fname[20], ch = 'A';
- int first = 1, len;
-
- sp = NULL;
- while (fgets(s, BUFSIZ, fp))
- {
- if (s[0] != ch) /* new letter found */
- {
- ch = s[0];
- first = 1;
- }
- if (!first && s[0] == ch) /* same letter */
- {
- fputs(s, sp);
- if (ferror(sp))
- {
- perror(fname);
- exit(1);
- }
- }
- if (first) /* first encounter with this letter */
- {
- strcpy(fname, root); /* derive filename */
- len = strlen(fname);
- fname[len] = ch;
- fname[++len] = NULL;
-
- if (sp != NULL)
- fclose(sp);
- if ((sp = fopen(fname, "a")) == NULL)
- {
- perror(fname);
- exit(1);
- }
- fputs(s, sp);
- if (ferror(sp))
- {
- perror(fname);
- exit(1);
- }
- first = 0;
- }
- }
- }
- @@@ Fin de dict.c
- echo exclude.c
- cat >exclude.c <<'@@@ Fin de exclude.c'
- # include <stdio.h> /* exclude.c (rev3.7) */
- # include <ctype.h>
- # define MXWDS 500
-
- char mode = 'i'; /* default is ignore mode, not only mode */
-
- usage() /* print proper usage and exit */
- {
- puts("Usage: exclude [-i -o exclfile] [filename(s)]\t(rev3.7)");
- puts("-i: exclfile contains words to be ignored, one per line");
- puts("-o: exclfile has only words to be printed, one per line");
- puts("With no options, excluded words should be in \"exclfile\".");
- exit(1);
- }
-
- main(argc, argv) /* exclude common words from concordance */
- int argc;
- char *argv[];
- {
- FILE *fp, *fopen();
- int i = 1;
-
- if (argc == 1)
- {
- rdexclf("exclfile");
- exclude(stdin);
- exit(0);
- }
- if (argv[1][0] == '-')
- {
- if (argc == 2)
- usage();
- if (argv[1][1] == 'i')
- mode = 'i';
- else if (argv[1][1] == 'o')
- mode = 'o';
- else /* bad flag */
- usage();
- rdexclf(argv[2]);
- if (argc == 3)
- exclude(stdin);
- i = 3;
- }
- else /* no options used */
- rdexclf("exclfile");
- for (; i < argc; i++)
- {
- if ((fp = fopen(argv[i], "r")) != NULL)
- {
- exclude(fp);
- fclose(fp);
- }
- else /* can't open input file */
- {
- fprintf(stderr,
- "Exclude cannot access the file: %s\n", argv[i]);
- continue;
- }
- }
- exit(0);
- }
-
- char *wdptr[MXWDS]; /* array of pointers to excluded words */
- int nwds = 0; /* the number of excluded words in core */
-
- rdexclf(fname) /* load structure with words from exclfile */
- char fname[];
- {
- FILE *efp, *fopen();
- char wd[512], *p, *malloc(), *strcpy();
-
- if ((efp = fopen(fname, "r")) == NULL)
- {
- fprintf(stderr,
- "Cannot access exclude file: %s\n", fname);
- usage();
- exit(1);
- }
- while (fgets(wd, 512, efp))
- {
- if (nwds >= MXWDS)
- {
- fprintf(stderr,
- "Maximum of %d exclude words allowed.\n", MXWDS);
- exit(1);
- }
- else if ((p = malloc((unsigned)(strlen(wd)+1))) == NULL)
- {
- fprintf(stderr,
- "Exclude: no more space left in core.\n");
- exit(1);
- }
- else /* everything is OK */
- {
- strcpy(p, wd);
- wdptr[nwds++] = p;
- }
- }
- return;
- }
-
- exclude(fp) /* filter out excluded words, i or o mode */
- FILE *fp;
- {
- char s[512], word[512];
-
- while (fgets(s, 512, fp))
- {
- if (firstword(s, word) == 0)
- continue;
- if (mode == 'i')
- {
- if (!inlist(word))
- fputs(s, stdout);
- }
- if (mode == 'o')
- {
- if (inlist(word))
- fputs(s, stdout);
- }
- }
- }
-
- firstword(s, wd) /* return first word of string s */
- char s[], *wd;
- {
- int i = 0;
-
- if (isspace(s[i]))
- return(0);
- while (!isspace(s[i]))
- *wd++ = s[i++];
- *wd++ = '\n';
- *wd = NULL;
- return(1);
- }
-
- inlist(word) /* check to see if word is in exclude list */
- char word[];
- {
- int i;
-
- for (i = 0; i < nwds; i++)
- {
- if (strcmp(word, wdptr[i]) == 0)
- return(1);
- }
- return(0);
- }
- @@@ Fin de exclude.c
- echo format.c
- cat >format.c <<'@@@ Fin de format.c'
- # include <stdio.h> /* format.c (rev3.7) */
- # include <ctype.h>
- # include <signal.h>
-
- char *tempfile; /* to store overflow while counting */
- int nomap = 0; /* toggle for mapping keyword to lcase */
- int nocnt = 0; /* toggle for counting keyword */
- int nokwd = 0; /* toggle for suppressing keyword */
-
- usage() /* print proper usage and exit */
- {
- puts("Usage: format [-mck] [filename(s)]\t\t(rev3.7)");
- puts("-m: keywords not mapped from lower to upper case");
- puts("-c: suppress counting of keyword frequency");
- puts("-k: entirely suppress printing of keyword");
- exit(1);
- }
-
- main(argc, argv) /* make keyword headings with count */
- int argc;
- char *argv[];
- {
- FILE *fopen(), *fp;
- int i, j, onintr();
- char *mktemp();
-
- if (signal(SIGINT, SIG_IGN) != SIG_IGN)
- signal(SIGINT, onintr);
-
- tempfile = "/tmp/FmtXXXXX";
- mktemp(tempfile);
-
- for (i = 1; *argv[i] == '-'; i++)
- {
- for (j = 1; argv[i][j] != NULL; j++)
- {
- if (argv[i][j] == 'm')
- nomap = 1;
- else if (argv[i][j] == 'c')
- nocnt = 1;
- else if (argv[i][j] == 'k')
- nokwd = 1;
- else /* bad option */
- {
- fprintf(stderr,
- "Illegal format flag: -%c\n", argv[i][j]);
- usage();
- }
- }
- }
- if (i == argc)
- {
- if (nokwd)
- rmkwds(stdin);
- else if (nocnt)
- ffmt(stdin);
- else
- format(stdin);
- }
- for (; i < argc; i++)
- {
- if ((fp = fopen(argv[i], "r")) != NULL)
- {
- if (nokwd)
- rmkwds(fp);
- else if (nocnt)
- ffmt(fp);
- else
- format(fp);
- fclose(fp);
- }
- else /* attempt to open file failed */
- {
- fprintf(stderr,
- "Format cannot access the file: %s\n", argv[i]);
- continue;
- }
- }
- unlink(tempfile);
- exit(0);
- }
-
- char buff[BUFSIZ*8]; /* tempfile buffer for storing contexts */
- int bufflen; /* total length of contexts in buffer */
- int fulltf = 0; /* does the tempfile contain something? */
- FILE *tf = NULL; /* file pointer for tempfile routines */
-
- format(fp) /* print keyword and count only if different */
- FILE *fp;
- {
- char s[BUFSIZ], okw[BUFSIZ/2], nkw[BUFSIZ/2], cntxt[BUFSIZ];
- char *sp, *kwp, *cxp, *strcpy();
- int kwfreq = 0;
-
- strcpy(okw,"~~~~~"); /* make sure 1st keyword is printed */
-
- while (fgets(s, BUFSIZ, fp))
- {
- for (sp = s, kwp = nkw; *sp && *sp != '|'; sp++, kwp++)
- {
- if (!nomap && islower(*sp))
- *kwp = toupper(*sp);
- else
- *kwp = *sp;
- }
- *kwp = NULL;
-
- for (++sp, cxp = cntxt; *sp && *sp != '\n'; sp++, cxp++)
- {
- if (*sp == '|') {
- *cxp = ' '; *++cxp = ' '; *++cxp = ' ';
- } else
- *cxp = *sp;
- }
- *cxp = '\n';
- *++cxp = NULL;
-
- if (strcmp(nkw, okw) != 0) /* kwds different */
- {
- if (kwfreq != 0)
- {
- getbuff(kwfreq);
- putchar('\n');
- }
- *buff = NULL;
- bufflen = 0;
- fputs(nkw, stdout);
- putbuff(cntxt);
- kwfreq = 1;
- }
- else /* if keywords are the same */
- {
- putbuff(cntxt);
- kwfreq++;
- }
- strcpy(okw, nkw);
- }
- getbuff(kwfreq);
- }
-
- putbuff(cntxt) /* cache routine to buffer tempfile */
- char cntxt[];
- {
- char *strcat();
-
- if (!fulltf)
- {
- bufflen += strlen(cntxt);
- if (bufflen < BUFSIZ*8)
- strcat(buff, cntxt);
- else {
- fulltf = 1;
- if ((tf = fopen(tempfile, "w")) == NULL)
- perror(tempfile);
- fputs(buff, tf);
- *buff = NULL;
- bufflen = 0;
- }
- }
- else /* fulltf */
- fputs(cntxt, tf);
- }
-
- getbuff(kwfreq) /* print frequency and context buffer */
- int kwfreq;
- {
- char str[BUFSIZ];
-
- printf("(%d)\n", kwfreq);
- if (!fulltf)
- fputs(buff, stdout);
- else
- {
- fclose(tf);
- if ((tf = fopen(tempfile, "r")) == NULL)
- perror(tempfile);
- while (fgets(str, BUFSIZ, tf))
- fputs(str, stdout);
- fclose(tf);
- fulltf = 0;
- }
- }
-
- int onintr() /* remove tempfile in case of interrupt */
- {
- fprintf(stderr, "\nInterrupt\n");
- unlink(tempfile);
- exit(1);
- }
-
- ffmt(fp) /* if different, print keyword without count */
- FILE *fp;
- {
- char s[BUFSIZ], okw[BUFSIZ/2], nkw[BUFSIZ/2], cntxt[BUFSIZ];
- char *sp, *kwp, *cxp, *strcpy();
-
- strcpy(okw,"~~~~~"); /* make sure 1st keyword is printed */
- while (fgets(s, BUFSIZ, fp))
- {
- for (sp = s, kwp = nkw; *sp && *sp != '|'; sp++, kwp++)
- {
- if (!nomap && islower(*sp))
- *kwp = toupper(*sp);
- else
- *kwp = *sp;
- }
- *kwp = NULL;
-
- for (++sp, cxp = cntxt; *sp && *sp != '\n'; sp++, cxp++)
- {
- if (*sp == '|') {
- *cxp = ' '; *++cxp = ' '; *++cxp = ' ';
- } else
- *cxp = *sp;
- }
- *cxp = '\n';
- *++cxp = NULL;
-
- if (strcmp(nkw, okw) != 0) /* kwds different */
- printf("\n%s\n %s", nkw, cntxt);
- else /* if keywords are the same */
- printf(" %s", cntxt);
- strcpy(okw, nkw);
- }
- }
-
- rmkwds(fp) /* completely suppress printing of keyword */
- FILE *fp;
- {
- char s[BUFSIZ], *sp;
-
- while (fgets(s, BUFSIZ, fp))
- {
- for (sp = s; *sp && *sp != '|'; sp++)
- ;
- for (; *sp; sp++)
- {
- if (*sp == '|')
- printf(" ");
- else
- putchar(*sp);
- }
- }
- }
- @@@ Fin de format.c
- echo freq.c
- cat >freq.c <<'@@@ Fin de freq.c'
- # include <stdio.h> /* freq.c (rev3.7) */
- # include <ctype.h>
-
- struct tnode /* binary tree for word and count */
- {
- char *word;
- int count;
- struct tnode *left;
- struct tnode *right;
- };
-
- char punctuation[BUFSIZ] = ",.;:-?!\"()[]{}" ;
-
- long int total = 0; /* total number of words */
- long int different = 0; /* number of different words */
- char numfreq = 0; /* toggle for numerical freq */
- char nomap = 0; /* do not map to lower case */
-
- usage() /* print proper usage and exit */
- {
- puts("Usage: freq [-n -m -dF -] filename(s)\t\t(rev3.7)");
- puts("-n: list words in numerical order of frequency");
- puts("-m: disable mapping of words to lower case");
- puts("-d: define punctuation set according to file F");
- puts("- : read standard input instead of files");
- exit(1);
- }
-
- main(argc, argv) /* tabulate word frequencies of a text */
- int argc;
- char *argv[];
- {
- FILE *fopen(), *fp;
- struct tnode *root, *tree();
- char word[BUFSIZ];
- int i;
-
- if (argc == 1)
- usage();
- root = NULL; /* initialize tree */
- for (i = 1; i < argc; i++)
- {
- if (*argv[i] == '-')
- getflag(argv[i]);
- else if ((fp = fopen(argv[i], "r")) != NULL)
- {
- while (getword(word, fp))
- {
- ++total;
- root = tree(root, word);
- }
- fclose(fp);
- }
- else /* attempt to open file failed */
- {
- fprintf(stderr,
- "Freq cannot access the file: %s\n", argv[i]);
- exit(1);
- }
- }
- if (numfreq) /* print results */
- treesort(root);
- else
- treeprint(root, stdout);
-
- printf("------------------------------\n");
- printf("%5ld Total number of words\n", total);
- printf("%5ld Different words used\n", different);
- exit(0);
- }
-
- getflag(f) /* parses command line to set options */
- char *f;
- {
- char *pfile, word[BUFSIZ];
- struct tnode *root, *tree();
-
- f++;
- switch(*f++)
- {
- case 'n':
- numfreq = 1;
- break;
- case 'm':
- nomap = 1;
- break;
- case 'd':
- pfile = f;
- getpunct(pfile);
- break;
- case NULL:
- root = NULL;
- while (getword(word, stdin))
- {
- ++total;
- root = tree(root, word);
- }
- if (numfreq)
- treesort(root);
- else
- treeprint(root, stdout);
- break;
- default:
- fprintf(stderr,
- "Invalid freq flag: -%s\n", --f);
- exit(1);
- break;
- }
- }
-
- getpunct(pfile) /* read user's punctuation from pfile */
- char *pfile;
- {
- FILE *pfp, *fopen();
- char s[BUFSIZ], *strcpy();
-
- if ((pfp = fopen(pfile, "r")) == NULL)
- {
- fprintf(stderr,
- "Freq cannot access Pfile: %s\n", pfile);
- exit(1);
- }
- else
- while (fgets(s, BUFSIZ, pfp))
- strcpy(punctuation, s);
- }
-
- getword(word, fp) /* drives program through text word by word */
- char word[];
- FILE *fp;
- {
- while ((*word = getc(fp)) && isskip(*word) && *word != EOF)
- ;
- if (*word == EOF)
- return(0);
- if (!nomap && isupper(*word))
- *word = tolower(*word);
-
- while ((*++word = getc(fp)) && !isskip(*word) && *word !=EOF)
- {
- if (!nomap && isupper(*word))
- *word = tolower(*word);
- }
- *word = NULL;
- return(1);
- }
-
- isskip(c) /* function to evaluate punctuation */
- char c;
- {
- char *ptr;
-
- if (isspace(c))
- return(1);
- for (ptr = punctuation; *ptr != c && *ptr != NULL; ptr++)
- ;
- if (*ptr == NULL)
- return(0);
- else
- return(1);
- }
-
- struct tnode *tree(p, w) /* build tree beginning at root */
- struct tnode *p;
- char *w;
- {
- struct tnode *talloc();
- char *strsave();
- int cond;
-
- if (p == NULL)
- {
- p = talloc();
- p->word = strsave(w);
- p->count = 1;
- p->left = p->right = NULL;
- }
- else if ((cond = strcmp(w, p->word)) == 0)
- p->count++;
- else if (cond < 0)
- p->left = tree(p->left, w);
- else /* if cond > 0 */
- p->right = tree(p->right, w);
- return(p);
- }
-
- treesort(p) /* sort contents of binary tree and print */
- struct tnode *p;
- {
- FILE *pfp, *popen();
-
- pfp = popen("sort +0rn -1 +1", "w");
- if (p != NULL)
- treeprint(p, pfp);
- pclose(pfp);
- }
-
- treeprint(p, fp) /* write tree onto fp file stream */
- struct tnode *p;
- FILE *fp;
- {
- if (p != NULL)
- {
- treeprint(p->left, fp);
- fprintf(fp, "%5d %s\n", p->count, p->word);
- ++different;
- treeprint(p->right, fp);
- }
- }
-
- struct tnode *talloc() /* core allocator for tree */
- {
- struct tnode *p;
- char *malloc();
-
- if ((p = ((struct tnode *)malloc(sizeof(struct tnode)))) != NULL)
- ; /* will return */
- else /* if (p == NULL) */
- overflow();
- return(p);
- }
-
- char *strsave(s) /* allocate space for string of text */
- char *s;
- {
- char *p, *malloc();
-
- if ((p = malloc((unsigned)(strlen(s)+1))) != NULL)
- strcpy(p, s);
- else /* if (p == NULL) */
- overflow();
- return(p);
- }
-
- overflow() /* exit gracefully in case of core overflow */
- {
- fprintf(stderr,
- "Freq: no more core available (maximum on PDP 11/70 is 64K bytes).\n");
- fprintf(stderr,
- "You might try: dissolve filename(s) | sort | uniq -c\n");
- exit(1);
- }
- @@@ Fin de freq.c
- exit 0
-