home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Usenet 1994 January
/
usenetsourcesnewsgroupsinfomagicjanuary1994.iso
/
sources
/
unix
/
volume10
/
hum
/
part01
next >
Wrap
Internet Message Format
|
1989-08-08
|
57KB
Path: uunet!rs
From: rs@uunet.UU.NET (Rich Salz)
Newsgroups: comp.sources.unix
Subject: v10i27: Bill Tuthill's "hum" text concordance package, Part01/03
Message-ID: <474@uunet.UU.NET>
Date: 26 Jun 87 21:48:47 GMT
Organization: UUNET Communications Services, Arlington, VA
Lines: 1965
Approved: rs@uunet.uu.net
Submitted by: John Gilmore <hoptoad!gnu>
Mod.Sources: Volume 10, Number 27
Archive-name: hum/Part01
I advertised this package in comp.text.desktop and a bunch of people
are interested. I find it a good bunch of tools for text analysis and
I think it belongs in the archives.
It was developed by Bill Tuthill while he was at UC Berkeley. He
authorized me to pass it on to you for publication.
[ Its a pretty gud writing tool. --r$ ]
: To unbundle, sh this file
echo Hum.tutorial
cat >Hum.tutorial <<'@@@ Fin de Hum.tutorial'
.if n .ds Q \&"
.if n .ds U \&"
.if t .ds Q ``
.if t .ds U ''
.ND 3 November 1981
.nr LL 6.5i
.nr FL 6.2i
.RP
.TL
Hum \(em A Concordance and Text Analysis Package
.AU
William Tuthill
.AI
Comparative Literature Department
University of California
Berkeley, California 94720
.AB
A new package of programs for literary and linguistic
computing is available, emphasizing the preparation
of concordances and supporting documents.
Both keyword in context and keyword and line generators are provided,
as well as exclusion routines, a reverse concordance module,
formatting programs, a dictionary maker, and lemmatization facilities.
There are also word, character, and digraph
frequency counting programs,
word length tabulation routines,
a cross reference generator, and other related utilities.
The programs are written in the C programming language,
and implemented on several Version 7 Unix\(dg systems at Berkeley.
.FS
\(dg Unix is a trademark of Bell Laboratories.
.FE
They should be portable to any system that has a C compiler
and which supports some kind of pipe mechanism.
At Berkeley, they reside in ~hum/bin; documentation is in ~hum/man.
This paper constitutes a tutorial introduction to the package.
Manual pages for the various programs are available separately.
.AE
.SH
The Literary Text
.PP
There are two indispensable prerequisites for the analysis
of a literary text by computer: the computer, and the text itself.
About 98% of the work involved is entering and correcting the text,
so if you can obtain your text elsewhere, you will save a lot of time.
Check with the Unix consultants to see what tape formats
can be read at the Berkeley installation,
and have the text taped in one of these formats.
It is also possible to read cards through the link
to the IBM 4341; again, check with the consultants.
.PP
If you are forced to enter the text yourself (and most people are),
you must learn to use the Unix system, and the editor.
Fortunately, the Computer Center has provided
good documentation at the beginning stages.
Start off with the tutorial,
\fICommunicating with Unix,\fP\|
which will teach you how to use the Unix system.\**
.FS
Ricki Blau,
\fICommunicating with Unix,\fP\|
Computing Services, Berkeley (1980).
.FE
Then read through
\fIEdit: A Tutorial\fP\|
carefully, and learn how to use the \fBedit/ex\fP editor.\**
.FS
Ricki Blau and James Joyce,
\fIEdit: a Tutorial,\fP\|
Computing Services, Berkeley (1980).
.FE
At first, you will be spending most of your time inside the editor.
.PP
After you become comfortable with the editor,
it would save you time in the long run
to familiarize yourself with \fBex\fP and \fBvi\fP.
For \fBvi\fP, read
\fIAn Introduction to Display Editing with Vi,\fP\|
a tutorial introduction that is relatively easy to understand.\**
.FS
William Joy and Mark Horton,
\fIAn Introduction to Display Editing with Vi,\fP\|
EECS Dep't, Berkeley (1980).
The \fIEx Reference Manual,\fP\|
EECS Dep't, Berkeley (1980),
by the same two authors,
can be used in conjunction with the introduction to \fBvi\fR.
.FE
\fBEdit\fR is a simplified but less flexible version of \fBex\fR,
both being line-oriented command editors.
\fBVi\fR is identical to visual mode of \fBex\fR,
a screen-oriented display editor that permits
intraline changes to be made easily and effectively.
By the time you begin correcting your text,
you should know how to use \fBvi\fP, or at least open mode of \fBex\fR.
It is far easier to make changes this way
than with the substitute command of \fBedit\fR.
.PP
When you finish correcting your text,
it would be helpful for you to learn more about the C-Shell,
which is the default command processing language at Berkeley.
The tutorial and reference document to consult is
\fIAn Introduction to the C Shell.\fP\**
.FS
William Joy,
\fIAn Introduction to the C Shell,\fP\|
EECS Dep't, Berkeley (1979).
.FE
Most of the text analysis is going to be done with programs
called from the shell.
Another good document that is slightly out of date,
but which is still very helpful, is
\fIUnix for Beginners.\fP\**
.FS
Brian Kernighan,
\fIUnix for Beginners,\fP\|
Bell Laboratories, Murray Hill (1978).
.FE
By beginner, they mean someone
who is a systems programmer on another system.
The article teaches you about programming the shell,
and covers other advanced topics.
.SH
Practical Suggestions
.PP
Before beginning text entry,
you should organize your file and directory structure.
If you are encoding a novel (or a saga),
you should probably make a directory for the novel,
and then put each chapter in a file of its own.
If you are entering an epic poem, it would be best
to have a directory for the epic, and individual files for each section.
And if you are typing in many short poems by various authors,
put each author into a separate directory,
and each poem into a file in its respective directory.
In other words, let your data determine your file structure.
Later, it will be easy to access and label your texts.
In any case, files should not be larger than 250,000 characters,
and even then they are difficult to work with.
.PP
At this point, you must decide how your text is to be numbered.
A novel, of course, should be labelled by page number,
an epic or romance by line number and perhaps section number,
and lyric poems by author, poem number, and line number.
A good rule is to follow the numbering in the best edition available.
We will discuss section numbering and author labelling later.
Line numbering is no problem,
since many Unix programs are built around line structure.
If you need to page number, however, reserve the equals sign,
if possible, because it is normally the page indicator.
Put an equals sign on a line of its own every time
there is a new page in your novel,
and follow line separation in your edition exactly;
do not hyphenate words from one line to the next, however.
The \fBkwic\fP, \fBsfind\fP, and \fBxref\fP programs will use the page indicator
to label your text with the correct page numbers.
.PP
Text in a foreign language, especially if there are diacritical marks,
requires a good deal of thought and preparation.
The first rule to follow is that an accent mark
cannot be a punctuation mark, and vice-versa,
because an accent mark is part of a word,
while a punctuation mark is not.
One typical problem is the cedilla,
which is often represented as an overstruck comma.
A similar problem is the umlaut, which is best
represented by the double quote mark.
If you have cedillas or umlauts in your text,
you will have to use one of two special programs,
\fBcedilla\fP or \fBumlaut\fP, along with special concordance options.
All of this will be explained below.
A third problem area is the use of a single quote mark,
which interferes with the apostrophe and the acute accent.
Two solutions are to use the double quote mark all the time,
or to use the grave accent for a single quote mark
(provided, of course, you don't need the grave accent).
.PP
When you have decided what is a punctuation mark
and what is a diacritical mark,
you may have to establish a punctuation file (Pfile, for example)
containing all punctuation marks that are not part of a word.
The default punctuation marks, usable for English
and some other languages, are:
.DS
,.;:-"?!()[]{}
.DE
Note that the apostrophe is considered part of a word.
If these punctuation marks are acceptable,
you do not need a punctuation file.
If you want a different punctuation set, however,
simply type your punctuation marks onto a single line in a file.
Many Humanities programs will read the file you specify
after the \-d option, and reload the punctuation set
with the last line of this file.
.PP
After text entering is all finished,
you will begin the process of correcting the text,
often the most laborious part of your work.
There are a number of programs to help you in this endeavor,
although nothing can replace careful rereading of the text.
In order to save money and eyestrain, use \fBlpr\fP to get
a hard-copy version of your text from the lineprinter.
If you are working with modern English,
\fBspell\fP may help you to identify words
that are not in the on-line dictionary.
If you are willing to type in your text twice,
you can check one version against the other with \fBdiff\fP,
a process that will eliminate most errors.
All the preceding programs are part of the standard Unix system.
Humanities programs that may help are \fBcfreq \-a\fP,
which can alert you to the presence of strange characters,
and \fBtprep \-t\fP, which gets rid of useless trailing white space.
.SH
Background and References
.PP
Generating a concordance is one of the best uses
of a computer's capabilities that a Humanities scholar can make.
Before the 1950's, many concordances were done by hand,
and you may well imagine the time and effort that went into this.
The first large-scale computer concordance was
a concordance to the Revised Standard Version of the \fIBible,\fP\|
done in 1956 on a Univac I computer.\**
.FS
John W. Ellison,
\fINelson's Complete Concordance of the Revised Standard Version Bible,\fP\|
New York, Thomas Nelson & Sons (1957).
.FE
This pioneering concordance set a standard for intelligent design
and legibility that has seldom been matched since then.
A hand concordance to the King James Version had been done
by James Strong in the late 1800's; it took 35 years to complete.\**
.FS
James Strong,
\fIAn Exhaustive Concordance of the Bible,\fP\|
New York, Hunt & Eaton (1894).
.FE
.PP
It is important to emphasize that not all computer concordance projects
go this smoothly, showing such a dramatic improvement in efficiency.
For example, a hand concordance to Shakespeare's plays was done
by John Bartlett over a 15 year period in the late 1800's,
using \*Qonly the leisure that [his teaching] profession allowed.\*U\**
.FS
John Bartlett,
\fIA Complete Concordance to the Dramatic Works of Shakespeare,\fP\|
New York, Macmillan & Co (1896).
.FE
A computer generated concordance to Shakespeare done at Harvard
University, by contrast, took nearly as long to complete.
It was started in the days before their computer recognized
upper and lower case distinctions, and consequently,
Shakespeare's text, as concorded, contains no capital letters!
This work was done on an IBM 360/50 computer.
A preliminary version was published by a German press in 1969,
but the final typeset version was not ready until 1973.\**
.FS
Marvin Spevack,
\fIThe Harvard Concordance to Shakespeare,\fP\|
Cambridge Mass, Harvard University Press (1973).
.FE
.PP
Since the pioneering computer concordance by Ellison,
a plethora of concordances has been published,
some better than others, and some almost unusable.
Much work has remained, for lack of money and interest,
in the form of semi-legible lineprinter output.
This has resulted in a kind of academic provincialism,
since these stacks of lineprinter output are seldom shared.
One viable alternative to this situation would be
to make and distribute microfilm concordances,
drastically reducing the cost of paper, printing, and mailing.
.PP
There are, of course, many other ways besides the concordance
to make use of the computer for literary studies.
Susan Hockey's recent book gives a good introduction to
the various possibilities for computer research in the Humanities.\**
.FS
Susan Hockey,
\fIA Guide to Computer Applications in the Humanities,\fP\|
Baltimore and London, Johns Hopkins University Press (1980).
.FE
She gives little practical advice, but does very well at
pointing out the current limitations of the computer.
James Joyce has written a provocative article on
poetry generation and analysis by computer, which combines
the intuitive and analytical approaches to the subject.\**
.FS
James Joyce, \*QPoetry Generation and Analysis,\*U in
\fIAdvances in Computers,\fP\|
vol. 13, pp. 43-72 (1975).
.FE
Finally, there is a good article from Bell Labs
on statistical text processing, which describes
some linguistic research done on their Unix system.\**
.FS
L. E. McMahon, L. L. Cherry, and R. Morris,
\*QStatistical Text Processing,\*U
in \fIThe Bell System Technical Journal,\fP\|
vol. 57, no. 6, pt. 2, pp. 2137-54 (1978).
.FE
.SH
Making a Concordance
.PP
Before you start using the special package of Humanities programs,
you will need to set your pathname properly, since
the programs all reside in the directory \*Q~hum/bin\*U.
The best way to do this is to have a \*Q.cshrc\*U file
something like the following:
.DS
set path = ~hum/bin
set path = (/usr/cc/bin /usr/ucb/bin $path .\|)
set history = 20
set noclobber
.DE
In addition to setting your pathname, this will give you
a history list for performing repeat commands,
and will prevent certain kinds of unintentional file destruction.
After creating the .cshrc file, type the command
\fBsource .cshrc\fP in order to activate the new settings.
Now you can obtain an index to the concordance package
by typing \fBhuman index\fP;
manual pages for all the programs are available
through the same command.
.PP
How you use the concordance package
will be largely determined by how big your text is,
and by what kind of concordance you want.
There are two basic styles available, the
Key Word In Context concordance, and the
Key Word And Line concordance
(hence the names \fBkwic\fP and \fBkwal\fP).
At other computer installations,
a \fBkwal\fP-like program might be called \*Qkwoc\*U or \*Qconc\*U.
To try them out, take a short text, and type these two commands:
.DS
kwic text | sort | format
kwal text | sort | format
.DE
Many people use the \fBkwal\fP program for poetry, and the
\fBkwic\fP program for prose, but if you are doing formulaic analysis,
you will probably want to use the \fBkwic\fP program for poetry as well.
If you decide to use \fBkwic\fP, learn to use \fBtprep\fP,
since your concordance will look much better
if the end of line is trimmed and the beginning of line is padded.
.PP
If you have a text that is fairly long,
you may want to concord it in the background,
while you do other work.
With background processing,
you should always redirect output to a temporary file,
where you can examine (and perhaps edit) the results
before sending your concordance to the lineprinter.
Use a command something like this:
.DS
kwic text* | sort | format > /tmp/final &
.DE
The asterisk expands to mean any file beginning with text,
so if your text has several parts, this will put them all
together, in the same order as they are listed with \fBls\fP.
The greater-than symbol redirects output to the tempfile,
and the ampersand puts the process in background.
If you logout before the process is completed,
then all your background processes will be halted,
and you will have a partial concordance, or none at all.
(You can wait for background processes to terminate
by typing \*Qwait\*U before logging out.)
.PP
If your text is extremely long, or if you want to go home,
you may want to submit the concordance,
so that you can logout without ending the concording process.
Submit processing is charged at the rate of $1.00 per processor minute.
Use the following command:
.DS
submit "kwic text* | sort | format > /tmp/final"
.DE
The quotes are necessary, to remove the magic
of the pipe and redirect symbols.
A few hours later, or the next morning,
you can return to examine the temporary file,
and if everything is all right, you can send it to the lineprinter.
Make sure to examine the end of your output file with \fBtail\fP;
your concordance should end with the entries for X, Y, and Z.
.PP
\fBKwic\fP and \fBkwal\fP have many options,
and you will probably use some or all of them.
They are fully explained in the manual section,
but a thumbnail sketch is also given here.
The \-k option sets the keyword length;
you should run the \fBmaxwd\fP program on all your texts,
to determine the length of your longest word.
If it is longer than 15, reset the keyword length accordingly.
The \-w option can be used to label the concordance
as to work or author, and the \-f can be used
to label it for poem number or chapter.
.PP
Line numbering is done automatically,
but can be affected by using the \-l and \-r options.
The \fBkwic\fP program will also do page numbering,
which can be controlled with the \-p and \-i options.
The \fBkwal\fP program has the esoteric \-s and \-x options instead,
to skip over a text-embedded lefthand identification field,
and to suppress automatic linenumbering.
It is possible to use \fBlno\fP to double linenumber
or hemistich number your text,
and then use \-s and \-x when compiling the concordance.
You may have some other system for labelling your lines.
Older text entered on IBM cards often has the embedded id field.
.PP
With the \fBkwic\fP program, the most popular option
is the \-c to set context width.
The default context width is 50 characters,
that is to say, 25 on either side of the keyword.
This width is suitable for the CRT terminal,
but to fill up a page from the lineprinter, a width
of 100 to 120 is better, depending on the \-w and \-f options.
The \fBkwal\fP program, by contrast, gives context on a line by line basis,
so if you have some extremely long lines,
they may be too wide for the page.
The \-d option, to read a punctuation file, was explained above.
The + option is used along with the \fBcedilla\fP and \fBumlaut\fP programs.
The \fBkwic\fP and \fBkwal\fP programs can both read from standard input,
if you use a hyphen instead of a filename, but this is not recommended.
.PP
Ordinarily, sorting is done from the beginning of the line,
so that the primary sort field is the keyword,
and the secondary sort field is the line or page number.
With \fBkwic\fP, if you want to sort by context rather than by text location,
you will have to call the \fBsort\fP program as follows:
.DS
kwic text* | sort \-dft\e| +3 +1 | format
.DE
The \-d means dictionary order, ie, ignore punctuation marks.
The \-f means to fold upper case to lower case, ie, ignore capitals.
The \-t indicates that the tab character between fields is a
vertical bar (the backslash takes away its magic meaning).
The +3 makes the keyword and its following context
into the primary sort field, and +1 makes
the linenumber (or page number) into the secondary sort field.
The results of \fBsort\fP are then piped to \fBformat\fP.
If you are doing formulaic analysis, you will want to sort this way.
Make sure to use \fBtprep\fP if you do this kind of sorting.
.PP
If you are planning to publish your concordance,
or even if other people want to peruse it, you should read
William Ingram's article on concordance style.\**
.FS
William Ingram, \*QConcordances for the Seventies,\*U
in \fIComputers and the Humanities,\fP\|
vol. 13, pp. 43-72 (1975).
.FE
Concordances for the eighties will be much like those of the seventies.
Ingram gives many practical suggestions, and points out some
ridiculous errors made in several recently published concordances.
.SH
Concordance Modules
.PP
The concordance package comes with a number of independent modules,
in order to attain acceptable flexibility and efficiency.
With these modules, you can exclude unnecessary words,
create a reverse concordance,
put the concordance into separate alphabetical files,
lemmatize, and format the output.
The \fBsort\fP program may be considered a module,
although it is not technically part of the Humanities package.
For convenience, these modules may be divided into two types:
those used before \fBsort\fP, and those used afterwards.
.PP
The \fBexclude\fP program reads words in an ignore file
(\*Qexclfile\*U by default),
and filters out concordance entries with these keywords.
It can also be used with an only file, to filter out
all concordance entries except those with the desired keywords.
\fBExclude\fP should immediately follow \fBkwic\fP or \fBkwal\fP,
in order to reduce the amount of output going to later modules,
especially \fBsort\fP, which is the most expensive module.
.PP
The \fBrevconc\fP program reverses the keyword,
so that the concordance can be sorted by word endings
rather than by word beginnings.
\fBRevconc\fP should be called both before and after the \fBsort\fP program,
unless you want your words backwards in the final concordance.
.PP
If you are using the + option of \fBkwic\fP or \fBkwal\fP,
to indicate cedillas or umlauts,
you will need to filter your output with \fBcedilla\fP or \fBumlaut\fP.
These programs replace the plus sign with a backspace
and the appropriate accent mark.
.PP
The \fBformat\fP program counts and removes repeated occurrences of keywords,
and creates keyword section headings for the concordance.
If desired, keyword counting can be suppressed,
as can mapping of the keyword to upper case,
and printing of the section heading itself.
\fBFormat\fP is generally the last module used before the output
is sent to a temporary file or to the lineprinter.
.PP
If you were to use all of the modules above,
which is quite unlikely, this would be the proper order
of the command line:
.DS
kwic + text | exclude | revconc | sort | revconc | umlaut | format
.DE
Of course, after all that work, it would be good to redirect the
output to a temporary file somewhere.
.PP
For large concordance projects,
you will want to reduce the size of the output files,
so you can edit and print them more easily.
The \fBdict\fP program sends concordance entries to their
proper dictionary file, from A to Z
(actually in ASCII sequence from blank to tilde).
This is useful for large concordances that you must edit by hand.
\fBDict\fP should be used in place of \fBformat\fP;
concordances can be formatted when they are finally printed.
.PP
The \fBlemma\fP program takes specified words and positions them
in the proper order after their specified headword.
This is for keeping inflected words together in one location.
\fBLemma\fP should be called only after the \fBdict\fP program
has separated your concordance into alphabetical sections.
It only works on groups of files created by \fBdict\fP\|.
.PP
Generally, it is considered bad style for a concordance
to contain multiple entries when the same word occurs twice on a line.
For example, Shakespeare's line, \*QTomorrow, tomorrow, and tomorrow,\*U
when run through \fBkwal\fR, will produce three identical entries.
\fBKwic\fR will produce three slightly different entries,
but two of them will be largely redundant.
In order to avoid such redundancy,
use the Unix utility \fBuniq\fP as follows:
.DS
kwal text* | sort | uniq | format
kwal text* | sort \-u | format
.DE
The second line has exactly the same effect, and is somewhat faster.
It is more difficult to remove redundant lines
with \fBkwic\fP than with \fBkwal\fP,
because \fBuniq\fP will not ignore differing context fields.
However, it is possible to write an \fBawk\fP program
that should work for this purpose.
.PP
At the current time, no facilities are provided
for making interlinear concordances;
it is expected that they will be provided in the future.
.SH
Other Programs for Textual Analysis
.PP
There are other methods of textual analysis besides the concordance,
and programs for some of these methods are available.
One common method is the word frequency count,
which can be accomplished with the \fBfreq\fP program.
Ordinarily, \fBfreq\fP will give frequency counts of words in your text,
organized into alphabetical order.
With the \-n option, it will give these words organized
by order of frequency, with the most common words first.
If you run out of core with \fBfreq\fP, the same thing can
be done, but much more slowly, with this command:
.DS
wheel +1 text | sort | uniq \-c
.DE
You will probably want to use \fBfreq\fP in conjunction
with \fBpr \-n\fP, to create n-column output.
See the manual page for \fBpr\fP(1) in the
\fIUnix Programmer's Manual\fP\| for details.
.PP
Some linguistics students may be interested in
the relative frequency of different characters;
the \fBcfreq\fP program may help them out.
\fBCfreq\fP can also count the frequency of digraphs in a text.
Comparison of word lengths may also be useful,
for which purpose there is a \fBwdlen\fP program,
which prints out a hologram along with word length frequencies.
.PP
Most good concordances nowadays are published along with
a word frequency list (in numerical order),
and a reverse list of the graphic forms.
This is simply a list of all the different words in the text,
sorted from the end to the beginning, rather than vice-versa.
It is far less elaborate than the reverse concordance,
but serves many of the same purposes.
To create such a list, use the following program sequence:
.DS
wheel +1 text | rev | sort \-u | rev
.DE
\fBWheel +1\fP gives a word list, \fBrev\fP reverses each word,
and the \-u option of \fBsort\fP suppresses all but one copy of
each identical line, rather than giving you useless repeated lines.
Finally \fBrev\fP restores normal order to all the words.
Again, \fBpr \-n\fP may be of some use in creating multi-column output.
.PP
Syntactic analysis by computer is more difficult than lexical analysis,
but the \fBwheel\fP program may be of some help for the former endeavor.
It rolls through the text several words at a time,
printing word clusters of arbitrary size on each line.
These syntactic clusters can then be sorted and counted,
using \fBsort\fP and \fBuniq\fP, in order to find repeated patterns.
.DS
wheel +3 text | sort | uniq \-c | sort \-r
.DE
The above command, for example, yields a count
of three-word syntactic clusters,
with the most frequent ones heading the list.
.PP
Pattern searching is another way to study syntactic patterns
of a text with the aid of a computer.
The Unix utility \fBgrep\fP is useful in this context,
but it is line-oriented, and many literary texts
cannot be studied on a line by line basis.
So there is a program called \fBsfind\fP, which searches
through a text sentence by sentence for a specified pattern.
The entire sentence is printed, along with the matching pattern.
.PP
If you do not need a full-length concordance,
or if you just want a simple index that gives word locations,
there is a cross reference generator named \fBxref\fP.
It will index distinct words in a text by line number,
or by page number, depending on how your text is labelled.
It is slower than \fBfreq\fP, but much faster than a concordance series.
Moreover, its output will be relatively compact.
In the final concordance, you may want
to have some common words listed with \fBxref\fP indexing,
and more important words listed with full concordance entries.
.PP
After doing a bit of work with the computer,
you may feel constrained by not knowing a programming language.
Many things are extremely difficult to accomplish
if you do not know how to program;
but with a little programming knowledge,
these tasks become exceedingly trivial.
If you are so inclined, you ought to learn \fBawk\fP.
It has the simplicity of \fBbasic\fP,
the string handling capabilites of \fBsnobol\fP,
and the economy of expression found in \fBapl\fP.
The tutorial, \fIAwk \(em A Pattern Scanning and Processing Language,\fP\|
should be enough to get you started.\**
.FS
Alfred Aho, Brian Kernighan, and Peter Weinberger,
\fIAwk \(em A Pattern Scanning and Processing Language,\fP\|
Bell Laboratories, Murray Hill (1978).
.FE
.SH
Related Utilities
.PP
For times when you have to change a print wheel or type ball,
there is the \fBpause\fP program for temporarily halting output.
Pause may also be helpful for generating form letters or mailing lists,
since it gives you time to change paper or envelopes.
.PP
For those who work with textual variants,
the \fBpair\fP program may be of some use.
\fBPair \-m\fP intercalates two files line by line,
while \fBpair\fP sets two files side by side.
Different manuscript versions can be compared
by means of this program.
It is possible to make two textual variants parallel
by inserting blank lines in one variant;
then \fBdiff\fP can be used to collate the manuscripts.
(There is even a \fBdiff3\fP program for examining three texts.)
.PP
Many novice Unix users wonder how they can linenumber a text,
as with the \*Qset number\*U option of \fBex/vi\fP.
There are several ways, but the best approach
for Humanities users is the \fBlno\fP program.
It allows you to set the linenumber
to start at whatever value you want,
and also supports double line numbering and hemistich numbering.
.PP
If your final copy is to be lineprinted, you may want to use
\fBtolpr\fP to shift your text to the right, away from the holes.
It is possible to do extra formatting on your concordance,
and even to print your concordance using the phototypesetter.
If you want to use the typesetter,
use \fBtroffmt\fP instead of \fBformat\fP,
and send the resulting output to \fBtroff \-Q\fP.
It is not necessary to use the \-ms macros.
.PP
If you intend to transport tapes to or from another installation,
especially an IBM-compatible machine,
a number of programs in \*Q/usr/inp/bin\*U may be helpful.
For importing text encoded in fixed length format,
the \fBline\fP program can be used to insert missing newline characters;
\fBtprep\fP can then be used to strip off trailing blanks.
For exporting text, the \fBfixlen\fP program can be used to transform
variable length Unix lines into IBM-style fixed length records.
For rearranging fields of a record
(for instance, to move linenumbers from the last few columns
to the first few columns), use the \fBpermute\fP program.
.SH
Acknowledgements
.PP
The author would like to thank Professors
Joseph J. Duggan and John Lindow for research funding
during the early stages of this project.
James Joyce gave advice on the general design of this package,
and Brad Rubenstein and Bob Campbell helped immeasurably
with much of the actual programming.
The project was finally completed with the financial support
of a grant from the National Endowment for the Humanities.
@@@ Fin de Hum.tutorial
echo Makefile
cat >Makefile <<'@@@ Fin de Makefile'
BIN=
all: $(BIN)accent $(BIN)cfreq $(BIN)dict $(BIN)exclude $(BIN)format\
$(BIN)freq $(BIN)kwal $(BIN)kwic $(BIN)lno $(BIN)med\
$(BIN)maxwd $(BIN)pair $(BIN)pause $(BIN)revconc $(BIN)sfind\
$(BIN)skel $(BIN)togrk $(BIN)tolpr $(BIN)tosel $(BIN)tprep\
$(BIN)troffmt $(BIN)wdlen $(BIN)wheel $(BIN)xref
clean:
rm -f $(BIN)accent $(BIN)cfreq $(BIN)dict $(BIN)exclude $(BIN)format
rm -f $(BIN)freq $(BIN)kwal $(BIN)kwic $(BIN)lno $(BIN)med
rm -f $(BIN)maxwd $(BIN)pair $(BIN)pause $(BIN)revconc $(BIN)sfind
rm -f $(BIN)skel $(BIN)togrk $(BIN)tolpr $(BIN)tosel $(BIN)tprep
rm -f $(BIN)troffmt $(BIN)wdlen $(BIN)wheel $(BIN)xref
rm -f $(BIN)cedilla $(BIN)umlaut
$(BIN)accent: accent.c
cc accent.c -O -s -o $(BIN)accent
ln $(BIN)accent $(BIN)cedilla
ln $(BIN)accent $(BIN)umlaut
$(BIN)cfreq: cfreq.c
cc cfreq.c -O -s -o $(BIN)cfreq
$(BIN)dict: dict.c
cc dict.c -O -s -o $(BIN)dict
$(BIN)exclude: exclude.c
cc exclude.c -O -s -o $(BIN)exclude
$(BIN)format: format.c
cc format.c -O -s -o $(BIN)format
$(BIN)freq: freq.c # separate i&d space
cc freq.c -O -s -o $(BIN)freq
$(BIN)kwal: kwal.c
cc kwal.c -O -s -o $(BIN)kwal
$(BIN)kwic: kwic.c
cc kwic.c -O -s -o $(BIN)kwic
$(BIN)lno: lno.c
cc lno.c -O -s -o $(BIN)lno
$(BIN)med: med.c
cc med.c -O -s -o $(BIN)med -lcurses -ltermcap
$(BIN)maxwd: maxwd.c
cc maxwd.c -O -s -o $(BIN)maxwd
$(BIN)pair: pair.c
cc pair.c -O -s -o $(BIN)pair
$(BIN)pause: pause.c
cc pause.c -O -s -o $(BIN)pause
$(BIN)revconc: revconc.c
cc revconc.c -O -s -o $(BIN)revconc
$(BIN)sfind: sfind.c
cc sfind.c -O -s -o $(BIN)sfind
$(BIN)skel: skel.c
cc skel.c -O -s -o $(BIN)skel
$(BIN)togrk: togrk.c
cc togrk.c -O -s -o $(BIN)togrk
$(BIN)tolpr: tolpr.c
cc tolpr.c -O -s -o $(BIN)tolpr
$(BIN)tosel: tosel.c
cc tosel.c -O -s -o $(BIN)tosel
$(BIN)tprep: tprep.c
cc tprep.c -O -s -o $(BIN)tprep
$(BIN)troffmt: troffmt.c
cc troffmt.c -O -s -o $(BIN)troffmt
$(BIN)wdlen: wdlen.c
cc wdlen.c -O -s -o $(BIN)wdlen
$(BIN)wheel: wheel.c
cc wheel.c -O -s -o $(BIN)wheel
$(BIN)xref: xref.c
cc xref.c -O -s -o $(BIN)xref
@@@ Fin de Makefile
echo ORIGIN
cat >ORIGIN <<'@@@ Fin de ORIGIN'
Unix-From: sun!cairo!tut Thu Oct 10 20:50:28 1985
Date: Thu, 10 Oct 85 14:31:16 PDT
From: sun!cairo!tut (Bill Tuthill)
Message-Id: <8510102131.AA08577@cairo.sun.uucp>
To: gnu@l5.uucp
Subject: Re: Hum tutorial
Date: Fri, 19 Jun 87 16:45:45 PDT
From: sun!tut (Bill Tuthill)
Message-Id: <8706192345.AA04406@cairo.sun.uucp>
To: hoptoad!gnu
Subject: Re: Hum programs
It's fine with me if you submit Hum to Rich $alz for mod.sources.
How did he get such a $trange letter in his la$t name?
@@@ Fin de ORIGIN
echo README
cat >README <<'@@@ Fin de README'
Here is the source code for the "hum" concordance package.
To compile all the programs, just type "make", and the programs will
all be compiled, one by one, and put in "../bin". You can change
the destination bin if you like (for example, to "/usr/hum/bin") by
changing the first line in the "Makefile". All the programs here
are self-contained, requiring no special libraries, and having no
dependencies. If you do extensive revising, especially on the longer
programs, you will probably want to split them up, and change the
"Makefile" accordingly.
Several programs here are new and undocumented. The "togrk"
program employs Wes Walton's Greek transcription system to convert
English letters to Greek character strings for "nroff/troff". The
"troffmt" program replaces "format" for the phototypesetter, and
requires the macro package "../tmac/trmacs" to work correctly.
It has not been fully debugged.
The old "lemma" program was not good enough to distribute.
It is currently in the process of revision so it can be used along
with "dict". Disk seeks and reads will replace sequential I/O.
If you need the new "lemma" program, contact Bill Tuthill for a
software update.
@@@ Fin de README
echo accent.c
cat >accent.c <<'@@@ Fin de accent.c'
# include <stdio.h> /* accent.c (rev3.7) */
char mode = 'a'; /* accent mode: a=file c=cedilla u=umlaut */
char accfile[15] = "accfile"; /* file containing accent mark definitions */
usage() /* print proper usage and exit */
{
fprintf(stderr, "Usage: accent [-a accfile] [filename(s)]\t(rev3.7)\n");
fprintf(stderr, "\taccent mark definitions read from accfile\n");
exit(1);
}
main(argc, argv) /* user-controlled module for accent marks */
int argc;
char *argv[];
{
FILE *fopen(), *fp;
int n, i;
n = strlen(argv[0]);
/* umlaut */
if (argv[0][n-2] == 'u' && argv[0][n-1] == 't')
mode = 'u';
/* cedilla */
if (argv[0][n-2] == 'l' && argv[0][n-1] == 'a')
mode = 'c';
i = 1;
if (mode == 'a' && argv[i][0] == '-')
{
if (argv[i][1] == 'a' && argc >= 3)
{
strcpy(accfile, argv[++i]);
argc -= 2;
i++;
}
else
usage();
}
if (mode == 'a')
getacc(accfile);
fp = stdin;
do {
if (argc > 1 && (fp = fopen(argv[i], "r")) == NULL)
{
fprintf(stderr, "%s: can't access file %s\n",
argv[0], argv[i]);
exit(1);
}
else {
if (mode == 'a')
accent(fp);
else /* c or u mode */
ced_um(fp);
fclose(fp);
}
} while (++i < argc);
exit(0);
}
char from[BUFSIZ/8]; /* accent marks placed in text */
char into[BUFSIZ/8]; /* accent marks to be output */
getacc(file) /* retrieve accent mark definitions */
char *file;
{
FILE *afp, *fopen();
char str[BUFSIZ/16];
register int n;
if ((afp = fopen(file, "r")) == NULL)
{
fprintf(stderr, "can't find accent file: %s\n", file);
exit(1);
}
for (n = 0; fgets(str, BUFSIZ/16, afp) && n < BUFSIZ/8; n++)
{
if (strlen(str) == 4)
{
from[n] = str[0];
into[n] = str[2];
}
else {
fprintf(stderr,
"syntax error in line %d of %s\n", n+1, file);
fprintf(stderr,
"usage: <fromchar> <space> <intochar>\n");
exit(1);
}
}
from[n] = into[n] = NULL;
}
accent(fp) /* change accent marks into <BS><acc> */
FILE *fp;
{
register int c, n;
while ((c = getc(fp)) != EOF)
{
n = position(from, c);
if (n >= 0)
{
putchar('\b');
putchar(into[n]);
}
else
putchar(c);
}
}
ced_um(fp) /* change + into cedilla: <BS>, or umlaut: <BS>" */
FILE *fp;
{
register int c;
while ((c = getc(fp)) != EOF)
{
if (c == '+')
{
putchar('\b');
if (mode == 'c')
putchar(',');
if (mode == 'u')
putchar('"');
}
else
putchar(c);
}
}
position(str, c) /* return location of c in str, -1 if not */
char str[], c;
{
register int i;
for (i = 0; str[i]; i++)
if (str[i] == c)
return(i);
return(-1);
}
@@@ Fin de accent.c
echo cfreq.c
cat >cfreq.c <<'@@@ Fin de cfreq.c'
# include <stdio.h> /* cfreq.c (rev 3.7) */
# include <ctype.h>
char *chars[512] =
{
"^@", "^A", "^B", "^C", "^D", "^E", "^F", "^G",
"^H", "^I", "^J", "^K", "^L", "^M", "^N", "^O",
"^P", "^Q", "^R", "^S", "^T", "^U", "^V", "^W",
"^X", "^Y", "^Z", "^[", "^\\", "^]", "^^", "^_",
" ", "!", "\"", "#", "$", "%", "&", "'",
"(", ")", "*", "+", ",", "-", ".", "/",
"0", "1", "2", "3", "4", "5", "6", "7",
"8", "9", ":", ";", "<", "=", ">", "?",
"@", "A", "B", "C", "D", "E", "F", "G",
"H", "I", "J", "K", "L", "M", "N", "O",
"P", "Q", "R", "S", "T", "U", "V", "W",
"X", "Y", "Z", "[", "\\", "]", "^", "_",
"`", "a", "b", "c", "d", "e", "f", "g",
"h", "i", "j", "k", "l", "m", "n", "o",
"p", "q", "r", "s", "t", "u", "v", "w",
"x", "y", "z", "{", "|", "}", "~", "^?",
};
struct tnode /* binary tree for digraphs and count */
{
char *word;
int count;
struct tnode *left;
struct tnode *right;
};
long count[128]; /* count of individual characters */
long total; /* total number of characters */
char printable = 0; /* toggle printable chars option */
char ascii = 0; /* toggle all ascii chars option */
char digraph = 0; /* toggle digraph frequency count */
char nomap = 0; /* turn off digraph mapping to lower case */
main(argc,argv) /* count frequency of characters or digraphs */
int argc;
char *argv[];
{
FILE *fopen(), *fp;
struct tnode *tree(), *root;
char dgraphs[3];
int i;
root = NULL;
dgraphs[2] = NULL;
if (argc == 1)
{
puts("Usage: cfreq [-p -a -d -m -] filename(s)\t(rev3.7)");
puts("-p: list all printable characters (blank - '~')");
puts("-a: list all ascii characters (null - delete)");
puts("-d: count digraphs rather than single characters");
puts("-m: disable mapping of digraphs to lower case");
puts("- : read standard input instead of files");
exit(1);
}
for (i = 1; i < argc; i++)
{
if (*argv[i] == '-')
getflag(argv[i]);
else if ((fp = fopen(argv[i], "r")) != NULL)
{
if (digraph)
while (dfreq(dgraphs, fp))
root = tree(root, dgraphs);
else
cfreq(fp);
fclose(fp);
}
else /* cannot open file */
{
fprintf(stderr,
"Cfreq cannot access the file: %s\n", argv[i]);
exit(1);
}
}
if (digraph)
{
printf("Digraph: Freq:\n");
treeprint(root);
}
else
pr_results();
exit(0);
}
getflag(f) /* parses command line to set options */
char *f;
{
struct tnode *tree(), *root;
char dgraphs[3];
f++;
switch(*f)
{
case 'p':
printable = 1;
break;
case 'a':
ascii = 1;
break;
case 'd':
digraph = 1;
break;
case 'm':
nomap = 1;
break;
case NULL:
root = NULL;
dgraphs[2] = NULL;
if (digraph)
{
while (dfreq(dgraphs, stdin))
root = tree(root, dgraphs);
printf("Digraph: Freq:\n");
treeprint(root);
exit(1);
}
else
cfreq(stdin);
break;
default:
fprintf(stderr, "Invalid cfreq flag: -%s\n", f);
exit(1);
break;
}
}
cfreq(fp) /* run through files counting characters */
FILE *fp;
{
register int c;
while ((c = getc(fp)) != EOF)
{
count[(int)c]++;
total++;
}
}
pr_results() /* print out table of character counts */
{
int i;
printf("Char:\t Freq:\n");
if (printable)
for (i = 32; i < 127; i++)
printf("%s\t%6ld\n", chars[i], count[i]);
else if (ascii)
for (i = 0; i < 128; i++)
printf("%s\t%6ld\n", chars[i], count[i]);
else
{
for (i = 65; i < 91; i++)
printf("%s\t%6ld\n", chars[i], count[i]);
for (i = 97; i < 123; i++)
printf("%s\t%6ld\n", chars[i], count[i]);
}
printf("Total:\t%6ld\n", total);
}
dfreq(dgraphs,fp) /* drives program through text by digraphs */
char dgraphs[];
FILE *fp;
{
register int c;
if ((c = getc(fp)) != EOF)
{
if (c == '\n' || c == '\t')
dgraphs[0] = ' ';
else if (!nomap && isupper(c))
dgraphs[0] = tolower(c);
else dgraphs[0] = c;
}
else return(0);
if ((c = getc(fp)) != EOF)
{
if (c == '\n' || c == '\t')
dgraphs[1] = ' ';
else if (!nomap && isupper(c))
dgraphs[1] = tolower(c);
else dgraphs[1] = c;
}
else return(0);
ungetc(c, fp);
return(1);
}
struct tnode *tree(p, w) /* build tree beginning at root */
struct tnode *p;
char *w;
{
struct tnode *talloc();
char *strsave();
int cond;
if (p == NULL)
{
p = talloc();
p->word = strsave(w);
p->count = 1;
p->left = p->right = NULL;
}
else if ((cond = strcmp(w, p->word)) == 0)
p->count++;
else if (cond < 0)
p->left = tree(p->left, w);
else /* if cond > 0 */
p->right = tree(p->right, w);
return(p);
}
treeprint(p) /* print out contents of binary tree */
struct tnode *p;
{
if (p != NULL)
{
treeprint(p->left);
printf("%s\t %5d\n", p->word, p->count);
treeprint(p->right);
}
}
struct tnode *talloc() /* core allocator for tree */
{
struct tnode *p;
char *malloc();
if ((p = ((struct tnode *)malloc(sizeof(struct tnode)))) != NULL)
; /* will return */
else /* if (p == NULL) */
overflow();
return(p);
}
char *strsave(s) /* allocate space for string of text */
char *s;
{
char *p, *malloc(), *strcpy();
if ((p = malloc((unsigned)(strlen(s)+1))) != NULL)
strcpy(p, s);
else /* if (p == NULL) */
overflow();
return(p);
}
overflow() /* exit gracefully in case of core overflow */
{
fprintf(stderr,
"Cfreq: no more core available (maximum on PDP is 64K bytes).\n");
exit(1);
}
@@@ Fin de cfreq.c
echo dict.c
cat >dict.c <<'@@@ Fin de dict.c'
# include <stdio.h> /* dict.c (rev3.7) */
main(argc, argv) /* split file into dictionary sections */
int argc;
char *argv[];
{
FILE *fp, *fopen();
char *strcpy(), root[15]; /* root name of outfile */
if (argc == 1 || argc > 3)
{
puts("Usage: dict [-]filename [outroot]\t\t(rev3.7)");
puts("- : read standard input rather than file");
exit(1);
}
if (argc == 2)
*root = 'X';
if (argc == 3)
strcpy(root, argv[2]);
if (argv[1][0] == '-' && argv[1][1] == NULL)
dict(stdin, root);
else if ((fp = fopen(argv[1], "r")) != NULL)
{
dict(fp, root);
fclose(fp);
}
else /* can't open input file */
{
fprintf(stderr,
"Dict cannot access the file: %s\n", argv[1]);
exit(1);
}
exit(0);
}
dict(fp, root) /* write each letter to separate file */
FILE *fp;
char *root;
{
FILE *sp, *fopen();
char s[BUFSIZ], fname[20], ch = 'A';
int first = 1, len;
sp = NULL;
while (fgets(s, BUFSIZ, fp))
{
if (s[0] != ch) /* new letter found */
{
ch = s[0];
first = 1;
}
if (!first && s[0] == ch) /* same letter */
{
fputs(s, sp);
if (ferror(sp))
{
perror(fname);
exit(1);
}
}
if (first) /* first encounter with this letter */
{
strcpy(fname, root); /* derive filename */
len = strlen(fname);
fname[len] = ch;
fname[++len] = NULL;
if (sp != NULL)
fclose(sp);
if ((sp = fopen(fname, "a")) == NULL)
{
perror(fname);
exit(1);
}
fputs(s, sp);
if (ferror(sp))
{
perror(fname);
exit(1);
}
first = 0;
}
}
}
@@@ Fin de dict.c
echo exclude.c
cat >exclude.c <<'@@@ Fin de exclude.c'
# include <stdio.h> /* exclude.c (rev3.7) */
# include <ctype.h>
# define MXWDS 500
char mode = 'i'; /* default is ignore mode, not only mode */
usage() /* print proper usage and exit */
{
puts("Usage: exclude [-i -o exclfile] [filename(s)]\t(rev3.7)");
puts("-i: exclfile contains words to be ignored, one per line");
puts("-o: exclfile has only words to be printed, one per line");
puts("With no options, excluded words should be in \"exclfile\".");
exit(1);
}
main(argc, argv) /* exclude common words from concordance */
int argc;
char *argv[];
{
FILE *fp, *fopen();
int i = 1;
if (argc == 1)
{
rdexclf("exclfile");
exclude(stdin);
exit(0);
}
if (argv[1][0] == '-')
{
if (argc == 2)
usage();
if (argv[1][1] == 'i')
mode = 'i';
else if (argv[1][1] == 'o')
mode = 'o';
else /* bad flag */
usage();
rdexclf(argv[2]);
if (argc == 3)
exclude(stdin);
i = 3;
}
else /* no options used */
rdexclf("exclfile");
for (; i < argc; i++)
{
if ((fp = fopen(argv[i], "r")) != NULL)
{
exclude(fp);
fclose(fp);
}
else /* can't open input file */
{
fprintf(stderr,
"Exclude cannot access the file: %s\n", argv[i]);
continue;
}
}
exit(0);
}
char *wdptr[MXWDS]; /* array of pointers to excluded words */
int nwds = 0; /* the number of excluded words in core */
rdexclf(fname) /* load structure with words from exclfile */
char fname[];
{
FILE *efp, *fopen();
char wd[512], *p, *malloc(), *strcpy();
if ((efp = fopen(fname, "r")) == NULL)
{
fprintf(stderr,
"Cannot access exclude file: %s\n", fname);
usage();
exit(1);
}
while (fgets(wd, 512, efp))
{
if (nwds >= MXWDS)
{
fprintf(stderr,
"Maximum of %d exclude words allowed.\n", MXWDS);
exit(1);
}
else if ((p = malloc((unsigned)(strlen(wd)+1))) == NULL)
{
fprintf(stderr,
"Exclude: no more space left in core.\n");
exit(1);
}
else /* everything is OK */
{
strcpy(p, wd);
wdptr[nwds++] = p;
}
}
return;
}
exclude(fp) /* filter out excluded words, i or o mode */
FILE *fp;
{
char s[512], word[512];
while (fgets(s, 512, fp))
{
if (firstword(s, word) == 0)
continue;
if (mode == 'i')
{
if (!inlist(word))
fputs(s, stdout);
}
if (mode == 'o')
{
if (inlist(word))
fputs(s, stdout);
}
}
}
firstword(s, wd) /* return first word of string s */
char s[], *wd;
{
int i = 0;
if (isspace(s[i]))
return(0);
while (!isspace(s[i]))
*wd++ = s[i++];
*wd++ = '\n';
*wd = NULL;
return(1);
}
inlist(word) /* check to see if word is in exclude list */
char word[];
{
int i;
for (i = 0; i < nwds; i++)
{
if (strcmp(word, wdptr[i]) == 0)
return(1);
}
return(0);
}
@@@ Fin de exclude.c
echo format.c
cat >format.c <<'@@@ Fin de format.c'
# include <stdio.h> /* format.c (rev3.7) */
# include <ctype.h>
# include <signal.h>
char *tempfile; /* to store overflow while counting */
int nomap = 0; /* toggle for mapping keyword to lcase */
int nocnt = 0; /* toggle for counting keyword */
int nokwd = 0; /* toggle for suppressing keyword */
usage() /* print proper usage and exit */
{
puts("Usage: format [-mck] [filename(s)]\t\t(rev3.7)");
puts("-m: keywords not mapped from lower to upper case");
puts("-c: suppress counting of keyword frequency");
puts("-k: entirely suppress printing of keyword");
exit(1);
}
main(argc, argv) /* make keyword headings with count */
int argc;
char *argv[];
{
FILE *fopen(), *fp;
int i, j, onintr();
char *mktemp();
if (signal(SIGINT, SIG_IGN) != SIG_IGN)
signal(SIGINT, onintr);
tempfile = "/tmp/FmtXXXXX";
mktemp(tempfile);
for (i = 1; *argv[i] == '-'; i++)
{
for (j = 1; argv[i][j] != NULL; j++)
{
if (argv[i][j] == 'm')
nomap = 1;
else if (argv[i][j] == 'c')
nocnt = 1;
else if (argv[i][j] == 'k')
nokwd = 1;
else /* bad option */
{
fprintf(stderr,
"Illegal format flag: -%c\n", argv[i][j]);
usage();
}
}
}
if (i == argc)
{
if (nokwd)
rmkwds(stdin);
else if (nocnt)
ffmt(stdin);
else
format(stdin);
}
for (; i < argc; i++)
{
if ((fp = fopen(argv[i], "r")) != NULL)
{
if (nokwd)
rmkwds(fp);
else if (nocnt)
ffmt(fp);
else
format(fp);
fclose(fp);
}
else /* attempt to open file failed */
{
fprintf(stderr,
"Format cannot access the file: %s\n", argv[i]);
continue;
}
}
unlink(tempfile);
exit(0);
}
char buff[BUFSIZ*8]; /* tempfile buffer for storing contexts */
int bufflen; /* total length of contexts in buffer */
int fulltf = 0; /* does the tempfile contain something? */
FILE *tf = NULL; /* file pointer for tempfile routines */
format(fp) /* print keyword and count only if different */
FILE *fp;
{
char s[BUFSIZ], okw[BUFSIZ/2], nkw[BUFSIZ/2], cntxt[BUFSIZ];
char *sp, *kwp, *cxp, *strcpy();
int kwfreq = 0;
strcpy(okw,"~~~~~"); /* make sure 1st keyword is printed */
while (fgets(s, BUFSIZ, fp))
{
for (sp = s, kwp = nkw; *sp && *sp != '|'; sp++, kwp++)
{
if (!nomap && islower(*sp))
*kwp = toupper(*sp);
else
*kwp = *sp;
}
*kwp = NULL;
for (++sp, cxp = cntxt; *sp && *sp != '\n'; sp++, cxp++)
{
if (*sp == '|') {
*cxp = ' '; *++cxp = ' '; *++cxp = ' ';
} else
*cxp = *sp;
}
*cxp = '\n';
*++cxp = NULL;
if (strcmp(nkw, okw) != 0) /* kwds different */
{
if (kwfreq != 0)
{
getbuff(kwfreq);
putchar('\n');
}
*buff = NULL;
bufflen = 0;
fputs(nkw, stdout);
putbuff(cntxt);
kwfreq = 1;
}
else /* if keywords are the same */
{
putbuff(cntxt);
kwfreq++;
}
strcpy(okw, nkw);
}
getbuff(kwfreq);
}
putbuff(cntxt) /* cache routine to buffer tempfile */
char cntxt[];
{
char *strcat();
if (!fulltf)
{
bufflen += strlen(cntxt);
if (bufflen < BUFSIZ*8)
strcat(buff, cntxt);
else {
fulltf = 1;
if ((tf = fopen(tempfile, "w")) == NULL)
perror(tempfile);
fputs(buff, tf);
*buff = NULL;
bufflen = 0;
}
}
else /* fulltf */
fputs(cntxt, tf);
}
getbuff(kwfreq) /* print frequency and context buffer */
int kwfreq;
{
char str[BUFSIZ];
printf("(%d)\n", kwfreq);
if (!fulltf)
fputs(buff, stdout);
else
{
fclose(tf);
if ((tf = fopen(tempfile, "r")) == NULL)
perror(tempfile);
while (fgets(str, BUFSIZ, tf))
fputs(str, stdout);
fclose(tf);
fulltf = 0;
}
}
int onintr() /* remove tempfile in case of interrupt */
{
fprintf(stderr, "\nInterrupt\n");
unlink(tempfile);
exit(1);
}
ffmt(fp) /* if different, print keyword without count */
FILE *fp;
{
char s[BUFSIZ], okw[BUFSIZ/2], nkw[BUFSIZ/2], cntxt[BUFSIZ];
char *sp, *kwp, *cxp, *strcpy();
strcpy(okw,"~~~~~"); /* make sure 1st keyword is printed */
while (fgets(s, BUFSIZ, fp))
{
for (sp = s, kwp = nkw; *sp && *sp != '|'; sp++, kwp++)
{
if (!nomap && islower(*sp))
*kwp = toupper(*sp);
else
*kwp = *sp;
}
*kwp = NULL;
for (++sp, cxp = cntxt; *sp && *sp != '\n'; sp++, cxp++)
{
if (*sp == '|') {
*cxp = ' '; *++cxp = ' '; *++cxp = ' ';
} else
*cxp = *sp;
}
*cxp = '\n';
*++cxp = NULL;
if (strcmp(nkw, okw) != 0) /* kwds different */
printf("\n%s\n %s", nkw, cntxt);
else /* if keywords are the same */
printf(" %s", cntxt);
strcpy(okw, nkw);
}
}
rmkwds(fp) /* completely suppress printing of keyword */
FILE *fp;
{
char s[BUFSIZ], *sp;
while (fgets(s, BUFSIZ, fp))
{
for (sp = s; *sp && *sp != '|'; sp++)
;
for (; *sp; sp++)
{
if (*sp == '|')
printf(" ");
else
putchar(*sp);
}
}
}
@@@ Fin de format.c
echo freq.c
cat >freq.c <<'@@@ Fin de freq.c'
# include <stdio.h> /* freq.c (rev3.7) */
# include <ctype.h>
struct tnode /* binary tree for word and count */
{
char *word;
int count;
struct tnode *left;
struct tnode *right;
};
char punctuation[BUFSIZ] = ",.;:-?!\"()[]{}" ;
long int total = 0; /* total number of words */
long int different = 0; /* number of different words */
char numfreq = 0; /* toggle for numerical freq */
char nomap = 0; /* do not map to lower case */
usage() /* print proper usage and exit */
{
puts("Usage: freq [-n -m -dF -] filename(s)\t\t(rev3.7)");
puts("-n: list words in numerical order of frequency");
puts("-m: disable mapping of words to lower case");
puts("-d: define punctuation set according to file F");
puts("- : read standard input instead of files");
exit(1);
}
main(argc, argv) /* tabulate word frequencies of a text */
int argc;
char *argv[];
{
FILE *fopen(), *fp;
struct tnode *root, *tree();
char word[BUFSIZ];
int i;
if (argc == 1)
usage();
root = NULL; /* initialize tree */
for (i = 1; i < argc; i++)
{
if (*argv[i] == '-')
getflag(argv[i]);
else if ((fp = fopen(argv[i], "r")) != NULL)
{
while (getword(word, fp))
{
++total;
root = tree(root, word);
}
fclose(fp);
}
else /* attempt to open file failed */
{
fprintf(stderr,
"Freq cannot access the file: %s\n", argv[i]);
exit(1);
}
}
if (numfreq) /* print results */
treesort(root);
else
treeprint(root, stdout);
printf("------------------------------\n");
printf("%5ld Total number of words\n", total);
printf("%5ld Different words used\n", different);
exit(0);
}
getflag(f) /* parses command line to set options */
char *f;
{
char *pfile, word[BUFSIZ];
struct tnode *root, *tree();
f++;
switch(*f++)
{
case 'n':
numfreq = 1;
break;
case 'm':
nomap = 1;
break;
case 'd':
pfile = f;
getpunct(pfile);
break;
case NULL:
root = NULL;
while (getword(word, stdin))
{
++total;
root = tree(root, word);
}
if (numfreq)
treesort(root);
else
treeprint(root, stdout);
break;
default:
fprintf(stderr,
"Invalid freq flag: -%s\n", --f);
exit(1);
break;
}
}
getpunct(pfile) /* read user's punctuation from pfile */
char *pfile;
{
FILE *pfp, *fopen();
char s[BUFSIZ], *strcpy();
if ((pfp = fopen(pfile, "r")) == NULL)
{
fprintf(stderr,
"Freq cannot access Pfile: %s\n", pfile);
exit(1);
}
else
while (fgets(s, BUFSIZ, pfp))
strcpy(punctuation, s);
}
getword(word, fp) /* drives program through text word by word */
char word[];
FILE *fp;
{
while ((*word = getc(fp)) && isskip(*word) && *word != EOF)
;
if (*word == EOF)
return(0);
if (!nomap && isupper(*word))
*word = tolower(*word);
while ((*++word = getc(fp)) && !isskip(*word) && *word !=EOF)
{
if (!nomap && isupper(*word))
*word = tolower(*word);
}
*word = NULL;
return(1);
}
isskip(c) /* function to evaluate punctuation */
char c;
{
char *ptr;
if (isspace(c))
return(1);
for (ptr = punctuation; *ptr != c && *ptr != NULL; ptr++)
;
if (*ptr == NULL)
return(0);
else
return(1);
}
struct tnode *tree(p, w) /* build tree beginning at root */
struct tnode *p;
char *w;
{
struct tnode *talloc();
char *strsave();
int cond;
if (p == NULL)
{
p = talloc();
p->word = strsave(w);
p->count = 1;
p->left = p->right = NULL;
}
else if ((cond = strcmp(w, p->word)) == 0)
p->count++;
else if (cond < 0)
p->left = tree(p->left, w);
else /* if cond > 0 */
p->right = tree(p->right, w);
return(p);
}
treesort(p) /* sort contents of binary tree and print */
struct tnode *p;
{
FILE *pfp, *popen();
pfp = popen("sort +0rn -1 +1", "w");
if (p != NULL)
treeprint(p, pfp);
pclose(pfp);
}
treeprint(p, fp) /* write tree onto fp file stream */
struct tnode *p;
FILE *fp;
{
if (p != NULL)
{
treeprint(p->left, fp);
fprintf(fp, "%5d %s\n", p->count, p->word);
++different;
treeprint(p->right, fp);
}
}
struct tnode *talloc() /* core allocator for tree */
{
struct tnode *p;
char *malloc();
if ((p = ((struct tnode *)malloc(sizeof(struct tnode)))) != NULL)
; /* will return */
else /* if (p == NULL) */
overflow();
return(p);
}
char *strsave(s) /* allocate space for string of text */
char *s;
{
char *p, *malloc();
if ((p = malloc((unsigned)(strlen(s)+1))) != NULL)
strcpy(p, s);
else /* if (p == NULL) */
overflow();
return(p);
}
overflow() /* exit gracefully in case of core overflow */
{
fprintf(stderr,
"Freq: no more core available (maximum on PDP 11/70 is 64K bytes).\n");
fprintf(stderr,
"You might try: dissolve filename(s) | sort | uniq -c\n");
exit(1);
}
@@@ Fin de freq.c
exit 0