Copyright © 1990 Free Software Foundation, Inc. Francois Pinard <pinard@iro.umontreal.ca>, 1988.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 1, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
gptx
- GNU permuted index generatorThis is the 0.2 alpha release of gptx
, the GNU version of a
permuted index generator. This software has the main goal of providing
a ptx
almost compatible replacement, able to handle small
files quickly, while providing a platform for more development.
This version reimplements and extends standard ptx
. Among other
things, it can produce a readable KWIC (keywords in their context)
without the need of nroff
, there is also an option to produce
TeX compatible output. This version does not yet handle huge input
files, that is, those files which do not fit in memory all at once.
Please note that an overall renaming of all options is foreseeable. In fact, GNU ptx specifications are not frozen yet.
1.1 How to use this program | How to use the program, its options and parameters. | |
1.2 Syntax of Regular Expressions | How a regular expression is written and used. | |
1.3 ptx compatibility mode | In which ways ptx mode is different.
| |
1.4 Development guidelines | What are the development lines of this program. |
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This tool reads a text file and essentially produces a permuted index, with each keyword in its context. The calling sketch is one of:
gptx [option]… [input]… >output
or:
ptx [option]… [input [output]]
These are two different versions of one program. When using ptx
instead of gptx
, this implies built-in ptx
compatibility
mode, disallowing extensions, introducing some limitations, and changing
several of the program’s default option values. This documentation
describes both modes of operation. See See section ptx
compatibility mode for an explicit
list of differences.
As usual, each option is represented by an hyphen followed by a single letter. Some options require a parameter in the form of a decimal number or a file name, in which case the parameter follows the option after some whitespace. Option letters may be grouped and tied together as a string which follows only one hyphen; if one of several of them require parameters, they should follow the combined options in the order of appearance of individual letters in the string. Individual options are explained below.
When not in ptx
compatibility mode, there may be zero, one
or several parameters after the options. If there is no parameters, the
program reads the standard input. If there is one or several
parameters, they give the name of input files, which are all read in
turn; as if all the input files were concatenated. However, there is a
full contextual break between each file; and when automatic referencing
is requested, file names and line numbers refer to individual text input
files. In all cases, the program produces the permuted index onto the
standard output.
When in ptx
compatibility mode, besides the options, there may be
zero, one or two parameters. If there is no parameters, the program
reads the standard input and produces the permuted index onto the
standard output. If there is only one parameter, it names the text file
to be read instead of the standard input. If two parameters are given,
they give respectively the name of the file to read and the name of the
file to produce. Be careful to note that, in this case, the
contents of file given by the second parameter is destroyed; this
behaviour is dictated by compatibility; GNU standards discourage output
parameters not introduced by an option.
Note that for any file named as the value of an option or as an input text file, a single dash - may be used, in which case standard input is assumed. However, it would not make sense to use this convention more than once per program invocation.
1.1.1 General options | Options which affect general program behaviour. | |
1.1.2 Charset selection | Underlying character set considerations. | |
1.1.3 Word selection | Input fields, contexts, and keyword selection. | |
1.1.4 Output formatting | Types of output format, and sizing the fields. |
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
-C
Prints a short note about the Copyright and copying conditions.
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
As it is setup now, the program assumes that the input file is coded using 8-bit ISO 8859-1 code, also known as Latin-1 character set, unless if it is compiled for MS-DOS, in which case it uses the character set of the IBM-PC. Compared to 7-bit ASCII, the set of characters which are letters is then different, this fact alters the behaviour of regular expression matching. Thus, the default regular expression for a keyword allows foreign or diacriticized letters. Keyword sorting, however, is still crude; it obeys the underlying character set ordering quite blindly.
-f
Fold lower case letters to upper case for sorting.
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
-b file
This option is an alternative way to option -W
for describing
which characters make up words. This option introduces the name of a
file which contains a list of characters which cannot be part of
one word, this file is called the Break file. Any character which
is not part of the Break file is a word constituent. If both options
-b
and -W
are specified, then -W
has precedence and
-b
is ignored.
In normal mode, the only way to avoid newline as a break character is to
write all the break characters in the file with no newline at all, not
even at the end of the file. In ptx
compatibility mode, spaces,
tabs and newlines are always considered as break characters even if not
included in the Break file.
-i file
The file associated with this option contains a list of words which will
never be taken as keywords in concordance output. It is called the
Ignore file. The file contains exactly one word in each line; the
end of line separation of words is not subject to the value of the
-S
option.
If not specified, there might be a default Ignore file. Default Ignore
files are not necessarily the same in normal mode or in ptx
compatibility mode. Unless changed by the local installation, there is
no default Ignore file in normal mode, and the Ignore file is
/usr/lib/eign
in ptx
compatibility mode. If you want to
deactivate a default Ignore file, use /dev/null
instead.
-o file
The file associated with this option contains a list of words which will
be retained in concordance output, any word not mentionned in this file
is ignored. The file is called the Only file. The file contains
exactly one word in each line; the end of line separation of words is
not subject to the value of the -S
option.
There is no default for the Only file. In the case there are both an Only file and an Ignore file, a word will be subject to be a keyword only if it is given in the Only file and not given in the Ignore file.
-r
On each input line, the leading sequence of non white characters will be
taken to be a reference that has the purpose of identifying this input
line on the produced permuted index. See See section Output formatting for
more information about reference production. Using this option change
the default value for option -S
.
Using this option, the program does not try very hard to remove
references from contexts in output, but it succeeds in doing so
when the context ends exactly at the newline. If option
-r
is used with -S
default value, or when in ptx
compatibility mode, this condition is always met and references are
completely excluded from the output contexts.
-S regexp
This option selects which regular expression will describe the end of a
line or the end of a sentence. In fact, there is other distinction
between end of lines or end of sentences than the effect of this regular
expression, and input line boundaries have no special significance
outside this option. By default, in ptx
compatibility mode or if
-r
option is used, end of lines are used; in this case, the
regexp used is very simple:
\n
In normal mode and if -r
option is not used, by default, end of
sentences are used; the precise regex is imported from GNU emacs:
[.?!][]\"')}]*\\($\\|\t\\| \\)[ \t\n]*
An empty REGEXP is equivalent to completly disabling end of line or end
of sentence recognition. In this case, the whole file is considered to
be a single big line or sentence. The user might want to disallow all
truncation flag generation as well, through option -F ""
. On
regular expression writing and usage, see See section Syntax of Regular Expressions.
When the keywords happen to be near the beginning of the input line or sentence, this often creates an unused area at the beginning of the output context line; when the keywords happen to be near the end of the input line or sentence, this often creates an unused area at the end of the output context line. The program tries to fill those unused areas by wrapping around context in them; the tail of the input line or sentence is used to fill the unused area on the left of the output line; the head of the input line or sentence is used to fill the unused area on the right of the output line.
This option is not available when the program is operating ptx
compatibility mode.
-W regexp
This option selects which regular expression will describe each keyword.
By default, in ptx
compatibility mode, a word is anything which
ends with a space, a tab or a newline; the regexp used is [^
\t\n]+
.
In normal mode, a word is a sequence of letters; the
regexp used is \w+
.
An empty REGEXP is equivalent to not using this option, letting the default dive in. On regular expression writing and usage, see See section Syntax of Regular Expressions.
This option is not available when the program is operating ptx
compatibility mode.
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Output format is mainly controlled by -O
and -T
options,
described in the table below. However, when neither -O
nor
-T
is selected, and if we are not running in ptx
compatibility mode, the program choose an output format suited for a
dumb terminal. This is the default format when working in normal mode.
Each keyword occurrence is output to the center of one line, surrounded
by its left and rigth contexts. Each field is properly justified, so
the concordance output could readily be observed. As a special feature,
if automatic references are selected by option -A
and are output
before the left context, that is, if option -R
is not
selected, then a colon is added after the reference; this nicely
interfaces with GNU Emacs next-error
processing. In this default
output format, each white space character, like newline and tab, is
merely changed to exactly one space, with no special attempt to compress
consecutive spaces. This might change in the future. Except for those
white space characters, every other character of the underlying set of
256 characters is transmitted verbatim.
Output format is further controlled by the following options.
-g number
Select the size of the minimum white gap between the fields on the output line.
-w number
Select the output maximum width of each final line. If references are
used, they are included or excluded from the output maximum width
depending on the value of option -R
. If this option is not
selected, that is, when references are output before the left context,
the output maximum width takes into account the maximum length of all
references. If this options is selected, that is, when references are
output after the right context, the output maximum width does not take
into account the space taken by references, nor the gap that precedes
them.
-A
Select automatic references. Each input line will have an automatic
reference made up of the file name and the line ordinal, with a single
colon between them. However, the file name will be empty when standard
input is being read. If both -A
and -r
are selected, then
the input reference is still read and skipped, but the automatic
reference is used at output time, overriding the input reference.
This option is not available when the program is operating ptx
compatibility mode.
-R
In default output format, when option -R
is not used, any
reference produced by the effect of options -r
or -A
are
given to the far right of output lines, after the right context. In
default output format, when option -R
is specified, references
are rather given to the beginning of each output line, before the left
context. For any other output format, option -R
is almost
ignored, except for the fact that the width of references is not
taken into account in total output width given by -w
whenever
-R
is selected.
This option is not explicitely selectable when the program is operating
in ptx
compatibility mode. However, in this case, it is always
implicitely selected.
-F string
This option will request that any truncation in the output be reported
using the string string. Most output fields theoretically extend
towards the beginning or the end of the current line, or current
sentence, as selected with option -S
. But there is a maximum
allowed output line width, changeable through option -w
, which is
further divided into space for various output fields. When a field has
to be truncated because cannot extend until the beginning or the end of
the current line to fit in the, then a truncation occurs. By default,
the string used is a single slash, as in -F /
.
string may have more than one character, as in -F ...
.
Also, in the particular case string is empty (-F ""
),
truncation flagging is disabled, and no truncation marks are appended in
this case.
This option is not available when the program is operating ptx
compatibility mode.
-O
Choose an output format suitable for nroff
or troff
processing. Each output line will look like:
.xx "tail" "before" "keyword_and_after" "head" "ref"
so it will be possible to write an ‘.xx’ roff macro to take care of
the output typesetting. This is the default output format when working
in ptx
compatibility mode.
In this output format, each non-graphical character, like newline and
tab, is merely changed to exactly one space, with no special attempt to
compress consecutive spaces. Each quote character: " is doubled
so it will be correctly processed by nroff
or troff
. All
characters having their eight bit set are turned into spaces in this
version. It is expectable that diacriticized characters will be
correctly expressed in roff
terms if I learn how to do this. So,
let me know how to improve this special character processing.
This option is not available when the program is operating ptx
compatibility mode. In fact, it then becomes the default and sole output
format.
-T
Choose an output format suitable for TeX processing. Each output line will look like:
\xx {tail}{before}{keyword}{after}{head}{ref}
so it will be possible to write write a \xx
definition to take
care of the output typesetting. Note that when references are not being
produced, that is, neither option -A
nor option -r
is
selected, the last parameter of each \xx
call is inhibited.
In this output format, some special characters, like $, %,
&, # and _ are automatically protected with a
backslash. Curly brackets {, } are also protected with a
backslash, but also enclosed in a pair of dollar signs to force
mathematical mode. The backslash itself produces the sequence
\backslash{}
. Circumflex and tilde diacritics produce the
sequence ^\{ }
and ~\{ }
respectively. Other
diacriticized characters of the underlying character set produce an
appropriate TeX sequence as far as possible. The other non-graphical
characters, like newline and tab, and all others characters which are
not part of ASCII, are merely changed to exactly one space, with no
special attempt to compress consecutive spaces. Let me know how to
improve this special character processing for TeX.
This option is not available when the program is operating ptx
compatibility mode.
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Regular expressions have a syntax in which a few characters are special constructs and the rest are ordinary. An ordinary character is a simple regular expression which matches that character and nothing else. The special characters are ‘$’, ‘^’, ‘.’, ‘*’, ‘+’, ‘?’, ‘[’, ‘]’ and ‘\’; no new special characters will be defined. Any other character appearing in a regular expression is ordinary, unless a ‘\’ precedes it.
For example, ‘f’ is not a special character, so it is ordinary, and therefore ‘f’ is a regular expression that matches the string ‘f’ and no other string. (It does not match the string ‘ff’.) Likewise, ‘o’ is a regular expression that matches only ‘o’.
Any two regular expressions a and b can be concatenated. The result is a regular expression which matches a string if a matches some amount of the beginning of that string and b matches the rest of the string.
As a simple example, we can concatenate the regular expressions ‘f’ and ‘o’ to get the regular expression ‘fo’, which matches only the string ‘fo’. Still trivial. To do something nontrivial, you need to use one of the special characters. Here is a list of them.
is a special character that matches any single character except a newline. Using concatenation, we can make regular expressions like ‘a.b’ which matches any three-character string which begins with ‘a’ and ends with ‘b’.
is not a construct by itself; it is a suffix, which means the preceding regular expression is to be repeated as many times as possible. In ‘fo*’, the ‘*’ applies to the ‘o’, so ‘fo*’ matches one ‘f’ followed by any number of ‘o’s. The case of zero ‘o’s is allowed: ‘fo*’ does match ‘f’.
‘*’ always applies to the smallest possible preceding expression. Thus, ‘fo*’ has a repeating ‘o’, not a repeating ‘fo’.
The matcher processes a ‘*’ construct by matching, immediately, as many repetitions as can be found. Then it continues with the rest of the pattern. If that fails, backtracking occurs, discarding some of the matches of the ‘*’-modified construct in case that makes it possible to match the rest of the pattern. For example, matching ‘ca*ar’ against the string ‘caaar’, the ‘a*’ first tries to match all three ‘a’s; but the rest of the pattern is ‘ar’ and there is only ‘r’ left to match, so this try fails. The next alternative is for ‘a*’ to match only two ‘a’s. With this choice, the rest of the regexp matches successfully.
Is a suffix character similar to ‘*’ except that it requires that the preceding expression be matched at least once. So, for example, ‘ca+r’ will match the strings ‘car’ and ‘caaaar’ but not the string ‘cr’, whereas ‘ca*r’ would match all three strings.
Is a suffix character similar to ‘*’ except that it can match the preceding expression either once or not at all. For example, ‘ca?r’ will match ‘car’ or ‘cr’; nothing else.
‘[’ begins a character set, which is terminated by a ‘]’. In the simplest case, the characters between the two form the set. Thus, ‘[ad]’ matches either one ‘a’ or one ‘d’, and ‘[ad]*’ matches any string composed of just ‘a’s and ‘d’s (including the empty string), from which it follows that ‘c[ad]*r’ matches ‘cr’, ‘car’, ‘cdr’, ‘caddaar’, etc.
Character ranges can also be included in a character set, by writing two characters with a ‘-’ between them. Thus, ‘[a-z]’ matches any lower-case letter. Ranges may be intermixed freely with individual characters, as in ‘[a-z$%.]’, which matches any lower case letter or ‘$’, ‘%’ or period.
Note that the usual special characters are not special any more inside a character set. A completely different set of special characters exists inside character sets: ‘]’, ‘-’ and ‘^’.
To include a ‘]’ in a character set, you must make it the first character. For example, ‘[]a]’ matches ‘]’ or ‘a’. To include a ‘-’, write ‘---’, which is a range containing only ‘-’. To include ‘^’, make it other than the first character in the set.
‘[^’ begins a complement character set, which matches any character except the ones specified. Thus, ‘[^a-z0-9A-Z]’ matches all characters except letters and digits.
‘^’ is not special in a character set unless it is the first character. The character following the ‘^’ is treated as if it were first (‘-’ and ‘]’ are not special there).
Note that a complement character set can match a newline, unless newline is mentioned as one of the characters not to match.
is a special character that matches the empty string, but only if at the beginning of a line in the text being matched. Otherwise it fails to match anything. Thus, ‘^foo’ matches a ‘foo’ which occurs at the beginning of a line.
is similar to ‘^’ but matches only at the end of a line. Thus, ‘xx*$’ matches a string of one ‘x’ or more at the end of a line.
has two functions: it quotes the special characters (including ‘\’), and it introduces additional special constructs.
Because ‘\’ quotes special characters, ‘\$’ is a regular expression which matches only ‘$’, and ‘\[’ is a regular expression which matches only ‘[’, and so on.
Note: for historical compatibility, special characters are treated as ordinary ones if they are in contexts where their special meanings make no sense. For example, ‘*foo’ treats ‘*’ as ordinary since there is no preceding expression on which the ‘*’ can act. It is poor practice to depend on this behavior; better to quote the special character anyway, regardless of where is appears.
For the most part, ‘\’ followed by any character matches only that character. However, there are several exceptions: characters which, when preceded by ‘\’, are special constructs. Such characters are always ordinary when encountered on their own. Here is a table of ‘\’ constructs.
specifies an alternative. Two regular expressions a and b with ‘\|’ in between form an expression that matches anything that either a or b will match.
Thus, ‘foo\|bar’ matches either ‘foo’ or ‘bar’ but no other string.
‘\|’ applies to the largest possible surrounding expressions. Only a surrounding ‘\( … \)’ grouping can limit the grouping power of ‘\|’.
Full backtracking capability exists to handle multiple uses of ‘\|’.
is a grouping construct that serves three purposes:
This last application is not a consequence of the idea of a parenthetical grouping; it is a separate feature which happens to be assigned as a second meaning to the same ‘\( … \)’ construct because there is no conflict in practice between the two meanings. Here is an explanation of this feature:
after the end of a ‘\( … \)’ construct, the matcher remembers the beginning and end of the text matched by that construct. Then, later on in the regular expression, you can use ‘\’ followed by digit to mean “match the same text matched the digit’th time by the ‘\( … \)’ construct.”
The strings matching the first nine ‘\( … \)’ constructs appearing in a regular expression are assigned numbers 1 through 9 in order that the open-parentheses appear in the regular expression. ‘\1’ through ‘\9’ may be used to refer to the text matched by the corresponding ‘\( … \)’ construct.
For example, ‘\(.*\)\1’ matches any newline-free string that is composed of two identical halves. The ‘\(.*\)’ matches the first half, which may be anything, but the ‘\1’ that follows must match the same exact text.
matches the empty string, provided it is at the beginning of the buffer.
matches the empty string, provided it is at the end of the buffer.
matches the empty string, provided it is at the beginning or end of a word. Thus, ‘\bfoo\b’ matches any occurrence of ‘foo’ as a separate word. ‘\bballs?\b’ matches ‘ball’ or ‘balls’ as a separate word.
matches the empty string, provided it is not at the beginning or end of a word.
matches the empty string, provided it is at the beginning of a word.
matches the empty string, provided it is at the end of a word.
matches any word-constituent character. The editor syntax table determines which characters these are.
matches any character that is not a word-constituent.
Here is a complicated regexp, used by Emacs to recognize the end of a sentence together with any whitespace that follows. It is given in Lisp syntax to enable you to distinguish the spaces from the tab characters. In Lisp syntax, the string constant begins and ends with a double-quote. ‘\"’ stands for a double-quote as part of the regexp, ‘\\’ for a backslash as part of the regexp, ‘\t’ for a tab and ‘\n’ for a newline.
"[.?!][]\"')]*\\($\\|\t\\| \\)[ \t\n]*"
This contains four parts in succession: a character set matching period, ‘?’ or ‘!’; a character set matching close-brackets, quotes or parentheses, repeated any number of times; an alternative in backslash-parentheses that matches end-of-line, a tab or two spaces; and a character set matching whitespace characters, repeated any number of times.
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
ptx
compatibility modeThis section outlines the differences between this program and standard
ptx
. For someone used to standard ptx
, here are some
points worth noticing when not using ptx
compatibility mode:
troff
or
nroff
. By default, output is rather formatted for a dumb
terminal. troff
or nroff
output may still be selected
through option -O
.
-R
option is used, the maximum reference
width is subtracted from the total output line width. In ptx
compatibility mode, width of references are not taken into account in
the output line width computations.
ptx
compatibility mode. However, standard
ptx
does not accept 8-bit characters, a few control characters
are rejected, and the tilde ~ is condemned.
ptx
compatibility mode.
However, standard ptx
processes only the first 200 characters in
each line.
ptx
compatibility mode, the break
characters default to space, tab and newline only.
ptx
compatibility mode. Even in
ptx
mode, there are some slight disposition glitches this
program does not completely reproduce, even if it comes quite close.
ptx
compatibility mode is not the same
as in normal mode. In default installation, default Ignore files are
‘/usr/lib/eign’ in ptx
compatibility mode, and nothing in
normal mode.
ptx
disallows specifying both the Ignore file and the
Only file at the same time. This version allows both, and specifying an
Only file does not inhibit processing the Ignore file.
[ << ] | [ < ] | [ Up ] | [ > ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This software is meant to evolve towards a concordance package for GNU, which should ideally be able to tackle true, real, big concordance jobs, while staying fast and of easy for little jobs. Several packages of this kind are awfully slow, I’m trying to keep speed in mind. I am interested in interactive query, but postpone burdening myself too much too soon about it.
Here is a What To Do Next list, in expected execution order.
Most of the boosting work should go along the line of fast recognition of multiple and complex boundaries, which define various ‘languages’. Each such language has its own rules for words, sentences, paragraphs, and reporting requests. This is less difficult than I first thought:
sort
has been released recently, and could evolve with gptx
.
[Top] | [Contents] | [Index] | [ ? ] |
This document was generated on March 29, 2022 using texi2html 5.0.
The buttons in the navigation panels have the following meaning:
Button | Name | Go to | From 1.2.3 go to |
---|---|---|---|
[ << ] | FastBack | Beginning of this chapter or previous chapter | 1 |
[ < ] | Back | Previous section in reading order | 1.2.2 |
[ Up ] | Up | Up section | 1.2 |
[ > ] | Forward | Next section in reading order | 1.2.4 |
[ >> ] | FastForward | Next chapter | 2 |
[Top] | Top | Cover (top) of document | |
[Contents] | Contents | Table of contents | |
[Index] | Index | Index | |
[ ? ] | About | About (help) |
where the Example assumes that the current position is at Subsubsection One-Two-Three of a document of the following structure:
This document was generated on March 29, 2022 using texi2html 5.0.