home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
The First Hungarian Family
/
The_First_Hungarian_Family_CD-ROM.bin
/
internet
/
offlread
/
ddig6
/
ddigest.doc
< prev
next >
Wrap
Text File
|
1995-06-04
|
16KB
|
397 lines
DDigest (version 0.05)
Copyright (c) 1994 Robert P. Rush
Permission to copy and distribute this material for any
purpose and without fee is hereby granted, provided that
the above copyright notice and this permission notice
appear in all copies. THE AUTHOR MAKES NO
REPRESENTATIONS ABOUT THE ACCURACY OR SUITABILITY OF
THIS MATERIAL FOR ANY PURPOSE. IT IS PROVIDED "AS IS,"
WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES. THE AUTHOR
WILL ASSUME NO LIABILITY FOR DAMAGES EITHER FROM THE
DIRECT USE OF THIS PRODUCT OR AS A CONSEQUENCE OF THE
USE OF THIS PRODUCT.
DDigest (De-Digest) is a program designed to extract
individual articles from a digest format mailing list. It
will then place these articles into a rnews packet to be
imported into an off-line news reader. DDigest will find
these digests in either a mail packet (in soup format), or a
Yarn folder.
Requirements:
MSDOS or other compatible operating system.
Installation:
Copy ddigest.exe to a directory contained in the path.
Edit a configuration file for each mailing list.
Configuration files:
This contains settings telling DDigest how to find and process
a digest. Comments may be placed in the configuration file by
placing a "#" into the first column.
Ddigest uses regular expressions in searching for the
`FindDigest', `FirstAricleBreak', and `ArticleBreak' strings.
See the section on regular expressions for additional
information.
Finding a digest:
DDigest will look in the "FindHeader" of each mail message for
the "FindDigest" search string. If found, the mail message
will be processed as a digest. Otherwise, DDigest will
continue looking in the next message. For example, if the
configuration file contains the settings:
FindHeader=Sender:
FindDigest=yarn-list
Any mail message containing the string "yarn-list" in the
"Sender:" header will be considered a digest.
Finding Articles in a digest:
Once a digest has been found, DDigest will look in the message
body for the beginning of the first article using the
"FirstArticleBreak" setting. The end of each article and the
beginning of the next article is found with the
"ArticleBreak" setting. The extracted article will contain
everything from the end of one article break to the beginning
of the next article break. The article breaks are not
included in any articles.
For example, the digestifier places an introduction followed
by a line containing seventy-six dashes at the beginning of
the digest. Each article in the digest is separated by three
lines, the first one being blank, the second line containing
thirty dashes, and the third line blank. The configuration
file should contain the settings:
FirstArticleBreak=^-{60,100}$
ArticleBreak=^$-{30}$$
This will find a line containing between sixty and one hundred
dashes, and nothing but dashes, and use it as the break
between the introduction and the first article. It will also
use any group of three lines where the first and third lines
in the group are blank and the middle line contains exactly
thirty dashes as a break between articles.
Placing extracted articles into a newsgroup:
If necessary, create the newsgroup in the off-line reader.
Yarn will also allow you to set up a posting address so that
posted articles will automatically be posted to the list and
not to a phantom newsgroup.
DDigest will add a "Newsgroup:" line to each extracted article
using the "NewsGroup" setting in the configuration file. For
example, if the configuration file contains the setting:
NewsGroup=Yarn-List
All extracted articles will be placed into the newsgroup
"Yarn-List".
Regular Expressions:
Using a regular expression in place of a simple search string
will allow searching for a string that will change from one
digest to the next. For example, you want to look for the
line:
This is digest #123 in a series on weather
The following regular expression will compare the entire line
including the digest number:
This is digest #[[:digit:]]+ in a series on weather
The `[[:digit:]]+' part of the expression will match any
sequence of one or more digits. It is composed of an operator
`[[:digit:]]', and a modifier `+'. The operator will match
one digit. The modifier tells DDigest to apply the operator
one or more times. A modifier may follow an operator,
ordinary character, or parenthesized expression to match
multiple occurrences of what the operator, character, or
parenthesized expression would match.
An anchor indicates that the character preceding or following
it should start or end a word or line.
To Find Operator Examples
================= ===== =============
Any single character. . "f.t" finds "fat,"
"fet," and "fit."
One of the specified [] "f[ae]t" finds "fat"
characters. and "fet," but not
"fit."
Any single character [^] "f[^ae]t" finds "fit,"
except one of the but not "fat" and
specified characters. "fet."
Any word constituent \w "f\wt" finds "fit,"
character. (A word "fat," and "f_t" but
constituent character is not "f t" or "f,t."
a letter, number or
underline.)
Any non word constituent \W "f\Wt" finds "f t" or
character. "f,t" but not "fit,"
"fat," and "f_t."
Any character that would \ "f\+t" finds "f+t" but
otherwise be an operator, not "fft," while "f+t"
modifier, or anchor. finds "fft" but not
(These are "^.[$()|*+?{\") "f+t."
A line separator. $ "f$t" finds text where
"f" is the last letter
on one line and "t" is
the first letter on
the next line.
Any operator or character could be followed by a modifier to
match multiple occurrences of a character or expression.
To Find modifier Examples
================= ===== =============
Zero or more occurrences * "f.*t" finds "ft,"
of the previous character "fat," "fabt," and "f
or expression. a whole lot more text
t."
One or more occurrences + "f.+t" finds "fat,"
of the previous character "fabt," and "f a whole
or expression. lot more text t," but
not "ft."
Zero or one occurrence ? "f.*t" finds "ft" and
of the previous character "fat," but not "fabt"
or expression. or "f a whole lot more
text t."
Exactly <n> occurrences {n} "fa{3}t" finds "faaat"
of the previous character but not "faat" or
or expression. "faaaat."
At least <n> occurrences {n,} "fa{3,}t" finds
of the previous character "faaat" and "faaaat"
or expression. but not "fat."
Between zero and <n> {,n} "fa{,3}t" finds "ft,"
occurrences of the and "faaat" but not
previous character or "faaaat."
expression.
Between <n1> and <n2> {n1,n2} "fa{2,3}t" finds
occurrences of the "faat" and "faaat" but
previous character or not "fat" or "faaaat."
expression.
To indicate that a Anchor Examples
character or expression
must:
================= ===== =============
start a line. ^ "^hi" will match "hi"
(This must occur at the only if it is at the
beginning of an beginning of a line.
expression.)
end a line. $ "hi$" will match "hi"
only if it is at the
end of a line.
start a word. \< "\<hi" will match "hi"
and "hit," but not
"phi."
end a word. \> "hi\>" will match "hi"
and "phi" but not
"hit."
Brackets may be used to match a range or class of characters.
A list of characters enclosed by `[' and `]' matches any
single character in that list; if the first character of the
list is the caret `^' then it matches any character not in the
list. For example, the regular expression _[0123456789]_
matches any single digit. A range of ASCII characters may be
specified by giving the first and last characters, separated
by a hyphen. Finally, certain named classes of characters are
predefined. Their names are self explanatory, and they are
`[:alnum:]`, `[:alpha:]`, `[:cntrl:]`, `[:digit:]`,
`[:graph:]`, `[:lower:]`, `[:print:]`, `[:punct:]`,
`[:space:]`, `[:upper:]`, `[:word:]`, `[:xdigit:]'. For
example, _[[:alnum:]]_ means _[0-9A-Za-z]_ except the latter
form is dependent upon the ASCII character encoding, whereas
the former is portable, _[[:word:]]_ is equivalent to _\w_,
_[[:digit:][:alpha:]_]_ or _[0-9A-Za-z_]_. (Note that the
brackets in these class names are part of the symbolic names,
and must be included in addition to the brackets delimiting
the bracket list.) Most metacharacters lose their special
meaning inside lists. To include a literal `]' place it first
in the list. Similarly, to include a literal `^' place it
anywhere but first. Finally, to include a literal `-' place
it last.
A "^" should be used to match a line separator at the
beginning of an expression. Otherwise, a "$" should be used
to match a line separator. Only the "^" and "$" operators
will match a line separator.
Two adjacent (concatenated) regular expressions match a match
of the first followed by a match of the second.
Two regular expressions separated by | match either a match
for the first or a match for the second.
A regular expression enclosed in parentheses matches a match
for the regular expression.
A regular expression enclosed in parentheses followed by a
modifier will match an integer multiple of the enclosed
expression. I.e. "(abc)*" will match "abcabc," or
"abcabcabc" but will only match the first 6 characters of
"abcabcab."
The order of precedence of operators at the same parenthesis
level is "[]" then "*+?{}" then concatenation then "|".
An expression will match the longest string it can.
Multiple line matches are limited to five lines.
The reported start and end of a match may be specified by
embedding a `\s` or `\e` in the regular expression. This will
allow using part of the next article in the search pattern.
I.e. The search pattern _^$-{30}$$\eFROM:_, will match an
article break composed of: a blank line, followed by a line
containing 30 dashes, followed by a blank line, followed by a
line starting with _FROM:_. Everything following the `\e`
will be included in the following message.
Regular expressions used by DDigest are the same as those used
by Yarn with the following exceptions:
1) A "\" may only be followed by one of the following
"^.[$()|*+?{\", or "sewW". The sequence "\w" will match
any word constituent character, not a "w". A "\" followed
by any other character is illegal.
2) A "$" may also be used to match a newline in the middle of
a string.
3) The "{}" operator may be used in place of "*+?".
The command line is:
DDigest <config file> <Email file> <rnews file>
or
DDigest <config file> <Yarn folder> <rnews file> -Y
or
DDigest <config file> <Text file> <rnews file> -T
Where:
<config file>
This file contains the search patterns used to find the
digest and to separate it into individual articles.
<Email file>
This file will be found in the soup packet imported into
the off-line news reader.
<rnews file>
This file will be generated by DDigest. It could then be
imported into the off-line news reader as an rnews file.
<Yarn Folder>
A file contained in Yarns (USER)\mail directory.
<Text file>
A file contained a single digest in text format.
If the <Email file>, <Yarn Folder>, or <Text Folder> name is
replaced with a `-`, the input will be taken from the standard
input. If the <rnews file> is replaced with a `-`, the output
will be sent to the standard output. This will allow the input
to be piped from another program and the output to be piped to
another program.
To De-Digestify a mailing list digest:
1) Unzip the soup packet containing Email into an empty
directory.
2) Decide which file contains Email. This is probably named
"0000000.MSG" if you are using UQWK to create your SOUP
packets. If this does not work, you could find out which
one it by looking at the AREAS file. Each line contains
three fields separated by tabs. The first field is the
file name (Without the ".MSG" extension). The second field
is the file type. Look for the line containing "Email" in
the second field. On my system, the line is "0000000
Email bn". Therefore the file name is "0000000.MSG".
3) Execute the command (substituting your filenames):
DDigest list.CFG 0000000.MSG list.MSG
4) Import the resultant file:
import -r list.MSG
Notes:
* Each article extracted from the digest will have unique
Message-Id based on the Message-id of the digest.
Therefore, if a digest is imported twice, the articles
will be recognized as duplicates.
* The program will also insert the following headers into
each extracted article based on the digest headers:
X-Digest-Subject: <Digest subject>
X-Digest-Date: <Digest Date>
* The 'X-Digest-Subject:', 'X-Digest-Date:', 'Message-Id:',
and 'Newsgroups:' lines are added to the beginning of each
extracted article.
* The closing will be output as an extra article without a
subject.
* The search strings must be contained on one line, but may
be any length up to 255 characters long.
* A temporary file is created if either the '-T' option is
used or if the input is piped into ddigest.
Bob Rush
bobr@mcs.com