home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
back2roots/padua
/
padua.7z
/
padua
/
uucp
/
unpost.lha
/
unpost.man
< prev
Wrap
Text File
|
1993-02-07
|
25KB
|
650 lines
UNPOST
Name:
unpost - Extract binary files from multi-segment uuencoded USENET
postings or Email.
Synopsis:
unpost -f[-] -d[-] -c config -e errors -t text -i incompletes [file]
Description:
UNPOST is a tool designed primarily to extract binaries from USENET
binaries postings such as those made to alt.binaries.pictures.misc
and comp.binaries.ibm.pc. As well as extracting binaries from USENET
postings, UNPOST can extract binaries from multi-segment uuencoded
mailings as well, however, to simplify this documentation only
USENET article postings will be discussed. The principles are the
same for multi-segment mailings.
UNPOST assumes that the source file that is given to it will have the
following format:
SEGMENT begin line
...
HEADER ID line
...
BODY ID line
...
UUENCODED line
The lines are:
SEGMENT begin line - Is the line that identifies the begining of a
segment.
HEADER ID line - One or more lines that contain segment number,
total number of segments or the ID string in
the article or mail header.
BODY ID line - One or more lines that contain segment number,
total number of segments or the ID string in
the article or mail message body.
UUENCODED line - Is the first uuencoded line in the file.
UUencoded lines include the begin and end lines.
... - Indicates zero or more lines that can contain
any information so long as they CANNOT be
misidentified as SEGMENT begin, ID or UUENCODED
lines.
Notice that the ID information can be spread across multiple lines. A
segment is assumed to end at the begining of the next segment, or at
the end of the source file. An UNPOST source file contains one or more
segments.
UNPOST has three different modes, interpretation mode, concatenation
mode and UU decoder mode. In all three modes, UNPOST can accept one
or more input files.
In the first mode, interpretation mode, UNPOST looks at article header
and body lines before the first UU encoded line, and attempts to extract
three pieces of information from them: segment number, total number
of segments that the binary was split into, and an ID string that is
common to all segments. If UNPOST finds something that it considers
to be an ID string, and a uuencoded line in the article, but it does
not find a segment number and number of segments, UNPOST assumes that
the article is a single segment binary posting (part 1 of 1).
To aid in finding out what happened, in interpretation mode UNPOST
will write a list of all the different ID strings and their respective
segment lists to standard error or the file specified as the error
file (see Standards section for details of what an ID string is).
Any errors or warnings detected during processing will also be
written to standard error or error file.
In interpretation mode three other files can optionally be created.
All three of these files will contain segments copied out of the source
file, and none of these files will be created unless they are turned
on and named by a command line switch.
The first optional file that UNPOST can create for the user in
interpretation mode is the text file (-t switch). This file will have
copied to it all segments from the source file that do not contain
uuencoded data.
Segments that are part 0/# type segments that do not contain uuencoded
data will NOT be copied to the text file. They are considered to be
description segments, and they will be copied to the description file
only if the -d switch is turned on. Also, all binary postings that
have all of their segments present will have the segment header
and body of segment #1 (up to and including the uuencode begin line)
copied into the description file.
The third optional file that can be created in interpretation mode is
the incomplete or unused uuencode data segments file. This file
contains all segments that have uuencoded data, that were not used in
a succesful uudecoding. This file will only be created if the -i
switch is present.
The incompletes file allows the user to hand decode those binarys which
could not be interpreted or decoded by UNPOST. Often times, a binary
will have all of it's parts, but UNPOST will not be able to put them
together because of differences in the ID string between segments, or
problems with the part numbering information. The simplest way to
solve these problems is to collect the incompletes, edit the ID
lines to correct the problem, and rerun UNPOST on the incompletes
file.
In the second mode, catentation mode, UNPOST assumes that all of the
segments in the source file between a uuencode begin and a uuencode
end line are part of one binary posting and that the segments are in
order. UNPOST scans from the begining of the file until it finds a
uuencode begin line, and decodes from there (skipping over non-
uuencoded lines such as article header lines and signatures) until
it finds a uuencode end line.
In the last mode, UU decoder mode, UNPOST assumes that the source
file contains one or more UU encoded files. Only UU encoded lines
are allowed between the uuencode begin line and the uuencode end line
of any single uuencoded file.
Options:
-c <file> To read and use a different configuration than the
default configuration. The default configuration is
stored in a file called def.cfg.
-d Turns on description capturing and writes descriptions
to a file that has the same name as the output but with
a .inf extension. This defaults to off.
-e <file> Redirects error and information output from standard
error to <file>.
-f[-] Modify file names to be MS-DOS compatible. Use of -f
turns file name modification on if the default is off,
and -f- turns file name modification off if the default
is on. File name modification is currently the default.
-h Turns on full interpretation mode. This is the default.
-i <file> Turns on incomplete binaries capturing and writes the
segments to file <file>.
-s Switch to ordered segment mode. This mode ignores article
headers, and assumes that the segments are in order.
-t <file> Turns on text only segment capturing and writes the segments
to <file>.
-u Switch to uudecoder mode. Assume only uuencoded data
between begin and end lines. Multiple uuencoded files
are allowed.
-? Show a summary of the command line switches.
It is important to realize that UNPOST
Standards:
In all modes, UNPOST recognizes and decodes only uuencoded data.
In interpretation mode UNPOST requires that:
1) The uuencoded lines be true uuencoded lines. This means
that if trailing spaces are truncated by a mailer, editor
or news node, UNPOST will not consider those lines to
be uuencoded lines. Also, the uuencode character set
recognized by UNPOST is ' ' - '`', with no other characters
being legal.
2) That all segments of the same binary file posting have
the same, recognizable ID string.
3) Segments have a recognizable SEGMENT begin line as the
first line in the segment (denoting the begining of a
segment).
4) That all ID lines follow the SEGMENT begin line in the
segment.
5) That the first UUencoded line of the segment follows the
last ID line.
6) That the first uuencode line in the first segment be a
begin line.
7) That the last segment contain a uuencode end line.
In sorted segment mode, UNPOST requires that:
1) The uuencoded lines be true uuencoded lines. This means
that if trailing spaces are truncated by a mailer, editor
or news node, UNPOST will not consider those lines to
be uuencoded lines. Also, the uuencode character set
recognized by UNPOST is ' ' - '`', with no other characters
being legal.
2) That the segments be stored in the file in order.
3) That the first uuencode line in the first segment be a
begin line.
4) That the last segment contain a uuencode end line.
In uudecoder mode, UNPOST requires that:
1) There be only uuencoded lines between a uuencode begin and
a uuencode end line. In this mode, UNPOST will recognize
and attempt to repair lines that had trailing spaces
truncated.
Examples:
To extract a single binary that had all of it's segments saved in order
to a single file:
unpost -s binary.uue
To extract all binaries that have had all of their segments saved
to a single file:
unpost multiple.uue 2> errors
Or
unpost -e errors multiple.uue
The file errors will contain a list of all the ID strings that UNPOST
found and thought could have been binary files, and any errors
that occurred during processing.
To capture the incomplete or unused segments that have uuencoded
data in them:
unpost -e errors -i multiple.inc multiple.uue
To capture descriptions and text only segments as well:
unpost -d -e errors -t text -i multiple.inc multiple.uue
To process two different files, one in uuencode mode, one in interpretation
mode:
unpost -e errors -u uuencode.uue -h multiple.uue
To process a file that requires a different configuration:
unpost -c -e errors multiple.uue
Notes:
To use this program to collect all of the binaries posted to, say,
the alt.binaries.misc group on a daily basis, start up rn, go to
the alt.binaries.misc newsgroup, and save all of the unread articles
by using this command:
.-$smisc.uue:j
This will save all articles from the current number to the last to
the file misc.uue, then junk them. After exiting rn, run UNPOST
on the file misc.uue in interpretation mode (default mode):
unpost -e errors -i misc.1 misc.uue
Make sure to check the errors and/or misc.1 file for segments
that UNPOST couldn't extract.
Diagnostics:
Error - file 'filename' already exists.
UNPOST will not overwrite an existing file. Delete the file or
rename it and try again.
Error - missing begin line.
UNPOST expected to find a uuencode begin line in this segment,
but did not.
Error - Could not open description file 'filename' for writing.
UNPOST could not open a file of that name for some reason.
Possibly a permission problem, or the file exists and is not
writeable.
Error - Bad write to binary file.
A file write failed for some unknown reason. Possibly a full
disk?
Error - missing segment #
Binary ID: 'binaryID'
In attempting to decode a file whose ID string is binaryID,
one or more segments are missing.
Error - Missing UU end line.
As this is the last segment, it ought to have a uuencode end
line in it, but UNPOST did not find one.
Warning - Early uuencode end line.
UNPOST found a uuencode end line, but this was not the last
segment, so we found it early. Did the poster screw up and
misnumber his segments?
Error - Unexpected UU begin line.
We found an unexpected (read: this is not the first line of the
first segment, so what is this doing here?) UU begin line.
Error - cannot identify string '' in line #
In reading in a configuration file, the configuration file
lexical analyzer could not recognize this string.
Error - Out of memory.
Yup. Out of memory. Split the source file into smaller
pieces and try again.
Error - Could not modify file name to be MS-DOS conformant.
File name mungling is turned on, and the name of one of the
files cannot be made conformant (probably due to having to
many numbers in it).
Warning - Unexpected end of file in segment:
Segment: 'segment line'
File name mungling is turned on, and UNPOST is attempting to
identify the file type (so it can use the proper extension
when modifying the file name) but the UU begin line was the
last line in the file.
Warning - No UU line after begin.
Segment: 'segment line'
File name mungling is turned on, and UNPOST is attempting to
identify the file type (so it can use the proper extension
when modifying the file name) but the UU begin line was not
followed by a line of UU encoded binary data.
Error - Got number of segments but not segment number.
Error - Got segment number but not number of segments.
UNPOST must have all three pieces of relevant data, but if
UNPOST has at least an ID string, UNPOST will attempt to
assume a one part binary.
Error - Could not get ID string.
Fatal error, with no ID string, there is no way to collect
the pieces together.
Error - No begin line in first segment:
Segment: 'segment line'
UNPOST did not find a UU begin line in the first segment.
Error - missing '}' in regular expression.
In a regular expression of the type abc{1, 2}, the closing curly
brace is missing.
Error - To many sub-expressions.
UNPOST has a limit on the number of sub-expressions it
allows. This is a compile time option that can be changed
by modifying the value of MAX_SUB_EXPRS in regexp.h.
Error - missing ')' in regular expression.
Mismatched parentheses.
Error - badly formed regular expression.
Unexpected character 'c'
I give up! What is this character doing at this point in
a regular expression?
Error, can not enumerate a sub expression.
Regular expressions of the type: (...)* are not allowed.
Error - illegal regular expression node type.
Whoops, we have an internal programmers error here. Let
me know if you see this.
Error - Sub expression # extraction failed.
Another internal error that needs to be brought to my attention.
Error - could not open file 'filename' for reading.
UNPOST could not open file 'filename' for processing. Did you
spellit right?
Error - Unexpected end of file.
Error - Unexpected UU begin line.
Error - Segment number # greater than number of segments in:
Segment: 'segment line'
Either UNPOST got screwed up somehow or the poster posted
something like (Part 10/9).
Warning - duplicate segment # in:
Binary ID: 'binaryID'
UNPOST found two segments with the same binary ID and the
same segment number.
Error - reading source file.
Could not read a line from the source file.
Error - Could not open file 'filename' for output.
Could not open one of the text, incomplete or error files
for writing.
Configuration:
Ok, here's how to configure UNPOST to work for you. UNPOST relies
heavily on regular expressions. These regular expressions may
not be correct for your news reader, or system.
There are five classes of regular expressions:
1) The SEGMENT begin line regular expression.
2) The ID line prefix regular expression.
3) The ID line with part description regular expression list.
4) The begin line regular expression.
5) The end line regular expression.
Of these five, I don't expect you to have to modify the regular
expressions for handling begin and end lines, because they should
be correct for all uuencoders that follow the standard format.
Be aware that UNPOST has a hierarchy of regular expressions.
Each SEGMENT begin line regular expression has underneath it two
lists of regular expressions that recognize ID line prefixes,
and each element in the list of ID line prefix regular expressions
has a list under it that attempts to parse the ID line.
The two lists are for 1) the header and 2) the body.
The ID line prefix regular expression exists for the sake of
efficiency. It is used to find an ID line before we attempt
to parse it. Modify or add one of these if you wish to change
whether or not a line is recognized by UNPOST as being an ID line.
If you modify this, you must modify the list of segment description
regular expressions to match.
The SEGMENT begin line regular expressions are used to find the begining
of a SEGMENT, or the end of a previous segment. Modify these to change
the line or lines that UNPOST recognizes as the begining of a segment.
If you get an error message that indicates that the Subject line
has no identifiable part description, and you see that some bright
boy/girl has come up with a brand new part description format, then
you have two choices, modify the source and hope they don't post
again, or add a new ID line regular expression to the list of
ID line regular expressions in the segment.c source file.
Be aware that the lists of regular expressions are searched in order
from top to bottom to find a match. This means that less specific
regular expressions should be placed later in the list. For example:
the regular expression '\((0-9)+/(0-9)+\)' should come before the
regular expression '(0-9)+ (0-9)+' in the part syntax parsing regular
expression list. This reduces the number of misparses that occur.
Remember that C uses the backslash (\) as an escape character in
strings, so to put a backslash into a regular expression you
need to put two into the C source string.
All regular expressions can be found at the top of the parse.c source
file. Before you modify the actual source code and recompile, I
strongly suggest that you compile the regular expression test harness
and test your new regular expression. Then, when you are sure that
it is correct, copy the def.cfg file to a new name, make your changes
there and use that configuration file for a while. If after all this,
you are sure that it works, go for it.
Before you add or modify a regular expression, you have to know the
syntax of the regular expressions used in this program. The syntax
is very similiar to that used by UN*X style regular expressions,
but is not exactly the same. See the section titled Regular
Expressions before attempting to configure UNPOST.
Regular Expressions:
Operands
--------
UNPOST regular expressions have three types of operands, character
strings (one or more characters), character sets and match any
single character. A character string is any series of adjacent
characters that are not not meta-characters (special characters).
A data set is a string of characters enclosed in square braces with
an optional caret (^) as the first character following the open
square brace. The match any character operand matches any single
character except the end of line character.
A character string in a regular expression matches the exact string
in the source, including case.
Example of character strings:
AirPlane - Matches the string 'AirPlane', but not the strings
'airPlane' or 'Airplane'.
A character set will match any single character in the source if
that character is a member of the set. If the first character
of the set is the caret, the character set will match any
character that is NOT a member of the set (including control
characters!) except for NUL and LF.
A character set can be described using ranges.
Examples of character sets:
[abcd] - Matches either a, b, c or d.
[0-9] - Matches any decimal character.
[^a-z] - Matches any character that is not a lower
case alphabetic.
The match any character operand does just that, it matches any
character. But it does not match the case of no character, NUL
or LF.
Example of match any character:
. - Matches any character.
Operators
---------
UNPOST regular expressions also contain operators. The operators that
upost recognizes are the alternation operator, the span operators, the
concatenation operator and the enumeration operators.
The alternation operator has the lowest precedence of all the operators
and its action is to attempt to match one of two alternatives.
Example of alternation:
Airplane|drigible - Matches either the string Airplane or the string
drigible.
The next higher precedence operator is the catenation operator. The
catenation operator specifies that both the left and right hand
regular expressions must match. The catenation operator does not
have a special character, it is assumed to exist between two
different operands that have no other operator between them.
Example of catenation:
[Aa]irplane - Matches either a 'A' or an 'a' followed by the string
irplane. This is a catenation of the two regular
expressions [Aa] and irplane.
The next higher precedence operator is the enumeration operator.
The enumeration operator specifies how many instances of a regular
expression must be matched.
Examples of Enumeration:
abc* - Matches zero or more occurences of the string abc.
[A-Z]+ - Matches one or more occurences of an upper case
alphabetic character.
[ ]? - Matches zero or one occurences of the space character.
very{1} - Matches one or more occurences of the string very.
b{1,3} - Matches a minimum of one to a maximum of three occurences
of the string b.
An enumeration operator attempts to match the largest source sub-
string possible, except in the case of the . (match any character)
followed by an enumeration operator. In this case, the smallest
possible sub-string is matched.
The precedence of the operators can be modified with the use of
parentheses. Parentheses have another meaning as well, described
below.
Example of parenthesis use:
Death( defying|wish) - Will match either the string 'Death defying'
or the string 'Deathwish'. Without the
parentheses, the regular expression would
match either the string 'Death defying'
or the string 'wish'.
Sub Expressions
---------------
UNPOST regular expressions are used primarily for identifying a
particular line and extracting substrings from that line. To
this end, UNPOST regular expressions support sub-expression
marking. Subexpressions are marked by parentheses.
To determine the sub-expression number of a sub-expression, scan
the regular expression from left to right, counting the number
of left parentheses. Start with one, and whatever the count for
that sub-expression, is it's subexpression number.
Example:
.*((abcd)((0-9)+/(0-9)+))
Sub-expression ((abcd)((0-9)+/(0-9)+)) is sub-expression #1.
Sub-expression (abcd) is #2. Sub-expression ((0-9)+/(0-9)+) is #3.
Sub-expression (0-9)+ is #4. Sub-expression (0-9)+ is #5.
Anchoring
---------
Normally, a regular expression will match a sub-string any where in
the source string. If you want to specify that the matching sub-string
must start at the begining of the source string, you may use a caret
character as the first character of the regular expression. This
anchors the regular expression match to the start of the line.
To anchor a regular expression to the end of a line, use the dollar
sign character. This effectively matches the end of line or end
of string character.
Anchor operators have a higher precedence than alternation, but lower
than catenation.
Bugs:
This program has been pretty extensively tested in interpretation mode,
and it appears to be both robust and flexible.
Unfortunately, about once a week, somebody comes up with a new and
unusual way to encode the parts description on the Subject line.
Author:
John W. M. Stevens - jstevens@csn.org