PEP

Section: Misc. Reference Manual Pages (1L)
Updated: 28 December 1989
Index Return to Main Contents
 

NAME

pep - a file detergent  

SYNOPSIS

pep [ -a ] [ -b ] [ -c [ size ]] [ -d + | - ]
      [ -e [ 0 | 1 | 2 ]] [ -g file ] [ -h ] [ -i + | - ]
      [ -k + | - ] [ -m + | - ] [ -o [ b ]] [ -p ]
      [ -s [ size ]] [ -t [ size ]] [ -u terminator ] [ -v ]
      [ -w + | - ] [ -x ] [ -z ] [ filename ... ]  

DESCRIPTION

Pep is a filter program to "clean" files. It is named after a popular Norwegian detergent.

Pep may be used to remove control characters, strip parity bits, interpret ANSI escape sequences, compress tabulation, extract strings and convert character sets. Nine out of ten hackers prefer "pep" to soap (which may very well explain why some of them smell the way they do).

Pep is a filter. Its default operation is to read from standard input (the keyboard) and write on standard output (the terminal).

You may also specify the name of one or more files as the last argument on the command line. Most versions of pep (not the version compiled for the DEC VMS operating system) allow ambiguous filename arguments, were a single filename argument may specify several files.

You may instruct pep to write the result back onto the original input file with the -o option. If you use this option, the original file will be lost. If you want to keep the original file (something that usually will be the case when you do things like extracting strings from an executable file), you should make a copy of the file before applying pep, and filter the copy rather than the original. Some of the functions in pep (in particular those selected with the -b and -s options) may remove a lot of material from files, and it may be unfortunate if this happens to the wrong file. It is probably a good idea to always use pep on copies until you have some experience with the various pep-options. You may also use the b argument on the -o option to save the original in a .BAK-file.

To get a brief summary of the command line syntax and all the options, you need to specify the -h option. Just type the command:

pep -h

followed by the RETURN key. Note that just pep will not give you this summary. The command:

pep

will start pep as a filter, and it will just echo back whatever you type, until you type the end of file character (usually CTRL-D or CTRL-Z).

When pep is running as filter, it is reading from the standard input and writing to the standard output. In this state, pep will be very much less verbose than it usually is. It will still print error messages, but very little else. Note that while:

pep < foobar.in > foobar.out
pep -ob foobar.txt

will do more or less the same job, the first will do it quietly, in the tradition of Unix filters; the latter will print the copyright notice, a detailed list of the things it will do, and finally a list and line count of all the files it processes as it plods along.

Pep will remove some "noise" from files, even if no options are specified. The following is the default behavior:

*
remove trailing spaces;
*
terminate each line with the canonical line terminator (usually LF, CR or both);
*
remove underlining intended for backspacing printers;
*
remove control characters (character codes < 32) except canonical line terminator, FF and TAB;
*
break the line before the FF if a line contains an FF anywhere except in the first column.

If you want to check what pep actually intend to do to your file before it does it, you may make it pause with the -p option. For example:

pep -p foobar.txt

will make pep stop after displaying a list of the conversions it will apply to the file. The user is prompted and may choose to proceed (hitting the RETURN key), or abort the program without doing anything (hitting CTRL-C).

The user may want other conversions than the default action described above. A number of conversion functions may be selected by specifying one or more options on the command line.

Some of the options require an additional argument switch, and must be followed by a "+" or a "-", other options require a number or a filename argument. Most of the options may be combined with other options, but a few are mutually exclusive. If the user specifies invalid options or option arguments, then pep will abort with an error message and return an error exit code on operating systems that support exit codes.  

OPTIONS

-a
Write out information about pep.
-b
Remove all characters not in the original 7-bit character set (ISO 646). I.e. remove the characters which are encoded from 128 to 255. (If this option is combined with the -x option, it will print the codes for these characters in hexadecimal instead of removing them.) The -b option is powerful, and may remove a lot of bytes if you use it on the wrong file. Only use it if you know exactly how the eight bit is used in the file you intend to filter. Also note that the options i, d, k, g, m, w or z in most cases are better suited to process files where the eight bit is set.
-c [ size ]
Compress space into tabulation. I.e. insert TAB characters when replacing a run of two or more SPACE characters would produce a smaller output file. This function is the opposite of the function invoked with the -t option.
The default tabulation size is 8, but you may specify any other tabulation with the optional numeric argument.
-d + | -
Convert to or from the ISO 8859/1 8 bit character set and the Norwegian version of the ISO 646 7 bit character set. If the argument is "+", the file is converted to ISO 8859/1. If the argument is "-", the file is converted from ISO 8859/1. The ISO 8859/1 character set is also known as the "DEC Multinational Character Set".
-e [ 0 | 1 | 2 ]
Interpret ANSI screen control sequences (also known as ANSI ESCAPE sequences). This function makes pep emulate cursor positioning and other functions on an ANSI-terminal.
Pep will complain about "strange" (i.e. implementation dependent) use of ANSI escape sequences.
Pep will normally save a screen image on the output file when one of two events occur: 1) When the screen is full and scrolls up; or 2) just before a screen image is erased with the "erase screen" ANSI screen control sequence. In some cases important fields on the screen will be overwritten or erased. There is no good solution to this problem, but pep provides the user with some opportunity to guard against overwriting and erasure. This is done by specifying an additional numeric argument to the -e option. This numeric indicate the level of protection and is interpreted as follows:

0:
no protection --- fields may be erased and overwritten (this is the default);
1:
sequences that erase fields are ignored;
2:
sequences that erase or overwrite fields are ignored.
-g file
Read the conversion table from a file. The name of the file must be appended as the argument to this option.
The file itself is a standard ASCII text file where each line should contain two decimal numbers. The first number is the character code to convert from, and the second number is the character code to convert to. A "#" character and all the following characters up to a NEWLINE is considered a comment, and is ignored. Comments are however echoed on the screen along with the other comments pep makes, unless the comment line starts with a "##".
Below is an example of how such a conversion file may look:

# Convert from Macintosh to IBM-PC ##This line is not echoed on the screen. # MAC IBM 174 146 175 157 129 143 190 145 191 155 140 134 # EOF
-h
Write a brief summary of pep options, and exit.
-i + | -
Convert to or from the IBM 8 bit character set (Code Page 850 Multilingual) and the Norwegian version of the ISO 646 7 bit character set. If the argument is "+", the file is converted to CP 850. If the argument is "-", the file is converted from CP 850. The CP 850 character set (or a subset of it) is what is used in the IBM PC, AT, and PS/2 series of computers and their clones. Note that some machines with American PROMs have a yen- and cent character in the position rightfully belonging to upper and lower case versions of the Norwegian character written as an "o" with a slash across it (often referred to as oslash).
-k + | -
Convert to or from a 8 bit character set and the ISO 646 7 bit character set. This is a modified version of the -i function, hacked to preserve both the backslash character and the upper case oslash character as required by, among others, the "KnowledgeMan" package. These characters share the same code (92 decimal) in 7 bit ISO 646, but uses different codes (92 is backslash, 157 is oslash) in 8 bit CP 850. To get around this, two backslashes in ISO 646 will be converted to the upper case oslash character in CP 850, while a single backslash will be preserved --- and vice versa.
If this option is combined with the -d or -m option, the DEC/ISO or the Macintosh character sets is used as base instead of CP 850.
-m + | -
Convert to or from the Apple Macintosh 8 bit character set and the Norwegian version of the ISO 646 7 bit character set. If the argument is "+", the file is converted to the Macintosh character set; if the argument is "-", the file is converted from the Macintosh character set. See description of -v option below and note in "bugs" section below about treatment of "end-of-line" and "end-of-paragraph".
-o [ b ]
Pep will usually write the result of conversions on the standard output (stdout). This option instead instructs pep to replace each named input file with a file containing the result of filtering the file through pep. If the option is augmented with the argument b (i.e. -ob), then pep will create a backup copy of the original input file on a file with extension .BAK. If you just specify -o the original file is deleted.
The VMS version of pep will always run as if this option was specified. This is because VMS does not support useful redirection or pipes. Therefore, it is never necessary to specify the -o option under VMS, but users should still specify -ob if they want a backup copy of the original input file.
-p
Write out a brief description the conversion functions that will be activated by the current set of options, and pause. The user may review the list of conversion functions and abort (by hitting CTRL-C) if they do not have the intended effect.
-s [ size ]
Find strings in extremely "noisy" files.
Pep's concept of a string is that it is a sequence of "printable" characters of a certain length. The default minimum length of this sequence is 4, but this may be changed by the user by supplying an optional numeric argument that becomes the minimum length of the sequence.
The default definition of a "printable" character is a symbol with encoding above 31 decimal (i.e. 32 to 255) plus certain common control characters (TAB, CR and LF). This definition is almost always too liberal, and will include a lot of "noise" in the output. One or more of the options -b, -d, -i, -m or -z should be specified in addition to -s in order to narrow the definition and the search space. In my experience, the -b option is a particularly useful additional filter when searching for strings.
-t [ size ]
Expand tabulation, replacing the TAB character with a suitable number of spaces. The default tabulation size is 8, but the optional numeric argument size may be used to set tabulation to any desired size.
-u r | n | s | - | # | number
Pep's default behaviour is to terminate lines with whatever is the canonical line terminator (the standard way to terminate a text line) on the assumed target system for the output file. This means CR/LF on a microcomputer system, LF on a UNIX system, and CR if the target is a Macintosh). The assumed target system is usually the system pep is running on, unless you request folding to the character set of another computer system. Then, that computer system becomes the assumed target.
The -u option allows you to override this assumption. You do this by specifying explicit (in decimal) the numeric ASCII value of the end of line character you want in your output file. For example, to make sure lines are terminated by LF (the standard for UNIX text files), you may use -u10, because 10 is the ASCII value of the newline (LF) control character. Instead of a numeric argument, you may specify r, for carrige return (CR), n, for newline (LF), s, for record separator (RS), the symbol -, for no line terminator, or the symbol # to get carrige return followed by a newline (CR/LF).
-v
Normally, pep will terminate each line with the canonical line terminator. Some typesetting programs and word processors, however, require that no hard line terminator is present within a paragraph, and that only paragraphs are hard terminated. If you want to import a file to such a typesetting program or word processor, you may instruct pep to terminate paragraphs only with this option.
See note in "bugs" section below about treatment of "end-of-line" and "end-of-paragraph".
-w + | -
This slightly obsolete option converts files to and from the WordStar version 3.2 "document" mode. If the argument is "+", the file is converted to WordStar document mode; if the argument is "-", the file is converted from WordStar document mode into plain ASCII text.
-x
Expand unprintable characters. This option will make pep expand the characters it would otherwise remove from the file by printing the character encoding of these characters in hexadecimal between angle brackets.
-z
Zero the eight bit (a.k.a. the parity bit) on all characters in the file.
 

ENVIRONMENT

Pep knows a single environment variable: PEP, which may be used to indicate the lookup path for files with conversion tables. Below is some examples on how to set this in some operating systems:

set PEP=c:\usr\lib                              (MS-DOS)
setenv PEP /usr/local/lib                               (UNIX)
define PEP "DISK_USR:<LOCAL.LIB>"               (VMS)

The command to set this environment variable should usually be part of the command file that is read during login (this may be named AUTOEXEC.BAT, LOGIN.COM, .profile or .login depending upon your choice of operating system. Please note that environment variables do not exist under CP/M.  

EXAMPLES

Some of the examples below use i/o redirection and pipes, as indicated with the symbols ">" and "<" (redirection) and "|" (pipe symbol). These examples only apply to operating systems that support redirection and pipes.

pep -h
Print a quick summary of all available options, and exit.
pep
Read input from standard input (the keyboard), and write the result on standard output (the screen) until the user types the end of file character (usually CTRL-D (UNIX) or CTRL-Z (MS-DOS)). This is of limited practical use by itself, usually this command is inserted into the middle of a command where the standard input and standard output are pipes.
pep < foo.bar
Display a slightly cleaned-up version of the file foo.bar on the screen.
pep < foo.bar > foo.txt
Read the file foo.bar, clean it, and write the result on the file foo.txt.
pep foo.bar > foo.txt
Read the file foo.bar, clean it, and write the result on the file foo.txt.
pep foo1.bar foo2.bar > foo.txt
Read the files foo1.bar and foo2.bar, clean them, and catenate the result on the file foo.txt.
pep -o foo.fil bar.fil
Clean the files foo.fil and bar.fil, replacing the original files with the cleaned-up versions.
pep -ob foo.fil bar.fil
Clean the files foo.fil and bar.fil, replacing the original files with the cleaned-up versions. The original files are preserved as foo.bak and bar.bak.
pep -i+ -o program.dok
Convert the Norwegian text in the file program.dok to use the IBM-PC 8 bit character set. Please note that this conversion may not be 100 percent correct. For instance, the pipe symbol "|" will be converted to the lower case Norwegian oslash character. This is because the pipe symbol and the character share the same ASCII code (124) in the Norwegian version of the 7-bit character set, but they have different codes when using 8-bit character sets.
pep -e2 -o kermit.log
Interpret ANSI screen control sequences in the file kermit.log. Set guard to level 2 (no deletion or overwriting).
In this example, it is assumed that the file kermit.log is a log record of an on-line session with some Bulletin Board System (BBS). Such files may be created with the command "log session" in the popular kermit communication program. Most other communication programs have similar commands. Many BBSs uses uses ANSI sequences for simple graphics, highlighting and other special effects, and you will get a much more more readable session log if you run it through pep with the -e option turned on.
test | pep -e > test.scr
Run the program test, and pipe its output to pep, which interprets any ANSI sequences and store the resulting screen images in the file test.scr. Note that this is only possible on operating systems that support pipes (i.e. UNIX and MS-DOS).
The screen images will now be on standard text files which have the same general layout as the original screen images. This may be useful if you need text versions of the screen images for inclusion in manuals or for prototypes.
nroff -man -Tlpr pep.1l | pep > pep.doc
Generate a plain text version of this manual, without backspaces or double strikes (nroff is the standard Unix text formatter).
pep -d- -o *.txt
Convert all files with extension .txt from DEC/ISO character set to Norwegian 7-bit ASCII characters.
pep -gibm2mac -ur -< foo.ibm > foo.mac
Use the conversion table in the file ibm2mac to convert the character set in the file foo.ibm. Store the result on the file foo.mac, where each line should be terminated by a single CR character.
pep -m- < foo.mac | pep -i+ > foo.ibm
Convert Apple Macintosh encoded Norwegian characters in the file foo.mac to IBM-PC (Code Page 850) encoding. This is an alternative way to accomplish the same thing as the conversion done in the previous example.
pep -w- -o *.*
Convert all files in the current directory from WordStar document mode to 7-bit ASCII.
pep -w+ -t4 < foo.txt > foo.ws
Convert the file foo.txt to WordStar document mode format, also expanding tabulation (tabstop = 4) to space characters. The result is stored on a file named foo.ws. Pep uses a simple pattern recognition mechanism to recognize pages, paragraphs, soft white space and soft hyphens. It will probably not do a 100% conversion, but the file will be much easier to edit in WordStar than the original.
pep -z -x < foo.dat > foo.dmp
Strip the 8th bit and expand control characters to hex digits in the file foo.dat, and store the result on the file foo.dmp.
Expanding the unprintable characters to hexadecimal makes it easier to inspect a file in an ordinary text editor, and to post-process it by a customized filter you may create yourself with the search/replace and macro facilities found in many editors today.
pep -s6 -b < pep.exe
Extract "strings" from the file pep.exe. The strings are just listed on standard output (the screen). "Strings" are in this context assumed to be any sequence of characters that are at least 6 characters long. The -b option excludes characters with codes in the range 128 to 255 from the search. It is almost always a good idea to combine the -b option with -s option, otherwise to much garbage is picked up by the filter.
pep -t4 -c8 -o foo.c
If both tab expansion -t and tab compression -c is specified, then pep will repack the tabulation. This is useful if you want to convert a file from one tab-size to another (e.g. to convert non-standard 4 character tabulation into standard 8 character tabulation). In this example, two TAB characters in the file foo.c are replaced by a single tab character: and any TAB character that cannot be paired up is replaced by the appropriate number of spaces.
pep -t -c -o foo.c
Remove redundant space characters in existing tabulation in the file foo.c. What happens is that tabulation on each line is first expanded and then compressed again, which effectively removes any space characters "inside" a tabulation.
 

DIAGNOSTICS

If you specify an option that pep does not recognize, then pep will write a summary of usage and abort. Other errors on the command line will result in pep writing an error message before aborting.

On operating systems that support exit codes, pep will return an exit code upon termination.

If pep is interpreting ANSI escape sequences and notices syntactical or semantical errors in the way they are used, a warning is printed on the screen, prefixed with the string "ansi:". This means that it is also possible to use pep to check if programs use ANSI sequences in a portable way.  

FILES

pep, pep.exe, pep.cmd
executable file (actual name depends upon which operating system you use).
mac2ibm
small example of a user supplied conversion table to convert from the Macintosh character set to that used on the Norwegian version of the original IBM-PC (the sample file only covers the Norwegian characters --- to complete it is left as an exercise to the reader :-) ).
ibm2mac
inverse of mac2ibm: conversion table from a small subset of IBM CP 850 to Macintosh character set.
ebc2ns7
conversion table from the IBM EBCDIC character set to the Norwegian version of the ASCII 7-bit character set (ISO646 NS4551).
ibm2ro8
conversion table from the IBM-PC 8-bit character set to Hewlett-Packard ROMAN8.
ro82ibm
inverse of ibm2ro8: conversion table from ROMAN8 to IBM-PC character set.
ibm2iso
conversion table from the IBM-PC CP 850 8-bit character set to ISO 8859/1.
iso2ibm
inverse of ibm2iso: conversion table from ISO 8859/1 to CP 850.
 

AUTHOR

Copyright © 1989 Gisle Hannemyr.

Pep may be freely distributed and copied, as long as this file is included in the distribution and that these statements about authorship and copyright is not altered or removed.

Bug reports, improvements, comments, suggestions and flames to:
   Snail: Gisle Hannemyr, Brageveien 3A, 0452 Oslo, Norway.
   Email: gisle@nr.uninett (EAN);
          gisle@ifi.uio.no (Internet);
          ...!mcvax!ifi!gisle (UUCP);
          (and several BBS mailboxes).  

ACKNOWLEDGMENTS

Thanks to Robert Andersson, for the SYS-V rename function; and to Knut Borge, Bjoern Larsen, Knut Omang and Geir-Harald Strand, for elucidation of the unspeakeable mysteries of VMS. Special thanks are due Inge Arnesen for finding and fixed a bug, (and to Nils-Eivind Naas for bringing it to my attention). Several people have contributed ideas and/or bug reports. In addition to those mentioned above, Ola Garstad, Ottar Grimstad, Tor Sjoewall, and Jens-Henrik Soerensen should be mentioned. My apologies if anyone is forgotten.  

SEE ALSO

dd(1), detex(1L), convert(VMS), expand(1), od(1V), strings(1), tr(1), unexpand(1).

Detex(1L) is a lex-based program to convert LaTex and TeX manuscripts into plain ASCII text. It is available from the author upon request. Those marked VMS are standard VMS utilities. The others are standard UNIX utilities.  

BUGS

There is a very strong Norwegian bias in pep. In particular, there exists several national versions of the ISO 646 7-bit character set; but all built-in functions to convert between this and various 8-bit character sets (i.e. -d, -i, -k and -m) bluntly assumes the standard Norwegian version of the ISO 646. For pep to work with other national 7-bit character sets, the compiled in conversion tables (type FOLDMATRIX for those who read the source code) need to be extended.

The VMS version of pep runs with the -o option permanently enabled. This is because VMS does not support an useful i/o redirection or pipe mechanism.

The VMS Record Management Service (RMS) knows of several record formats. You can see what record format a file is by using the VMS DCL command DIRECTORY/FULL and examine the field "Record format". On VMS systems, Pep will always generate output files with record format set to "Stream_LF", but some programs may require that the output file is in other formats. To fix this, it might be necessary to run the output of pep through the VMS CONVERT utility. Please see the DEC VMS manuals for details.

The Macintosh "text only" format uses the carriage return (CR) character (ASCII 13) as terminator. Most text processors (e.g. MacWrite) seems capable of handling two conventions: One is to use CR to terminate each line (and two or more consequtive CR's between paragraphs); the other is to use CR between paragraphs only. Pep is also capable of handling both conventions. The default behaviour is to terminate each line, but the -v option may be used to terminate paragraphs only. Please note that pep uses a rather simplistic heuristic to identify the end of a paragraph, it bluntly assumes that paragraphs are separated by blank lines.

If you use the -o option, then the original input file will be overwritten. Before you are familiar with pep, you may find that it sometimes removes more material than you expect from a file. It may be a good idea to always make a copy of the original file before you start experimenting with pep, or you may add the "b" argument to the -o option (-ob).

The built-in IBM-PC, DEC and Macintosh conversion tables converts to and from the Norwegian version of 7-bit "ASCII" characters. You should use the -g option and "general" conversion tables for all other purposes.

Pep only knows the ANSI sequences implemented in the standard MS-DOS console driver ANSI.SYS.

There cannot be a space character between an option and the option's argument (e.g. you'll have to use "-gfoo.bar", not "-g foo.bar").

Pep will only filter "regular" files. It will skip directories, sockets and "special" files.

Links are the GOTOs of file systems. If you run a hard linked file through pep using the -o option, the link will not be preserved. Pep will just skip soft linked files.

Pep searches for the conversion tables requested with the -g option in the following order: first the current directory, then the directory of the file PEP.EXE (MS-DOS only), and finally the directory pointed to by the PEP environment variable.

Pep knows nothing about the COFF-format and the -s option is primitive compared to the UNIX command strings(1). So if you are on a UNIX-system --- forget about the -s option and use strings(1) instead.

Pep will not convert Word Perfect documents into plain ASCII. This much requested function is, however, built into Word Perfect. It is named "store as DOS-text" and is activated by pressing CTRL-F5 (at least in Word Perfect 4.2).


 

Index

NAME
SYNOPSIS
DESCRIPTION
OPTIONS
ENVIRONMENT
EXAMPLES
DIAGNOSTICS
FILES
AUTHOR
ACKNOWLEDGMENTS
SEE ALSO
BUGS

This document was created by man2html, using the manual pages.
Time: 01:35:14 GMT, February 01, 2023