STRSED

Section: Misc. Reference Manual Pages (3C )
Index Return to Main Contents

NAME

strsed - ed(1)/tr(1)-like substitute and replace function.

SYNOPSIS

char *strsed(string, command, 0)
char *string;
char *command;

char *strsed(string, command, range)
char *string;
char *command;
int range[2];

DESCRIPTION

Strsed is a regular expression pattern match and replace function that also combines tr(1)-like transliteration. The GNU regex package is used for the regular expression matching.

Strsed can be used to provide the functionality of most of the other more "complicated" string functions (e.g. strchr, strrchr, strpbrk, strspn, strcspn, and strtok), although less efficiently in each case, due to its generality. Strsed is a very powerful and general function that can be used to carry out complicated string manipulations such as those that are possible in text editors.

USAGE

String should be a null-terminated character string. A copy is made and will be operated on according to the search and replace instructions contained in command. Unless an error occurs (see ERRORS), the passed character strings string and command are never corrupted, and the string that is returned may always be passed to free(3) since its space is obtained from malloc(3).

Both string and command may contain the following C-like escape sequences:

    \b      Backspace.
    \f      Formfeed.
    \n      Newline.
    \r      Carriage Return.
    \s      Space.
    \t      Horizontal Tab.
    \v      Vertical Tab.
    \z      Used to remove ambiguity if necessary.
    \0-9    A reference to a register.
             (except for \0 in a regular expression.)
    \0x3d   The character whose value is 3d hexadecimal.
    \0X3d   The character whose value is 3d hexadecimal.
    \040    The character whose value is 40 octal.
    \32     The character whose value is 32 decimal.

The NUL (0) character cannot be specified. A ``\'' followed by one to three digits can be interpreted in several ways. If one or two hex digits are preceeded by an ``x'' or an ``X'', they will be taken as specifying a character in hexadecimal. If there are exactly three octal digits and the first is in the range ``0'' to ``3'' then they are taken as specifying a character in octal. Otherwise a single digit is taken to be a register reference and two or three digits are interpreted as specifying a character in decimal. \z can be used to avoid problems with ambiguity. For instance, \007 will be interpreted by strsed as octal 007. To specify the contents of register zero (\0) followed by the two characters ``07'', use \0\z07. The \z makes it clear what is meant (acting like a punctuation mark) and is otherwise ignored.

Strsed allows ed(1) like regular expressions and substitutions on string. The search and replace command is specified by command. The format of command is either

/search_pattern/replacement/
or
g/search_pattern/replacement/

In the first form, the search and replace is performed once on the string, and in the second, the replacement is done globally (i.e. for every occurrence of the search pattern in string.). A leading ``s'' in the above is silently ignored. This allows for a syntax more like that of ed(1). e.g. s/e/x/ is the same as /e/x/.

If replacement is empty, then the matched text will be replaced by nothing - i.e. deleted.

Search_pattern is a full regular expression (see ed(1)), including register specifications (i.e. \( ... \)) and register references, (e.g. \2) but not the {m,n} repetition feature of ed(1).

Replacement consists of ordinary characters and/or register references (e.g. \1 or \2). \0 means the entire matched text. In addition, a register reference may be immediately followed by a transliteration request, of the form

{char-list-1}{char-list-2}.

The characters from char-list-1 will be transliterated into the corresponding ones from char-list-2 in the same manner as tr(1). If the register reference before a transliteration request is omitted, it defaults to \0. Within a transliteration request, the characters "}" and "-" are metacharacters and must be escaped with a leading \ if you want them to be interpreted literally. Character ranges such as a-z are expanded in the same fashion as tr(1). If char-list-2 is shorter than char-list-1 then char-list-2 is padded to be the same length as char-list-1 by repeating its last character as many times as are needed. For example, the transliteration request

{a-z}{X}

will transliterate all lower case letters into an 'X'. Character ranges may be increasing or decreasing.

Unusual character ranges (such as a-f-0-\0x2d-c) are interpreted as running from their first character to their last (so the above would be treated as a-c). Note that it is not possible (in this release) to specify the complement of a character range in a transliteration request. However, this can be done in the search_pattern by commencing a character class with a "^" in the normal regular expression fashion.

The highest register that can be referenced is \9.

EXAMPLES

Here are some example command strings that might be given to strsed:

/a/A/            # Change the first 'a' into an 'A'
g/a/A/           # Change every 'a' into an 'A'
g/://            # Delete every ':'
g/jack/jill/     # Change every 'jack' to a 'jill'
/[^\s\t]/X/      # Change the first non-whitespace
                 # character into an 'X'.

Some more advanced examples...

/\([\s\t]*\)\([^\s\t]*\)/\1\2{a-z}{A-Z}/

This converts the first non-whitespace word to upper case, preserving any initial whitespace. It catches the first run of spaces and TABs into register one \([\s\t]*\), and then the following run of non-white characters into register two \([^\s\t]*\). The replacement, \1\2{a-z}{A-Z} specifies register 1 (the whitespace) followed by the contents of register 2 transliterated into uppercase. This would produce

"   SPOTTED pinto bean"

if called on the string

"   spotted pinto bean".

g/\([a-z]\)\1+/\1/

This is a very useful example and performs the same function as tr -s. That is, it squeezes runs of identical characters (in the range a to z) down to a single instance of that character. So "beeee good" becomes "be god". The "+" is the regular expression notation meaning "one or more".

g/\([\t\s]*\)\(.\)\([^\t\s]*\)/\1\2{a-z}{A-Z}\3/

This example capitalises the first letter of each word in the string, and preserves all whitespace. It catches three things,

1) the initial whitespace         \([\t\s]*\)  in register 1
2) the next letter                \(.\)        in register 2
3) the following nonwhite letters \([^\t\s]*\) in register 3

and then prints them out as they were found, with the only difference being the uppercase conversion of the contents of register 2. Given the string

"  this is a line  "

this command would return

"  This Is A Line  ".

If the initial 'g' was not present in the command, then the capitalisation would only be done to the first word in the string. It is important to understand this difference well.

SEARCHING ONLY

Strsed may be used to search for a regular expression in a string, but perform no action. The portion of the string that matched will be returned in the third argument range. In this case command should be of the form /pattern/. On return, range[0] will contain an index into the original string to indicate where the match began, and range[1] will index the first character after the end of the match. For example, after the call

strsed("two big macs please", "/b.*c/", range);

range[0] will contain 4 and range[1] will contain 11. If not match is found, both elements of range will contain -1.

ERRORS

If strsed detects any error it returns NULL. This can happen if the syntax of command is incorrect, if the regular expression in command is incorrect, if space cannot be obtained from malloc(3), or for other similar reasons. Note that it is not an error if the empty string is returned.

COMPILING AND LINKING STRSED

Strsed should be compiled with the -O and -c options of your C compiler. It has no main() function. When you come to link, you use strsed.o and regex.o from the GNU 18.55 (or 18.54) emacs distribution.

OBSCURE NOTE ON REGULAR EXPRESSIONS

It is possible (but not too likely) that the regular expression language that is recognised may differ slightly from installation to installation. This is because the GNU regular expression package may compiled with different settings for recognition of meta-characters. So on one machine, the character "|" might be taken as being the OR operator, whilst somewhere else you need to give "\|" - or vice-versa. This could be a pain in the neck, but there's not alot that can be done about it. If you really need to know the difference in a portable way, look in regex.h to see what things are defined and then act accordingly when constructing commands for strsed.

AUTHOR

Terry Jones
PCS Computer Systeme GmbH
Pfaelzer-Wald-Str 36
8000 Muenchen 90
West Germany 49-89-68004288

terry@distel.pcs.com
or ...!{pyramid,unido}!pcsbst!distel!terry

January 8th, 1990.

ACKNOWLEDGEMENTS

Many thanks to Jordan K. (mother) Hubbard for discussions, bugfinding, handholding, forcing me to use emacs and torrents of (usually) uncalled-for abuse.

This document was created by man2html, using the manual pages.
Time: 03:03:42 GMT, January 17, 2025