home *** CD-ROM | disk | FTP | other *** search
- Amiga SED An Amiga Stream Editor © THOR-Software (Thomas Richter)
- ______________________________________________________________________________
-
- Purpose of this program:
-
- SED takes an input file, checks each line of this file against a
- pattern supplied on the command line, and generates a new line from
- this pattern match in the destination. This could either mean that
- the matching line is removed completely from the output, replaced
- by a different line, or changed according to the specifications of
- SED.
-
- SED is an approximation of the Unix "stream editor" sed. It is not
- quite as powerful as sed because its command set is currently very
- limited, and it does not support command files. Its pattern syntax
- is different, too. It still looks like "line noise" to me - I didn't
- want to break with this tradition - but it's at least the Amiga kind
- of line noise.
-
- Its pattern matching rules are a superset of the AmigaOs patterns,
- with some additional features like "captured expressions" and more
- powerful "character classes" and "escaping".
-
- SED is useful for automatic processing of text files, e.g. the modi-
- fication of the startup-sequence. SED can also be run as a "filter"
- in which case it reads its input from stdin and prints output to
- stdout. Combine this feature with pipes and you get a very powerful
- text processing tool.
-
- A warning: Pattern matching looks simple, but is full of hard to gasp
- traps. This tool is therefore thought to be for "expert usage". In
- case you think SED doesn't process your pattern correctly, think
- twice!
-
- ______________________________________________________________________________
-
- SYNOPSIS:
-
- SED FROM,TO,MATCH/A,REPLACE,CHANGE,DELETE/S,USECASE/S,ALL/S,VERBOSE/S
-
- FROM An (AmigaOs) pattern specifying the input file(s) to process.
- If not given, SED reads from the standard input.
-
- TO The output file where to write the processed lines to. If not
- given, SED writes to the standard output.
-
- MATCH A pattern specification used for filtering the input lines.
- More on the pattern rules below.
-
-
- The next options specify what to do with the lines matching the pattern:
-
-
- REPLACE Replace matching lines in the input by this replace rule, do
- not write non-matching lines to the output. The replacement
- rules are given below.
-
- CHANGE Replace matching lines by the replace rule given by this ex-
- presson. In contrast to REPLACE, non-matching lines are placed
- in the destination without change. Useful for modifying a file
- according to a pattern rule.
-
- DELETE Print all lines except those matching the pattern. This
- effectively removes the matching lines from the input file.
-
- USECASE Be case-sensitive. By default, SED is case-insensitive.
- Note that SED differs in this detail from the Un*x sed.
-
- ALL In case the FROM pattern is a wildcard, enter sub-directories
- recursively.
-
- VERBOSE Print information about the file currently scanned, and upon
- entering a directory. By default, SED is quiet.
- Note that "Search" is by default not quiet.
- ______________________________________________________________________________
-
- Pattern specification:
-
- In the following, the syntax of the patterns is specified. By good
- tradition, this is in-comprehensively.
-
- I present first a "quick and dirty" presentation of the available
- patterns as a quick reference which might give you an impression about
- the possibilities. It is all but sufficient to work with SED. Then, a
- detailed and more precise, but also more confusing presentation
- follows.
-
- ______________________________________________________________________________
-
- Quick guide to patterns:
-
- SED patterns work much like Amiga patterns. Unlike in "Search", a pattern is
- applied to a FULL line, and not to sub-strings of this line. Which means that
- the pattern "hello" matches ONLY the line containing the single word "hello",
- nothing more. If you want to match lines containing "hello", use "#?hello#?"
- instead - see below for what "#?" means. Unlike Un*x sed, there are no special
- characters to match the start or the end of a line. They are not required in
- the SED approach.
-
- Standard patterns:
-
- ? Matches a single arbitrary character.
- # Matches zero or more repetitions of the following symbol in
- the AmigaOs sense. Note that # may match zero(!) characters
- as well.
- Therefore, #? matches an arbitrary sequence of at least zero
- characters, hence any string.
- + Matches one or more repetitions of the following symbol
- New to AmigaOs, standard Un*x regular expression.
- * Matches zero or more arbitrary characters in the sense of
- MS-DOS. Note that this is a functional difference to Un*x
- regular expressions where * has the meaning of #.
-
- Note that you need to write ** instead of * if you enclose
- the pattern in double quotes on the shell command line. This
- is because * is also the BCPL escape character. Messy.
-
- (...) Groups the characters in the bracket to a single symbol.
- As for example, #(ab) would match an arbitrary repetition
- of "ab", as the empty string, "ab", "abab" or "ababab", but
- not "aba".
- Brackets can be nested.
-
- (..|..) The vertical bar means "or". Matches either the left or the
- right string. The bar is only valid within brackets.
-
- {...} Groups expressions much like (...) but captures the contents
- of the sub string that matched the brackets. This captured
- expression is then available for the ouput replacement rules,
- see below for more information. For example,
-
- SED MATCH {#?}.c REPLACE {1}.o
-
- would match all lines ending on ".c", and would capture the
- string in front of the ".c". The "{1}" in the replace pattern
- would insert this string, and would append an ".o".
- Namely, the above replaces all lines ending on ".c" by a
- similar line ending by ".o".
- Works very much line the Un*x \(..\) matching.
-
- {..|..} The vertical bar works right the same way here as described
- above. Matches either the left or the right expression, and
- captures the expression that fits.
-
- % Matches the empty string. Useful for patterns like
- "#?(.c|.o|%)" which could be used to match the source, the
- object code and the final executable of a C project, for
- example.
-
- ~ Means "not" and matches all symbols that do not match the
- following symbol. Be warned, ~ is full of traps, see below
- for the full description.
-
- [..] Character classes. Matches a single character on a range of
- valid characters specified in the interior of the bracket.
- For example, "[ac]" would match the single character "a" or
- "c".
-
- [..|..] Matches either the left or the right character range. Hence,
- [a|c] is equivalent to [ac].
-
- [..,..] Another equivalent formulation of the above. [a,c] is the same
- as [ac] or [a|c].
-
- [..-..] Matches a character range. [a-z] matches all letters - except
- language specific "Umlaute", though, which have different en-
- codings. Several ranges can be grouped much the same way as
- single characters. [a-z|0-9] means "any character or any digit"
- but nothing else.
-
- [-..] Matches all characters up to the specified character. Hence,
- [-z] means "all characters up to z". Note that unlike in Un*x
- implementations, there are no messy rules concering the "["
- itself as character. The escape character "\" must be used to
- specify "[" or "]" itself, see below.
- This syntax can be combined freely with "|" or "," to specify
- more than one range.
-
- [..-] Matches all characters starting at the given ASCII value. Can
- be combined freely with "," and "|". There are no messy rules
- concerning "-" in the middle or the end of a character range,
- proper escaping must be used if "]" should be matched.
- [a-] matches therefore all characters "a" and up.
-
- [~..] Matches all characters not in the following range. ~ is applied
- up to the next "|" or ",", unlike in the standard AmigaOs (Arp)
- expression matching. Therefore,
- [~ab] matches all characters except "a" and "b" and is
- equivalent to [~a,~b] and [~a|~b].
-
- \ Escape character. Specifies a character to be matched:
-
- \t Tabulator \v Vertical TAB
- \b Backspace \r CR
- \f Form Feed \a Bell
- \n is INVALID since the end of the line is matched
- by the end of the pattern itself.
- \x.. The character encoded by the hex value following
- the "x". In case this specification is ambigious,
- the number might be terminated by a dot ".".
- Hence, "\x9.0" matches a tabulator sign and the
- digit "0", whereas "\x90" matches the ASCII char-
- acter of the code hex 90.
- Note that this rule differs from the ANSI-C rule.
- \0.. The character of the ASCII code encoded as an
- octal number.
- The dot is used as above as separator, unlike in
- ANSI-C.
- \$.. Identical to \x.., matches the digit encoded by
- the ASCII code in hex.
- \d Matches the dollar sign since \$ has a different
- meaning already.
- \#.. Matches the character encoded by the ASCII code
- in decimal notation.
- \h Matches the hash-mark since \# has a different
- meaning already.
-
- Everything else: The character following the backslash
- itself. Especially, \\ is the backslash itself and \" is
- the double quote.
-
- Note that you must use the backslash to match characters
- which are otherwise part of the pattern syntax, as for
- example "\(" to match the bracket. Note that "#" and "$"
- are special in this sense since "\$" and "\#" are used
- to specify characters by ASCII code.
-
- !,",§,$,&,=
- -,^,',`,<,> are reserved for future use AND MUST NOT be used at all.
- Escape them if you need them. However, the dot (".") is
- free, unlike Un*x regexp, same goes for "@" and "/".
-
- .. Everything else: Matches the character itself. Hence "a"
- matches a single "a" much like "[a]".
-
- ______________________________________________________________________________
-
- Replacement rules:
-
- The arguments of REPLACE and CHANGE specify what do with the lines which
- matched the specified pattern. Unlike the pattern specification, only the
- special operators \ and {..} are allowed. All other operators from the
- above list are forbidden and generate an error.
-
- \ Escape character, works identically to the \ in the pattern
- and places the single character encoded by the sequence
- following the backslash on the output directly.
-
- {..} Specifies a captured expression to be inserted into the
- output stream. The brackets take up to three arguments:
- The couting number of the regular expression, and optionally
- two arguments how to format the regular expression separated
- by a dot ".". These numbers work very much the same way like
- the arguments to the %s format specifier in ANSI-C.
-
- The first number in the bracket describes which captured
- expression to insert. If it is a positive number, the number
- is simply the index of the captured expression, counting from
- one upwards.
-
- Each opening bracket "{" in the input pattern starts a new
- captured expression, hence in nested expressions the other-
- most bracket has the lowest index.
-
- If this number is negative, it counts the captured ex-
- pressions downwards from the last expression.
-
- If the specified expression does not exist, the brackets
- expand into an empty string that is formatted according to
- the rules given by the next three arguments.
-
- {1} is the first captured expression,
- {3} is the third expression,
- {-1} is the last expression,
- {-2} is the second to last expression.
-
-
- The next number is the field with to print the captured ex-
- pression in. At least the specified number of characters are
- printed, or more if the expression is longer. If the ex-
- pression is too short, the field is padded with blank spaces.
- The expression is right-justified into this field, unless
- the field width is negative in which case the expression is
- left-justified. The sign of the field width is otherwise
- ignored.
- Defaults to 0, i.e. the field is always as small as possible.
-
- The last number is the size limit of the expression. The
- expression will be cut down if it is longer than the specified
- limit. SED will cut the end of the string if this argument is
- positive, or the start of the string if it is negative. The
- sign of the limit is otherwise ignored.
- If the limit is 0, which is the default, the expression will
- not be cut down at all.
-
- {1.10} is the first captured expression right justified in
- a field of ten characters or longer.
-
- {2.-5.7}is the second captured expression, left justified in
- a field of five characters. At most seven characters
- of the expression will be printed.
-
- .. Everything else: The character itself is printed on the ouput.
-
- ______________________________________________________________________________
-
- Detailed pattern matching rules:
-
- And now for the detailed rules to confuse you completely:
-
- - A SYMBOL is either a single character, one of the following operators
- followed by its arguments, a character class [..] or a (..) or {..} sequence.
-
- - A PATTERN is a sequence of SYMBOLs.
-
- - The POSTFIX of a symbol in a pattern is the subsequence of the pattern
- following the symbol, not including the argument of the symbol itself.
-
- ? Matches a single character except the end of a string.
-
- # Matches as many repetitions of the following symbol, but
- at least zero such that the postfix of the symbol matches
- the postfix of the input.
- Hence, "#" is greedy. There is currently no non-greedy form.
-
- + Matches as many repetitions of the following symbol but
- at least one such that the postfix of the symbol matches the
- postfix of the input. "+" is greedy.
-
- * is fully equivalent to "#?" and therefore greedy.
-
- (...) groups the pattern up to the next | or ) into a symbol
- which matches if the contents of the brackets match.
-
- (..|..) An or-combined symbol matches if one of the expressions
- in the bracket match such that the postfix symbol matches
- the postfix of the input.
-
- {...},{..|..} Similar to the above except that the matched string is
- captured.
-
- % Is completely ignored as pattern and gobbles no character
- from the input sequence at all.
-
- ~ Matches the longest subsequence or at least zero characters
- that does not match the following symbol such that the
- postfix of the symbol still matches the postfix of the input.
- "~" is greedy and will try to match as many characters first.
-
- Note that a symbol could either be a single character or a
- sequence of characters grouped by () or #. Since a single
- character cannot match a string larger or smaller than one
- character, ~ followed by a one-character symbol will match
- all subsequences except those whose postfix either don't
- match the postfix of the character, or which match the
- character and the postfix.
-
- This is *very* tricky and you should think about the con-
- sequences of this rule twice. More examples below.
-
- [..] Character classes. Groups a range of characters into a
- symbol that matches exactly a single character, but never
- the empty string.
-
- ~ in character classes is special: If there is a not-sequence
- in a character class, it matches if all not sequence match at
- once and one of the normal sequences match. Hence
-
- [~p,~q,a-z]
-
- matches all letters except p and q.
-
- .. Everything else matches exactly the the one character that
- it represents. They will not match the empty string.
-
- ______________________________________________________________________________
-
- Some examples of patterns to think of:
-
- % Matches only empty lines in the input.
-
- ~% Matches only non-empty lines in the input.
-
- #?.c Matches all lines ending on ".c"
-
- The#? Matches all lines starting with "The".
-
- #?Example#? Matches all lines containing the word "Example".
-
- Example#? Matches all lines starting with the word "Example".
-
- Example Matches all lines consisting of the single word "Example".
-
- #? Matches all lines.
-
- #?(.c|.o|%) Matches all lines (think about why!).
-
- foo(.c|.o|%) Matches all lines consisting entirely of the word "foo", "foo.c"
- or "foo.o".
-
- foo(.c|.o|) Just the same.
-
- ~(Example) Matches all lines except the line consisting of the single word
- "Example".
-
- ~(#?Example#?) Matches all lines that do not contain the word "Example".
-
- ~(ab)cd Matches all lines that do not start with "ab" and that end on
- "cd". Especially, this would match "bccd". It would also match
- the line "cd" since "ab" does not match the empty sequence in
- front of "cd". (think about this!)
-
- ~#a.c Matches all lines ending by ".c" except those where the ".c" is
- prefixed by an arbitrary number of a's, including zero a's.
- Hence, it would match "bc.c" and even "ab.c", but not "a.c" or
- ".c" as the last consists of zero a's and one ".c". It would
- not match "bc.o". This is identically to ~(#a).c since # binds
- the following a.
-
- ~(#ab)#? Matches all lines except those starting with a possibly empty
- sequence of a's followed by a single b. Hence, does not match
- aaabccc or bccc.
-
- ~(#[ ,\t];)#? Matches all lines except those starting with a possibly empty
- sequence of blanks or tabulators followed by a colon. Hence,
- for a shell script, this would match all non-comment lines.
-
- ~(#[ ,\t];)if#? This is a tricky one. Unlike what you might think, this does
- not match all non-comment lines starting with if. It also
- matches lines starting with a semicolon provided the string
- "if" is in the line and not directly behind the semicolon.
- Note that this is the intended behaivour. For example, it
- would match
-
- ;aifb
-
- The reason is simple: This is a string ";a" that does not
- match the symbol #[ ,\t]; followed by "ifb" which matches
- if#?.
-
- What you want here instead is #[ ,\t]if#? which matches
- all if-lines with additional, at least zero, blanks or tabs
- in front.
-
- The above example shows again the tricky nature of pattern
- matching.
-
- A real life example would be
-
- sed from S:Startup-Sequence match "{#[ ,\t]}RunBack{#?}" change "{1}Launch{2}"
-
- which would replace all invocations of "RunBack" in the Startup-Sequence by
- similar invocations of "Launch".
-
- Another example to think about as exercise:
-
- ({}|{#[~;]+[ \t]}#[~; \t][/:]|{}#[~; \t][/:]|{#[~;]+[ \t]})FooBar{|[ \t;]#?}
-
- Yes, this pattern is useful. Consider again S:Startup-Sequence as input file
- and think about what this could possibly do. Note that some expressions are
- captured. (Hey, I said this would look like line noise!)
- ______________________________________________________________________________
-
- Thomas Richter,
- October 2000
-