This section describes regular expressions and provides information and examples for using them in HomeSite. The rules listed in this section are for creating regular expressions in HomeSite; the rules used by other RegExp parsers might differ.
An excellent reference on regular expressions is Mastering Regular Expressions by Jeffrey E.F. Friedl, published by O'Reilly & Associates, Inc.
A regular expression is a pattern that defines a set of character strings. The RegExp parser in HomeSite evaluates the indicated text and returns each matching pattern.
Like in arithmetic expressions, you can use various operators to combine smaller expressions; simple regular expressions can be concatenated into complex criteria. For more information, see "Anchoring a regular expression to a string".
In HomeSite, you can use regular expressions for extended searches and validating code. Following is a description of each usage:
(" [A-Za-z] "){2,}
.
In an extended search, all matches are added to the list of results. But in an extended search and replace, matches are immediately replaced with the replacement text. So consider not only what is matched but what is not matched; for example, there might be two or more strings that you must replace with the same text. Also, it is always a good idea to back up your files first!
In a search and replace operation, the RegExp engine processes the entire document; it does not parse on a line-by-line basis. This affects the way that you should use characters such the asterisk (*), carat (^) and dollar sign ($).
For more information, also see "Using extended search commands" and "Validating Code".
Because special characters are the operators in regular expressions, in order to represent a special character as an ordinary one, you need to precede it with a backslash. To represent a backslash, for instance, use a double backslash (\\).
This section describes the rules for creating regular expressions. You can use regular expressions in the Search > Extended Find and Replace command to match complex string patterns.
The following rules govern one-character RegExp that match a single character:
[^#chr(13)##chr(10)#]
, which excludes the HomeSiteASCII carriage return and line feed codes.[akm]
matches an a, k, or m. Note that if you want to include a closing square bracket (]) in square brackets, it must be the first character. Otherwise, it does not work even if you use \].[a-z]
matches any lowercase letter. However, if the first character of the set is the caret (^), the RegExp matches any character except those in the set. It does not match the empty string. For example, [^akm]
matches any character except a, k, or m. The caret loses its special meaning if it is not the first character of the set.[Nn][Ii][Cc][Kk]
.You can specify a character by using a POSIX character class. You enclose the character class name inside two square brackets, as in this Replace example:
"Macromedia's Web Site","[[:space:]]","*","ALL")
This code replaces all the spaces with *, producing this string:
Macromedia's*Web*Site
The following table shows the supported POSIX character classes:
Character Class |
Matches |
---|---|
alpha |
Any letter, [A-Za-z] |
upper |
Any uppercase letter, [A-Z] |
lower |
Any lowercase letter, [a-z] |
digit |
Any digit, [0-9] |
alnum |
Any alphanumeric character, [A-Za-z0-9] |
xdigit |
Any hexadecimal digit, [0-9A-Fa-f] |
space |
A tab, new line, vertical tab, form feed, carriage return, or space |
print |
Any printable character |
punct |
Any punctuation character: ! ' # S % & ' ( ) * + , - . / : ; < = > ? @ [ / ] ^ _ { | } ~ |
graph |
Any character defined as a printable character except those defined as part of the space character class |
cntrl |
Any character not part of the character classes: [:upper:], [:lower:], [:alpha:], [:digit:], [:punct:], [:graph:], [:print:], [:xdigit:] |
You can use the following rules to build multicharacter regular expressions:
xy?z
matches either "xyz" or "xz".HomeSite supports back referencing, which allows you to match text in previously matched sets of parentheses. You can use a slash followed by a digit n (\n) to refer to the nth parenthesized subexpression.
One example of how you can use back references is searching for doubled words, for example, to find instances of "is is" or "the the" in text. The following example shows the syntax you use for back referencing in regular expressions:
("There is is coffee in the the kitchen", "([A-Za-z]+)[ ]+\1","*","ALL")
This code searches for words that are all letters ([A-Za-z]+) followed by one or more spaces [ ]+ followed by the first matched subexpression in parentheses. The parser detects the two occurrences of is as well as the two occurrences of the and replaces them with an asterisk, resulting in the following text:
There * coffee in * kitchen
You can anchor all or part of a regular expression to either the beginning or end of the string being searched:
The following table shows some regular expressions and describes what they match:
Expression |
Description |
---|---|
[\?&]value= |
A URL parameter value in a URL |
[A-Z]:(\\[A-Z0-9_]+)+ |
An uppercase DOS/Windows full path that is not the root of a drive, and that has only letters, numbers, and underscores in its text |
(\+|-)?[1-9][0-9]* |
An integer that does not begin with a zero and has an optional sign |
(\+|-)?[1-9][0-9]*(\.[0-9 ]*)? |
A real number |
(\+|-)?[1-9]\.[0-9]*E(\+| -)?[0-9]+ |
A real number in engineering notation |
a{2,4} |
Two to four occurrences of "a": aa, aaa, aaaa |
(ba){3,} |
At least three "ba" pairs: bababa, babababa, ... |
|
At least two occurrences of the same word |