[Top] [Prev] [Next] [Bottom]




ExpressionList file


The plug-in validator, VALLEX, performs validation by searching data for occurrences of key words or phrases. This type of validation is known as lexical analysis. See page 7-86 for more details on VALLEX.

VALLEX uses a special file, know as an ExpressionList file, to obtain the majority of information it needs when performing lexical analysis. This information includes the expressions to be searched for, configuration settings and any loading values1 that are to be applied.

The format of the file is line based, with a complete statement appearing on a line of its own.

Comments can be included in the file. A comment is denoted by placing the semicolon character at the start of the comment. The comment is terminated at the end of the line.

For example:

; This is a comment line.

The file can be considered as being split into three sections:

It is important not to change the ordering of these sections, due to the way VALLEX performs the search.

Loadings

The loading statements are listed first in the ExpressionList file.

These statements define some conditional loading values that may be applied when calculating the final lexical analysis score. If a loading value is to be applied it is multiplied to the numeric total generated after all the data has been searched, to give the final lexical analysis score. For more details on how this score is generated see page 7-138.

Loading values are usually applied to allow different results to be generated depending on the value of previously set attributes.

For example, you may decide that it is more important to capture profanity in outgoing messages than it is in incoming messages, or you may consider that sensitive company information leaving your sales department carries a higher risk than company information leaving other departments.

In these situations an attribute expression2 can be used to determine the direction and/or source of the message. An appropriate loading value can be applied depending on the results.

The format of the loading statement is:

Loading <Value> when <Attribute_Expression>

or

Loading <Value> Default

There can be several loading statements listed in the file.

For example:

Loading 3 when direction==in
Loading 4 when direction==out
Loading 2 Default

The Loading statement checks the previously set attribute expressions to determine when the loading value is applied. If the attribute expression evaluates to TRUE then the loading value is included when calculating the final lexical score.

The Default loading value is used if no matching attribute expression is found. If the Default loading statement is not present and no attribute expression evaluates to TRUE then no loading value is applied.

In the above example:

Another example is:

Loading 4 when department=out_of_sales

In this example:

Configuration settings

The second section of the ExpressionList file defines certain configuration settings that are used to control how the data is searched.

This section can contain three directives, these are:

Any of these directives can be omitted from the ExpressionList file. If so, the default values are used.

The CaseSensitive directive indicates whether or not expression matches are to be case sensitive. It can have a value of TRUE or FALSE.

For example:

CaseSensitive=TRUE

indicates that all expression matches are to be case sensitive.

CaseSensitive=FALSE

indicates that expression matches are not case sensitive.

The default value is FALSE.

The value set by the CaseSensitive directive can be overridden for individual expressions. This is achieved by setting a value with the expression when it is listed in the ExpressionList file. See page 7-139 for details.

The Separators directive lists the characters that are to be used as word separators during the search. These characters are entered as a string enclosed within double quotes, using escaped characters where necessary, for example, \t for tab, \n for linefeed, \r for carriage return, \" for double quotes. For a full list of escaped characters see page 7-4.

The format is:

Separators="<Separator_List>"

For example:

Separators=" \t\n\r\"-,."

In this example, the separator characters used are space, tab, linefeed, carriage return, double quotes, hyphen, comma and full stop.

Any character can be included in the list by using the three digit decimal representation of the character, preceded by a backslash. A range of characters can be specified by separating two such entries with a hyphen. There must be no space either side of the hyphen.

For example:

Separators=" \t\n\r\"-,.\001-\031"

These separators would typically be used in more complicated files, for example, Word documents that include formatting commands.

If no separator characters are specified in the file then the default characters used are space, tab, linefeed, carriage return, full stop and comma. A null character is always interpreted as a separator character.

The MaxWordLength directive is used to alter the maximum word length allowed in the search. It would typically be used to ensure that VALLEX performs efficiently when presented with non text data.

For example:

MaxWordLength=<Number>

Where the number can be any positive integer between 1 and 4095 inclusive. The default value is 127.

When performing lexical analysis on compound documents, such as Word documents, it is recommended that you set the MaxWordLength directive to 4095.

Expressions

The third section of the ExpressionList file lists the expressions that are to be included in the search.

Each expression is listed as a string within double quotes, along with a numeric value for the expression.

An optional + or - character can also be included, to indicate whether expression matches are case sensitive. If omitted, default is the value set by the CaseSensitive directive (see page 7-136).

The format is:

[+|-] "<Expression>" <Value>

where [+/-] indicates that these values are optional.

Each expression can be a word or phrase. A phrase can contain a maximum of 20 words. Each phrase is split into words using one or more separator characters defined by the Separators directive.

For example:

"Company Confidential"  		10
"Policy N*$"				5
"Copyright"				4
"Internal"				3

Certain wildcard characters can be included in an expression. These characters match with certain characters during the search and can be used to make the search more general.

For example:

"Policy N*$"				5

In this example, the * character matches with any number of characters, including none. The $ character matches with any number of digits, including none. Valid matches would therefore be `Wd', `NA5 ', `NABC5', `NA55', `NABC555' and so on.

The wildcard characters that can be used in an expression are shown on the following table:

Wildcard Matches
* Any number of non-separator characters, including none.
? Any single non-separator character.
$ Any number of any digits (0 to 9) including none.
# Any single digit (0 to 9).

If any of the wildcard characters need to be included in the search, as themselves, each character must be preceded with a backslash (\).4

For example:

A*B - reads as any text string that starts with the character A and ends with the character B. (The * matches any string of characters.)

A\*B - reads as the text string A*B.

If the backslash is also required in the search then use two backslashes (\\).

Two other characters can be used in association with expression lists. These are + and -. These characters are optional and are used to indicate whether or not individual expression matches are case sensitive.

The value of this character overrides the value set by the CaseSensitive directive (see page 7-136), but only for the one associated expression.

The + or - character is specified at the start of the line containing the expression, before the double quote.

For example:

+ "Company Confidential" 		10
- "Internal"              		3

This example indicates that matches with the expression `Company Confidential' are case sensitive and that matches with the expression `Internal' are not.

If neither the + or - character is specified then the value set by the CaseSensitive directive is used.

The numeric value assigned to the expression depends on the nature and context of the expression. The more important you consider the expression to be, the higher the value that should be assigned to it.

In the above example, the expression `Company Confidential' is considered to be more important than the expression `Internal' so it is assigned a higher value.

The numeric value assigned to an expression is added to a running total whenever the expression is encountered in the data being validated. This running total represents a numeric score for the expressions found in the data so far.

If a loading value is to be applied (see page 7-134), it is multiplied to the running total that is generated when the search is complete. This results in a final score being generated for the data.

The final score is checked against a list of numeric values.5 Comparison with these values sets the <Response> that is generated as the result of the lexical analysis.

Lexical analysis example

The following example shows how lexical analysis could be configured to block incoming or outgoing messages containing confidential information.

In VALIDATE.CFG:

[AMU]
Authfile=c:\MSW\CONFIG\AUTHFILE.TXT
If=allow_in,direction=in,allow
If=allow_out,direction=out,allow

[Validation]
F-PROT=VALEXE
Confidential=VALLEX
ValidateAttributes=VALATTR

[Confidential]
ExpressionList=C:\MSW\CONFIG\CONFID.LST
20=ConfidentialModerate
50=Confidential

The If directive checks the <Response> generated by AMUcheck. If it is allow_in or allow_out then an attribute called direction is set, with the value in or out respectively. The <Response> is then reset to allow, which is the actual <Response> generated by AMUcheck. It allows the message to be delivered normally, assuming no higher priority <Response> is returned by one of the other validator instances. (For example, the VALLEX validator instance defined in this example).

A new instance of the VALLEX validator, Confidential, is defined by creating an entry in the [Validation]section. A configuration section with the same name is created in the body of the file. The configuration section maps numeric values to <Response> values. It also specifies the path to the ExpressionList file, called CONFID.LST. See page 7-86 for more details on how to configure a VALLEX validator instance.

CONFID.LST contains the expressions to be searched for, loading values that may be applied and some configuration settings. An example file is shown on the next page.

In CONFID.LST (the ExpressionList file):

;
;Direction Loadings
;
Loading 3 when direction==in
Loading 4 when direction==out
Loading 2 Default
;
; Configuration
;
CaseSensitive=TRUE
Separators=" \t\n\r-,."
MaxWordLength=4095
;
; Expression list
;
"Company Confidential" 		 	10
"Policy N*$"            	 	5
"Copyright"             		4
"Internal"              		3

Using this file, assume that an incoming message has one occurrence of the text `Policy NAT5' and three occurrences of `Internal'. The occurrence of `Policy NAT5' has a value of 5 and each occurrence of `Internal' has a value of 3. After validation is complete this gives a running total for the message of:

5 + 3 + 3 + 3 = 14

Next any loading values to be applied are determined. The message is incoming so the attribute direction has the value in.6 A loading value of 3 is therefore applied. This gives a final score for the message of:

14 * 3 = 42

The final score is checked against the numeric values defined in the [Confidential] section (see the previous page). The value of 20 has been exceeded, but the value of 50 has not been reached. The <Response> returned in this instance is therefore ConfidentialModerate.



[Top] [Prev] [Next] [Bottom]



1 MAILsweeper only. Loading values are usually applied to allow different results to be generated, depending on previously set attribute values.

2 The attribute is usually set by AMUcheck, using the If directive. See page 7-102 for details.

3 This attribute would be set by AMUcheck, using the If directive. See page 7-102 for details.

4 The double-quotes character must also be preceded by a backslash if it is to be included in the search, for example, \".

5 These values are defined in the VALLEX configuration section. See page 7-86 for details.

6 This is the attribute set by AMUcheck, using the If directive. See page 7-102 for details.

msw.support@mimesweeper.com

Copyright © 1998, Content Technologies Limited. All rights reserved.