[Top] [Prev] [Next] [Bottom]




Blocking unsuitable pages


WEBsweeper can be configured to prevent pages with unsuitable material being downloaded. This can include material of an offensive, obscene or illegal nature. It can also be used to prevent confidential information being retrieved from internal Web servers, by users outside of your company.

Inappropriate pages can be determined by:

Using PICS ratings

WEBsweeper can identify the PICS rating given to an HTML page. This rating is set as standard attributes on the HTML data and comprises several different categories. The PICS rating may be contained in the body of the HTML page or may be requested separately, from a bureau. The HTML page may be assigned more than one rating, supplied by different sources.

WEBsweeper can only check the PICS rating if it is contained in the body of the HTML page. It cannot check ratings that are requested separately.

For ease of configuration, WEBsweeper maps the PICS ratings into its own internal rating system, using several categories, shown below. These mappings are set using the VALHTML validator. See page 7-92 for details.

Category Values
WSW_AGE
0-4
WSW_SEX
0-4
WSW_LANGUAGE
0-4
WSW_VIOLENCE
0-4
WSW_OTHER
0-4
WSW_LABEL
TRUE/FALSE

An instance of the VALATTR validator, called PICS, is used to check the PICS rating of the HTML page, using the mapped categories. It is found in the [Validation]section of the http configuration file, HTTP.CFG .

For example:

[Validation]
PICS=VALATTR

[PICS]
;NoRating=WSW_LABEL=FALSE
HaveOther=WSW_OTHER>2
HaveAge=WSW_AGE>2
HaveLanguage=WSW_LANGUAGE>2
HaveViolence=WSW_VIOLENCE>2
HaveSex=WSW_SEX>2

Each MIMEsweeper category is listed in the [PICS] configuration section, along with a threshold value for the category, as an attribute expression. For example, the attribute expression WSW_AGE>2 has a category of WSW_AGE and a threshold value of 2.

The mapped ratings of the page are compared against the threshold values for each of the categories. If a mapped rating exceeds the threshold value for one or more categories then an appropriate <Response> is generated, according to the VALATTR rules.

For example:

See page 7-81 for more details on how VALATTR performs validation.

Each <Response> defined in the [PICS] configuration section has a corresponding entry in the [Disposal] configuration section. This section is also found in the http configuration file, HTTP.CFG .

The entry maps the <Response> to a final disposition for the data.



For example:

[Disposal]
DefaultDisposal=Clean
...
NoRating=BlockNoRating
HaveOther=BlockOther
HaveAge=BlockAge
HaveLanguage=BlockLanguage
HaveViolence=BlockViolence
HaveSex=BlockSex
...
VIRUSPRESENT=Virus

Using this example, assuming that HaveLanguage is the highest priority <Response> generated by validation then the final disposition for the Web page in this instance will be BlockLanguage.

Each disposition listed has a corresponding configuration section in the same file, used to control the disposal actions taken.

For example:

[BlockLanguage]
InformText=Page blocked - content unsuitable

In this example the page is discarded and replaced with a message indicating that the download was not successful. The message text sent is the string specified by the value of the InformText directive.

See page 7-43 for more details on the InformText directive.

Pages with no PICS rating, or a rating that cannot be mapped, are assigned the attribute WSW_LABEL, with the value FALSE. The value of this attribute can be checked like all the other values.

This is achieved by editing the [PICS] configuration section of the http configuration file, HTTP.CFG, to ensure that the NoRating directive is no longer commented out.

That is, change:

[PICS]
;NoRating=WSW_LABEL==FALSE
HaveOther=WSW_OTHER>2
HaveAge=WSW_AGE>2
HaveLanguage=WSW_LANGUAGE>2
HaveViolence=WSW_VIOLENCE>2
HaveSex=WSW_SEX>2

to

[PICS]
NoRating=WSW_LABEL==FALSE
HaveOther=WSW_OTHER>2
HaveAge=WSW_AGE>2
HaveLanguage=WSW_LANGUAGE>2
HaveViolence=WSW_VIOLENCE>2
HaveSex=WSW_SEX>2

The above example will block any page that has a PICS rating category of greater than two and also any page that has no PICS rating assigned.

The majority of HTML pages do not currently have a PICS rating assigned. The above example will therefore result in more pages being blocked than may be practical.

The mappings from external rating services to WEBsweeper's rating scheme is found in the file PICSMAP.CFG. See Appendix A for details and for a full description of the PICS rating scheme.


Using lexical analysis

The previous example showed how PICS ratings could be used to detect and block unsuitable pages. However, as some sites do not currently use PICS to rate their pages, much of this information may remain undetected.

Another method of detecting and blocking unsuitable pages is to search the HTML text for certain expressions, for example, words or phrases that indicate profanity is present. This can be achieved using the lexical analysis validator, VALLEX.

The following example shows how VALLEX can be configured to detect unsuitable content, by searching the HTML text for certain keywords and phrases.

[Validation]
F-PROT=VALEXE
LEX=VALLEX
PICS=VALATTR

[LEX]
PerformIf=ContainerName==PlainText1 
ExpressionList=C:\MSW\CONFIG\PROF.LST
1=HaveProfane

A new instance of the VALLEX validator is created, called LEX. It is defined in the [Validation] section and a corresponding [LEX] configuration section is created in the body of the file.

The [LEX] configuration section specifies the name of an ExpressionList file that contains the expressions to be searched for and certain other configuration information. In this example the file is called PROF.LST.

The [LEX] configuration section also maps numeric values that may be obtained as a result of the search to <Response> values. In this example there is only one mapping, that is, 1=HaveProfane. This mapping has a numeric value of 1 and a <Response> of HaveProfane.

The <Response> generated by the LEX validator instance is determined by a numeric score obtained as a result of the search. In this example:

In HTTP.CFG:

[Disposal]
DefaultDisposal=Clean
...
HaveProfane=BlockProfane
...
VIRUSPRESENT=Virus

Each <Response> used in the [LEX] configuration section has a corresponding entry in the [Disposal] section. In this example there is only one entry, for the HaveProfane <Response>. This entry maps the <Response> to a final disposition for the Web data.

Assuming that HaveProfane is the highest priority <Response> generated by validation then the final disposition is BlockProfane.

The BlockProfane disposition has a corresponding configuration section in the same file. This configuration section controls the disposal actions taken.

[BlockProfane]
InformText=Page blocked - content unsuitable

In this example the page is discarded and replaced with a message indicating that the download was not successful. The message text sent is the string specified by the value of the InformText directive.

See page 7-43 for more details on the InformText directive.

In PROF.LST (the ExpressionList file):

"profane_word1"         1
"profane_word2"         1

The ExpressionList file contains, amongst other configuration information, the expressions to be included in the search.

Each expression is given a numeric value, depending on its considered importance in the search. In this example, each expression is considered to be of equal importance, so is given the same value, that is, 1.

Each time an expression is found in the data being searched, the associated numeric value is added to a score generated for the message so far. At the end of validation a final numeric score is obtained. This score is used to determine the <Response> generated, by comparing it with the entries listed in the [LEX] configuration section, as explained on page 5-59.

In this example, if any of the expressions listed are detected, even once, the <Response> generated is HaveProfane.

See the VALLEX section on page 7-86 and the Disposal section on page 7-22 for more details.
 


[Top] [Prev] [Next] [Bottom]



1 The PerformIf directive is used to ensure that lexical analysis is only performed on plain text.

msw.support@mimesweeper.com

Copyright © 1998, Content Technologies Limited. All rights reserved.