How to Use Filters

You can use filters to limit the addresses and types of files that will be downloaded. For instance, you can specify that the OC should:
- not download images embedded on Web pages
- download only from a particular Web site
- not download from a certain directory on a Web site
- exclude particular Web sites.

The scope of filters can be project-wide or be limited to a single task.


Task filters are combined with current project filters using AND logical operation every time the program needs to decide whether to retrieve a given URL or not. Thus, project filters affect every task in a project. Task filters affect only the task for which they were specified.

There are four types of filters:

You can also filter files by size. You can specify that the program should retrieve pages or files that are no more than a certain maximum size and/or no less than a certain minimum size. Type in the maximum size or a range of sizes in the Max Size or Min Size - Max Size text box for each type of filter. This is an optional parameter. If you do not provide the sizes, Internet Researcher will download files regardless of their sizes.

Pattern Matching Filters:

This is a logical expression that is evaluated for each URL of a task to determine whether the program should retrieve the URL or not. Logical expression can evaluate either to true or to false. If a filter applied to a URL evaluates to true, the URL is accepted and put into the queue to be downloaded. If a filter evaluates to false, a URL is not accepted and not downloaded.

The operands of the logical expression are URLs, URL patterns, and other expressions. The operators are AND, OR, and NOT (the case is not significant: operators can be written either in lowercase or uppercase). Expressions can include subexpressions enclosed in parenthesis. URL patterns can include two kinds of wildcard characters: an asterisk (*) and a question mark (?).

Character Usage Example
* Matches zero or more characters. In most cases it is used as the first or last character in a URL pattern. http//domain.com/* matches http://domain.com/, http://domain.com/index.com, http://domain.com/images/logo.jpg
? Matches any single alphabetic character. img??.gif matches img01.gif, img35.gif but does not match img112.gif

Note that a filter is not a list of URLs that must be included or excluded. This is a logical expression that is evaluated for each URL. Each URL must match the whole filter to be accepted. This is why you must use OR to join different URLs. If you join two URL patterns with AND, it will mean that, to be accepted, a URL must match both patterns. Joining two fully qualified URLs (not containing wildcard characters) with AND will have no sense at all. This is a logical expression that is evaluated for each URL to decide whether to download it or not.

You can check if a URL is accepted or rejected by a task filter on the Test Filters page. To view the Test Filters page, click on a task and select Test Filters from the Task Menu. Type in a URL, select a type of filter, and press the Test button. The page will reload showing you the result of the test (accepted or not) as well as a detailed explanation of why the URL was accepted or rejected by the filter. Every pattern in both the task filter and the project filter will be painted green or red depending on whether a pattern accepts or rejects the specified URL.

Examples for link filters:

Filter Description
*domain-1.com* or *domain-2.com* Matches any URL that contains either domain-1.com or domain-2.com such as:
  • http://domain-1.com/
  • http://domain-2.com/
  • http://www.domain-1.com/
  • http://www.domain-2.com/about.htm
  • ftp://ftp.domain-1.com/archive.zip
  • http://go.to/cgi-bin/redirect.exe?domain-1.com
Does not match URLs that do not contain domain-1.com or domain-2.com in them.
http://domain.com/* or http://www.domain.com/* Matches any URL that begins with http://domain.com/ or with http://www.domain.com/

Does not match:

  • ftp://ftp.domain.com/archive.zip
  • http://mail.domain.com/
  • https://www.domain.com/
http://*.domain.com/* Matches http://www.domain.com/

Does not match http://domain.com/ because there is no dot before domain.com

http://*domain.com/* and (*.html or *.htm) Matches any URL from domain.com website ending with *.html or *.htm

Does not match http://domain.com/ because it does not end with *.html or *.htm

(*domain-1.com* or domain-2.com) and (*.html or *.htm) Matches any URL from domain-1.com or domain-2.com ending with either *.html or *.htm
*domain.com* and not *domain.com/cgi-bin/* Matches any page from domain.com but not from the cgi-bin directory on that site.
*domain.com* and not (*domain.com/dir1* and not *domain.com/dir1/dir2*)

This can also be written as:

*domain.com* and not *domain.com/dir1* or *domain.com/dir1/dir2*

Matches any page from domain.com but not the pages within dir1 directory except for pages in its subdirectory dir2, which are accepted.

Examples for image filters:

*.jpeg OR *.jpg will download only jpeg files
not (*doubleclick.com* or *humanclick.com*) will not retrieve images from those two ad banner serving websites.
*cool-images.com* will download images only from the cool-images.com website.
*my-domain.com/my-photos/* will download images only from my-photos directory on the my-domain.com website.
not * This can be used to exclude all embedded images. It has the same effect as setting the "Do not retrieve" option. This pattern does not match any URL.