Word-Breaker DLLs and Noise Words

This topic describes how document text and properties returned by filters are broken up into words and how common words are excluded.

Word-Breaker DLLs

A word-breaker DLL parses the text and textual properties returned by the filter DLL into words. The word-breaker DLL is language dependent. For a list of languages supported by Index Server, see the Index Server Web page.

Noise Words

Words that are not significant for searching are called noise words or stop words. Noise words are stored in %systemroot%\system32 directory in various noise word files (Noise.dat, by default). The noise word files are language dependent. The noise word file for a particular language is specified in the registry under the key:

HKEY_LOCAL_MACHINE\SYSTEM
\SYSTEM
 \CurrentControlSet
  \Control
   \ContentIndex
    \Language
     \<language>
      \NoiseFile

For example, the noise word file for English_US is listed as the registry key:

HKEY_LOCAL_MACHINE\SYSTEM
\SYSTEM
 \CurrentControlSet
  \Control
   \ContentIndex
    \Language
     \English_US
      \NoiseFile
       \noise.dat

The noise word files can be edited with a text editor to either add new words or remove words that are not considered “noise” at a particular installation. Note that querying for noise words will not yield any hits.

Caution    Removing all noise words from the noise word files can significantly increase the size of indexes.


© 1997 by Microsoft Corporation. All rights reserved.