home *** CD-ROM | disk | FTP | other *** search
- ####################################################################
- #
- #
- # File: tknztbld.def
- #
- # Personal Library Software, July, 1993
- # Tom Donaldson
- #
- # Tokenizer definitional data for table driven tokenizer:
- # CplTabledRomanceTokenizer.
- #
- # The CplTabledRomanceTokenizer allows customization of tokenization by
- # editing rules that define the operation of the tokenizer. Central
- # concept is "word continuation" rules, defining characters-kinds that
- # CANNOT be split from each other.
- #
- # History
- # -------
- #
- # 29jul1993 tomd Performance improvements. Got rid of unused
- # character classes. Defined all non-token chars
- # to be "break" character class. Ordered
- # user-defined character classes by expected
- # frequency in text.
- #
- # Also added performance hints throughout.
- #
- # 26aug93 tomd No longer need to map all chars to char-classes.
- # Unmapped ones will default to break-chars.
- #
- # No longer need to assign numeric values to character
- # classes. No longer need to assign names to predefined
- # character classes (and cannot).
- #
- # Canonizer map no longer has to contain all
- # characters.
- #
- ####################################################################
-
-
-
- ####################################################################
- #
- # Installation
- # ============
- #
- # Database.def File
- # -----------------
- #
- # To use the CplTabledRomanceTokenizer, you need this line in the .def
- # file for the database:
- #
- # TOKENIZER = CplTabledRomanceTokenizer
- #
- #
- # Tokenizer File
- # --------------
- #
- # This file, tknztbld.def, is the rule file. Note that the name of the
- # file CANNOT be changed, and the file MUST be in the "home directory"
- # of the database using the tokenizer, or the "system" directory for
- # the CPL installation.
- #
- ####################################################################
-
-
- ####################################################################
- #
- # Operational Overview
- # ====================
- #
- # Database Open
- # -------------
- #
- # When a database is opened, its .def file is read. In the .def file,
- # a non-default tokenizer may be specified via a line of the form
- #
- # TOKENIZER = aTokenizerName
- #
- # If aTokenizerName is "CplTabledRomanceTokenizer", as soon as the
- # tokenizer is needed, it will try to read its definition file (i.e.,
- # this tknztbld.def file) from the same directory as the database's .def
- # file.
- #
- # If a problem arises during the load, tokenizer creation will fail. In
- # this case, please do a "diff" between the file that failed to load and
- # the original copy. Regretfully, few diagnostic error messages are
- # currently available to assist in determining why tokenizer definition
- # rules did not load.
- #
- # During Tokenization
- # -------------------
- #
- # As a buffer is scanned for words, each character is converted to a
- # "character class" via a Character Classification Map. The character
- # classes are compared against tokenization rules, defined in this file.
- # If the rules explicitly state that the current character code may not
- # be separated from the preceeding one, then the current character is
- # treated as part of a word, and the next character is classified and
- # tested. This process continues until the scanner finds a character
- # whose classification is not specified in the rules as always being
- # kept with the character class of the just-preceeding character.
- #
- #
- # Performance Hints
- # =================
- #
- # The table driven tokenizer allows a great deal of flexibility in
- # scanning words. It is possible to create a tokenizer definition that
- # will scan complex patterns as tokens, or not, depending upon the
- # immediate context of the characters being scanned. However, this
- # flexibility comes at a price: performance.
- #
- # In general, the simpler and fewer the character classes and rules the
- # faster tokenization will be. That is, the closer the rules come to
- # generating tokens that would pass the simple isalnum() test, the
- # faster tokenization will be. The further your needs depart from this
- # simplicity, the longer it will take to index files.
- #
- #
- ####################################################################
-
-
- ###################################################################
- #
- #
- # Required File Layout
- # ====================
- #
- # This tokenizer definition file, tknztbld.def, must contain the
- # following sections in this order:
- #
- # Section 1: Character Class Definitions
- #
- # Section 2: Character Classification Map
- #
- # Section 3: Word Continuation Rules
- #
- # Section 4: Canonization Map
- #
- # Section 1, the "Character Class Definitions", gives meaningful names
- # to "kinds" of characters. These class names are used in definining
- # what kinds of characters make up words.
- #
- # Section 2, the "Character Classification Map", assigns a character
- # class to each character in the character set used by documents.
- #
- # Section 3, the "Word Continuation Rules", uses the character class
- # names defined in the Character Class Definitions to specify what
- # groupings of characters may not be separated from each other during
- # toknization.
- #
- # Section 4, the "Canonization Map", specifies translations of
- # characters from their raw input form to their final "canonical"
- # indexed form.
- #
- # The detailed structure of each section of the tokenizer definition is
- # covered in comments accompanying the sections, below.
- #
- # The lines in this file that are preceeded by the pound character, '#',
- # are comment lines (obviously). Comments may only appear on a line by
- # themselves, but comment lines may appear anywhere in the file.
- # Likewise, blank lines may appear anywhere in the file.
- #
- ####################################################################
-
-
- ####################################################################
- #
- # Section 1: Character Class Definitions
- #
- ####################################################################
- #
- # The Character Class Definitions give names to the types of
- # characters that can take part in words, and that delimit words. These
- # names will be used in Section 2, the Character Classification Map, to
- # assign character classes to individual characters.
- #
- # You may define up to 250 character classes, although fewer than 10
- # will most likely be enough. Every character than can possibly ever
- # appear in the data to be tokenized MUST be assigned one of the defined
- # classes. The mapping is done via a character-class map, which appears
- # that the end of this file.
- #
- #
- # Predefined Special Values
- # =========================
- #
- # There are four predefined values that are special to the tokenizer:
- #
- # Invalid - NO character should EVER be of this class.
- #
- # EndRule - This is a character class that is used in this
- # definition file to mark the end of your character
- # classification names. It must be the last character
- # class name listed in the table below, and can only be
- # the last one listed.
- #
- # Break --- Characters of the "Break" class can NEVER be part of a
- # token. It is always valid for the tokenizer to
- # word-break the datastream either before or after a Break
- # character.
- #
- # EndBuff - Characters of the EndBuff class will be treated as a
- # "null" terminating character when scanning data for
- # tokens. The ASCII NUL character is an EndBuff character
- # by default, and you will not normally map other
- # characters to this class.
- #
- #
- # Only predefined Character Class names, or Character Class names that
- # are defined in this Character Class Definition section may be used
- # anywere else in the definition. Definitions are case sensitive.
- #
- #
- # Performance Hints
- # =================
- #
- # 1) Use The "Break" Class As Much As Possible. The Break class is a
- # special character class. Minimal testing is done on characters
- # classified as Break. This is because it is ALWAYS valid to separate a
- # Break character from any other character. Any characters that will
- # NEVER be part of a token should be classified as Break.
- #
- # 2) Define Class Names By Frequency. When creating user-defined character
- # classes, list the classes that will be assigned to the largest numbers
- # of characters in the data first. For example, you would expect most
- # data in a text file to classified as "letter" characters; you should
- # define your "letter" class name first.
- #
- # 3) Define As Few Classes As Possible. Fewer classes means less
- # testing. Less testing means faster tokenization.
-
-
- # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- #
- # Default Character Class Definitions: Values and Names
- #
- # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- #
- #
- # Name
- # ----
- Letter
- Number
- Dot
- Paragraf
- HardSpace
- #
- # The EndRule character class must always be the last class name listed,
- # and must only appear at the end of your character class definitions.
- # If it is not at the end of char class defs, an error occurs as soon as
- # the loader hits a non-blank, non-comment line that is not a character
- # class definition.
- #
- EndRule
-
- # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
-
- ####################################################################
- #
- # Section 2: Character Classification Map
- #
- ####################################################################
- #
- # Maps characters to word-continuation character classes.
- #
- # The Character Classification Map associates a character code with a
- # character class. The character classes must have been defined in the
- # Character Class Definition in Section 1. Only character class names
- # that have been defined in the Character Class Definition (Section 1,
- # above) may appear in this Character Classification Map (Section 2) or
- # in the Word Continuation Rules (Section 3, below).
- #
- # Default Mapping
- # ===============
- #
- # You only need to provide character classification for the characters
- # that you want to appear in tokens.
- #
- # Any characters that you do NOT map will be classified in this way:
- # - ASCII NUL is mapped as the end-of-buffer marker.
- # - All other characters are mapped as break characters.
- #
- #
- # End Of Map Marker
- # =================
- #
- # As for the previous table, there is a special value for this Character
- # Classification Map that marks its end. The special value is -1. The
- # decimal character code -1 will cause the Character Classification Map
- # loader to stop reading.
- #
- #
- # Performance Hints
- # =================
- #
- # Leave as many characters classified as "break" characters as possible.
- # Classify as may characters to the same class as possible.
- #
- # The following sample table uses Dollar and Bang classes for tokenizing
- # specialized technical documentation. If you don't need the Dollar and
- # Bang characters in indexable terms, change their mapping to Break, and
- # remove all references to Dollar and Break in the Character Class
- # Definitions and the Word Continuation Rules. Your database will index
- # faster.
-
-
- # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- #
- # Character Classification Map
- #
- # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- #
- # ------- ----- -----------------------
- # Decimal Class
- # Value Name Comment
- # ------- ----- -----------------------
- # Special characters:
- 46 Dot # kropeczka '.'
- 167 Paragraf # Paragraf
- 160 HardSpace # Twarda spacja
- # Digits:
- 48 Number # Char '0'
- 49 Number # Char '1'
- 50 Number # Char '2'
- 51 Number # Char '3'
- 52 Number # Char '4'
- 53 Number # Char '5'
- 54 Number # Char '6'
- 55 Number # Char '7'
- 56 Number # Char '8'
- 57 Number # Char '9'
- 91 Number # Char '['
- 93 Number # Char ']'
-
- # Upper case letters:
- 65 Letter # Char 'A'
- 165 Letter # Char 'A' pol
- 66 Letter # Char 'B'
- 67 Letter # Char 'C'
- 198 Letter # Char 'C' pol
- 68 Letter # Char 'D'
- 69 Letter # Char 'E'
- 202 Letter # Char 'E' pol
- 70 Letter # Char 'F'
- 71 Letter # Char 'G'
- 72 Letter # Char 'H'
- 73 Letter # Char 'I'
- 74 Letter # Char 'J'
- 75 Letter # Char 'K'
- 76 Letter # Char 'L'
- 163 Letter # Char 'L' pol
- 77 Letter # Char 'M'
- 78 Letter # Char 'N'
- 209 Letter # Char 'N' pol
- 79 Letter # Char 'O'
- 211 Letter # Char 'O' pol
- 80 Letter # Char 'P'
- 81 Letter # Char 'Q'
- 82 Letter # Char 'R'
- 83 Letter # Char 'S'
- 140 Letter # Char 'S' pol
- 84 Letter # Char 'T'
- 85 Letter # Char 'U'
- 86 Letter # Char 'V'
- 87 Letter # Char 'W'
- 88 Letter # Char 'X'
- 89 Letter # Char 'Y'
- 90 Letter # Char 'Z'
- 143 Letter # Char 'Zi' pol
- 175 Letter # Char 'Zy' pol
- # Lower case letters:
- 97 Letter # Char 'a'
- 185 Letter # Char 'a' pol
- 98 Letter # Char 'b'
- 99 Letter # Char 'c'
- 230 Letter # Char 'c' pol
- 100 Letter # Char 'd'
- 101 Letter # Char 'e'
- 234 Letter # Char 'e' pol
- 102 Letter # Char 'f'
- 103 Letter # Char 'g'
- 104 Letter # Char 'h'
- 105 Letter # Char 'i'
- 106 Letter # Char 'j'
- 107 Letter # Char 'k'
- 108 Letter # Char 'l'
- 179 Letter # Char 'l' pol
- 109 Letter # Char 'm'
- 110 Letter # Char 'n'
- 241 Letter # Char 'n' pol
- 111 Letter # Char 'o'
- 243 Letter # Char 'o' pol
- 112 Letter # Char 'p'
- 113 Letter # Char 'q'
- 114 Letter # Char 'r'
- 115 Letter # Char 's'
- 156 Letter # Char 's' pol
- 116 Letter # Char 't'
- 117 Letter # Char 'u'
- 118 Letter # Char 'v'
- 119 Letter # Char 'w'
- 120 Letter # Char 'x'
- 121 Letter # Char 'y'
- 122 Letter # Char 'z'
- 159 Letter # Char 'zi' pol
- 191 Letter # Char 'zy' pol
- # --- ----- -----------------------
- -1 EndOfDefs # Not loaded. Just marks end of map definition.
- # --- ----- -----------------------
-
-
- ####################################################################
- #
- # Section 3: Word Continuation Rules
- #
- ####################################################################
- #
- # The word continuation rules specify which sequences of characters
- # CANNOT be separated from each other in breaking a stream of data into
- # words.
- #
- # Each rule consists of character class names seperated by spaces or
- # tabs. A rule says that characters of the specified classes may not be
- # split. For example, the rule:
- #
- # Letter Letter
- #
- # says that when two data characters are of class Letter, and occur side
- # by side, the data characters may not be separated.
- #
- # Similarly, the rule:
- #
- # Letter Number
- #
- # says that a character that classifies as a Letter may not be separated
- # from a following character that classifies as a Number.
- #
- # Example 1:
- #
- # How does the tokenizer decide whether a character is a Letter or a
- # Number? That association is formed by the Character Classification
- # Map, in Section 2. Using the Character Classification Map in this
- # file, and the two "Letter Letter" and "Letter Number" rules just
- # presented, the following input text:
- #
- # "A-1 B 2 C3 4D 5 E 6-F GH IJ7LMNOP"
- #
- # Will tokenize as:
- #
- # "A" "B" "C3" "D" "E" "F" "GH" "IJ7" "LMNOP"
- #
- # Because: A Letter may not be seperated from a following Letter, and a
- # Letter may not be separated from a following Number. However, a
- # Number may be separated from a following Letter, and all other
- # characters are considered as delimiters. Obviously, we need more
- # rules. A more complete sample set follows.
-
-
- # Character Class Names, and '*'
- # ------------------------------
- #
- # All character class names in each rule MUST have been defined in the
- # Character Class Definitions in Section 1, above, with the exception of
- # one special "name": the '*' character. The '*' character means that
- # characters of the preceeding class may occur one or more times in a
- # sequence.
- #
- # Note that in previous rule we said that no two letters could be
- # separated. We did this with the rule:
- #
- # Letter Letter
- #
- # But what if a single letter, such as the "A" in "Section A", occurs by
- # itself? The "Letter Letter" rule does NOT say that "A" is a valid
- # token. The following two rules together DO say that a single letter,
- # or any number of letters in a row, must be treated as a token:
- #
- # Letter
- # Letter Letter
- #
- # However, we can reduce this to a single rule using the special
- # character class name "*":
- #
- # Letter *
- #
- #
- #
- # Example 2: "Unusual" Characters In Tokens
- #
- # This rule:
- #
- # Dollar * Letter
- #
- # Says that a stream of Dollar characters may not be broken if they are
- # followed by a Letter. This rule will cause these strings to be
- # treated as words:
- #
- # "$$SysDevice"
- # "$Fr$ed"
- # "$x8"
- #
- # But the same "Dollar * Letter" rule will not accept these strings:
- #
- # "SysDevice$$$" -- Token will be "SysDevice", "$$$" is junk.
- # "Fr$ed" -- Tokens will be "Fr" and "$ed".
- # "x$8" -- Token will be "x", the "$" and "8" will be
- # discarded.
- #
- #
- #
- # Example 3: More Complex Rules
- #
- # Using the example rules to this point, the string:
- #
- # "tomd@pls.com"
- #
- # Will be tokenized as:
- #
- # "tomd" "pls" "com"
- #
- # To cause tomd@pls.com to be accepted as a token, we can define this
- # rule:
- #
- # Letter * AtSign Letter * Dot Letter *
- #
- # Or define these equivalent rules:
- #
- # Letter *
- # Letter AtSign
- # AtSign Letter
- # Letter Dot Letter
- #
- #
- #
- # Implicit Linking of Rules
- # -------------------------
- #
- # It is important to note that rules functionally link to each other.
- # For example, we used these two rules in the previous example:
- #
- # Letter AtSign
- # AtSign Letter
- #
- # That is, a Letter may not be separated from a following AtSign, and an
- # AtSign may not be separated from a following Letter, which
- # functionally has the same effect as:
- #
- # Letter AtSign Letter
- #
- # Thus, the last character class on a rule can match-up with the same
- # character class at the head of another rule (or the same rule) to
- # match longer strings. In fact, the tokenizer does this, internally,
- # to create the longest tokens it can from an input stream.
- #
- #
- # EndRule: End of Definitions Marker
- # ----------------------------------
- #
- # The last "rule" in the list of rules MUST consist of solely the
- # EndRule character class (i.e., the rule with Value 1). It tells the
- # Word Continuation Rule loader that it is finished. If the EndRule is
- # missing, the Word Continuation Rule loader will try to eat any
- # following data as continuation rules, and will fail.
- #
- # Default Word Continuation Rules
- # -------------------------------
- #
- # The following, short, ruleset will create tokens consisting of
- # Numbers, Letters, and Dollars freely intermixed. It will NOT create
- # tokens containing ONLY Dollars; the rules state that for a Dollar to
- # be part of a token, the Dollar must be followed by a Letter. The
- # rules also allow a single bang character, "!", at the beginning of a
- # token. Note that these rules are intended for a particular type of
- # technical documentation, and will probably not exactly fit your needs.
- #
- #
- # Examples Using Default Word Continuation Rules
- # ----------------------------------------------
- #
- # Example 1':
- #
- # The string from Example 1:
- #
- # "A-1 B 2 C3 4D 5 E 6-F GH IJ7LMNOP"
- #
- # Previously tokenized as:
- #
- # "A" "B" "C3" "D" "E" "F" "GH" "IJ7" "LMNOP"
- #
- # With the default rules, below, tokenizes as:
- #
- # "A" "1" "B" "2" "C3" "4D" "5" "E" "6" "F" "GH" "IJ7MNOP"
- #
- #
- # Example 2':
- #
- # The strings from Example 2:
- #
- # "$$SysDevice"
- # "$Fr$ed"
- # "$x8"
- #
- # Previously tokenized as:
- #
- # "SysDevice" "Fr" "$ed" "x"
- #
- # With the default rules, below, tokenize as:
- #
- # "$$SysDevice" "$Fr$ed" "$x8"
- #
- #
- #
- # Example 3':
- #
- # The string from Example 3:
- #
- # "tomd@pls.com"
- #
- # Will still be tokenized as (lacking the AtSign and Dot rules):
- #
- # "tomd" "pls" "com"
- #
- #
- #
- #
- # Performance Hints
- # =================
- #
- #
- # 1) Keep Rules As Simple As Possible. The simpler the rule, the faster
- # the word-continuation tests. If a rule uses few character classes,
- # and contains few items, it is "simple." If a rule uses more than two
- # character classes, or uses two or more classes in different
- # permutations, the rule is "complex."
- #
- # 2) Define As Few Rules As Possible. The fewer rules there are to
- # check, the faster will be tokenization.
- #
- #
- # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- #
- # Word Continuation Rules
- #
- # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- -
- #
- # I apologize if this explanation has been overly long, especially given
- # the brevity of the following rules. This documentation is partially
- # for me (Tom Donaldson), and partially for anyone using the tokenizer.
- # If I jot it down now, there is a better chance I will "remember" it
- # all later!
- #
- #
- Letter
- Number
- Letter Letter
- Number Number
- Letter Number
- Number Letter
- Letter Dot HardSpace Letter
- Letter Dot HardSpace Number
- Paragraf HardSpace Letter
- Paragraf HardSpace Number
- #
- #
- # Rule End MUST be the last rule in the continuation rules.
- # It must NOT occur as part of any other rule.
- #
- EndRule
- #
- # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
-
- ####################################################################
- #
- # Section 4: Canonization Map
- #
- ####################################################################
- #
- # After the tokenizer returns a "word", based on the rules defined in
- # the first three sections of this file, it must be put into a
- # regularized, "canonical", form to make searches easier and faster. In
- # fact, canonizing tokens can also drastically reduce the size of the
- # database's dictionary.
- #
- # A common form of canonization is to convert all lower case letters to
- # upper case, and use the upper cased terms in the index and during
- # searches. This allows, for example, the word "NeXT" to match the
- # words "next", "NEXT", "nExt", etc. Note that this also means that
- # only one version of "next" is stored in the index, rather than all
- # permutations on the case of the letters that might exist in the
- # database.
- #
- # The canonization map allows you to determine what character by
- # character transforms are performed during canonization. The default
- # canonization supplied in the following table maps all lower case
- # characters to upper case. All other values are mapped to themselves;
- # that is, all other values are unchanged after canonization.
- #
- # For example, in some databases you might want to convert the "A WITH
- # TILDE" to a plain "A". You can do this by specifying that the "A
- # WITH TILDE" character, with character code 195, should be canonized
- # to the character "A", with character code 65:
- #
- # 195 65 # Canonize A-tilde as A
- #
- #
- # Default Values
- # ==============
- #
- # You do not have to define a mapping for all characters. The default
- # for all characters is to map to itself. Thus, your canonization
- # mapping table need only contain characters which you want to have
- # translated after tokenization.
- #
- # CRITICAL
- # ========
- #
- # As for the previous table, there is a special value for this
- # Canonization Map that marks its end. The special value is -1. The
- # decimal character code -1 will cause the Canonization Map loader to
- # stop reading.
- #
- #
- # Performance Hints
- # =================
- #
- # None.
- #
- # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- #
- # Default Canonization Map
- #
- # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- #
- # ------- ------- -----------
- # Input Output
- # Decimal Decimal
- # Char Char
- # Value Value Comment
- # ------- ------- -----------
- #
- # Map the characters a-z to the "canonical" characters A-Z. That is,
- # all letters will be upper cased.
- 97 65 # Char 'a' canonizes to 'A'
- 98 66 # Char 'b' canonizes to 'B'
- 99 67 # Char 'c' canonizes to 'C'
- 100 68 # Char 'd' canonizes to 'D'
- 101 69 # Char 'e' canonizes to 'E'
- 102 70 # Char 'f' canonizes to 'F'
- 103 71 # Char 'g' canonizes to 'G'
- 104 72 # Char 'h' canonizes to 'H'
- 105 73 # Char 'i' canonizes to 'I'
- 106 74 # Char 'j' canonizes to 'J'
- 107 75 # Char 'k' canonizes to 'K'
- 108 76 # Char 'l' canonizes to 'L'
- 109 77 # Char 'm' canonizes to 'M'
- 110 78 # Char 'n' canonizes to 'N'
- 111 79 # Char 'o' canonizes to 'O'
- 112 80 # Char 'p' canonizes to 'P'
- 113 81 # Char 'q' canonizes to 'Q'
- 114 82 # Char 'r' canonizes to 'R'
- 115 83 # Char 's' canonizes to 'S'
- 116 84 # Char 't' canonizes to 'T'
- 117 85 # Char 'u' canonizes to 'U'
- 118 86 # Char 'v' canonizes to 'V'
- 119 87 # Char 'w' canonizes to 'W'
- 120 88 # Char 'x' canonizes to 'X'
- 121 89 # Char 'y' canonizes to 'Y'
- 122 90 # Char 'z' canonizes to 'Z'
- 185 165 #
- 230 198 #
- 234 202 #
- 179 163 #
- 241 209 #
- 243 211 #
- 156 140 #
- 159 143 #
- 191 175 #
- # --- ----- -----------------------
- -1 -1 # Not loaded. Just marks end of map definition.
- # --- ----- -----------------------
-
- ####################################################################
- #
- #
- # End Of File: tknztbld.def
- #
- #
- ####################################################################