Enter 1999 September

home *** CD-ROM | disk | FTP | other *** search

/ Enter 1999 September / ENTER9_2.bin / prog_10 / DB / TKNZTBLD.CPL < prev

Wrap

Text File | 1999-03-15 | 27.9 KB | 792 lines

#################################################################### # # # File: tknztbld.def # # Personal Library Software, July, 1993 # Tom Donaldson # # Tokenizer definitional data for table driven tokenizer: # CplTabledRomanceTokenizer. # # The CplTabledRomanceTokenizer allows customization of tokenization by # editing rules that define the operation of the tokenizer. Central # concept is "word continuation" rules, defining characters-kinds that # CANNOT be split from each other. # # History # ------- # # 29jul1993 tomd Performance improvements. Got rid of unused # character classes. Defined all non-token chars # to be "break" character class. Ordered # user-defined character classes by expected # frequency in text. # # Also added performance hints throughout. # # 26aug93 tomd No longer need to map all chars to char-classes. # Unmapped ones will default to break-chars. # # No longer need to assign numeric values to character # classes. No longer need to assign names to predefined # character classes (and cannot). # # Canonizer map no longer has to contain all # characters. # #################################################################### #################################################################### # # Installation # ============ # # Database.def File # ----------------- # # To use the CplTabledRomanceTokenizer, you need this line in the .def # file for the database: # # TOKENIZER = CplTabledRomanceTokenizer # # # Tokenizer File # -------------- # # This file, tknztbld.def, is the rule file. Note that the name of the # file CANNOT be changed, and the file MUST be in the "home directory" # of the database using the tokenizer, or the "system" directory for # the CPL installation. # #################################################################### #################################################################### # # Operational Overview # ==================== # # Database Open # ------------- # # When a database is opened, its .def file is read. In the .def file, # a non-default tokenizer may be specified via a line of the form # # TOKENIZER = aTokenizerName # # If aTokenizerName is "CplTabledRomanceTokenizer", as soon as the # tokenizer is needed, it will try to read its definition file (i.e., # this tknztbld.def file) from the same directory as the database's .def # file. # # If a problem arises during the load, tokenizer creation will fail. In # this case, please do a "diff" between the file that failed to load and # the original copy. Regretfully, few diagnostic error messages are # currently available to assist in determining why tokenizer definition # rules did not load. # # During Tokenization # ------------------- # # As a buffer is scanned for words, each character is converted to a # "character class" via a Character Classification Map. The character # classes are compared against tokenization rules, defined in this file. # If the rules explicitly state that the current character code may not # be separated from the preceeding one, then the current character is # treated as part of a word, and the next character is classified and # tested. This process continues until the scanner finds a character # whose classification is not specified in the rules as always being # kept with the character class of the just-preceeding character. # # # Performance Hints # ================= # # The table driven tokenizer allows a great deal of flexibility in # scanning words. It is possible to create a tokenizer definition that # will scan complex patterns as tokens, or not, depending upon the # immediate context of the characters being scanned. However, this # flexibility comes at a price: performance. # # In general, the simpler and fewer the character classes and rules the # faster tokenization will be. That is, the closer the rules come to # generating tokens that would pass the simple isalnum() test, the # faster tokenization will be. The further your needs depart from this # simplicity, the longer it will take to index files. # # #################################################################### ################################################################### # # # Required File Layout # ==================== # # This tokenizer definition file, tknztbld.def, must contain the # following sections in this order: # # Section 1: Character Class Definitions # # Section 2: Character Classification Map # # Section 3: Word Continuation Rules # # Section 4: Canonization Map # # Section 1, the "Character Class Definitions", gives meaningful names # to "kinds" of characters. These class names are used in definining # what kinds of characters make up words. # # Section 2, the "Character Classification Map", assigns a character # class to each character in the character set used by documents. # # Section 3, the "Word Continuation Rules", uses the character class # names defined in the Character Class Definitions to specify what # groupings of characters may not be separated from each other during # toknization. # # Section 4, the "Canonization Map", specifies translations of # characters from their raw input form to their final "canonical" # indexed form. # # The detailed structure of each section of the tokenizer definition is # covered in comments accompanying the sections, below. # # The lines in this file that are preceeded by the pound character, '#', # are comment lines (obviously). Comments may only appear on a line by # themselves, but comment lines may appear anywhere in the file. # Likewise, blank lines may appear anywhere in the file. # #################################################################### #################################################################### # # Section 1: Character Class Definitions # #################################################################### # # The Character Class Definitions give names to the types of # characters that can take part in words, and that delimit words. These # names will be used in Section 2, the Character Classification Map, to # assign character classes to individual characters. # # You may define up to 250 character classes, although fewer than 10 # will most likely be enough. Every character than can possibly ever # appear in the data to be tokenized MUST be assigned one of the defined # classes. The mapping is done via a character-class map, which appears # that the end of this file. # # # Predefined Special Values # ========================= # # There are four predefined values that are special to the tokenizer: # # Invalid - NO character should EVER be of this class. # # EndRule - This is a character class that is used in this # definition file to mark the end of your character # classification names. It must be the last character # class name listed in the table below, and can only be # the last one listed. # # Break --- Characters of the "Break" class can NEVER be part of a # token. It is always valid for the tokenizer to # word-break the datastream either before or after a Break # character. # # EndBuff - Characters of the EndBuff class will be treated as a # "null" terminating character when scanning data for # tokens. The ASCII NUL character is an EndBuff character # by default, and you will not normally map other # characters to this class. # # # Only predefined Character Class names, or Character Class names that # are defined in this Character Class Definition section may be used # anywere else in the definition. Definitions are case sensitive. # # # Performance Hints # ================= # # 1) Use The "Break" Class As Much As Possible. The Break class is a # special character class. Minimal testing is done on characters # classified as Break. This is because it is ALWAYS valid to separate a # Break character from any other character. Any characters that will # NEVER be part of a token should be classified as Break. # # 2) Define Class Names By Frequency. When creating user-defined character # classes, list the classes that will be assigned to the largest numbers # of characters in the data first. For example, you would expect most # data in a text file to classified as "letter" characters; you should # define your "letter" class name first. # # 3) Define As Few Classes As Possible. Fewer classes means less # testing. Less testing means faster tokenization. # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # # Default Character Class Definitions: Values and Names # # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # # # Name # ---- Letter Number Dot Paragraf HardSpace # # The EndRule character class must always be the last class name listed, # and must only appear at the end of your character class definitions. # If it is not at the end of char class defs, an error occurs as soon as # the loader hits a non-blank, non-comment line that is not a character # class definition. # EndRule # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - #################################################################### # # Section 2: Character Classification Map # #################################################################### # # Maps characters to word-continuation character classes. # # The Character Classification Map associates a character code with a # character class. The character classes must have been defined in the # Character Class Definition in Section 1. Only character class names # that have been defined in the Character Class Definition (Section 1, # above) may appear in this Character Classification Map (Section 2) or # in the Word Continuation Rules (Section 3, below). # # Default Mapping # =============== # # You only need to provide character classification for the characters # that you want to appear in tokens. # # Any characters that you do NOT map will be classified in this way: # - ASCII NUL is mapped as the end-of-buffer marker. # - All other characters are mapped as break characters. # # # End Of Map Marker # ================= # # As for the previous table, there is a special value for this Character # Classification Map that marks its end. The special value is -1. The # decimal character code -1 will cause the Character Classification Map # loader to stop reading. # # # Performance Hints # ================= # # Leave as many characters classified as "break" characters as possible. # Classify as may characters to the same class as possible. # # The following sample table uses Dollar and Bang classes for tokenizing # specialized technical documentation. If you don't need the Dollar and # Bang characters in indexable terms, change their mapping to Break, and # remove all references to Dollar and Break in the Character Class # Definitions and the Word Continuation Rules. Your database will index # faster. # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # # Character Classification Map # # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # # ------- ----- ----------------------- # Decimal Class # Value Name Comment # ------- ----- ----------------------- # Special characters: 46 Dot # kropeczka '.' 167 Paragraf # Paragraf 160 HardSpace # Twarda spacja # Digits: 48 Number # Char '0' 49 Number # Char '1' 50 Number # Char '2' 51 Number # Char '3' 52 Number # Char '4' 53 Number # Char '5' 54 Number # Char '6' 55 Number # Char '7' 56 Number # Char '8' 57 Number # Char '9' 91 Number # Char '[' 93 Number # Char ']' # Upper case letters: 65 Letter # Char 'A' 165 Letter # Char 'A' pol 66 Letter # Char 'B' 67 Letter # Char 'C' 198 Letter # Char 'C' pol 68 Letter # Char 'D' 69 Letter # Char 'E' 202 Letter # Char 'E' pol 70 Letter # Char 'F' 71 Letter # Char 'G' 72 Letter # Char 'H' 73 Letter # Char 'I' 74 Letter # Char 'J' 75 Letter # Char 'K' 76 Letter # Char 'L' 163 Letter # Char 'L' pol 77 Letter # Char 'M' 78 Letter # Char 'N' 209 Letter # Char 'N' pol 79 Letter # Char 'O' 211 Letter # Char 'O' pol 80 Letter # Char 'P' 81 Letter # Char 'Q' 82 Letter # Char 'R' 83 Letter # Char 'S' 140 Letter # Char 'S' pol 84 Letter # Char 'T' 85 Letter # Char 'U' 86 Letter # Char 'V' 87 Letter # Char 'W' 88 Letter # Char 'X' 89 Letter # Char 'Y' 90 Letter # Char 'Z' 143 Letter # Char 'Zi' pol 175 Letter # Char 'Zy' pol # Lower case letters: 97 Letter # Char 'a' 185 Letter # Char 'a' pol 98 Letter # Char 'b' 99 Letter # Char 'c' 230 Letter # Char 'c' pol 100 Letter # Char 'd' 101 Letter # Char 'e' 234 Letter # Char 'e' pol 102 Letter # Char 'f' 103 Letter # Char 'g' 104 Letter # Char 'h' 105 Letter # Char 'i' 106 Letter # Char 'j' 107 Letter # Char 'k' 108 Letter # Char 'l' 179 Letter # Char 'l' pol 109 Letter # Char 'm' 110 Letter # Char 'n' 241 Letter # Char 'n' pol 111 Letter # Char 'o' 243 Letter # Char 'o' pol 112 Letter # Char 'p' 113 Letter # Char 'q' 114 Letter # Char 'r' 115 Letter # Char 's' 156 Letter # Char 's' pol 116 Letter # Char 't' 117 Letter # Char 'u' 118 Letter # Char 'v' 119 Letter # Char 'w' 120 Letter # Char 'x' 121 Letter # Char 'y' 122 Letter # Char 'z' 159 Letter # Char 'zi' pol 191 Letter # Char 'zy' pol # --- ----- ----------------------- -1 EndOfDefs # Not loaded. Just marks end of map definition. # --- ----- ----------------------- #################################################################### # # Section 3: Word Continuation Rules # #################################################################### # # The word continuation rules specify which sequences of characters # CANNOT be separated from each other in breaking a stream of data into # words. # # Each rule consists of character class names seperated by spaces or # tabs. A rule says that characters of the specified classes may not be # split. For example, the rule: # # Letter Letter # # says that when two data characters are of class Letter, and occur side # by side, the data characters may not be separated. # # Similarly, the rule: # # Letter Number # # says that a character that classifies as a Letter may not be separated # from a following character that classifies as a Number. # # Example 1: # # How does the tokenizer decide whether a character is a Letter or a # Number? That association is formed by the Character Classification # Map, in Section 2. Using the Character Classification Map in this # file, and the two "Letter Letter" and "Letter Number" rules just # presented, the following input text: # # "A-1 B 2 C3 4D 5 E 6-F GH IJ7LMNOP" # # Will tokenize as: # # "A" "B" "C3" "D" "E" "F" "GH" "IJ7" "LMNOP" # # Because: A Letter may not be seperated from a following Letter, and a # Letter may not be separated from a following Number. However, a # Number may be separated from a following Letter, and all other # characters are considered as delimiters. Obviously, we need more # rules. A more complete sample set follows. # Character Class Names, and '*' # ------------------------------ # # All character class names in each rule MUST have been defined in the # Character Class Definitions in Section 1, above, with the exception of # one special "name": the '*' character. The '*' character means that # characters of the preceeding class may occur one or more times in a # sequence. # # Note that in previous rule we said that no two letters could be # separated. We did this with the rule: # # Letter Letter # # But what if a single letter, such as the "A" in "Section A", occurs by # itself? The "Letter Letter" rule does NOT say that "A" is a valid # token. The following two rules together DO say that a single letter, # or any number of letters in a row, must be treated as a token: # # Letter # Letter Letter # # However, we can reduce this to a single rule using the special # character class name "*": # # Letter * # # # # Example 2: "Unusual" Characters In Tokens # # This rule: # # Dollar * Letter # # Says that a stream of Dollar characters may not be broken if they are # followed by a Letter. This rule will cause these strings to be # treated as words: # # "$$SysDevice" # "$Fr$ed" # "$x8" # # But the same "Dollar * Letter" rule will not accept these strings: # # "SysDevice$$$" -- Token will be "SysDevice", "$$$" is junk. # "Fr$ed" -- Tokens will be "Fr" and "$ed". # "x$8" -- Token will be "x", the "$" and "8" will be # discarded. # # # # Example 3: More Complex Rules # # Using the example rules to this point, the string: # # "tomd@pls.com" # # Will be tokenized as: # # "tomd" "pls" "com" # # To cause tomd@pls.com to be accepted as a token, we can define this # rule: # # Letter * AtSign Letter * Dot Letter * # # Or define these equivalent rules: # # Letter * # Letter AtSign # AtSign Letter # Letter Dot Letter # # # # Implicit Linking of Rules # ------------------------- # # It is important to note that rules functionally link to each other. # For example, we used these two rules in the previous example: # # Letter AtSign # AtSign Letter # # That is, a Letter may not be separated from a following AtSign, and an # AtSign may not be separated from a following Letter, which # functionally has the same effect as: # # Letter AtSign Letter # # Thus, the last character class on a rule can match-up with the same # character class at the head of another rule (or the same rule) to # match longer strings. In fact, the tokenizer does this, internally, # to create the longest tokens it can from an input stream. # # # EndRule: End of Definitions Marker # ---------------------------------- # # The last "rule" in the list of rules MUST consist of solely the # EndRule character class (i.e., the rule with Value 1). It tells the # Word Continuation Rule loader that it is finished. If the EndRule is # missing, the Word Continuation Rule loader will try to eat any # following data as continuation rules, and will fail. # # Default Word Continuation Rules # ------------------------------- # # The following, short, ruleset will create tokens consisting of # Numbers, Letters, and Dollars freely intermixed. It will NOT create # tokens containing ONLY Dollars; the rules state that for a Dollar to # be part of a token, the Dollar must be followed by a Letter. The # rules also allow a single bang character, "!", at the beginning of a # token. Note that these rules are intended for a particular type of # technical documentation, and will probably not exactly fit your needs. # # # Examples Using Default Word Continuation Rules # ---------------------------------------------- # # Example 1': # # The string from Example 1: # # "A-1 B 2 C3 4D 5 E 6-F GH IJ7LMNOP" # # Previously tokenized as: # # "A" "B" "C3" "D" "E" "F" "GH" "IJ7" "LMNOP" # # With the default rules, below, tokenizes as: # # "A" "1" "B" "2" "C3" "4D" "5" "E" "6" "F" "GH" "IJ7MNOP" # # # Example 2': # # The strings from Example 2: # # "$$SysDevice" # "$Fr$ed" # "$x8" # # Previously tokenized as: # # "SysDevice" "Fr" "$ed" "x" # # With the default rules, below, tokenize as: # # "$$SysDevice" "$Fr$ed" "$x8" # # # # Example 3': # # The string from Example 3: # # "tomd@pls.com" # # Will still be tokenized as (lacking the AtSign and Dot rules): # # "tomd" "pls" "com" # # # # # Performance Hints # ================= # # # 1) Keep Rules As Simple As Possible. The simpler the rule, the faster # the word-continuation tests. If a rule uses few character classes, # and contains few items, it is "simple." If a rule uses more than two # character classes, or uses two or more classes in different # permutations, the rule is "complex." # # 2) Define As Few Rules As Possible. The fewer rules there are to # check, the faster will be tokenization. # # # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # # Word Continuation Rules # # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - # # I apologize if this explanation has been overly long, especially given # the brevity of the following rules. This documentation is partially # for me (Tom Donaldson), and partially for anyone using the tokenizer. # If I jot it down now, there is a better chance I will "remember" it # all later! # # Letter Number Letter Letter Number Number Letter Number Number Letter Letter Dot HardSpace Letter Letter Dot HardSpace Number Paragraf HardSpace Letter Paragraf HardSpace Number # # # Rule End MUST be the last rule in the continuation rules. # It must NOT occur as part of any other rule. # EndRule # # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - #################################################################### # # Section 4: Canonization Map # #################################################################### # # After the tokenizer returns a "word", based on the rules defined in # the first three sections of this file, it must be put into a # regularized, "canonical", form to make searches easier and faster. In # fact, canonizing tokens can also drastically reduce the size of the # database's dictionary. # # A common form of canonization is to convert all lower case letters to # upper case, and use the upper cased terms in the index and during # searches. This allows, for example, the word "NeXT" to match the # words "next", "NEXT", "nExt", etc. Note that this also means that # only one version of "next" is stored in the index, rather than all # permutations on the case of the letters that might exist in the # database. # # The canonization map allows you to determine what character by # character transforms are performed during canonization. The default # canonization supplied in the following table maps all lower case # characters to upper case. All other values are mapped to themselves; # that is, all other values are unchanged after canonization. # # For example, in some databases you might want to convert the "A WITH # TILDE" to a plain "A". You can do this by specifying that the "A # WITH TILDE" character, with character code 195, should be canonized # to the character "A", with character code 65: # # 195 65 # Canonize A-tilde as A # # # Default Values # ============== # # You do not have to define a mapping for all characters. The default # for all characters is to map to itself. Thus, your canonization # mapping table need only contain characters which you want to have # translated after tokenization. # # CRITICAL # ======== # # As for the previous table, there is a special value for this # Canonization Map that marks its end. The special value is -1. The # decimal character code -1 will cause the Canonization Map loader to # stop reading. # # # Performance Hints # ================= # # None. # # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # # Default Canonization Map # # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # # ------- ------- ----------- # Input Output # Decimal Decimal # Char Char # Value Value Comment # ------- ------- ----------- # # Map the characters a-z to the "canonical" characters A-Z. That is, # all letters will be upper cased. 97 65 # Char 'a' canonizes to 'A' 98 66 # Char 'b' canonizes to 'B' 99 67 # Char 'c' canonizes to 'C' 100 68 # Char 'd' canonizes to 'D' 101 69 # Char 'e' canonizes to 'E' 102 70 # Char 'f' canonizes to 'F' 103 71 # Char 'g' canonizes to 'G' 104 72 # Char 'h' canonizes to 'H' 105 73 # Char 'i' canonizes to 'I' 106 74 # Char 'j' canonizes to 'J' 107 75 # Char 'k' canonizes to 'K' 108 76 # Char 'l' canonizes to 'L' 109 77 # Char 'm' canonizes to 'M' 110 78 # Char 'n' canonizes to 'N' 111 79 # Char 'o' canonizes to 'O' 112 80 # Char 'p' canonizes to 'P' 113 81 # Char 'q' canonizes to 'Q' 114 82 # Char 'r' canonizes to 'R' 115 83 # Char 's' canonizes to 'S' 116 84 # Char 't' canonizes to 'T' 117 85 # Char 'u' canonizes to 'U' 118 86 # Char 'v' canonizes to 'V' 119 87 # Char 'w' canonizes to 'W' 120 88 # Char 'x' canonizes to 'X' 121 89 # Char 'y' canonizes to 'Y' 122 90 # Char 'z' canonizes to 'Z' 185 165 # 230 198 # 234 202 # 179 163 # 241 209 # 243 211 # 156 140 # 159 143 # 191 175 # # --- ----- ----------------------- -1 -1 # Not loaded. Just marks end of map definition. # --- ----- ----------------------- #################################################################### # # # End Of File: tknztbld.def # # ####################################################################