Simtel MSDOS - Coast to Coast

home *** CD-ROM | disk | FTP | other *** search

/ Simtel MSDOS - Coast to Coast / simteldosarchivecoasttocoast2.iso / fido / ftsc_all.z43 / FSC-0044.001 < prev next >

Wrap

Text File | 1990-04-01 | 29KB | 564 lines

Document: FSC-0044 Version: 001 Date: 04/01/90 An improved method of duplicate message detection and prevention Jack Decker 1:154/8@Fidonet Status of this document: This FSC suggests a proposed protocol for the FidoNet(r) community, and requests discussion and suggestions for improvements. Distribution of this document is subject to the restrictions stated in the copyright paragraph below. Fido and FidoNet are registered marks of Tom Jennings and Fido Software. Purpose: The purpose of this document is to present a proposal for an improved method of duplicate message detection and prevention, that could eventually replace the existing PATH and SEEN-BY lines currently used within Fidonet. The principal advantages of this method over previous schemes is that it is fully Zone-aware and Point-aware, and that it adds far fewer bytes to a message than the current combination of SEEN-BY and PATH lines. It can also be run in parallel with existing SEEN-BY and PATH lines for an indefinite period, thus allowing a "transition period" of as long as is necessary for software to be converted. Copyright: This document is Copyright 1990 by Jack Decker. However, permission is granted for any and all non-commercial use, providing the contents of this document are not altered in any way. Also, permission is expressly granted for any use by developers of software primarily intended to be used in the Fidonet amateur communications network, or in any similar amateur communications network that primarily uses Fidonet technology and protocols, whether that software is commercial or not. Comments on this proposal, and suggestions for improvement are welcomed by the author. In particular, suggestions on how this proposal might be reworded to make the meaning clearer are especially welcome. A. Definition: In this document the characters ^A (caret and capital A) will be used to represent a 0x01 (SOH) byte. It will be most commonly used in reference to the "^APTH line", which will be a line that begins with a 0x01 byte immediately followed by the letters "PTH" (and then by additional data as specified below). B. Why a new method of duplicate message detection is needed: Most of the methods of duplicate message detection currently used in Fidonet echomail processing operate by trying to find some distinguishing characteristic of an echomail message (whether it be something deliberately included in the message, such as an EID, MSGID, etc. type of "kludge line", or something which is contained in all echomail messages, such as the message header). Typically, either the item being used for duplicate detection itself or a checksum of that item is then saved in a data file, and if another item comes in with that same distinguishing characteristic, the message is considered to be a duplicate message. The data files used to store previously-seen message data can occupy a significant amount of disk space if many conferences are carried on a system. Unfortunately, all such schemes seem to suffer from the drawback that under the proper circumstances, messages that are not duplicates of each other may be created with identifying characteristics that are similar enough to be falsely recognized as duplicates. The circumstances under which this can happen may differ from method to method, but none are totally foolproof. Thus, it's possible that messages may be deleted as duplicates even though in reality they are not duplicates, but rather they are simply messages that contain data that make them appear to be duplicates. The most common cause of duplicate messages is improper echomail topology (also known as the infamous "dupe loop"). While there are certainly other ways that duplicates can be generated, improper topology is far and away the leading cause. Further, many of the current duplicate elimination schemes will NOT catch most of the duplicates that are NOT generated as a result of improper topology (which is why duplicate messages are seen occasionally, even though duplicate message detection schemes are currently in use). Unfortunately, if a duplicate killer is to be effective, it must store the identifying characteristics for the last several thousand messages that have passed through a particular system. This not only uses up disk storage space, it consumes extra processing time during echomail processing, since each new arriving message must be compared to the data list in the attempt to determine if it is, indeed, a duplicate. A better approach would be to store within a message itself data that identifies it as having already been received by a particular system, before sending it on to another system. Then, if the same message came back to a given system in a "dupe loop", it would be possible to positively identify it as one that has already been seen on that particular system. And, since the data necessary to identify the message as a duplicate is stored within the message itself, it is possible to use this method without the necessity of storing great amounts of data on previously-seen messages (in many implementations this alone would save 10K or more of disk space per conference carried!). Were it not for the fact that the PATH line present in most echomail messages does not contain Zone or Point information, we could use it for that purpose. However, since it does not contain that information, it cannot and should not be used in that manner. Another drawback of the PATH line is that because it is physically located at the end of a message (after the SEEN-BY lines), if only the last part of a message is scrambled or deleted, the PATH line information will be lost. C. Proposal: 1) A new type of kludge line (commonly known within FIDONET as an "IFNA kludge line"), which combines certain characteristics of the existing PATH and SEEN-BY lines, will be placed into each echomail message. This line, known as the ^APTH line, will be placed at the TOP of the message (not necessarily the first line, but prior to the body of the message text). IMPORTANT: Support for the existing PATH and SEEN-BY lines will be retained as long as is necessary to accommodate everyone in Fidonet, but eventually the ^APTH line could possibly replace both the current PATH and SEEN-BY lines. 2) The ^APTH line will contain full four-dimensional addressing (Zone:Net/Node.Point), BUT elements that are the same as the previous entry in the line need not be repeated. When the "point" portion of an address stands alone, it shall be preceded by at least a "." character (to distinguish it from a node address). 3) If a system is importing messages and finds a message with its own address already in the ^APTH line, it will discard the message (unless that address is in the very last position on the line... this allows for the odd situation where a point or another task on the same system has already inserted the system's address in the ^APTH line, or where it is desirable to process the same message a second time). 4) One (and only one) non-numeric character may appear just AFTER any address on the ^APTH line. When using the ^APTH for duplicate message checking only, you may just skip past any such address, unless it's your own address (see examples later in this document). In that case, strip the address and the non-numeric character (in other words, if you see your own address but it's immediately followed by a non-numeric character, remove that address, add yours to the end of the ^APTH line, and toss the message anyway). The reason for doing this is to allow the design of an echomail processor that doesn't rely on SEEN-BY's. Such a processor could append a non-numeric character (such as a "!") to an address, in order to indicate that "this message hasn't really passed through this node, but don't send it back there" (which would be the equivalent of a SEEN-BY statement for that node, indicating that this message has already been sent to that node). Thus the ^APTH line could eventually take the place of SEEN-BY lines, but you could still have a "fully coupled" triangular or rectangular topology. In this case, you'd add the nodes that are part of that fully coupled topology to the ^APTH line BEFORE sending the message to them, but with the special character after the address. The receiver would know that the message didn't really pass through that node yet, but it should NOT send it to that node under any circumstances. (Please note that during the initial design of software to create ^APTH lines, you would not have to worry about generating the special case with the trailing non-numeric characters, you'd just have to be able to handle them as shown in the examples below above if you came across one). D. Specifications and examples: The general specifications for a ^APTH line, and a general outline of how an incoming message might be processed follows. A valid ^APTH line will contain at a minimum the string ^APTH followed by a single space character and the network address of the system that created the ^APTH line, in Zone:Net/Node[.Point] format, where ^A is a 0x01 byte (SOH) and the point address is required only if the system is a point (specifically, a system that is NOT a point should not .0). Once again, the FIRST address specified in a ^APTH line is expected to contain, at a minimum, Zone, Net, and Node numbers. If any of these are missing from the FIRST address, the line should be considered defective. It will be left to the discretion of the software author as to how to handle a message with a defective ^APTH line. Subsequent addresses in the ^APTH line are delimited by spaces and should contain only that information that is different from the previous entry on the line, except that when a bossnode receives a message from a point, then the bossnode should append its node number only. Specific examples follow: a. If the Zone is the same as the previous address, but the net is different, then only Net/Node[.Point] should be used. b. If the Zone and Net are the same as the previous address, but the node is different, then only Node[.Point] should be used. c. If the Zone, Net, and Node are the same as the previous address, but the point is different, then only .Point should be used. Note that in this case, the period is included. d. If the Zone, Net, and Node are the same as the previous address, but the previous address contains a point specifier and the receiving system is not a point (i.e., it IS the bossnode), then only Node should be used. .0 (point zero) might also be a valid entry in this case, but only if the bossnode consistently identifies itself to other systems using a full four-dimensional address. For example, a message that originated on 1:234/5.6 and went from there to 1:234/5 would contain a ^APTH line in this format: ^APTH 1:234/5.6 5 If the bossnode is also considered to be point zero, then ^APTH 1:234/5.6 .0 Would be equally valid. In the case of a "fully connected" topology, nodes may be added to the ^APTH line even though a message has not actually passed through those nodes, to prevent the message from being sent to those nodes. Such nodes should have an exclamation point character ("!") appended to the end of the entry, immediately following the node or point number. These nodes should be added to the very end of a new or existing ^APTH line, after the address of the node which added them. For example, suppose that 157/200, 154/9, and 228/6 were in a "fully connected" topology. When a message was received by 157/200 and then sent to 154/9 and 228/6, the ^APTH line might look something like this: ^APTH: 3:711/431.5 431 430 403 1:124/4210 4115 157/200 154/9! 228/6! When a message arrives on one of the nodes indicated by the exclamation point, the exclamation point entry should be removed, and the node should add itself to the end of the line in the normal manner. For example, after the message containing the above ^APTH line were received at 154/9, it would be modified to read: ^APTH: 3:711/431.5 431 430 403 1:124/4210 4115 157/200 228/6! 154/9 Please note that at the time of this proposal, the exclamation point (!) is the ONLY "officially recognized" non-numeric character that can be expected to be appended to a ^APTH line address, however, the possibility remains that someone may figure out a good reason to use a different trailing character for some other (but similar) purpose, so I am allowing for that possibility by using the generic terminology "non-numeric character" rather than the more specific "exclamation point" throughout this document. The ^APTH line must be terminated with a carriage return and/or linefeed (a carriage return followed by a linefeed is preferred, and should be used by all systems capable of generating a carriage return/linefeed combination). There is no specified limit on the length of a ^APTH line. Each message should contain only one ^APTH line, even if it extends beyond the typical 80 column screen width. The ^APTH line is primarily intended for use by the conference mail processing software, so primary consideration is being given to ease of processing the line, rather than making it easily human-readable (most software will not display kludge lines hidden behind a ^A character in any event). E. Pseudo-outline of message processing Here is a suggested flow pseudo-outline showing one way that messages might be processed in a standalone program that runs between the import and export cycles of an existing conference mail processor such as ConfMail (this outline assumes that the standard Fido/Opus style *.msg files are used, though obviously that need not be the case): 1. Open *.msg file for input 2. Open temporary file for output 3. Copy header (first 190 bytes) from input to output file. The following operations begin immediately following this header. 4. Examine each line of input file (a line is delimited by a carriage return, linefeed, or any combination thereof) for one of the following: a. A blank line - Write to output and examine next line. b. A line containing spaces only - Write to output and examine next line. c. A line that begins with a 01 byte (SOH) - GoTo 5. d. A line that does not meet any of the above specifications. I. Create a line containing the string ^APTH followed by a single space character and your network address, in Zone:Net/Node[.Point] format, where ^A is a 0x01 byte (SOH) and the point address is required only if you are a point (specifically, a system that is NOT a point should not necessarily use .0). This line should be terminated with a carriage return and/or linefeed (a carriage return followed by a linefeed is preferred). II. Write the line created in 4.d.I. to the output file. III. Write the line input in 4. to the output file. IV. Goto 9. 5. If a line begins with a 0x01 (SOH) byte, examine the keyword immediately following it. a. If the keyword is NOT "PTH", write the entire line to output and examine the next line (go back to 4). 6. If a line begins with ^APTH, examine each address in the line in turn. Addresses start immediately following the characters "PTH " (note the space). a. The FIRST address is expected to contain, at a minimum, Zone, Net, and Node numbers. If any of these are missing from the FIRST address, the line should be considered defective. It will be left to the discretion of the software author as to how to handle a message with a defective ^APTH line. b. As each address is found, any Zone, Net, and Node numbers found should be stored in temporary variables, to be used as defaults for subsequent addresses when only a partial address is given. For example, the first address will contain a Zone number. This should be stored in a temporary variable and used as the default Zone for all subsequent addresses, until and unless another Zone number is seen in the line, at which time that Zone number should be stored in the temporary variable and used as the default Zone. 7. As each address is found, it should be compared against the system's address. If a match is found: a. Check to make sure that the address is not a point address if the system's address does not contain a point specifier. If the system's address is given without a point specifier, then it should not be considered a match if ANY point address is found in the ^APTH line address that is being compared (not even .0 - for example, if the address 1:234/5.0 is seen in the ^APTH line, and 1:234/5 is the given system address, then it is NOT a match). b. If the address does match exactly, check to see if a non-numeric character (specifically the "!" character) immediately follows the address. If it does, then that address must be removed from the line at that point. I. When removing an address, please make sure that you do not change the address of subsequent entries. This may require modification of the next entry on the line, if one exists. For example, suppose you had a "fully connected" topology where 1:157/200 sent an echo to both 1:154/9 and 1:154/970. The ^APTH line might end as follows: ..... 157/200 154/9! 970! However, after modification of the ^APTH line, it should read: ..... 157/200 154/970! 9 You can see that if 154/9 were simply deleted without regard to what follows on the line, the following (incorrect) line might result: ..... 157/200 970! 154/9 (THIS IS INCORRECT) The above is incorrect because 154/970 has been transformed into 157/970. II. After removing an address followed by a non-numeric character, continue to scan any remaining addresses in the ^APTH line in case a match is found later in the line. If no other matches are found, proceed as if no match had been found. Goto 8. c. Check to see if the address is the last one on the line (not counting addresses that have a non-numeric character immediately following them). If this address is followed only by the end of the line, or ONLY by addresses that have a non-numeric trailing character, then there is a very high probability that we have either inadvertently or deliberately processed this message twice, and it is not really a duplicate. In this case, the original *.msg file should be left untouched. I. Close both the input and output files. II. Delete the temporary output file. END. d. If a match is found, and it is not followed by a non-numeric character, and it is not the last address on the ^APTH line, then the message is a duplicate message and should be treated as such (either by deleting it, or moving it to a "bad messages" area or the netmail area). I. Close both the input and output files. II. Delete the temporary output file. III. Either delete or move the original .msg input file, as appropriate. END. 8. If the end of the ^APTH line is reached and a match has not been found, then add the system's address to the end of the ^APTH line. Then write the modified ^APTH line to the output file. I. If one or more addresses with an appended non-numeric character (used within "fully-coupled" topologies) are to be added to the ^APTH line, they should be added at the very end of the line, after the address of the system currently processing the message). 9. Copy the remainder of the input file to the output file. Close both files. 10. Delete the input file. 11. Rename the temporary output filename to the old input filename. END. [End of outline] F. Additional notes and clarifications: Note 1: In section 7.b.I. I mentioned the necessity of not simply deleting a node from the ^APTH line without checking to see if the next address in the ^APTH line needs to be modified. This can easily be accomplished if TWO sets of temporary variables are kept, for the CURRENT and PREVIOUS Zone, Net, and Node values (Point addresses are NOT kept as defaults, thus there is no need to store Point information). When reading the FIRST address in the ^APTH line, the Zone, Net, and Node numbers of that address would be stored in both the CURRENT and PREVIOUS variables. Thereafter, whenever a new Zone, Net, or Node number is explicitly specified in a ^APTH line address, the new value(s) are stored in the CURRENT variables, but first the CURRENT values are moved to the PREVIOUS values. To help visualize this, let's again suppose we have a ^APTH line that ends as follows (all of these addresses are in Zone 1): ..... 157/200 154/9! 970! Let's suppose that we are processing this message on 154/9, and will need to remove the 154/9! address. When we encounter 157/200, our variables will be set as follows: Previous | Current Zone 1 | 1 Net ? | 157 Node ? | 200 Now, when we read 154/9, our current values will be moved to the previous: Previous | Current Zone 1 | 1 Net 157 | 154 Node 200 | 9 We now have the data we need to determine what needs to be added to the next address, after we delete 154/9. In this case, we need only compare the Previous and Current addresses to determine which are UNEQUAL. In this case, the Zone is the same, but the Net and Node are not. So, if the following address lacks either the Net or Node, we'll have to add those. Now we delete the 154/9! and look at the next address, 970. At this point our variables will look like this: Previous | Current Zone 1 | 1 Net 154 | 154 Node 9 | 970 Again, we compare to see which addresses are UNEQUAL. In this case, only the NODE address is. So we know we do NOT have to add the NODE address, nor do we have to add the Zone address (because it was not different on the first compare). We only need add those address components which were unequal on the first compare, but equal on the second compare. So, in this case, the Net address must be added to the next address in the ^APTH line, leaving as a result: ..... 157/200 154/970! The current system address is then added back in at the end of the line, thus: ..... 157/200 154/970! 9 Note 2: In section 4.d it is suggested that, when a line that is neither blank nor a kludge line (that begins with a ^A character) is found, a ^APTH line be added at that point. However, there are reports that under certain circumstances (particularly when messages are "forwarded" or "hurled"), certain software may insert a non-kludge line prior to previously-existing kludge lines in a message. It should be recognized by software authors that a non-kludge line should NEVER be inserted in front of existing kludge lines located at the start of a message, if those kludge lines are still valid (and if they are NOT still valid, they should be removed. When a message is forwarded or hurled, it is probably desirable to remove duplicate control information since what is essentially happening is that the text of a previous message is being inserted into a NEW message. Since the message is new, the "old" duplicate control information is no longer valid). Software authors that are implementing the ^APTH line in their software should never search beyond the first text line of a message for the ^APTH line, because if one is found later in the text, it is in all probability an old ^APTH line that was inadvertently copied over from another message, and is not relevant to the current message. Note 3: This is an optional suggestion, for use during the transition period in which the ^APTH line has to coexist with more traditional PATH and SEEN-BY lines. If ^APTH line checking is being used during the import phase of echomail processing in a conference mail processor, it might be a good idea to optionally check to make sure that all ^APTH line addresses that are in the system's home Zone (including those with an appended non-numeric character) have been properly included in the SEEN-BY lines, and to add any that have not been so included. It should be obvious that ^APTH line addresses that are NOT in the system's home Zone should NOT be added to the SEEN-BY lines. If this feature is implemented, it may be a good idea to give the sysop the ability to enable or disable it by means of a command line switch or a configuration file option. Note 4: If nodes with trailing non-numeric characters are inserted into a ^APTH line for the purpose of indicating "SEEN-BY" nodes in a fully coupled topology, it is permissible (but not required) to include those nodes ONLY in the ^APTH lines of messages actually exported to the nodes participating in the circular topology. In other words, it's permissible to add such nodes to the ^APTH lines of messages during the import cycle, in which case messages with ^APTH lines containing the added nodes would be exported to all nodes. However, it's also permissible to add those nodes to the ^APTH line during the export cycle, including them only in the ^APTH lines of the nodes that need to see them. Please keep in mind that such nodes are added only to the END of the ^APTH line, AFTER the address of the system processing the message. In any event, it's up to the software author to implement this feature in such a manner that duplicates will not be created. Similarly, if a node RECEIVES a message containing a ^APTH line that lists nodes with trailing non-numeric characters, it is permissible to remove those nodes from the ^APTH line if it can be positively ascertained that they are no longer required. Generally speaking, this should NOT be done unless there is absolutely NO possibility of the message being exported to one of the nodes in question. Note that in most situations, if a ^APTH line contains a node with a trailing non-numeric character, but it is followed by a node number (other than your own) that does NOT have a trailing non-numeric character (that is, the node with the trailing non-numeric character is not one of the last nodes on the line), then it can usually be safely removed since it will have already "passed through" the fully-coupled topology. Using the previous example of 157/200, 154/9, and 154/970 participating in a fully-coupled topology, the ^APTH line as received at 154/9 and 154/970 might end as follows: ..... 157/200 154/9! 970! However, please note that if 157/200 also feeds other nodes that are NOT part of this particular fully coupled topology, there is not real reason they would have to see the "154/9! 970!" at the end of the line. However, there is no prohibition against including those nodes in the ^APTH lines of messages exported to other nodes. Once this example message arrives at 154/9, the ^APTH line would be changed to look like this: ..... 157/200 154/970! 9 Now, when this message is exported from 154/9 to another node (154/111 for example), that node may remove the "154/970!" as long as 154/9 remains in the ^APTH line, since as long as the message cannot be sent back to 154/9, it cannot re-enter the fully-coupled topology. The ^APTH line at this point (after the message is received on 154/111) might look like this: ..... 157/200 154/9 111 It would probably not be advisable to remove the "154/970!" at 154/9 in this example, even if the message has already been exported, because the message might need to be re-exported (such as when a new board picks up an echo feed). When in doubt, nodes with trailing non-numeric characters (other than your own) should be left in the ^APTH line. While there is a cost of a few extra bytes per message if you leave them in, it does not compare to the cost of the duplicate messages that could be generated if they are removed indiscriminately. Jack Decker April 1, 1990