Network Working Group Chris Weider Internet Draft Paul Leach Microsoft Corp. April, 1997 Hierarchical Extensions to the Common Indexing Protocol Status of this Memo This is a personal submission to the FIND Working Group. It does not represent working group consensus. This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress". WARNING: The specification in this document is subject to change, and will certainly change. It is inappropriate AND STUPID to implement to the proposed specification in this document. In particular, anyone who implements to this specification and then complains when it changes will be properly viewed as an idiot, and any such complaints shall be ignored. YOU HAVE BEEN WARNED. To learn the current status of any Internet-Draft, please check the 1id-abstracts.txt listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). Distribution of this document is unlimited. Please send comments to the FIND working group at . Discussions of the working group are archived at . 1. Introduction This work explores what, in the parlance of the current CIP draft, is called an index type -- specifically, a new kind of index that merges indexing of hierarchically named attribute-value entities (such as in LDAP and RWHOIS) and ones without distinguished names (such as in WHOIS++). It is based on a previous version of the CIP specification, but that was just a convenient syntactical jumping off point. It is intended to be orthogonal to the FIND working group task of defining a framing syntax and functionality for a common indexing data wrapping protocol, and that the concepts and protocol elements in this draft should be able to be expressed in a manner consistent with the new CIP framework at the appropriate time. 2. Protocol Functionality and components of the Index Service 2.1 Base data servers Most directory services today specify only the query language, the information model, and the server responses for their servers. Most also use a basic 'template-based' information model, in which each entry consists of a set of attribute-value pairs. Thus the basic service can be provided by a wide variety of databases and directory services. However, to participate in the Index Service, that underlying database must also be able to generate a 'centroid', or some other type of forward knowledge, for the data it serves. Connections out from the indexing service to the base data servers will be accomplished using URIs for the various end protocols. This will avoid the need to rewrite the data from its native formats. 2.2 Centroids as forward knowledge The centroid of a server is comprised of a list of the templates and attributes used by that server, and a word list for each attribute. The word list for a given attribute contains one occurrence of every word which appears at least once in that attribute in some record in that server's data, and nothing else. For example, if a server contains exactly three records, as follows: Record 1 Record 2 Template: User Template: User First Name: John First Name: Joe Last Name: Smith Last Name: Smith Favourite Drink: Labatt Beer Favourite Drink: Molson Beer Record 3 Template: Domain Domain Name: foo.edu Contact Name: Mike Foobar the centroid for this server would be Template: User First Name: Joe John Last Name: Smith Favourite Drink: Beer Labatt Molson Template: Domain Domain Name: foo.edu Contact Name: Mike Foobar It is this information which is handed up the tree to provide forward knowledge. As we mention above, this may not turn out to be the ideal solution for forward knowledge, and we suspect that there may be a number of different sets of forward knowledge used in the Index Service. However, the indexing architecture is in a very real sense independent of what types of forward knowledge are handed around, and it is entirely possible to build a unified directory which uses many types of forward knowledge. 2.3 Other types of forward information There are several other types of forward information that might be useful in an indexing service. The first is untokenized values for the given attributes, as opposed to the tokenized values given in the centroid. A second type is forward information generated by a typical query; this can be used for replication of databases or of specific records in a database. A third type is forward information which specifies from which server a given value was obtained. All of these are given in the protocol. A fourth type is aggregated hierarchical values: for example, let's assume that server A holds many email addresses with domain names such as foo.microsoft.com, bar.microsoft.com, and so forth. It would enhance compression if server A could simply specify that the email attribute was hierarchical and that any query which matched contained microsoft.com as the leftmost string would be a hit for purposes of referral. 2.4 Index servers and Index server Architecture A index server collects and collates the centroids (or other forward knowledge) of either a number of base servers or of a number of other index servers. An index server must be able to generate a centroid (or other forward knowledge) for the information it contains. In addition, an index server can index any other server it wishes, which allows one base level server (or index server) to participate in many hierarchies in the directory mesh. 2.4.1 Queries to index servers An index server receives a query, searches its collections of centroids and other forward information, determines which servers hold records which may fill that query, and then notifies the user's client of the next servers to contact to submit the query. An index server can also contain primary data of its own; and thus act a both an index server and a base level server. In this case, the index server's response to a query may be a mix of records and referral pointers. Each index server is required to support the following query protocols and to generate referrals in the proper format for those protocols: RWhois, WHOIS++, and LDAP. Index servers which directly index a base level server may in the future return referrals to those servers in their native protocols. 2.4.2 Index server distribution model and forward knowledge propogation The diagram on the next page illustrates how a mesh of index servers might be created for a set of base servers. Although it looks like a hierarchy, the protocols allow (for example) server A to be indexed by both server D and by server H. whois++ index index servers servers servers for for whois++ lower-level servers index servers _______ | | | A |__ |_______| \ _______ \----------| | _______ | D |__ ______ | | /----------|_______| \ | | | B |__/ \----------| | |_______| | F | /----------|______| / _______ _______ / | | | |- | C |--------------| E | |_______| |_______|- \ \ _______ \ ______ | | \----------| | | G |--------------------------------------| H | |_______| |______| Figure 1: Sample layout of the Index Service mesh In the portion of the index tree shown above, base servers A and B hand their centroids up to index server D, base server C hands its centroid up to index server E, and index servers D and E hand their centroids up to indexserver F. Servers E and G also hand their centroids up to H. The number of levels of index servers, and the number of index servers at each level, will depend on the number of base servers deployed, and the responsetime of individual layers of the server tree. These numbers will have to be determined in the field. 2.4.3 Forward knowledge propogation and changes to forward knowledge Forward knowledge propogation is initiated by an authenticated POLL command (sec. 3.4.1). The format of the POLL command allows the poller to request the forward knowledge of any or all templates and attributes held by the polled server. After the polled server has authenticated the poller, it determines which of the requested forward knowledge the poller is allowed to request, and then issues a CENTROID-CHANGES report (sec. 3.4.2) to transmit the data. When the poller receives the CENTROID-CHANGES report, it can authenticate the pollee to determine whether to add the new changes to its data. Additionally, if a given pollee knows what pollers hold forward knowledge from the pollee, it can signal to those pollers the fact that its information has changed by issuing a DATA-CHANGED command. The poller can then determine if and when to issue a new POLL request to get the updated information. The DATA-CHANGED command is included in this protocol to allow 'interactive' updating of critical information. 2.4.4 Forward knowledge propogation and mesh traversal When an index server issues a POLL request, it may indicate to the polled server what relationship it has to the polled. This information can be used to help traverse the directory mesh. Two fields are specified in the current proposal to transmit the relationship information, although it is expected that richer relationship information will be shared in future revisions of this protocol. One field used for this information is the Hierarchy field, and can take on three values. The first is 'topology', which indicates that the indexing server is at a higher level in the network topology (e.g. indexes the whole regional ISP). The second is 'geographical', which indicates that the polling server covers a geographical area subsuming the pollee. The third is 'administrative', which indicates that the indexing server covers an administrative domain subsuming the pollee. The second field used for this information is the Description field, which contains the DESCRIBE record of the polling server. This allows users to obtain richer metainformation for the directory mesh, enabling them to expand queries more effectively. 2.4.5 Loop control Since there are no a priori restrictions on which servers may poll which other servers, and since a given server may participate in many sub-meshes, mechanisms must be installed to allow the detection of cycles in the polling relationships. This is accomplished in the current protocol by including a hop-count on polling relationships. Each time a polled server generates forward information, it informs the polling server about its current hopcount, which is the maximum of the hopcounts of all the servers it polls, plus 1. A base level server (one which polls no other servers) will have a hopcount of 0. When a server decides to poll a new server, if its hopcount goes up, then it must information all the other servers which poll it about its new hopcount. A maximum hopcount (8 in the current version) will help the servers detect polling loops. A second approach to loop detection is to do all the work in the client; which would determine which new referrals have already appeared in the referral list, and then simply iterate the referral process until there are no new servers to ask. An algorithm to accomplish this in WHOIS++ is detailed in [Faltstrom 95]. 2.4.6 Query handling and passing algorithms When an index server receives a query, it searches its collection of forward knowledge and determines which servers hold records which may fill that query. As this service becomes widely deployed, it is expected that some index servers may specialize in indexing certain template types or perhaps even certain fields within those templates. If an index server obtains a match with the query _for those template fields and attributes the server indexes_, it is to be considered a match for the purpose of forwarding the query. 2.4.7 Query referral Query referral is the process of informing a client which servers to contact next to resolve a query. The syntax for notifying a client is outlined in section 4.5. A query can specify the 'trace' option, which causes each server which receives the query to send its server handle and an identification string to the client. 2.5 Security considerations In the opinion of this author, until a generally accepted Internet wide security service is available (or until a web of such services reaches into most of the Internet) administrators should not assume that servers outside their control, or with which they have not established a trust relationship, will secure their data.. Propogating security information through the common index mesh will run immediately into the problems of common authentication, access control, and incommensurable security features. Thus any index information propogated to an untrusted (i.e. public) server should be considered unsecured. 3. Integrating disparate services 3.1 The service model The basic service model uses a common data model and allows the use of different access protocols to access a CIP server. CIP schema will not be standardized in this version of the protocol. 3.2 Integration of data models The basic data models for most of the existing directory services are essentially the same, a set of templates or object classes which are composed of attribute value pairs. Therefore integration of the data models should not prove too difficult. 3.3 Integration of schema The various protocols use different attribute names for attributes which typically contain the same data. In this version of the protocol, the attributes will not be changed for inclusion into the CIP mesh. However, it is our intent at some point to require the translation of the base schema into a standard CIP schema set. This implies that in meshes based on this version of the protocol, that the schema may be different for each mesh. 3.4 Using different query protocols to access the CIP service As this document is presently constituted, one can use many protocols to access a CIP server. If the attributes used by the client and server are the same, the query may be answered by the CIP service. 4. Protocol Specification for the Index Service The syntax for each protocol component is listed below. In addition, each section contains a listing of which of these attributes is required and optional for each of the components. All timestamps must be in the format YYYYMMDDHHMM in GMT. 4.1 Request-Response model There are two basic transactions in the Common Indexing Protocol: A Change Notification, with which a polled server indicates that the data is holds has changed, and that the polling server should repoll the polled server, and a Poll, in which a polling server indicates which data it would like an index for and the polled server sends that index. A polling server may issue a poll at any time, even if a prior change notification has not been received from the polling server. 4.2 Syntax Conventions All lines in the protocol end in . Line breaks are not to be included in the values extracted from a line. Special characters are escaped by a backslash, “\”. An escaped line break indicates that the line following the line ending in an escaped line break is supposed to be concatenated with the previous line to form a single value. A line break which is part of a value (in a postal address, for example) is indicated by the special token . Component specifications and grouping operators are expressed using the standard HTML format to open a block and <\token> to close the block. 4.3 Change Notification A polled server opens a TCP connection to a polling server, and issues a Data-Changed report, as detailed in 3.3.1. When the polling server receives the \Data-Changed line, it generates a Data-Changed-Ack, as detailed in 3.3.2. When the polled server receives the <\Data-Changed-Acl> line of the Data-Changed-Ack, it closes the connection. If the transaction is interrupted at any point, the polled server should assume that the report was not received, and should resend as appropriate. 4.3.1 Data-changed report syntax The data changed report look like this: Version-number: // version number of index service software, used to insure // compatibility. Current value is 2.3 Time-of-latest-centroid-change: // time stamp of latest forward information // change,GMT Time-of-message-generation: // time when this message was generated, GMT DSI: // Data set identifier. This uniquely identifies a given data set in case the // server manages multiple logical data sets Server-handle: // IANA unique identifier for this server // or OID for this server // Or Distinguished Name of the root of the subtree this server // is responsible for. Host-Name: // Host name of this server (current name) Host-Port: // Port number of this server (current port) Protocol: // Access protocol to use when speaking to this server Best-time-to-poll: // For heavily used servers, this will identify when // the server is likely to be lightly loaded // so that response to the poll will be speedy, GMT <\Data-Changed> // This line must be used to terminate the data changed message Required/optional table Version-Number REQUIRED Time-of-latest-centroid-change REQUIRED Time-of-message-generation REQUIRED DSI OPTIONAL Server-handle REQUIRED Host-Name REQUIRED Host-Port REQUIRED Protocol REQUIRED Best-time-to-poll OPTIONAL 4.3.2 DATA-CHANGED-ACK report The DATA-CHANGED-ACK report has the following syntax: <\Data-Changed-Ack> 4.4 Centroid Change Report A polling server opens a TCP connection to a polled server, and issues a POLL command, as detailed in 4.4.1. When the polled server receives the # END POLL line, it generates a CENTROID-CHANGES report, as detailed in 4.4.2. When the polled server received the # END CENTROID-CHANGES line of the CENTROID-CHANGES report, it commits the data to its database and closes the connection. If the transaction is interrupted at any point, the polling server should assume that the entire centroid was not received, and should repoll the polled server. 4.4.1 Poll syntax Version: // version number of poller's index software, used to // insure compatibility. Current is 2.2 Charset: // specifies character set in which the centroid changes are to be // transmitted. Must be one of ISO-8859-1 or UNICODE-1-1-UTF-8 DSI: // Data set identifier. Indicates which data set of multiple data sets // should be indexed. Must be an OID. Type-of-poll: //optional. If not present, indicates centroid poll Start-time: // give me all the centroid changes starting at this time, GMT End-time: // ending at this time, GMT // This block may occur multiple times Template: // a standard template or object class name, or the keyword ALL, for a // full update. Field: // used to limit centroid update information to specific fields, // is either a specific field name, a list of field names separated by, // spaces, or the keyword ALL. May occur multiple times per template. <\Request> Starting-point: // location in the DIT or other hierarchical structure // to start the index. If used, it implies that the entire subtree is // indexed as well. If this attribute is missing, then the index request is // assumed to cover the entire data store of the polled server. Server-handle: // IANA unique identifier for the polling server. // this handle may optionally be cached by the polled // server to announce future changes Host-Name: // Host name of the polling server. Host-Port: // Port number of the polling server. Description: // This field contains the DESCRIBE record of the // polling server Tokenization: // The tokenization algorithm used // Can be one of: "TOKENS", or "FIELDS". // Default is "FIELDS", which means the entire value in each field.. Options: // Can be used to request the WEIGHT, HANDLE, and/or HOST information // for the returned values <\Poll> // This line must by used to terminate the poll message When the poll type is CENTROID, the poll scope is FULL if the Start-time attribute is missing and incremental otherwise. If Start-time is present, it must be the same value as the End-time from a previous CENTROID-CHANGES report from this server. The allowable values for OPTION are WEIGHT, HANDLE, and HOST. Support for the HANDLE and HOST values are required. HANDLE indicates that each attribute value must be listed with the server handle of the server from which this value was obtained by the polled server; HOST indicates that each attribute value must be listed with the host name and port number of the server from which this value was obtained. WEIGHT is optional, and allows each value to be assigned a relative weight according to a defined and specified weighting scheme. This value is included for future clarification. Since a weighting scheme will need to be identified, WEIGHT will take additional scheme identifiers in a syntax to be determined. . Required/Optional Table Version REQUIRED, value is 2.0 Charset REQUIRED Support for values ISO-8859-1 and UNICODE-1-1-UTF-8 are required DSI OPTIONAL Type-Of-Poll Optional Start-time OPTIONAL End-Time OPTIONAL Template OPTIONAL If not present, report all templates Field OPTIONAL If not present, report all fields Starting-point OPTIONAL Server-handle REQUIRED Host-Name REQUIRED Host-Port REQUIRED Description OPTIONAL Tokenization REQUIRED Support for value TOKENS and FIELDS are required Options OPTIONAL Authentication-Type OPTIONAL Authentication-data OPTIONAL Example of a POLL command: Version-number: 2.0 Charset: UNICODE-1-1-UTF-8 Server-handle: BUNYIP01 Host-Name: services.bunyip.com Host-Port: 7070 Tokenization-type: TOKENS <\Poll> 4.4.2 Centroid-changes report syntax The centroid change report contains nested multiply occurring blocks. These blocks are delimited by lines which start with the # character, and have comments indicating that they may be used multiple times. The syntax of a Data: item is either a list of values (words or other phases, depending on the tokenization value), one value per line, with the syntax: word weight<\weight> or the keyword: * The weight is not required, but is expected to be used by advanced servers. The weight is the relative weight of the value for weighting servers. The keyword * as the only item of a Data: list means that any value for this field should be treated as a hit by the indexing server. The field Any-field: needs more explanation than can be given in the body of the syntax description below. It can take two values, True or False. If the value is True, the pollee is indicating that there are fields in this template which are not being exported to the polling server, but wishes to treat as a hit. Thus, when the polling server gets a query which has a term requesting a field not in this list for this template, the polling server will treat that term as a 'hit'. If the value is False, the pollee is indicating that there are no other fields for this template which should be treated as a hit. This field is required because the basic model for the CIP query syntax requires that the results of each search term be 'and'ed together. This field allows polled servers to export data only for non-sensitive fields, yet still get referrals of queries which contain sensitive terms. Version: // version number of pollee's index software, used to // insure compatibility. Current value is 2.3 Character-set: // Specifies which character set the data is in. Allowable values // are ISO-8859-1 and UNICODE-1-1-UTF-8 Start-time: // change list starting time, GMT End-time: // change list ending time, GMT Server-handle: // IANA unique identifier of the responding server Hop-Count: // One more than the largest value the polled server has received // when polling other servers. If the polled server is a leaf , // server, hop-count should be zero. The current maximum value // (Oct 96) is 8. Options: // Which options the polled server was able to satisfy. Values are // WEIGHT Status-Codes: // transmit error codes which indicate errors in the fulfillment of // the request. See section 5. Compression-type: // Type of compression used on the data, or NONE Size-of-compressed-data: // size of compressed data if compression is used Protocol: // Query protocol spoken by the polled server. Used to construct the URLs // for referrals. One of WHOIS++, LDAP, CCSO, RWHOIS Operation: // One of 3 keywords: ADD, DELETE, FULL // ADD - add these entries to the centroid for this server // DELETE - delete these entries from the centroid of this server // FULL - the full centroid as of end-time follows Tokenization: // The tokenization algorithm used // Can be one of: "TOKENS "FIELDS". // Default is "FIELDS". Token: // Character(s) used in the tokenization algorithm // may occur multiple times Host: // Host name of server to which the following centroid data belongs. Must // be present and have a correct value even if the only server presenting // data is the polled server. Port: // Port number of server to which the following centroid data belongs. Must // be present and have a correct value even if the only server presenting // data is the polled server. Server-Handle: // server handle of server to which the following centroid data // belongs. Must be present and have a correct value even if the only server // presenting data is the polled server // may occur multiple times Template: // A standard template name Field: // an attribute (field) name inside the template <\Field> <\Schema>