THE TWELVE RULES for determining how relational a DBMS product is by E. F. Codd President The Relational Institute May 16, 1986 Abstract In a database management system (DBMS) omission of a few features of the relational model can have seriously adverse effects on the usability, the economic advantage, and the integrity advantage of that system. Frequently, DBMS vendors have been amazingly short-sighted in the design of their new DBMS products. However, they all want to claim their products are "relational". The rules described in this paper are intended to reflect the compelling advantages to be gained by a DBMS that is fully supportive of each and every feature of the relational model. Furthermore, these rules may help users to evaluate DBMS products (which are claimed to be relational) more rapidly than if they examined the support for each and every feature of the relational model. The rules may also help vendors realize how far short of the mark their DBMS products are. NOTE: Authorization to upload to a BBS was obtained from The Relational Institute by both Hugo Blasdel (who scanned corrected and uploaded the document) and Fabian Pascal (who originally obtained it from TRI). Errors in text are preserved (bottom of page 5) and formating (middle of page 5) are preserved, although this note and fixed phone number below are added. Copyright (c) 1986 E. F. Codd / The Relational Institute All rights reserved This material is planned for inclusion in a book by E. F. Codd to be published soon by Addison-Wesley The Relational Institute Telephone: 408-268-8821 Suite 106 6489 Camden Avenue TRI Technical Report San Jose, CA 95120 EFC-6 / 05-16-86 CONTENTS PAGE 1 Introduction to the rules . . . . . . . . . . . . . 1 2 Non-claims . . . . . . . . . . . . . . . . . . . . . 2 R1 Rule 1 The information rule . . . . . . . . 2 R2 Rule 2 Guaranteed access . . . . . . . . . . 3 R3 Rule 3 Systematic treatment of missing information . . . . . . . 3 R4 Rule 4 Dynamic on-line catalog . . . . . . . 4 R5 Rule 5 Comprehensive data sublanguage . . . 4 R6 Rule 6 View updating . . . . . . . . . . . . 5 R7 Rule 7 High-level insert, update, delete . . 5 R8 Rule 8 Physical data independence . . . . . 6 R9 Rule 9 Logical data independence . . . . . . 6 R10 Rule 10 Integrity independence . . . . . . . 6 R11 Rule 11 Distribution independence . . . . . . 7 R12 Rule 12 Non-subversion . . . . . . . . . . . 8 3 Concluding remarks . . . . . . . . . . . . . . . . 9 4 References . . . . . . . . . . . . . . . . . . . . 9 Copyright (c) 1986 E. F. Codd / The Relational Institute THE TWELVE RULES 1. Introduction The twelve rules defined below were first published in Computerworld U.S.A. October 14 & 21, 1985. The versions presented here are slightly revised to make their definitions more precise (not to change their meaning). A database management system (DBMS) is fully relational if it fully supports each and every one of the twelve rules and all 30 features of the relational model. Like the relational model itself, these rules are based on the practical requirements of users, database administrators (DBAs), application programmers, security officers and their managers. For example, in rules 8 through 11, I specify, and require, four different types of independence aimed at protecting customers' investment in application programs, terminal activities, and training. Rules 8 and 9 -- physical and logical data independence -- have been heavily discussed for many years. Rules 10 and 11 -- integrity independence and distribution independence -- are aspects of the relational approach which have received inadequate attention to-date, but are likely to become as important as rules 8 and 9. These rules are collectively based on a single foundation rule, which I call Rule Zero:. For any system that is advertised as, or claimed to be, a RELATIONAL DATABASE MANAGEMENT SYSTEM, that system must be able to manage databases entirely through its relational capabilities as specified in the relational model. This must hold whether or not the system supports any non- relational capabilities of managing data. Any DBMS that does not satisfy this Rule Zero is not worth rating as a RELATIONAL DBMS. One consequence of this rule: any system claimed to be a relational DBMS must support database insert, update, and delete at the RELATIONAL level (multiple-record-at-a-time) -- see rule R7 below. Another consequence is the necessity of supporting the information rule and the guaranteed access rule -- see rules Rl and R2 below. "Multiple-record-at-a-time" includes as special cases those situations in which zero or one record is retrieved, inserted, updated, or deleted. In other words, a relation (R-table) may have either zero tuples (rows) or one tuple (row) and still be a valid relation (R-table). Note that although these are special cases, they do not receive special treatment. What is the danger to buyers and users of a system: 1 that is claimed to be a RELATIONAL DBMS and 2 that fails on Rule Zero ? 1 Buyers and users will expect all the advantages of a truly relational DBMS, and they will fail to get these advantages. 2. Non-Claims It is important for the reader to realize that the author is NOT making any of the following claims about the 12 rules: 1) that the rules are independent of one another; 2) that the 12 rules are complete in any sense; 3) that all a user needs to consider in evaluating any DBMS is its fidelity to the relational model or to the rules (for more complete evaluations, see the forthcoming Product Reports from The Relational Institute). As time goes on, it may be deemed advisable to augment these rules to make them "more complete", but in any event they will continue to be based principally on the relational model. This model is not expected to change in any radical way. It has been very stable for the last ten years at least. Minor additions are expected every five years at the beginning and midpoint of each decade. For an example of such an addition, see the references [EFC1,2] cited in connection with Rule 3: it contains a candidate extension re missing information, one that will be included by 1990, if there are no insuperable technical or semantic problems uncovered. Now for the 12 rules based on Rule Zero. Rl The Information Rule Rule 1: ALL INFORMATION in a relational database is represented explicitly at the logical level and in exactly one way -- by values in R-tables. Even R-table names, column names, and domain names are represented as character strings in some R-tables. R-tables containing such names are normally part of the built-in system catalog. The catalog is accordingly a relational database itself -- one that is dynamic and active and represents the meta-data (data describing the rest of the data in the system). The information rule is enforced not only for user productivity, but also to make it a reasonably simple job for software vendors to define additional software packages (like application development aids, expert systems, etc.) which interface with relational DBMS and, by definition, are well integrated with the DBMS -- that is, these packages retrieve information already existing in the catalog and, as needed, put new information in the catalog by the very act of using the DBMS. An additional reason is to make the database administrator's task of maintaining the database in a state of overall integrity both simpler and more effective. There is nothing more embarrassing to a DBA than being asked if his database contains 2 certain specific information, and then replying after a week's examination of the database that he does not know. R2 Guaranteed Access Rule Rule 2: Each and every datum (atomic value) in a relational database is guaranteed to be logically accessible by resorting to a combination of R-table name, primary key value, and column name. Clearly, each datum in a relational database can be accessed in a rich variety (possibly thousands) of logically distinct ways. However, it is important to have at least one way (independent of the specific relational database) that is guaranteed -- because most computer-oriented concepts (such as scanning successive addresses) have been deliberately omitted from the relational model. Note that the guaranteed access rule represents an associative addressing scheme that is unique to the relational model. It does not depend at all on the usual computer-oriented addressing. However, the primary key concept is an essential part of it. Structural feature S8 (see section 3.1) requires each base relation to have a primary key. R3 Systematic treatment of missing information Rule 3: Indicators (distinct from the empty character string or a string of blank characters, and distinct from zero or any other number) are supported in fully relational DBMS for representing at the logical level the fact that information is missing (applicable and inapplicable information) in a systematic way -- independent of data type. Besides the logical representation, the DBMS must support manipulative functions for these indicators [EFC1,2] and these must also be independent of the data type of the missing information. To support database integrity, it must be possible to specify "indicators not allowed" for each primary key column and for any other columns where the DBA considers it an appropriate integrity constraint (for example, certain foreign key columns). Past techniques entailed defining a special value (peculiar to each column or field) to represent missing information. This would be most unsystematic in a relational database, because users would have to employ different techniques for each column or for each domain -- a difficult task due to the high level of language in use (and a task which, I believe, would lower the productivity of users). Moreover, the use of a special value selected by a user from the domain pertinent to the column is tantamount to equating the semantics of the FACT that a value is missing to the semantics of any value from that domain -- I call this error the "value misrepresentation error". 3 R4 Dynamic on-line catalog based on the relational model Rule 4: The database description is represented at the logical level just like ordinary data, so that authorized users can apply the same relational language to its interrogation as they apply to the regular data. One consequence of this is that each user (whether an application programmer or end user) needs to learn only one data model -- an advantage which non-relational systems usually do not offer (IMS together with its dictionary requires the user to learn two distinct data models). Another consequence is that authorized users can easily extend the catalog to become a full-fledged, active, relational data dictionary whenever the vendor fails to do so. R5 Comprehensive Data Sublanguage Rule Rule 5: A relational DBMS (no matter how many languages and what modes of terminal use it supports -- for example, the fill-in-the-blanks mode) MUST support at least one one language (1) whose statements are expressible per some well-defined syntax as character strings; and (2) which is comprehensive in supporting ALL of the following items: 1 data definition 2 view definition 3 data manipulation (interactive & by program) 4 integrity constraints 5 authorization 6 transaction boundaries (begin, commit, and rollback) The relational approach is intentionally highly dynamic -- that is, it should rarely be necessary to bring the database activity to a halt (in contrast to non-relational DBMS). Therefore it does not make sense to separate the services listed above into distinct languages. Note: in the mid-seventies ANSI-SPARC generated a document advocating 42 distinct interfaces and (potentially) 42 distinct languages for database management systems! Fortunately, that idea has apparently been abandoned. 4 R6 View Updating Rule Rule 6: The DBMS includes an algorithm at least as powerful as VU-1 [see EFC3] for determining (at view definition time) whether that view is.tuple-insertible and tuple deletable, and whether each of its columns is updatable. It records the result of this investigation in the catalog. Note that a view is theoretically updatable if there exists a time-independent algorithm for unambiguously determining a single series of changes to the base relations which will have as their effect precisely the requested changes in the view. In this regard, "update" is intended to include insertion and deletion, as well as modification. Suppose the definition of a particular view V 1) happens to be expressed directly in terms of base relations; or 2) its definition is expanded until it is so expressed. This view is simple if its definition in terms of base relations involves no more than: 1) four occurrences of operators of the class: union, outer union, set difference, intersection; 2) four occurrences of operators of the class: join, relational division; 3) twice as many occurrences of the project operator as there are relations cited in the query or update; and 4) twice as many occurrences of algebraic selects as there are relations cited in the query or update. The reason that rule # 6 refers to simple views only is that the general question as to whether or not a view is updatable is NOT LOGICALLY DECIDABLE [HWB]. By limiting the class of views in rule #6 to those defined above, the updatability of these views becomes decidable -- and it should be decided by the DBMS, not by a user! R7 High-level Insert, Update, and Delete Rule 7: The capability of handling a base relation or a derived relation as a single operand applies not only to the retrieval of data but also to the insertion, update, and deletion of data. This requirement gives the system much more scope in optimizing the efficiency of its execution-time actions. It allows the system to determine which access paths to exploit to obtain the most efficient code. It can also be extremely important in obtaining efficient handling of transactions across a distributed database In this case, the total communication costs can easily become enormous. Hence, many companies wish to costs are saved by avoiding the necessity of transmitting a separate request for each record obtained from remote sites. 5 R8 Physical Data Independence Rule 8: Application programs and terminal activities remain logically unimpaired whenever any changes are made in either storage representations or access methods. To handle this, the DBMS must support a clear, sharp boundary between the logical and semantic aspects on the one hand and the physical and performance aspects of the base tables on the other hand; application programs must deal with the logical aspects only. Non-relational DBMS rarely provide complete support for this rule -- in fact, I know of none that do. R9 Logical Data Independence Rule 9: Application programs and terminal activities remain logically unimpaired when information-preserving changes of any kind that theoretically permit unimpairment are made to the base tables. Two examples: (1) splitting a base R-table into two tables either by rows using row content or by columns using column names, if primary keys are preserved in each result; (2) combining two base R-tables into one by means of a non-loss join (Stanford University and MIT authors now call these joins "lossless"). To provide this service whenever possible, the DBMS must be capable of handling inserts, updates, and deletes at least when algorithm VU-1 determines that such updatability is feasible (see rule R6). Rule R9 frequently permits logical database design to be dynamically changed if it appears to be necessary or desirable -- for example, if such a change would improve performance. The physical and logical data independence rules permit database designers for relational DBMS to make mistakes in their designs without the heavy penalties levied by non-relational DBMS. This in turn means that it is much easier to get started with a relational DBMS, because not nearly as much performance-oriented planning is needed prior to "blast-off". R10 Integrity Independence Rule 10: Integrity constraints specific to a particular relational database must be definable in the relational data sublanguage and storable in the catalog (not in the application programs). Information about inadequately identified objects is NEVER recorded in a relational database. To be more specific, the following two integrity rules apply to every relational database: 1) Entity integrity No component of a primary key is allowed to have a missing value; 6 2) Referential integrity For each distinct non-missing foreign key value in a relational database, there must exist a matching primary key value from the same domain. In addition to the two integrity rules (entity integrity and referential integrity), which apply to the base relations of each and every RELATIONAL DATABASE, there is a clear need to be able to specify additional integrity constraints reflecting either business policies or government regulations. Assume the relational model is faithfully reflected. Then, these additional integrity constraints are defined in terms of the high level data sublanguage, and the definitions are stored in the catalog (not in the application programs). If, as sometimes happens, either business policies or government regulations change, it will probably become necessary to change the integrity constraints. Normally, this can be accomplished in a fully relational DBMS by changing one or more integrity statements stored in the catalog. In many cases, neither the application programs nor the terminal activities are logically impaired. Non-relational DBMS rarely support this rule as part of the DBMS engine (where it belongs) -- instead they depend on a dictionary package (which may or may not be present and can readily be bypassed). Rll Distribution Independence Rule 11: A relational DBMS has distribution independence. By distribution independence, I mean that the DBMS has a data sublanguage which enables application programs and terminal activities to remain logically unimpaired: 1) when data distribution is first introduced (if the DBMS originally installed manages non-distributed data only); and 2) when data is re-distributed (if the DBMS manages distributed data). Note that this definition is carefully worded so that even a non- distributed DBMS can fully support rule 11. SQL/DS, DB2, Oracle, INGRES, and SUPRA fully support this rule, although each one is either non-distributed or supports distributed databases only partially in its present release. This has been demonstrated as follows: (1) SQL programs written to operate on non-distributed data (using System R) run correctly on distributed versions of that data using System R* -- the IBM San Jose Research prototype; (2) the distributed INGRES project has shown the same capability for the QUEL language of INGRES. It is important to distinguish distributed processing from distributed data. In the former case, work (i.e., programs) is transmitted to the data; in the latter case, data is transmitted to the work. Many non-relational DBMS support distributed 7 processing, but not distributed data. The only systems that support the concept of making all the distributed data aopear to be local are relational DBMS (these are prototypes right now). In the case of a distributed relational DBMS, a single trans- action may straddle several remote sites. Such straddling is managed entirely under the covers -- the system may have to execute recovery at multiple sites. Each program or terminal activity treats the totality of data as if it were all local to the site where the application program or terminal activity is being executed. A fully relational DBMS that does not support distributed databases has the capability of being extended to provide that support, while leaving application programs and terminal activities logically unimpaired -- both at the time of initial distribution and whenever later re-distribution is made. There are 4 important reasons why relational DBMS enjoy this advantage: decomposition flexibility in deciding how to deploy the data, recomposition power of the relational operators when combining the results of sub-transactions executed at different sites, economy of transmission resulting from the fact that there need not be a request message sent for each record to be retrieved from any remote site, analyzability of intent (due to the very high level of relational languages) for vastly improved optimization of execution. R12 Non-Subversion Rule Rule 12: If a relational system has a low level (single-record- at-a-time) language, that low level cannot be used to subvert or bypass the integrity rules and constraints expressed in the higher level relational language (multiple-records-at-a-time). In the relational approach, preservation of integrity is made independent of logical data structure to achieve integrity independence (see rule R10). Remark: Rulc R12 is extremely difficult for a "relational pretender" like Cullinet's IDMS/R and ADR's Datacom/DB to obey, because such a system already supports an interface below the relational integrity constraint interface. Even if the DBMS had an authorization feature which would permit only certain specified programs to use the low level language, that would not be enough (and neither DBMS goes even this far, at this time). In fact, in talking with vendors of "relational pretenders", it is quite obvious that they had not given this matter a thought ! 8 3. Conclusion As stressed in section 2 (Non-Claims), there is no contention in this article that evaluating a candidate DBMS with regard to its degree of fidelity to the relational model is all that needs to be done, when trying to determine which is the best DBMS for a particular company or portion thereof. However, I do claim that this fidelity evaluation is today an essential, and possibly crucial, part of the choosing process. There are many clear (and obvious) relationships between user needs in general and the twelve rules cited above. This applies whether we are discussing productivity of application programmers, direct access to the databases by end users, ease of database design and installation, vastly improved control of database integrity, or protection of the firm's investment in application programs and in the training of their employees. 4. References [EFCl] E. F. Codd, "Missing Information (Applicable and Inapplicable) in Relational Databases", Report EFC-4, The Relational Institute, San Jose, CA, Feb 21, 1986 [EFC2] E. F. Codd, "More Commentary on Missing Information in Relational Databases", Report EFC-14, The Relational Institute, San Jose, CA, January 12, 1987 [EFC3] E. F. Codd, "View Updatability: Algorithm VU-1", Report EFC-15, The Relational Institute, San Jose, CA January 22, 1987 [HWB] H. W. Buff, "The View Update Problem is Undecidable", private communication from the Swiss Reinsurance Co., Zurich, Switzerland August 4, 1986 (revised June 2, 1987) 9