home *** CD-ROM | disk | FTP | other *** search
Text File | 1992-08-27 | 73.1 KB | 1,975 lines |
-
-
- .nr tp 12
- .nr sp 12
- .nr pp 11
- .nr fp 10
- .sz \n(pp
- .fo ''%''
- .(l C
- .sz \n(sp
- .b
- EXTENDING A DATA BASE SYSTEM WITH PROCEDURES
- .sz \n(pp
- .sp 3
- .i
- Michael Stonebraker, Jeff Anton and Eric Hanson
- .sp
- EECS Department
- University of California
- Berkeley, Ca., 94720
- .sp
- .r
- .)l
- .sp 3
- .(f
- This research was sponsored
- by the U.S. Air Force Office of Scientific
- Research Grant 83-0254
- and
- the Naval Electronics Systems
- Command Contract N39-82-C-0235
- .)f
- .uh Abstract
- .pp
- This paper suggests that more powerful data base systems (DBMS) can
- be built by supporting data base procedures as full fledged
- data base objects. In particular, allowing fields of a data base to
- be a collection of queries in the query language of the system
- is shown to allow complex data relationships to be
- naturally expressed.
- Moreover, many of the features present in object-oriented
- systems and semantic data models can be
- supported by this facility.
- .pp
- In order to implement this construct, extensions
- to a typical relational query language must be made and
- considerable work on the execution engine of the underlying DBMS
- must be accomplished. This paper reports on the extensions for
- one particular query language and data manager and then gives
- performance figures for a prototype implementation.
- Even though the performance of the prototype is competitive with that of
- a
- conventional system, suggestions
- for improvement are presented.
- .sh 1 "INTRODUCTION"
- .pp
- Most current data base systems store information only as data.
- However
- older data base systems (e.g. [DBTG71]) specifically allowed data base
- procedures written in a general purpose
- programming language to be called during command execution.
- Moreover, Lisp [WILE84] supports objects which are interchangeably
- either
- procedures or data. In this paper we suggest that supporting
- a restricted form of data base procedures in a DBMS allows complex
- data base problems to be easily and naturally addressed. In
- particular, we propose that a field in a data base be allowed to
- have a value which is a collection of commands
- in the query language supported by the DBMS (e.g. SQL [SORD84] or QUEL).
- .pp
- Our proposal should augment a field-oriented abstract data
- type (ADT) facility (e.g. [ONG84]). Such an ADT capability appears useful
- for supporting relatively simple objects which do not require
- shared subobjects (e.g. lines, points, complex numbers, etc.). On
- the other hand, data base procedures are attractive for more complex
- objects, possibly with shared subobjects (e.g. forms, icons, reports, etc.).
- .pp
- We begin in Section 2 by presenting the data definition facilities for
- procedural data along with several examples of the
- use of this construct.
- Then, in Section 3
- we review briefly how to extend one query language with
- necessary facilities to use procedures.
- Our choice is QUEL [STON76], but the
- extensions are easy to map into most other relational query languages.
- The definition of this language, QUEL+, is indicated in Section
- 3 and is based on suggestions in [STON84].
- Substantial changes to the query execution code
- of a data base system are required to process QUEL+. In Section 4 we indicate
- the changes that were necessary
- to support our constructs in the University of California version
- of INGRES [STON76]. Then, in Section 5 the performance of
- our prototype on several problems with complex data relationships
- is indicated. Lastly, Section 6 discusses ways in which
- the performance
- of the prototype could be improved.
- .sh 1 "DATA BASE PROCEDURES"
- .pp
- The motivation behind using procedures as full-fledged
- data base objects was to retain the ``spartan simplicity''
- of the relational model, while allowing it to address situations
- where it has been found inadequate.
- Such situations include
- generalization, aggregation, referential
- integrity, transitive closure, complex objects with shared subobjects,
- stored queries, and objects with unpredictable composition.
- The main advantage of our approach is that a single mechanism
- can address a large class of recognized deficiencies.
- We discuss the data definition capabilities of our proposal along
- with examples of its application to some of the above problems
- in the remainder of
- this section.
- .sh 2 "Objects with Unpredictable Composition"
- .pp
- The basic concept is that a field in a relation can
- have a value consisting of a collection of
- query language commands. Consider, for example, a conventional
- EMP relation with the requirement of storing
- data on
- the various hobbies of employees.
- Three relations containing hobby data might be:
- .(l
- SOFTBALL (emp-name, position, average)
- SAILING (emp-name, rating, boat-type, marina)
- JOGGING (emp-name, distance, best-time, shoe-type, number-of-races)
- .)l
- Each gives relevant data for a particular hobby. For example,
- Smith could be added as the catcher of the softball team by:
- .(l
- append to SOFTBALL (name = ``Smith'', position = ``catcher'', average = 0)
- .)l
- The desired form of the EMP relation would be:
- .(l
- create EMP (name = c10, age = i4, salary = f8, hobbies = procedure)
- .)l
- Then, for example, Smith could be added as an employee by:
- .(l
- append to EMP (
- name = ``Smith''
- age = 40
- salary = 10000,
- hobbies = ``retrieve (SOFTBALL.all)
- where SOFTBALL.name = ``Smith''''
- )
- .)l
- In this case, the first three values are conventional fields while
- the fourth is a field of data type ``collection of commands in the
- query language''. The value of this last field is obtained by
- executing the command (s) in the field.
- As such the ultimate value of each hobbies object is an arbitrary
- collection of records of arbitrary composition. A procedural field has
- the flexibility to model environments where there is no predetermined
- structure
- to objects. A second example of the need for procedural
- fields is indicated in the next subsection.
- .sh 2 "Stored Queries"
- .pp
- Most data base systems which preprocess commands in advance
- of execution (e.g. System R [ASTR76] and the IDM [EPST80]) store
- access plans or compiled
- code in the data base system. Such systems already manage a data base
- of compiled queries. Their implementations would become somewhat
- cleaner if data base commands became full-fledged data base objects.
- For example, the precompiler for a programming language
- could run a conventional APPEND command to insert
- a tuple into the following
- relation for each data base command found in a user program:
- .(l
- TODO (id, command)
- .)l
- Then, at run time the program would use the EXECUTE command to
- be introduced in Section 3:
- .(l
- execute (TODO.command) where TODO.id = value
- .)l
- To substitute parameters into such a command, one
- requires an additional operator ``with'' to specify:
- .(l
- execute (TODO.command with param-list) where TODO.id = value
- .)l
- In this way, the compile-time and run-time interfaces to the data base system
- are the same, resulting in a more compact implementation.(**)
- .(f
- ** Of course, authorization must be done for the above
- command to support access control.
- It would
- be beneficial to avoid reauthorizing a command each time it
- is executed from an application program. A mechanism to
- accomplish this task is beyond the scope of this paper.
- .)f
- Moreover, in Section 6 we discuss how to asynchronously
- build query processing plans
- for user commands between the time that the preprocessor inserts
- then in the TODO relation and the time that the user executes them. Hence,
- there is no performance penalty to our approach compared to current technology.
- In fact, our approach may well run faster because in Section 6
- we also
- propose caching the
- answers to commands as well as their execution plan.
- .pp
- A second use of stored queries is to support
- the definition of relational views.
- Each view can be stored as a row
- in a VIEW relation as follows:
- .(l
- VIEW (name, query)
- .)l
- Here, the retrieval command that defines the view can be stored in
- the ``query'' field while the name of the view is stored in
- the ``name'' field.
- The query modification facilities of [STON75] are needed to support
- the extensions that we propose to a query language in the next section;
- consequently, it will be seen that views require very little special case
- code if implemented as procedural fields.
- .pp
- Lastly, many applications require the ability to
- store algorithms made up
- of data base commands in the data base.
- An example of this kind
- of application is [KUNG84].
- Our proposal contains exactly the facilities
- needed in such environments.
- .sh 2 "Complex Objects with Shared Subobjects"
- .pp
- Another example where procedures are helpful is in
- modeling of complex objects.
- Suppose an
- object is composed of text, line segments, and polygons and
- is represented in the following relations:
- .(l
- OBJECT (Oid, text, shape)
- LINE (Lid, l-desc)
- TEXT (Tid, t-desc)
- POLYGON (Pid, p-desc)
- .)l
- Subcomponents of objects would be inserted
- into the LINE, TEXT or POLYGON relation,
- and we assume that l-desc and p-desc are of
- type ``line'' and ``point'' respectively and utilize a field-oriented
- ADT facility (e.g. [ONG84]).
- For example:
- .(l
- append to LINE (Lid = 22,
- l-desc = ``(0,0) (14,28)'')
- append to POLYGON (Pid = 44,
- p-desc = ``(1,10) (14,22) (6,19) (12,22)'')
- append to TEXT (Tid = 16,
- t-desc = ``the fox jumped over the log'')
- .)l
- Then, the ``text'' and ``shape'' fields of
- OBJECT would be of type procedure, and each tuple in OBJECT would
- contain queries to assemble a specific object from pieces stored in
- the other relations.
- For example,
- the following query would make object 6 be
- composed of all line segments with identifiers less
- than 20,
- polygon 44, and the first 9 text fragments.
- .(l
- append to OBJECT( Oid = 6,
- shape = ``retrieve (LINE.all) where LINE.Lid < 20
- retrieve (POLYGON.all) where POLYGON.Pid = 44'',
- text = ``retrieve (TEXT.all) where TEXT.Tid < 10'')
- .)l
- Notice that sharing is easily accomplished by inserting
- queries into multiple ``shape'' or ``text'' fields which reference the same
- subobject.
- .pp
- Additional examples of complex objects
- include forms (such as found in a system like FADS [ROWE82]),
- icons, reports, and complex geographic objects (e.g. a plumbing
- fixture which makes a right angle bend).
- .pp
- When objects can have a variety
- of subobjects and those subobjects can be shared, most contemporary
- modelling ideas are flawed. For example, the proposal of [HASK82, LORI83] does
- not easily allow shared subobjects. Semantic
- data models (e.g. [HAMM81, MYLO80,
- SHIP81, SMIT77, ZANI83]) lack
- the flexibility to deal with uncertain structure.
- The proposal of [COPE84] allows sharing by storing subobjects
- as separate records and connecting them with
- pointer chains. Our sharing is accomplished without
- requiring a specialized low level storage manager, and
- we will show in Section 6 how caching can be
- used to make performance competitive with pointer
- based proposals.
- .sh 2 "Generalizations to Arbitrary Procedures"
- .pp
- Our proposal should be easily generalizable to procedures
- written in a general purpose programming language.
- An example that can utilize more general procedures is a graphics
- application that wishes to store icons in the data base (e.g. [KALA85]).
- Icons should be stored in human readable form, so their description
- can be browsed
- easily. However, display software requires icons to be
- converted into a display list
- for a particular graphics terminal. An icon could be
- a complex object, and its components assembled
- by a query. However, the components must then be turned
- into a display list by a procedure in a general
- purpose programming language which appears in an application
- program. Efficiency can be gained by
- caching icons as noted in Section 6; however, further
- efficiency results from caching the actual display list.
- Such a capability requires general procedures rather than just
- data base procedures.
- .pp
- A second example of the need for general procedures
- is in the support for
- extended data type proposals (e.g. [ONG84])
- They require user-defined
- procedures to implement new operators. Such procedures must
- be called by the DBMS as appropriate, and it
- would be more natural if they
- were full fledged data base objects.
- .pp
- A last example of the use of general procedures would be in
- the system catalogs of a
- typical
- relational data base system where the following two relations appear.
- .(l
- RELATION (relation-name, owner, ...)
- ATTRIBUTE (relation-name, attribute-name, position, data-type, ...)
- .)l
- Whenever a relation with N attributes is ``opened'', a ``descriptor'' must be
- built by accessing one
- tuple in RELATION
- plus N tuples from the
- ATTRIBUTE relation.
- In order to allow ``browsing'' of the system catalogs, it is desirable
- to store the catalogs in the above fashion; however the penalty
- is the lengthy time required to open a relation.
- .pp
- An alternate solution is to
- add a procedural field
- to RELATION, e.g:
- .(l
- RELATION (relation-name, owner, ... , descriptor)
- .)l
- The ``descriptor'' field contains queries
- to retrieve the appropriate
- tuples from the ATTRIBUTE relation and the current tuple from the
- RELATION relation.
- These queries are surrounded by code in a general purpose programming
- language to build the actual descriptor in the format desired by the run
- time system.
- .pp
- In Section 6 we will discuss a technique that allows the value
- for a procedural field to be cached in the field itself. If this
- is accomplished, then the N accesses to the ATTRIBUTE relation
- are avoided, and the descriptor can be accessed directly from the
- RELATION relation.
- Writes to
- tuples in the ATTRIBUTE relation
- which make up an object (an infrequent event) will cause
- the cached value to be invalidated as explained in Section 6.
- The next time a relation is opened, the contents of the cached
- value must be reassembled.
- .pp
- Alternate implementations of complex objects (e.g. [COPE84]) store
- subobjects as individual records. Hence, pointers must
- be followed to assemble
- a composite object. Sophisticated clustering will be required to
- avoid extra disk reads in this environment. Moreover, if subobjects
- are shared,
- it
- will be impossible to guarantee clustering. Our caching implementation
- should offer superior performance to one based on pointers
- when updates are infrequent.
- It should be noted, however, that our caching idea can be applied to
- any DBMS to improve performance. Hence, a pointer based
- DBMS that also implemented
- caching might be an attractive alternative.
- .pp
- We now turn to a special case of procedural data types
- and indicate
- its utility.
- .sh 2 "Referential Integrity"
- .pp
- Consider the standard EMP and DEPT example as follows:
- .(l
- EMP (name, age, salary, dept)
- DEPT (dname, floor, budget)
- .)l
- Here, one often wants to guarantee that the values that occur in the
- column ``dept'' of EMP are a subset of the values that occur in
- the field ``dname'' in DEPT. This concept has been
- termed
- .b referential
- .b integrity
- in
- [DATE81] and occurs because ``dept'' is, in effect, a pointer to a tuple
- in DEPT and is represented by a foreign key.
- .pp
- Procedural data can alleviate the need for special case syntax
- and implementation code to support referential integrity
- in the following way. Suppose
- the ``dept'' field for each employee in the EMP relation
- contains the following procedure:
- .(l
- retrieve (DEPT.all) where DEPT.dname = ``the-appropriate-dept''
- .)l
- In this case the following semantics are automatically enforced.
- Whenever, an
- employee is hired and assigned to a non-existent department, then the
- procedure in the ``dept'' field evaluates to null, and the employee
- is effectively placed in the
- null department. Moreover, whenever a department is deleted from the
- DEPT relation, then all employees who were previously in that department
- now have a procedural field which evaluates to null and are thereby placed
- in the null department. Although [DATE81] has
- several other options, procedural data captures the main thrust of that
- proposal.
- .pp
- Notice that all fields in the ``dept'' column have the same basic query
- as their value, differing only in the constant used in the qualification.
- Consider an
- implementation of this special case whereby the parameterized
- command(s) is stored in the system catalogs and only the
- parameter(s) stored in the field itself.
- Hence, in the example above, only the
- department name of the employee's department would appear in the
- field ``dept'', while the remainder of the query:
- .(l
- retrieve (DEPT.all) where DEPT.dname = parameter-1
- .)l
- would appear in the system catalogs.
- Moreover, an update to the ``dept'' field would only need to
- specify the parameter and not the entire query, e.g:
- .(l
- append to EMP (name = ``Joe'', age = 25, salary = 10000, dept = ``shoe'')
- .)l
- .pp
- To specify this special case syntactically, one
- could proceed in two steps. First,
- one could
- .b register
- the procedure containing the parameter(s) with the data manager and give it
- some internal name, say DEPARTMENT, with the following command:
- .(l
- define DEPARTMENT as retrieve (DEPT.all) where DEPT.dname = parameter-1
- .)l
- Then, one
- could create the EMP relation as:
- .(l
- create EMP (name = c10, age = i4, salary = f8, dept = DEPARTMENT)
- .)l
- Alternatively, one could avoid the registration step for commonly
- used procedures such as the one above by accepting the following syntax:
- .(l
- create EMP (name = c10, age = i4, salary = f8, dept = DEPT[dname])
- .)l
- The syntactic token DEPT[dname] signifies that the
- procedure
- .(l
- retrieve (DEPT.all) where DEPT.dname = parameter-1
- .)l
- should be automatically defined and associated with the ``dept'' field.
- .pp
- The data type ``pointer to a tuple'' suggested in [POWE83, ZANI83]
- can be effectively supported by another special case. Suppose each
- relation automatically contains a unique identifier (UID), a feature
- commonly requested in some environments. Moreover, suppose
- in the syntax:
- .(l
- create EMP (name = c10, age = i4, salary = f8, dept = DEPT)
- .)l
- the DEPT token is automatically associated with the query:
- .(l
- retrieve (DEPT.all) where DEPT.UID = parameter-1
- .)l
- In this way procedures can be used to support the capability that a
- field in one relation can be a uniquely identified tuple in another
- relation.
- .sh 2 "Aggregation and Generalization"
- .pp
- Procedural fields can support both generalization and
- aggregation as proposed in [SMIT77].
- For example, consider:
- .(l
- PEOPLE (name, phone#)
- .)l
- where phone# is
- of type procedure and is an aggregate for the more detailed
- values area-code, exchange and number. As such, the following parameterized
- procedure can be used for the phone# field:
- .(l
- retrieve (area-code = parameter-1, exchange = parameter-2, number = parameter-3)
- .)l
- A simple append to PEOPLE might be:
- .(l
- append to PEOPLE (name = ``Fred'', phone# = ``415-841-3461'')
- .)l
- Here, ``-'' is the assumed separator between the values of the three parameters.
- .pp
- Generalization is also easy to support. If all employees
- have exactly one hobby, then the hobbies field in the example EMP
- relation from Section 2.1 will specify a simple generalization hierarchy. In fact,
- our example use of hobbies
- supports a generalization hierarchy with members
- which can be in several of the subcategories at once.
- .sh 2 "Summary"
- .pp
- In summary, data base procedures are a high leverage construct.
- Not only can they be used to simulate a variety of semantic
- data modelling ideas such as generalization and aggregation, but also
- they can be used to support objects that have unpredictable composition
- and shared subobjects. In addition, they are useful in simplifying the
- design of current relational systems by allowing a more uniform treatment of
- compiled queries and views. Lastly, support for procedures written in an arbitrary
- programming language is a natural and valuable extension, and a
- preliminary proposal in this direction appears in [STON86]. Hence, a single
- construct is useful in a wide variety of circumstances.
- .sh 1 "THE QUERY LANGUAGE, QUEL+"
- .pp
- In order to make procedures a useful construct, several extensions
- must be made to QUEL and these are indicated in the next several
- subsections. This language, QUEL+, contains slight modifications to
- the facilities proposed in [STON84], and a concise summary of its
- extensions to QUEL appears in Appendix 1.
- .sh 2 "Execution of the Data"
- .pp
- A procedural field can be interpreted in two ways, namely it has
- a
- .b definition
- which is the QUEL code in the field
- and a
- .b value
- which is obtained by executing the QUEL commands.
- Since a user needs to gain access to both representations, we use the
- convention that a normal retrieval returns the definition. For example, the
- query:
- .(l
- retrieve (EMP.hobbies) where EMP.name = ``Smith''
- .)l
- will return a collection of QUEL commands. Execution of a procedural
- field is accomplished by an additional
- QUEL+
- command which allows one to execute data in the
- data base.
- For example, one
- can find all the hobby data for Smith
- by running the
- following command:
- .(l
- execute (EMP.hobbies) where EMP.name = ``Smith''
- .)l
- This command will search for qualifying tuples
- and then execute the contents of
- the hobbies field.
- .pp
- Two points should be noted about the above command. First,
- notice that a user program must be prepared to accept the
- tuples returned from the above query.
- Since the composition of these tuples may vary
- from tuple to tuple,
- the run time system must send output to an application
- program using a more complex format than often used currently. In
- particular, each tuple must either be self-describing
- or a tuple descriptor must be sent to the application which
- describes all subsequent tuples until a new descriptor is sent.
- Run time support code in the application program must be prepared
- to accept this more complex format and deal with the more complex
- buffering and communication with variables in an
- application program that
- this entails.
- Second,
- a user must note which fields contain procedural data,
- since retrieving a procedural field does not yield the ultimate data value.
- We considered
- automatic evaluation of procedural fields, but this option requires
- a second operator to ``unevaluate'' the procedure and seemed no more
- user-friendly.
- Also, it would have required the application program to accept
- unnormalized relations. For example, automatic evaluation
- of procedural fields for the query:
- .(l
- retrieve (EMP.name, EMP.hobbies) where EMP.age > 35
- .)l
- would yield an unnormalized relation as a result.
- .pp
- In some applications, it is desirable to execute only one
- of a collection of qualifying tuples. The following command
- will execute the hobby description for one
- employee over 70.
- .(l
- execute-one (EMP.hobbies) where EMP.age > 70
- .)l
- The intent of this command is that query processing heuristics along the lines
- of [SELI79] would be run on each candidate hobby description. The
- one with the expected least cost would be selected for execution.
- The
- use of this construct in a particular expert system
- application is discussed in [KUNG84].
- .sh 2 "Multiple-Dot Notation"
- .pp
- Our second extension to QUEL allows the
- components of a complex object to be addressed directly.
- For example, one could retrieve the batting average of Smith as follows:
- .(l
- retrieve (EMP.hobbies.average) where EMP.name = ``Smith''
- .)l
- This
- .b multiple-dot
- notation has many points in common with the data manipulation
- language GEM [ZANI83],
- and allows one to conveniently access subsets of components
- of complex objects.
- More exactly, QUEL+ allows an indirectly referenced column name
- of the form:
- .(l
- relation.column-name-1.column-name-2 ... column-name-n
- .)l
- wherever
- a normal column name:
- .(l
- relation.column-name
- .)l
- is allowed in QUEL. The only restriction
- is that ``column-name-i'' must be a procedural data type
- for 1 <= i < n-1. Moreover,
- column-name-(i+1) is a column
- in any relation specified by a RETRIEVE
- command contained in the field specified by
- column-name-i.
- Of course, the same construct is allowed for relation surrogates
- (tuple variables).
- .pp
- The above QUEL+ command returns the average of Smith for any hobby
- that has a field with name ``average''. Since there may be
- several hobbies with this field defined, one requires a notation
- to restrict the average only to the SOFTBALL relation. This is
- easily accomplished with another operator, i.e:
- .(l
- retrieve (EMP.hobbies.average)
- where EMP.name = ``Smith''
- and EMP.hobbies.average in SOFTBALL
- .)l
- Here ``in'' expects an indirectly referenced
- column name as the left operand and a relation name as
- the right operand and returns true only if the column is in the
- indicated relation.
- Additional operators associated with procedural objects may be appropriate
- and will be added to QUEL+ as a need arises.
- .sh 2 "Extended Scoping"
- .pp
- To change the position of Smith from catcher to outfield,
- one could make a direct
- update to the SOFTBALL relation. However, it is sometimes
- cleaner to allow the update to be made through the EMP
- relation as follows:
- .(l
- replace EMP.hobbies (position = ``outfield'') where EMP.name = ``Smith''
- .)l
- The desired construct is that a procedural field (in this
- case EMP.hobbies) can appear as the target of a
- DELETE, REPLACE or APPEND command.
- In general, this procedural field is identified by an arbitrary multiple-dot
- expression of the form discussed in the previous section, and we term
- this expression the
- .b scope
- of the update.
- .pp
- The semantics of an
- .b extended
- .b scope
- command
- are that the RETRIEVE commands in
- the procedural field used as the target of the update command
- define conventional
- relational views.
- Once a specific instance of such a procedural field has been identified,
- for each view, Vi, associated with a RETRIEVE command, Ri, one need
- only replace the
- the update scope by Vi
- in every place it appears in the user command, and then
- standard query modification [STON75] using Ri should
- be performed on the qualification and the target list of the
- resulting user's command.
- .pp
- For example, if Smith's ``EMP.hobbies'' field contains the single
- query:
- .(l
- retrieve (SOFTBALL.all) where SOFTBALL.name = ``Smith''
- .)l
- then the above command to move Smith to the outfield will have the form
- .(l
- replace EMP.hobbies (position = ``outfield'')
- .)l
- once the clause
- .(l
- where EMP.name = ``Smith''
- .)l
- has been evaluated to identify a specific ``EMP.hobbies'' value.
- Hence, this query is turned into:
- .(l
- replace V1 (position = ``outfield'')
- .)l
- and then query modification converts it to:
- .(l
- replace SOFTBALL (position = ``outfield'') where SOFTBALL.name = ``Smith''
- .)l
- .pp
- Notice that this construct allows a very simple means for supporting relational
- views. If the definition of each view appears in the VIEW relation as suggested
- in the previous section, e.g:
- .(l
- VIEW (name, query)
- .)l
- then any command involving a view, V, need only be modified to replace every
- reference to V with VIEW.query and then the clause
- .(l
- VIEW.name = V
- .)l
- must be added to the qualification. The resulting command will be one
- containing multiple-dot clauses and extended scoping statements and
- can be executed
- as a conventional QUEL+ command.
- .sh 2 "Extended Scoping with Tuple Variables"
- .pp
- In addition to allowing the above construct, QUEL+ also allows
- a tuple variable to be used whenever a relation name or a field
- of type QUEL is permissible.
- Hence, the example above can also be expressed as:
- .(l
- range of e is EMP.hobbies
- replace e (position = ``outfield'') where EMP.name = ``Smith''
- .)l
- .sh 2 "Relation Level Operators"
- .pp
- In addition, QUEL+ supports relation level operators, including
- union, intersection, outer join, natural join, containment and a test
- for emptiness.
- We illustrate the use of this construct with an example from the
- previous section where objects were made up of lines, polygons, and
- text fragments.
- In this situation, one might want to find
- all pairs of objects, one of which
- contains all
- the shapes in the other. This would be formulated as:
- .(l
- range of o is OBJECT
- range of o1 is OBJECT
- retrieve (o.Oid, o1.Oid) where o.shape >> o1.shape
- .)l
- Here, the containment operator >>, accepts two procedural operands
- and returns true if the relation specified by the
- procedure in the left operand includes the relation specified by the
- procedure in the right operand.
- The relation on the left is found by
- constructing the outer union defined by the
- RETRIEVE
- commands in o.shape.
- If all commands have identical target lists, then the outer union is
- the same as a normal union. Otherwise, it is formed by constructing a
- relation with all columns appearing in any command, filling each
- target list with nulls to be the full width of the composite relation, and
- then performing a normal union.
- This resulting relation must be
- compared for set inclusion with the relation to which
- o1.shape
- evaluates.
- Our initial collection of operators is indicated in Table 1.
- .(z
- .(c
- .TS
- |c|c|
- |l|l|.
- Operator Function
- _ _
- U union
- !! intersection
- >> containment
- << containment
- == equality
- <> inequality
- JJ natural join on all common column names
- OJ outer (natural) join
- empty emptyness
- _ _
- .TE
- .)c
-
- .ce
- Relation Level Operators
-
- .ce
- Table 1
- .)z
- .sh 1 "PROCESSING QUEL+"
- .pp
- The purpose of this section is to explain how our existing
- prototype executes QUEL+ commands.
- This prototype supports the complete language noted in the previous section
- with the exception of execute-one and extended scoping statements.
- Moreover, it only implements general QUEL procedural fields. The
- optimization routines to support the special case that all queries in a
- given column differ only by a collection of parameters have not yet
- been implemented.
- .pp
- Although more sophisticated
- query processing algorithms have been constructed [SELI79, KOOI82],
- our implementation
- builds on the original INGRES strategy [WONG76]. The implementation
- of QUEL+ has been accomplished using this code because it
- is readily available for experimentation.
- Integration of our constructs into more advanced optimizers
- appears straightforward, and we discuss this point again at
- the end of this section.
- .pp
- Figure 1 shows a diagram of the extended decomposition process.
- .(z
- .hl
- .nf
- |-----------------
- | |
- | V
- | -------------------------------
- | | extract and process one |
- | | variable clauses which |
- | | do not contain relation |
- | | level or multiple-dot |
- | | operators |
- | -------------------------------
- | |
- | V
- | -----------------------
- | | apply reduction |
- | | algorithm |
- | -----------------------
- | |
- | V
- | --------------------------- ---------------------------
- | | is the qualification | yes | are there relations |
- | | free of variables? |----------->| to materialize? |
- | --------------------------- ---------------------------
- | no | yes | | no
- | | -------------| |
- | | | |
- | V V V
- | --------------------- ----------------- ----------------------
- | | do tuple | | materialize a | | pass to extended |
- | | substitution | | relation | | OVQP for relation |
- | --------------------- ----------------- | level operator |
- | | | | evaluation |
- | | | --------------------
- | | | |
- | V V V
- ----------------------------------------------------------------------------------
-
- .ce
- Extended Decomposition
-
- .ce
- Figure 1
- .fi
- .hl
- .)z
- Detachment of one-variable queries that do not
- contain multiple-dot or relation level
- operators can proceed as in the original
- INGRES algorithms [WONG76]. Similarly, the reduction module of decomposition
- is unaffected by our extensions to QUEL.
- In addition, tuple substitution is performed when all
- other processing steps fail.
- A glance at the left hand column of Figure 1 indicates that a test
- for zero variables must be inserted into the original flow of control
- after the reduction module. Then, new facilities must be included
- to process the ``yes'' branch of the test. These include a test
- for whether there is a relation to materialize and
- the code to perform this step.
- Lastly,
- the one-variable query
- processor must be extended to process relation level operators. We
- explain these extensions with a detailed example.
- .pp
- The desired task is to find the polygon descriptions
- with identifiers less than 5 for all objects which have the
- same collection of shapes as the
- complex object with Oid equal to 10,
- i.e:
- .(l
- range of o is OBJECT
- range of o1 is OBJECT
- retrieve (o.shape.p-desc)
- where o.shape.Pid < 5
- and o.shape == o1.shape
- and o1.Oid = 10
- .)l
- In the initial step of the reduction process
- the last clause in the query is found to have a single variable, so
- it can be executed as:
- .(l
- retrieve into TEMP-1 (o1.shape) where o1.Oid = 10
- .)l
- The original query is now:
- .(l
- retrieve (o.shape.p-desc) where o.shape.Pid < 5 and o.shape == TEMP-1.shape
- .)l
- The first clause above contains a multiple-dot attribute and should
- not be processed until later.
- At this point reduction fails and the query still has two
- variables in it, so processing falls through to the tuple
- substitution module. If TEMP-1 is selected for substitution, the
- resulting query is:
- .(l
- retrieve (o.shape.p-desc)
- where o.shape.Pid < 5
- and o.shape == ``QUEL-constant-1''
- .)l
- Notice that the variable ``TEMP-1.shape'' has been replaced by a
- constant ``QUEL-constant-1'' which is a collection of QUEL commands.
- Processing now returns to the top of the loop where the query
- still does not have any one-variable clauses. Processing again returns
- to tuple substitution where the variable o might be chosen. This
- results in the query:
- .(l
- retrieve (``QUEL-constant-2''.p-desc)
- where ``QUEL-constant-2''.Pid < 5
- and ``QUEL-constant-3'' == ``QUEL-constant-1''
- .)l
- Notice that o.shape has been replaced by two constants ``QUEL-constant-2''
- and ``QUEL-constant-3'' which are identical. When o.shape
- is materialized, there will be a one-relation clause (o.shape.Pid < 5)
- that
- can be used to restrict and project the relation.
- Moreover, it is desirable to check this clause as early as possible
- because the current query will have no answer if this clause is false.
- On the other hand, o.shape must be retained as a complete object
- so that the
- the relation level comparison with QUEL-constant-1 can be performed
- if necessary. In order to avoid forcing the relation level operator
- to be executed first, we have duplicated the QUEL constant and
- thereby retained the option of
- performing the one-variable restriction first.
- Even though QUEL-constant-2 and QUEL-constant-3 define the same object,
- the caching discussed in Section 6
- should avoid materializing this object more than
- once.
- .pp
- Now the command has zero variables and is passed to the
- materialize module. This processing step chooses one of the
- QUEL constants and materializes the outer-union of the RETRIEVE
- commands into a relation TEMP-2. If ``QUEL-constant-2''
- is chosen, then the resulting query will be:
- .(l
- retrieve (TEMP-2.p-desc) where
- TEMP-2.Pid < 5
- and ``QUEL-constant-3'' == ``QUEL-constant-1''
- .)l
- This query now has a one-variable clause which can be
- detached and processed creating another temporary
- relation TEMP-3.
- If TEMP-3 is empty then
- the query is false and
- can be terminated. Alternately, processing must
- continue on the following command:
- .(l
- retrieve (TEMP-3.p-desc) where ``QUEL-constant-3'' == ``QUEL-constant-1''
- .)l
- The qualification is again free from variables, so another relation
- must be materialized. If ``QUEL-constant-1'' is chosen, we obtain:
- .(l
- retrieve (TEMP-3.p-desc) where ``QUEL-constant-3'' == TEMP-4
- .)l
- The qualification is still free from variables, so the final
- relation must be materialized as follows:
- .(l
- retrieve (TEMP-3.p-desc) where TEMP-5 == TEMP-4
- .)l
- After another trip around the processing loop, no further materialization is
- possible. Hence, the query must now be passed to the one-variable query
- processor. This module will process the operator == for the two
- relations involved.
- .pp
- Several comments are appropriate at this time. First, this algorithm delays
- materializing a relation until there is no conventional
- processing to do. In addition, it delays evaluating relation
- level operators until there is nothing else to do. This reflects
- our belief that expensive operations should never be done until
- absolutely necessary.
- The current prototype only materializes a procedural
- field if the desired columns
- actually appear in the result. This tactic avoids
- obviously unnecessary materializations.
- However, no attempt has been made to materialize only a subset of a procedural
- object by using qualification in the user command to advantage. For
- example, only the tuples where Pid < 5 could have been materialized
- from the query in ``QUEL-constant-2'' by modifying the qualification. Such
- restricted materializations would not allow the caching that we have in
- mind, and we did not consider them. A more sophisticated query planner
- would try to optimize the decision of
- whether to materialize the value of the whole procedural object
- or a qualified subset.
- .pp
- Second, most current optimizers build a complete query plan in advance
- of executing the command. Such optimizers (e.g. [SELI79, KOOI82]) can
- construct a plan for the portion of the query without nested dot constructs.
- However, run-time planning will be required on remaining portions
- of commands.
- For example,
- the following query must be processed by tuple substitution
- for o or o1.
- .(l
- retrieve (o.shape.p-desc, o1.shape.p-desc) where o.shape.l-desc = o1.shape.l-desc
- .)l
- After substitution twice, the remaining query is:
- .(l
- retrieve (TEMP-1.p-desc, TEMP-2.p-desc) where TEMP-1.l-desc = TEMP-2.l-desc
- .)l
- The characteristics of TEMP-1 and TEMP-2 are not known until
- run time, so further query planning must be deferred to
- this time.
- .pp
- The only exception to run time planning would occur if all values
- in a procedural column contain the same query as discussed in Section 2.
- In this situation, a view translation
- algorithm can be run on the initial user
- command instead of applying the algorithm of this section.
- The algorithm is similar to the one presented in [STON75]
- and would translate a multiple-dot query
- into a conventional query which can be optimized in the
- conventional fashion.
- This ``flattening'' of a query
- will allow a compile time plan to be built and additionally
- will support a wide range of query processing alternatives to
- be explored, rather than just the ``outside-to-inside'' strategy
- discussed in this section.
- The details of this algorithm
- are straight-forward and are omitted for the sake of brevity.
- .pp
- Lastly, in our prototype
- the module that materializes a relation passes the RETRIEVE commands
- to another process which also runs the INGRES+ code. This second
- INGRES+ executes the command, stores the resulting
- relation in the data base, and then passes control back to the
- first INGRES+. A second process is required because the INGRES
- code will not allow a command to
- suspend in the middle of the decomposition process so that
- a new command can be executed. The ability to ``stack'' the
- execution state of a query would be a very desirable addition
- to the system.
- .sh 1 "BENCHMARK RESULTS"
- .pp
- It would be clearly desirable to compare the performance of
- INGRES+ against various other approaches to object management. These
- could include using a conventional relational system
- as well as prototypes with other capabilities
- (e.g. [COPE84, LORI83]).
- Only a conventional relational system was easily available in
- our environment as a test case. Hence,
- a more detailed performance study is left as a future exercise
- and would require the acquisition of appropriate hardware to run other
- prototypes.
- .pp
- In this section we describe a collection of benchmarks
- which we performed on our prototype.
- We modeled three different tasks using QUEL+ and then compared
- them to a conventional relational system, namely INGRES [STON76].
- In all cases
- we chose tasks which would result in different
- queries in the two systems.
- Running the same command in both systems would clearly result in equal
- performance.
- In all tests recovery and concurrency control
- has been turned off, and CPU time and total elapsed time
- in a single user environment were tabulated.
- For
- convenience, INGRES numbers are normalized to 1 while INGRES+
- numbers are given as a multiple of the corresponding
- INGRES result. All
- tests are run on a single-user VAX 11/780.
- .pp
- Both systems contain substantial inefficiencies (e.g. run-time optimization,
- generation of an excessive number of temporary relations). However,
- it appears that such problems penalize both systems about equally.
- Only three issues excessively penalize INGRES+. First, the
- unnecessary communication with a second INGRES+ task adds unnecessary
- overhead that could be eliminated in a commercial implementation.
- Second, in many cases INGRES will be seen to execute a single
- two-variable query while INGRES+ runs a larger number of one-variable
- commands. Run time query planning of a larger number of commands
- imposes an excessive penalty on INGRES+.
- Lastly, the flattening of parameterized procedural fields has not
- yet been implemented in INGRES+. Consequently, execution of muliple-dot
- queries is constrained by the structure of the query, and
- an inefficient plan may be executed as a result.
- Hence, a compiled query implementation of QUEL+ which included parameterized
- procedual fields
- should yield results similar to or more
- favorable toward INGRES+ than those we present.
- .pp
- The three experiments are discussed in the following
- subsections.
- .sh 2 "Simulation of Simple Complex Objects"
- .pp
- This experiment involves accessing simple variant records
- corresponding to hobbies in the EMP relation of Section 2. Each of 7000
- employees has
- a collection of hobbies.
- From a total of 50 possible hobbies, each employee practices
- between one and eight.
- .pp
- Both an INGRES and an INGRES+ data base must store records on
- each of the 50 hobbies in relations:
- .(l
- SOFTBALL (emp-name, other data)
- SAILING (emp-name, other data)
- JOGGING (emp-name, other data)
- .
- .
- .
- .)l
- A normal DBMS would store in addition the relations:
- .(l
- EMP(name, age, salary)
- HOBBIES(emp-name, hobby-name)
- .)l
- while an INGRES+ data base would only require a single relation:
- .(l
- EMP(name, age, salary, hobbies)
- .)l
- The field ``hobbies'' has a collection of queries, one per
- hobby as noted in Section 2.
- .pp
- The task is to find the information
- on all hobbies
- for a given employee and is expressed in INGRES+ as follows:
- .(l
- execute (EMP.hobbies) where EMP.name = ``unique-emp''
- .)l
- A normal DBMS query language cannot express
- this task, and the most reasonable option is to execute the
- following algorithm.
- .(l
- retrieve (HOBBIES.hobby-name) where HOBBIES.emp-name = ``unique-emp''
- for each such hobby-name do
- retrieve (hobby-name.all) where hobby-name.emp-name = ``unique-emp''
- end-do
- .)l
- Table 2 indicates a performance comparison for
- various numbers of hobbies per employee. The INGRES+ numbers result from
- running the above execute command while the INGRES number were obtained
- using the terminal monitor to
- retrieve the hobbies for a given employee and then
- executing the appropriate number of retrieve commands to obtain
- hobby data. This would simulate a user who ran a query to
- obtain the collection of hobbies and then ran the correct collection
- of queries on the various hobbies relations.
- Notice that the INGRES+ option is superior except when there
- is a single hobby per employee.
- This performance difference results from the fact that a large
- number of queries are passed
- through the INGRES terminal monitor,
- which has noticeable
- overhead. The INGRES+ solution runs the same collection of queries,
- but the hobby queries are generated internally by the system and
- do not go through a terminal monitor.
- .(z
- .hl
- .(c
- .TS
- l|l|l|l|l
- l|l|l|l|l
- l|n|n|n|n.
- Query INGRES-CPU INGRES+-CPU INGRES-total INGRES+-total
- _ _ _ _ _
- one-hobby 1 1.23 1 1.28
- four-hobby 1 .73 1 .61
- eight-hobby 1 .59 1 .61
- .TE
- .)c
-
- .ce
- A Benchmark of Simple Complex Objects
-
- .ce
- Table 2
- .hl
- .)z
- .sh 2 "Simulation of More Complex Objects With Shared Subobjects"
- .pp
- Consider the example from Section 2 where complex objects are
- composed of
- lines, text and polygons. Moreover, assume that these subobjects
- must be shared among various complex objects.
- Consequently, both schemas have relations for the subobjects
- as follows:
- .(l
- LINE (Lid, l-desc)
- TEXT (Tid, t-desc)
- POLYGON (Pid, p-desc)
- .)l
- Then, a normal schema must have three additional relations
- indicating which subobjects are in which complex objects:
- .(l
- T-obj (Tid, Oid)
- P-obj (Pid, Oid)
- L-obj (Lid, Oid)
- .)l
- However, an INGRES+ schema needs only the single additional
- relation from Section 2, namely:
- .(l
- OBJECT (Oid, trim, shape)
- .)l
- The trim field in OBJECT is a collection of queries of the form:
- .(l
- retrieve (TEXT.all) where TEXT.Tid = value
- .)l
- while the shape field has a collection of queries of the form:
- .(l
- retrieve (LINE.all) where LINE.Lid = value
- retrieve (POLYGON.all) where POLYGON.Pid = value
- .)l
- .pp
- The benchmark query is to find
- the shapes of a particular complex object, e.g:
- .(l
- retrieve (POLYGON.all)
- where POLYGON.Pid = P-obj.Pid
- and P-obj.Oid = ``unique-value''
-
- retrieve (LINE.all)
- where LINE.Lid = L-obj.Lid
- and L-obj.Oid = ``unique-value''
- .)l
- The QUEL+ query is simply:
- .(l
- execute (OBJECT.shapes) where OBJECT.Oid = ``unique-value''
- .)l
- .pp
- The INGRES+ prototype limits the length of procedural fields to 255
- bytes (about 9 queries); hence, multiple rows are required
- to express an object with a larger number of subobjects. This
- limitation affects INGRES+ performance marginally.
- The costs of the two systems for objects
- having respectively
- 1, 4, 8, 16 and 32 subobjects is shown in Table 3.
- .(z
- .hl
- .(c
- .TS
- l|l|l|l|l
- l|l|l|l|l
- l|n|n|n|n.
- Query INGRES-CPU INGRES+-CPU INGRES-total INGRES+-total
- _ _ _ _ _
- Q1 1 .54 1 .67
- Q4 1 .75 1 .78
- Q8 1 .99 1 1.0
- Q16 1 1.35 1 1.44
- Q32 1 2.17 1 2.94
- .TE
- .)c
-
- .ce
- Benchmarks of Complex Objects
-
- .ce
- Table 3
- .hl
- .)z
- INGRES must run 2 two-variable
- queries while INGRES+ runs a single query on the OBJECT relation
- followed by a one-relation query per subobject.
- If there is only one subobject, it is clear that INGRES+ will run 2
- one-variable queries and have better performance than INGRES. This
- performance advantage deteriorates until there are 16 subobjects
- at which point
- 17 one-variable queries take more time than 2 two-variable queries.
- In a commercial implementation, two-variables queries would be better
- optimized and the crossover point might occur at a lower number of
- subobjects. On the other hand, run-time optimization of 17 commands
- in INGRES+ is a serious source of overhead which would not be
- present in a commercial system. Hence, it is not clear how Table 3
- would look in a commercial environment.
- .sh 2 "Simulation of Unnormalized Relations"
- .pp
- Although QUEL+ is most useful when applied to applications with
- complex structure, it is also possible to provide multiple-dot
- addressing on conventional data. This will allow
- a more natural query formulation
- compared to conventional techniques; however,
- much of the same effect can be alternately achieved using relational views.
- This section is included to demonstrate that QUEL+ provides
- reasonable performance even in ordinary situations.
- .pp
- The normal way to store data for the standard
- EMP, DEPT and JOB data base is:
- .(l
- EMP (name, age, salary, dept, jid)
- DEPT(dname, floor)
- JOB (jid, jname, benefits)
- .)l
- Here, employees have a name, an age, a salary, are in a
- department, and have a job identifier. The other two relations
- are self-evident. We assume that there are 7000 employees, 500
- departments and 50 job descriptions. Moreover, EMP tuples are
- 32 bytes wide, DEPT tuples are 14, and JOB tuples are 24.
- .pp
- On the other hand, in INGRES+ one can use an alternate schema
- as follows:
- .(l
- EMP (name, age, salary, dept, j-emp)
- DEPT(dname, floor, d-emp)
- JOB (jid, jname, benefits)
- .)l
- Here, j-emp is a procedural field of the form:
- .(l
- retrieve (JOB.all) where JOB.jid = ``value-for-this-employee''
- .)l
- In addition, d-emp is a procedural field of the form:
- .(l
- retrieve (EMP.all) where EMP.dept = ``this-dept''
- .)l
- Consequently, all the employees in a specific department are accessible
- though the d-emp field while the job description of a particular
- employee can be obtained through the j-emp field.
- .pp
- We ran the following three queries in both INGRES and INGRES+
-
- Query 1: a normal join returning a few tuples
-
- The queries to run in the two systems are respectively:
- .(l
- retrieve (EMP.name, DEPT.floor)
- where EMP.dept = DEPT.dname
- and DEPT.dname = ``unique-name''
-
- retrieve (DEPT.d-emp.name, DEPT.floor)
- where DEPT.dname = ``unique-name''
- .)l
- In this case, we are comparing the processing speed of normal INGRES
- running a two variable query with that of INGRES+ which must
- execute a one-variable query and then a second one-variable subquery.
-
- Query 2: the full join
-
- The two queries are respectively:
- .(l
- retrieve (EMP.name, DEPT.floor)
- where EMP.dept = DEPT.dname
-
- retrieve (DEPT.d-emp.name, DEPT.floor)
- .)l
- In this case, we are computing the full join between EMP and DEPT. The
- comparison is between a single two variable query and a single one-variable
- query to scan the DEPT relation along with 500 one-variable queries to
- find appropriate information in EMP. This should be a
- poor query for INGRES+ because of the run-time optimization of 500
- queries.
- Moreover, because of the structure of the query, INGRES+ will iterate
- over DEPT tuples and then access the EMP relation for each one.
- This may (or may not) correspond to the plan which would be
- selected by a conventional optimizer using a flat representation
- of the query. If
- iterative substitution for DEPT tuples is not a wise plan, then INGRES+
- will have poor performance because of the structure of the query.
-
- Query 3: a three way join to find the job of a particular employee in a particular department
-
- The queries are:
- .(l
- retrieve (JOB.jname, EMP.name)
- where DEPT.dname = ``value-1''
- and EMP.dept = dept.dname
- and EMP.job = JOB.jid
- and EMP.name = ``value-2''
-
- retrieve (DEPT.d-emp.j-emp.jname, DEPT.d-emp.name)
- where DEPT.dname = ``value-1''
- and DEPT.d-emp.name = ``value-2''
- .)l
- Here, INGRES is running a single three variable query
- while INGRES+ will execute three one-variable queries. The
- extended system should be especially attractive in this case, because
- of the extra complexity required to process multivariable queries in
- a conventional system.
- .pp
- Table 4 presents the results for these three queries.
- .(l
- .hl
- .(c
- .TS
- l|l|l|l|l
- l|l|l|l|l
- l|n|n|n|n.
- Query INGRES-CPU INGRES+-CPU INGRES-total INGRES+-total
- _ _ _ _ _
- Query 1 1 1.37 1 1.15
- Query 2 1 15.1 1 17.2
- Query 3 1 .29 1 .36
- .TE
- .)c
-
- .ce
- Benchmarks of Unnormalized Relations
-
- .ce
- Table 4
- .hl
- .)l
- As can be seen, Query 1 performs at about the same speed in both
- systems. In this case two one-variable queries are comparable to
- a single two variable query.
- The full join was a factor 15-17 worse in INGRES+
- because the overhead of running 500 queries to retrieve EMP tuples is
- overwhelming.
- Finally, Query 3 shows that three one-variable commands are
- faster by a factor of 3-4 than a single three-variable command.
- Although one would expect superior performance from INGRES+, the
- magnitude is surprising and reflects the fact that the normal INGRES optimizer
- is not especially good at three way joins.
- .pp
- The conclusion to be drawn is that INGRES+ is competitive except when
- it utilizes a poor query plan or is forced to run a large number of commands.
- Bad performance in the latter situation
- should be eliminated by compile time query planning.
- Bad performance in the former case can be alleviated by
- flattening out multiple-dot
- commands when an entire column has the same query structure.
- .pp
- The next section turns to a suggestion to dramatically improve the performance
- of procedural fields.
- .sh 1 "CACHING PROCEDURAL FIELDS"
- .pp
- The performance of
- INGRES+ may be dramatically improved by caching
- frequently used objects so they will
- not have to be repeatedly rematerialized.
- This section explores the use of this tactic.
- .sh 2 "The Cache Model"
- .pp
- Caching QUEL procedures should be thought of as a two step process:
- .(l
- 1) compile a query plan for the command(s) (plan caching)
- 2) execute the plan (result caching)
- .)l
- .pp
- The first step (plan caching)
- is often done at compile time in current
- systems; however, in our model it should be thought of as
- computing an intermediate representation of the object. This
- representation can be saved for later reuse. Moreover, in current
- systems a query plan is usually invalidated at execution time if
- the schema has changed in a way that compromises the validity
- of the plan.
- In our model
- query plans are cached until a compromising update
- forces invalidation. One can optionally support a ``demon'' which
- utilizes any idle CPU resources recompiling invalidated plans.
- Alternatively, one can simply recompile when the plan is executed,
- as in [ASTR76].
- .pp
- The size of a compiled plan depends on the target language
- of the compiler. If access plans are generated, then the
- size will be modest (e.g. hundreds of bytes). If machine code
- is generated, then the size will be thousands of bytes.
- If plans are of moderate size, they can be cached directly
- in the field that defines them. Larger plan representations
- can be cached in a separate relation, i.e:
- .(l
- CACHE (identifier, compiled-plan)
- .)l
- and only the identifier stored in the defining field.
- .pp
- The second step (result caching) involves materializing
- the object from the compiled plan.
- Again this step can be performed on demand and saved
- for reuse or even computed in advance.
- The
- size of a materialized object depends, of course,
- on
- the command(s) which is executed.
- Small objects can be cached directly in the defining field.
- Larger objects should probably be cached as
- individual relations, and
- the name of the relation(s) inserted in the defining field.
- When objects are hierarchically composed of other objects, the
- above constructs can be applied recursively; hence, small objects
- which are composed of small objects will be cached together
- in the field describing the enclosing object.
- .pp
- It is straightforward for the data base system
- to allow result caching for N1 ``big objects'' and execute a least
- recently used (LRU) algorithm to select big objects to be discarded.
- Of course this requires N1 relations to hold these objects and
- the corresponding extra entries in the system catalogs.
- Alternately, a data manager could reserve N2 blocks of storage
- for objects and select a victim based on a function of
- size and time-since-last-reference. Of course, N1 and N2
- would be carefully chosen system-specific parameters. Objects
- must also be
- discarded upon an update to one or more tuples from which they are
- composed. We discuss a mechanism to accomplish this task
- presently. Alternately, it should be possible to incrementally
- update the cached object when a subobject is modified.
- Recent work on supporting materialized views (e.g. [BLAK86])
- can be applied to this task. Moreover, if the object is an
- QUEL aggregate, it is straightforward to update the cached value.
- Additional effort in this direction is currently in progress.
- .pp
- For cached big objects, it is desirable to store their name
- and the query(s) which compose them in a main memory data structure.
- Then, if the same or another user materializes an object with the same
- description, the one already materialized can be used instead.
- An example of this situation occurred in Section 4 where our
- algorithm materialized the object represented by o.shape twice.
- .pp
- The prototype discussed in Section 4 caches
- big objects as discussed above, but small object caching as well as
- invalidation on update has
- yet to be implemented.
- We turn now to the efficient invalidation of objects.
- .sh 2 "Object Invalidation"
- .pp
- Consider a new kind of lock mode called I mode. Hence,
- objects can be locked in R, W or I mode. A lock set in
- I mode has an associated identifier indicating the object which
- has been precomputed using the locked object.
- The compatibility of the various modes is
- indicated in Table 5.
- .(z
- .hl
- .(c
- .TS
- c c c c
- l l l l.
- R W I
-
- R ok no ok
- W no no *
- I ok no ok
- .TE
- .)c
-
- .ce
- Compatibility Modes for I Locks
-
- .ce
- Table 5
- .hl
- .)z
- The * in that table indicates that a W requestor for an object
- locked in I mode will be allowed to proceed and set
- a W lock on the object. First, however the object
- with which the I lock is associated will be invalidated.
- .pp
- When INGRES+ materializes any object, it
- simply sets I locks on all objects read by the query(s)
- which materialize the object. These I locks are held
- until the object is deleted or invalidated through an
- update to a subobject.
- .sh 2 "Implementation of I Locks"
- .pp
- A straight-forward approach would be to place I locks in the
- same lock table holding R and W locks. In this case, one must cope
- with a lock table of widely varying size since the number
- of I locks can change dramatically. Moreover, when a
- failure occurs, either all precomputed objects must be invalidated
- (which may be a costly alternative if the facility is extensively
- used) or I locks must
- be made
- recoverable. Lastly, phantoms must be correctly
- handled. Hence, if a new tuple is added which
- satisfies the qualification of some precomputed object, then this
- object must be invalidated.
- .pp
- The first objective can be satisfied by using
- extendible hashing [FAGI79] for the lock table instead of conventional
- hashing. The second objective requires setting I locks
- as part of a transaction and writing them into the log.
- If a failure occurs during this transaction, then recovery
- code must be extended slightly to back
- out the I locks which were set. Moreover, I locks
- must be periodically checkpointed to allow recovery from media
- failures. Hence, when recovery code is rolling forward from a
- checkpoint, I locks can be suitable updated.
- In summary, making I locks recoverable simply involves treating them
- as ordinary data and presents only modest implementation difficulties.
- .pp
- The phantom problem poses more serious issues. Systems
- which perform page level locking (e.g. [RTI86, CHEN84])
- have few difficulties
- supporting correct semantics in the presence of
- phantoms.
- Hence, one can include I locks in such systems without concern.
- However, finer granularity locking is required to
- avoid an excessive number of unnecessary invalidations.
- Systems which perform record
- level locking (e.g. System R [ASTR76]) can allow detection of
- phantoms by holding locks on index intervals in the leaf nodes of
- secondary indexes as well as on data records.
- Hence, a transaction which inserts a tuple will
- hold a write lock on the tuple and on the appropriate index interval
- for any field for which a secondary index
- exists.
- If I locks are also held on data records and index intervals,
- then all conflicts caused by phantoms will be detected by a
- collision in some index.
- .pp
- When access paths other than B-trees are present, a slight
- generalization to the above scheme will support phantom
- detection.
- Tuple level locks are held
- on index and data records which use a hashed or B-tree organization.
- Then,
- a precomputed object must be invalidated if an I lock which
- is held on its behalf falls
- adjacent to a tuple
- on which a W lock is held.
- For B-tree data records and indexes,
- adjacency means ``logically adjacent in tuple identifier order'';
- for hashed records and indexes,
- adjacency means ``in the same hash bucket.''
- .pp
- The only problem with the adjacency approach
- is that a B-tree page split will cause a write lock to be set
- on the page to be split, and thereby will cause an invalidation
- of all objects holding I locks
- on that page. Unless tuple identifiers
- are constructed so that they do not change when a tuple moves to
- a different page, this invalidation is required. Otherwise
- I locks will be held on the previous identifier for the moved tuple,
- and incorrect operation will result.
- .pp
- The phantom problem and the logging problem
- appear easier to solve if an alternate strategy is employed.
- Consider storing I locks in the data and index records
- themselves. Systems which support variable length records can simply
- add as many I locks to each record as necessary. Such locks are
- recoverable using conventional techniques which
- are automatically applied to data records.
- The phantom problem requires the above adjacency algorithm; however,
- structure modifications (e.g. B-tree page splits) do not cause
- unnecessary invalidations.
- Moreover, since the extra locks are stored separately from
- R and W locks, extendible hashing is not a prerequisite for the
- lock table.
- Lastly, this approach is easily extended to field-level
- locks, which may offer superior performance to record level I locks.
- .pp
- The drawbacks of this second alternative is that a second implementation
- of a lock manager must be coded for I locks.
- Moreover, setting an I lock on an object requires rewriting the object instead
- of just reading it. Hence, setting I locks will be expensive.
- .sh 2 "Performance of Cached Objects"
- .pp
- The purpose of this section is to present a model which can be
- used to suggest when caching should be applied to a procedural object.
- Consider a ``small'' object which can be constructed
- in N1 page accesses
- and inspection of N2 records. Assume that this object
- occupies R1 = 1 page of space,
- consists of R2 tuples and is cached in the data record itself.
- In addition,
- consider a query pattern in which P percent of the accesses are
- reads of the complex object and 1-P are updates to subobjects from which
- the complex object is composed.
- Moreover, each update is applied to a randomly chosen subobject from the
- N2 candidates.
- Lastly, the record in which the description of the complex object is stored
- must be read during retrieval whether or not caching is employed. Hence,
- that access is not counted in the following analysis.
- .pp
- Consider the sequence of accesses to be a collection of
- .b intervals,
- each consisting of
- one or
- more consecutive writes followed by one or more consecutive reads.
- In a given interval, the expected length ER of the run of reads and
- the length EW of the run of writes is:
- .(l
- ER = 1 / (1 - P)
- EW = 1 / P
- .)l
- Consider first the no-cache alternative.
- With the
- parameter K discussed in [SELI79] which weighs the CPU time used
- in evaluating a query plan relative to the number of I/O's,
- the cost, M to materialize the
- object after its definition has been obtained is:
- .(l
- M = N1 + N2 * K
- .)l
- Without caching, this cost must be paid on each read access,
- and the
- cost, C1 of these accesses is:
- .(l
- C1 = (ER) * (M)
- .)l
- .pp
- We now turn to the corresponding costs for
- cached objects. The cost to invalidate the complex object
- on the initial write is:
- .(l
- IN = 1
- .)l
- This assumes that I locks are stored in the records of the subobjects
- and that only the
- data record containing the object description
- must be rewritten to perform invalidation. All
- I locks are simply left in place.
- Subsequent writes prior to the first read also require the same
- cost even though the complex object has already been invalidated.
- .pp
- The first read to the complex object
- requires a materialization at cost, M. In addition, the cost, C
- to cache the object is:
- .(l
- C = 1
- .)l
- The materialized object must be written into the field occupied by
- the description of the complex object requiring a single record to be rewritten.
- Since I locks were never reset from prior materializations,
- no extra cost is required to ensure that they are set.
- Subsequent reads only require accessing the object in the
- cache. Since the access to the object description is not being
- counted, there is no additional I/O because the cached object is stored
- in this record. Hence, only the CPU cost to access the R2 records in the
- cached object must be paid, i.e:
- .(l
- A = R2 * K
- .)l
- .pp
- Therefore, the expected cost of an interval using
- caching is:
- .(l
- C2 = EW * IN /*cache invalidation on each write*/
-
- + M + C /*materialize plus cache on first read*/
-
- + (ER - 1) * (A) /*subsequent reads access the cache*/
-
- .)l
- Define Z = M - A to be the caching factor, i.e. the
- difference between the cost to materialize the object
- and the cost to access the object from the cache.
- This cost is in weighted CPU and page costs and is typically
- in the range of 10 - 1000.
- Algebraic manipulation now yields that caching will be preferred if:
- .(l
- Z > (1 / P ** 2) - 1
- .)l
- The consequence of this analysis is that caching will generally
- win unless P is very small. If P is 1/2, then Z must be
- greater than 3 for caching to be
- beneficial. If P is 1/10, then Z must exceed 99.
- .sh 2 "Indexing Cached Fields"
- .pp
- An extension of this caching
- tactic may be attractive when the
- queries in a field have certain compositions. Consider
- a situation, such as salaries in an EMP relation for which
- most of the fields are specified as constants while
- a few are specified procedurally. For example, Joe might make
- $10,000 and Bill is specified as having the same wages as Joe.
- In this case salary can be a procedural field and most
- rows have a value of the form:
- .(l
- retrieve (wages = constant)
- .)l
- while Bill would have a value of:
- .(l
- retrieve (EMP.salary.wages) where EMP.name = ``Joe''
- .)l
- In this case, one would desire a salary index on salaries for
- efficient processing
- of queries of the form:
- .(l
- retrieve (EMP.name) where EMP.salary.wages = value
- .)l
- This section indicates how to construct such indexes on procedural fields.
- .pp
- Instead of caching values as they are
- computed, consider caching all values
- in advance. If this is done, then
- it is straightforward to
- build an index on a field appearing in the cached objects using
- the conventional indexing utility.
- One can use this index to answer queries of the above form
- rapidly.
- On updates to subobjects, one
- must invalidate cached objects as discussed
- in the previous subsection. However,
- as part of the transaction which updates a subobject, the invalidated
- object must be rebuilt and the index updated.
- Conventional locking must be employed to guarantee consistency,
- and aborting a transaction must cause a backout of all
- updates or an
- invalidation of
- the new cached value.
- .pp
- In the case that most values are simply constants, such indexes
- should be beneficial, and be only marginally more expensive to
- maintain than conventional ones.
- .sh 1 "CONCLUSIONS"
- .pp
- This paper has suggested that data base procedures are a natural
- way to model complex objects and to allow data base oriented algorithms
- and precompiled queries in the data base. Moreover, they
- appear to be easily generalizable to arbitrary programming language
- procedures which may be useful in certain applications.
- Lastly, they can be used to model aggregation, generalization and
- most other environments addressed by semantic data models. The
- advantage of data base procedures is that a user does not need to learn
- additional concepts to design his application. Since he must
- know the query language anyway, there is little extra complexity.
- Hence, this proposal is in the same spirit of the ``spartan simplicity''
- stressed by the original advocates of the relational model.
- .pp
- A prototype implementation was described and initial performance
- studies explored. They indicated comparable
- performance between the extended and the non-extended prototypes
- except when many one-variable commands were processed
- by INGRES+ or when a
- bad query plan was
- indicated by the structure of the query.
- Bad plans can be avoided if procedural fields are restricted to
- parameterized queries, and compile time optimization should
- alleviate the overhead of one-relation commands. Moreover,
- a caching strategy was described which should
- accelerate performance of the prototype especially when objects
- are of moderate size. Our caching scheme should
- offer superior performance to pointer oriented
- implementations of complex objects when update frequency
- is low or moderate.
- .bp
- .lp
- .ce
- REFERENCES
- .ls 1
- .nr ii 10m
- .ip [ASTR76]
- Astrahan, M., et. al., ``System R: A Relational Approach to Data,''
- ACM-TODS, June 1976.
- .ip [BLAK86]
- Blakeley, J. et. al., ``Efficiently Updating Materialized Views,''
- Proc. 1986 ACM-SIGMOD Conference on Management of Data, Washington, D.C.,
- May 1986.
- .ip [CHEN84]
- Cheng, J., et. al., ``IBM Database 2 Performance: Design, Implementation,
- and Tuning,'' IBM Systems Journal, February 1984.
- .ip [COPE84]
- Copeland, G. and Maier, D., ``Making Smalltalk a Data Base System,''
- Proc. 1984 ACM-SIGMOD Conference on Management of Data, Boston, Mass.,
- June 1984.
- .ip [DATE81]
- Date, C., ``Referential Integrity,'' Proc. 6th VLDB Conference,
- Cannes, France, September 1981.
- .ip [DBTG71]
- Data Base Task Group, ``Report to the CODASYL Programming Language
- Committee,'' April 1971.
- .ip [EPST80]
- Epstein, R., and Hawthorn, P., ``Design Decisions for the Intelligent
- Database Machine,'' Proc. 1980 National Computer Conference,
- Anaheim, Ca., May 1980.
- .ip [FAGI79]
- Fagin, R. et. al., ``Extendible Hashing: A Fast Access Method for Dynamic
- Files,'' ACM-TODS, Sept. 1979.
- .ip [HAMM81]
- Hammer, M. and McLeod, D., ``Database Description with SDM,'' ACM-TODS,
- September 1981.
- .ip [HASK82]
- Haskins, R. and Lorie, R.,
- ``On Extending the Functions of a Relational Database System,''
- Proc. 1982
- ACM-SIGMOD Conference on Management of Data, Orlando, Fl, June 1982.
- .ip [HELD75]
- Held, G. et. al., ``INGRES - A Relational Data Base System,'' Proc. 1975
- National Computer Conference, Anaheim, Ca., May 1975.
- .ip [KALA85]
- Kalash, J., ``Implementation of a Data Base Browser,'' Electronics
- Research Laboratory, University of California, Berkeley, Ca.,
- Memo No. M85/22, May 1985.
- .ip [KOOI82]
- Kooi, R. and Frankfurth, D., ``Query Optimization in INGRES,''
- Database Engineering, Sept. 1982.
- .ip [KUNG84]
- Kung, R. et. al., ``Heuristic Search in Database Systems,'' Proc.
- 1st International Conference on Expert Systems, Kiowah, S.C.,
- Oct. 1984.
- .ip [LORI83]
- Lorie, R. and Plouffe, W., ``Complex Objects and Their Use in
- Design Transactions,'' Proc. Engineering Design Applications Stream of
- ACM-IEEE Data Base Week, San Jose, Ca., May 1983.
- .ip [MYLO80]
- Myloupoulis, J. et. al., ``A Language Facility for Designing Database
- Intensive Applications,'' ACM-TODS, June 1980.
- .ip [ONG84]
- Ong, J., et. al., ``Implementation of Data Abstraction in the Relational
- Database System INGRES,'' SIGMOD Record, March 1984.
- .ip [POWE83]
- Powell, M. and Linton, M., ``Database Support for Programming
- Environments,'' Proc. Engineering Design Applications Stream of ACM-IEEE
- Database Week, San Jose, Ca., May 1983.
- .ip [ROWE82]
- Rowe, L. and Shoens, K., ``A Form Application Development System,''
- Proc. 1982 ACM-SIGMOD Conference on Management of Data, Orlando, Fl,
- June 1982.
- .ip [RTI86]
- Relational Technology, Inc., ``INGRES Version 5.0 Reference Manual,''
- November 1986.
- .ip [SELI79]
- Selinger, P., ``Access Path Selection in a Relational Database System,''
- Proc. 1979 ACM-SIGMOD Conference on Management of Data, Boston, Mass.,
- June 1979.
- .ip [SHIP81]
- Shipman, D., ``The Functional Model and the Data Language Daplex,''
- ACM-TODS, March, 1981.
- .ip [SMIT77]
- Smith, J and Smith, D., ``Database Abstractions: Aggregation and
- Generalization,'' ACM TODS, June 1977.
- .ip [SORD84]
- Sordi, J., ``IBM Database 2: The Query Management Facility,'' IBM Systems
- Journal, February 1984.
- .ip [STON75]
- Stonebraker, M., ``Implementation of Views and Integrity Control by
- Query Modification,'' Proc. 1975 ACM-SIGMOD Conference on Management
- of Data, San Jose, Ca., June 1975.
- .ip [STON76]
- Stonebraker, M. et al., ``The Design and Implementation of INGRES,''
- ACM-TODS,
- September 1976.
- .ip [STON84]
- Stonebraker, M. et. al., ``QUEL as a Data Type,'' Proc. 1984 ACM-SIGMOD
- Conference on Management of Data, Boston, Mass., June 1984.
- .ip [STON86]
- Stonebraker, M., ``Object Management in POSTGRES Using Procedures,''
- Electronics Research Laboratory, University of California,
- Memo M86/42, July 1986.
- .ip [WILE84]
- Wilensky, R., ``The LISP PRIMER''
- W. Norton, Co, New York, 1984.
- .ip [WONG76]
- Wong, E., ``Decomposition: A Strategy for Query Processing,'' ACM-TODS,
- Sept. 1976.
- .ip [ZANI83]
- Zaniolo, C., ``The Database Language GEM,''
- Proc. 1983 ACM-SIGMOD Conference on Management of Data,
- San Jose, Ca., May 1983.
-
- .bp
- .ce
- APPENDIX 1
-
- The syntax of QUEL has been described elsewhere [HELD75]. Here,
- we report only the extensions to the original language, QUEL, which
- define QUEL+. Let F be the construct ``QUEL-col-1.,...,QUEL-col-n.field''.
-
- 1) F can appear
- wherever a field of a relation can appear in QUEL.
-
- 2) The construct ``tuple-variable.F can appear
- whenever a tuple variable or a relation name can appear in QUEL.
-
- 3) Clauses of the form:
- .(l
- G1 newop G2
- .)l
- are allowable if G1 and G2 are tuple variables or the
- construct ``tuple-variable.F'' and newop is in the set:
- .(l
- { U , !! , >> , << , == , <> , JJ , OJ, empty }
- .)l
-
- 4) EXECUTE and EXECUTE-ONE are added as commands
-
- 5) An operator ``in'' is added accepting an indirectly referenced column
- as a left operand and a relation name as a right operand.
-
- 6) A keyword ``with'' is added which is usable with the EXECUTE command
- to indicate the presence of a parameter list.
-