home *** CD-ROM | disk | FTP | other *** search
- .sh 1 "PERFORMANCE OF THE POSTGRES OPTIMIZER" 5
- .pp
- This section describes how we went about validating the POSTGRES optimizer.
- It also presents and discusses our results.
- .sh 2 "Testing The Optimizer"
- .pp
- To evaluate the POSTGRES optimizer's performance as well as
- assess the credibility of its cost formulas, we could do the following.
- We could measure the amount of time required to execute
- various query plans, and then we could compare these measurements with
- the optimizer's predicted costs.
- Ideally, the predicted cost will be identical
- to the actual measured cost.
- However, this is an unrealistic expectation since optimizer
- cost formulas are merely estimates.
- A more attainable goal is for the optimizer to select the true
- optimal plan in a large majority of cases.
- This will at least show that in almost all circumstances, a selected plan
- is the best plan not only according to the optimizer but also in reality.
- .pp
- The tests described above illustrate how we would have wanted to validate
- the optimizer.
- However, at the time when we wanted to run these performance tests,
- measuring the amount of time required to execute a query plan
- was impossible because other parts of POSTGRES
- had not been fully implemented.
- As an alternative, we compared POSTGRES plans with
- query plans selected by the optimizer of commercial INGRES [KOOI82],
- which also happens to use an enumerative planning algorithm.
- The assumption behind this is that the INGRES optimizer selects
- \*(lqcorrect\*(rq plans.
- Therefore, if a plan selected by POSTGRES is equivalent to a plan selected by
- INGRES, (for the same query under similar conditions), then
- the POSTGRES optimizer has also generated a \*(lqcorrect\*(rq plan.
- Although this may not always be a valid assumption, it is probably
- a very good one since the INGRES optimizer has been tested, tuned, and
- used widely; and tests have shown that it is
- \*(lqextremely good\*(rq [ROWE86].
- Furthermore, comparing POSTGRES
- and INGRES query plans at least allowed us to validate our optimizer against
- something other than our intuition as to which plans \fIshould\fR be
- selected.
- .pp
- Commercial INGRES was chosen as a basis for our comparisons because
- it is ideal in several respects.
- First of all, it has a feature that allows users to examine
- plans selected by the optimizer.
- By issuing the command \*(lq\fBset qep\fR,\*(rq
- plan trees for all subsequent join queries will be printed on
- the terminal monitor, easily allowing us to make comparisons between
- INGRES and POSTGRES plans.
- In addition, both POSTGRES and INGRES
- store within database catalogs statistical information that can be used to
- compute more realistic operator selectivities.
- In INGRES, these statistics are generated and subsequently updated
- using the \*(lq\fBoptimizedb\fR\*(rq command, while POSTGRES maintains
- them using system demons.
- An example of how this extra information would be put to use is the following.
- Suppose a relation, \fIr\fR, has 100 tuples but only 10 distinct
- values in attribute, \fIr.f\fR.
- Because the number
- of distinct values is a better indicator of the distribution of data
- within \fIr.f\fR, a clause like \fIr.f = 1\fR would have a
- selectivity of $1 over 10$ rather than $1 over 100$, as a result of using
- this extra information.
- In both systems, we made use of this and other statistical information
- to generate more accurate cost estimations.
- As a result, fair comparisons could be made between the two optimizers.
- .pp
- However, the POSTGRES optimizer is \fInot\fR identical to the INGRES
- optimizer, and consequently, in certain situations, INGRES selected
- a plan that POSTGRES did not even consider.
- In these cases, it was impossible to determine whether POSTGRES had selected
- the optimal path from among those plans it had considered; but as
- will become evident in the next subsection, we were still able to draw
- some interesting conclusions.
- One situation where this occurs relates to the manner in which both
- systems treat secondary indices.
- Figure 5.1 illustrates how a secondary index is generally used.
- Given the key(s) of the index, the system locates within
- the secondary index relation a unique tuple identifier (tid).
- It then uses the tid to directly retrieve an appropriate tuple from the
- corresponding indexed relation.
- .(z
- .hl
- .GS
- file figure5.1
- .GE
- .sp
- .ce 2
- \fBFigure 5.1\fR
- Using secondary indices
- .hl
- .)z
- The POSTGRES query plan tree corresponding to such a secondary index scan is
- shown in figure 5.2a, while the INGRES tree is shown in figure 5.2b.
- .(z
- .hl
- .GS
- file figure5.2
- .GE
- .sp
- .ce 2
- \fBFigure 5.2\fR
- Plan trees for secondary index scans in POSTGRES and INGRES
- .hl
- .)z
- A join between the tidp field in EMPINDEX and tid in EMP is required because
- INGRES literally treats secondary indices as relations, given that relations
- are used to implement them.
- In this particular case,
- although the two trees are different in appearance, they are both
- processed in the manner shown in figure 5.1.
- The only difference is that joining of tids
- is implicit in the POSTGRES indexscan node.
- .pp
- It will not always be the case that there will be a POSTGRES tree
- equivalent to an INGRES tree because by treating secondary indices as
- relations, further strategies are available to the INGRES optimizer.
- In figure 5.2, the index relation, EMPINDEX, is joined directly with
- its corresponding indexed relation, EMP.
- POSTGRES will only use a secondary index in this manner.
- INGRES, however, may choose to sort the result of a secondary scan, or
- it may join the result
- with a relation other than the corresponding indexed relation.
- Figure 5.3 illustrates these two situations.
- Examples where this generality is advantageous will be discussed in the
- next subsection.
- .(z
- .hl
- .GS
- file figure5.3
- .GE
- .sp
- .ce 2
- \fBFigure 5.3\fR
- Other processing strategies using secondary indices in INGRES
- .hl
- .)z
- .pp
- Another disadvantage of testing our optimizer against INGRES's is that
- INGRES is a conventional database system that does not support user-defined
- operators and access methods or nested-attribute queries.
- As a result, we could only test standard operators, access methods, and queries.
- Fortunately, this drawback was insignificant.
- In POSTGRES, costs are computed using
- formulas as well as operator selectivities.
- The latter is supplied by the user.
- In other words, it is a parameter that is not inherent within the optimizer
- and thus cannot be manipulated or tuned (except by the user who supplied the
- routines).
- Consequently, provided selectivities relevant to a single operator and storage
- structure are accurate, one of each was sufficient for testing purposes.
- It would have been nice to illustrate the generality of our optimizer by using
- non-standard operators in our tests, but even standard operators and access
- methods in POSTGRES are implemented as if they were user-defined.
- Thus, no generality was lost in using a conventional operator and access
- method in our tests.
- .pp
- The single operator and access method used were the equality operator,
- since it is a
- mergesortable operator, and an ISAM storage structure [HELD75a].
- To build an ISAM storage structure, tuples must first be sorted
- into data pages.
- Then, a multi-level directory is constructed indicating the high key value
- stored within each page.
- Such a structure provides quick access when used with the following operators:
- .(l
- =, <, $<=$, >, and $>=$.
- .)l
- The directory is analogous to the internal nodes of a B-tree except that
- once it has been built, the directory remains static.
- Therefore, when a data page is filled, rather than
- splitting nodes to maintain a balanced tree,
- overflow pages are created and linked to the filled data page.
- If a large number of overflow pages are created, finding a tuple within a page
- requires an inefficient linear search through the primary page
- as well as its overflow pages.
- So, given a choice between an ISAM storage structure
- and a B-tree, a user would probably choose a B-tree.
- However, we could not use B-trees in our tests because the version of
- INGRES that we used only supported ISAM and
- hash access methods.
- Forced to choose between the two, we chose ISAM because
- there is a greater overhead
- associated with searching through an ISAM directory, making them more
- interesting than hash tables.
- .pp
- Using an ISAM access method does have its disadvantages, though.
- Although tuples in an ISAM structure are initially sorted when
- the structure is built, the tuples are not maintained in sort order.
- As a result, we could not test the POSTGRES optimizer feature that takes
- advantage of access methods like B-trees
- to eliminate sort steps required to order a
- user-specified sort result or tuples for a merge-sort join.
- However, although merge-sorting on a tuple by tuple basis is not possible,
- a variation of merge-sort can be performed on
- a page by page basis since the range of values within each page is known.
- INGRES, in fact, does this.
- In contrast, the POSTGRES optimizer does not, and as a result, differences
- arose in our performance tests.
- We chose not to account
- for partial index sorts because few access methods have this unusual
- characteristic.
- Moreover, as already alluded to, users will likely opt for access
- methods like B-trees, which always maintain their records in sort order.
- In other words, this feature would probably not
- be employed very often, had we implemented it.
- This should be kept in mind when differences arise between POSTGRES and
- INGRES plans in the next subsection.
- .pp
- With respect to nested-attribute queries, not being able to test these either is
- also of minimal importance.
- As discussed in section 3.5, relations materialized from queries
- embedded within data fields will generally be small, and as a result, the
- cost of executing any subplan corresponding to a nested portion of a query
- will also be small.
- Therefore, it is less important if the true optimal subplan is not selected
- while optimizing this portion of a query.
- .pp
- In testing the optimizer, we used version 2.1 of INGRES running on
- a VAX 11/780.
- To simulate the INGRES optimizer as closely as possible,
- we had to determine values for two system-dependent parameters that affect
- cost estimations.
- One of these is the page size used by the database.
- This is required because in estimating the cost of a sort, the
- optimizer must determine the number
- of pages occupied by the tuples to be sorted.
- By multiplying the number of tuples in an INGRES relation by the width of
- a tuple (including space for internal page pointers) and
- dividing by the number of pages in the relation,
- it turns out that INGRES uses 2K pages.
- .pp
- Determining a value for the second parameter, \fIW\fR, which relates
- I/O to CPU cost, was less straightforward.
- Assuming that it takes 30 milliseconds and about 2000 instructions of
- operating system overhead to
- read an I/O page, while
- manipulating a tuple only consumes about 200 CPU instructions,
- \fIW\fR is in the range of 0.03 to 0.1 [DEWI84].
- For the queries used in our tests, using various values within
- this range did not affect the plan choice except in a few situations
- where a merge-sort join was chosen over nested iteration.
- This occurred when \fIW\fR was close to 0.03, a value that apparently minimized
- the cost of CPU-intensive tasks like sorting.
- In these situations, the INGRES optimizer always selected a nested
- iteration join.
- Thus, to insure compatibility with INGRES plans, we assigned \fIW\fR the value
- 0.065, arbitrarily choosing a value in the middle of our original range.
- .sh 2 "Tests Performed"
- .pp
- For testing purposes, we used the following three relations:
- EMP, DEPT, and WATER.
- Schemas and relevant statistics for these relation are shown in figure 5.4.
- .(z
- .hl
- .sz \n(ep
- .nf
- \fBRELATIONS :\fR
- .sp
- EMP ( name = c10, salary = i4, manager = c10, age = i4, dept = c10 )
- .in +0.5i
- # pages : 600 (unindexed)
- # tuples : 30,000
- # distinct values in name : 30,000
- # distinct values in dept = 1000
- .in -0.5i
- .sp
- DEPT ( dname = c10, floor = i4, budget = i4 )
- .in +0.5i
- # pages : 10 (unindexed), 14 (primary ISAM index on floor)
- # tuples : 1000
- # distinct values in dname : 1000
- # distinct values in floor : 9
- .in -0.5i
- .sp
- WATER ( cid = i4, floor = i4 )
- .in +0.5i
- # pages : 1 (unindexed)
- # tuples : 50
- # distinct values in cid : 9
- # distinct values in floor : 9
- .in -0.5i
- .sp 2
- \fBSECONDARY INDICES : \fR
- .sp
- EMPINDEX ( ISAM on dept ) :
- .in +0.5i
- # pages : 253
- # tuples : 30,000
- .in -0.5i
- .sp
- DEPTINDEX ( ISAM on floor ) :
- .in +0.5i
- # pages : 8
- # tuples : 1000
- .in -0.5i
- .sp
- DEPTINDEX ( ISAM on dept ) :
- .in +0.5i
- # pages : 11
- # tuples : 1000
- .in -0.5i
- .fi
- .sz \n(pp
- .sp
- .ce 2
- \fBFigure 5.4\fR
- Schemas and statistics for test relations
- .hl
- .)z
- First, we ran the following four two-way join queries on EMP and DEPT:
- .(l
- A: \fBretrieve\fR (EMP.name, EMP.age, DEPT.all) \fBwhere\fR
- EMP.dept = DEPT.dname
-
- B: \fBretrieve\fR (EMP.name, EMP.age, DEPT.dname, DEPT.budget) \fBwhere\fR
- EMP.dept = DEPT.dname \fBand\fR
- DEPT.floor = 1
-
- C: \fBretrieve\fR (EMP.age, EMP.dname, DEPT.floor, DEPT.budget) \fBwhere\fR
- EMP.dept = DEPT.dname \fBand\fR
- EMP.name = \*(lqDiamond\*(rq
-
- D: \fBretrieve\fR (EMP.age, DEPT.dname, DEPT.budget) \fBwhere\fR
- EMP.dept = DEPT.dname \fBand\fR
- DEPT.floor = 1 \fBand\fR
- EMP.name = \*(lqDiamond\*(rq
- .)l
- To illustrate that the optimizer chooses appropriate strategies
- under varying conditions, we ran the four queries for non-indexed relations as
- well as relations with primary and secondary indices defined on relevant
- attributes.
- Indices were selected so as to illustrate the combined effects of relation
- sizes, distribution of values, and secondary index overhead in determining
- a query's optimal plan.
- The results of running queries A-D using various
- indices are shown in table 5.1.
- .(z
- .hl
- .sz \n(ep
- .nf
- .in +1i
- A: \fBretrieve\fR (EMP.name, EMP.age, DEPT.all) \fBwhere\fR
- EMP.dept = DEPT.dname
-
- B: \fBretrieve\fR (EMP.name, EMP.age, DEPT.dname, DEPT.budget) \fBwhere\fR
- EMP.dept = DEPT.dname \fBand\fR
- DEPT.floor = 1
-
- C: \fBretrieve\fR (EMP.age, EMP.dname, DEPT.floor, DEPT.budget) \fBwhere\fR
- EMP.dept = DEPT.dname \fBand\fR
- EMP.name = \*(lqDiamond\*(rq
-
- D: \fBretrieve\fR (EMP.age, DEPT.dname, DEPT.budget) \fBwhere\fR
- EMP.dept = DEPT.dname \fBand\fR
- DEPT.floor = 1 \fBand\fR
- EMP.name = \*(lqDiamond\*(rq
- .in -1i
- .fi
- .sp
- .TS
- center box;
- c|c|c|c|c.
- Storage Structure Query A Query B Query C Query D
- =
- no indices Plan (a) \(dg Plan (a) Plan (c) Plan (c)
- _
- secondary index on DEPT (floor) Plan (a)
- _
- primary index on DEPT (floor) Plan (b) Plan (d)
- _
- secondary index on EMP (dept) Plan (c) Plan (e)
- _
- primary index on DEPT (floor) & Plan (b) Plan (g)
- secondary index on DEPT (dname)
- _
- secondary index on EMP (dept) Plan (f) Plan (c)
- _
- secondary index on DEPT (dname) & Plan (f)
- secondary index on EMP (dept)
- .TE
- .sp
- .sz \n(fp
- $" " sup \(dg$ See figure 5.5
- .sz \n(pp
- .sp
- .ce 2
- \fBTable 5.1\fR
- Results of executing queries A-D using various storage structures
- .hl
- .)z
- .(z
- .GS
- file figure5.6
- .GE
- .sz \n(fp
- .nf
- MS = merge-sort
- NL = nestloop
- .fi
- .sz \n(pp
- .ce 2
- \fBFigure 5.5\fR
- Comparison of selected POSTGRES and INGRES plans
- .hl
- .)z
- For the fourteen executions shown, when indices were defined, in all but
- one case both
- POSTGRES and INGRES made similar decisions as to whether or not the indices
- should be used, and except in cases where INGRES took advantage of
- a partial ISAM sort, both selected equivalent join strategies.
- .pp
- Figure 5.5 gives a pictorial representation of the seven different
- corresponding plan trees mentioned in table 5.1.
- For each pair, the POSTGRES tree is shown on the left while the INGRES tree
- is on the right.
- Both pairs of plans (a) and (b) correspond to merge-sort joins,
- and both plan (b)'s
- scan DEPT using a primary index defined on \*(lqfloor.\*(rq
- In plans (c) and (d), POSTGRES uses nested iteration, scanning EMP first
- because the highly restrictive clause,
- .(l
- EMP.name = \*(lqDiamond,\*(rq
- .)l
- applies in these cases.
- INGRES, however, uses a slight variation of nested iteration in both
- (c) and (d).
- EMP is scanned and sorted into a temporary before it is nest iterated with
- DEPT.
- Sorting only the inner relation on the inner join attribute alleviates having
- to scan the entire inner join relation on every iteration because
- the scan halts after a value greater than the current outer
- join attribute has been reached.
- In certain cases, this can be helpful;
- however, in these cases, there is really no advantage because only
- a single tuple results after scanning EMP.
- In fact, these INGRES and POSTGRES plans are equivalent
- in cost because the cost of the extra sort step in the INGRES trees is zero,
- and in both cases, EMP and DEPT only need to be scanned once.
- .pp
- POSTGRES's plan (e) is similar to INGRES's except INGRES takes advantage
- of an ISAM's partial sort and
- thus uses a slight variation of a merge-sort join with EMP and DEPTINDEX.
- Again, only one tuple qualifies after scanning EMP, resulting in a sort cost
- of zero.
- Ignoring the sort step, the two plans are actually equivalent.
- .pp
- Plan (f) is an example where INGRES's treatment of an access method's
- partial ordering is helpful.
- In this particular case, INGRES takes advantage of the feature
- to join EMPINDEX with DEPT.
- The result of this join is then
- sorted on the tidp field so EMP can be scanned in page order.
- According to POSTGRES cost estimates, the INGRES plan is better by about a
- factor of four to five.
- It is significantly better than the POSTGRES plan partly because
- of INGRES's treatment of secondary indices, but more so due to
- INGRES accounting for an ISAM's partial sort.
- Therefore, the fact that INGRES was significantly better in this case is
- misleading for the reasons discussed in the previous subsection.
- .pp
- The last plan pair (g) illustrates the one case where different indices
- are selected by POSTGRES and INGRES.
- INGRES chooses to use
- the index defined on \*(lqfloor,\*(rq while POSTGRES chooses the
- index defined on \*(lqdept.\*(rq
- POSTGRES chose the plan it did because according to its cost estimates
- the selected plan was a mere 0.27% better than a plan equivalent to that chosen
- by INGRES.
- Thus, although the two optimizers chose different plans, the difference is
- very minor.
- .pp
- The set of tests just described indicate that differences between selected
- INGRES and POSTGRES plans are rather subtle, resulting in minor variations,
- except in one case where the difference is minimized by other circumstances.
- Additional strategies employed by INGRES
- resulted in greater differences when the following three-way join:
- .(l
- \fBretrieve\fR (EMP.name, DEPT.floor, WATER.cid) \fBwhere\fR
- EMP.dept = DEPT.dname \fBand\fR
- DEPT.floor = WATER.floor
- .)l
- was run on several different relation configurations.
- The results are shown in table 5.2.
- .(z
- .hl
- .GS
- file table5.2
- .GE
- .sp
- .ce 2
- \fBTable 5.2\fR
- Results of executing a three-way join on various storage structures
- .hl
- .)z
- .(z
- .hl
- .GS
- file table5.3
- .GE
- .sp
- .ce 3
- \fBTable 5.2\fR
- (continued)
- .hl
- .)z
- .pp
- Plans (1), (2), and (4) in table 5.2 illustrate another strategy
- employed by INGRES that POSTGRES does not consider.
- In these plans, INGRES first sorts
- DEPT on \*(lqdname\*(rq and then nest iterates the result with WATER.
- Since a join preserves the sort order of its outer join relation, the join
- result is still sorted in \*(lqdname\*(rq order and thus can be used
- in a merge with EMP.
- This is advantageous because it may be cheaper to sort DEPT prior to
- its join with WATER, since more tuples may have to be sorted after the join.
- Other differences were already discussed in reference to the previous set
- of tests and will not be repeated.
- .pp
- To assess the magnitude of differences in POSTGRES and INGRES plans
- for the four cases shown, the costs of all INGRES plans were calculated using
- POSTGRES cost estimation formulas.
- These computed costs were compared with the costs of corresponding
- POSTGRES plans,
- and the percentage differences are shown in table 5.2.
- The results show that INGRES was better than POSTGRES in all four cases.
- However, INGRES was only significantly better in one case, and this was
- a result of INGRES accounting for a partial ISAM ordering.
- In all other cases, the additional strategies considered by INGRES
- resulted in no more than a 13.8% difference.
- .pp
- To summarize, in spite of differences between the POSTGRES and INGRES
- optimizers, we were able to draw two promising conclusions from our
- performance tests.
- For the most part, both systems selected similar strategies
- for queries involving a single join.
- Assuming that the INGRES optimizer is correct, these results indicate that
- POSTGRES selects true optimal plans.
- For queries for which INGRES was able to capitalize on a broader range
- of strategies, INGRES did not perform considerably better.
- Thus, not having considered all these other processing strategies,
- which would have added further complexity to
- our implementation effort, did not dramatically affect the overall performance
- of our optimizer.
-