Approximate Component Retrieval: An Academic Exercise or a Practical Concern?

Lamia Labed Jilani

Regional Institute for Research in Computing and Telecommunications
Cite Montplaisir, Belvedere 1002 Tunisia
Tel: (216) 1 787 757, Fax: (216) 1 787 827
Email: lamia.labed@irsit.rnrt.tn

Rym Mili

School of Engineering and Computer Science
University of Texas at Dallas, Richardson, TX 75028, USA
Tel: (972) 883-2091, Fax: (972) 883-2349
Email: rmili@utdallas.edu

Ali Mili

Department of Computer Science, University of Ottawa
Ottawa, Ont. K1N 6N5, Canada
Tel: (613) 562 5800 X 6714, Fax: (613) 562 5187
Email: amili@csi.uottawa.ca

Abstract:

When one uses informal methods to retrieve a component that satisfies some requirements out of a software reuse library, one cannot distinguish between the retrieved components that do satisfy the requirements and those that merely approximate the requirements (i.e. almost satisfy them). On the other hand, if one uses formal retrieval methods based on precise specifications of components and queries and on formal matching criteria, then one can clearly distinguish between two retrieval methods: exact retrieval, which seeks to identify components that are proved to satisfy the requirements at hand; and approximate retrieval, which is content with components that do not necessarily satisfy but approximate the requirements at hand. In this paper we advocate the need to make the distinction between these two families of methods, and introduce a possible approach thereto.

Keywords: Component based software, software component storage and retrieval, software libraries, software reuse, formal specifications, information retrieval, measures of distance between specifications.

Workshop Goals: Learning; networking; assessing the pertinence of our work; advocating the need for scientifically based methods.

Working Groups: components based software, formal methods, reuse libraries.

Background

Software reuse libraries are reporsitories where reusable software components are stored and retrieved. They play a crucial role in determining the success of a software reuse policy, because they have a profound impact on the practice of software reuse in an organization:

A library which is poorly stocked (few components, or few relevant components) may cause a significant overhead on the development process, while seldom producing reusable components.
A library whose retrieval method has poor recall causes users (programmers) to miss reuse opportunities, when such opportunities do exist.
A library whose retrieval method has poor precision causes users (programmers) to be unecessarily distracted by components that are retrieved but prove to be irrelevant.

The weight of these problems increases as a function of the size of the library, and there is every indication that reuse libraries increase in size all the time. In order to ensure that library components remains relevant to the application domain, one has to define precise inclusion criteria in the reuse library. Also, in order to ensure that the library maintains good recall, one has to design a retrieval method that is as exhaustive as possible (which visits all the entries, or at least ensures that it skips an entry only if it knows it to be irrelevant). Finally, in order to ensure that the library maintains good precision, one must define a storage and retrieval method which provides precise descriptions of components and queries, and formally defined matching criteria.

Position

In light of the foregoing observations, one may think that formal methods of software components storage and retrieval are widely used in practice. Yet despite the abundance of such methods, and despite the wide range of cost vs quality that these methods provide [1, 2, 3, 4, 5, 6, 7], they are mostly ignored by industry, in favor of traditional, low-tech solutions that are inspired from information retrieval or from library science [8].

We submit the position that both kinds of methods are needed to do a satisfactory job in component storage and retrieval: traditional retrieval techniques are most useful in the early stages of the search process, when large chunks of the library can be excluded by simple keyword matches; mathematically based techniques are most effective in the later stages of the search process, when a great deal of pprecision is required to discriminate between several candidates which differ only slightly from each other.

One of the key differences between informal retrieval methods and formal retrieval methods is the ability to distinguish between exact retrieval and approximate retrieval. Because informal methods focus on matching component descriptions with user queries, they do not support the idea of correctness: a component may well match the query in all its detail but still fail to be correct (due to a mismatch between the library manager's interpretation of a feature, and the user's); also, a component may fail to match a query but still be correct with respect to the query (the component does satisfy a required feature, but the library manager neglected to record it). Hence, with informal retrieval methods, all retrievals are approximate retrievals: the decision of whether a component is correct (and can be used verbatim), is not correct but is close enough (and can be used after modification), or is not correct and costs too much to modify (and must be discarded) --this decision is taken after the retrieval operation, rather than as part of it.

We have investigated a formal method of component retrieval [3], based on formal specifications and program correctness, and have discussed in turn exact retrieval then approximate retrieval under this method. In this paper, we briefly introduce our main results on approximate retrieval.

Discussion

In [9], Mili defines four measures of distance between specifications; we review these measures in turn and see how they can be used to perform approximate retrieval. Basically, for a given measure of distance, say , we consider a reuse library L and a query K, and we seek to identify all the components C of L that minimize the distance .

Functional Consensus

The first measure of distance is what we call functional consensus. The rationale for this measure can be summarized as follows:

Given a component C and a query K, we consider that C is close to K if C and K have plenty of information in common.

Among all the components of the library, this measure will select that which has most information in common with the query.

Refinement Difference

Given two specifications C and K such that K refines C (i.e. all the requirements information of C is recorded in K). The refinement difference between K and C is the smallest functional increment that we must add to C to obtain K. The rationale of this measure is the following.

Given a component C and a query K, we consider that C is close to K if the amount of functionality of K that is not satisfied by C is small.

Note that unlike all other measures of distance presented in this section, the measure of refinement difference is not symmetric.

Refinement Distance

Given two specifications K and C; the refinement distance between K and C reflects all the functional information of K that is not recorded in C and all the requirements information of C that is not in K. We denote this measure by . The rationale of this measure of distance is the following:

The refinement distance reflects two terms: the functional requirements of K that C does not satisfy; and the functional properties of C that K does not need. Ideally, we want to minimize both of these terms: we minimiize the first term in order to have fewer additional features to add to C; and we minimize the second term in order to have fewer irrelevant features of C to deal with when we are modifying C to satisfy K.

Functional Distance

The rationale of functional distance is the following:

Given two specifications A and B. The distance between A and B is reflected by two features: the amount of requirements information that A have in common, which is reflected by the functional consensus of A and B (denoted by ); and the amount of requirements information that sets them apart, which is reflected by .

Consequently, we define the functional distance between A and B as the vector denoted by

Experimentation: A Library of Compilers

In order to illustrate how these distances can be used to perform approximate retrieval in a database of software components, we have considered the library of compilers that is presented in [3] and a user query K that no element of the library satisfies. Figure 1 gives a graphic representation of these compilers, where the nodes are ordered by means of the refinement relation.

Figure 1: A Database of Pascal Compilers

For each measure of distance (say ), we consider all the entries of the original database and compare them with respect to their distance to specification K. Specifically, whenever component is -closer to K than component , we draw higher than in the new graph; also, whenever two components and have the same distance to K (i.e. ), we represent them at the same node in the new graph. The graphs that we obtain for functional consensus, refinement difference, refinement distance and functional distance are given in figure 2. On each graph, the specifications that minimize the measure of distance (hence are prime candidates in an approximate retrieval) are those that appear at the top of the graph.

Figure 2: Graphs derived from Measures of Distance

References

1: R. Hall, ``Generalized behaviour-based retrieval,'' in Proceedings, 16th Int. Conf. on Soft. Eng., (Sorento, Italy), IEEE Computer Society Press, May 1994.
2: J. Jeng and B. Cheng, ``Formal methods applied to reuse,'' in Proceedings, 5th Workshop on Software Reuse, (Palo Alto, CA), University of Maine, November 1992.
3: R. Mili, R. Mittermeir, and A. Mili, ``Storing and retrieving software component: A refinement based approach,'' in Proceedings, 16th Int. Conf. on Soft. Eng., (Sorento, Italy), IEEE Computer Society Press, May 1994.
4: D. Perry and S. Popovich, ``Inquire: Predicate-based use and reuse,'' in Proceedings, Knowledge Based Software Engineering Conference, (Chicago, IL), IEEE Computer Society Press, September 1993.
5: A. Podgurski and L. Pierce, ``Behaviour sampling: a technique for automated retrieval of reusable components,'' in Proceedings, 14th International Conference on Software Engineering, (Melbourne, Victoria, Australia), pp. 300-304, IEEE Computer Society Press, September 1992.
6: A. M. Zaremski and J. M. Wing, ``Signature matching: A tool for using software libraries,'' ACM Transactions on Software Engineering and Methodology, vol. 4, pp. 146-170, April 1995.
7: A. M. Zaremski and J. M. Wing, ``Specification matching of software components,'' in Proceedings, SIGSOFT '95: Third ACM SIGSOFT Symposium on the Foundations of Software Engineering, (New York, NY), ACM Press, April 1995.
8: W. Frakes and T. Pole, ``An empirical study of representation methods for reusable software components,'' IEEE Transactions on Software Engineering, vol. 20, pp. 617-630, August 1994.
9: R. Mili, ``Assessing the reuse worthiness of a component: Empirical and analytical approaches,'' Tech. Rep. PhD Dissertation, University of Ottawa, 1996.

Biographies

Lamia Labed Jilani holds an Engineering degree in Computer Engineering from the University of Tunis II; she is a PhD candidate at the University of Tunis II and is a researcher with the Regional Institute for Research in Computing and Telecommunications in Tunis, Tunisia. Rym Mili holds a Doctorate in Computer Science from the University of Tunis and a PhD in Computer Science from the University of Ottawa; she is an Assistant Professor of Computer Science at the University of Texas at Dallas. Ali Mili holds a PhD from the University of Illinois and a Doctorat d'Etat from the University of Grenoble; he is Professor of Computer Science at the University of Ottawa.