Prototyping vs. Production

Well-intentioned reuse of existing tools may make scaling more difficult.

As a system starts to show promise of utility, ambitions turn from small demos to practical codes, and scale difficulties begin to show up. An initially attractive system foundation may have other properties which exact a relatively large cost in comparison to the expected benefits. One must consider these benefits carefully; engineering for scale may require paying higher costs during the early stages of development.

As an example, we built Sinapse on top of a symbolic mathematical system, Mathematica (MMa), because it had several attractive properties from a prototyping point of view. It provided a single program development environment which contained considerable built-in knowledge useful for manipulating symbolic (algebraic and differential) equations, a procedural programming language, and a transformation system for free. It promised portability of the result because MMa ran on several platforms, and it was interactive, which eased debugging by making inspection simpler. Early versions of Sinapse benefited because the availability of these mechanisms made it easier to prototype parts of the system, thereby selling the system concept via small demos.

Growing sophistication caused Sinapse's internal algorithm form to become more specialized. It became progressively more difficult to keep a form compatible with that of the underlying MMa system. Eventually, we gave up direct compatibility, opting for conversion to MMa's form when we needed MMa's mathematical knowledge, and then converting the result back to the Sinapse form. Happily the number of conversions per run have turned out to be infrequent. Obviously, we wish to (re)use the knowledge in a symbolic algebra system; the need for our own representations hints that staying within the environment defined by such a tool is not necessarily required. Ideally, we'd like to have a way of compiling symbolic manipulation knowledge stored in a knowledge-representation language such as KIF [#!Genesereth92a:knowledge-interchange-format-reference!#] into a form compatible with our desired representations. This is itself a program synthesis problem we don't know how to solve.

The MMa tree-to-tree rewrite system works well on small trees representing ``typical'' size mathematical formulas. Unfortunately, MMa's built-in control strategies work slowly on large sets of transformations (100's of rules) on large ASTs (10K tree nodes) which are typical of the size of the codes we synthesize. Further, MMa is an interpreter without any corresponding compiler, costing us an additional runtime factor of 10x. Synthesis times without optimization run to 30 minutes on a Sparc 2 and can be much longer with the optimizations. This has made building and testing the Sinapse system components relatively painful. If one insists on using ASTs, a tree-to-tree transformation system engine is cheap to replicate compared to the investment in a synthesis system; we would recommend instead looking into more efficient production system tools like OPS5 [#!Brownston85a!#].

Even the choice of ASTs is called into question by scale. Most ``purist"'' transformation systems, MMa being no exception, operate by manipulation of ASTs. However, serious optimization of refined code requires that dataflow analysis be performed; ASTs are a poor choice for this. The lion's share of the cost in the optimizer is the need to recompute the data flow analysis information after an optimization has been made. An approach we are pursuing is the use of compiler representations such as Static Single Assignment form [#!Cytron91:computing-SSA-form-efficiently!#], in which the data flow has been made explicit, and optimizations directly transform the dataflows, keeping them up to date. An ideal solution would be to treat the data flow analysis problem as a problem in incremental re-computation, and use finite difference methods (unfortunately, this term is common to both PDE solving and program synthesis but doesn't mean the same thing) to keep dataflows current [#!Smith89a:KIDS-overview!#].

Many difficulties can be traced to the tension between prototyping and production. Synthesis systems are controversial enough so that the need to prototype them early can predispose evolutionary descendents to later scale troubles. For conventional SE tasks, we know that prototypes are a poor foundation for production code; for knowledge-based systems, evidence that a prototype works should similarly not be treated as evidence that the prototype can be easily extended to a production code by mere incremental addition of knowledge.