Compiling to C

Partial C-ification [#!tdb95!#] is a translation framework which `does less instead of doing more' to improve performance of emulators close to native code systems.

Starting from an emulator for a language L written in C, we translate to C a subset of its instruction set (usually frequent and fine-grained instructions which are executed in contiguous sequences) and then simply use a compiler for C to generate a unique executable program.

A translation threshold allows the programmer to empirically fine-tune the C-ification process by choosing the length of the emulator instruction sequence, starting from which, translation is enabled. The process uses a reasonable default value and can be easily controlled by the programmer

:-set_c_threshold(Min,Max).

will ensure that only emulated sequences of length between Min and Max get translated to C. This allows to handle gracefully the size/speed tradeoff.

Communication between the run-time system (still under the control of the emulator) and the C-ified chunks is handled as follows.

The emulated code representation of a given program (in particular the compiler itself) is mapped to a C data structure which allows exchange of symbol table information at link time.

To be able to call a C-routine from the emulator we have to know its address. Unfortunately, the linker is the only one that knows the eventual address of a C-routine. A simple and fully portable technique to plug the address of a C-routine into the byte code is to C-ify the byte-code of the emulator into a huge C array of records, containing the symbolic address of the C-chunks. After compilation, and linking with the emulator, the linker will automatically resolve all the missing addresses and generate warnings for the missing C-routines.

This is compiled together with the C-code of the emulator to a stand alone executable with performance in the range between pure emulators and native code implementations.

The method ensures a strong operational equivalence between emulated and translated code which share exactly the same observables in the run-time system.

An important characteristic is easy debugging of the resulting compiler, coming from the full sharing of the run-time system between emulated and compiled code and the following property we call instruction-level compositionality: if every translated instruction has the same observable effect on a (small) subset of the program state (registers and a few data areas) in emulated and translated mode, then arbitrary sequences of emulated and translated instructions are operationally equivalent.

Currently C-ification covers term creation on the heap and frequently used inline operations which can be processed in Binary Prolog before calling the `real goal' in the body.

Chunks containing small built-ins that do not require a procedure call will generate `leaf-routines' in C (which are called efficently and do not use stack space).

On the other hand large built-ins implemented as macros in the emulator would make code size explode. Implementing them as functions to be called from the C-chunk would require code duplication and it would destroy the leaf-routine discipline which is particularly rewarding on Sparcs. We have chosen to implement them through an abstraction with a coroutining flavor: anti-calls. Note that calling a built-in from a C-chunk is operationally equivalent to the following sequence:

Overall, anti-calls can be seen as form of coroutining (jumping back and forth) between native and emulated code. Anti-calls can be implemented with the direct-jump technique even more efficiently, although for portability reasons we have chosen a conventional return/call sequence, which is still fairly efficient as a return/call costs the same as a call/return. Moreover, this allows the chunks to remain leaf-routines, while delegating overflow and signal handling to the emulator. Note that excessively small chunks created as result of anti-calls are removed by an optimizing step of the compiler with the net result that such code will be completely left to the emulator. This is of course more compact and provable to be not slower than its fully C-expanded alternative.



Subsections