-*- Text -*- $Id: cmpint.txt,v 1.12 2000/03/21 04:29:53 cph Exp $ Copyright (c) 1991-1992 Massachusetts Institute of Technology Documentation of the C interface to MIT Scheme compiled code *DRAFT* Remarks: In the following, whenever Scheme is used, unless otherwise specified, we refer to the MIT Scheme dialect and its CScheme implementation. This file describes the compiled-code data structures and macros defined in cmpint-md.h and required by cmpint.c and cmpgc.h . cmpaux.txt describes the assembly language code that must be written to get all of this to work. cmpint-md.h is the machine dependent header file that defines many of these parameters. A new version must be written for each architecture. The "settable" fields in cmpint-md.h are described in the paragraphs marked with "=>". In the following, word and longword are the size of an item that fills a processor register, typically 32 bits. Halfword is half this size, and byte is typically 8 bits. Description of compiled-code objects and relevant types: The Scheme compiler compiles scode expressions (often procedure definitions) into native code. As its output, it produces Scheme objects that represent compiled expressions. To execute these expressions, they are passed as arguments to scode-eval together with an environment. Typically these expressions will construct some pointers to compiled procedures and define them in the environment. These procedures can then be invoked normally from the read-eval-print loop, from interpreted code, or from other compiled code. In the course of their computation, these procedures will need to call other procedures and then proceed the computation. In order to accomplish this, they will push compiled return addresses on the stack, and these will eventually be popped and "jumped through" to return. Compiled code and objects referenced by it are collected into "vector-like" objects called compiled-code blocks. The above four classes of objects (compiled expressions, compiled procedures, compiled return addresses, and compiled-code blocks) are implemented using two microcode types: TC_COMPILED_CODE_BLOCK is used to implement compiled-code blocks. It is a vector type, that is, it is a pointer type, the word addressed by the pointer contains the length of the vector (not including the length header word), and the rest of the words (length) follow at increasing addresses in memory. Typically the first word after the header is a non-marked-vector header, and the instructions follow it. The non-marked-vector header covers all the instructions, but the vector may contain arbitrary objects after the instructions, covered only by the normal vector header. The optional, additional space at the end of the block is called the "constants" section, since it is used, among other things, to keep copies of constant objects used by the compiled code. See the picture below for a diagram of the typical layout. TC_COMPILED_ENTRY is used to implement compiled expressions, compiled return addresses, compiled procedures, and some other entry points that the compiler and the compiled-code interface need. A compiled entry is a non-standard pointer type described below. Description of compiled entries: The address portion of a compiled entry object points to an instruction in the "middle" of a compiled-code block. In order for the garbage collector to be able to move the whole block as a unit it must be able to determine the address of the first word. Note that this word contains the length of the whole block, so this need not be specified again. The address of the first word of the block can be found from the address of the instruction, and a few bytes, currently a halfword preceding the instruction. These bytes are called the offset field of a compiled entry object, and typically encode the distance (in bytes) between the beginning of the block and the compiled entry. A few bytes preceding the offset field are called the format field and encode the type of compiled entry (procedure vs. expression, etc.) and some type-specific information (number of arguments, offset to next return address on the stack, etc.). The gc-offset field and the format field must be the same size, and their size is determined by the C typedef of format_word at the beginning of cmpint-md.h. Note that, to date, the compiler has only been ported to systems where this size is 2 bytes (for each), but it should be possible to port it to systems where these fields are larger. Encoding of the offset field: The offset field is decoded as follows: If the low order bit is 0 the offset is a simple offset, ie. subtracting the offset from the address of the compiled entry results in the address of the compiled-code block that contains the entry. If the low order bit is 1, it is a continued offset, ie. subtracting the offset from the address of the compiled entry results in the address of another compiled entry whose offset may or may not have a low bit of 0. The rest of the bits (typically 15) are some encoding of the offset: - If instructions can appear at arbitrary byte addresses (including odd addresses), this field is the offset itself. - If instructions have alignment constraints (ie. halfword alignment on MC68K, or longword alignment on many RISCs), this field is the offset shifted appropriately. In this way, no bits are wasted, and the range of offsets is increased. For example, The DEC VAX can have instructions at any byte location. The 15 bits are the offset. The MC68020 can have instructions only at halfword boundaries. The 15 bit field shifted left by 1 is the real offset. Note that in this case, if the low bit of the offset field is 0, the offset field is the real offset. The HP Precision Architecture, and many other RISCs, can have instructions only at longword boundaries. The 15 bit field shifted left by 2 is the real offset. Encoding of the format field: The preceding bytes encode the kind of compiled entry in the following way: The format field is further subdivided into two equal sized halves, used, roughly, for the minimum (high order half) and maximum (low order half) number of arguments that a compiled procedure will accept. Inappropriate values for these numbers of arguments imply that the entry is not a procedure, and then the two halves may be combined to generate information appropriate to the entry type. The examples below assume that the format field is 2 bytes long. - For compiled expressions it is always -1 (0xffff) - For compiled entries it is always -3 or -2 (0xfff[d-e]). It is -2 for compiler generated entries, -3 for compiler-interface generated entries. - For compiled return addresses with saved dynamic links it is always -4 (0xfffc). The next item on the stack is then a dynamic link. - For the special return address `return_to_interpreter' it is always -5 (0xfffb). - For all other compiled return addresses, both halves (bytes) must have their sign bit set, that is, they must appear negative when sign-extended. The remaining bits of the high-order half of the field (all but the sign bit) and all but the two most significant bits of the low-order half of the field (sign bit and adjacent bit), when concatenated, form the offset in the stack to the previous (earlier) return address. This information is used by the debugger to "parse" the stack into frames. The sub-fields are actually concatenated backwards, with the bits from the high order half being the low order bits in the result. If the format field is two bytes long, each half is a single byte, and the valid range for the high-order half is 0x80-0xff, while the valid range for the low-order half is 0x80-0xdf - For compiled procedures, the format field describes the arity (number of parameters) and the format of the frame on the stack: The high order half of the field is (1+ REQ) where REQ is the number of required arguments. Note that REQ must such that the resulting half of the format field does not appear negative! If the format field is two bytes long, REQ must be less than 127. The low order half of the field is given by the expression (* (EXPT -1 REST?) FRAME-SIZE) where FRAME-SIZE is (+ 1 REQ OPT REST?), REQ is as above, OPT is the number of named optional arguments, and REST? is 1 if the procedure has a rest parameter (ie. it is a "lexpr"), or 0 otherwise. FRAME-SIZE must not appear negative, thus if the format field is two bytes long, FRAME-SIZE must be less than 127. Picture of typical compiled-code block and entry: ---------------------------------------- start_address | MANIFEST-VECTOR | tl | ----------------------------------------<---------\ | NM-HEADER | il | \ ----------------------------------------<---\ | | | \ | | | | | | | | | | | | | | some instructions | | | | | | | | | | | | | | | | | | | ---------------------------------------- | | | format_field_1 | offset_field_1 | | | ---------------------------------------- | | entry_address_1 | movel arg0,reg0 | | | ---------------------------------------- | | | | | | | | | | | | > il | | | | | | more instructions | | | | | | | | | | | | | | | ---------------------------------------- | > tl | format_field_2 | offset_field_2 | | | ---------------------------------------- | | entry_address_2 | andl pointer_mask,arg0,reg0 | | | ---------------------------------------- | | | | | | | | | | | | | | | | | | | more instructions | | | | | | | | | | | | | / | /--->----------------------------------------<---/ | / | Scheme object | | | ---------------------------------------- | "cons- | | | | tants" | | | | | | | | < | | | | | more Scheme objects | | section | | | | | | | | | | | | \ | | / \--->----------------------------------------<----------/ Note: The picture above assumes that each machine instruction takes the same space as a scheme object, and that this is also the combined length of the gc-offset and format fields. The type tags are always at the most significant end of the word, which depending on endianness may be at the lowest or highest addressed part of the word in memory. The picture above depicts them on the left. Description of picture: [TC_COMPILED_CODE_BLOCK | start_address] would be the object representing the compiled-code block. [TC_COMPILED_ENTRY | entry_address_1] would represent entry1. [TC_COMPILED_ENTRY | entry_address_2] would represent entry2. 1) Assuming that instructions are longword aligned and that entry_address_1 is close enough to start_address not to need an extension, but entry_address_2 is not, then offset_field_1 = ((entry_address_1 - start_address) >> 1) offset_field_2 = (((entry_address_2 - entry_address_1) >> 1) | 1) note that entry_address_1 - start_address is a multiple of 4 because of the alignment assumption. 2) Assuming that instructions are halfword aligned and that entry_address_1 is close enough to start_address not to need an extension, but entry_address_2 is not, then offset_field_1 = (entry_address_1 - start_address) offset_field_2 = ((entry_address_2 - entry_address_1) | 1) note that entry_address_1 - start_address is a multiple of 2 because of the alignment assumption. 3) Assuming that instructions are byte aligned and that entry_address_1 is close enough to start_address not to need an extension, but entry_address_2 is not, then offset_field_1 = ((entry_address_1 - start_address) << 1) offset_field_2 = (((entry_address_2 - entry_address_1) << 1) | 1) The length of the "constants" section is (tl - il). There are (tl + 1) total words in the object. => Macro PC_ZERO_BITS should be defined to be the number of bits in instruction addresses that are always 0 (0 if no alignment constraints, 1 if halfword, etc.). => format_word should be 'typedefd' to be the size of the descriptor fields. It is assumed that the offset field and the format field are the same size. This definition is unlikely to need modification. Compiled closures: Most compiled procedures are represented as a simple compiled entry pointing to the compiled-code block generated by the compiler. Some procedures, called closures, have free variables whose locations cannot be determined statically by the compiler or the linker. The compiler will generate code to construct a tiny compiled-code block on the fly and make the compiled procedure be an entry point pointing to this dynamically allocated compiled-code block. For example, consider the following code, appearing at top level, (define foo (lambda (x) (lambda (y) (+ x y)))) ;lambda-1 The outer LAMBDA will be represented as a compiled entry pointing to the appropriate block. The inner LAMBDA cannot be since there may be more than one instance, each with its independent value for X: (define foo1 (foo 1)) (define foo2 (foo 2)) Compiled closures are implemented in the following way: The entry corresponding to the procedure points to a jump-to-subroutine (or branch-and-link) instruction. The target of this jump is the code corresponding to the body of the procedure. This code resides in the compiled-code block that the compiler generated. The free variables follow the jump-to-subroutine instruction (after aligning to longword). Using this representation, the caller need not know whether it is invoking a "normal" compiled procedure or a compiled closure. When the closure is invoked normally, it jumps to the real code for the procedure, after leaving a "return address" into the closure object in a standard place (stack or link register). This "return address" is the address of the free variables of the procedure, so the code can reference them by using indirect loads through the "return address". Here is a stylized picture of the situation, where the procedure object (closure entry point) is a pointer to <1>. closure object: +-------------------------------+ | | |
| | | +-------------------------------+ <1> | jsr instruction to <2> | +-------------------------------+ | | +-------------------------------+ compiled code blok produced by the compiler: +-------------------------------+ | | | ... | | | +-------------------------------+ <2> | | | | | | V | | | +-------------------------------+ The code above could be compiled as (in pseudo-assembly language, in which & denotes an immediate value): const format-word:0x0202;gc-offset:?? foo: mov rfree,reg0 mov &[TC_MANIFEST_CLOSURE | 4],reg1 ; gc header mov reg1,0(reg0) mov &[format_field | offset_field],reg1 ; entry descriptor mov reg1,NEXT_WORD(reg0) mov &[JSR absolute opcode],reg1 ; jsr absolute opcode/prefix mov reg1,2*NEXT_WORD(reg0) mova lambda-1,reg1 ; entry point mov reg1,3*NEXT_WORD(reg0) mov arg1,4*NEXT_WORD(reg0) ; x mov &5*NEXT_WORD,reg1 add reg0,reg1,rfree mov &[TC_COMPILED_ENTRY | 2*NEXT_WORD],reg1 add reg0,reg1,retval ret const format-word:0xfffe;gc-offset:?? lambda-1: mov arg1,reg0 ; y mov x_offset(retlnk),reg1 ; x x_offset = 0 add reg1,reg0,reg0 mov reg0,retval ret A more detailed picture of the closure object would be: ---------------------------------------- | MANIFEST-CLOSURE | 4 | ---------------------------------------- | format_field | gc_offset_field | ;format = 0x0202 ---------------------------------------- ;offset = encode(8) entry | JSR absolute opcode | ---------------------------------------- | address of lambda-1 | ---------------------------------------- ;address of retadd retadd | value of x | ; -> retlnk before ---------------------------------------- ; entering lambda-1 The following macros are used to manipulate closure objects: => COMPILED_CLOSURE_ENTRY_SIZE specifies the size of a compiled closure entry (there may be many in a single compiled closure block) in bytes. In the example above this would be 12 bytes (4 total for the format and gc offset fields, 4 for JSR opcode, and 4 for the address of the real entry point). => EXTRACT_CLOSURE_ENTRY_ADDRESS is used to extract the real address of the entry point from a closure object when given the address of the closure entry. Note that the real entry point may be smeared out over multiple instructions. In the example above, given the address the word labeled ENTRY, it would extract the address of LAMBDA-1. => STORE_CLOSURE_ENTRY_ADDRESS is the inverse of EXTRACT_CLOSURE_ENTRY_ADDRESS. That is, given the address of a closure entry point, and a real entry point, it stores the real entry point in the closure object. In the example above, given the address of the word labeled ENTRY, and a different entry point, say for LAMBDA-2, it would make the closure jump to LAMBDA-2 instead. This is used to relocate closures after garbage collection and similar processes. Some caveats: - The code for lambda-1 described above does not match what the compiler would currently generate. The current parameter-passing convention specifies that all the state needed to continue the computation at a procedure's entry point must be on the stack and all the information on the stack must be valid objects (for GC correctness in case of interrupt, more on this below). Thus the contents of retlnk must be pushed on the stack as a valid object, and this is done by reconstructing the closure object whose datum field encodes the address of entry and whose type tag is TC_COMPILED_ENTRY, and then pushing it onto the stack. Note that on some machines, the return address for the subroutine-call instruction is pushed on the stack by the hardware, and thus this value might have to be popped, adjusted, and re-pushed if it cannot be adjusted in place. The code for lambda-1 would then be closer to: lambda-1: sub retlnk,&retadd-entry,retlnk or &[TC_COMPILED_ENTRY | 0],retlnk,retlnk ; set type code push retlnk ; more on this below mov arg1,reg0 mov top_of_stack,reg1 and &[0 | -1],reg1,reg1 ; remove type code mov x_offset+retadd-entry(reg1),reg1 add reg1,reg0,retval pop ; the closure object ret Note that (retadd-entry) is a constant known at compile time, and is the same for the first entry point of all closures. On many machines, the combination sub/or can be obtained with a single add instruction: add &([TC_COMPILED_ENTRY | 0]-(retadd-entry)),retlnk,retlnk This value is called the "magic constant", encoded in the first few instructions of a closure's code. - Multiple closures sharing free variables can share storage by having multiple entry points (multiple JSR instructions) in the closure object. The compiler occasionally merges multiple related closures into single objects. A complication arises when closure entry points are not necessarily long-word aligned, since the compiler expects all variable offsets (like x_offset above) to be long-word offsets. This problem only occurs on machines where instructions are not all long-word aligned and for closures with multiple entry points, since the first entry point is guaranteed to be aligned on a long-word boundary on all machines. The current solution to this problem, on those machines on which it is a problem, is to choose a canonical entry point (the first one) guaranteed to be aligned properly, and push that on the stack on entry to a closure's code. The compiler keeps track of what actual entry point the code belongs to even though the value on the stack may correspond to a different entry point. The "magic constant" becomes an entry-point dependent value, since each return address may have to be bumped back to the first entry point in the closure object rather than to the immediately preceding entry point. Interrupts: Scheme polls for interrupts. That is, interrupt processing is divided into two stages: - When an asynchronous interrupt arrives, the handler (written in C) invoked by the operating system sets a bit in a pending-interrupts mask, stores the relevant information (if any) in a queue, and proceeds the computation where it was interrupted. - The interpreter and compiled code periodically check whether an interrupt is pending and if so, invoke an interrupt handler written in Scheme to process the interrupt. The interpreter checks for interrupts at the apply point. Compiled code currently checks at every procedure entry (including loops) and at every continuation invocation. This may change in the future, although it will always be the case that interrupts will be checked at least once in each iteration of a loop or recursion. Compiled code does not actually check the bits in the mask to determine whether an interrupt is pending. It assumes that the first-level interrupt handler (the handler written in C) not only sets the bits, but also changes the copy of the MemTop (top of consing area) pointer used by the compiler so that it will appear that we have run out of consing room. Thus compiled code merely checks whether the Free pointer (pointer into the heap) is numerically larger than the MemTop pointer, and if so it invokes an assembly-language or C utility that decides whether a garbage collection is needed or an interrupt must be processed. Sometimes this utility will decide that the interrupt need not be processed (it is disabled, for example), and will need to return to the compiled code skipping the interrupt check since otherwise we will get into an infinite loop. The interrupt check code is fixed (so that the handler can determine how much code to skip) and comes in two varieties: closure interrupt code, and normal-entry (other) interrupt code. Normal-entry interrupt code is always the first code in an entry point (procedure or continuation, but not closure code) and merely compares the Free and MemTop pointers and branches. Closure code does this comparison after setting up the closure object. Closure code assumes that the closure object is in the first parameter location (the closure itself is argument 0) so that free variables can be fetched. Thus a closure label must first set this up correctly, and then check for interrupts. In pseudo-assembly language, a "normal" entry might look like gc_or_int LOADI #interrupt-handler-index,rindex LOADA entry,rentry JMP scheme-to-interface format word and gc word for the entry entry CMP Free,MemTop BGE gc_or_int after_entry a "closure" entry might look like (this is not in the closure object, but in the code block at which the closure object points) gc_or_int LOADI #interrupt-handler-index,rindex LOADA entry,rentry JMP scheme-to-interface format word and gc word for the entry entry ADDI offset,retadd,ret_add ; bump ret. add. to entry point ORI #[TC_CLOSURE | 0],ret_add PUSH ret_add ; arguments on the stack CMP Free,MemTop BGE gc_or_int after_entry The following macros are used by the C utility and handler to determine how much code to skip: => ENTRY_SKIPPED_CHECK_OFFSET is the number of bytes between entry and after_entry in a normal entry. => CLOSURE_SKIPPED_CHECK_OFFSET is the number of bytes between entry and after_entry in a closure entry. => ENTRY_PREFIX_LENGTH is the number of bytes between gc_or_int and entry in a normal entry, not counting those taken up by the format word and the gc word. Important considerations: The Scheme compiled code register set includes the current copy of the Free pointer, but does not include the copy of MemTop, although it is mostly constant. The reason is that the C-level interrupt handler does not have convenient access to the register set at the point of the interrupt, and thus would have a hard time changing the version of MemTop used by compiled code at the point of the interrupt. Thus the copy of MemTop used by compiled code is kept in memory. On machines where register-to-memory comparisons can be done directly this is no problem, but on load/store architectures (most RISCs for example), this is not feasible. Furthermore, most RISCs have a few cycles of delay for memory loads, and adjacent instructions may not be interlocked by the hardware. Thus a sequence like LOAD Memory_MemTop,Rtemp CMP Rfree,Rtemp BGE gc_or_int may be very slow and NOPs may have to be inserted explicitly between the LOAD and CMP instructions to make the code work. Since Scheme's interrupt response is not immediate, and polling is frequent, the following sequence can be used instead: CMP Rfree,Rmemtop BGE gc_or_int LOAD Memory_MemTop,Rmemtop Where Rmemtop is a register that holds a recent value of MemTop and is reloaded at every interrupt check. Thus interrupt processing will start at the second interrupt check after the actual interrupt comes in. In other words, if the sequence of entry points executed dynamically is ep1, ep2, ep3, and an asynchronous interrupt occurs between ep1 and ep2, the interrupt handler will not be invoked until ep3, rather than ep2. This instruction sequence eliminates the need to wait for the LOAD to complete, and the LOAD will have completed (or will be handled by the hardware's interlock mechanism) by the next check since at least one instruction (a branch instruction), and often many more, will intervene. Note that this delayed checking does not affect garbage collection interruptions since MemTop is constant between garbage collections, and thus the value being loaded is always the same, in the absence of asynchronous interrupts. Various operating systems allow the signal handler convenient access to the interrupted code's register set. In such a situation, the LOAD instruction can be eliminated and the C-level interrupt handler can modify Rmemtop directly. Rmemtop should be chosen from the caller-saves convention (super-temporary) registers if possible, since these registers must be explicitly saved by the signal handler, rather than implicitly by the calling convention. Interrupts and closures that share storage: If an interrupt arrives on entry to the closure, the correct closure object must be reconstructed so that the computation will continue correctly on return from the interrupt. The code to reconstruct the correct closure is also issued by the compiler, which at compile time maintains the identity of each closure and the distance to the canonical closure used for environment purposes. If the interrupt is dismissed, instead of processed, we need to continue the computation bypassing the interrupt checking code in order to avoid an infinite loop. This is what the macro CLOSURE_SKIPPED_CHECK_OFFSET is used for. We must skip the preamble of the closure code and emulate part of it, that is, adjust the object on top of the stack to be the closure object that the code expects to have there. This can be done by extracting the magic constant from the entry point, and bumping the corresponding return address by this value. The macro ADJUST_CLOSURE_AT_CALL accomplishes this feat on those machines where it is needed. => ADJUST_CLOSURE_AT_CALL, when given an entry point and a location, adjusts the closure object stored at location so that it is the closure object that the entry point expects on top of the stack. On machines where all instructions are long-word aligned, it is a NOP, on other machines (eg. 68k, VAX), it extracts the magic constant from the closure's code, and uses it to construct the appropriate closure object. External calls from compiled code: Many calls in scheme code (and particularly in large programs) are calls to independently compiled procedures or procedures appearing at the top level of a file. All these calls are calls to potentially unknown procedures since the names to which they are bound can be unbound or redefined dynamically at run time. The code issued by the compiler for such an external call must take into account the possibility of the lack of a valid value, run-time definition, and run-time assignment. This is done as follows: For each external procedure called with a fixed number of arguments (more on this below), a small contiguous space is allocated in the "constants" section of the compiled-code block. This space initially contains the name of the external variable whose value is being invoked, and the number of arguments (+ 1 for technical reasons) being passed to the procedure. These locations will be replaced at load time by an absolute jump to the correct entry point of the called procedure if the number of arguments matches and the callee (target procedure) is compiled, or by an absolute jump to some utility code generated on the fly to interface the caller and the callee (called a trampoline procedure). Note that both procedures need not be in the same compiled-code block. The fixed code in the code section of the compiled-code block contains a pc-relative branch instruction to this space allocated in the "constants" section. When the compiled-code block is loaded, a linker that resolves these references and replaces the name and arguments with machine-specific code to do the absolute jump is invoked. The linker also records the locations of all such jump instructions so that a subsequent redefinition or assigment of the same name will cause the jump instruction to be replaced by a new one to the correct value. Note that the number of arguments needs to be checked only by the linker, so no instructions are issued to check it at run time. It is for this reason that the number of arguments is part of the information left by the compiler in the "constants" section. These entries in the "constants" section are called execute caches, operator links, or "UUO" links for historical reasons. They must be large enough to contain the instructions required for an absolute jump (and possibly some delay slot instructions in a RISC-style machine), and the number of arguments passed in the call. This number of arguments is not used in the call sequence, but is used by the linker when initially linking and when relinking. Execute caches are contiguous in the "constants" section, and the whole lot is preceded by a GC header of type TC_LINKAGE_SECTION which contains two fields. The least-significant halfword of the header contains the size in longwords of the execute-cache section (note that each cache entry may take up more than one longword). The remaining bits (ignoring the type code) MUST be 0. If a file makes enough external calls that this halfword field cannot hold the size, the links caches must be separated into multiple blocks each with its own header. Occasionally a procedure is called with more than one number of arguments within the same file. For example, the LIST procedure may be called with three and seven arguments in the same file. In this case there would be two execute caches for LIST. One would correspond to the argument count of three, and the other to seven. As an example, consider the code generated for (sort ) where sort is the "global" procedure sort. The code section would contain push push branch sort-uuo-link In the constants section there would be a label that would contain the following after linking sort-uuo-link: jump sort ; Absolute address for sort 3 ; Number of arguments + 1 Before linking it would contain sort-uuo-link: SORT ; The symbol SORT 3 ; Number of arguments + 1 This assumes that the absolute jump instruction takes one word. If it takes more, the appropriate padding would have to be inserted between the symbol SORT and the number 3. On machines where instructions are not necessarily longword aligned (MC68020 and VAX, for example), the padding bits for the instruction can be used to contain the argument count. Note that the order of the instructions and the count are machine dependent, although typically the instructions precede the count. The following macros are used to manipulate execute caches: => EXECUTE_CACHE_ENTRY_SIZE specifies the length (in longwords) of an execute-cache entry. This includes the size of the instructions and the argument count. For the example above it would be 3, assuming that the jump instruction and the absolute address take two words together (the third is for the argument count). Note that on RISC machines, this size may have to include the size of the branch delay slot instruction. => EXTRACT_EXECUTE_CACHE_ARITY specifies how to read the argument count from an execute-cache entry when given the address of the entry. In the above example, it would extract 3 from the address labeled sort-uuo-link. => EXTRACT_EXECUTE_CACHE_SYMBOL specifies how to read the symbol from an execute-cache entry (before it is actually linked) when given the address of an entry. In the above example, it would extract the symbol SORT from sort-uuo-link. => EXTRACT_EXECUTE_CACHE_ADDRESS fetches the real entry point stored in an execute-cache entry when given the address of the entry. In the above example, it would extract the entry point of the sort procedure when given the address of the jump instruction (labeled as sort-uuo-link). => STORE_EXECUTE_CACHE_ADDRESS is the inverse of this, ie. when given a target entry point and the address of an execute cache entry, it stores the entry point there. In the above example, given a new entry point for sort, and sort-uuo-link, it would modify the jump instruction to jump to the new location. => STORE_EXECUTE_CACHE_CODE stores the fixed instructions (opcodes), if any, in an execute cache cell. If the opcodes depend on the actual target address, this macro should be a NOP, and all the work should be done by STORE_EXECUTE_CACHE_ADDRESS. These two macros are separated to avoid extra work at garbage collection time on architectures where some or all of the code need not change. In the example above, this macro would store the jump opcode. Trampolines: Trampolines are the linker-generated procedures that interface the caller and the callee when they are not directly compatible. They may not be directly compatible because the callee may not exist, may not be a compiled procedure, or may expect arguments in different locations. Trampolines typically call a C or assembly-language procedure to reformat the argument list or invoke the error handler. C procedures are invoked using the scheme_to_interface (and trampoline_to_interface) code described below. A trampoline is similar to a compiled closure in that it is a small compiled-code block with some additional storage needed by the trampoline handler (like the actual procedure being invoked, the variable that is unbound, or the number of arguments being passed). The code typically invokes an out-of-line handler passing it the address of the storage section, and an index into a table of C or assembly language procedures that handle the actual transfer. A typical trampoline looks like ---------------------------------------- block | MANIFEST-VECTOR | 6 | (4 + words of storage) ---------------------------------------- | NM-HEADER | 3 | (fixed) ---------------------------------------- | format_field | offset_field | (fixed) ---------------------------------------- entry | LOADI #index,rindex | (index varies) ---------------------------------------- | JSR trampoline_to_interface | (fixed) ---------------------------------------- retadd | first word of storage | (variable) ---------------------------------------- | second word of storage | (variable) ---------------------------------------- => TRAMPOLINE_ENTRY_SIZE is the size in longwords of the compiled-code portion of a trampoline. It is similar to COMPILED_CLOSURE_ENTRY_SIZE but in longwords, and will typically represent less storage since an absolute address is not needed (or desirable). It must include the format word and the GC offset for the entry. In the example above it would be 3. Note that it must be an integer number of longwords, so the code should be padded if necessary. => TRAMPOLINE_ENTRY_POINT returns the address of the entry point of a trampoline when given the address of a trampoline's block, i.e. the address of the word that will contain the manifest vector header. In the picture above, when given the address of the byte labeled `block', it will return the address of the byte labeled `entry'. => TRAMPOLINE_STORAGE returns the address of the first storage word in a trampoline when given the addres of the first instruction (the entry point of the trampoline). In the picture above it would return the address of the word labeled `retadd' when given the address of the word labeled `entry'. Most versions of cmpint-md.h define the last two macros in the same way, in terms of TRAMPOLINE_BLOCK_TO_ENTRY, which is used only for this purpose. The definitions that use TRAMPOLINE_BLOCK_TO_ENTRY assume that the first instruction of a trampoline is aligned on a longword boundary. If this is the case, you can define TRAMPOLINE_BLOCK_TO_ENTRY appropriately and use the definitions of TRAMPOLINE_ENTRY_POINT and TRAMPOLINE_STORAGE from another machine. TRAMPOLINE_BLOCK_TO_ENTRY is the number of longwords from the start of a trampoline's block (the manifest vector header in the picture above), to the first instruction, which must be longword aligned. This will typically be 3 since there are two scheme header words, and the gc and format word typically take one longword together. => STORE_TRAMPOLINE_ENTRY stores the "compiled" code into an "empty" trampoline. It is given the address of the entry point, and the index of the C procedure to invoke (they are all in a table), and stores the machine code necessary to invoke scheme_to_interface (or trampoline_to_interface), passing the index and the address of the trampoline storage area as parameters. In the example above this macro would store the LOADI (load immediate) and JSR instructions. Compiled code and processor caches: Many modern computers have processor caches that speed up the average memory reference if the code exhibits sufficient locality in its reference patterns. In order to obtain increased performance at a lower cost, many processors have split caches for instructions and data that are not guaranteed to be consistent, ie. they are not necessarily invisible to the programmer. This presents problems for self-modifying code and for dynamic loaders and linkers, since instructions are stored using data references (and therefore the data cache), but the instruction cache may not reflect the updates. Modern hardware with split caches often provides some way to synchronize both caches so that the operating system can guarantee correct operation of newly-loaded programs. The Scheme compiled code support performs some of the same tasks that operating systems do, and therefore runs into these problems. The ways in which the consistency problem arises in the Scheme system are: - Newly allocated instructions. The compiler can be invoked dynamically, compiled code can be loaded dynamically into freshly allocated storage, and compiled closures are created dynamically. The instruction cache must reflect the changes made to memory through the data cache. The operating system's program loader must solve precisely this problem. - Execute caches may change their contents. Execute caches contain jump instructions to the appropriate code, but these instructions may change when the corresponding variables are assigned. If the instruction cache is not updated, the wrong code may be entered on subsequent calls. Operating systems with dynamic linking must solve this problem as well. - Code is moved by the garbage collector, since code space is not separate from data space and static. If the caches are not synchronized after a garbage collection, subsequent instruction fetches may result in the execution of incorrect instructions. The operating system must solve this problem when it re-allocates virtual memory pages. The problem can be solved by synchronizing the caches in the appropriate places. The relevant places in the Scheme system have been identified, and use two machine-dependent macros to synchronize both caches or flush the instruction cache. => FLUSH_I_CACHE is used to flush the portion of the I-cache that Scheme code addresses may be in, or alternatively, to guarantee that the I-cache contains only valid data. It may flush/synchronize the entire I-cache instead, if it is easier. It is used after garbage collections and image loads. => FLUSH_I_CACHE_REGION is used to flush or synchronize a region of the address space from the I-cache. It is given the base address and the number of long-words of the region of memory that has just been modified and whose new contents must be copied into the I-cache for correct execution. It is used after updating an execute cache while running between garbage collections. It is not used during garbage collection since FLUSH_I_CACHE will be used afterwards. These macros need not be defined if it is not needed to flush the cache. A NOP version is provided by the code when they are not defined in cmpint-md.h Note that on some machine/OS combinations, all system calls cause a cache flush, thus an innocuous system call (eg., a time reading call) may be used to achieve this purpose. Many modern machines only make their cache flushing instructions available to the operating system (they are priviledged instructions), and some operating systems provide no system calls to perform this task. In the absence of information on the structure and characteristics of the cache (the information could be used to write flushing routines), the Scheme compiler and system may have to be changed in order to run on such machines. Here is a set of changes that will bypass the problem, at the expense of some functionality and perhaps performance: - Change the entry code for closures and execute caches. The code in execute caches can be changed from jump target to jsr fixed-utility-routine target address where fixed-utility-routine extracts target address from the return address and invokes it. The same change can be made to the closure entry code. This solves the problem of assignment to variables with execute caches. This change can be done quite easily since the format of closures and execute caches is already machine dependent, and all the accessors, constructors, and mutators have been abstracted into macros or can be easily rewritten in the compiler. - Change the storage management scheme to accomodate a code area that is never garbage collected so that code, once placed there, never moves. This would constitue a major change to the system. The main problem that this change would present is the following: Closures are data structures created and dropped on the fly, thus they cannot be allocated from a region of memory that is never reclaimed. Thus closures would have to be allocated from data space, and could no longer contain instructions. This implies that the format of entry points would have to change, since the relevant information would no longer consist of a single address, but two, ie. the address of the code in code space and the address of the data in data space. This would imply many changes to the compiler for there are implicit assumptions throughout that compiled entry points take no space besides the space taken by the code. In particular, simple closures would disappear, and multi-closures would have to be redesigned. Implementation registers and utilities The C runtime support maintains some state variables that Scheme code may need to access. In order to make these variables easily accessible to both languages, all these variables are collected into a contiguous vector (the register block) accesible by both. The C variable "Registers" holds the address of this vector, and while compiled Scheme code is being executed, a processor register also holds the address of the vector. Among other data, the register block contains the memory version of MemTop, the interpreter's expression and environment registers, the interrupt mask and pending interrupt words. In addition, the compiler occasionally needs static memory locations into which it can spill the values contained in processor registers. Rather than using another register to hold the address of the spill locations, these are allocated on the same vector as the register block, and the register that holds the address of the register block can be used to access the spill locations as well. Compiled code also needs to invoke assembly language and C utilities to perform certain tasks that would take too much space to code in-line. Rather than choosing fixed addresses for these routines, or having to update them every time a piece of code is loaded or dumped, a register is reserved to hold the address of one of them, and the distance between them is pre-determined, so that compiled code can invoke any of them by adding an offset to the value in the register and jumping there. On processors with few registers (eg. 68k), it would be wasteful to reserve two registers in this fashion. Both registers are therefore merged. Yet another section of the register block array is reserved for utility procedures, and appropriate jump instructions are placed there so that compiled code can invoke the utilities by jumping into the register block array. The following macros define the sizes of the various areas of the array. None of them need to be defined except to override the default. The default assumes that there are enough processor registers that another one can be reserved to point to the utility handles. => COMPILER_REGBLOCK_N_FIXED is the size of the register block section of the array. It must accomodate at least as many locations as the interpreter expects to have. => COMPILER_REGBLOCK_N_TEMPS is the number of spill locations. => COMPILER_TEMP_SIZE is the size (in long words) of the contents of a floating point register if different from a double. For example, an MC68881 saves registers in 96 bit (3 longword) blocks. The default is fine for most machines. => COMPILER_REGBLOCK_EXTRA_SIZE is the additional size (in longwords) to be reserved for utility handles. It is typically defined the following way as (COMPILER_REGBLOCK_N_HOOKS * COMPILER_HOOK_SIZE). => COMPILER_REGBLOCK_N_HOOKS is the maximum number of utility handles. => COMPILER_HOOK_SIZE is the size in longwords of a utility handle (an absolute jump instruction). => Macro ASM_RESET_HOOK can be used to initialize the register block array. It is invoked at boot time. Miscellany: Macro IN_CMPINT_C, defined in cmpint.c, can be used to conditionally include code (extern procedures) needed by the port. It is only defined when cmpint-md.h is included by cmpint.c . => Macro COMPILER_PROCESSOR_TYPE identifies the processor type. It should be unique for each kind of processor.