

|
Volume Number: | 4 | |
Issue Number: | 3 | |
Column Tag: | The Mac Hacker |
QuickTrap Routines Bypass Trap Dispatcher
By Mike Morton, University of Hawaii
Bypassing the ROM trap dispatcher
In an article a while back, I covered the basics of bypassing the Macintosh trap dispatcher to call ROM routines directly, to speed up calls to the Toolbox and OS. In this article, I’ll present a set of subroutines which implement this technique in a practical way.
The package is written in MPW assembler, and should be easily callable from any of the MPW languages. It’s short and should be portable to other development systems. It also includes a “fail-soft” feature, in case it turns out not to work on some future Macintosh.
A quick review
Programs call the Macintosh Toolbox and Operating System routines by executing “illegal” instructions, which are handed to the trap dispatching code in the ROM. In addition to the time it takes for the 680x0 processor to recover from the emotional trauma of this illegal instruction, the dispatcher must fetch the offending instruction, decode it, and call the routine it specifies. This is very general, since it “hides” ROM locations from the application, but it’s also slow.
With the GetTrapAddress routine, you can calculate the address of a ROM routine just once each time your application runs. Calling that address directly can save you a lot of time, with very little cost in generality.
What does the dispatcher do?
Here’s the code for the dispatcher in my MacPlus ROM. Your Mac may have something a little different, but all existing Macs seem to be similar in principle. The dispatcher, at address $401F52 in my ROM, disassembles to:
disp: SUBQ.L #2, SP ; add 2 bytes above CCR MOVEM.LD1-D2/A2, -(SP) ; save 12 bytes of regs MOVE.L 12+4(SP), A2 ; get PC of trap word MOVE.W (A2)+, D2; get A-trap word MOVE.L A2, 12+4(SP) ; restore updated PC MOVE.W D2, D1 ; copy trap word to D1 ANDI.W #$01FF, D2 ; get just trap number CMPI.W #$A800, D1 ; trap or OS? BLO.S doOS; jump if OS LEA $0C00, A2; point->Toolbox dispatch LSL.W #2, D2 ; scale number->longwords MOVE.L (A2,D2.W), 12(SP) ; copy address to stack CMPI.W #$AC00, D1 ; “auto-pop” bit set? MOVEM.L(SP)+, D1-D2/A2 ; restore regs; leave CCR BLO.S callTB ; skip if “auto-pop” off MOVE.L (SP)+, (SP); RTS to caller, not glue tBox: RTS ; “call” Toolbox routine doOS: LEA $0400, A2; point to OS dispatch BCLR #8, D2 ; clear&test “keep A0” bit BNE.S OSa0; skip to allow A0 returned LSL.W #2, D2 ; scale number->longwords MOVE.L (A2,D2.W), A2; fetch OS routine address MOVEM.LA0-A1, -(SP) ; save regs (incl A0) JSR (A2); call OS routine MOVEM.L(SP)+, A0-A1 ; and restore OS regs OSrt: MOVEM.L(SP)+, D1-D2/A2; restore OUR regs ADDQ.W #4, SP ; ignore stacked CCR TST.W D0; preset CCR on result RTS ; and return OSa0: LSL.W #2, D2 ; scale number->longwords MOVE.L (A2,D2.W), A2; fetch OS routine address MOVE.L A1, -(SP); preserve A1, *not* A0 JSR (A2); call OS routine MOVE.L (SP)+, A1; and restore A1 BRA.S OSrt; clean up with common code
[An aside: This is the first piece of ROM code I ever read, and I still think it’s a great example of tight 68000 coding. It’s tighter on the Mac II, with indirect addressing available. I can’t see any way to make it faster; can anyone spot a way to save a few bytes, though?]
Besides figuring out which routine to call (using the Toolbox dispatch table at $0C00 or OS table at $0400), the dispatcher also does some other important things. For Toolbox traps, it discards the return address if the “auto-pop” bit is set -- this is useful for “glue”. And for OS traps, it preserves D1, D2, A1 and A2, and sometimes A0. For OS traps, it also passes the low nine bits of the trap number to the routine, in D1,
Our task is to make a trap “dispatcher” which does all this, but is much faster. Note, for instance, that the new code must still pass the trap number in D1.w -- I believe this is how some routines test for flag bits set in the word. (For instance, CmpString has a bit to specify if the comparison is case-sensitive.)
Hey, wait a minute! Isn’t it a bad idea to know how one ROM routine (the dispatcher) communicates with all the others? Isn’t code which depends on this interface likely to fall apart when the Mac III hits the streets? Well, first of all, it’d be awfully hard for Apple to change hundreds of routines. But more importantly, there’s a way to back out gracefully. Trust me; we’ll get to it
An application’s view of the QuickTrap routines
The fundamental speedup is to get rid of the dispatcher, and have one “quick trap” routine for every real routine you’d like fast access to. For instance, if your program does a lot of SetPort calls, you can easily create “qtSetPort”, which has exactly the same interface and does the same thing, only faster. As you might guess, each qtxxx routine caches the address for its routine.
Once, at the beginning of your application, you must call qtEval, which “evaluates” each address and stores it. If you don’t call it, everything will still work -- this is related to the fail-soft scheme.
Other than this, everything works the same as old-style trap routines.
Caching problems
Imagine that you spend a lot of time doing FrameOval calls to draw circles on the screen, and would like to speed this up. (Actually, I’m sure the trap time is insignificant compared to the drawing time; this is just an example.) You install “qtFrameOval” and call it instead everything works great.
Now your friend gives you this neat, public-domain desk accessory which causes all ovals to be drawn on your screen with smile-faces in them. [Any takers to write this, by the way? You could call it The Smiling Moose ] It does this by altering the FrameOval trap to call it. But since your application never executes that trap, its ovals are drawn unmolested. How can you make sure your ovals are happy?
The answer is to call qtEval at the right times -- not just at initialization but whenever you suspect someone has installed a replacement trap routine. Since the qtxxx routines are supposed to “cache” the real addresses, they must track new address when they’re installed, or the cache becomes “stale”.
One way to do this is to call qtEval every time you regain control from a desk accessory, each time you regain control from Switcher or Multifinder, and each time you invoke an FKEY. Perhaps you’d also have to call it for every SystemTask call. And of course you must call it if your application does any SetTrapAddress calls for the relevant traps. In short, whenever anyone could have changed trap addresses, refresh the cache.
A simpler approach is to change the SetTrapAddress trap by installing a prefix routine which sets a flag in your globals that re-evaluation is needed. If DAs, FKEYs, etc., play by the rules and use SetTrapAddress calls, nobody can make the trap tables get out of sync with your cached addresses.
It’s tempting to call qtEval in your idle-loop as a heavy-handed way to make sure it’s done often enough. I suspect this is a bad idea -- it can cause seemingly random bugs.
One other way: if you use, for instance, qtFrameOval only in some code which doesn’t relinquish control, call qtEval once before each time you enter that code. Remember that qtEval isn’t all that speedy -- it must call GetTrapAddress for every qtxxx routine.
Reasons not to use these routines
Because the routines are JSR’d to, they take up four bytes instead of two. This is no big problem for most applications, but don’t change all your calls.
When you’re debugging, commands to break on traps don’t work, since your application is not executing trap instructions. You can force these traps to occur by disabling the caching; see below for details.
The routines use impure code. You must make sure you put them in a segment which is locked in memory.
Which traps should you replace?
Remember that many traps take so much time that the dispatch isn’t worth improving. Others do next to nothing, and speed up a lot. In early use of these routines at Lotus, we estimated about thirty routines were worth replacing. In the OS world, things like BlockMove and UprString were included. Routines which just twiddled handles are also important, like HLock, HUnlock, HPurge, HNoPurge, and GetHandleSize. Among the Toolbox routines, things like MoveTo and SetPort seemed to help.
Even if a routine is slow, it may be worth tweaking if it’s called a lot. We got measureable improvements substituting for CharWidth, DrawString, StringWidth, and SystemTask.
You can also replace package calls, which is kind of a pain. If you want to change all the FP68K traps to qtFP68K, you have to change Apple’s include files, since each of the SANE macros invokes the trap. Another solution is to just redefine FP68K to be a macro to JSR to the qtxxx routine. But then you have to define a trap like “myFP68K” which still expands to the A-line trap -- this is because the qtxxx routine must have a copy of the trap word.
How much does it help?
As the TV diet ads say, results vary directly with how closely you stick to the plan. Average performance in a large Macintosh product at Lotus was improved by about 5%. A couple of heavily CPU-bound loops were improved by 15%. These aren’t huge gains, but considering that they took only a day or so of work to install in a very large program, they’re pretty good.
When does the warranty run out?
OK, it’s time to face the music. If these routines dive directly into the ROM, they may someday dive into ROM routines in a new machine which expect different parameters. (For more on this topic, see Macintosh Technical Note #110.) Or even if the ROM doesn’t change, some caching problem may come up if your application’s users use some odd way of altering trap addresses and making your cache stale.
The initialization routine qtEval can be easily disabled by modifying resources. For instance, when a user calls to complain that some FKEY or DA doesn’t work with your application, you can quickly change a copy of the application to disable address caching and test if that’s the problem. If it is the problem, you can either distribute the altered application or tell power users how to edit the resources to alter the copy they already have.
The resource used to control caching is QTRP 257. The format is simple: if the resource is present and the first word is zero, caching is enabled. To turn off caching, just remove this resource under Resedit (or renumber it, to easily restore caching). Remember that programmers may want to disable caching for certain types of debugging when they want to “see” traps under a debugger.
In a future format, a non-zero first word could signal that the resource contains a list of specific traps to be enabled/disabled.
In short, it’s easy to experimentally turn off this hackery to check if it’s causing problems, and easy to turn it off permanently if it is. In tests at Lotus, an application with caching disabled ran less than 1% slower than one which executed traps in the first place. This is the cost of calling a qtxxx routine, which in turn must do the xxx trap anyway because caching is off.
Notes on the code
The routines are intended to be pretty simple; I’ll walk through them and point out a few things.
Code caching: While this stuff works fine on a Mac II, I believe it ought to flush the 68020’s code cache after patching itself. Any recommendations from 68020 gurus out there?
Layout: The Toolbox and OS routines are laid out with symbols defining offsets in them. This is so they can be patched; the symbols must stay in sync with the layout.
Toolbox routines: These start out looking a lot like “glue” routines for a higher-level language -- they use the “auto-pop” bit, so their return is ignored and the trap returns straight to the application routine which called qtxxx. This trap word, plus four bytes of slop, is patched to be JMP <trap address>. Also, before the entry point there’s another copy of the trap word, in case qtEval is called more than once.
OS routines: These are more complicated. In their simple form, they execute the trap and return, because there is no “auto-pop” bit for OS calls. After being patched, they do just what the dispatcher does: save registers, set up D1 and D2 with the trap number, JSR to the routine, restore registers, test D0.w, and return. Note that the registers saved and restored depend on the trap word -- if bit 8 is set, then A0 is included in the registers saved.
Bit-coding: The OS routines hard-wire the trap number passed in D1 and D2. If you want to call, for example, NewHandle with the “clear” bit set, you must define two routines: qtNewHandle and qtNewHandleClear (or whatever you want to call them). This is necessary because your JSR qtNewHandle can’t communicate whether it wants the “clear” bit set -- that’s something normally encoded in the trap word.
Adding routines: The qtTool and qtOS macros do all the work for you. For each one, supply the name of the qtxxx routine you want to define and the _xxx name of the trap it’s going to handle.
Using the routines
Pick some segment which won’t leave memory and change the “SEG” directive to specify it.
Assemble the routines and link them into your program as normal. If you’re using a higher-level language, declare the qtxxx routines to have exactly the same calling interface as the _xxx trap, except that they’re defined externally instead of invoking in-line trap words.
Remember to call qtEval once at startup. And if you want to avoid the cache getting stale, use one of the strategies described above to decide when to call qtEval again.
If you’re using a language which uses glue, you may not be able to easily do this. Write to your language developer and pester them to do it for you
Comments? Improvements? Letter bombs?
I’d be interested to know how these routines work, how easy they are to install in various development environments, and what kind of performance improvements you see from using them. Drop me a line at P.O. Box 11378, Honolulu, HI 96828.
Since this stuff is stretching the ROM in ways it wasn’t meant to be stretched, I’d also appreciate hearing about the technique in general. Do you think it’s safe? Can you suggest a better way? And if you have improvements, send them in to MacTutor
; Macintosh Toolbox and OS-trap bypass routines. ; Copyright © 1987 Michael S. Morton ; ; History:10-May-87 - MM - Initial version. ;22-Oct-87 - MM - Neatened for publication. BLANKS ON STRING ASIS PRINT OFF LOAD ‘tlAsmSyms.sym’ PRINT ON ; Impure code! Should be in a segment which is locked in memory. SEG ‘LOCKDSEG’ ; *** change to a locked segment *** ;------------------------------------------------------------------ ; Each Toolbox routine starts out life as: ;2 <pure copy of trap word> ;entry: 2 <trap word with auto-pop bit> ;4 <four bytes unused> ; ; The evaluator changes this to: ;2 <pure copy of trap word> ;entry: 6 JMP <actual address> ; ; Either form is callable with a JSR because the former ; includes the “auto-pop” bit, so the Toolbox routine returns ; to its caller’s caller. Offsets are: tTrap: EQU 0 ; pure copy of trap tJump: EQU 2 ; JMP xxx.L instruction tAddress: EQU 4 ; address to jump to tLength:EQU 8 ; length of one block ; The “qtTool” macro generates code for one “qt” Toolbox routine. ; ; args: routine -Name for routine. ;Typically “qt” plus trap name. ; trap -Name of the trap, eg, “_MoveTo”. MACRO qtTool EXPORT &Syslst[1] ; define the “qtXXX” routine globally &Syslst[2] ; first, a pure copy of the trap word &Syslst[1] &Syslst[2] autoPop ; entry: if not overwritten, just trap ds.b 4 ; reserve room for overwriting ENDM ;---------------------------------------------------------------- ; Each OS trap bypass routine is 28 bytes long. ; The unevaluated routine is: ;2 <pure copy of trap word> ;entry: 2 <trap word>; only this part ;2 RTS ; gets executed before eval ;4 MOVE.W #<xxx>, D1; trap word patched here ;4 MOVE.W #<xxx>, D2; trap number patched here ;6 JSR xxx.L ; routine address patched here ;4 MOVEM.L (SP)+, D1-D2/A1-A2; register list patched here ;2 TST.W D0 ; set condition codes ;2 RTS ; and return ; After the evaluator is done, the routine becomes: ;2 <pure copy of trap word> ;entry: 4 MOVEM.L D1-D2/A1-A2, -(SP) ; (saves A0 too, ; if bit 8 set) ;4 MOVE.W #<trapword>, D1 ; get trap word in D1 ;4 MOVE.W #<trapword & $01FF>, D2 ; and trap number ;6 JSR xxx.L ; call the routine ;4 MOVEM.L (SP)+, D1-D2/A1-A2 ; (gets A0 too, if bit 8 set) ;2 TST.W D0 ; set condition codes ;2 RTS ; and return oTrap: EQU 0 ; original trap word oSave: EQU 2 ; for MOVEM.L xxx, -(SP) to go oTrapWord:EQU 8 ; for trap word in MOVE.W #xxx, D1 oTrapNum: EQU 12; for trap number in MOVE.W #xxx, D2 oAddress: EQU 16; address to jump to oRestore: EQU 22; for second copy of MOVEM regs list oLength:EQU 28 ; length of one block ; The “qtOS” macro generates code for one “qt” OS routine. ; ; args: routine -Name for trap routine. ; Typically “qt” plus trap name. ; trap - Name of the trap, eg, “_GetHandleSize”. MACRO qtOS EXPORT &Syslst[1] ; define the “qtXXX” routine globally &Syslst[2] ; pure copy of trap word &Syslst[1] &Syslst[2] ; entry: if not overwritten, just trap RTS ; and return MOVE.W #$5555, D1 ; get trap word in D1, for OS routine MOVE.W #$5555, D2 ; and trap number JSR $55555555 ; leave space for a longword address MOVEM.L (SP)+, D1-D2/A1-A2 ; assume A0 not in the register list TST.W D0; set condition codes RTS ; and return ENDM ;---------------------------------------------------------------- ; Resource type and ID for the flag used to disable the trickery. qtrpType: EQU ‘QTRP’; resource type for flag qtrpId: EQU 257 ; resource ID for flag ; qtEval ; ; description: ;Update the routines so they jump directly into ; the ROM, or wherever. This routine should be called ; at startup, and each time the application ;thinks anyone has (or might have) called SetTrapAddress. ; ; uses: (no registers) qtEval: PROCEXPORT ; Stuff used in patching together routines: jmpInst:EQU $4EF9; opcode word of JMP xxx.L OSregs: REG D1-D2/A1-A2 ; registers saved by OS dispatcher OSregs2:REG D1-D2/A0-A2 ; registers saved when bit 8 is zero MOVEM.L D0-D2/A0-A2, -(SP) ; save caller’s registers ; First, see if we’ve already been told not to do our thing: LEA qtEnabled, A2; point to the flag TST.B (A2); have we snuffed it already? BEQ evalEnd ; yes: nothing to do ; Second, decide if the resource flag allows us to map/cache. SUBQ #4, SP; make room for function result MOVE.L #qtrpType, -(SP) ; pass the type MOVE.W #qtrpId, -(SP) ; and ID _GetResource ; try to find our flag MOVE.L (SP)+, A0; pop result MOVE.L A0, D0 ; and test for NIL BEQ.S evalTurnOff; no such thing? go flag this and exit MOVE.L (A0), A1 ; deref. handle; point to rsrc with A0 MOVE.W (A1), D2 ; pick up first word, to check later MOVE.L A0, -(SP); pass handle _ReleaseResource; and get rid of it TST.W D2; now check -- did rsrc start with zero? BNE.S evalTurnOff; no: we don’t yet do selective disable ; Nothing forbids hackery. Evaluate all the ; toolbox bypass routines. LEA qtToolStart, A1 ; point to first routine LEA qtToolEnd, A2; point to just after last routine MOVE.W #jmpInst, D1 ; get a JMP xxx.L instruction BRA.S toolEnd ; check for no routines toolLp: MOVE.W tTrap(A1), D0; pick up the trap number in D0.w _GetTrapAddress newTool ; ask where this routine lives MOVE.L A0, tAddress(A1) ; store address first, THEN MOVE.W D1, tJump(A1); the JMP, so routine’s always OK ADDQ #tLength, A1 ; advance to the next routine toolEnd:CMP.L A2, A1; at (or past) end of toolbox routines? BLO.S toolLp ; nope: go evaluate another one ; Evaluate all the OS bypass routines. LEA qtOSStart, A1; point to first LEA qtOSEnd, A2; and to just after last BRA.S osEnd ; handle degenerate case osLoop: MOVE.W oTrap(A1), D0; pick up trap number MOVE.W D0, D2 ; copy it for later use (BTST, etc.) _GetTrapAddress newOS ; find where the routine lives MOVE.L A0, oAddress(A1) ; save routine address in JMP xxx.L MOVE.W D2, oTrapWord(A1); fill in MOVE.W #trapword, D1 AND.W #$01FF, D2 ; get just the trap number MOVE.W D2, oTrapNum(A1) ; and store in MOVE.W #trapnum, D2 ; Decide whether the saved registers include A0. MOVE.L OSent, D0; assume we want usual registers saved MOVE.L OSexit, D1 ; and restored BTST #8, D2; but should we save A0, too? BNE.S osLp1 ; nope: OSent’s registers are fine MOVE.L OSent2, D0 ; yep: use reg list which includes A0 MOVE.L OSexit2, D1; and ditto for one which saves A0 osLp1: MOVE.W D1, oRestore(A1) ; store register list for restore MOVE.L D0, oSave(A1); lastly, get rid of 1st trap ADD #oLength, A1 ; advance to the next routine osEnd: CMP.L A2, A1; at the end? BLO.S osLoop ; nope: go do another evalEnd:MOVEM.L (SP)+, D0-D2/A0-A2; restore caller’s registers RTS ; Here when the resource forbids caching. ; A2 points to the flag. evalTurnOff: SF(A2) ; disable it for faster call next time BRA.S evalEnd ; clean up and exit ; *** Impure *** flag: 0 means mapping disabled; non-zero means enabled. qtEnabled: DC.B $FF,00; initially enabled; 2nd byte to align ; Instructions and register lists to stick into OS routines. ; Each is 2 words. OSent: MOVEM.L OSregs, -(SP) OSent2: MOVEM.L OSregs2, -(SP) OSexit: MOVEM.L (SP)+, OSregs OSexit2:MOVEM.L (SP)+, OSregs2 ; Toolbox trap replacement routines. To be re-evaluated, ; these must be between qtToolStart and qtToolEnd. ; Nothing else must be in here -- the evaluator ; walks through this as an array. qtToolStart: ; Beginning of Toolbox trap replacement routines. qtTool qtMoveTo,_MoveTo qtTool qtSetPort,_SetPort ; add your own here qtToolEnd: ; End of Toolbox trap replacement routines. ; OS trap replacement routines. As with toolbox, ; keep only these in here. qtOSStart: ; Beginning of OS trap replacement routines. qtOS qtHLock,_HLock qtOS qtHUnlock,_HUnlock ; add your own here qtOSEnd: ; End of OS trap replacement routines. END

- SPREAD THE WORD:
- Slashdot
- Digg
- Del.icio.us
- Newsvine