Volume Number:		4
Issue Number:		3
Column Tag:		The Mac Hacker

QuickTrap Routines Bypass Trap Dispatcher

By Mike Morton, University of Hawaii

Bypassing the ROM trap dispatcher

In an article a while back, I covered the basics of bypassing the Macintosh trap dispatcher to call ROM routines directly, to speed up calls to the Toolbox and OS. In this article, I’ll present a set of subroutines which implement this technique in a practical way.

The package is written in MPW assembler, and should be easily callable from any of the MPW languages. It’s short and should be portable to other development systems. It also includes a “fail-soft” feature, in case it turns out not to work on some future Macintosh.

A quick review

Programs call the Macintosh Toolbox and Operating System routines by executing “illegal” instructions, which are handed to the trap dispatching code in the ROM. In addition to the time it takes for the 680x0 processor to recover from the emotional trauma of this illegal instruction, the dispatcher must fetch the offending instruction, decode it, and call the routine it specifies. This is very general, since it “hides” ROM locations from the application, but it’s also slow.

With the GetTrapAddress routine, you can calculate the address of a ROM routine just once each time your application runs. Calling that address directly can save you a lot of time, with very little cost in generality.

What does the dispatcher do?

Here’s the code for the dispatcher in my MacPlus ROM. Your Mac may have something a little different, but all existing Macs seem to be similar in principle. The dispatcher, at address $401F52 in my ROM, disassembles to:

disp: 
 SUBQ.L #2, SP   ; add 2 bytes above CCR
 MOVEM.LD1-D2/A2, -(SP) ; save 12 bytes of regs
 MOVE.L 12+4(SP), A2 ; get PC of trap word
 MOVE.W (A2)+, D2; get A-trap word
 MOVE.L A2, 12+4(SP) ; restore updated PC
 MOVE.W D2, D1   ; copy trap word to D1
 ANDI.W #$01FF, D2 ; get just trap number
 CMPI.W #$A800, D1 ; trap or OS?
 BLO.S  doOS; jump if OS
 LEA    $0C00, A2; point->Toolbox dispatch
 LSL.W  #2, D2   ; scale number->longwords
 MOVE.L (A2,D2.W), 12(SP) ; copy address to stack
 CMPI.W #$AC00, D1 ; “auto-pop” bit set?
 MOVEM.L(SP)+, D1-D2/A2 ; restore regs; leave CCR
 BLO.S  callTB   ; skip if “auto-pop” off
 MOVE.L (SP)+, (SP); RTS to caller, not glue
tBox:   RTS ; “call” Toolbox routine

doOS: 
 LEA    $0400, A2; point to OS dispatch
 BCLR   #8, D2   ; clear&test “keep A0” bit
 BNE.S  OSa0; skip to allow A0 returned
 LSL.W  #2, D2   ; scale number->longwords
 MOVE.L (A2,D2.W), A2; fetch OS routine address
 MOVEM.LA0-A1, -(SP) ; save regs (incl A0) 
 JSR  (A2);   call OS routine 
 MOVEM.L(SP)+, A0-A1 ;    and restore OS regs
OSrt: 
 MOVEM.L(SP)+, D1-D2/A2; restore OUR regs
 ADDQ.W #4, SP   ; ignore stacked CCR
 TST.W  D0; preset CCR on result
 RTS    ; and return

OSa0: 
 LSL.W  #2, D2   ; scale number->longwords
 MOVE.L (A2,D2.W), A2; fetch OS routine address
 MOVE.L A1, -(SP); preserve A1, *not* A0 
 JSR    (A2);   call OS routine 
 MOVE.L (SP)+, A1;    and restore A1
 BRA.S  OSrt; clean up with common code

[An aside: This is the first piece of ROM code I ever read, and I still think it’s a great example of tight 68000 coding. It’s tighter on the Mac II, with indirect addressing available. I can’t see any way to make it faster; can anyone spot a way to save a few bytes, though?]

Besides figuring out which routine to call (using the Toolbox dispatch table at $0C00 or OS table at $0400), the dispatcher also does some other important things. For Toolbox traps, it discards the return address if the “auto-pop” bit is set -- this is useful for “glue”. And for OS traps, it preserves D1, D2, A1 and A2, and sometimes A0. For OS traps, it also passes the low nine bits of the trap number to the routine, in D1,

Our task is to make a trap “dispatcher” which does all this, but is much faster. Note, for instance, that the new code must still pass the trap number in D1.w -- I believe this is how some routines test for flag bits set in the word. (For instance, CmpString has a bit to specify if the comparison is case-sensitive.)

Hey, wait a minute! Isn’t it a bad idea to know how one ROM routine (the dispatcher) communicates with all the others? Isn’t code which depends on this interface likely to fall apart when the Mac III hits the streets? Well, first of all, it’d be awfully hard for Apple to change hundreds of routines. But more importantly, there’s a way to back out gracefully. Trust me; we’ll get to it

An application’s view of the QuickTrap routines

The fundamental speedup is to get rid of the dispatcher, and have one “quick trap” routine for every real routine you’d like fast access to. For instance, if your program does a lot of SetPort calls, you can easily create “qtSetPort”, which has exactly the same interface and does the same thing, only faster. As you might guess, each qtxxx routine caches the address for its routine.

Once, at the beginning of your application, you must call qtEval, which “evaluates” each address and stores it. If you don’t call it, everything will still work -- this is related to the fail-soft scheme.

Other than this, everything works the same as old-style trap routines.

Caching problems

Imagine that you spend a lot of time doing FrameOval calls to draw circles on the screen, and would like to speed this up. (Actually, I’m sure the trap time is insignificant compared to the drawing time; this is just an example.) You install “qtFrameOval” and call it instead everything works great.

Now your friend gives you this neat, public-domain desk accessory which causes all ovals to be drawn on your screen with smile-faces in them. [Any takers to write this, by the way? You could call it The Smiling Moose ] It does this by altering the FrameOval trap to call it. But since your application never executes that trap, its ovals are drawn unmolested. How can you make sure your ovals are happy?

The answer is to call qtEval at the right times -- not just at initialization but whenever you suspect someone has installed a replacement trap routine. Since the qtxxx routines are supposed to “cache” the real addresses, they must track new address when they’re installed, or the cache becomes “stale”.

One way to do this is to call qtEval every time you regain control from a desk accessory, each time you regain control from Switcher or Multifinder, and each time you invoke an FKEY. Perhaps you’d also have to call it for every SystemTask call. And of course you must call it if your application does any SetTrapAddress calls for the relevant traps. In short, whenever anyone could have changed trap addresses, refresh the cache.

A simpler approach is to change the SetTrapAddress trap by installing a prefix routine which sets a flag in your globals that re-evaluation is needed. If DAs, FKEYs, etc., play by the rules and use SetTrapAddress calls, nobody can make the trap tables get out of sync with your cached addresses.

It’s tempting to call qtEval in your idle-loop as a heavy-handed way to make sure it’s done often enough. I suspect this is a bad idea -- it can cause seemingly random bugs.

One other way: if you use, for instance, qtFrameOval only in some code which doesn’t relinquish control, call qtEval once before each time you enter that code. Remember that qtEval isn’t all that speedy -- it must call GetTrapAddress for every qtxxx routine.

Reasons not to use these routines

Because the routines are JSR’d to, they take up four bytes instead of two. This is no big problem for most applications, but don’t change all your calls.

When you’re debugging, commands to break on traps don’t work, since your application is not executing trap instructions. You can force these traps to occur by disabling the caching; see below for details.

The routines use impure code. You must make sure you put them in a segment which is locked in memory.

Which traps should you replace?

Remember that many traps take so much time that the dispatch isn’t worth improving. Others do next to nothing, and speed up a lot. In early use of these routines at Lotus, we estimated about thirty routines were worth replacing. In the OS world, things like BlockMove and UprString were included. Routines which just twiddled handles are also important, like HLock, HUnlock, HPurge, HNoPurge, and GetHandleSize. Among the Toolbox routines, things like MoveTo and SetPort seemed to help.

Even if a routine is slow, it may be worth tweaking if it’s called a lot. We got measureable improvements substituting for CharWidth, DrawString, StringWidth, and SystemTask.

You can also replace package calls, which is kind of a pain. If you want to change all the FP68K traps to qtFP68K, you have to change Apple’s include files, since each of the SANE macros invokes the trap. Another solution is to just redefine FP68K to be a macro to JSR to the qtxxx routine. But then you have to define a trap like “myFP68K” which still expands to the A-line trap -- this is because the qtxxx routine must have a copy of the trap word.

How much does it help?

As the TV diet ads say, results vary directly with how closely you stick to the plan. Average performance in a large Macintosh product at Lotus was improved by about 5%. A couple of heavily CPU-bound loops were improved by 15%. These aren’t huge gains, but considering that they took only a day or so of work to install in a very large program, they’re pretty good.

When does the warranty run out?

OK, it’s time to face the music. If these routines dive directly into the ROM, they may someday dive into ROM routines in a new machine which expect different parameters. (For more on this topic, see Macintosh Technical Note #110.) Or even if the ROM doesn’t change, some caching problem may come up if your application’s users use some odd way of altering trap addresses and making your cache stale.

The initialization routine qtEval can be easily disabled by modifying resources. For instance, when a user calls to complain that some FKEY or DA doesn’t work with your application, you can quickly change a copy of the application to disable address caching and test if that’s the problem. If it is the problem, you can either distribute the altered application or tell power users how to edit the resources to alter the copy they already have.

The resource used to control caching is QTRP 257. The format is simple: if the resource is present and the first word is zero, caching is enabled. To turn off caching, just remove this resource under Resedit (or renumber it, to easily restore caching). Remember that programmers may want to disable caching for certain types of debugging when they want to “see” traps under a debugger.

In a future format, a non-zero first word could signal that the resource contains a list of specific traps to be enabled/disabled.

In short, it’s easy to experimentally turn off this hackery to check if it’s causing problems, and easy to turn it off permanently if it is. In tests at Lotus, an application with caching disabled ran less than 1% slower than one which executed traps in the first place. This is the cost of calling a qtxxx routine, which in turn must do the xxx trap anyway because caching is off.

Notes on the code

The routines are intended to be pretty simple; I’ll walk through them and point out a few things.

Code caching: While this stuff works fine on a Mac II, I believe it ought to flush the 68020’s code cache after patching itself. Any recommendations from 68020 gurus out there?

Layout: The Toolbox and OS routines are laid out with symbols defining offsets in them. This is so they can be patched; the symbols must stay in sync with the layout.

Toolbox routines: These start out looking a lot like “glue” routines for a higher-level language -- they use the “auto-pop” bit, so their return is ignored and the trap returns straight to the application routine which called qtxxx. This trap word, plus four bytes of slop, is patched to be JMP <trap address>. Also, before the entry point there’s another copy of the trap word, in case qtEval is called more than once.

OS routines: These are more complicated. In their simple form, they execute the trap and return, because there is no “auto-pop” bit for OS calls. After being patched, they do just what the dispatcher does: save registers, set up D1 and D2 with the trap number, JSR to the routine, restore registers, test D0.w, and return. Note that the registers saved and restored depend on the trap word -- if bit 8 is set, then A0 is included in the registers saved.

Bit-coding: The OS routines hard-wire the trap number passed in D1 and D2. If you want to call, for example, NewHandle with the “clear” bit set, you must define two routines: qtNewHandle and qtNewHandleClear (or whatever you want to call them). This is necessary because your JSR qtNewHandle can’t communicate whether it wants the “clear” bit set -- that’s something normally encoded in the trap word.

Adding routines: The qtTool and qtOS macros do all the work for you. For each one, supply the name of the qtxxx routine you want to define and the _xxx name of the trap it’s going to handle.

Using the routines

Pick some segment which won’t leave memory and change the “SEG” directive to specify it.

Assemble the routines and link them into your program as normal. If you’re using a higher-level language, declare the qtxxx routines to have exactly the same calling interface as the _xxx trap, except that they’re defined externally instead of invoking in-line trap words.

Remember to call qtEval once at startup. And if you want to avoid the cache getting stale, use one of the strategies described above to decide when to call qtEval again.

If you’re using a language which uses glue, you may not be able to easily do this. Write to your language developer and pester them to do it for you

Comments? Improvements? Letter bombs?

I’d be interested to know how these routines work, how easy they are to install in various development environments, and what kind of performance improvements you see from using them. Drop me a line at P.O. Box 11378, Honolulu, HI 96828.

Since this stuff is stretching the ROM in ways it wasn’t meant to be stretched, I’d also appreciate hearing about the technique in general. Do you think it’s safe? Can you suggest a better way? And if you have improvements, send them in to MacTutor

; Macintosh Toolbox and OS-trap bypass routines.
; Copyright © 1987 Michael S. Morton
;
; History:10-May-87 - MM - Initial version.
;22-Oct-87 - MM - Neatened for publication.

 BLANKS ON
 STRING ASIS
 PRINT  OFF
 LOAD ‘tlAsmSyms.sym’
 PRINT  ON

 ; Impure code!  Should be in a segment which is locked in memory.
 SEG  ‘LOCKDSEG’ ; *** change to a locked segment ***

;------------------------------------------------------------------
; Each Toolbox routine starts out life as:
;2 <pure copy of trap word>
;entry: 2 <trap word with auto-pop bit>
;4 <four bytes unused>
;
; The evaluator changes this to:
;2 <pure copy of trap word>
;entry: 6 JMP    <actual address>
;
; Either form is callable with a JSR because the former 
; includes the “auto-pop” bit, so the Toolbox routine returns 
; to its caller’s caller.  Offsets are:

tTrap:  EQU 0    ; pure copy of trap
tJump:  EQU 2    ; JMP xxx.L instruction
tAddress: EQU  4 ; address to jump to
tLength:EQU 8    ; length of one block

;   The “qtTool” macro generates code for one “qt” Toolbox routine.
;
; args:     routine -Name for routine.  
;Typically “qt” plus trap name.
;       trap    -Name of the trap, eg, “_MoveTo”.

 MACRO
 qtTool
 EXPORT &Syslst[1] ; define the “qtXXX” routine globally
 &Syslst[2] ; first, a pure copy of the trap word
 &Syslst[1] &Syslst[2]  autoPop  
 ; entry: if not overwritten, just trap
 ds.b 4 ; reserve room for overwriting
 ENDM

;----------------------------------------------------------------
; Each OS trap bypass routine is 28 bytes long.  
; The unevaluated routine is:
;2 <pure copy of trap word>
;entry: 2 <trap word>; only this part 
;2 RTS  ;  gets executed before eval
;4 MOVE.W  #<xxx>, D1; trap word patched here
;4 MOVE.W  #<xxx>, D2; trap number patched here
;6 JSR    xxx.L  ; routine address patched here
;4 MOVEM.L (SP)+, D1-D2/A1-A2; register list patched here
;2 TST.W   D0    ; set condition codes
;2 RTS  ; and return
; After the evaluator is done, the routine becomes:
;2 <pure copy of trap word>
;entry: 4 MOVEM.L D1-D2/A1-A2, -(SP) ; (saves A0 too, 
 ;  if bit 8 set)
;4 MOVE.W  #<trapword>, D1  ; get trap word in D1
;4 MOVE.W  #<trapword & $01FF>, D2 ;  and trap number
;6 JSR    xxx.L  ; call the routine
;4 MOVEM.L (SP)+, D1-D2/A1-A2 ; (gets A0 too, if bit 8 set)
;2 TST.W   D0    ; set condition codes
;2 RTS  ; and return

oTrap:  EQU 0    ; original trap word
oSave:  EQU 2    ; for MOVEM.L xxx, -(SP) to go
oTrapWord:EQU  8 ; for trap word in MOVE.W #xxx, D1
oTrapNum: EQU  12; for trap number in MOVE.W #xxx, D2
oAddress: EQU  16; address to jump to
oRestore: EQU  22; for second copy of MOVEM regs list
oLength:EQU 28   ; length of one block

;   The “qtOS” macro generates code for one “qt” OS routine.
;
; args:     routine -Name for trap routine.  
;  Typically “qt” plus trap name.
;       trap     - Name of the trap, eg, “_GetHandleSize”.

 MACRO
 qtOS
 EXPORT &Syslst[1] ; define the “qtXXX” routine globally
 &Syslst[2] ; pure copy of trap word
 &Syslst[1] &Syslst[2]    
 ; entry: if not overwritten, just trap 
 RTS    ;  and return
 MOVE.W #$5555, D1 ; get trap word in D1, for OS routine
 MOVE.W #$5555, D2 ;  and trap number
 JSR  $55555555  ; leave space for a longword address
 MOVEM.L (SP)+, D1-D2/A1-A2 
 ; assume A0 not in the register list
 TST.W  D0; set condition codes
 RTS    ; and return
 ENDM

;----------------------------------------------------------------
; Resource type and ID for the flag used to disable the trickery.

qtrpType: EQU  ‘QTRP’; resource type for flag
qtrpId: EQU 257  ; resource ID for flag


;   qtEval
;
; description:
;Update the routines so they jump directly into 
;  the ROM, or wherever. This routine should be called 
;   at startup, and each time the application
;thinks anyone has (or might have) called SetTrapAddress.
;
; uses: (no registers)

qtEval: PROCEXPORT

; Stuff used in patching together routines:
jmpInst:EQU $4EF9; opcode word of JMP xxx.L

OSregs: REG D1-D2/A1-A2   ; registers saved by OS dispatcher
OSregs2:REG D1-D2/A0-A2   ; registers saved when bit 8 is zero

 MOVEM.L D0-D2/A0-A2, -(SP)  ; save caller’s registers

 ; First, see if we’ve already been told not to do our thing:
 LEA  qtEnabled, A2; point to the flag
 TST.B  (A2); have we snuffed it already?
 BEQ  evalEnd    ; yes: nothing to do

 ; Second, decide if the resource flag allows us to map/cache.
 SUBQ #4, SP; make room for function result
 MOVE.L #qtrpType, -(SP)  ; pass the type 
 MOVE.W #qtrpId, -(SP)  ;  and ID
 _GetResource    ; try to find our flag
 MOVE.L (SP)+, A0; pop result 
 MOVE.L A0, D0   ;  and test for NIL
 BEQ.S  evalTurnOff; no such thing?  go flag this and exit

 MOVE.L (A0), A1 ; deref. handle; point to rsrc with A0
 MOVE.W (A1), D2 ; pick up first word, to check later
 MOVE.L A0, -(SP); pass handle 
 _ReleaseResource;  and get rid of it

 TST.W  D2; now check -- did rsrc start with zero?
 BNE.S  evalTurnOff;  no: we don’t yet do selective disable

 ; Nothing forbids hackery.  Evaluate all the 
 ; toolbox bypass routines.

 LEA  qtToolStart, A1     ; point to first routine
 LEA  qtToolEnd, A2; point to just after last routine
 MOVE.W #jmpInst, D1 ; get a JMP xxx.L instruction
 BRA.S  toolEnd  ; check for no routines

toolLp: MOVE.W tTrap(A1), D0; pick up the trap number in D0.w
 _GetTrapAddress newTool  ; ask where this routine lives
 MOVE.L A0, tAddress(A1)  ; store address first, THEN 
 MOVE.W D1, tJump(A1);  the JMP, so routine’s always OK
 ADDQ #tLength, A1 ; advance to the next routine

toolEnd:CMP.L  A2, A1; at (or past) end of toolbox routines?
 BLO.S  toolLp   ; nope: go evaluate another one

 ; Evaluate all the OS bypass routines.
 LEA  qtOSStart, A1; point to first 
 LEA  qtOSEnd, A2;  and to just after last
 BRA.S  osEnd    ; handle degenerate case

osLoop: MOVE.W oTrap(A1), D0; pick up trap number
 MOVE.W D0, D2   ; copy it for later use (BTST, etc.)
 _GetTrapAddress newOS  ; find where the routine lives
 MOVE.L A0, oAddress(A1)  ; save routine address in JMP xxx.L

 MOVE.W D2, oTrapWord(A1); fill in MOVE.W #trapword, D1
 AND.W  #$01FF, D2 ; get just the trap number
 MOVE.W D2, oTrapNum(A1)  ; and store in MOVE.W #trapnum, D2

 ; Decide whether the saved registers include A0.
 MOVE.L OSent, D0; assume we want usual registers saved 
 MOVE.L OSexit, D1 ;  and restored
 BTST #8, D2; but should we save A0, too?
 BNE.S  osLp1    ; nope: OSent’s registers are fine
 MOVE.L OSent2, D0 ; yep: use reg list which includes A0
 MOVE.L OSexit2, D1;  and ditto for one which saves A0
osLp1:  MOVE.W D1, oRestore(A1)  ; store register list for restore
 MOVE.L D0, oSave(A1); lastly, get rid of 1st trap
 ADD  #oLength, A1 ; advance to the next routine

osEnd:  CMP.L  A2, A1; at the end?
 BLO.S  osLoop   ; nope: go do another

evalEnd:MOVEM.L (SP)+, D0-D2/A0-A2; restore caller’s registers
 RTS

 ; Here when the resource forbids caching.  
 ; A2 points to the flag.
evalTurnOff:
 SF(A2) ;  disable it for faster call next time
 BRA.S  evalEnd  ; clean up and exit

 ; *** Impure *** flag: 0 means mapping disabled; non-zero means enabled.
qtEnabled:
 DC.B $FF,00; initially enabled; 2nd byte to align

 ; Instructions and register lists to stick into OS routines.  
 ; Each is 2 words.
OSent:  MOVEM.L OSregs, -(SP)
OSent2: MOVEM.L OSregs2, -(SP)
OSexit: MOVEM.L (SP)+, OSregs
OSexit2:MOVEM.L (SP)+, OSregs2

; Toolbox trap replacement routines.  To be re-evaluated, 
; these must be between qtToolStart and qtToolEnd.  
; Nothing else must be in here -- the evaluator
; walks through this as an array.

qtToolStart: ; Beginning of Toolbox trap replacement routines.

 qtTool qtMoveTo,_MoveTo
 qtTool qtSetPort,_SetPort
 ;  add your own here 

qtToolEnd:  ; End of Toolbox trap replacement routines.

; OS trap replacement routines.  As with toolbox, 
; keep only these in here.

qtOSStart:  ; Beginning of OS trap replacement routines.

 qtOS qtHLock,_HLock
 qtOS qtHUnlock,_HUnlock
 ;  add your own here 

qtOSEnd:  ; End of OS trap replacement routines.

 END

SPREAD THE WORD:
Slashdot
Digg
Del.icio.us
Reddit
Newsvine

Nov. 20:	Take Control of Syncing Data in Sow Leopard' released
Nov. 19:	Cocktail 4.5 (Leopard Edition) released
Nov. 19:	macProVideo offers new Cubase tutorials
Nov. 18:	S Stardom anounces Safe Capsule, a companion piece for Apple's
Nov. 17:	Ableton releases Max for Live
Nov. 17:	Ableton releases Max for Live
Nov. 17:	Ableton releases Max for Live
Nov. 17:	Ableton releases Max for Live
Nov. 17:	Ableton releases Max for Live
Nov. 17:	Ableton releases Max for Live
Nov. 17:	Ableton releases Max for Live
Nov. 17:	Ableton releases Max for Live
Nov. 17:	Ableton releases Max for Live
Nov. 17:	Ableton releases Max for Live

MAC TECH

document.write(document.title);