stack to hold interrupt and subroutine return address, pushed/poppedì
register data; stack-oriented data structures.
Naturally, some instructions get used a lot more than others. But ì
frequency-of-use studies reveal that many programs NEVER use large portionsì
of the instruction set. Sometimes there are good reasons, like sticking toì
8080 opcodes so your code runs on an 8080/8085/V20 etc. More often theì
programmer is simply unfamiliar with the entire instruction set, and soì
restricts himself to what he knows.
This is fine for noncritical uses but suicidal when performance counts.ì
It's like running a racehorse with a gag in its throat. Take some time toì
go over the entire instruction set, one by one. Devise an example for eachì
instruction that puts it to good use. I only know of nine turkeys with noì
use besides "trick" NOPs (can you find them?).
Here's a routine that might be written by a rather inept programmer (orì
an unusually efficient compiler). It outputs a string of characters endingì
in 0 to the console. It generally follows good programming practices; it'sì
well structured, has clearly-defined entry and exit points, and carefullyì
saves and restores all registers used.
...
ld de,string ; point to message
call outstr ; output it
...
outstr: push af ; save registers
push bc
push de
push hl
ld a,(de) ; get next character
cp 0 ; compare it to 0
jp z,outend ; if not last char,
push de ; ..save registers
ld e,a
ld c,conout ; output character to console
call bdos
pop de ; restore registers
inc de ; advance to next
jp outstr ; repeat until doneèoutend: pop hl ; else 0 is last char
pop de
pop bc ; restore registers
pop af
ret ; return
string: db 'message' ; display our message
db 0 ; end of string marker
Now let's see how it can be improved. First, note that over half theì
instructions are PUSHes or POPs. This is the consequence of saving everyì
register before use, a common compiler strategy. Though safe and simple,ì
it's the single worst performance-killer I know.
The alternative is to push/pop only as necessary. This is easier saidì
than done; miss one, and you've got a nasty bug to find. A good strategyì
helps. I initially define my routines to minimize the registers used; onlyì
push/pop as needed within the routine itself; and restore nothing on exit. ì
In OUTSTR, this eliminates all but the PUSH DE/POP DE around the CALL BDOS.
This shifts the save/restore burden to the calling routine. Since theì
caller also follows the rule of minimal register usage and push/pops only asì
necessary, it will probably not push/pop as many registers; thus we haveì
increased speed by eliminating redundant push/pops. We have also made itì
explicitly clear which registers a caller really needs preserved.
Now I move the remaining push/pops to the called routines to save memory. ì
If every caller saves a particular register, it obviously should beì
saved/restored by the subroutine itself. If two or more callers save it,ì
speed is the deciding factor; preserve that register in the subroutine ifì
the extra overhead is not a problem for callers that don't need thatì
register preserved.
Push/pops are sloooww; at 21 to 29 T-states per pair, they make wonderfulì
low-byte time killers. If possible, either use, or save to, a register thatì
isn't killed by the called routine. In our example, try IX or IY instead ofì
DE; the index registers aren't trashed by the BDOS call (except, see Jayì
Sage's column. Ed). This saves 5 T-states/loop but adds 2 bytes (seeì
why?). The instruction EX DE,HL (8 T-states per pair) is often useful, butì
not here; the BIOS eats both HL and DE. The ultimate speed demon is a fast-n-drastic pair of EXX instructions to replace the PUSH DE/POP DE. They saveì
13 T-states with no size increase, and even preserve BC so we don't have toì
reload it for every loop.
Comparisons
A CP 0 instruction was used to test for 0, an obvious choice. But itì
takes 2 bytes and 7 T-states to execute. The Z80's Zero flag makes theì
special case of testing for zero easy; all we have to do is update the flagsì
to match the byte loaded. This is most easily done with an OR Aì
instruction, which takes only 1 byte and 4 T-states. You'll find this trickì
often in Z80 code.
è Note that OR A has no effect on A; we just used it to update the flagsì
because it's smaller and faster than CP 0. This illustrates a basicì
principle of assembly languages; the side effects of an instruction areì
often more important than the main effect. Here are some other not-so-obvious instructions:
and a ; update flags and clear Carry
xor a ; set A=0, update flags, P/V flag=1
sub a ; same, but P/V flag=0
sbc a,a ; set all bits in A to Carry (00 or FF)
add a,a ; A*2, or shift A left & set lsb=0
add hl,hl ; HL*2, or shift HL left & set lsb=0
adc hl,hl ; shift HL left & lsb=Carry
sbc hl,hl ; set all bits in HL to Carry (0000 or FFFF)
ld hl,0 ; \_load SP into HL so it can be examined
add hl,sp ; /
Using DE as the string pointer is a weak choice. It forces us to loadì
the character into A, then move it to E. If we use HL, IX, or IY instead,ì
we can load E directly and save a byte. But this makes it harder to testì
for 0.
An INC E, DEC E updates the Z flag without changing E. Or mark the endì
of the string with 80h, and use BIT 7,E to test for end. Both are asì
efficient as the OR A trick but don't need A. If you are REALLY desperate,ì
add 1 to every byte in the string, so a single DEC E restores the characterì
and sets the Z flag; kinky, but short and fast.
Jumps
This example used 3-byte absolute jump instructions. We can save memoryì
by using the Z80's 2-byte relative jumps instead; each use saves a byte. ì
Since jumps are among the most common instructions, this adds up fast.
Relative jumps have a limited range, so it pays to arrange your codeì
carefully to maximize their use. I've found that about half the jumps in aì
well structured program can be relative. When most of the jumps are out ofì
range, it's often a sign of structural weaknesses, "spaghetti-code" orì
excessively complex subroutines.
How about execution speed? An absolute jump always takes 10 T-states; aì
relative jump takes 12 to jump, or 7 to continue. So if speed counts, useì
absolute jumps when the branch is normally taken, and relative jumps when itì
is not. In the example, this means changing the JP Z,OUTEND to JR Z,OUTENDì
but keeping the JP at the end.
But wait a minute! The JR Z,OUTEND merely jumps to the RET at the end ofì
the subroutine. It would be more efficient still to replace it with RET Z,ì
a 1-byte conditional return that is only 5 T-states if the return is notì
taken. This illustrates another difference between assembler and high-levelì
languages; entry and exit points are often not at the beginning and end of aì
routine.
è We can speed up unconditional jumps within a loop. On entry, load HLì
with the start address of the loop, and replace JP LABEL by JP (HL). Itì
takes 1 byte and 4 T-states, saving 6 T-states per loop. This scheme costsì
us a byte (+3 to set HL; -2 for JP (HL)). But if used more than once in theì
routine, we save 2 bytes per occurrence. If HL is unavailable (as is theì
case here; the BDOS trashes it), IX or IY can be used instead. However, theì
JP (IX) and JP (IY) instructions take 2 bytes and 8 T-states, making the ì
savings marginal.
Can we do better yet? Yes, if we carefully rethink the structure of ourì
program. Notice it has two jump instructions per loop; yet only one test isì
performed (test for 0). This is a hint that one conditional jump should beì
all we need. Think of the instructions in the loop as links in a chain.ì
Rotate the chain to put the test-for-0 link at the bottom, and LD C,CONOUTì
on top (which we'll label OUTNXT). The JP OUTSTR is now unnecessary, andì
can be removed. JP NZ,OUTNXT performs the test and loops until 0 (remember,ì
absolute for speed, relative for size). The entry point is still OUTSTR,ì
though (horrors!) it's now in the middle of the routine.
We've also made a subtle change in the logic. Presumably we wouldn'tì
call OUTSTR unless there was at least one character to output. But whatì
would happen if we did?
Another way is to use DJNZ to close the loop. Make the first byte of theì
string its length (1-256). Load this value into B as part of theì
initialization. The resulting program takes 34 T-states per loop (notì
counting the CALL).
STILL faster? OK, you twisted my arm. If you're absolutely sure theì
string won't cross a page boundary, you can use INC L instead of INC HL toì
save 2 T-states. The 8-bit INC/DEC instructions are faster than theirì
16-bit counterparts, but should only be used if you're positive the addressì
will never require a carry. This brings us to 32 T-states/loop, which isì
the best I can do within this routine itself. Or can you do better?
...
ld hl,string ; point to message
call outstr ; output it
...
outstr: ld b,(hl) ; get length of message
ld c,conout ; output to console
ld e,(hl) ; get 1st char
outnxt: exx ; save registers,
call bdos ; output char,
exx ; and restore
inc hl ; advance to next
ld e,(hl) ; get next character
djnz z,outnxt ; loop until end
ret
string: db strend - strbeg ; message lengthèstrbeg: db 'message' ; message itself
strend:
Parameter Passing
In the above example, parameters were passed to the subroutine viaì
registers (string address in HL). This is fast and easy, but each call toì
OUTSTR takes 6 bytes. Now let's look at methods that save memory at theì
expense of speed.
Parameters can be passed to a subroutine as "data" bytes immediatelyì
following the CALL. Let's define the two bytes after CALL OUTSTR as theì
address of the string. The following code then picks up this pointer,ì
saving us a byte per call. The penalty is in making OUTSTR 4 bytes longerì
and 38 T-states/loop slower; thus it doesn't pay until we use it 5 or moreì
times.
...
call outstr ; output message
dw string ; beginning here
...
outstr: pop hl ; get pointer to "DW STRING"
ld e,(hl) ; E=low byte of string addr
inc hl
ld d,(hl) ; D=high byte of string addr
inc hl ; skip over "DW STRING" & push
push hl ; corrected return address
outnxt: ld a,(de) ; get next character
or a ; if 0,
ret z ; all done, return
push de
ld e,a
ld c,conout ; output character to console
call bdos
pop de
inc de ; advance to next
jr outnxt
We also had to rethink our choice of registers. If we tried to use HL orì
IX as the string pointer, OUTSTR would have been larger and slower (try itì
yourself). This demonstrates the consequences of inappropriate registerì
choices.
The more parameters that must be passed, the more efficient thisì
technique becomes. A further refinement is to put the string itselfì
immediately after CALL. This saves an additional two bytes per call, andì
shortens OUTSTR by 6 bytes.
...
call outstr ; output messageè db 'message',0 ; which immediately follows
...
outstr: pop de ; get pointer to message
ld a,(de) ; get next character
inc de ; advance to next
push de ; & save as return address
or a ; if char=0, all done
ret z ; pointer is return addr
ld e,a
ld c,conout ; else output char to console
call bdos
jr outstr ; & repeat
Constants and Variables
Constants and variables are part of every program. Constants are usuallyì
embedded within the program itself, as "immediate" bytes. Variables on theì
other hand are usually separated, grouped into a common region perhaps atì
the end of the program. This makes sense for programs in ROM, where theì
variables obviously must be stored elsewhere. But it is not a requirementì
for programs in RAM.
If your program executes from RAM, performance can be improved byì
treating variables as in-line constants; storage for the variable is in theì
last byte (or two) of an immediate instruction. For example, here is aì
routine that creates a new stack, toggles a variable FLAG between twoì
states, and then restores the original stack:
toggle: ld (stack),sp ; save old stack pointer
ld sp,mystack ; setup my stack
ld a,(flag) ; get Yes/No flag
cp 'Y' ; if "Y",
ld a,'N' ; set it to "N"
jr z,setno ; else "N",
ld a,'Y' ; set it to "Y"
setno: ld (flag),a ; save new state
ld sp,(stack) ; restore stack pointer
ret
stack: dw 0 ; old stack pointer
flag: db 'Y' ; value of flag
The LD A,(FLAG) instruction takes 13 T-states and 4 bytes of RAM (3 forì
the instruction, 1 to store FLAG). It can be replaced by LD A,'Y' where 'Y'ì
is the initial value of the variable FLAG, the 2nd byte of the instruction.ì
Speed and memory are improved 2:1, to 7 T-states and 2 bytes respectively.
It works for 16-bit variables as well. Replace LD SP,(STACK) with LDì
SP,0 where 0 is a placeholder for the 2-byte variable STACK. This saves 3ì
bytes and 10 T-states.
ètoggle: ld (stack+1),sp ; save old stack pointer
ld sp,mystack ; setup my stack
flag: ld a,'Y' ; get Y/N flag (byte 2=var)
cp 'Y' ; if "Y",
ld a,'N' ; set it to "N"
jr z,setno ; else "N",
ld a,'Y' ; set it to "Y"
setno: ld (flag+1),a ; save new state
stack: ld sp,0 ; restore stack (byte 2,3=var)
ret
There is another advantage to this technique -- versatility. Anyì
immediate-mode instruction can have variable data; loads, math, compares,ì
logical, even jumps and calls. Try changing our first example so a variableì
OUTDEV selects the output device; console or printer. Now see how simple itì
is if OUTDEV is the 2nd byte of the LD C,CONOUT instruction.
It even creates new instructions. For instance, the Z80's indexed ì
instructions don't allow a variable offset. This makes it awkward to loadì
the "n"th byte of a table, where we would like LD A,(IX+b) where "b" is aì
variable. But it can be done if the variable offset is stored in the lastì
byte of the indexed instruction itself.
Storing variables in the address field of a jump or call instruction canì
do some weird and wonderful things. There is no faster way to perform aì
conditional branch based on a variable. But remember you are treading onì
the thin ice of self-modifying code; debugging and relocation become muchì
more difficult, and you must insure that the variable NEVER has anì
unexpected value. Also, in microprocessors with instruction caches (fastì
memory containing copies of the contents of regular memory), there can beì
problems if the cache data are not updated.
I put a LABEL at each instruction with an immediate variable, then useì
LABEL+1 for all references to it. This serves as a reminder that somethingì
odd is going on. Be sure to document what you're doing, or you'll driveì
some poor soul (probably yourself) batty.
Exclusive OR
The XOR operator is a powerful tool for manipulating data. Sinceì
anything XOR'd with itself is 0, use XOR A instead of LD A,0. To toggle aì
variable between two values, XOR it with the difference between the twoì
values. Our last example can be performed much more efficiently by:
toggle: ld (stack+1),sp ; save old stack pointer
ld sp,mystack ; setup my stack
flag: ld a,'Y' ; get Y/N flag (byte 2=var)
xor 'Y'-'N' ; toggle "Y" <-> "N"
ld (flag+1),a ; save new state
stack: ld sp,0 ; restore stack (byte 2,3=var)
retè
XOR eliminated the jump, for a 2:1 improvement in size and speed. Thisì
illustrates a generally useful rule. Almost any permutation can beì
performed faster, without jumps, by XOR and the other math and logicalì
operators. Consider the following routine to convert the ASCII character inì
A to uppercase: it's both shorter and faster than the traditional methodì
using jumps.
convert: ld b,a ; save a copy of the char in B
sub 'a' ; if char is lowercase (a thru z),
cp 'z'-'a'+1 ; then carry=1, else carry=0
sbc a,a ; fill A with carry
and 'a'-'A' ; difference between upper/lower
xor b ; convert to uppercase
Data Compaction
Programs frequently include large blocks of text, data tables, and otherì
non-program data. Careful organization of such information can produceì
large savings in memory and speed of execution.
ASCII is a 7-bit code. The 8th bit of each byte is either unused or justì
marks the end of a string. You can bit-pack 8 characters into 7 bytes withì
a suitable routine. If upper case alone is sufficient, 6 bits are enough.ì
For dedicated applications, don't overlook older but more memory-efficientì
codes like Baudot (5 bits), EBCD (4 bits), or even International Morse (2-10ì
bits, with frequent characters the shortest).
If your text is destined for a CRT or printer, it may be heavy on controlì
characters and ESC sequences. I've found the following algorithm useful.ì
Bytes whose msb=0 are normal ASCII characters: output as-is. Printableì
characters whose msb=1 are preceeded by ESC, so "A"+80h=C1h sends "ESC A".ì
Control codes whose msb=1 are a "repeat" prefix to output the next byteì
between 2 and 32 times. For example, linefeed+80h=8Ah repeats the nextì
character 11 times. The value 80h, which otherwise would be "repeat once",ì
is reserved as the marker for the end of the string.
Programs can be compacted, too. One technique is to write your programì
in an intermediate language (IL) better suited to the task at hand. Itì
might be a high-level language, the instruction set of another CPU, or aì
unique creation specifically for the job at hand. The rest of your programì
is then an interpreter to execute this language. Tom Pittman's Tiny BASICì
is an excellent example of this technique. His intermediate languageì
implemented BASIC in just 384 bytes; the IL interpreter in turn took aboutì
2K.
Another approach is threaded code, made popular by the FORTH language. Aì
tight, well-structured program will probably use lots of CALLs. At theì
highest levels, the code may in fact be nothing but long sequences of CALLs:
main: call getname
call openfileè call readfile
call expandtabs
call writefile
call closefile
ret
Every 3rd byte is a CALL; large programs will have 1000s of them. Soì
let's eliminate the CALL opcodes, making our program just a list ofì
addresses:
main: ld (stack),sp ; save stack pointer
ld sp,first ; point it to first address in the list
ret ; and go execute it
first: dw openfile
dw readfile
dw expandtabs
dw writefile
dw closefile
dw return ; end of list
return: ld sp,(stack) ; restore stack pointer
ret ; and return (to MAIN's caller)
The stack pointer is pointed to the address of the first subroutine in theì
list to execute. RET then loads this address into the program counter andì
advances the stack pointer to the next address. Since each subroutine alsoì
ends with a RET, it automatically jumps directly to the next routine in theì
list to be executed. This is called directly threaded code.
RETURN is always the last subroutine in a list. It restores the stackì
pointer and returns to the caller of MAIN.
Directly threaded code can cut program size up to 30%, while actuallyì
increasing execution speed. However, it has some rather drasticì
limitations. During execution of the machine-code subroutines in theì
address list, the Z80's one and only stack pointer is tied up as an addressì
pointer. That means the stack can't be used; no interrupts, calls, pushes,ì
or pops are allowed without first switching to a local stack.
The solution to this is called indirectly threaded code, made famous (orì
infamous) by the FORTH language. Rather than have each subroutine directlyì
chain into the next, they are linked by a tiny interpreter, called NEXT:
main: call next
dw openfile
dw readfile
dw expandtabs
dw writefile
dw closefile
dw return ; end of list
next: pop ix ; make IX our next-subroutine pointerènext1: ld hl,next1 ; push address so RET comes back here
push hl
ld l,(ix+0) ; get address to "call"
inc ix ; low byte
ld h,(ix+0) ; high byte
inc ix ; point IX to addr for next time
jp (hl) ; call address
return: pop hl ; end of list; discard NEXT addr
ret ; and return to MAIN's caller
Now IX is our pointer into the address list; it points to the nextì
subroutine to be executed. Subroutines can use the stack normally within,ì
but must preserve IX and can't pass parameters in HL. When they exit viaì
RET, it returns them to NEXT1.
Though the example executes the address list as straight-line code,ì
subroutines can be written to perform jumps and calls via IX as well. NEXTì
can provide special handling for commonly-used routines as well; words withì
the high byte=0 could jump IX by a relative offset if A=0. If there areì
less than 256 subroutines, each address can be replaced by a single byte,ì
which NEXT converts into an address via a lookup table.
Indirectly threaded code can reduce size up to 2:1 in return for aì
similar loss in execution speed. The decrease in program size is oftenì
remarkable. I learned this in 1975 designing a sound-level dosimeter. Thisì
cigarette-pack sized gadget rode around in a shirt pocket all day, loggingì
the noise a person was exposed to. It then did various statisticalì
computations to report the high, low, mean, and RMS noise levels versusì
time.
In those dark ages, a BIG memory chip was 256x4. Cost, power, and sizeì
forced us into an RCA 1802 CMOS microprocessor, with just 512 bytes ofì
program memory (bytes, not K!). Try as we might, we couldn't do it. Inì
desperation, we tried Charlie Moore's FORTH. Incredibly, it bettered evenì
our heavily optimized code by 30%!
Of course, very little of FORTH itself wound up in the final product; itì
just showed us the way. Once you know HOW it's done, you can apply the sameì
techniques to any assembly-language program without becoming a born-againì
FORTH zealot.
Shortcuts
Here are some "quickies" that didn't fit in elsewhere. Keep in mind whatì
is actually in all the registers as you program. Do you really need toì
clear carry, or is it already cleared as the result of a previous operation? ì
Before you load a register, are you sure it's necessary? Perhaps it'sì
already there, or sitting in another register.
Many routines return "leftovers" that can be very useful, such as HL, DE,ìèand BC=0 after a block move. Perhaps an INC or DEC will produce the valueì
you want. Variables can be grouped so you needn't reload the entire addressì
for each. If the high byte of a register is correct, just load the lowerì
half.
Keep an index register pointed to your frequently-used variables. Thisì
makes them easier to access (up to 256 bytes) and opens the door to memory-ì
(rather than register-) oriented manipulations. The indexed instructionsì
are slower and less memory-efficient, but the versatility sometimes makes upì
for it (store an immediate byte to memory, load/save to memory fromì
registers other than A, etc.).
The Z80's bit test/set/reset instructions add considerable versatility ifì
you define your flags as bits rather than bytes. Bit flags can be accessedì
in any register, or even directly in memory, without loading them into aì
register.
If the last two instructions of a subroutine are CALL FOO and RET, youì
could just as well end with JP FOO and let FOO do the return for you. Ifì
the entry point of FOO is at the top, even the JP is unnecessary; you canì
locate FOO immediately after and "fall in" to it.
If you have a large number of jumps to a particular label (like the startì
of your MAIN program), it may be more efficient to push the address of MAINì
onto the stack at the top of the routine. Each JP MAIN can then be replacedì
by a 1-byte RET.
SKIP instructions are a short, fast way to jump a fixed distance. Theì
Z80 has no skips, but you can simulate a 1- or 2-byte skip with a 2- orì
3-byte do-nothing instruction: JR or JP on a condition that is never true,ì
for instance. If the flags aren't in a known state, load to an unusedì