home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!ferkel.ucsb.edu!taco!rock!stanford.edu!agate!doc.ic.ac.uk!uknet!mcsun!news.funet.fi!funic!nntp.hut.fi!vipunen.hut.fi!jmunkki
- From: jmunkki@vipunen.hut.fi (Juri Munkki)
- Newsgroups: comp.sys.mac.programmer
- Subject: Re: Help! making an assembly routine faster
- Message-ID: <1992Nov16.190947.9920@nntp.hut.fi>
- Date: 16 Nov 92 19:09:47 GMT
- References: <1992Nov14.091905.29520@cs.uoregon.edu> <1992Nov14.200831.20477@nntp.hut.fi> <1992Nov16.014850.28678@cs.uoregon.edu>
- Sender: usenet@nntp.hut.fi (Usenet pseudouser id)
- Reply-To: jmunkki@vipunen.hut.fi (Juri Munkki)
- Organization: Helsinki University of Technology
- Lines: 123
- Nntp-Posting-Host: vipunen.hut.fi
-
- In article <1992Nov16.014850.28678@cs.uoregon.edu> mkelly@mystix.cs.uoregon.edu (Michael A. Kelly) writes:
- >So, here are the resulting routines. The first uses the jump table approach,
- >the second uses the wide mask approach. Can they be made even faster??
-
- Yes.
-
- > @2: ; copy the next eight bytes in the row
- >
- > MOVE.B (A2), D2 ; copy the next mask byte
- >
- > TST.B D2
-
- A move instruction always does an implied tst, so you can just throw away
- the test instruction.
-
- > BEQ @nocopy ; if zero, don't copy anything
- >
- > CMPI.B #0xFF, D2
- > BNE @hardway ; don't copy everything
-
- An addq.w #1, and then a beq might prove to be faster than the cmp with
- an immediate value. You have to adjust the mask back to its old value,
- if the test fails, but this can be done either with the jump tables
- (not with the ones you are using now, but the longer ones I will suggest
- later in this article) or by a subq.w #1
-
- >
- > MOVE.L (A0)+, (A1)+ ; copy all bytes
- > MOVE.L (A0)+, (A1)+
- > ADDQ.L #1, A2
-
- Do a move.b (A2)+ instead of this instruction. I can't see any reason why
- you can't do the increment there.
-
- > JMP @endloop
-
- Copy the end of the loop here. So that you have the DBF instruction here
- instead of a JMP. Put the jump after the DBF. There's absolutely no reason
- to jump around when you can just use another DBF.
-
- > @nocopy: ; copy no bytes
- > ADDQ.L #8, A0
- > ADDQ.L #8, A1
- > ADDQ.L #1, A2
- > JMP @endloop
-
- Same here as above.
-
- > @hardway:
- > ANDI.L #0xF0, D2 ; mask off the low four bits
- > LSR.W #4, D2 ; shift bits 4-7 into bits 0-3
-
- The AND is totally wasted. The LSR will do the masking for you. This
- is assuming that you can keep the high bytes of D2 cleared. I think
- you should be able to do it. (I think it's already that way.)
-
- You can also eliminate the and and lsr, if you use two 256-entry jump
- tables that simply ignore the high or low 4 bits. The tables will take
- some memory (2 x 4 x 256 bytes), but they are easy to construct with
- copy and paste.
-
- > ADD.W D2, D2 ; double the index
- > ADD.W @table(D2.W), D2 ; calculate the address
- > JSR @table(D2.W) ; plot four pixels
-
- The 68020 has addressing modes that do the multiplication of the index.
- I haven't needed them myself, but I'm fairly certain that you can improve
- this part with the right addressing mode.
-
- Replace the jsr with a LEA An to the return address and a JMP to the
- subroutine. Then jump back with a JMP (An). This is quite a bit faster
- than a JSR/RTS combination, although it's not "good style".
-
- > CLR.L D2 ; clear the mask register
- > MOVE.B (A2)+, D2 ; copy the next mask byte
- > ANDI.B #0xF, D2 ; mask off the high four bits
-
- Use BFEXTU, if you must read the mask again. Remember that you can use
- -1(A2), if you already incremented A2 or you might be able to account
- for this with the bitfield offset. You can also use constant bitfield
- offsets, if I remember correctly. I think you have some registers that
- you could use, so you could store fairly constant bitfield indices
- there.
-
- > @sub6: ; mask = 0110
- > ADDQ.L #1, A0
- > ADDQ.L #1, A1
- > MOVE.B (A0)+, (A1)+
-
- This should be a move.w
-
- > ADDQ.L #1, A0
- > ADDQ.L #1, A1
- > RTS
- >
- > @sub8: ; mask = 1000
- > MOVE.B (A0)+, (A1)+
- > ADDQ.L #3, A0
- > ADDQ.L #3, A1
- > RTS
-
- A move.b (a0),(a1) along with addq #4 is faster on a 68000, but I
- don't think it matters on new processors. I may be wrong, but you'll
- probably never see the difference.
-
- In the deep mask version, you could unroll the loop. It's kind of
- surprising the the 1 bit mask is actually faster, but it's mostly
- because of the superior algorithm that allows you to directly copy
- 8 bytes at a time in the most common case.
-
- I think you did really well with the assembly. My changes will probably
- not make a big difference. I think 5% is the best you can hope for, but
- it might be as much as 10%. The only way to go beyond this is to make
- the move.l commands aligned on long word destinations, as I mentioned
- in my previous article.
-
- I hope my articles offer proof for the other half of my .signature... :-)
- Can anyone do significantly better? I really love optimizing graphics
- routines.
-
- --
- Juri Munkki Windsurf: fast sailing
- jmunkki@hut.fi Macintosh: fast software
-