NetNews Usenet Archive 1992 #27

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #27 / NN_1992_27.iso / spool / comp / sys / mac / programm / 18485 < prev next >

Wrap

Internet Message Format | 1992-11-17 | 5.3 KB

Path: sparky!uunet!ferkel.ucsb.edu!taco!rock!stanford.edu!agate!doc.ic.ac.uk!uknet!mcsun!news.funet.fi!funic!nntp.hut.fi!vipunen.hut.fi!jmunkki From: jmunkki@vipunen.hut.fi (Juri Munkki) Newsgroups: comp.sys.mac.programmer Subject: Re: Help! making an assembly routine faster Message-ID: <1992Nov16.190947.9920@nntp.hut.fi> Date: 16 Nov 92 19:09:47 GMT References: <1992Nov14.091905.29520@cs.uoregon.edu> <1992Nov14.200831.20477@nntp.hut.fi> <1992Nov16.014850.28678@cs.uoregon.edu> Sender: usenet@nntp.hut.fi (Usenet pseudouser id) Reply-To: jmunkki@vipunen.hut.fi (Juri Munkki) Organization: Helsinki University of Technology Lines: 123 Nntp-Posting-Host: vipunen.hut.fi In article <1992Nov16.014850.28678@cs.uoregon.edu> mkelly@mystix.cs.uoregon.edu (Michael A. Kelly) writes: >So, here are the resulting routines. The first uses the jump table approach, >the second uses the wide mask approach. Can they be made even faster?? Yes. > @2: ; copy the next eight bytes in the row > > MOVE.B (A2), D2 ; copy the next mask byte > > TST.B D2 A move instruction always does an implied tst, so you can just throw away the test instruction. > BEQ @nocopy ; if zero, don't copy anything > > CMPI.B #0xFF, D2 > BNE @hardway ; don't copy everything An addq.w #1, and then a beq might prove to be faster than the cmp with an immediate value. You have to adjust the mask back to its old value, if the test fails, but this can be done either with the jump tables (not with the ones you are using now, but the longer ones I will suggest later in this article) or by a subq.w #1 > > MOVE.L (A0)+, (A1)+ ; copy all bytes > MOVE.L (A0)+, (A1)+ > ADDQ.L #1, A2 Do a move.b (A2)+ instead of this instruction. I can't see any reason why you can't do the increment there. > JMP @endloop Copy the end of the loop here. So that you have the DBF instruction here instead of a JMP. Put the jump after the DBF. There's absolutely no reason to jump around when you can just use another DBF. > @nocopy: ; copy no bytes > ADDQ.L #8, A0 > ADDQ.L #8, A1 > ADDQ.L #1, A2 > JMP @endloop Same here as above. > @hardway: > ANDI.L #0xF0, D2 ; mask off the low four bits > LSR.W #4, D2 ; shift bits 4-7 into bits 0-3 The AND is totally wasted. The LSR will do the masking for you. This is assuming that you can keep the high bytes of D2 cleared. I think you should be able to do it. (I think it's already that way.) You can also eliminate the and and lsr, if you use two 256-entry jump tables that simply ignore the high or low 4 bits. The tables will take some memory (2 x 4 x 256 bytes), but they are easy to construct with copy and paste. > ADD.W D2, D2 ; double the index > ADD.W @table(D2.W), D2 ; calculate the address > JSR @table(D2.W) ; plot four pixels The 68020 has addressing modes that do the multiplication of the index. I haven't needed them myself, but I'm fairly certain that you can improve this part with the right addressing mode. Replace the jsr with a LEA An to the return address and a JMP to the subroutine. Then jump back with a JMP (An). This is quite a bit faster than a JSR/RTS combination, although it's not "good style". > CLR.L D2 ; clear the mask register > MOVE.B (A2)+, D2 ; copy the next mask byte > ANDI.B #0xF, D2 ; mask off the high four bits Use BFEXTU, if you must read the mask again. Remember that you can use -1(A2), if you already incremented A2 or you might be able to account for this with the bitfield offset. You can also use constant bitfield offsets, if I remember correctly. I think you have some registers that you could use, so you could store fairly constant bitfield indices there. > @sub6: ; mask = 0110 > ADDQ.L #1, A0 > ADDQ.L #1, A1 > MOVE.B (A0)+, (A1)+ This should be a move.w > ADDQ.L #1, A0 > ADDQ.L #1, A1 > RTS > > @sub8: ; mask = 1000 > MOVE.B (A0)+, (A1)+ > ADDQ.L #3, A0 > ADDQ.L #3, A1 > RTS A move.b (a0),(a1) along with addq #4 is faster on a 68000, but I don't think it matters on new processors. I may be wrong, but you'll probably never see the difference. In the deep mask version, you could unroll the loop. It's kind of surprising the the 1 bit mask is actually faster, but it's mostly because of the superior algorithm that allows you to directly copy 8 bytes at a time in the most common case. I think you did really well with the assembly. My changes will probably not make a big difference. I think 5% is the best you can hope for, but it might be as much as 10%. The only way to go beyond this is to make the move.l commands aligned on long word destinations, as I mentioned in my previous article. I hope my articles offer proof for the other half of my .signature... :-) Can anyone do significantly better? I really love optimizing graphics routines. -- Juri Munkki Windsurf: fast sailing jmunkki@hut.fi Macintosh: fast software