NetNews Usenet Archive 1992 #27

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #27 / NN_1992_27.iso / spool / comp / sys / mac / programm / 18608 < prev next >

Wrap

Text File | 1992-11-18 | 5.1 KB | 124 lines

Newsgroups: comp.sys.mac.programmer Path: sparky!uunet!charon.amdahl.com!pacbell.com!sgiblab!darwin.sura.net!paladin.american.edu!news.univie.ac.at!hp4at!mcsun!news.funet.fi!funic!nntp.hut.fi!vipunen.hut.fi!jmunkki From: jmunkki@vipunen.hut.fi (Juri Munkki) Subject: Re: Help! making an assembly routine faster Message-ID: <1992Nov18.203345.1290@nntp.hut.fi> Sender: usenet@nntp.hut.fi (Usenet pseudouser id) Nntp-Posting-Host: vipunen.hut.fi Reply-To: jmunkki@vipunen.hut.fi (Juri Munkki) Organization: Helsinki University of Technology References: <1992Nov16.014850.28678@cs.uoregon.edu> <1992Nov16.190947.9920@nntp.hut.fi> <1992Nov18.010815.6649@cs.uoregon.edu> Date: Wed, 18 Nov 1992 20:33:45 GMT Lines: 110 In article <1992Nov18.010815.6649@cs.uoregon.edu> mkelly@mystix.cs.uoregon.edu (Michael A. Kelly) writes: > According to the Motorola manual, you're right. But in practice this slowed > things down quite a bit. I can't figure out why. I replaced the CMPI with > an ADDQ #1, then at @hardway I did a SUBQ #1. My test case is a 32x32 rect > with a 32x32 filled circle as the mask. I think it would slow things down > a lot more with more complicated masks. But still, I don't know why it's > slower, since the CMPI takes 8 clock cycles and the ADDQ and SUBQ each > take 4, so it really should be faster.... Then again, those timings are > for the 68000 (and 68020 too I think), and I'm using a 68040. On the 040, the instructions can overlap quite a bit. I guess that the modification of a data register prevented the overlap. I suggest that you try storing the constant 0xFF in a free data register and doing the compare with the data register. Register to register compares should always be faster than immediate to register compares. > >it might be as much as 10%. The only way to go beyond this is to make > >the move.l commands aligned on long word destinations, as I mentioned > >in my previous article. > > But as long as I align the source and destination Pixmaps, that isn't an > issue, right? I thought about this alignment stuff and it occured to me that the mask bitmap would be a lot harder to use if you aligned your writes to video RAM. On the Quadras, video RAM is so fast that alignment probably doesn't matter all that much. On NuBUS, things are usually quite different. > OK, here's the new code. The first one is the newer, better version of > Quick8CopyMask, with most of the optimizations suggested by Juri. It's > about 5.5 times as fast as QuickDraw's CopyMask, at least with my simple > circle mask test case. The second one is a small part of a very large > Quick8CopyMask that has 256 separate subroutines to handle each mask > byte, rather than only 16 subroutines to handle a mask nibble (a nibble is > half a byte, right?). It's far too long to post here, but if you want a > copy I'll be happy to email it to you. It's about 6.5 times as fast as > CopyMask; about 15% faster than the short version. > > I tested the routines with the mask used in the CalcCMask DTS snippet; > the short version was 5.7 times as fast as CopyMask and the long version > was 7 times as fast. It should be quite hard to improve speed from the longer code. I bet it took quite a few minutes to write it. :-) I do have an idea that you could try, if you still feel like the code should be improved. Snippet from long version: > @1: ; copy the next row > MOVE.W w, D1 > @2: ; copy the next eight bytes in the row > CLR.W D2 ; clear the mask register > MOVE.B (A2)+, D2 ; copy the next mask byte > BEQ @nocopy ; if zero, don't copy anything > > CMPI.B #0xFF, D2 > BNE @hardway ; don't copy everything > > MOVE.L (A0)+, (A1)+ ; copy all bytes > MOVE.L (A0)+, (A1)+ > > DBF D1, @2 > JMP @endloop > > @nocopy: ; copy no bytes > ADDQ.L #8, A0 > ADDQ.L #8, A1 > > DBF D1, @2 > JMP @endloop > > @hardway: > ADD.W D2, D2 ; double the index > ADD.W @table(D2.W), D2 ; calculate the address > JMP @table(D2.W) ; plot eight pixels I finally dug up my 020 manual and went through the addressing modes. Instead of having a jump table, you should probably use a table of jumps. :-) clr.w D2 @1 move.w w,D1 @2 move.b (A2)+,D2 jmp (@jumptable,PC,D2.w*4) @jumptable bra.w @mask0 bra.w @mask1 bra.w @mask2 bra.w @mask3 ... bra.w @mask254 move.l (A0)+,(A1)+ ; This is mask 255 move.l (A0)+,(A1)+ dbf D1,@2 ... I checked with Think C and at least the above code (or something similar) to it compiles and the disassembly looks reasonable. Note that i removed the special checks for 0 and 255. I think they are mostly wasted, but it's possible they speed things with masks with large solid areas. -- Juri Munkki Windsurf: fast sailing jmunkki@hut.fi Macintosh: fast software