home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.sys.mac.programmer
- Path: sparky!uunet!charon.amdahl.com!pacbell.com!sgiblab!darwin.sura.net!paladin.american.edu!news.univie.ac.at!hp4at!mcsun!news.funet.fi!funic!nntp.hut.fi!vipunen.hut.fi!jmunkki
- From: jmunkki@vipunen.hut.fi (Juri Munkki)
- Subject: Re: Help! making an assembly routine faster
- Message-ID: <1992Nov18.203345.1290@nntp.hut.fi>
- Sender: usenet@nntp.hut.fi (Usenet pseudouser id)
- Nntp-Posting-Host: vipunen.hut.fi
- Reply-To: jmunkki@vipunen.hut.fi (Juri Munkki)
- Organization: Helsinki University of Technology
- References: <1992Nov16.014850.28678@cs.uoregon.edu> <1992Nov16.190947.9920@nntp.hut.fi> <1992Nov18.010815.6649@cs.uoregon.edu>
- Date: Wed, 18 Nov 1992 20:33:45 GMT
- Lines: 110
-
- In article <1992Nov18.010815.6649@cs.uoregon.edu> mkelly@mystix.cs.uoregon.edu (Michael A. Kelly) writes:
- > According to the Motorola manual, you're right. But in practice this slowed
- > things down quite a bit. I can't figure out why. I replaced the CMPI with
- > an ADDQ #1, then at @hardway I did a SUBQ #1. My test case is a 32x32 rect
- > with a 32x32 filled circle as the mask. I think it would slow things down
- > a lot more with more complicated masks. But still, I don't know why it's
- > slower, since the CMPI takes 8 clock cycles and the ADDQ and SUBQ each
- > take 4, so it really should be faster.... Then again, those timings are
- > for the 68000 (and 68020 too I think), and I'm using a 68040.
-
- On the 040, the instructions can overlap quite a bit. I guess that the
- modification of a data register prevented the overlap. I suggest that
- you try storing the constant 0xFF in a free data register and doing
- the compare with the data register. Register to register compares should
- always be faster than immediate to register compares.
-
- > >it might be as much as 10%. The only way to go beyond this is to make
- > >the move.l commands aligned on long word destinations, as I mentioned
- > >in my previous article.
- >
- > But as long as I align the source and destination Pixmaps, that isn't an
- > issue, right?
-
- I thought about this alignment stuff and it occured to me that the mask
- bitmap would be a lot harder to use if you aligned your writes to video
- RAM. On the Quadras, video RAM is so fast that alignment probably doesn't
- matter all that much. On NuBUS, things are usually quite different.
-
- > OK, here's the new code. The first one is the newer, better version of
- > Quick8CopyMask, with most of the optimizations suggested by Juri. It's
- > about 5.5 times as fast as QuickDraw's CopyMask, at least with my simple
- > circle mask test case. The second one is a small part of a very large
- > Quick8CopyMask that has 256 separate subroutines to handle each mask
- > byte, rather than only 16 subroutines to handle a mask nibble (a nibble is
- > half a byte, right?). It's far too long to post here, but if you want a
- > copy I'll be happy to email it to you. It's about 6.5 times as fast as
- > CopyMask; about 15% faster than the short version.
- >
- > I tested the routines with the mask used in the CalcCMask DTS snippet;
- > the short version was 5.7 times as fast as CopyMask and the long version
- > was 7 times as fast.
-
- It should be quite hard to improve speed from the longer code. I bet it took
- quite a few minutes to write it. :-)
-
- I do have an idea that you could try, if you still feel like the code should
- be improved.
-
- Snippet from long version:
- > @1: ; copy the next row
- > MOVE.W w, D1
- > @2: ; copy the next eight bytes in the row
- > CLR.W D2 ; clear the mask register
- > MOVE.B (A2)+, D2 ; copy the next mask byte
- > BEQ @nocopy ; if zero, don't copy anything
- >
- > CMPI.B #0xFF, D2
- > BNE @hardway ; don't copy everything
- >
- > MOVE.L (A0)+, (A1)+ ; copy all bytes
- > MOVE.L (A0)+, (A1)+
- >
- > DBF D1, @2
- > JMP @endloop
- >
- > @nocopy: ; copy no bytes
- > ADDQ.L #8, A0
- > ADDQ.L #8, A1
- >
- > DBF D1, @2
- > JMP @endloop
- >
- > @hardway:
- > ADD.W D2, D2 ; double the index
- > ADD.W @table(D2.W), D2 ; calculate the address
- > JMP @table(D2.W) ; plot eight pixels
-
- I finally dug up my 020 manual and went through the addressing modes.
-
- Instead of having a jump table, you should probably use a table of jumps. :-)
-
- clr.w D2
- @1
- move.w w,D1
-
- @2
- move.b (A2)+,D2
- jmp (@jumptable,PC,D2.w*4)
-
- @jumptable bra.w @mask0
- bra.w @mask1
- bra.w @mask2
- bra.w @mask3
- ...
- bra.w @mask254
- move.l (A0)+,(A1)+ ; This is mask 255
- move.l (A0)+,(A1)+
- dbf D1,@2
- ...
-
- I checked with Think C and at least the above code (or something similar)
- to it compiles and the disassembly looks reasonable.
-
- Note that i removed the special checks for 0 and 255. I think they are
- mostly wasted, but it's possible they speed things with masks with large
- solid areas.
-
- --
- Juri Munkki Windsurf: fast sailing
- jmunkki@hut.fi Macintosh: fast software
-