No Fragments Archive 10: Diskmags

home *** CD-ROM | disk | FTP | other *** search

/ No Fragments Archive 10: Diskmags / nf_archive_10.iso / MAGS / CHOSNECK / CHOS2.ZIP / CHOSNECK.2ND / STUFF / DATAS.ZIP / ART48.SCR < prev next >

Wrap

Text File | 1989-06-08 | 17.6 KB | 445 lines

<head> <title="...forever..."> <font=monaco10.fnt> <font=newy36.fnt> <font=time24.fnt> <image=back.raw w=256 h=256 t=-1> <buf=6841> <bgcolor=-1> <background=0> <link_color=000> <module=console.mod> <pal=back.pal> colors: 251 - black </head> <body> <frame x=0 y=0 w=640 h=6841 b=-1 c=-1> <f1><c000> DSP routines and optimisation <f0> <f1><c000> various bits and pieces <f0> by earx 2003 - history --------------------------------------------------------------------- a falcon is fitted with a cute little dsp chip. this piece of silicon is a wonder of efficiency especially for it's time. it runs most instructions in just 2 cycles and is capable of running at 50MHz without breaking a sweat (or higher with new srams). now when the falcon was released there were alot of discussions going on whether it could actually beat a top class pc (costing twice as much). most people agreed that a 16MHz 68030 wasn't much compared to a 50MHz 80486. others argued that the 56001 added so much power to the falcon that it could easily take on such a pc. i was and still am one of the supporters of the latter statement. the only problem was that a 030 might be relatively easy to code. easier even than the 68000 we loved from the st days. but the dsp was a whole different story. it's architecture differs alot from the usual cpu's (risc/cisc). it wasn't easy to code dsp. especially not in the early days. people were forced to program with motorola's assembler, which might be complete and flexible, but also bloated and slower than snails. i have enormous respect for the poineers on the dsp. eko, aura, bitmaster, dnt crew, aggression and others. from just doing what the dsp was initially intented to do (audio filtering) they let it play mods, make it paint shaded vectors, do 3d transformations, fractals and so on. - motivation ------------------------------------------------------------------ the point here is it really took a few years before the dsp's potential was released. however, i think we haven't seen the best routs yet. this is why i write a doc to possible get some people interested in dsp hacking. also, dsp coding is alot of fun. there are really funny ways of doing things on it, that don't seem like cpu coding at all. but we'll get to that later on.. first, i want to mention the reasons for coding dsp. a good reason may be educational purposes. the dsp, cpu system is basicly a multiprocessor scheme. complete with seperate busses, synchronous and asynchronous transfer. it might tell a deal about how other multiprocessor systems to do work (consoles!). another reason can be an interest in filtering, fft's, dct's and all that stuff. the dsp is very suited for these and implementation is relatively easy. the third is reason is just to show what a basic machine can do! to fully exploit the potential of the falcon you have to use dsp, there is no way around it! using only a 030 might seem like working with one hand tied behind your back, but this analogy is somewhat flawed. people use both hands intuitively and without much effort. but programming multiprocessor systems is definetely less intuitive than coding singleprocessor systems. yep, this is a bit of a warning. but don't get scared yet ;) - getting started ------------------------------------------------------------- i won't explain everything you need to know here. i'm sorry, but i have little time. and besides. tat wrote a good doc on getting started with the dsp (loading p56s/lods, transferring with hostport). all i will say here is that a running dsp program is often a synergy of a cpu program and a dsp program, not just a dsp program and it's loader. and to get this running you need to make some protocols for transmission. furthermore you need a good dsp assembler. as i said, motorola's is complete, but not very practical to use. the conversion of cld->p56 is tedious. or at least i tend to think so. there are alternatives. dspdit for lovers of turboasm and qdsp for others (you can install it as a tool in devpac or just run with commandline). there is also a56, but i have never tried. a debugger is also a must. on cpu it's often possible to just hack something and if it goes wrong look at the code a bit to set it right. on the dsp however, especially when starting it's not so easy. we'll get to that later. okay, now it's time to give some pointers. dsp coding has quite alot of pitfalls and in order to have a decent coding session it's advisable to avoid these. - split buses ----------------------------------------------------------------- the 56k dsp's use a split bus architecture known as modified harvard. this means you have actually 3 buses. the first is p(rogram) memory, guess what it is for ;) the second and third are x,y data memory. to maximise throughput you can load and store in parallel or load twice or store twice whatever.. this is quite a change of philosphy from the normal 680x0 we're used to. the 680x0 can ofcourse load and store in one instruction, which is damn easy to code. however, it can definetely not do it in parallel. also next to this parallel load/store the dsp has an arithmetic instruction. all three can run in parallel. ofcourse we can't do this or need this all of the time, but it is possible to reach these peak throughputs with some good thinking. a good advice however is to first write your code in pure 'sequential' form and get it to work. and later on parallelize it step by step. for instance: ; x0=scalar do #13,_loop move y:(r4)+,y0 ; get input. mpy x0,y0,a ; a=result move a,x:(r0)+ ; store res. _loop: can be done as: ; x0=scalar move y:(r4)+,y0 ; y0=input[0] mpy x0,y0,a ; a=result do #13,_loop mpy x0,y0,a a,x:(r0)+ ; store res. move y:(r4)+,y0 ; get input. _loop: and we can go further ; x0=scalar move y:(r4)+,y0 ; y0=input[0] mpy x0,y0,a ; a=result[0] do #13,_loop mpy x0,y0,a a,x:(r0)+ y:(r4)+,y0 ; a=result[n] _loop: yes, about 3 times as fast! but notice that we have to make a special 'head' to initialize correctly. this is often the case. also notice the r0/r4. these are chosen for a reason. in a parallel x/y move one reg must be r0..r3 and the other must be r4..r7. all things we have to keep in mind. but as we can see, this does pay off. :) about the extra header code. it's often also the case that you need a tail to complete the last iteration. so, indeed, this optimisation isn't without consequences, sometimes even increasing the code-size a bit. - overlaps -------------------------------------------------------------------- a thing correlated to the split buses is the overlap of memory. you would like the think of your memory as 3 banks (p,x,y) which have nothing todo with each other. this is not true. infact you have both internal and external memory (oh boy!). the internal memory is small: internal | external ------------------------+------------------------- p: 512 words p:0..p:$200|p:$0200..p:$7FFF x: 256 words p:0..p:$100|x:$0100..x:$3FFF y: 256 words p:0..p:$100|y:$0100..y:$3FFF but where's the overlap then? right here: p:$0200..p:$3FFF = y:$0200..y:$3FFFF p:$4000..p:$7FFF = x:$0000..x:$3FFFF note that p:$4000..p:$40FF is _external_ x ram. no internal ram is mirrored, only external. as you can see, that's quite nasty. it's possible to be unaware of this since it's all very transparent. and then all of a sudden your code gets overwritten ;) a situation that occurs alot is that you start coding your program at p:$40 as usual and then you code a few hours till you hit p:$200. you now overwrite y:$200. that's quite cute, as long as there's nothing useful there in y mem. *g* okay, i'll not leave you standing in the cold and give a good hint. use an endlabel to get the end address of your code.. .... org p:$40 ... .. mpy .. mac .. rep .. div .. rts end_p_mem: ; internal y memory 0..$100.. org y:0 ; external y memory end_p_mem..$3FFF org y:end_p_mem yep, this way you avoid fuckups. now some things to take into consideration for speed optimisation. the first thing is internal memory. the dsp can do 3 accesses on the 3 internal memories at once. also it can do 1 access on an external memory and 2 on internal memories in the same period. however, it can't do 2 accesses on the external ram at once. it needs an extra cycle! for example: ; r0<$100, r4>=$100 (r0 internal, r4 external) org p:$40 mpy x0,y0,a a,x:(r0)+ y:(r4)+,y0 will run in 2 cycles. but: ; r0>=$100, r4>=$100 (both external!) org p:$40 mpy x0,y0,a a,x:(r0)+ y:(r4)+,y0 will take an extra cycle! another bad situation: ; r0>=$100, r4>=$100 (both external!) org p:$200 mpy x0,y0,a a,x:(r0)+ y:(r4)+,y0 now 3 accesses on the external memories are done. please note i'm not exactly sure about the timings, but i do know the latter situations actually are slower than the first. we may conclude from this that keeping at least the time-critical loops in internal p memory is a good idea. also it's healthy to keep one address register pointing to internal mem (either x or y). - bootstrapping --------------------------------------------------------------- if i'm not mistaking, at p:$7800 (or x:$3800) a nice small piece of bootstrap code is found. if you are friendly enough to need all dsp mem you can get you will most likely take it for yourself. no problem, you say. next time a p56 is loaded, the os will just reset the dsp and reinstall this bootstrap code. *buzz* wrong. atari didn't implement this, so killing the bootstrap on dsp is effectively the same as killing low memory (tos) on the cpu side. in order to do this right you need to re-install a bootstrapper yourself. the code involved here is too much to mention and i recommend you look at the bootstrapper by nocrew. - hostport -------------------------------------------------------------------- the hostport is an essential communication channel between cpu and dsp. there has been alot of fuss about it. first of all atari decided to make it a cheap 8bit implementation. 24bit words are split up into 3 bytes and transferred sequentially. Also in some cases you are forced to use handshaking which further decreases performance. These are really the Achilles heel of the dsp system. It's like driving a racecar and being restricted to 100KMH. but enough about this. i am here to tell how to get around these limitations as much as that's possible. handshaking is an interesting issue. first of all, the fastest processor needs to be handshaked. note the subtlety. if the complexity of the calculations fluctuate, this is hard to know... so, in those cases you have to handshake both cpu and dsp. anyway, that's safest. it will always run on any falcon. often however coders have kicked the cpu side handshaking to increase throughput. this is possible. i have seen routs function well even when the when the cpu had a higher clock than the dsp. however, remember to always handshake before transmitting the first word. this way, you start synchronized. also, the dsp loop must be damn tight and must typicly be a maximum of 10 cycles (excluding the actual transmit instructions!). for instance: ; wrapped texturemapping (for 64x64 textures) ; y:(r4): texture ; high accuracy.. 10bit subtexel. on 68K 8bit is common.. movec #$FFFF,m0 movec #$FFFF,m5 move #<$08,y0 ; y0=downscalar (%UUUUUUuuuuuuuuuu -> %0000UUUUUUuuuuuu) move #$000FC0,y1 ; y1=U.0 mask move #$002000,x1 ; x1=downscalar (V<<10+v -> V) ; r0: %UUUUUUuuuuuuuuuu[0] (start) ; n0: u_step (%UUUUUUuuuuuuuuuu) ; r5: %VVVVVVvvvvvvvvvv[0] (start) ; n5: v_step (%VVVVVVvvvvvvvvvv) ; n2=#pixels in scan do n2,_xloop move r0,x0 ; x0=%UUUUUUuuuuuuuuuu mpyr y0,x0,a (r0)+n0 ; a=%0000UUUUUUuuuuuu, r0=%UUUUUUuuuuuuuuuu[n+1] and y1,a r5,x0 ; a=%0000UUUUUU000000, x0=%VVVVVVvvvvvvvvvv mac x1,x0,a (r5)+n5 ; a=%0000UUUUUUVVVVVV, r5=%VVVVVVvvvvvvvvvv[n+1] move a,n4 ; n4=%0000UUUUUUVVVVVV jclr #1,x:<<HSR,* ; Wait until host is ready. movep y:(r4+n4),x:<<HTX ; Transmit texel. _xloop: this is my new texturing loop. it is able to wrap (for seamless 'wallpaper' textures) and fast enough to keep up with accelerated cpu's. now optimising the dsp part is important, but like i said before. many dsp applications are a combined process. so the cpu could do with some optimisation as well. on a standard falcon (16MHz 030, 32MHz dsp). the dsp runs circles around the cpu so to say, so the cpu becomes the bottleneck. a normal texturing loop would look like this: move.w #NUM_PIXELS-1,d7 lea $FFFFA206.w,a1 ; dsp xmit (16 bit) loop: move.w (a1),(a0)+ ; copy dsp->screen dbf d7,loop ; and another.. we already see that we only use the lower 16bit part of the transmit register. this causes only two reads on the bus instead of three. remember the story about the 8bit hostport? well, that applies here. because we fetch a 16bit word we cause two (8bit) accesses. this loop is reasonable, however it can be much better for wider spans. and with that i mean spans with more than 2 pixels ;) you can try to unroll as far as i-cache can take you, to kill the dbf time (10 cycles mostly!). moveq #NUM_PIXELS,d7 moveq #16-1,d1 move.l d7,d0 and.l d1,d0 ; count mod 16 neg.l d0 lsr.w #4,d7 ; count/16 jmp jump(pc,d0.l*2) ; skip unneeded pix. loop: rept 16 ; unroll 16 times move.w (a1),(a0)+ ; copy dsp->screen endr jump: dbf d7,loop ; next 16 pixels.. a more costly initialisation, but as you can see also a much cheaper loop. this can result in 1.1 mln pix/s on a standard falcon (rgb 320x200) and this is a _net_ measurement. not just measurements of the innerloop, but also including the yloop and scanconversion. yes, this is all very nice. but what about the cases _with_ handshaking? right, if you really need handshaking it's best to use it on larger words. handshaking for each byte is madness, it's a good idea to use it on 16bit or even 24bit portions. sadly, 24bit portions aren't very common. the 030 doesn't use 12 or 24bit memory words (would be nice in this case wouldn't it? ;)) so we're more interested in 16 and 8 bit modes. if we need to transmit bytes it's best to pack them toghether in a 16bit word. a good example is using bytes to index a palette. often, when rgb operations like saturated are too costly, you'd want to use this. now we can look up as follows: ; init clr.l d0 lea $FFFFA207.w,a1 ; dsp xmit (lsb) lea palette,a2 lea $FFFFA202.w,a3 ; dsp control ; ..loop.. wait: btst #0,(a3) ; handshake..... beq.s wait ; .. move.b (a1),d0 ; get index from dsp. move.w (a2,d0.l*2),(a0)+ ; lookup and store. ofcourse a complete handshake for only 1 silly byte lookup is crazy: ; init clr.l d0 lea $FFFFA206.w,a1 ; dsp xmit (lsw) lea palette,a2 lea $FFFFA202.w,a3 ; dsp control ; ..loop.. wait: btst #0,(a3) ; handshake..... beq.s wait ; .. move.w (a1),d0 ; get indices from dsp. move.l (a2,d0.l*4),(a0)+ ; lookup and store. here we see two pixels are done at once with only 3 bus accesses extra (only 1 or 2 with good use of ttram!). please note that you also need a special double palette here, like so: dc.w col0,col0 dc.w col0,col1 dc.w col0,col2 dc.w ... ... ... dc.w col0,col254 dc.w col0,col255 dc.w col1,col0 dc.w col1,col1 ... ... dc.w col1,col255 dc.w col2,col0 ... ... ofcourse on the dsp this requires a packing operation. but as we shall see, this is nothing: move #>128,y0 ; y0=scalar ; x0: first index ($00FF), a0: second index ($00SS) mac x0,y0,a ; a0=$00FFSS okay, one last note about handshaking: btst sucks! yes folks, there is another and better way todo it. btst #0,(a0) = moveq #1,d0 and.b d0,(a0) this does not clear the other bits in the status reg. why? it's a read only register! the good part of this is also that no write is issued on the bus, so it's basicly not more than a btst but faster. ofcourse you have to reserve a register, but who cares. i don't know why atari or other programmers didn't notice this one. it can give a healthy speedup... okay, not factor 2, but still enough for some cases where you need to have an fx run in 1 or 2 vbl and need to give it a little 'push'. <link=art48b.scr>Go to NEXT PART</l> </frame> </body>