home *** CD-ROM | disk | FTP | other *** search
- <head>
- <title="...forever...">
- <font=monaco10.fnt>
- <font=newy36.fnt>
- <font=time24.fnt>
- <image=back.raw w=256 h=256 t=-1>
- <buf=6841>
- <bgcolor=-1>
- <background=0>
- <link_color=000>
- <module=console.mod>
- <pal=back.pal>
- colors:
- 251 - black
- </head>
- <body>
- <frame x=0 y=0 w=640 h=6841 b=-1 c=-1>
-
-
- <f1><c000> DSP routines and optimisation <f0>
- <f1><c000> various bits and pieces <f0>
-
- by earx 2003
-
-
- - history ---------------------------------------------------------------------
-
- a falcon is fitted with a cute little dsp chip. this piece of silicon is a
- wonder of efficiency especially for it's time. it runs most instructions in
- just 2 cycles and is capable of running at 50MHz without breaking a sweat (or
- higher with new srams).
-
- now when the falcon was released there were alot of discussions going on
- whether it could actually beat a top class pc (costing twice as much). most
- people agreed that a 16MHz 68030 wasn't much compared to a 50MHz 80486. others
- argued that the 56001 added so much power to the falcon that it could easily
- take on such a pc.
-
- i was and still am one of the supporters of the latter statement. the only
- problem was that a 030 might be relatively easy to code. easier even than the
- 68000 we loved from the st days. but the dsp was a whole different story. it's
- architecture differs alot from the usual cpu's (risc/cisc).
-
- it wasn't easy to code dsp. especially not in the early days. people were
- forced to program with motorola's assembler, which might be complete and
- flexible, but also bloated and slower than snails.
-
- i have enormous respect for the poineers on the dsp. eko, aura, bitmaster, dnt
- crew, aggression and others. from just doing what the dsp was initially
- intented to do (audio filtering) they let it play mods, make it paint shaded
- vectors, do 3d transformations, fractals and so on.
-
- - motivation ------------------------------------------------------------------
-
- the point here is it really took a few years before the dsp's potential was
- released. however, i think we haven't seen the best routs yet. this is why i
- write a doc to possible get some people interested in dsp hacking. also, dsp
- coding is alot of fun. there are really funny ways of doing things on it, that
- don't seem like cpu coding at all. but we'll get to that later on..
-
- first, i want to mention the reasons for coding dsp. a good reason may be
- educational purposes. the dsp, cpu system is basicly a multiprocessor scheme.
- complete with seperate busses, synchronous and asynchronous transfer. it might
- tell a deal about how other multiprocessor systems to do work (consoles!).
-
- another reason can be an interest in filtering, fft's, dct's and all that
- stuff. the dsp is very suited for these and implementation is relatively easy.
-
- the third is reason is just to show what a basic machine can do! to fully
- exploit the potential of the falcon you have to use dsp, there is no way around
- it! using only a 030 might seem like working with one hand tied behind your
- back, but this analogy is somewhat flawed. people use both hands intuitively
- and without much effort. but programming multiprocessor systems is definetely
- less intuitive than coding singleprocessor systems. yep, this is a bit of a
- warning. but don't get scared yet ;)
-
- - getting started -------------------------------------------------------------
-
- i won't explain everything you need to know here. i'm sorry, but i have little
- time. and besides. tat wrote a good doc on getting started with the dsp
- (loading p56s/lods, transferring with hostport).
-
- all i will say here is that a running dsp program is often a synergy of a cpu
- program and a dsp program, not just a dsp program and it's loader. and to get
- this running you need to make some protocols for transmission.
-
- furthermore you need a good dsp assembler. as i said, motorola's is complete,
- but not very practical to use. the conversion of cld->p56 is tedious. or at
- least i tend to think so. there are alternatives. dspdit for lovers of turboasm
- and qdsp for others (you can install it as a tool in devpac or just run with
- commandline). there is also a56, but i have never tried.
-
- a debugger is also a must. on cpu it's often possible to just hack something
- and if it goes wrong look at the code a bit to set it right. on the dsp
- however, especially when starting it's not so easy. we'll get to that later.
-
- okay, now it's time to give some pointers. dsp coding has quite alot of
- pitfalls and in order to have a decent coding session it's advisable to avoid
- these.
-
- - split buses -----------------------------------------------------------------
-
- the 56k dsp's use a split bus architecture known as modified harvard. this
- means you have actually 3 buses. the first is p(rogram) memory, guess what it
- is for ;) the second and third are x,y data memory. to maximise throughput you
- can load and store in parallel or load twice or store twice whatever.. this is
- quite a change of philosphy from the normal 680x0 we're used to.
-
- the 680x0 can ofcourse load and store in one instruction, which is damn easy to
- code. however, it can definetely not do it in parallel. also next to this
- parallel load/store the dsp has an arithmetic instruction. all three can run
- in parallel.
-
- ofcourse we can't do this or need this all of the time, but it is possible to
- reach these peak throughputs with some good thinking. a good advice however is
- to first write your code in pure 'sequential' form and get it to work. and
- later on parallelize it step by step.
-
- for instance:
-
- ; x0=scalar
- do #13,_loop
- move y:(r4)+,y0 ; get input.
- mpy x0,y0,a ; a=result
- move a,x:(r0)+ ; store res.
- _loop:
-
- can be done as:
-
- ; x0=scalar
- move y:(r4)+,y0 ; y0=input[0]
- mpy x0,y0,a ; a=result
- do #13,_loop
- mpy x0,y0,a a,x:(r0)+ ; store res.
- move y:(r4)+,y0 ; get input.
- _loop:
-
- and we can go further
-
- ; x0=scalar
- move y:(r4)+,y0 ; y0=input[0]
- mpy x0,y0,a ; a=result[0]
- do #13,_loop
- mpy x0,y0,a a,x:(r0)+ y:(r4)+,y0 ; a=result[n]
- _loop:
-
- yes, about 3 times as fast! but notice that we have to make a special 'head' to
- initialize correctly. this is often the case. also notice the r0/r4. these are
- chosen for a reason. in a parallel x/y move one reg must be r0..r3 and the
- other must be r4..r7. all things we have to keep in mind. but as we can see,
- this does pay off. :)
-
- about the extra header code. it's often also the case that you need a tail to
- complete the last iteration. so, indeed, this optimisation isn't without
- consequences, sometimes even increasing the code-size a bit.
-
- - overlaps --------------------------------------------------------------------
-
- a thing correlated to the split buses is the overlap of memory. you would like
- the think of your memory as 3 banks (p,x,y) which have nothing todo with each
- other. this is not true. infact you have both internal and external memory (oh
- boy!). the internal memory is small:
-
- internal | external
- ------------------------+-------------------------
- p: 512 words p:0..p:$200|p:$0200..p:$7FFF
- x: 256 words p:0..p:$100|x:$0100..x:$3FFF
- y: 256 words p:0..p:$100|y:$0100..y:$3FFF
-
- but where's the overlap then? right here:
-
- p:$0200..p:$3FFF = y:$0200..y:$3FFFF
- p:$4000..p:$7FFF = x:$0000..x:$3FFFF
-
- note that p:$4000..p:$40FF is _external_ x ram. no internal ram is mirrored,
- only external.
-
- as you can see, that's quite nasty. it's possible to be unaware of this since
- it's all very transparent. and then all of a sudden your code gets overwritten
- ;)
-
- a situation that occurs alot is that you start coding your program at p:$40 as
- usual and then you code a few hours till you hit p:$200. you now overwrite
- y:$200. that's quite cute, as long as there's nothing useful there in y mem.
- *g* okay, i'll not leave you standing in the cold and give a good hint. use an
- endlabel to get the end address of your code..
-
- ....
-
- org p:$40
-
- ...
- ..
-
- mpy ..
- mac ..
- rep ..
- div ..
- rts
- end_p_mem:
-
- ; internal y memory 0..$100..
- org y:0
-
- ; external y memory end_p_mem..$3FFF
- org y:end_p_mem
-
- yep, this way you avoid fuckups.
-
- now some things to take into consideration for speed optimisation. the first
- thing is internal memory. the dsp can do 3 accesses on the 3 internal memories
- at once. also it can do 1 access on an external memory and 2 on internal
- memories in the same period. however, it can't do 2 accesses on the external
- ram at once. it needs an extra cycle!
-
- for example:
-
- ; r0<$100, r4>=$100 (r0 internal, r4 external)
- org p:$40
- mpy x0,y0,a a,x:(r0)+ y:(r4)+,y0
-
- will run in 2 cycles. but:
-
- ; r0>=$100, r4>=$100 (both external!)
- org p:$40
- mpy x0,y0,a a,x:(r0)+ y:(r4)+,y0
-
- will take an extra cycle! another bad situation:
-
- ; r0>=$100, r4>=$100 (both external!)
- org p:$200
- mpy x0,y0,a a,x:(r0)+ y:(r4)+,y0
-
- now 3 accesses on the external memories are done.
-
- please note i'm not exactly sure about the timings, but i do know the latter
- situations actually are slower than the first. we may conclude from this that
- keeping at least the time-critical loops in internal p memory is a good idea.
- also it's healthy to keep one address register pointing to internal mem
- (either x or y).
-
- - bootstrapping ---------------------------------------------------------------
-
- if i'm not mistaking, at p:$7800 (or x:$3800) a nice small piece of bootstrap
- code is found. if you are friendly enough to need all dsp mem you can get you
- will most likely take it for yourself. no problem, you say. next time a p56 is
- loaded, the os will just reset the dsp and reinstall this bootstrap code.
- *buzz* wrong.
-
- atari didn't implement this, so killing the bootstrap on dsp is effectively the
- same as killing low memory (tos) on the cpu side. in order to do this right you
- need to re-install a bootstrapper yourself. the code involved here is too much
- to mention and i recommend you look at the bootstrapper by nocrew.
-
- - hostport --------------------------------------------------------------------
-
- the hostport is an essential communication channel between cpu and dsp. there
- has been alot of fuss about it. first of all atari decided to make it a cheap
- 8bit implementation. 24bit words are split up into 3 bytes and transferred
- sequentially. Also in some cases you are forced to use handshaking which
- further decreases performance. These are really the Achilles heel of the dsp
- system. It's like driving a racecar and being restricted to 100KMH.
-
- but enough about this. i am here to tell how to get around these limitations as
- much as that's possible.
-
- handshaking is an interesting issue. first of all, the fastest processor needs
- to be handshaked. note the subtlety. if the complexity of the calculations
- fluctuate, this is hard to know... so, in those cases you have to handshake
- both cpu and dsp. anyway, that's safest. it will always run on any falcon.
-
- often however coders have kicked the cpu side handshaking to increase
- throughput. this is possible. i have seen routs function well even when the
- when the cpu had a higher clock than the dsp. however, remember to always
- handshake before transmitting the first word. this way, you start synchronized.
- also, the dsp loop must be damn tight and must typicly be a maximum of 10
- cycles (excluding the actual transmit instructions!).
-
- for instance:
-
- ; wrapped texturemapping (for 64x64 textures)
- ; y:(r4): texture
- ; high accuracy.. 10bit subtexel. on 68K 8bit is common..
- movec #$FFFF,m0
- movec #$FFFF,m5
- move #<$08,y0
- ; y0=downscalar (%UUUUUUuuuuuuuuuu -> %0000UUUUUUuuuuuu)
- move #$000FC0,y1
- ; y1=U.0 mask
- move #$002000,x1
- ; x1=downscalar (V<<10+v -> V)
-
- ; r0: %UUUUUUuuuuuuuuuu[0] (start)
- ; n0: u_step (%UUUUUUuuuuuuuuuu)
- ; r5: %VVVVVVvvvvvvvvvv[0] (start)
- ; n5: v_step (%VVVVVVvvvvvvvvvv)
- ; n2=#pixels in scan
-
- do n2,_xloop
- move r0,x0
- ; x0=%UUUUUUuuuuuuuuuu
- mpyr y0,x0,a (r0)+n0
- ; a=%0000UUUUUUuuuuuu, r0=%UUUUUUuuuuuuuuuu[n+1]
- and y1,a r5,x0
- ; a=%0000UUUUUU000000, x0=%VVVVVVvvvvvvvvvv
- mac x1,x0,a (r5)+n5
- ; a=%0000UUUUUUVVVVVV, r5=%VVVVVVvvvvvvvvvv[n+1]
- move a,n4
- ; n4=%0000UUUUUUVVVVVV
- jclr #1,x:<<HSR,* ; Wait until host is ready.
- movep y:(r4+n4),x:<<HTX ; Transmit texel.
- _xloop:
-
- this is my new texturing loop. it is able to wrap (for seamless 'wallpaper'
- textures) and fast enough to keep up with accelerated cpu's.
-
- now optimising the dsp part is important, but like i said before. many dsp
- applications are a combined process. so the cpu could do with some optimisation
- as well. on a standard falcon (16MHz 030, 32MHz dsp). the dsp runs circles
- around the cpu so to say, so the cpu becomes the bottleneck.
-
- a normal texturing loop would look like this:
-
- move.w #NUM_PIXELS-1,d7
-
- lea $FFFFA206.w,a1 ; dsp xmit (16 bit)
- loop: move.w (a1),(a0)+ ; copy dsp->screen
- dbf d7,loop ; and another..
-
- we already see that we only use the lower 16bit part of the transmit register.
- this causes only two reads on the bus instead of three. remember the story
- about the 8bit hostport? well, that applies here. because we fetch a 16bit word
- we cause two (8bit) accesses.
-
- this loop is reasonable, however it can be much better for wider spans. and
- with that i mean spans with more than 2 pixels ;) you can try to unroll as far
- as i-cache can take you, to kill the dbf time (10 cycles mostly!).
-
- moveq #NUM_PIXELS,d7
- moveq #16-1,d1
-
- move.l d7,d0
- and.l d1,d0 ; count mod 16
- neg.l d0
- lsr.w #4,d7 ; count/16
- jmp jump(pc,d0.l*2) ; skip unneeded pix.
-
- loop:
- rept 16 ; unroll 16 times
- move.w (a1),(a0)+ ; copy dsp->screen
- endr
- jump: dbf d7,loop ; next 16 pixels..
-
- a more costly initialisation, but as you can see also a much cheaper loop. this
- can result in 1.1 mln pix/s on a standard falcon (rgb 320x200) and this is a
- _net_ measurement. not just measurements of the innerloop, but also including
- the yloop and scanconversion.
-
- yes, this is all very nice. but what about the cases _with_ handshaking? right,
- if you really need handshaking it's best to use it on larger words. handshaking
- for each byte is madness, it's a good idea to use it on 16bit or even 24bit
- portions. sadly, 24bit portions aren't very common. the 030 doesn't use 12 or
- 24bit memory words (would be nice in this case wouldn't it? ;)) so we're more
- interested in 16 and 8 bit modes.
-
- if we need to transmit bytes it's best to pack them toghether in a 16bit word.
- a good example is using bytes to index a palette. often, when rgb operations
- like saturated are too costly, you'd want to use this.
-
- now we can look up as follows:
-
- ; init
- clr.l d0
- lea $FFFFA207.w,a1 ; dsp xmit (lsb)
- lea palette,a2
- lea $FFFFA202.w,a3 ; dsp control
- ; ..loop..
- wait: btst #0,(a3) ; handshake.....
- beq.s wait ; ..
- move.b (a1),d0 ; get index from dsp.
- move.w (a2,d0.l*2),(a0)+ ; lookup and store.
-
- ofcourse a complete handshake for only 1 silly byte lookup is crazy:
-
- ; init
- clr.l d0
- lea $FFFFA206.w,a1 ; dsp xmit (lsw)
- lea palette,a2
- lea $FFFFA202.w,a3 ; dsp control
- ; ..loop..
- wait: btst #0,(a3) ; handshake.....
- beq.s wait ; ..
- move.w (a1),d0 ; get indices from dsp.
- move.l (a2,d0.l*4),(a0)+ ; lookup and store.
-
- here we see two pixels are done at once with only 3 bus accesses extra (only 1
- or 2 with good use of ttram!). please note that you also need a special double
- palette here, like so:
-
- dc.w col0,col0
- dc.w col0,col1
- dc.w col0,col2
- dc.w ...
- ...
- ...
- dc.w col0,col254
- dc.w col0,col255
- dc.w col1,col0
- dc.w col1,col1
- ...
- ...
- dc.w col1,col255
- dc.w col2,col0
- ...
- ...
-
-
- ofcourse on the dsp this requires a packing operation. but as we shall see,
- this is nothing:
-
- move #>128,y0 ; y0=scalar
- ; x0: first index ($00FF), a0: second index ($00SS)
- mac x0,y0,a ; a0=$00FFSS
-
- okay, one last note about handshaking: btst sucks! yes folks, there is another
- and better way todo it.
-
- btst #0,(a0)
- =
- moveq #1,d0
- and.b d0,(a0)
-
- this does not clear the other bits in the status reg. why? it's a read only
- register! the good part of this is also that no write is issued on the bus, so
- it's basicly not more than a btst but faster. ofcourse you have to reserve a
- register, but who cares. i don't know why atari or other programmers didn't
- notice this one. it can give a healthy speedup... okay, not factor 2, but still
- enough for some cases where you need to have an fx run in 1 or 2 vbl and need
- to give it a little 'push'.
-
- <link=art48b.scr>Go to NEXT PART</l>
- </frame>
- </body>
-
-