Vectronix 2

home *** CD-ROM | disk | FTP | other *** search

/ Vectronix 2 / VECTRONIX2.iso / FILES_03 / JAG_DOX.ZIP / risc_doc.txt < prev next >

Wrap

Text File | 1996-01-28 | 30KB | 990 lines

# ------------------------------------------------------------------- # GPU/DSP (c) Copyright 1995 KKP & Nat! # ------------------------------------------------------------------- # These are some of the results/guesses that Klaus and I (Nat!) found # out about the Jaguar with a few helpful hints by other people, # who'd prefer to remain anonymous. # # Since we are not under NDA or anything from Atari we feel free to # give this to you for educational purposes only. # # Please note, that this is not official documentation from Atari # or derived work thereof (both of us have never seen the Atari docs) # and Atari isn't connected with this in any way. # # Please use this informationphile as a starting point for your own # exploration and not as a reference. If you find anything innacurate, # missing, needing more explanation etc. by all means please write # to us: # nat@zumdick.rhein-main.de # or # kkp@gamma.dou.dk # # If you could do us a small favor, don't use this information for # those lame flamewars on r.g.v.a or the mailing list. # # HTML soon ? # ------------------------------------------------------------------- # $Id: risc_doc.txt,v 1.10 1996/01/28 20:35:56 nat Exp $ # ------------------------------------------------------------------- This now contains lots of stuff, that are cryptic because I just incorporated third source knowledge. There's quite a bit I don't understand yet :) [nat/1995] Please note the high bullshit content when it comes to the description of the pipeline business. 1 RISCy Business =-=-=-=-=-=-=-=-= The RISC's has 2 register banks of 32 registers each. There are the Current and the Alternative register bank. Register R31 is the stack pointer and normally R0 is initilized to 0 (Zero). G_FLAGS and D_FLAGS are the status register. The first 5 bits contains the Carry, Zero and Minus flags (I THINK). GPU_FLAGS: =-=-=-=-= 32 28 24 20 16 12 8 4 0 +--------^---------^---------^--------+---+----^------+-^--------+-+------+ 1 | unused |aux| irq_pend | irq_enab |m|flags | +-------------------------------------+---+-----------+----------+-+------+ flags: bit 0: zero bit 1: carry bit 2: negative These are the GPU status flags that are set on arithmetic and logical instructions (except irq_mask) mask (m): bit 3: IMASK IMASK (Interupt disable?) irq_enable: bit 4: IRQ 0 enable bit 5: IRQ 1 enable bit 6: IRQ 2 enable bit 7: IRQ 3 enable bit 8: IRQ 4 enable You can enable any of the 5 interrupts by setting the appropriate bit. (?) irq_clear: bit 9: IRQ 0 clear bit 10: IRQ 1 clear bit 11: IRQ 2 clear bit 12: IRQ 3 clear bit 13: IRQ 4 clear When through with an interrupt processing, you probably have to clear the appropriate bit here. aux: bit 14: register bank selection bit 15: DMA Switching between the registerbanks is done like this: movei #G_FLAGS,r1 ; Status flags or movei #D_FLAGS,r1 ; Status flags load (r1),r0 bset #14,r0 store r0,(r1) ; Switch the GPU/DSP to bank 1 Normally the GPU is running on Bank 1, since on an IRQ Bank 0 becomes automatically active. bit 15 seems to control the way the GPU load/store instructions access memory. If set they run at DMA priority. If cleared ?? GPU_CONTROL: =-=-=-=-=-= 32 28 24 20 16 12 8 4 0 +--------^---------^---------^--------^--------+--+-----^---+--+-^--------+ 1 | unused | h| irq_lat | d| control | +----------------------------------------------+--+---------+--+----------+ control: bit 0: start the GPU / run status bit 1: allow GPU to interrupt the 68K (?) bit 2: generate a GPU type 0 interrupt (on the 68K (?)) bit 3: enable single step bit 4: perform a single step Setting bit 0 starts the GPU. When reading this register this bit will tell you whether the GPU is running or not. You can stop the GPU by clearing this bit. dma (d): bit 5: set external DMA ACK (?) int_lat: bit 6: IRQ 0 pending VI-IRQ (VBLANK) bit 7: IRQ 1 pending bit 8: IRQ 2 pending bit 9: IRQ 3 pending bit 10: IRQ 4 pending Clear or poll any pending interrupts with these bits. (?) bus_hog (h) : bit 11: hog mode on Allows the GPU to 'hog' the bus. When the GPU code uses a lot of load/store instructions consecutively it could be that the OP does not get enough time to do its processing. Use with care. Register R31 is used by the RISC's as stack pointers. They only seems to be used by interupts. See the section on interupts. GPU_MATRIX_CONTROL: =-=-=-=-=-=-=-=-=-= 32 28 24 20 16 12 8 4 0 +--------^---------^---------^--------^--------^--------^-----+--+--------+ 1 | unused | t| size | +-------------------------------------------------------------+--+--------+ size: bits 0-3: size as a binary number Size of one row of the matrix. type (t): bit 4: row order Specifiy whether your matrix is Row Major (0) or Column Major (1). GPU_MATRIX_ADDRESS: =-=-=-=-=-=-=-=-=-= 32 28 24 20 16 12 8 4 0 +--------^---------^---------^--------^--------^--------^--------^--------+ 1 | address | +-------------------------------------------------------------------------+ Points to the matrix in memory. DIV_CONTROL: =-=-=-=-=-= 32 28 24 20 16 12 8 4 0 +--------^---------^---------^--------^--------^--------^--------^-----+--+ 1 | unknown |c | +----------------------------------------------------------------------+--+ This register is write only control (c) bit #0 division control If bit #0 is set, then the division operation will assume a unsigned (?) 16.16 integer fractional representation for the divide. Else you get a straight 32 bit unsigned integer divide (like on the 68000 DIVU). DIV_REMAINDER: =-=-=-=-=-=-= 32 28 24 20 16 12 8 4 0 +--------^---------^---------^--------+--------^--------^--------^--------+ 1 | unused | value | +-------------------------------------+-----------------------------------+ This register can be read only. Remainder of the division operation. Guess: only 16 bits wide. ############################################################################ Architecture: =-=-=-=-=-=-= Ingredients: GPU/DSP: two load/store units one ALU one divisor unit various control logic for branching et.c. The GPU and the DSP are both pipeline processor, employing a triple stage forwarding pipeline. The pipeline is: (???) Stage 1: Load (LAS1/LAS2) Stage 2: Arithmetic and Logic Unit Stage 3: Store (LAS1/LAS2) Load an Store Unit (LAS) =-=-=-=-=-=-=-=-=-=-=-= The LAS aren't just called LAS because they can Load and Store, but because they can also Load and Store at the same time. To the same register that is... Therefore writing a register back, still retains the register value in the LAS for usage by the ALU again. Arithmetic and Logic Unit (ALU) =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= add, mult, shift all 'atomic' instruction excecute one cycle Registers =-=-=-=-= 64 registers, each 32 bits wide, stored in two banks of r0..r31 interrupts always execute out of bank 0 (i.e. your code should always execute in bank 1..) I M P O R T A N T N O T I C E ****************************************************************** I got lots problems with the following treatment. Don't take it as gospel. ******************************************************************* I M P O R T A N T N O T I C E Currently I favor this as the way the Jaguar does an add. This a completely unsubstantiated guess on my part solely based on some outsider info (the code with the STALLs) 0: nop 1: add r0,r1 2: nop 3: nop INSTR LAS1 ALU LAS2 --------------------------------------------- 0: nop - - - 1: add r0,r1 fetch r0 - fetch r1 2: nop - add write r1 3: nop - - - Instructions that need registers which aren't yet update with the new values force the processor to stall (like on any good pipelined system) The processor keeps track of the register states, and introduced stalls when appropriate. Here's a few more complex example: (Thanks, you know who!) Ex 1: div r0,r1; (r1 is not available now!) STALL STALL STALL*12 add r1,r2; (yay, we can use r1 again :-) You could replace the STALLs with code that did not need to access r1 and the divison wouldn't slow you down more than any other instruction. (Of course a second division is impossible, when the DIV unit is already in use) Ex.2: nop nop nop (LS1) (LS2) (ALU) add r0,r1 (load r0, load r1, nop) add r2,r3 (load r2, load r3, add r0,r1) add r4,r5 (store r1, load r4, add r2,r3 (load r5, nop , STALL) add r6,r7 (load r6, load r7, add r4,r5) add r8,r9 (store r5, load r8, add r6,r7) (load r9, nop , STALL) add r0,r1 (load r0, load r1, add r8,r9) nop (store r9,nop add r0,r1) nop (store r1,nop nop) 1.0 Move instructions =-=-=-=-=-=-=-=-=-=-= move Rn,Rn move PC,Rn movei #xxxxxxxx,Rn load (Rn),Rn load (Rm+n),Rn * Rm = R14 | R15 ! load (Rm+Ri),Rn * Rm = R14 | R15 ! loadb (Rn),Rn * load byte loadw (Rn),Rn * Load word loadp (Rn),Rn * Load Phrase (GPU only) store Rn,(Rn) store Rn,(Rm+n) * Rm = R14 | R15 ! store Rn,(Rm+Ri) * Rm = R14 | R15 ! storeb Rn,(Rn) * Store Byte storew Rn,(Rn) * Store Word storep Rn,(Rn) * Store Phrase (GPU only) moveta Rn,Rn * move to alternative register bank movefa Rn,Rn * move from alternative register bank 1.1 Logical Instructions =-=-=-=-=-=-=-=-=-=-=-=-= or Rn,Rn xor Rn,Rn and Rn,Rn 1.2 Bitoperation Instructions =-=-=-=-=-=-=-=-=-=-=-=-=-=-= bset #,Rn bclr #,Rn btst #,Rn 1.3 Shift Instructions =-=-=-=-=-=-=-=-=-=-=-= shlq #xx,Rn shrq #xx,Rn sharq #xx,Rn ror Rn,Rn rorq #xx,Rn 1.4 Arith. Instructions =-=-=-=-=-=-=-=-=-=-=-= mult Rn,Rn imult Rn,Rn mmult Rn,Rn imultn Rn,Rn imacn Rn,Rn resmac Rn div Rn,Rn * exec seems to use max 4 i-cycles add Rn,Rn addc Rn,Rn * add with carry addq #xx,Rn addqt #xx,Rn * add quick, test result addqmod #xx,Rn * add quick, take modulo sub Rn,Rn subc Rn,Rn * add with carry subq #xx,Rn subqt #xx,Rn * sub quick, test result subqmod #xx,Rn * sub quick, take modulo cmp Rn,Rn cmpq #xx,Rn neg Rn not Rn abs Rn 1.5 Program Structure Instructions =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= jump CC,(Rn) jump (Rn) jr CC,xxxxxx jr xxxxxxx nop 1.6 Condition Codes =-=-=-=-=-=-=-=-=-= Condition codes CC can be any of CC (%00100) CS (%01000) EQ (%00010) MI (%11000) NE (%00001) PL (%10100) HI (%00101) T (%00000). They are used together with the jump instructions... 2.0 Restrictions =-=-=-=-=-=-=-=-= 'JR+MOVEI', 'JUMP+MOVEI', 'JR+JR', 'JR+JUMP', 'JUMP+JR', 'JUMP+JUMP', 'JR+MOVE PC', 'JUMP+MOVE PC' IMULTN must be followed by a IMACN (Error displayed) IMACN must be followed by a IMACN or RESMAC (Error displayed) RESMAC must be preceed by a IMACN (Error displayed) a NOP is inserted between LOAD+MMULT and STORE+MMULT (Warning displayed). I don't know if LOADB+MMULT, LOADW+MMULT, LOADP+MMULT, ... are valid or not. Currently, it's not tested... 3.0 Instruction Encoding =-=-=-=-=-=-=-=-=-=-=-=-= Most instructions are only 2 bytes long. This means that 4 instructions can be pulled from RAM in one memory access!! This also makes the code extremly tight, which is of optimum concern when writing cartridge based programs. One more than 2 byte instruction is the movei #x,Rn which have the 32 bit constant just after the 2 byte instruction, this saves a lot of time and space over other RISC's. The ARM forexample uses 4 32 bit instructions to fill a register (8 bit at a time). The SPARC 2 32 bit instructions. 3.2 Instruction Encoding =-=-=-=-=-=-=-=-=-=-=-=-= All instructions uses the top 6 bits to encode the instruction. The 2 operand instructions split the remainder of the 16 bits into 2 5 bit fields, the source (quick or register) and the destination register. 3.2.1 The Implied Instructions =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= iiiiii 0000000000 /\ /\ || |_============== room for extensions || \`======================= instruction The Implied instruction are nop! 3.2.2 The 1 Operand Instructions =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= iiiiii 00000 ddddd <====== destination register /\ /\ || |_================ room for extensions || \`======================= instruction The one operand instructions are: neg R0 not R1 abs R2 resmac R3 3.2.3 The 2 Operand Instructions =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Most instructions are 2 operand and follow this pattern. The register to register instructions use the sssss and ddddd to specify source and destination registers, as add r1,r0. In the quick to register instructions the sssss field is used to hold a constant, as asl #3,r0 where the constans is between 1 and 32 and moveq #0,d2 where the constant is between 0 and 31. iiiiii sssss ddddd <====== destination register /\ /\ || |_================ source (quick or register) || \`======================= instruction Examples of 2 operand instructions are: move R1,R2 bset #31,R2 etc... 3.2.4 The movei Instruction =-=-=-=-=-=-=-=-=-=-=-=-=-= The movei instruction are very special! This instruction is the only 6 byte instruction, that is what makes it special. The instruction word follow the general structure, iiiiii 00000 ddddd <====== destination register /\ /\ || |_================ room for extensions || \`======================= instruction ($98) but the 32 bit constant that is to be loaded into the destination register followes the instruction +-------------+ +------------+ +------------+ | Movei Rn | | Lower word | | Upper word | +-------------+ +------------+ +------------+ 3.2.5 The Load & Store Instructions =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Most instructions are 2 operand and follow this pattern. iiiiii ppppp ddddd <====== destination register /\ /\ || |_================ indirect register || \`======================= instruction 3.2.5.1 Addressing Modes For Load/Store Byte/Word/Phrase =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= All load and store instructions support register indirect addressing, which is written (Rn). This means that you can load the memory location pointed to by a register into yet another register (or the same). 3.2.5.2 Addressing Modes For Load/Store Longword =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Together with the Load/Store longword instructions, there are other addressing modes. Called: * indexed register indirect addressing, which is written (Rn+Rm), * register indirect addressing w. offset, which is written (Rn+xx), In these addressing modes Rn _have_ to be R14 or R15! fx: load (r1+r2),r0 store r0,(r1+16) 3.2.5.3 Load/Store Phrase (GPU Only) =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= The GPU has an direct 64 bit (Phrase) interface to the main memory. The loadp/storep instructions access this memorys full width. The lower part of the phrase pointed to by the (Rp) goes from/to the register specified, the other part of the phrase is in G_HIDATA ( 0xF02118 ) /* GPU Bus Interface high data */ fx: store r0,(rp) 3.2.6 The Program Control Instructions =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Most Program Control instructions follow this pattern: iiiiii ddddd ccccc <====== Condition Vector /\ /\ || |_================ source (quick or register) || \`======================= instruction The ddddd field can either speify an offset (jr instruction) or a register containing a absolute address (jump instruction), all jump instructions are conditional. 3.2.6.1 Condition Codes =-=-=-=-=-=-=-=-=-=-=-= Condition codes ccccc can be any 5 bit vector, here are some ready defined usefull values: CC (%00100 CS (%01000) EQ (%00010) MI (%11000) NE (%00001) PL (%10100) HI (%00101) T (%00000) Examples of Program Control instructions: jump mi, (r5) jr ne, exit jr t, loop ; loop forever jr loop ; loop forever jump (r5) 3.2.7 Modulo Aritimetics (DSP only) =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= The instructions addqmod and subqmod are modular with the size specified in the D_MOD (0xF1A118) /* DSP Modulo Instruction Mask */ The mask register contains a mask that is applied to the register after the add operation, as in the following two step movei #%111111,r1 loop: addq #4,r0 and r1,r0 ... jr loop With the modulo register this can be written: movei #D_MOD,r3 movei #~%111111,r1 store r1,(r3) loop: addq #4,r0 ... jr loop This is an obvious win! - you save a cycle each loop! Instructions are subqmod, addqmod 3.2.8 Multiply and Multiply-Accumulate Locations =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= D_MACHI EQU BASE+$1A120 ; DSP Hi byte of MAC operations 3.2.9 Matrix Multiply Locations =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= D_MTXC EQU BASE+$1A104 ; DSP Matrix Control D_MTXA EQU BASE+$1A108 ; DSP Matrix Address G_MTXC EQU BASE+$2104 ; GPU Matrix Control G_MTXA EQU BASE+$2108 ; GPU Matrix Address 3.2.10 Divide Locations =-=-=-=-=-=-=-=-=-=-=-= D_REMAIN EQU BASE+$1A11C ; DSP Division Remainder D_DIVCTRL EQU BASE+$1A11C ; DSP Divider control G_REMAIN EQU BASE+$211C ; GPU Division Remainder G_DIVCTRL EQU BASE+$211C ; GPU Divider control 3.3 Instruction numbers =-=-=-=-=-=-=-=-=-=-=-= Mnemonic Mode iiiiii sssss ddddd hex Notes -------------------------------------------------------------- ADD Rs,Rd 000000 sssss ddddd $00 ADDC Rs,Rd 000001 sssss ddddd $04 ADDQ #q,Rd 000010 qqqqq ddddd $08 q is [32, 1..31] ADDQT #q,Rd 000011 qqqqq ddddd $0C q is [32, 1..31] SUB Rs,Rd 000100 sssss ddddd $10 SUBC Rs,Rd 000101 sssss ddddd $14 SUBQ #q,Rd 000110 qqqqq ddddd $18 q is [32, 1..31] SUBQT #q,Rd 000111 qqqqq ddddd $1C q is [32, 1..31] NEG Rd 001000 00000 ddddd $20 AND Rs,Rd 001001 sssss ddddd $24 OR Rs,Rd 001010 sssss ddddd $28 XOR Rs,Rd 001011 sssss ddddd $2C NOT Rd 001100 00000 ddddd $30 BTST #q,Rd 001101 qqqqq ddddd $34 q is [0..31] BSET #q,Rd 001110 qqqqq ddddd $38 q is [0..31] BCLR #q,Rd 001111 qqqqq ddddd $3C q is [0..31] MULT Rs,Rd 010000 sssss ddddd $40 IMULT Rs,Rd 010001 sssss ddddd $44 IMULTN Rs,Rd 010010 sssss ddddd $48 RESMAC Rd 010011 00000 ddddd $4C IMACN Rs,Rd 010100 sssss ddddd $50 DIV Rs,Rd 010101 sssss ddddd $54 ABS Rd 010110 00000 ddddd $58 $5C SHLQ #q,Rd 011000 qqqqq ddddd $60 q is [32, 1..31] SHRQ #q,Rd 011001 qqqqq ddddd $64 q is [32, 1..31] $68 SHARQ #q,Rd 011011 qqqqq ddddd $6C q is [32, 1..31] ROR Rs,Rd 011100 sssss ddddd $70 RORQ #q,Rd 011101 qqqqq ddddd $74 q is [32, 1..31] CMP Rs,Rd 011110 sssss ddddd $78 CMPQ #q,Rd 011111 qqqqq ddddd $7C q is [0..31] DSP SUBQMOD #q,Rd 100000 qqqqq ddddd $80 q is [32, 1..31] $84 MOVE Rs,Rd 100010 sssss ddddd $88 MOVEQ #q,Rd 100011 qqqqq ddddd $8C q is [0..31] MOVETA Rs,Rd 100100 sssss ddddd $90 MOVEFA Rs,Rd 100101 sssss ddddd $94 MOVEI #c32,Rd 100110 00000 ddddd $98 followed by a 32 bit const LOADB (Rp),Rd 100111 ppppp ddddd $9C LOADW (Rp),Rd 101000 ppppp ddddd $A0 LOAD (Rp),Rd 101001 ppppp ddddd $A4 GPU LOADP (Rp),Rd 101010 ppppp ddddd $A8 Load Phrase LOAD (R14+n),Rd 101011 nnnnn ddddd $AC LOAD (R15+n),Rd 101100 nnnnn ddddd $B0 STOREB Rs,(Rp) 101101 ppppp sssss $B4 STOREW Rs,(Rp) 101110 ppppp sssss $B8 STORE Rs,(Rp) 101111 ppppp sssss $BC GPU STOREP Rs,(Rp) 110000 ppppp sssss $C0 Store Phrase STORE Rs,(R14+n) 110001 nnnnn sssss $C4 STORE Rs,(R15+n) 110010 nnnnn sssss $C8 MOVE PC,Rn 110011 00000 ddddd $CC JUMP CC,(Rd) 110100 ddddd ccccc $D0 JR CC,q 110101 qqqqq ccccc $D4 MMULT Rs,Rd 110110 sssss ddddd $D8 $DC $E0 NOP 111001 00000 00000 $E4 LOAD (R14+Ri),Rd 111010 iiiii ddddd $E8 LOAD (R15+Ri),Rd 111010 iiiii ddddd $EC STORE Rs,(R14+Ri) 110001 iiiii sssss $F0 STORE Rs,(R15+Ri) 110010 iiiii sssss $F4 $F8 DSP ADDQMOD #q,Rd 111111 qqqqq ddddd $FC q is [32, 1..31] 3.4 Move instructions =-=-=-=-=-=-=-=-=-=-=-= None of the move instructions affect the status flags of the GPU, except when moving data into the status register itself. 3.5 Arithmetic instructions =-=-=-=-=-=-=-=-=-=-=-=-=-=-= ABS sets the carry flag if a negative value was transformed to a positive else clears it ADDQT does not affect the status flags DIV takes 16 cyles to execute. Supposedly you can do either 16.16 integer, fractional division or 32 bit integer division. The DIV does a 2 bit divide each cycle, hence 16 cycles total for 32 bit. The remainder of the divison is saved in a special register. IMACN they don't have a register write back (so they're easier to optimize) and you'll find if you get in the habit of using them, you can normally structure your code a bit faster by using them.. 3.6 Logical instructions =-=-=-=-=-=-=-=-=-=-=-=-=-= SHLQ affects the status flags SHRQ affects the status flags 4.0 Matrix multiplication =-=-=-=-=-=-=-=-=-=-=-=-=-= MMULT starting_register,destination_register You have to setup the matrix control register and the matrix address register before executing the MMULT instruction. The MMULT instruction will then multiply the values contained in the interrupt (alternate ?) register bank (0) starting at register "starting_register" with the matrix pointed to by the matrix address register. The values used for the multiplication are 16 bit values (sign extended ?) and are arranged in this slightly peculiar fashion (for a 5x1 Matrix): Register bank #0 (RB) +-------------+------------+ | 1 | 0 | Rn +-------------+------------+ | 3 | 2 | Rn+1 +-------------+------------+ | unused | 4 | Rn+2 +-------------+------------+ These values are then multiplied with a 32 bit table in memory and the results of these added together: Main memory (MM) $020000 +--------------------------+ | 0 | $020003 $020004 +--------------------------+ | 1 | $020007 $020008 +--------------------------+ | 2 | $02000B $02000C +--------------------------+ | 3 | $02000F $020010 +--------------------------+ | 4 | $020013 +--------------------------+ result = MM0 * RB0 + MM1 * RB1 + MM2 * RB2 + MM3 * RB3 + MM4 * RB4; The result 32 bit is stored into "destination_register" in the current bank (or bank 1?). Brief internal description of MMULT: strips program instruction fetchs and forces instructions straight into the pipeline.. getting a through put of one (16 bit) multiply per tick.. (25 million per second :-) Supposedly the MMULT is performed by inserting generated instructions into the instruction stream. Supposedly fot a MMULT the instructions inserted are a leading IMULTN, the middle ones IMACN, and finally a RESMAC. ??? These have their operands modified in the manner described above. ??? i.e. that funky packed thingy, two elements per register, that allows all of an eightxeight matrix to be stored in the secondary register bank and is the rasen d'e^tre of the second bank".. (woosh) 5.0 Interupts =-=-=-=-=-=-= The GPU and the DSP uses an interupt scheme that looks a lot like the 56000's way of handling interupts. In the lowest part of each processors memory the interupt entry points are. There are 16 bytes for each interupt. This should be enough to jump into the real interupt handler. ( If this works like on the 56000 it should be possible to have Fast Interupts, where the CPU returns automatically when the 16 bytes have been executed and no jump instructions have been executed ). For the DSP it looks like this: 000000 Reset (or DSP control interupt) 000010 I2S Interupt Enable interupts I2S: movei #D_FLAGS,r1 ; load dsp flags to go to bank 1 load (r1),r0 bset #5,r0 ; enable I2S interrupt store r0,(r1) ; save dsp flags Handle i2s interupts i2s_isr: movei #D_FLAGS,r30 ; get flags ptr load (r30),r12 bclr #3,r12 ; clear IMASK load (r31),r28 ; get last instruction address bset #10,r12 ; clear I2S interrupt addq #2,r28 ; point at next to be executed addq #4,r31 ; update the stack pointer ... jump T,(r28) ; and return store r12,(r30) ; restore flags BUGS: =-=-= There are also apparently some bugs (pretty heavy ones IMO) in the GPU/DSP that you should be aware of: 1) INDEXED STORES NEVER STALL e.g div r0,r3 store r3,(r14+6) should be div r0,r3 or r3,r3 store r3,(r14+6) Here the OR is used to 'touch' the register for the scoreboard. If you wouldn't touch the r3 register you would most likely (but not always, think of those IRQs!) write the old value of r3 back. 2) TWO CONSECUTIVE WRITES TO THE SAME REGISTER MIGHT BE PROBLEMATIC Although writing code like this is a bug anyway, you should be careful that if you write to same reg with no intermittent read, and the second instruction finishes first garbage will result: load (r3),r2 moveq #3,r2 should be load (r3),r2 or r2,r2 moveq #3,r2 3) NEITHER THE DSP NOR THE GPU CAN EXECUTE 'jr' OR 'jump' FROM EXTERNAL RAM 4) NEITHER THE DSP NOR THE GPU MAX BE USED IN HIGH PRIORITY 5) A mmult INSTRUCTION MUST NEVER BE INTERRUPTED how very convenient... 6) THE DSP (ONLY) MUST NOT DO AN EXTERNAL WRITE UNLESS PRECEDED BY AN EXTERNAL READ THAT COMPLETES BEFORE THE WRITE STARTS. The saying goes, that this bug is only spurious and can remain undetected for quite some time. Hint for external I/O use the Blitter (as always :)) e.g. A: load (r1),r2 or r10,r11 store r11,(r3) B: load (r1),r2 or r2,11 store r11,(r3) C: load (r1),r2 or r2,r2 or r10,r11 store r11,(r3) [A] will no work but [B] will, this is because the result of the load is required for the 'or' operation to be performed. To make [A] work, change it to [C]...