OS/2 Shareware BBS: 10 Tools

home *** CD-ROM | disk | FTP | other *** search

/ OS/2 Shareware BBS: 10 Tools / 10-Tools.zip / crypl200.zip / IDEA / IDEA8086.ASM < prev

Wrap

Assembly Source File | 1995-06-21 | 10KB | 286 lines

;A while ago I posted a message claiming a speed of 238,000 ;bytes/sec for an implementation of IDEA on a 33Mh 486. Below is ;an explanation and some code to show how it works. The basic ;trick should be useful on many (but not all) processors. I ;expect only those familiar with IDEA and its reference ;implementation will be able to follow the discussion. See: ; ;Lai, Xueja and Massey, James L. A Proposal for a New Block ;Encryption Standard, Eurocrypt 90 ; ;For those who have been asking for the code, sorry I kept ;putting it off. I wanted to get it out of Turbo Pascal ;ideal-mode, but I never had the time. ; ;Colin Plum wrote IDEA-386 code which is included in PGP ;2.3a and uses the same tricks. I don't know who's is ;faster, but I expect they will be very close. Now ;here's how it's done. ; ;A major bottleneck in software IDEA is the mul() routine, which ;is used 34 times per 64 bit block. The routine performs ;multiplication in the multiplicative group mod 2^16+1. The two ;factors are each in a 16 bit word, and the output is also in a 16 ;bit word. Note that 0 is not a member of the multiplicative ;group and 2^16 does not fit in 16 bits. We therefor use the 0 ;word to represent 2^16. Now group elements map one to one onto ;all possible 16 bit words, since 2^16+1 is prime. ; ;Here is (essentially) the reference implementation from [Lai]. ; ; ;unsigned mul( unsigned a, unsigned b ) { ; long int p ; ; long unsigned q ; ; if( a==0 ) p= 0x00010001 - b ; ; else if( b==0 ) p= 0x00010001 - a ; ; else { ; q= a*b; ; p= (q & 0xffff) - (q>>16) ; if( p<0 ) p= p + 0x00010001 ; ; } ; return (unsigned)(p & 0xffff) ; ;} ; ; ;Note the method of reducing a 32 bit word modulo 2^16-1. We ;subtract the high word from the low word, and add the modulus ;back if the result is less than 0. [Lai] contains a proof that ;this works, and you can convince yourself fairly easily. ; ;To speed up this routine, we note that the tests for a=0 and b=0 ;will rarely be false. With the possible exception of the first 2 ;of the 34 multiplications, 0 should be no more likely than any of ;the other 65535 numbers. Note that if (and only if) either a or ;b is 0 then q will also be 0, and we can check for this in one ;instruction if our processor sets a zero flag for multiplication ;(as the 68000 does but 80x86 does not). ; ;Fortunately p will also be zero after the subtraction if and only ;if either a or b is 0. Proof: r will be zero when the high order ;word of q equals the low order word, and that happens when q is ;divisible by 00010001 hex. Since 00010001h = 2^16+1 is prime, ;this happens if either a or b is a multiple of 2^16+1, and 0 is ;the only such multiple which will fit in a 16 bit word. ; ;The speed-up strategy is to proceed under the assumption that a ;and b are not 0, check to be sure in one instruction, and ;recompute if the assumption was wrong. Here's some 8086 ;assembler code: ; ; mov ax, [a] ; mul [b] ; ax is implied. q is now in DX AX ; sub ax, dx ; mod 2^16+1 ; jnz not0 ; Jump if neither op was 0. Usually taken. ; ; mov ax, 1 ; recompute result knowing one op is 0. ; sub ax, [a] ; sub ax, [b] ; jmp out ; Just jump over adding the carry. ;not0: ; adc ax, 0 ; If r<0 add 1, otherwise do nothing. ;out: ; Result is now in ax ; ; ;Note that when r<0 we add 1 instead of 2^16+1 since the 2^16 part ;overflows out of the result. The "adc ax, 0" does all the work ;of checking for a negative result and adding the modulus if ;needed. ; ;The multiplication takes 9 instructions, 4 of which are rarely ;executed. I believe similar tricks are possible on many ;processors. The one drawback to the check-after-multiply tactic ;is that we can't let the multiply overwrite the only copy of an ;operand. ; ;Note that most software implementations of IDEA will run at ;slightly different speeds when 0's come up in the multiply ;routine. The reference implementation is faster on 0, this one ;is faster on non-zero. This may be a problem for some real-time ;stuff, and also suggests an attack based on timing. ; ;Finally, below is an implementation of the complete encryption ;function in 8086 assembler, to replace the cipher_idea() function ;in PGP. It takes the same parameters as the function from PGP, ;and uses the c language calling conventions. I tested it using ;the debug features of the idea.c file in PGP. You will need to ;add segment/assume directives. This version uses no global data ;and should be reentrant. ; ;The handling of zero multipliers is outside the inner loop so ;that a short conditional jump can loop back to the beginning. ;Forward conditional jumps are usually not taken and backward ;jumps are usually taken, which is consistent with 586 branch ;prediction (or so I've heard). Stalls where the output of one ;instruction is needed for the next seem unavoidable. ; ;Last I heard, IDEA was patent pending. My code is up for grabs, ;although I would get a kick out being credited if you use it. ;On the other hand Colin's code is already tested and ready ;to assemble and link with PGP. ; ;--Bryan ; ;____________________CODE STARTS BELOW THIS LINE_________ ; Called as: asmcrypt( inbuff, outbuff, zkey ) just like PGP PROC _asmcrypt ; establish parameter and local space on stack ; follow c language calling conventions ARG inblock:Word, outblock:Word, zkey:Word LOCAL sx1:Word,sx4:Word,skk:Word,done8:Word =stacksize push bp mov bp, sp sub sp, stacksize ; push ax ; My compiler assumes these are not saved. ; push bx ; push cx ; push dx push si push di ; Put the 16 bit sub-blocks in registers and/or local variables mov si, [inblock] mov ax, [si] mov [sx1], ax ; x1 is in ax and sx1 mov di, [si+2] ; x2 is in di mov bx, [si+4] ; x3 is in bx mov dx, [si+6] mov [sx4], dx ; x4 is in sx4 mov si, [zkey] ; si points to next subkey mov [done8], si add [done8], 96 ; we will be finished with 8 rounds ; when si=done8 @@loop: ; 8 rounds of this add di, [si+2] ; x2+=zkey[2] is in di add bx, [si+4] ; x3+=zkey[4] is in bx mul [Word si] ;x1 *= zkey[0] sub ax, dx jz @@x1 ; if 0, use special case multiply adc ax, 0 @@x1out: mov [sx1], ax ; x1 is in ax and sx1 xor ax, bx ; ax= x1^x3 mul [Word si+8] ; compute kk sub ax, dx ; if 0, use special case multiply jz @@kk adc ax, 0 @@kkout: mov cx, ax ; kk is in cx mov ax, [sx4] ; x4 *= zkey[6] mul [Word si+6] sub ax, dx jz @@x4 ; if 0, use special case multiply adc ax, 0 @@x4out: mov [sx4], ax ; x4 is in sx4 and ax xor ax, di ; x4^x2 add ax, cx ; kk+(x2^x4) mul [Word si+10] ; compute t1 sub ax, dx jz @@t1 ; if 0, use special case multiply adc ax, 0 @@t1out: ; t1 is in ax add cx, ax ; t2 is in cx kk+t1 xor [sx4], cx ; x4 in sx4 xor di, cx ; new x3 in di xor bx, ax ; new x2 in bx xchg bx, di ; x2 in di, x3 in bx xor ax, [sx1] ; x1 in ax mov [sx1], ax ; and [sx1] add si, 12 ; point to next subkey cmp si, [done8] jne @@loop jmp @@out8 ;------------------------------------------ ; Special case multiplications, when one factor is 0 @@x1: mov ax, 1 sub ax, [sx1] sub ax, [Word si] jmp @@x1out @@kk: mov ax, [sx1] ; rebuild overwritten operand xor ax, bx neg ax inc ax sub ax, [si+8] jmp @@kkout @@x4: mov ax, 1 sub ax, [sx4] sub ax, [Word si+6] jmp @@x4out @@t1: mov ax, [sx4] ; rebuild xor ax, di add ax, cx neg ax inc ax sub ax, [si+10] jmp @@t1out ;--------------------------------------------------- ; 8 rounds are done, now that extra pseudo-round @@out8: push di mov di, [outblock] mul [Word si] sub ax, dx jnz @@o1n ; jump over special case code mov ax, 1 sub ax, [sx1] sub ax, [si] jmp @@o1out @@o1n: adc ax, 0 @@o1out: mov [di], ax ; final ciphertext block 1 mov ax, [sx4] mul [Word si+6] sub ax, dx jnz @@o4n ; jump over special case code mov ax, 1 sub ax, [sx4] sub ax, [si+6] jmp @@o4out @@o4n: adc ax, 0 @@o4out: mov [di+6], ax ; final ciphertext block 4 add bx, [si+2] mov [di+2], bx ; final ciphertext block 2 pop ax add ax, [si+4] mov [di+4], ax ; final ciphertext block 3 ; Restore the stack and return pop di pop si ; pop dx ; pop cx ; pop bx ; pop ax mov sp, bp pop bp ret ENDP _asmcrypt