[Prev][Next][Index][Thread]

Re: 486 optmization




[Note: this is a fairly technical reply to Jered's question, but there
 are a bunch of assembler-heads on this list, so I thought I'd post it.
 Most people will just want to skip this message.]

>>>>> "jered" == jered  <jered@MIT.EDU> writes:

    jered> I just saw on the linux kernel discuss meeting that 486es
    jered> and higher have a special instruction for converting
    jered> big-endian to/from little-endian.  Does anyone know if gcc
    jered> (djgpp) uses this and optimizes for it, what sort of
    jered> performance increase it might give, and if it would be
    jered> worth anyone's while to have a 486-higher executable of
    jered> Executor?

Jered is talking about the "bswap" instruction, which byte swaps a
four-byte value in a register in one cycle.  It was added when the
80486 came out, so it isn't present on 80386's.  It's non-pairable on
the Pentium.

gcc doesn't generate the "bswap" instruction, because it won't work on
an 80386.  I don't know if gcc has any way of doing anything special
for byte swaps anyway.  The -m486 flag isn't allowed to generate code
that won't run on an 80386, so gcc couldn't generate a bswap.  From
gcc.info:

 `-m486'
 `-mno-486'
      Control whether or not code is optimized for a 486 instead of an
      386.  Code generated for an 486 will run on a 386 and vice versa.

Executor's C code uses inline assembly to byte swap with three rotate
instructions, which works on both the 80386 and 80486+.  Our CPU
emulator (syn68k) decides at runtime if you have an 80486 or better
and generates bswap instructions "on the fly" if you do.  Otherwise,
it generates three rotate instructions.

A version of Executor that didn't work on 80386's would be a little
smaller and a little faster than the current one, but there's no
reason to think the performance difference would be huge.  We
benchmarked such a version long, long ago and found that an
80486-specific version was something like 5-10% faster.

Note that since NEXTSTEP/Intel only works on 80486's or better,
Executor/NEXTSTEP/Intel assumes the presence of an 80486 and takes
advantage of it.

The new, faster blitter I'm writing may take advantage of the bswap
instruction, if present.  Once that's done, the CPU emulator and the
graphics engine will both use bswap, so there won't be much
performance to gain by creating an 80486-specific version of Executor.

Thanks for the suggestion, though.

-Mat


Follow-Ups: References: