home *** CD-ROM | disk | FTP | other *** search
- Xref: sparky comp.sys.intel:1593 alt.lang.asm:363 comp.os.msdos.programmer:8843
- Path: sparky!uunet!munnari.oz.au!yoyo.aarnet.edu.au!sirius.ucs.adelaide.edu.au!amodra
- From: amodra@ucs.adelaide.edu.au (Alan Modra)
- Newsgroups: comp.sys.intel,alt.lang.asm,comp.os.msdos.programmer
- Subject: 386SX code optimisation
- Message-ID: <8348@sirius.ucs.adelaide.edu.au>
- Date: 27 Aug 92 15:24:15 GMT
- Followup-To: comp.sys.intel
- Organization: Information Technology Division, The University of Adelaide, AUSTRALIA
- Lines: 36
-
- I'm often involved in producing fast, tight assembly code for things
- like BIOS and drivers. One little thing that helps here is to put
- entry points into code on machine word boundaries ie. on a 386SX align
- to word boundaries, and on a 386DX align to dword boundaries. OK you
- say, that's all fairly obvious - feed as much code as you can into the
- CPU pipeline on its first fetch. The curious thing about a 386SX is
- that it often pays to align to addresses that are an odd multiple of
- two. Anyone know why this is so ?
-
- I suspect it may be that the internal pipelines of the 386SX are 32 bits
- wide, and the prefetch unit gets as much of the first dword as it can
- before passing instructions on to the next stage. This means that if
- your first code fetch is on a dword boundary, you get two word reads
- before anything happens. If your first code fetch is on an odd multiple
- of 2 address, then you read only one word.
-
- Here's some relative timing I measured on one of my 386SX systems, for
- 64K of repeated JMPs and other code snippets:
-
- Start address mod 4
- 0 1 2 3
- Code bytes Timing
- JMP short EB,02,xx,xx 1.208 1.208 1.000 1.318
- JMP near E9,01,00,xx 1.000 1.000 1.091 1.091
- JNC short 73,02,xx,xx 1.203 1.203 1.000 1.312
- JNC long 0F,83,00,00 1.000 1.158 1.000 1.075
- NOP; JMP short 90,EB,01,xx 1.065 1.065 1.000 1.076
- MOV AX,AX; JMP 89,C0,EB,00 1.000 1.089 1.009 1.089
- XCHG BX,BX; JMP 87,DB,EB,00 1.058 1.076 1.000 1.136
- MUL BP; JMP F7,E5,EB,00 1.062 1.062 1.000 1.087
-
- Timing is normalized to the fastest time in each row.
-
- Anyone from Intel like to comment on why the 386SX (mis)behaves this
- way? I'd expect this sort of behaviour on a 386DX with 16 bit memory
- but not on a 386SX which can (and does) fetch words rather than dwords.
-