NetNews Usenet Archive 1992 #19

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #19 / NN_1992_19.iso / spool / comp / sys / intel / 1593 < prev next >

Wrap

Internet Message Format | 1992-08-27 | 2.4 KB

Xref: sparky comp.sys.intel:1593 alt.lang.asm:363 comp.os.msdos.programmer:8843 Path: sparky!uunet!munnari.oz.au!yoyo.aarnet.edu.au!sirius.ucs.adelaide.edu.au!amodra From: amodra@ucs.adelaide.edu.au (Alan Modra) Newsgroups: comp.sys.intel,alt.lang.asm,comp.os.msdos.programmer Subject: 386SX code optimisation Message-ID: <8348@sirius.ucs.adelaide.edu.au> Date: 27 Aug 92 15:24:15 GMT Followup-To: comp.sys.intel Organization: Information Technology Division, The University of Adelaide, AUSTRALIA Lines: 36 I'm often involved in producing fast, tight assembly code for things like BIOS and drivers. One little thing that helps here is to put entry points into code on machine word boundaries ie. on a 386SX align to word boundaries, and on a 386DX align to dword boundaries. OK you say, that's all fairly obvious - feed as much code as you can into the CPU pipeline on its first fetch. The curious thing about a 386SX is that it often pays to align to addresses that are an odd multiple of two. Anyone know why this is so ? I suspect it may be that the internal pipelines of the 386SX are 32 bits wide, and the prefetch unit gets as much of the first dword as it can before passing instructions on to the next stage. This means that if your first code fetch is on a dword boundary, you get two word reads before anything happens. If your first code fetch is on an odd multiple of 2 address, then you read only one word. Here's some relative timing I measured on one of my 386SX systems, for 64K of repeated JMPs and other code snippets: Start address mod 4 0 1 2 3 Code bytes Timing JMP short EB,02,xx,xx 1.208 1.208 1.000 1.318 JMP near E9,01,00,xx 1.000 1.000 1.091 1.091 JNC short 73,02,xx,xx 1.203 1.203 1.000 1.312 JNC long 0F,83,00,00 1.000 1.158 1.000 1.075 NOP; JMP short 90,EB,01,xx 1.065 1.065 1.000 1.076 MOV AX,AX; JMP 89,C0,EB,00 1.000 1.089 1.009 1.089 XCHG BX,BX; JMP 87,DB,EB,00 1.058 1.076 1.000 1.136 MUL BP; JMP F7,E5,EB,00 1.062 1.062 1.000 1.087 Timing is normalized to the fastest time in each row. Anyone from Intel like to comment on why the 386SX (mis)behaves this way? I'd expect this sort of behaviour on a 386DX with 16 bit memory but not on a 386SX which can (and does) fetch words rather than dwords.