home *** CD-ROM | disk | FTP | other *** search
- String functions that meet the needs of i[45]86
- November, 1994
- Ulrich Drepper <drepper@ira.uka.de>
-
-
- Torbjorn Granlund made in 1991 a great job while he wrote the
- assembler versions of nearly all the essential strings functions in
- the GNU libc. Most are written in assembler for speed in the features
- of the processor are used for better speed.
-
- At that time the i386 was the CPU one found. In the development for the
- i486 Intel manage to implement the most important assembler instructions
- to use only one clock cycle. But this could not be done for all instruc-
- tions. So some of them now use the same or a bigger number of cycles.
- That's the way the CISC CPU get closer to RISC.
-
- Among those instructions which suffer from speedup are the "unimportant"
- string instructions: lods, stos, movs, scas, cmps, ins, and outs.
- Formerly very fast in comparison with the two or three instructions they
- substitute now they take longer. E.g.
-
- lodsl movl (%esi),%eax stosl movl %eax,(%edi)
- addl $4,%esi addl $4,%edi
-
- i386 5 4 + 2 4 2 + 2
-
- i486 5 1 + 1 5 1 + 1
-
-
- The only situation where the string instructions can still compete is
- when they are compined with 'rep' prefix and can process 32 bits in one
- step. This is the case for functions like 'memset', 'memcpy' etc. You
- will not find these functions in the collection.
-
-
- In the process of writing this functions I found several more techniques
- that can be used when breaking the string instructions in the sequence.
- For understanding this one must know that all kind of jumps (transfer of
- control) is expensive. The main goal should be to keep the main path of
- execution free of jumps. Of course we need loops but with some tricks we
- can get more power out of a simple loop.
-
-
- Some optimization techniques
-
- This is not intend to be a course in assembler programming but just a
- collection of the experiences I made. Perhaps somebody doing work on
- further functions (or even myself) want to read this instead of
- re-invent the wheel.
-
- 0. (As said above) Optimize the path of execution to use as little
- _TAKEN_ jumps as possible.
-
- In conditional jumps the main path should follow the way of a not
- fulfilled condition.
-
- 1. Unroll loops.
-
- The contents of loops should be unrolled to have the loop check to
- occure not every step. A good number I found is 4: nearly all the
- functions I wrote have a four times unfolded loop.
-
- To achieve this for directly bounded loops (such as in strncmp,
- bound by the length parameter) one have to do a check for the
- whole loop. This leads to some extra code which can process the
- remaining bytes but this is limited to a small range. E.g. the
- loop processes 16 bytes each round one has to write a tail code
- which can process 0 to 15 bytes. This number is so small that one
- need not process it in a loop but instead write a straight code.
-
- 2. Use the registers which are used for register variables only if
- necessary.
-
- The binary format specification for a.out and ELF on i386 CPUs
- says that the registers %ebx, %edi, and %esi must not be changed
- by a function (they are 'callee-save-registers'). A saving is
- generally done by a push/pop pair. This costs 5 cycles. Not much
- but we are looking for every cycle :).
-
- Before using tempories on the stack there is another possibility.
- Most those functions are tail-functions. I.e. they won't call any
- other functions. Therefor the base register is not needed. This
- way one has another register available. (Caution: you won't be
- able to use a debugger anymore if you use this technique.)
-
- 3. Exploit the adressing modes of the i386 CPUs
-
- The processor has a rather complex general adressing mode:
-
- offset + base register + index register * index factor
-
- where offset is a constant
- base register is a general register
- index register another general register
- index factor is 1, 2, 4, or 8
-
- One possible use is found in strcmp. We have to process two
- strings and therefor have to deal with two pointers. But these
- pointers are NOT unrelated. If we used one pointer and a constant,
- but only known at runtime, offset we can write loops simpler. An
- example:
-
- str1 => %edi str2 => %esi
-
-
- normal code optimized code
-
- subl %esi,%edi
-
- L: mov (%edi),... L: mov (%edi,%esi),...
- mov (%esi),... mov (%esi),
-
- ... ...
-
- add $x,%edi add $x,%esi
- add $x,%esi
-
- jmp L jmp L
-
- The CPU is able to compute the adrress (most of the time) in one
- cycle.
-
-
- 4. Use 32 bit for processing
- (This is what gains the most speedup !!!)
-
- The nature of the string functions generally appears to be only
- solvable by processing byte by byte. One has to match a character
- and also look for the end of the string (NULL char).
-
- But this not true. Basing on an algorithm I found in the C
- version of 'memchr' by Roland McGrath based on a work of (again)
- Torbjorn Granlund one has an efficient possiblity of matching a
- specific character out of a 32-bit word. I don't describe it here
- (look into the source code of the assembler version of 'memchr').
- The only thing to say here is that it can be implemented with very
- cheap assembler instruction.
-
- (Unfortunately the test is only a necessary condition but not
- sufficient. If you look through the code you will known what I
- mean.)
-
- But does this extra arithmetic is it worth? Is a simple
- comparison not of similar or greater speed?
-
- With a simple word: no. An extreme example: 'strlen'. One could
- think of a naive implementation, perhaps with an unfolded loop.
- But we have one comparison and one conditional jump for each byte.
- With this nice little trick we have some arithmetic and two
- conditionals for each 32-bit word. That's worth it.
-
-
- But there must be some problems. The world would not wait so long
- with the invention if the gain is so big. And in fact: there is a
- problem.
-
- The address space in modern OS is a virtual one. One effect of
- this is that memory access to some areas of the address space is
- illegal. If we now process string which is located at the top of
- a valid area and an 32-bit memory access crosses the boundary to
- the not mapped area we get a SEGFAULT. But what to do? We cannot
- check before every memory access if it is valid. This would
- nullify the gains.
-
- The solution comes if we put some knowledge about the environment
- in the algorithm. The MMU of the i386 CPU is only able to map
- memory areas on 4096-bytes boundaries (I assume the big grainity
- which is found in all interesting OS). If we now align all memory
- access on a 4-byte boundary we can never cross the line between to
- areas with one memory access. If we access an illegal address
- (perhaps because of a non-terminated string) the simple algorithm
- would make this, too. The behaviour is identical! (Another
- profit from aligned memory access is that it is faster.)
-
- You might say that this sounds nice but if you have two strings to
- process you cannot assure to align them both simultaneously. And
- you are right. This was one of the biggest problems I come over
- but if you look through the code of 'strcmp' you will note that
- there is a solution.
-
- NOTE NOTE NOTE: PLEASE check it out !!!!!!!!!!
-
- One can identify two big areas of memory where the 32 bit access
- is valid: The normal data space upto the '___brk_addr' and the
- stack (from %esp upto 0xc0000000). At the beginning of 'strcmp'
- and 'strncmp' we check if both pointers fall into one of these
- areas. If not we use the simple byte-by-byte algorithm.
-
- In the other case there are three possibilities:
- a) both pointers are in the data areas
-
- Here it is sufficient to align the bigger pointer because it is
- the only one which could cause a boundary crossing.
-
- b) both pointers are in the stack area
-
- We need not do any special action. The first illegal address
- is 0xc000000. But at the top of the stack there is (at least)
- the return address for the 'main' function and its parameter
- pointers. Every access in this area can only be a fault and
- perhaps a resulting SEGFAULT will help to debug.
- (If somebody says that this is not right we also here can align
- the bigger pointer.)
-
- c) one pointer in data the other in stack area
-
- As one can guess we have to align the pointer in the data area.
- This works with the same arguments as above.
-
-
- 5. Align loops.
-
- It is important to align the main loop according to the needs of
- the processor. The i386 only needs a 4-byte alignment but the
- i486 a 16-byte alignment (no info about i586). I use the generic
- macro 'ALIGN' which is defined in 'asm-ops.h'.
-
- In some files you find another optimization with this alignment.
- The aligning is done by padding with a fill-instruction: NOP.
- This cannot be executed for free, it costs on i386 3 cycles and on
- i486 1 cycle. Many functions have a jump to an instruction
- immediately in front of the loop. If we now arrange the code that
- the instructions are placed behind the NOP many times the NOPs
- need not be executed. Please look for a conrete example. It is
- a rather ugly method but the assembler does not have a directive
- like 'align to XXX but place the NOP before YYY'.
-