NetNews Usenet Archive 1992 #18

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #18 / NN_1992_18.iso / spool / comp / theory / cellaut / 343 < prev next >

Wrap

Text File | 1992-08-19 | 16.7 KB | 352 lines

Newsgroups: comp.theory.cell-automata Path: sparky!uunet!charon.amdahl.com!pacbell.com!mips!sdd.hp.com!elroy.jpl.nasa.gov!decwrl!csus.edu!netcom.com!pdh From: pdh@netcom.com (Phil Howard ) Subject: Using lookup tables for faster Life Game calculations Message-ID: <a_#n=la.pdh@netcom.com> Date: Thu, 20 Aug 92 02:30:29 GMT Organization: Netcom - Online Communication Services (408 241-9760 guest) Keywords: 8086 real mode lookup table life game Lines: 341 /************************************************************\ |Copyright (C) 1992, by Philip D. Howard, all rights reserved| |Permission is hereby granted to Usenet, and the underlying | |networks making up Usenet, to distribute this article. | \************************************************************/ Lookup table logic is already a well known method of optimizing program code in many situations. The Game of Life is one of those cases where the method can definitely speed up the calculation process. However such use of lookup tables for the Game of Life is not particularly trivial, so I will be summarizing the explorations in this area I have made, which has resulted in at least one fast working executable program. BACKGROUND For those unfamiliar, the Game of Life is a cellular automata simulation game. Others who participate in the comp.theory.cell-automata Usenet newgroup could give better citations than I can. Unfortunately I cannot find an FAQ (Frequently Asked Questions, with answers) posting. The original Game of Life, published by Conway in an article in Scientific American involved a rectangular array of positions in which a cell may or may not be alive. Time occurred in generations in which the cells could be born or die. The original rules specified that a live cell would die (of loneliness) if it had fewer than 2 neighbors, or die (of overcrowding or lack of food) if it had more than 3 neighbors. So in order for a cell to continue to live, it must have either 2 or 3 neighbors. A new cell could be born into an empty position (no live cell there currently) if that position had exactly 3 neighboring live cells. Both the orthogonal and diagonal positions were counted, but they had to be the adjacent ones. Variations have also since been experimented with both on the same board as above, or in different arrangements (hexagonal, 3d, etc). In this article I will be limiting discussion to just the original rectangular board (array of cell positions). EARLY EFFORTS When I first read about the Game of Life, I acquired a program for my Apple II computer to run the simulation. It was very obvious that figuring the generations by hand was no fun at all. The computer did the work for me just fine, until I wanted to explore lots of new cell configurations to see what they did. I also wanted to explore large configurations, but the program only did 40x48 or so. The program I got was rather slow (about 1 generation per two seconds). I then tried to write one of my own and it ended up being even slower. This was in BASIC so it was obvious I needed machine language. I also wanted to use the Hi-Res display mode of the Apple II for a larger configuration. The Apple II stored 7 pixels per byte for a width of 280 pixels and a height of 192 pixels. Obviously this meant even more work for the computer, so speed was essential. The first draft of my program had a lot of shifting going on, cell counting, and some code to check the numbers against the rules to decide whether a cell should be in the given position in the next generation. While seeking to find ways to reduce the amount of shifting of bits that was taking place, I realized that the neighbor bits in each row were already in a unique group, and maybe I could capitalize on that in some way. So the first approach I took was to use the 3 cells in either the row above or the row below, and use them as a lookup to a table that gave me the count. I didn't need to mask off the 3 bits because I made the table repeat the counts to fill out 256 bytes in a way that treated the other 5 bits as irrelevant. At the expense of some memory I saved several CPU cycles. Later on I tried a different approach of collecting all 8 neighbor bits into an index, which directly got the data without the counting and the test of the count. One further test of the center cell made a simple branch. In order to make things as fast as possible, I designed 7 different types of machine code for each of the 7 different pixel positions in a byte. The resultant program ran at about the same speed as the first program I had, but now was doing 280x192 instead of 40x48. By limiting the size I got a significant speedup. I've since sold my Apple II, and gave all my old Apple II disks to a friend who probably erased them to put his games on. I no longer have any of that code. RECENT ADVANCES More recently I have considered again the writing of a program to run the Game of Life, but this time on a PC clone architecture. The EGA and VGA displays have a video mode (number 13, hexadecimal 0D) in which there are 320x200 pixels, organized so that each plane (4 planes total) contains 8 pixels per byte. I could use just one plane to store that data as well as retrieve it. The original Apple II was limited to 48K (subsequent models have pushed that figure up) but the PC architecture in 8086 real mode had a memory capacity of 640K (not utilizing the memory locations reserved for things like video). This allowed for an even larger lookup table. The original method I examined involved calculating 4 cells in parallel. For each byte, 2 sections of code would be used, one for the higher 4 bits and one for the lower 4 bits. I chose to use a linear arrangement of 4 bits to make the calculation of a full byte of cells easier to do. In these graphs, the cells being calculated are h, i, j, and k: higher 4 cells: lower 4 cells: .......a bcdef... ...abcde f....... .......g hijkl... ...ghijk l....... .......m nopqr... ...mnopq r....... The dots represent bit positions that do not participate, and the space is the boundary between bytes. For each 4 cells being calculated, it was necessary to load 6 bytes of data, and perform some shifting and masking. Since this is a total of 18 bits and the target machine had a limit of 64K for a single array, I decided to split things up and utilize the segmented memory access of the 8086. The lookup table was allocated as 32 pieces of 8K bytes. A selected 5 bits was used to index into a table of segment numbers, and the other 13 bits was used as an offset. For the upper 4 bits, the bits were shifted around so that: segment index = ........ ...hijkl offset = ...nopqr bcdefagm For the lower 4 bits, the bits were shifted around so that: segment index = ........ ...ghijk offset value = ...mnopq rlfabcde Since each lookup table entry had pre-calculated bits in the proper position, it would have been a simple matter when generating the table to do the upper 4 bits one way, and the lower 4 bits a different way. The reason for the different arrangement was to minimize the total amount of shifting being done. The C code to calculate the upper 4 bits is like: b = segref( uchar, seglist[ ( c2 & 248 ) >> 3 ], // segindex HIJKl ( ( ( ( u1 & 1 ) << 1 ) // ..... ......a. | ( c1 & 1 ) // ..... .......g ) << 1 ) // ..... .....ag. | ( d1 & 1 ) // ..... .......m | ( u2 & 248 ) // ..... bcdef... | ( ( d2 & 248 ) << 5 ) // nopqr ........ ) & 0xf0; The C code to calculate the lower 4 bits is like: b |= segref( uchar, seglist[ c1 & 31 ], // segindex gHIJK ( ( ( ( u2 & 128 ) >> 1 ) // ..... .f...... | ( c2 & 128 ) // ..... l....... ) >> 1 ) // ..... .lf..... | ( d2 & 128 ) // ..... r....... | ( u1 & 31 ) // ..... ...abcde | ( ( d1 & 31 ) << 8 ) // mnopq ........ ) & 0x0f; segref( uchar , arg_next , tp++ ) = b; The array "seglist" contained the 32 segment values (paragraph number). The C macro "segref" is a macro I wrote to simply accessing memory when I am specifying the segment and offset separately. It results in an lvalue in the C language, which can be used to fetch or store data. Argument 1 is the data type, argument 2 is the segment, and argument 3 is the offset. The variables u1, c1, d1, u2, c2, and d2 were pre-loaded with the data from the old generation. I was hoping the C compiler would handle these in registers, but unfortunately the 8086 has too few registers, and those it does have are too limited (for instance you cannot use AX, CX, or DX as base registers). So it allocated memory even for these. This would be a loss primarily on non-cache machines. The speed results were quite good (at full 320x200): Gateway 2000 386/25 (no cache): 6.75 generations per second IBM PS/2-70: approx 7.5 generations per second Gateway 2000 486/33C: 17.5 generations per second GOING EVEN FASTER I studied the machine code the compiler generated and was still convinced it should be possible to go even faster. There were a lot of shift operations being done, and neither I nor the compiler could find any way to optimize any of it (I tried the maximum optimization of the MicroSoft C compiler). So I needed something else if I was going to get any more speed. One of the things I noticed was that only a 1 column slice was being taken of the adjacent byte. If I shifted the working position over, I could have ONE of the two "phases" of calculation being performed on only 3 bytes instead of 6 bytes. The input would also be better aligned resulting in less shifting of data. The collection of the new cells would be more complex but no new shifting was needed to do so since the lookup table would hold cells in the ultimate position anyway. As an added bonus, I discovered that some of the shifting was 4 bits, which suggested a further possible savings by taking advantage of the inherint 4 bit shift of the address calculation in the 8086 real mode memory segmentation. So a new construction of the 18 bits to form an address in the lookup table looks like this. In order to show the effect of the 4 bit shift in the segment value, the graphic will show the inherint shift: first 4 cells: second 4 cells: ......ab cdef.... ..abcdef ......gh ijkl.... ..ghijkl ......mn opqr.... ..mnopqr For the first 4 bits, the bits were shifted around so that: segment = ..gh .... cdef .... offset = ijkl .... opqr abmn index = gh ijkl cdef opqr abmn For the second 4 bits, the bits were shifted around so that: segment = ..gh ijkl .... .... offset = .... abcd efmn opqr index = gh ijkl abcd efmn opqr Since the segment portion of the index was being constructed entirely in position, it was necessary to have the lookup table entirely contiguous in memory. The base segment value of the table would be added to the calculated segment value to be used as a portion of the address of the ultimate memory location in the lookup table. I am currently coding this method in assembly language instead of C, and am also including the ability to have any board width, not just multiples of 8 which would have been easier to do. Therefore the edge calculations are more complex and separate code is dedicated. This version will also allow choosing to have wrap around between top and bottom, or between left and right, or both. REDUCING THE MEMORY Both of the methods described above for the PC/8086 use an 18 bit index and so requires a lookup table of 256K bytes. Several people have reported that the first beta-test version I released would not run on their machine because it was out of memory (some additional memory was require for the code, I/O buffers, a copy of the active board, and an active shape pattern. It would certainly be nice to find a way to reduce the amount of memory required. One method that was suggested was to organize the 4 cells in a square instead of a line. This would result in only 16 bits instead of a 18 bits for the index. The memory used would be 64K. The configurations would be like: phase 0: phase 1: phase 2: phase 3: ......ab cd...... abcd.... ..abcd.. ....abcd ......ef gh...... efgh.... ..efgh.. ....efgh ......ij kl...... ijkl.... ..ijkl.. ....ijkl ......mn op...... mnop.... ..mnop.. ....mnop In each case, the cells f, g, j, and k are being calculated. The number of phases (different sets of code) needed is 4 since the width would be 2 instead of 4, and that results in 4 different positions within a byte. In addition, each use of the lookup table will have to be different for the upper 2 new cells than for the lower 2 new cells. Since the cells that contribute are different, the need to be pulled in from different places in the lookup table. This could either be a different bit position (but these are already used up for the 4 phases) or a different byte (which doubles the table size from 64K to 128K). At the present time I have chosen not to explore the 2x2 method any further. It does have potential, but it also looks like it will be more complex. Another possible method that I do plan to explore involves calculating fewer cells at a time, trading some speed for even more memory savings. The organization would be: phase 0: phase 1: phase 2: ......ab cd...... abcde... ...abcde ......ef gh...... fghij... ...fghij ......ij kl...... klmno... ...klmno The cells being calculated in phase 0 are f and g. The cells being calculated in phases 1 and 2 are g, h, and i. The number of bits to form an index is only 15 (12 in phase 0, but this gives us the freedom to place them anywhere in the lower 15 bit positions) so the table size needed is only 32K bytes. I plan to try to implement this method in C without using any segment and offset tricks, although pre-processor selected code may be used to force based addresses to be sure that the index is used as an offset directly instead of being added to a far address in the 8086 architecture. Other architectures would be compiled using ordinary indexing. SUMMARY The lookup table, especially the modified form where multiple cells are effectively calculated in parallel does offer a speedup advantage at the cost of storing the table itself. Care must be taken in limited machine architectures when the tables are large, but some tricky programming may also allow one to select some advantages in these cases. This method of using a lookup table allows for using any kind of rules for a rectangular board where cells are affected by their immediate eight neighbors. I have not yet explored any of these variations, not have I constructed code that would flexibly build the lookup table from a rule set. I am looking for ways to specify a rule set before doing this. Very large memory spaces (particularly if an equivalent amount of real memory is available to keep the lookup table resident in virtual memory systems) may offer significant further speedups. I look forward to the day I have a machine with 2 gigabytes of real memory, of which 1 gigabyte can be used to hold the lookup table needed for a 30 bit index which would let me calculate a full 8-bit byte all at once. If you would like a uuencoded or xxencoded copy of the PCLIFE program I wrote using the first 4-bit method I described, send me e-mail to the address: pdh@netcom.com and ask for a copy. The ZIP file will include a few sample boards, selected via the 12 function keys. The user interface is this version is very rough and incomplete; some commands are not implemented. Suggestions for a full release version are invited. If you already have a user interface but would like to add my generation code, please contact me at the same e-mail address. I will send uuencoded unless you specify xxencoded. -- /***********************************************************************\ | Phil Howard --- KA9WGN --- pdh@netcom.com | "The problem with | | depending on government is that you cannot depend on it" - Tony Brown | \***********************************************************************/