home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.theory.cell-automata
- Path: sparky!uunet!charon.amdahl.com!pacbell.com!mips!sdd.hp.com!elroy.jpl.nasa.gov!decwrl!csus.edu!netcom.com!pdh
- From: pdh@netcom.com (Phil Howard )
- Subject: Using lookup tables for faster Life Game calculations
- Message-ID: <a_#n=la.pdh@netcom.com>
- Date: Thu, 20 Aug 92 02:30:29 GMT
- Organization: Netcom - Online Communication Services (408 241-9760 guest)
- Keywords: 8086 real mode lookup table life game
- Lines: 341
-
- /************************************************************\
- |Copyright (C) 1992, by Philip D. Howard, all rights reserved|
- |Permission is hereby granted to Usenet, and the underlying |
- |networks making up Usenet, to distribute this article. |
- \************************************************************/
-
-
- Lookup table logic is already a well known method of optimizing program
- code in many situations. The Game of Life is one of those cases where
- the method can definitely speed up the calculation process. However
- such use of lookup tables for the Game of Life is not particularly
- trivial, so I will be summarizing the explorations in this area I have
- made, which has resulted in at least one fast working executable program.
-
-
-
- BACKGROUND
-
- For those unfamiliar, the Game of Life is a cellular automata simulation
- game. Others who participate in the comp.theory.cell-automata Usenet
- newgroup could give better citations than I can. Unfortunately I cannot
- find an FAQ (Frequently Asked Questions, with answers) posting.
-
- The original Game of Life, published by Conway in an article in Scientific
- American involved a rectangular array of positions in which a cell may or
- may not be alive. Time occurred in generations in which the cells could
- be born or die. The original rules specified that a live cell would die
- (of loneliness) if it had fewer than 2 neighbors, or die (of overcrowding
- or lack of food) if it had more than 3 neighbors. So in order for a cell
- to continue to live, it must have either 2 or 3 neighbors. A new cell
- could be born into an empty position (no live cell there currently) if
- that position had exactly 3 neighboring live cells. Both the orthogonal
- and diagonal positions were counted, but they had to be the adjacent ones.
-
- Variations have also since been experimented with both on the same board
- as above, or in different arrangements (hexagonal, 3d, etc). In this
- article I will be limiting discussion to just the original rectangular
- board (array of cell positions).
-
-
-
- EARLY EFFORTS
-
- When I first read about the Game of Life, I acquired a program for my
- Apple II computer to run the simulation. It was very obvious that
- figuring the generations by hand was no fun at all. The computer did
- the work for me just fine, until I wanted to explore lots of new cell
- configurations to see what they did. I also wanted to explore large
- configurations, but the program only did 40x48 or so. The program I
- got was rather slow (about 1 generation per two seconds). I then tried
- to write one of my own and it ended up being even slower. This was in
- BASIC so it was obvious I needed machine language.
-
- I also wanted to use the Hi-Res display mode of the Apple II for a larger
- configuration. The Apple II stored 7 pixels per byte for a width of 280
- pixels and a height of 192 pixels. Obviously this meant even more work
- for the computer, so speed was essential. The first draft of my program
- had a lot of shifting going on, cell counting, and some code to check
- the numbers against the rules to decide whether a cell should be in the
- given position in the next generation.
-
- While seeking to find ways to reduce the amount of shifting of bits that
- was taking place, I realized that the neighbor bits in each row were
- already in a unique group, and maybe I could capitalize on that in some
- way. So the first approach I took was to use the 3 cells in either the
- row above or the row below, and use them as a lookup to a table that
- gave me the count. I didn't need to mask off the 3 bits because I made
- the table repeat the counts to fill out 256 bytes in a way that treated
- the other 5 bits as irrelevant. At the expense of some memory I saved
- several CPU cycles.
-
- Later on I tried a different approach of collecting all 8 neighbor bits
- into an index, which directly got the data without the counting and the
- test of the count. One further test of the center cell made a simple
- branch. In order to make things as fast as possible, I designed 7
- different types of machine code for each of the 7 different pixel
- positions in a byte. The resultant program ran at about the same speed
- as the first program I had, but now was doing 280x192 instead of 40x48.
- By limiting the size I got a significant speedup.
-
- I've since sold my Apple II, and gave all my old Apple II disks to a
- friend who probably erased them to put his games on. I no longer have
- any of that code.
-
-
-
- RECENT ADVANCES
-
- More recently I have considered again the writing of a program to run
- the Game of Life, but this time on a PC clone architecture. The EGA
- and VGA displays have a video mode (number 13, hexadecimal 0D) in which
- there are 320x200 pixels, organized so that each plane (4 planes total)
- contains 8 pixels per byte. I could use just one plane to store that
- data as well as retrieve it.
-
- The original Apple II was limited to 48K (subsequent models have pushed
- that figure up) but the PC architecture in 8086 real mode had a memory
- capacity of 640K (not utilizing the memory locations reserved for things
- like video). This allowed for an even larger lookup table.
-
- The original method I examined involved calculating 4 cells in parallel.
- For each byte, 2 sections of code would be used, one for the higher 4 bits
- and one for the lower 4 bits. I chose to use a linear arrangement of
- 4 bits to make the calculation of a full byte of cells easier to do.
-
- In these graphs, the cells being calculated are h, i, j, and k:
-
- higher 4 cells: lower 4 cells:
-
- .......a bcdef... ...abcde f.......
- .......g hijkl... ...ghijk l.......
- .......m nopqr... ...mnopq r.......
-
- The dots represent bit positions that do not participate, and the space
- is the boundary between bytes. For each 4 cells being calculated, it
- was necessary to load 6 bytes of data, and perform some shifting and
- masking. Since this is a total of 18 bits and the target machine had
- a limit of 64K for a single array, I decided to split things up and
- utilize the segmented memory access of the 8086. The lookup table was
- allocated as 32 pieces of 8K bytes. A selected 5 bits was used to
- index into a table of segment numbers, and the other 13 bits was used
- as an offset.
-
- For the upper 4 bits, the bits were shifted around so that:
- segment index = ........ ...hijkl
- offset = ...nopqr bcdefagm
-
- For the lower 4 bits, the bits were shifted around so that:
- segment index = ........ ...ghijk
- offset value = ...mnopq rlfabcde
-
- Since each lookup table entry had pre-calculated bits in the proper
- position, it would have been a simple matter when generating the table
- to do the upper 4 bits one way, and the lower 4 bits a different way.
- The reason for the different arrangement was to minimize the total
- amount of shifting being done.
-
- The C code to calculate the upper 4 bits is like:
-
- b = segref( uchar,
- seglist[ ( c2 & 248 ) >> 3 ], // segindex HIJKl
- ( (
- ( ( u1 & 1 ) << 1 ) // ..... ......a.
- | ( c1 & 1 ) // ..... .......g
- ) << 1 ) // ..... .....ag.
- | ( d1 & 1 ) // ..... .......m
- | ( u2 & 248 ) // ..... bcdef...
- | ( ( d2 & 248 ) << 5 ) // nopqr ........
- ) & 0xf0;
-
- The C code to calculate the lower 4 bits is like:
-
- b |= segref( uchar,
- seglist[ c1 & 31 ], // segindex gHIJK
- ( (
- ( ( u2 & 128 ) >> 1 ) // ..... .f......
- | ( c2 & 128 ) // ..... l.......
- ) >> 1 ) // ..... .lf.....
- | ( d2 & 128 ) // ..... r.......
- | ( u1 & 31 ) // ..... ...abcde
- | ( ( d1 & 31 ) << 8 ) // mnopq ........
- ) & 0x0f;
- segref( uchar , arg_next , tp++ ) = b;
-
- The array "seglist" contained the 32 segment values (paragraph number).
-
- The C macro "segref" is a macro I wrote to simply accessing memory when
- I am specifying the segment and offset separately. It results in an
- lvalue in the C language, which can be used to fetch or store data.
- Argument 1 is the data type, argument 2 is the segment, and argument
- 3 is the offset.
-
- The variables u1, c1, d1, u2, c2, and d2 were pre-loaded with the data
- from the old generation. I was hoping the C compiler would handle these
- in registers, but unfortunately the 8086 has too few registers, and those
- it does have are too limited (for instance you cannot use AX, CX, or DX
- as base registers). So it allocated memory even for these. This would
- be a loss primarily on non-cache machines.
-
- The speed results were quite good (at full 320x200):
-
- Gateway 2000 386/25 (no cache): 6.75 generations per second
- IBM PS/2-70: approx 7.5 generations per second
- Gateway 2000 486/33C: 17.5 generations per second
-
-
-
- GOING EVEN FASTER
-
- I studied the machine code the compiler generated and was still convinced
- it should be possible to go even faster. There were a lot of shift
- operations being done, and neither I nor the compiler could find any way
- to optimize any of it (I tried the maximum optimization of the MicroSoft C
- compiler).
-
- So I needed something else if I was going to get any more speed. One of
- the things I noticed was that only a 1 column slice was being taken of the
- adjacent byte. If I shifted the working position over, I could have ONE
- of the two "phases" of calculation being performed on only 3 bytes instead
- of 6 bytes. The input would also be better aligned resulting in less
- shifting of data. The collection of the new cells would be more complex
- but no new shifting was needed to do so since the lookup table would
- hold cells in the ultimate position anyway.
-
- As an added bonus, I discovered that some of the shifting was 4 bits,
- which suggested a further possible savings by taking advantage of the
- inherint 4 bit shift of the address calculation in the 8086 real mode
- memory segmentation. So a new construction of the 18 bits to form an
- address in the lookup table looks like this. In order to show the
- effect of the 4 bit shift in the segment value, the graphic will show
- the inherint shift:
-
- first 4 cells: second 4 cells:
-
- ......ab cdef.... ..abcdef
- ......gh ijkl.... ..ghijkl
- ......mn opqr.... ..mnopqr
-
- For the first 4 bits, the bits were shifted around so that:
- segment = ..gh .... cdef ....
- offset = ijkl .... opqr abmn
- index = gh ijkl cdef opqr abmn
-
- For the second 4 bits, the bits were shifted around so that:
- segment = ..gh ijkl .... ....
- offset = .... abcd efmn opqr
- index = gh ijkl abcd efmn opqr
-
- Since the segment portion of the index was being constructed entirely
- in position, it was necessary to have the lookup table entirely contiguous
- in memory. The base segment value of the table would be added to the
- calculated segment value to be used as a portion of the address of the
- ultimate memory location in the lookup table.
-
- I am currently coding this method in assembly language instead of C,
- and am also including the ability to have any board width, not just
- multiples of 8 which would have been easier to do. Therefore the
- edge calculations are more complex and separate code is dedicated.
- This version will also allow choosing to have wrap around between
- top and bottom, or between left and right, or both.
-
-
-
- REDUCING THE MEMORY
-
- Both of the methods described above for the PC/8086 use an 18 bit index
- and so requires a lookup table of 256K bytes. Several people have reported
- that the first beta-test version I released would not run on their machine
- because it was out of memory (some additional memory was require for the
- code, I/O buffers, a copy of the active board, and an active shape pattern.
- It would certainly be nice to find a way to reduce the amount of memory
- required.
-
- One method that was suggested was to organize the 4 cells in a square
- instead of a line. This would result in only 16 bits instead of a 18
- bits for the index. The memory used would be 64K. The configurations
- would be like:
-
- phase 0: phase 1: phase 2: phase 3:
-
- ......ab cd...... abcd.... ..abcd.. ....abcd
- ......ef gh...... efgh.... ..efgh.. ....efgh
- ......ij kl...... ijkl.... ..ijkl.. ....ijkl
- ......mn op...... mnop.... ..mnop.. ....mnop
-
- In each case, the cells f, g, j, and k are being calculated.
-
- The number of phases (different sets of code) needed is 4 since the
- width would be 2 instead of 4, and that results in 4 different positions
- within a byte. In addition, each use of the lookup table will have
- to be different for the upper 2 new cells than for the lower 2 new cells.
- Since the cells that contribute are different, the need to be pulled
- in from different places in the lookup table. This could either be a
- different bit position (but these are already used up for the 4 phases)
- or a different byte (which doubles the table size from 64K to 128K).
-
- At the present time I have chosen not to explore the 2x2 method any further.
- It does have potential, but it also looks like it will be more complex.
-
-
-
- Another possible method that I do plan to explore involves calculating
- fewer cells at a time, trading some speed for even more memory savings.
- The organization would be:
-
- phase 0: phase 1: phase 2:
-
- ......ab cd...... abcde... ...abcde
- ......ef gh...... fghij... ...fghij
- ......ij kl...... klmno... ...klmno
-
- The cells being calculated in phase 0 are f and g. The cells being
- calculated in phases 1 and 2 are g, h, and i.
-
- The number of bits to form an index is only 15 (12 in phase 0, but this
- gives us the freedom to place them anywhere in the lower 15 bit positions)
- so the table size needed is only 32K bytes.
-
- I plan to try to implement this method in C without using any segment and
- offset tricks, although pre-processor selected code may be used to force
- based addresses to be sure that the index is used as an offset directly
- instead of being added to a far address in the 8086 architecture. Other
- architectures would be compiled using ordinary indexing.
-
-
-
- SUMMARY
-
- The lookup table, especially the modified form where multiple cells are
- effectively calculated in parallel does offer a speedup advantage at the
- cost of storing the table itself. Care must be taken in limited machine
- architectures when the tables are large, but some tricky programming may
- also allow one to select some advantages in these cases.
-
- This method of using a lookup table allows for using any kind of rules
- for a rectangular board where cells are affected by their immediate eight
- neighbors. I have not yet explored any of these variations, not have I
- constructed code that would flexibly build the lookup table from a rule
- set. I am looking for ways to specify a rule set before doing this.
-
- Very large memory spaces (particularly if an equivalent amount of real
- memory is available to keep the lookup table resident in virtual memory
- systems) may offer significant further speedups. I look forward to the
- day I have a machine with 2 gigabytes of real memory, of which 1 gigabyte
- can be used to hold the lookup table needed for a 30 bit index which
- would let me calculate a full 8-bit byte all at once.
-
- If you would like a uuencoded or xxencoded copy of the PCLIFE program
- I wrote using the first 4-bit method I described, send me e-mail to
- the address: pdh@netcom.com and ask for a copy. The ZIP file will
- include a few sample boards, selected via the 12 function keys. The
- user interface is this version is very rough and incomplete; some
- commands are not implemented. Suggestions for a full release version
- are invited. If you already have a user interface but would like to
- add my generation code, please contact me at the same e-mail address.
- I will send uuencoded unless you specify xxencoded.
- --
- /***********************************************************************\
- | Phil Howard --- KA9WGN --- pdh@netcom.com | "The problem with |
- | depending on government is that you cannot depend on it" - Tony Brown |
- \***********************************************************************/
-