Power-Programmierung

home *** CD-ROM | disk | FTP | other *** search

/ Power-Programmierung / CD2.mdf / doc / alphasor / thesis

Wrap

Text File | 1994-02-22 | 40.1 KB | 773 lines

DISCOURSE ON ALPHASORT PATENT NO. 5,218,700 COMPARISON WITH OTHER SORTING ALGORITHMS Written by Al Beechick Copyright 1994 by Arrow Connection P.O. Box 899 Pollock Pines CA 95726 (916) 644-2341 TABLE OF CONTENTS 1. Alphasort, the Newer and Faster MSD Radix Algorithm 2. Comparison to Quicksort 3. Comparison to Radix Exchange Sorting 4. Comparison to Sorting by Distribution 5. Comparison to Sorting by Address Calculation 6. Comparison to Patent No. 4,809,158 1. ALPHASORT, THE NEWER AND FASTER MSD RADIX ALGORITHM Computer sorting algorithms can be divided into two classes, those that use compares and those that don't. Radix sorting does not use compares, and Alphasort falls into this class. This paper compares Alphasort to other sorting algorithms that fall into both classes. The goal is to help the reader understand why and how Alphasort is faster. Alphasort is called such, because it sorts by placing words according to the letters of the alphabet. This bypasses the need for compares. In addition to letters of the alphabet, Alphasort examines numbers or any computer-recognizable character. This method of placing words in "bins" or "piles" according to the characters in the word is radix sorting. When Alphasort examines the characters, it starts at the most significant digit (MSD). In other words, it starts at the front of the word. This is the opposite of traditional radix sorting which starts at the least significant digit (LSD). Old radix methods examined each character of the word, starting at the end of the word. The final pass, therefore, would be on the most significant digit, and this was necessary because of memory handling problems. The memory problems encountered by radix sorting were how to keep track of the "bins" or "piles." According to traditional radix thought, starting at the most significant digit would result in a multiplicity of "piles," and therefore use up a huge amount of memory in the computer. But Alphasort resolves this problem, finding a way to use a minimum amount of memory, no more memory than other methods which use compares. Resolving the memory problem is the key to faster sorting. Now that Alphasort starts at the most signifcant digit, this means that only a fraction of the total digits need to be examined, instead of all the digits as before. For example, if the first two passes result in a unique placement of the word, then the rest of the letters in the word are not examined. If "Alfred" is the only name in the list which starts with "Al," then the characters "fred" are not examined. This eliminates unnecessary passes, and it results in maximum efficiency. Alphasort not only increases efficiency over traditional radix sorting, but it also surpasses the other class of sorting methods, namely those methods which use compares. This could not be said of traditional radix sorting methods which sort according to the least significant digit. The necessity to examine each and every digit resulted in lower efficiency, and so there was no compelling need to abandon methods which use compares in favor of a radix method. But a radix method which sorts by the most significant digit examines only a fraction of the digits, resulting in an efficiency which surpasses that class of methods which use compares. Formulas show that sorting time increases, not proportionally, but geometrically, for methods which use compares. In other words, to increase a list by one item, the number of compares increases by, not one, but a multiple thereof. If N is the number of items in the list being sorted, then a method which uses compares may need N x log(base 2) N passes. In contrast, Alphasort uses approximately N x log(base r) N passes, where r is the range of values that individual characters can take, such as 26 for alphabetical sorting, or 10 for numerical sorting. If you were to graph sorting time against the size of the list, Alphasort's increased sorting time would rise almost linearly. The reason the graph would not show an exactly straight line is that additional items may require Alphasort to examine more digits in some items. It "may" require additional passes because the likelihood increases that words "may" have the same first two or three characters. On the other hand, if items added to the list have unique opening letters, then additional sorting passes will not be required. So sorting time depends upon the makeup of the list more than the size of the list. In contrast, methods which use compares require additional passes as the list size increases. This is an inherent requirement of the algorithm. Therefore, if you were to graph sorting time against the size of the list, increased sorting time would rise with a markedly upward curve. The preferable method for long lists, therefore, is Alphasort. But it works with equal efficiency for short lists. This is in contrast to traditional radix sorting. Since traditional radix sorting starts at the least significant digit, it has to examine each digit, and this is wasted effort in a short list. However, Alphasort starts with the most significant digit, meaning that for a short list only a small fraction of the total digits need to be examined. The concept of Alphasort is that only the minimum amount of passes need be applied to sort the list at hand. Thus, for a list of only 10 items, there is a strong likelihood that the entire list can be sorted in one pass. If by chance, two words in the short list started with the same letter, then an additional pass would be taken on those two words. The worst case scenario for Alphasort is a large number of duplicates or near duplicates in the list being sorted. This forces additional passes in order to arrive at a unique combination of opening characters for each item. Even in this worst case scenario, Alphasort compares favorably with traditional radix sorting methods, because the traditional methods examined each character regardless of whether the items were duplicates not. The table in the next section compares sorting times of Alphasort and quicksort that I tested on my personal computer. The list used had many duplicates, because I increased list size by copying portions of the list. Therefore, the table reflects Alphasort's performance at its worst case scenario. Even then it compares favorably with quicksort. To summarize thus far, Alphasort compares favorably with both classes of sorting algorithms. The class of algorithms which use compares will always be at a disadvantage next to Alphasort, because Alphasort never uses compares. The necessity of multiplying compares hampers efficiency. Also, Alphasort compares favorably with traditional radix methods, because it starts at the most significant digit instead of the least significant digit. The necessity of examining each digit hampers efficiency in traditional radix methods. Alphasort achieves this efficiency because of the way it handles the "bins" or the "piles." How Alphasort handles computer memory will be examined more closely in the last section where the Alphasort patent is compared to prior a patent. A question I often receive is, "How does Alphasort handle integers?" My answer, "It handles integers the same way it handles words." The method examines the digits, which can be any alphanumeric character. It does not matter whether the digits are letters, numbers, or any other special character in the set of recognizable computer characters. When sorting integers, however, the programmer will want to make one small adjustment. Make the first pass on the length of the integer, and then make the second pass on the first digit of the integer. In other words, when sorting 10, 100, and 1000, the respective lengths are 2, 3 and 4. Thus the first pass places shorter integers before longer ones, so that 10 ends up in a different "pile" than 100. The second pass, then, distinguishes 10 from 20, and so on up to 99. You can think of it as sorting on the zero byte first and the first byte second. Now that you have an overview of Alphasort, let's get into more detail as we compare Alphasort to other methods of sorting. 2. COMPARISON TO QUICKSORT My tests show that Alphasort runs at least twice as fast as quicksort, even faster for certain kinds of lists. The quicksort I used came with Borland's Turbo Pascal. The list I used for testing purposes was a list of people's names. The original list had 175 names. I created longer lists by copying the original list several times. Therefore, the longer lists have many exact duplicates, putting Alphasort at its worst case scenario. I ran these tests on my 38625 personal computer. The times shown in the following tables start from reading the list from the hard drive until the list is sorted in memory. I displayed the sorted list on the screen for verification, but such display was not counted in the sorting time. Assuming that the reading-from-disk time was equal for both algorithms, then the actual sorting time for Alphasort vs. quicksort would be even more favorable than the following tables show. I did not attempt to separate read time from sorting time; my source code combines the read with the first pass. In other words, as Alphasort reads one item it also places that item in a "bin" or "pile" according to the first letter of that item. So the read and the first placing both occur in the first loop. It is not necessary to combine these, of course, but that is the way I happened to do it. Table 1. Comparison of sorting times of a list of names up to 24 characters wide. Programs were compiled in Turbo Pascal on a 38625 personal computer. Times given are seconds and hundreths of seconds. length of list: 2000 1800 1600 1400 1200 1000 Quicksort 4:94 4:40 4:17 3:46 3:24 2:69 Alphasort 2:41 2:31 2:14 2:04 1:97 1:26 Table 2. Comparison of sorting times of a list of names items 10 characters wide. length of list: 5000 3000 1000 Quicksort 12:08 6:93 2:53 Alphasort 3:51 2:86 1:20 These tables show that the width of the items affect Alphasort more than the length of the list. The width of the items is more of a factor when there are more duplicates in the list. Fewer duplicates would mean that the width of the items would not be as much of a factor. This is due to the nature of radix sorting, which examines the items one character at a time and puts then into "bins" or "piles" according to those characters. I was limited in these tests by the memory limitations of my Turbo Pascal compiler, which in turn was limited by the memory limitations of DOS. Alphasort itself, has no inherent memory limitation, and it uses a minimal amount of memory. Its use of memory will be discussed in the last section of this paper where it is compared to a prior patent. Although these tests show promise, more comprehensive tests and benchmarks need to be designed. Considering that these tests were run with a list that put Alphasort at a disadvantage, I expect that future tests will not deviate very far from these results, or they might produce even more favorable results. To understand why Alphasort runs faster than quicksort, let's look at a table of running times. Table 3. Comparison of running times. The figures for the first two algorithms are taken from Donald Knuth's volume Sorting and Searching, (p. 381) and the figures for Alphasort are taken from my test. To arrive at Alphasort's running time, I used a counter in the repeat loops. length of list: 16 1000 Distribution counting (radix) 10362 32010 Median-of-3-quicksort 470 81486 Alphasort 54 7744 Knuth uses this table to show that radix sorting is not efficient for short lists. As you can see, distribution counting (traditional radix) loses to quicksort in the short list, but wins in the long list. However, Alphasort wins for both lists, because it sorts from the most significant digit, unlike traditional radix sorting which sorts from the least significant digit. Now that we've looked at some numbers, let's discuss the theory behind the numbers. Let's explain why Alphasort runs faster than quicksort. Quicksort uses partitions. It divides a list in half, compares words in each half, and then swaps them if necessary. It further partitions the list and makes further swaps. As I understand it, each time a word is swapped the word moves about half way closer to its destination. It's like a football team starting at it's own zero yard line. On the first play they gain 50 yards, half way to the goal line. On the next play they make 25 yards. On the next play 12 or 13 yards. At that rate they should reach the goal in 4 or 5 plays. Assuming 26 letters in the alphabet, each pass of Alphasort gets the word 26 times closer to its destination. It's like a football team making 96 yards on its first play. Alphasort often reaches the goal in 2 or 3 passes. How does Alphasort accomplish this? The first pass examines the first letter of each word. All the As are linked together in memory, all the Bs are linked together in memory, and so on. The words are not moved, but memory links their addresses together so that they can be accessed for the next pass. The second pass examines the second letter, not the second letter of every word in the list, but the second letter of only those words beginning with A. The third pass likewise examines the third letter of the highest group of words on the list. In most cases a unique combination of the first two or three letters allows words to be quickly sorted. In other cases, where the first 10 letters are the same, then more passes are needed to sort those words. Never is any letter examined unnecessarily. Passes are kept to an absolute minimum. We've discussed various scenarios, long lists, short lists, and lists with duplicates. Another scenario is a list that is previously sorted. Based on the previous paragraph, it makes no difference to Alphasort whether or not the list was previously sorted. Quicksort, however, would be affected by such a list. 3. COMPARISON TO RADIX EXCHANGE SORTING Radix exchange sorting, described in Knuth's volume, Sorting and Searching, is a combination of quicksort and radix sorting. The sorting process works like quicksort does because the list is divided into partitions, and records are swapped from one side of the partition to the other. It differs from quicksort in that individual digits are examined instead of comparing entire records. Does this method gain any advantage? The textbook says "Thus radix exchange sorting takes about as long as quicksort, on the average" (p. 128). Exchange sorting (whether it be quicksort or radix exchange sorting) sets up a partition midway in the list and moves records from one side to the other. Then it sets up other partitions to further divide the list, and so on. Therefore, each record is moved several times, each time a little closer to its final destination. Each time a record is moved, it is moved twice as close (on average) to its final destination as it was before. It keeps on moving twice as close as it was before until it finally hits the right spot. The same comments can be said here as were said about quicksort above. Visualize again a football field. Imagine a football team starting with the ball almost on the 0 yard line. On their first play they make 50 yards. That's half way to the goal. On the next play they make 25 yards. Again that's half way to the goal. On the next play they make 12 and one-half, then 6 and one-fourth yards, and so on, until they finally reach the goal on the last play. That is how exchange sorting moves the ball. Alphasort moves the ball farther on each play. Instead of moving half way to the goal, it moves the ball 26 times closer to the goal (in the case of alphabetic sorting) or 10 times closer to the goal (in the case of numeric sorting). 4. COMPARISON TO SORTING BY DISTRIBUTION In contrast to quicksort, which uses compares, sorting by distribution does not use compares. It is a radix method like Alphasort. This discussion of sorting by distribution is based on Knuth's description of the method in his volume, "Sorting and Searching." Sorting by distribution distributes items in the list into "piles." The piles correspond to digits, such as the A pile for all items beginning with A, and so on. The main problem of radix sorting by distribution is what to do with the piles. "In order to handle radix sorting inside a computer, we must decide what to do with the piles" (page 171). Setting aside a pile for each letter of the alphabet (M piles) proved unsatisfactory "since each area must be large enough to hold N items. . . . Such an excessive memory requirement originally caused most people to reject the idea of radix sorting within a computer, until H.H. Seward . . . pointed out that we can achieve the same effect with only 2N record areas and M count fields. We simply count how many elements will lie in each of the M piles, by making a preliminary pass over the data; this tells us precisely how to allocate memory for the piles" (pages 171-172). Therefore, the problem of the piles is solved at the expense of an extra pass, a "preliminary pass" as the book calls it. How does Alphasort solve the problem of the piles? Instead of squeezing the memory down to 2N record areas as above, Alphasort uses only one N record area in which the items are stored. In addition, it uses an array of integers one N in size. The bottom line is that it requires less memory than the textbook method. The textbook method reduces memory requirements at the expense of an extra "preliminary pass." However, Alphasort does not require an extra "preliminary pass." It doesn't need this extra pass, because Alphasort does not count how many items go into each pile. It has another way of keeping track of the piles, and it's not by counting. The bottom line is that Alphasort needs fewer passes than the textbook method, approximately half as many. As a rule, in computer programming, you gain efficiency at the expense of memory. In this case, however, the reverse is true. Alphasort uses less memory, but it also accomplishes the sort with fewer passes. The textbook suggests an alternative solution to the problem of the piles. The alternative involves a linking mechanism. Records are linked to one another in such a way that the program follows the links rather than moving the records themselves. This is getting closer to Alphasort's method; so let's take a look at the linking mechanism. The textbook linking method uses links which point to both the top of the pile and the bottom of the pile. Alphasort, on the other hand, uses links which point to only one end of the pile. This is half as many links to keep track of. What does the textbook conclude about sorting by distribution (radix sorting)? Page 177 reads "Therefore radix sorting is usually more efficient than any other known method for internal sorting on such machines, provided that N is not too small and the keys are not too long." In other words, radix sorting is more efficient sometimes, but not in two cases: (1) when the list is short, and (2) when the items in the list are long. How would Alphasort rate if Knuth had known about it when he wrote the book? Regarding case 1, short lists, we have seen in the first section above, that Alphasort sorts both short and long lists equally well. Regarding case 2, long items, we also have seen in the first section above, that Alphasort is not ordinarily hindered by long items, assuming there is not an overabundance of duplicates in the list. I suppose that these two cases has prevented radix sorting from more common use. Since many applications deal with short lists or long items, then radix sorting has not often become the method of choice. But Alphasort avoids both limitations because it sorts from the most significant digit instead of the least significant digit. The textbook recognizes the value of most significant digit (MSD) sorting, if such a thing were possible, but it can't quite figure out how to manipulate the computer's memory in order to accomplish it. "In our analysis of radix-exchange sorting, we found that it was unnecessary to inspect many bits of the key, when we approach the keys from the left instead of the right. Let us therefore reconsider the idea of a radix sort which starts at the most signficant digit (MSD) instead of the least significant digit (LSD). . . . an MSD-first radix method suggests itself naturally. . . . This principle of `divide and conquer' is quite appealing, and the only reason it doesn't work especially well for sorting punched cards is that the multiplicity of piles tends to become confusing" (page 177). The "multiplicity of piles." That's the problem. The textbook continues: "Perhaps the best compromise has been suggested by M.D. MacLaren . . . who recommends an LSD-first sort . . . but applied only to the most significant digits. This does not completely sort the file, but it usually brings the file very nearly into order so that very few inversions remain; therefore straight insertion can be used to finish up" (page 177). A "best compromise" solution that partially sorts the file. Alphasort surmounts the compromise with a unique memory-handling technique which we will examine in the last section of this paper which compares Alphasort to a prior patent. Alphasort's method of keeping track of the "piles" allows it to start at the most significant digit, and therefore, it matters not if the list is long or short, or if the items are long or short. 5. COMPARISON TO SORTING BY ADDRESS CALCULATION If I handed you a deck of cards to sort, and asked you to lay them out in numerical order on the table in front of you, I think you have a pretty good idea of how you would do it. If your first card were an ace, then you would put it to the far left on the table. Why not put it in the middle of the table? Because you already know that ace is the lowest number. So it goes to the left. When you turn over a 7, then you would put it in the middle of the table, because you know it is the middle number. Knowing ahead of time the range of values, you sort more efficiently. Most computer sorting methods, however, sort by trial and error. Give the same card sorting job to a computer, and it would do the equivalent of blindly placing the first card in the middle, even if it were an ace. Then it would compare the second card to the first before it knew where to place the second card. Compare and move, compare and move, until finally the list ends up in the right order. Sorting by address calculation attempts to avoid much of the trial and error of other sorting algorithms. Such an idea was first suggested in 1956 by Isaac and Singleton (ACM Journal, vol. 3). They devised a method similar to how you would sort a deck of cards on a table. The algorithm first determines the range of the list by a preliminary pass on the entire list or on a sample portion of it. Using this knowledge about the range, the algorithm then places items one by one in the approximate memory location where they belong. If it attempts to put an item in a memory location that is already occupied by another item, then it compares the two items, looks for the nearest empty memory location, and moves adjacent items if necessary. Allowing more memory locations than there are items in the list provides a buffer, and that reduces the number of moves needed. The last step is to get rid of the gaps in the extra memory locations by packing the list. The key to the success of the method is to reduce the number of moves. Isaac and Singleton say, "The fewer times the items have to be moved from one location to another, the more efficient the sorting process." (p. 169). How efficient is sorting by address calculation? Isaac and Singleton's evaluation is favorable: ". . . estimates indicate that on this computer, sorting random numbers with the tested method is more than twice as fast as any of the conventional merge-sort methods." (p. 173). Could sorting by address calculation be made even more efficient than it already is? If we could eliminate three steps, then yes, it could be made more efficient. The three steps are: (1) determining the range of the list in advance, (2) occasional compares and moves, and (3) closing the gaps and packing the list. On top of these three steps, it would be desirable to eliminate the need for the extra "buffer" memory. Alphasort can be described as sorting by address calculation, because it places items in pre-determined memory locations. Actually, it doesn't place the item itself, because the items are never moved, but a pointer to the item goes into the pre-determined memory location. So Alphasort works on the same principle as Isaac and Singleton's method. How does it differ? It differs in four ways. First, Alphasort does not determine the range of list in advance. Before it ever sees the list, it already knows what range it will use. This is the range of computer-recognizable characters, which is from 0 to 255 to include all of them. The programmer may narrow the range, depending upon the application. For example, the programmer may decide to narrow the range from from 0 to 127. Or, if the programmer knows there are no control codes to be dealt with in the application, then he/she can narrow the range even further, starting with 32, for example. The narrower the range, the faster Alphasort works. But the programmer adjusts the range to fit the application. What if a list has an overabundance of items starting with "A"? Isaac and Singleton's method would need to know this in advance, so that it can allow extra memory locations for "A" words. But Alphasort does not need to know this in advance. It matters not whether the list has an overabundance of "A" words or an overabundance of "Z" words. Therefore, Alphasort does not need to make a preliminary pass over the list like Isaac and Singleton's method does. There's a second difference between Alphasort and Issac and Singleton's sorting by address calculation. Alphasort eliminates the need for compares and moves altogether. Isaac and Singleton's method uses occasional compares whenever items bump into each other. In that case, items need to be moved over into adjacent empty locations. However, Alphasort has a flexibility built into its memory management system that avoids the rigid requirement to have empty memory buffers available in precise locations. Even if most of the items began with "A," none of the items would have to move over to make room for others. There's a third difference. Alphasort does not need to close the gaps in the memory buffers and pack the list. The items are packed tight from start to finish; so this final step is eliminated. The fourth and final difference: Alphasort has no need for extra "buffer" memory. Instead of using extra memory, Alphasort makes flexible use of the memory it has. When I say "flexible use of memory," does this mean that items are moved within the memory? No, items are never moved. By flexible, I mean that memory is written over itself. One location at a time, as each word is being processed, a memory location receives a new value. We'll discuss memory management more in the final section of this paper, when we compare Alphasort to a prior patent. 6. COMPARISON TO PATENT NO. 4,809,158 My Alphasort patent application was initially denied by the patent office on the basis of McCauley's prior patent no. 4,809,158 in 1989. His is an MSD radix method, and therefore it is similar to Alphasort in the fact that it sorts items by examining the digits, starting with the first (most significant) digit. But the two methods differ in memory management. In my appeal to the patent office, I pointed out the differences in memory management, how Alphasort saves steps, and is therefore more efficient. The uniquness of Alphasort, and therefore its claim to patentability, lies in its efficient memory management. The patent examiner accepted the appeal and granted me patent no. 5,218,700 in 1993. Both methods group words together according to the characters they contain. For example, all words beginning with A are grouped together, all words with B are grouped together, and so on. However, Alphasort's implementation of the idea is different than McCauley's, because of the different way of tracking groups and subgroups of words, and because of a different linking mechanism. McCauley's invention tracks groups of words by remembering the first and last word in each group. We read in his patent, in column 6, lines 46-49 says, "Each `bin' is comprised of pointers kept in two vectors, the `first pointers vector' (`FPV'), and the `last pointers vector' (`LPV')." The following lines go on the explain that these two pointers point to the first and last words in the bin. Column 10, lines 62-68 says: ". . . the bin itself just holds pointers to the first and last records of that list (in vectors `FPV' and `LPV'). That is, the bins do not need to hold all the pointers to the records in each bin, for it is economically sufficient that the system be able to locate immediately only the first and last records in each bin." Therefore, McCauley's invention uses two pointers to keep track of the group of words in each bin. On the other hand, my invention uses only one pointer where McCauley's uses two. Alphasort eliminates the "first pointers variable" ("FPV"). This leaves only the "last pointers variable" ("LPV"), which is comparable in my invention to lettertracker. Lettertracker is a two-dimensional array of integers. Each memory location in lettertracker holds an integer which serves the same purpose as McCauley's LPV, pointing to the last, or most recently processed word, in a group of words. The reason McCauley uses two pointers for each bin is that he needs both in order to re-link the list after each sorting pass. And herein lies the second difference: my invention does not re-link the list during the sorting process. Re-linking in McCauley's invention occurs after each sorting pass. This is accomplished by connecting the LPV in one group to the FPV in the next group, so that the last word in one group points to the first word in the next group, and so on, until the entire list is re-linked into one unified list. Column 10, line 54 through column 11, line 20 explain how the LPV and FPV values are changed in order to accomplish this re-linking process. Such re-linking does not occur in my invention, saving steps, saving sorting time, and reducing to only one the number of pointer variables needed for each group, or bin. Alphasort does use a linking mechanism, but it links only groups of words rather than linking the entire list. The linking mechanism in my invention is called wordtracker. After lettertracker points to the most recent word in a group, wordtracker tracks down the rest of the words in that group. As groups are subdivided into smaller and smaller groups, as is the manner in radix sorting, wordtracker also keeps track of these smaller subgroups. But Alphasort never re-links the entire list during intermediate stages of the sorting process as McCauley's invention does. Therefore, Alphasort's linking mechanism is different than McCauley's because it performs fewer steps. Fewer steps in the sorting process result in significantly faster sorting times. Timed tests on my 386 IBM-compatible personal computer corroborate this. Using a 4-character key comprised of 256 random characters, as McCauley did, a list 2000 long sorted 3.89 times faster than quicksort. A list 8000 long sorted 4.39 times faster than quicksort. McCauley also compared his method to quicksort, and these numbers show significant improvement over the relative efficiency given by McCauley. It is difficult to determine his precise numbers, because his results show on a graph from which I could only estimate the exact numbers. However, his graphical representation, compared to my numbers for Alphasort, are sufficient to show that Alphasort runs significantly faster. If 128 random characters are used instead of 256, the sorting times for Alphasort improve even more. In this case, a list 2000 long sorted 4.60 times faster than quicksort, and a list 8000 long sorted 5.79 times faster than quicksort. These faster sorting times are made possible by eliminating the extra steps; namely by keeping track of only one pointer per bin instead of two, and by skipping the re-linking process after each sorting pass. How does Alphasort manage memory so that it can skip these extra steps? Does it use extra memory in order to gain efficiency? No, it uses a mimimal amount of memory. The memory needed consists of two large arrays and one small array. First, an array of strings stores the list being sorted. Second, an array of integers (called wordtracker) stores pointers to items in the list. These two large arrays are no larger than the size of the list being sorted. No buffers are needed, no extra room is required, because items are not moved, swapped, packed, or consolidated. The final array, the small one, is a two-dimensional array of integers. One dimension corresponds to the range of possible digits (such as A through Z; or in computer terms, characters 0 through 255 or 0 through 127). The other dimension corresponds to the possible width of the items (such as 30 if sorting a database whose largest field is 30 characters long). This array I call lettertracker. One dimension of the array keeps track of which letter of the word is being examined, whether it be the first letter, second letter, third letter, and so on. The other dimension keeps track of which character is being examined, whether it be A, or B, or C and so on. Therefore, any particular location of the lettertracker array corresponds to a certain character and to the position of that character. Since lettertracker is a two-dimensional array, the indexes of the array track those two things. This array roughly corresponds to McCauley's bins which store First Pointers Vector and Last Pointers Vector. The main difference is that lettertracker points to only one end, instead of both ends, of a group of records. In other words, to use McCauley's terms, the array holds Last Pointers Vectors. It does not track First Pointers Vectors at all. Remember, it is necessary to track only one set of pointers because the list never needs to be re-linked and consolidated. The value stored at any particular location in lettertracker is an integer which acts as a pointer. The integer points to an index in the wordtracker array. Thus, these two integer arrays work hand-in- hand. Lettertracker (the smaller array) and wordtracker (the larger array) cooperate in keeping track of groups and subgroups of items during the sorting process. The way they work together keeps memory usage down to the absolute minimum. Wordtracker, remember, is exactly the size of the list, no larger. The long-standing problem of MSD radix sorting, as you recall, is how to keep track of the "piles" or "bins." These piles represent groups of words within the list. These groups constantly change during the sorting process. Also, the groups get smaller and more numerous as the sorting progresses. The "pile" problem is solved by the two integer arrays, lettertracker and wordtracker. This is how the two work hand-in-hand to keep track of the piles. Lettertracker remembers the last record processed, and wordtracker recalls the rest of the group. First lettertracker points to an index in wordtracker. Then the value at that index of wordtracker points to another index in wordtracker. And the value at that index in turn points to another index in wordtracker. The values keep pointing until the value zero is found. Zero signals the end of the group. All of these values which point to indexes in wordtracker also correspond to record numbers, namely the number of the record as stored in the array of strings. For example, if the pointer is 7, then that corresponds to word number 7, the 7th word in the array of strings. To illustrate, suppose that you are one of 26 detectives. Each detective receives one clue on a piece of paper. Each clue is different. The clue you receive directs you to another clue somewhere in the city. The city has clues placed all over it. You don't know how many clues you will need to find, and you don't know how many of the clues scattered throughout the city belong to you and how many belong to one of the other 25 detectives. You follow the clues one by one, each clue leading to one other one, until you reach the last clue. The 26 detectives are like different letters of the alphabet; each hunts for a different set of clues just as each letter of the alphabet has a different set of words in its pile. The original 26 clues given to the 26 detectives are like the lettertracker array; each memory location of lettertracker doesn't reveal much about the entire pile, but it points to the next word in the pile, and that is sufficient. The clues scattered throughout the city are like the wordtracker array; the words of each pile are not contiguous, but as one word points to another word, the entire pile can be found. Let's make the detective illustration work backwards. Before you go off looking for your next clue, suppose I take your clue from you and hand you another. The clue I take from you I hide somewhere in the city. The clue I hide will be the first one you find, because the clue I just handed you will direct you to it. This is what happens as words are read from disk. The most recent word pointer (like the clue in the detective's hand) is stored into lettertracker. The word pointer that was in lettertracker (like the clue I took from you) is stored into wordtracker. So word pointers are stored and retrieved in opposite fashion. As a pile is being retrieved, it is being subdivided into smaller piles. As this happens, values in wordtracker change, one by one, as each word is processed, with the result that the words are linked together in new ways according to the new piles. Big piles are divided into smaller and smaller piles, until only one word remains in each pile. If you were sorting a large stack of cards on the floor, the floor would quickly become filled with little piles everywhere, and you would run out of space. But the two arrays lettertracker and wordtracker work hand-in-hand to manage the piles without additional memory space. And in this case, efficient use of memory also translates into efficient use of time. The result is faster sorting. The core of the algorithm is a repeat loop that contains 6 lines of code. To inquire about obtaining the source code and permission to use it, contact: Al Beechick, P.O. Box 899, Pollock Pines, CA 95726 Phone: (916) 644-2341