home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Power-Programmierung
/
CD2.mdf
/
doc
/
alphasor
/
thesis
Wrap
Text File
|
1994-02-22
|
41KB
|
773 lines
DISCOURSE ON ALPHASORT
PATENT NO. 5,218,700
COMPARISON WITH
OTHER SORTING ALGORITHMS
Written by Al Beechick
Copyright 1994 by Arrow Connection
P.O. Box 899
Pollock Pines CA 95726
(916) 644-2341
TABLE OF CONTENTS
1. Alphasort, the Newer and Faster MSD Radix Algorithm
2. Comparison to Quicksort
3. Comparison to Radix Exchange Sorting
4. Comparison to Sorting by Distribution
5. Comparison to Sorting by Address Calculation
6. Comparison to Patent No. 4,809,158
1. ALPHASORT, THE NEWER AND FASTER MSD RADIX ALGORITHM
Computer sorting algorithms can be divided into two classes,
those that use compares and those that don't. Radix sorting does not
use compares, and Alphasort falls into this class. This paper
compares Alphasort to other sorting algorithms that fall into both
classes. The goal is to help the reader understand why and how
Alphasort is faster.
Alphasort is called such, because it sorts by placing words
according to the letters of the alphabet. This bypasses the need for
compares. In addition to letters of the alphabet, Alphasort examines
numbers or any computer-recognizable character. This method of
placing words in "bins" or "piles" according to the characters in the
word is radix sorting.
When Alphasort examines the characters, it starts at the most
significant digit (MSD). In other words, it starts at the front of
the word. This is the opposite of traditional radix sorting which
starts at the least significant digit (LSD). Old radix methods
examined each character of the word, starting at the end of the word.
The final pass, therefore, would be on the most significant digit, and
this was necessary because of memory handling problems.
The memory problems encountered by radix sorting were how to keep
track of the "bins" or "piles." According to traditional radix
thought, starting at the most significant digit would result in a
multiplicity of "piles," and therefore use up a huge amount of memory
in the computer. But Alphasort resolves this problem, finding a way
to use a minimum amount of memory, no more memory than other methods
which use compares.
Resolving the memory problem is the key to faster sorting. Now
that Alphasort starts at the most signifcant digit, this means that
only a fraction of the total digits need to be examined, instead of
all the digits as before. For example, if the first two passes result
in a unique placement of the word, then the rest of the letters in the
word are not examined. If "Alfred" is the only name in the list which
starts with "Al," then the characters "fred" are not examined. This
eliminates unnecessary passes, and it results in maximum efficiency.
Alphasort not only increases efficiency over traditional radix
sorting, but it also surpasses the other class of sorting methods,
namely those methods which use compares. This could not be said of
traditional radix sorting methods which sort according to the least
significant digit. The necessity to examine each and every digit
resulted in lower efficiency, and so there was no compelling need to
abandon methods which use compares in favor of a radix method. But a
radix method which sorts by the most significant digit examines only a
fraction of the digits, resulting in an efficiency which surpasses
that class of methods which use compares.
Formulas show that sorting time increases, not proportionally,
but geometrically, for methods which use compares. In other words,
to increase a list by one item, the number of compares increases by,
not one, but a multiple thereof. If N is the number of items in
the list being sorted, then a method which uses compares may need
N x log(base 2) N passes. In contrast, Alphasort uses approximately
N x log(base r) N passes, where r is the range of values that
individual characters can take, such as 26 for alphabetical sorting,
or 10 for numerical sorting.
If you were to graph sorting time against the size of the list,
Alphasort's increased sorting time would rise almost linearly. The
reason the graph would not show an exactly straight line is that
additional items may require Alphasort to examine more digits in some
items. It "may" require additional passes because the likelihood
increases that words "may" have the same first two or three
characters. On the other hand, if items added to the list have unique
opening letters, then additional sorting passes will not be required.
So sorting time depends upon the makeup of the list more than the size
of the list.
In contrast, methods which use compares require additional passes
as the list size increases. This is an inherent requirement of the
algorithm. Therefore, if you were to graph sorting time against the
size of the list, increased sorting time would rise with a markedly
upward curve.
The preferable method for long lists, therefore, is Alphasort.
But it works with equal efficiency for short lists. This is in
contrast to traditional radix sorting. Since traditional radix
sorting starts at the least significant digit, it has to examine each
digit, and this is wasted effort in a short list. However, Alphasort
starts with the most significant digit, meaning that for a short list
only a small fraction of the total digits need to be examined. The
concept of Alphasort is that only the minimum amount of passes need be
applied to sort the list at hand. Thus, for a list of only 10 items,
there is a strong likelihood that the entire list can be sorted in one
pass. If by chance, two words in the short list started with the same
letter, then an additional pass would be taken on those two words.
The worst case scenario for Alphasort is a large number of
duplicates or near duplicates in the list being sorted. This forces
additional passes in order to arrive at a unique combination of
opening characters for each item. Even in this worst case scenario,
Alphasort compares favorably with traditional radix sorting methods,
because the traditional methods examined each character regardless of
whether the items were duplicates not.
The table in the next section compares sorting times of Alphasort
and quicksort that I tested on my personal computer. The list used
had many duplicates, because I increased list size by copying portions
of the list. Therefore, the table reflects Alphasort's performance at
its worst case scenario. Even then it compares favorably with
quicksort.
To summarize thus far, Alphasort compares favorably with both
classes of sorting algorithms. The class of algorithms which use
compares will always be at a disadvantage next to Alphasort, because
Alphasort never uses compares. The necessity of multiplying compares
hampers efficiency. Also, Alphasort compares favorably with
traditional radix methods, because it starts at the most significant
digit instead of the least significant digit. The necessity of
examining each digit hampers efficiency in traditional radix methods.
Alphasort achieves this efficiency because of the way it handles
the "bins" or the "piles." How Alphasort handles computer memory will
be examined more closely in the last section where the Alphasort
patent is compared to prior a patent.
A question I often receive is, "How does Alphasort handle
integers?" My answer, "It handles integers the same way it handles
words." The method examines the digits, which can be any alphanumeric
character. It does not matter whether the digits are letters,
numbers, or any other special character in the set of recognizable
computer characters.
When sorting integers, however, the programmer will want to make
one small adjustment. Make the first pass on the length of the
integer, and then make the second pass on the first digit of the
integer. In other words, when sorting 10, 100, and 1000, the
respective lengths are 2, 3 and 4. Thus the first pass places shorter
integers before longer ones, so that 10 ends up in a different "pile"
than 100. The second pass, then, distinguishes 10 from 20, and so on
up to 99. You can think of it as sorting on the zero byte first and
the first byte second.
Now that you have an overview of Alphasort, let's get into more
detail as we compare Alphasort to other methods of sorting.
2. COMPARISON TO QUICKSORT
My tests show that Alphasort runs at least twice as fast as quicksort,
even faster for certain kinds of lists. The quicksort I used came
with Borland's Turbo Pascal. The list I used for testing purposes was
a list of people's names. The original list had 175 names. I created
longer lists by copying the original list several times. Therefore,
the longer lists have many exact duplicates, putting Alphasort at its
worst case scenario. I ran these tests on my 38625 personal computer.
The times shown in the following tables start from reading the list
from the hard drive until the list is sorted in memory. I displayed
the sorted list on the screen for verification, but such display was
not counted in the sorting time. Assuming that the reading-from-disk
time was equal for both algorithms, then the actual sorting time for
Alphasort vs. quicksort would be even more favorable than the
following tables show. I did not attempt to separate read time from
sorting time; my source code combines the read with the first pass.
In other words, as Alphasort reads one item it also places that item
in a "bin" or "pile" according to the first letter of that item. So
the read and the first placing both occur in the first loop. It is
not necessary to combine these, of course, but that is the way I
happened to do it.
Table 1. Comparison of sorting times of a list of names up to 24
characters wide. Programs were compiled in Turbo Pascal on a 38625
personal computer. Times given are seconds and hundreths of seconds.
length of list: 2000 1800 1600 1400 1200 1000
Quicksort 4:94 4:40 4:17 3:46 3:24 2:69
Alphasort 2:41 2:31 2:14 2:04 1:97 1:26
Table 2. Comparison of sorting times of a list of names items 10
characters wide.
length of list: 5000 3000 1000
Quicksort 12:08 6:93 2:53
Alphasort 3:51 2:86 1:20
These tables show that the width of the items affect Alphasort more
than the length of the list. The width of the items is more of a
factor when there are more duplicates in the list. Fewer duplicates
would mean that the width of the items would not be as much of a
factor. This is due to the nature of radix sorting, which examines
the items one character at a time and puts then into "bins" or "piles"
according to those characters.
I was limited in these tests by the memory limitations of my Turbo
Pascal compiler, which in turn was limited by the memory limitations
of DOS. Alphasort itself, has no inherent memory limitation, and it
uses a minimal amount of memory. Its use of memory will be discussed
in the last section of this paper where it is compared to a prior
patent.
Although these tests show promise, more comprehensive tests and
benchmarks need to be designed. Considering that these tests were run
with a list that put Alphasort at a disadvantage, I expect that future
tests will not deviate very far from these results, or they might
produce even more favorable results.
To understand why Alphasort runs faster than quicksort, let's look at
a table of running times.
Table 3. Comparison of running times. The figures for the first two
algorithms are taken from Donald Knuth's volume Sorting and Searching,
(p. 381) and the figures for Alphasort are taken from my test. To
arrive at Alphasort's running time, I used a counter in the repeat loops.
length of list: 16 1000
Distribution counting (radix) 10362 32010
Median-of-3-quicksort 470 81486
Alphasort 54 7744
Knuth uses this table to show that radix sorting is not efficient for
short lists. As you can see, distribution counting (traditional
radix) loses to quicksort in the short list, but wins in the long
list. However, Alphasort wins for both lists, because it sorts from
the most significant digit, unlike traditional radix sorting which
sorts from the least significant digit.
Now that we've looked at some numbers, let's discuss the theory behind
the numbers. Let's explain why Alphasort runs faster than quicksort.
Quicksort uses partitions. It divides a list in half, compares words
in each half, and then swaps them if necessary. It further partitions
the list and makes further swaps. As I understand it, each time a word
is swapped the word moves about half way closer to its destination.
It's like a football team starting at it's own zero yard line. On the
first play they gain 50 yards, half way to the goal line. On the next
play they make 25 yards. On the next play 12 or 13 yards. At that rate
they should reach the goal in 4 or 5 plays.
Assuming 26 letters in the alphabet, each pass of Alphasort gets the
word 26 times closer to its destination. It's like a football team
making 96 yards on its first play. Alphasort often reaches the goal in
2 or 3 passes.
How does Alphasort accomplish this? The first pass examines the first
letter of each word. All the As are linked together in memory, all the
Bs are linked together in memory, and so on. The words are not moved,
but memory links their addresses together so that they can be accessed
for the next pass. The second pass examines the second letter, not the
second letter of every word in the list, but the second letter of only
those words beginning with A. The third pass likewise examines the
third letter of the highest group of words on the list. In most cases
a unique combination of the first two or three letters allows words to
be quickly sorted. In other cases, where the first 10 letters are the
same, then more passes are needed to sort those words. Never is any
letter examined unnecessarily. Passes are kept to an absolute minimum.
We've discussed various scenarios, long lists, short lists, and lists
with duplicates. Another scenario is a list that is previously
sorted. Based on the previous paragraph, it makes no difference to
Alphasort whether or not the list was previously sorted. Quicksort,
however, would be affected by such a list.
3. COMPARISON TO RADIX EXCHANGE SORTING
Radix exchange sorting, described in Knuth's volume, Sorting and
Searching, is a combination of quicksort and radix sorting. The
sorting process works like quicksort does because the list is divided
into partitions, and records are swapped from one side of the
partition to the other. It differs from quicksort in that individual
digits are examined instead of comparing entire records.
Does this method gain any advantage? The textbook says "Thus radix
exchange sorting takes about as long as quicksort, on the average"
(p. 128).
Exchange sorting (whether it be quicksort or radix exchange sorting)
sets up a partition midway in the list and moves records from one side
to the other. Then it sets up other partitions to further divide the
list, and so on. Therefore, each record is moved several times, each
time a little closer to its final destination. Each time a record is
moved, it is moved twice as close (on average) to its final
destination as it was before. It keeps on moving twice as close as it
was before until it finally hits the right spot.
The same comments can be said here as were said about quicksort above.
Visualize again a football field. Imagine a football team starting
with the ball almost on the 0 yard line. On their first play they make
50 yards. That's half way to the goal. On the next play they make 25
yards. Again that's half way to the goal. On the next play they make
12 and one-half, then 6 and one-fourth yards, and so on, until they
finally reach the goal on the last play. That is how exchange sorting
moves the ball.
Alphasort moves the ball farther on each play. Instead of moving half
way to the goal, it moves the ball 26 times closer to the goal (in the
case of alphabetic sorting) or 10 times closer to the goal (in the
case of numeric sorting).
4. COMPARISON TO SORTING BY DISTRIBUTION
In contrast to quicksort, which uses compares, sorting by distribution
does not use compares. It is a radix method like Alphasort.
This discussion of sorting by distribution is based on Knuth's
description of the method in his volume, "Sorting and Searching."
Sorting by distribution distributes items in the list into "piles."
The piles correspond to digits, such as the A pile for all items
beginning with A, and so on. The main problem of radix sorting by
distribution is what to do with the piles. "In order to handle radix
sorting inside a computer, we must decide what to do with the piles"
(page 171). Setting aside a pile for each letter of the alphabet (M
piles) proved unsatisfactory "since each area must be large enough to
hold N items. . . . Such an excessive memory requirement originally
caused most people to reject the idea of radix sorting within a
computer, until H.H. Seward . . . pointed out that we can achieve the
same effect with only 2N record areas and M count fields. We simply
count how many elements will lie in each of the M piles, by making a
preliminary pass over the data; this tells us precisely how to
allocate memory for the piles" (pages 171-172). Therefore, the
problem of the piles is solved at the expense of an extra pass, a
"preliminary pass" as the book calls it.
How does Alphasort solve the problem of the piles? Instead of
squeezing the memory down to 2N record areas as above, Alphasort
uses only one N record area in which the items are stored. In
addition, it uses an array of integers one N in size. The bottom line
is that it requires less memory than the textbook method.
The textbook method reduces memory requirements at the expense of an
extra "preliminary pass." However, Alphasort does not require an
extra "preliminary pass." It doesn't need this extra pass, because
Alphasort does not count how many items go into each pile. It has
another way of keeping track of the piles, and it's not by counting.
The bottom line is that Alphasort needs fewer passes than the textbook
method, approximately half as many.
As a rule, in computer programming, you gain efficiency at the expense
of memory. In this case, however, the reverse is true. Alphasort
uses less memory, but it also accomplishes the sort with fewer passes.
The textbook suggests an alternative solution to the problem of the
piles. The alternative involves a linking mechanism. Records are
linked to one another in such a way that the program follows the links
rather than moving the records themselves. This is getting closer to
Alphasort's method; so let's take a look at the linking mechanism.
The textbook linking method uses links which point to both the top of
the pile and the bottom of the pile. Alphasort, on the other hand,
uses links which point to only one end of the pile. This is half as
many links to keep track of.
What does the textbook conclude about sorting by distribution (radix
sorting)? Page 177 reads "Therefore radix sorting is usually more
efficient than any other known method for internal sorting on such
machines, provided that N is not too small and the keys are not too
long." In other words, radix sorting is more efficient sometimes, but
not in two cases: (1) when the list is short, and (2) when the items
in the list are long.
How would Alphasort rate if Knuth had known about it when he wrote the
book? Regarding case 1, short lists, we have seen in the first
section above, that Alphasort sorts both short and long lists equally
well. Regarding case 2, long items, we also have seen in the first
section above, that Alphasort is not ordinarily hindered by long items,
assuming there is not an overabundance of duplicates in the list.
I suppose that these two cases has prevented radix sorting from more
common use. Since many applications deal with short lists or long
items, then radix sorting has not often become the method of choice.
But Alphasort avoids both limitations because it sorts from the most
significant digit instead of the least significant digit.
The textbook recognizes the value of most significant digit (MSD)
sorting, if such a thing were possible, but it can't quite figure out
how to manipulate the computer's memory in order to accomplish it.
"In our analysis of radix-exchange sorting, we found that it was
unnecessary to inspect many bits of the key, when we approach the keys
from the left instead of the right. Let us therefore reconsider the
idea of a radix sort which starts at the most signficant digit (MSD)
instead of the least significant digit (LSD). . . . an MSD-first radix
method suggests itself naturally. . . . This principle of `divide and
conquer' is quite appealing, and the only reason it doesn't work
especially well for sorting punched cards is that the multiplicity of
piles tends to become confusing" (page 177).
The "multiplicity of piles." That's the problem.
The textbook continues: "Perhaps the best compromise has been
suggested by M.D. MacLaren . . . who recommends an LSD-first sort . . .
but applied only to the most significant digits. This does not
completely sort the file, but it usually brings the file very nearly
into order so that very few inversions remain; therefore straight
insertion can be used to finish up" (page 177).
A "best compromise" solution that partially sorts the file.
Alphasort surmounts the compromise with a unique memory-handling
technique which we will examine in the last section of this paper
which compares Alphasort to a prior patent. Alphasort's method of
keeping track of the "piles" allows it to start at the most
significant digit, and therefore, it matters not if the list is long
or short, or if the items are long or short.
5. COMPARISON TO SORTING BY ADDRESS CALCULATION
If I handed you a deck of cards to sort, and asked you to lay them out
in numerical order on the table in front of you, I think you have a
pretty good idea of how you would do it. If your first card were an
ace, then you would put it to the far left on the table. Why not put
it in the middle of the table? Because you already know that ace is
the lowest number. So it goes to the left. When you turn over a 7,
then you would put it in the middle of the table, because you know it
is the middle number.
Knowing ahead of time the range of values, you sort more efficiently.
Most computer sorting methods, however, sort by trial and error.
Give the same card sorting job to a computer, and it would do the
equivalent of blindly placing the first card in the middle, even if it
were an ace. Then it would compare the second card to the first
before it knew where to place the second card. Compare and move,
compare and move, until finally the list ends up in the right order.
Sorting by address calculation attempts to avoid much of the trial and
error of other sorting algorithms. Such an idea was first suggested in
1956 by Isaac and Singleton (ACM Journal, vol. 3). They devised a
method similar to how you would sort a deck of cards on a table.
The algorithm first determines the range of the list by a preliminary
pass on the entire list or on a sample portion of it. Using this
knowledge about the range, the algorithm then places items one by one
in the approximate memory location where they belong. If it attempts to
put an item in a memory location that is already occupied by another
item, then it compares the two items, looks for the nearest empty
memory location, and moves adjacent items if necessary. Allowing more
memory locations than there are items in the list provides a buffer,
and that reduces the number of moves needed. The last step is to get
rid of the gaps in the extra memory locations by packing the list.
The key to the success of the method is to reduce the number of moves.
Isaac and Singleton say, "The fewer times the items have to be moved
from one location to another, the more efficient the sorting
process." (p. 169).
How efficient is sorting by address calculation? Isaac and Singleton's
evaluation is favorable: ". . . estimates indicate that on this
computer, sorting random numbers with the tested method is more than
twice as fast as any of the conventional merge-sort methods." (p. 173).
Could sorting by address calculation be made even more efficient than
it already is? If we could eliminate three steps, then yes, it could
be made more efficient. The three steps are: (1) determining the
range of the list in advance, (2) occasional compares and moves,
and (3) closing the gaps and packing the list. On top of these three
steps, it would be desirable to eliminate the need for the extra
"buffer" memory.
Alphasort can be described as sorting by address calculation, because
it places items in pre-determined memory locations. Actually, it
doesn't place the item itself, because the items are never moved, but
a pointer to the item goes into the pre-determined memory location.
So Alphasort works on the same principle as Isaac and Singleton's
method.
How does it differ? It differs in four ways. First, Alphasort does
not determine the range of list in advance. Before it ever sees the
list, it already knows what range it will use. This is the range of
computer-recognizable characters, which is from 0 to 255 to include
all of them. The programmer may narrow the range, depending upon the
application. For example, the programmer may decide to narrow the
range from from 0 to 127. Or, if the programmer knows there are no
control codes to be dealt with in the application, then he/she can
narrow the range even further, starting with 32, for example. The
narrower the range, the faster Alphasort works. But the programmer
adjusts the range to fit the application.
What if a list has an overabundance of items starting with "A"? Isaac
and Singleton's method would need to know this in advance, so that it
can allow extra memory locations for "A" words. But Alphasort does
not need to know this in advance. It matters not whether the list has
an overabundance of "A" words or an overabundance of "Z" words.
Therefore, Alphasort does not need to make a preliminary pass over the
list like Isaac and Singleton's method does.
There's a second difference between Alphasort and Issac and Singleton's
sorting by address calculation. Alphasort eliminates the need for
compares and moves altogether. Isaac and Singleton's method uses
occasional compares whenever items bump into each other. In that
case, items need to be moved over into adjacent empty locations.
However, Alphasort has a flexibility built into its memory management
system that avoids the rigid requirement to have empty memory
buffers available in precise locations. Even if most of the items
began with "A," none of the items would have to move over to make room
for others.
There's a third difference. Alphasort does not need to close the gaps
in the memory buffers and pack the list. The items are packed tight
from start to finish; so this final step is eliminated.
The fourth and final difference: Alphasort has no need for extra
"buffer" memory. Instead of using extra memory, Alphasort makes
flexible use of the memory it has.
When I say "flexible use of memory," does this mean that items are
moved within the memory? No, items are never moved. By flexible, I
mean that memory is written over itself. One location at a time, as
each word is being processed, a memory location receives a new value.
We'll discuss memory management more in the final section of this
paper, when we compare Alphasort to a prior patent.
6. COMPARISON TO PATENT NO. 4,809,158
My Alphasort patent application was initially denied by the patent
office on the basis of McCauley's prior patent no. 4,809,158 in 1989.
His is an MSD radix method, and therefore it is similar to Alphasort
in the fact that it sorts items by examining the digits, starting with
the first (most significant) digit. But the two methods differ in
memory management.
In my appeal to the patent office, I pointed out the differences in
memory management, how Alphasort saves steps, and is therefore more
efficient. The uniquness of Alphasort, and therefore its claim to
patentability, lies in its efficient memory management.
The patent examiner accepted the appeal and granted me patent no.
5,218,700 in 1993.
Both methods group words together according to the characters
they contain. For example, all words beginning with A are grouped
together, all words with B are grouped together, and so on.
However, Alphasort's implementation of the idea is different than
McCauley's, because of the different way of tracking groups and
subgroups of words, and because of a different linking mechanism.
McCauley's invention tracks groups of words by remembering the
first and last word in each group. We read in his patent, in column 6,
lines 46-49 says, "Each `bin' is comprised of pointers kept in two
vectors, the `first pointers vector' (`FPV'), and the `last pointers
vector' (`LPV')." The following lines go on the explain that these
two pointers point to the first and last words in the bin. Column 10,
lines 62-68 says: ". . . the bin itself just holds pointers to the
first and last records of that list (in vectors `FPV' and `LPV'). That
is, the bins do not need to hold all the pointers to the records in
each bin, for it is economically sufficient that the system be able to
locate immediately only the first and last records in each bin."
Therefore, McCauley's invention uses two pointers to keep track of the
group of words in each bin.
On the other hand, my invention uses only one pointer where
McCauley's uses two. Alphasort eliminates the "first pointers
variable" ("FPV"). This leaves only the "last pointers variable"
("LPV"), which is comparable in my invention to lettertracker.
Lettertracker is a two-dimensional array of integers. Each memory
location in lettertracker holds an integer which serves the same
purpose as McCauley's LPV, pointing to the last, or most recently
processed word, in a group of words.
The reason McCauley uses two pointers for each bin is that he needs
both in order to re-link the list after each sorting pass. And herein
lies the second difference: my invention does not re-link the list
during the sorting process.
Re-linking in McCauley's invention occurs after each sorting pass.
This is accomplished by connecting the LPV in one group to the FPV in
the next group, so that the last word in one group points to the first
word in the next group, and so on, until the entire list is re-linked
into one unified list. Column 10, line 54 through column 11, line 20
explain how the LPV and FPV values are changed in order to accomplish
this re-linking process.
Such re-linking does not occur in my invention, saving steps,
saving sorting time, and reducing to only one the number of pointer
variables needed for each group, or bin.
Alphasort does use a linking mechanism, but it links only groups of
words rather than linking the entire list. The linking mechanism in
my invention is called wordtracker. After lettertracker points to the
most recent word in a group, wordtracker tracks down the rest of the
words in that group. As groups are subdivided into smaller and
smaller groups, as is the manner in radix sorting, wordtracker also
keeps track of these smaller subgroups. But Alphasort never re-links
the entire list during intermediate stages of the sorting process as
McCauley's invention does.
Therefore, Alphasort's linking mechanism is different than McCauley's
because it performs fewer steps.
Fewer steps in the sorting process result in significantly faster
sorting times. Timed tests on my 386 IBM-compatible personal computer
corroborate this. Using a 4-character key comprised of 256 random
characters, as McCauley did, a list 2000 long sorted 3.89 times faster
than quicksort. A list 8000 long sorted 4.39 times faster
than quicksort.
McCauley also compared his method to quicksort, and these numbers show
significant improvement over the relative efficiency given by McCauley.
It is difficult to determine his precise numbers, because his results
show on a graph from which I could only estimate the exact numbers.
However, his graphical representation, compared to my numbers for
Alphasort, are sufficient to show that Alphasort runs significantly
faster.
If 128 random characters are used instead of 256, the sorting times
for Alphasort improve even more. In this case, a list 2000 long
sorted 4.60 times faster than quicksort, and a list 8000 long sorted
5.79 times faster than quicksort.
These faster sorting times are made possible by eliminating the extra
steps; namely by keeping track of only one pointer per bin instead of
two, and by skipping the re-linking process after each sorting pass.
How does Alphasort manage memory so that it can skip these extra
steps? Does it use extra memory in order to gain efficiency? No, it
uses a mimimal amount of memory. The memory needed consists of two
large arrays and one small array. First, an array of strings stores
the list being sorted. Second, an array of integers (called
wordtracker) stores pointers to items in the list. These two large
arrays are no larger than the size of the list being sorted. No
buffers are needed, no extra room is required, because items are not
moved, swapped, packed, or consolidated.
The final array, the small one, is a two-dimensional array of
integers. One dimension corresponds to the range of possible digits
(such as A through Z; or in computer terms, characters 0 through 255
or 0 through 127). The other dimension corresponds to the possible
width of the items (such as 30 if sorting a database whose largest
field is 30 characters long). This array I call lettertracker. One
dimension of the array keeps track of which letter of the word is
being examined, whether it be the first letter, second letter, third
letter, and so on. The other dimension keeps track of which character
is being examined, whether it be A, or B, or C and so on.
Therefore, any particular location of the lettertracker array
corresponds to a certain character and to the position of that
character. Since lettertracker is a two-dimensional array, the
indexes of the array track those two things. This array roughly
corresponds to McCauley's bins which store First Pointers Vector and
Last Pointers Vector. The main difference is that lettertracker
points to only one end, instead of both ends, of a group of records.
In other words, to use McCauley's terms, the array holds Last Pointers
Vectors. It does not track First Pointers Vectors at all.
Remember, it is necessary to track only one set of pointers because
the list never needs to be re-linked and consolidated.
The value stored at any particular location in lettertracker is an
integer which acts as a pointer. The integer points to an index in
the wordtracker array. Thus, these two integer arrays work hand-in-
hand. Lettertracker (the smaller array) and wordtracker (the larger
array) cooperate in keeping track of groups and subgroups of items
during the sorting process. The way they work together keeps memory
usage down to the absolute minimum. Wordtracker, remember, is exactly
the size of the list, no larger.
The long-standing problem of MSD radix sorting, as you recall, is how
to keep track of the "piles" or "bins." These piles represent groups
of words within the list. These groups constantly change during the
sorting process. Also, the groups get smaller and more numerous as
the sorting progresses.
The "pile" problem is solved by the two integer arrays, lettertracker
and wordtracker. This is how the two work hand-in-hand to keep track
of the piles. Lettertracker remembers the last record processed, and
wordtracker recalls the rest of the group. First lettertracker points
to an index in wordtracker. Then the value at that index of
wordtracker points to another index in wordtracker. And the value at
that index in turn points to another index in wordtracker. The values
keep pointing until the value zero is found. Zero signals the end of
the group.
All of these values which point to indexes in wordtracker also
correspond to record numbers, namely the number of the record as
stored in the array of strings. For example, if the pointer is 7,
then that corresponds to word number 7, the 7th word in the array of
strings.
To illustrate, suppose that you are one of 26 detectives. Each
detective receives one clue on a piece of paper. Each clue is
different. The clue you receive directs you to another clue somewhere
in the city. The city has clues placed all over it. You don't know
how many clues you will need to find, and you don't know how many of
the clues scattered throughout the city belong to you and how many
belong to one of the other 25 detectives. You follow the clues one by
one, each clue leading to one other one, until you reach the last
clue. The 26 detectives are like different letters of the alphabet;
each hunts for a different set of clues just as each letter of the
alphabet has a different set of words in its pile. The original 26
clues given to the 26 detectives are like the lettertracker array;
each memory location of lettertracker doesn't reveal much about the
entire pile, but it points to the next word in the pile, and that is
sufficient. The clues scattered throughout the city are like the
wordtracker array; the words of each pile are not contiguous, but as
one word points to another word, the entire pile can be found.
Let's make the detective illustration work backwards. Before you go
off looking for your next clue, suppose I take your clue from you and
hand you another. The clue I take from you I hide somewhere in the
city. The clue I hide will be the first one you find, because the clue
I just handed you will direct you to it. This is what happens as words
are read from disk. The most recent word pointer (like the clue in the
detective's hand) is stored into lettertracker. The word pointer that
was in lettertracker (like the clue I took from you) is stored into
wordtracker. So word pointers are stored and retrieved in opposite
fashion.
As a pile is being retrieved, it is being subdivided into smaller
piles. As this happens, values in wordtracker change, one by one, as
each word is processed, with the result that the words are linked
together in new ways according to the new piles.
Big piles are divided into smaller and smaller piles, until only one
word remains in each pile. If you were sorting a large stack of cards
on the floor, the floor would quickly become filled with little piles
everywhere, and you would run out of space. But the two arrays
lettertracker and wordtracker work hand-in-hand to manage the piles
without additional memory space.
And in this case, efficient use of memory also translates into
efficient use of time. The result is faster sorting.
The core of the algorithm is a repeat loop that contains 6 lines of
code. To inquire about obtaining the source code and permission to
use it, contact:
Al Beechick, P.O. Box 899, Pollock Pines, CA 95726
Phone: (916) 644-2341