home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Monster Media 1994 #1
/
monster.zip
/
monster
/
PROG_C
/
SORTSRCH.ZIP
/
SORT_SR.TXT
Wrap
Text File
|
1994-01-19
|
32KB
|
809 lines
Sorting and Searching
================================
In the world of computer science perhaps no other tasks are
more fundamental or as extensively analyzed as those of sorting
and of searching. These routines are used in virtually all
database programs, as well as in compilers, interpreters, and
operating systems. This lesson introduces the basics of sorting
and searching information. Because sorting data generally makes
searching the data easier and faster, sorting is discussed first.
Sorting is the process of arranging a set of similar informa-
tion into an increasing or decreasing order. Specifically, given
a sorted list a of n elements,
a[1] <= a[2] <= ... <= a[n]
Though ANSI C supplies the standard qsort() function as part
of the standard library, the study and understanding of sorting
is important for three main reasons. First, a generalized
function like qsort() cannot be applied to all situations.
Second, because qsort() is parameterized to operate on a wide
variety of data, it will run slower than a similar sort that
operates on only one type of data. (The generalization process
inherently increases run time because of the extra processing
time needed to handle various data types.) Finally, the Quicksort
algorithm (used by qsort()), although very good for the general
case, may not be the best type of sort for specialized situations.
Generally, when information (such as a mailing list) is
sorted, only a portion of that information is used as the sort
key. This key is used in comparisons, but when an exchange is
made the entire data structure is swapped. In a mailing list, for
example, the ZIP code field might be used as the key, but the
entire address is sorted. For the sake of simplicity while
developing the various sorting methods, you will be sorting
character arrays. Later, you will learn how to adapt any of these
methods to any type of data structure.
The three general methods that can be used to sort arrays
are:
By exchange
By selection
By insertion
To understand these three methods, imagine a deck of cards.
To sort the cards using exchange, you would spread the cards,
face up, on a table and then proceed to exchange out-of-order
cards until the deck is ordered.
To sort by selection, you would spread the cards on the
table, select the lowest-value card, take it out of the deck, and
hold it in your hand. Then you would select from the remaining
cards on the table the lowest card and place it behind the one
already in your hand. This process would continue until all the
cards were in your hand. Because you always select the lowest
card from those remaining on the table, the cards in your hand
would be sorted when the process was complete.
To sort the cards using insertion, you would hold the cards
in your hand and would take one at a time. As you took cards from
the deck, you would place them into a new deck on the table,
always inserting them in the correct position. The deck would be
sorted when you had no cards in your hand.
Many different algorithms exist for each of the three sorting
methods. Each algorithm has its merits, but the general criteria
for judging a sorting algorithm are based on the following
questions:
How fast can it sort information in an average case?
How fast is its best and worst case?
Does it exhibit natural or unnatural behavior?
Does it rearrange elements with equal keys?
How fast a particular algorithm sorts is of great concern.
The speed with which an array can be sorted is directly related
to the number of comparisons and the number of exchanges
(exchanges take more time). A comparison occurs when one array
element is compared to another. An exchange happens when two
elements are swapped in the array. Later in this chapter you will
see that some sorts require an exponential amount of time per
element to sort and some require logarithmic time.
The best- and worst-case run times are important if you
expect to encounter one of these situations frequently. Often a
sort will have a good average case, but a terrible worst case.
A sort is said to exhibit natural behavior if it works least
when the list is already in order, harder as the list becomes
less ordered, and hardest when a list is in inverse order. How
hard a sort works is based on the number of comparisons and moves
that must be executed.
To understand the importance of rearranging elements with
equal keys, imagine a database that is sorted on a main key and a
subkey - for example, a mailing list with the ZIP code as the main
key and the last name within the same ZIP code as the subkey.
When a new address is added to the list, and the list sorted
again, you do not want the subkeys (that is, last names within
ZIP codes) to be rearranged. To guarantee this, a sort must not
exchange main keys of equal value.
In the following sections, representative sorts from each
class of sorting algorithms are analyzed to judge their
efficiency. Later, improved sorting methods are studied.
The Bubble Sort
~~~~~~~~~~~~~~~
The best-known (and most infamous) sort is the Bubble sort.
Its popularity is derived from its catchy name and its simplicity.
For reasons that will become evident, this is one of the worst
sorts ever conceived.
The Bubble sort uses the exchange method of sorting. The
general concept behind the Bubble sort is the repeated
comparisons and, if necessary, exchanges of adjacent elements.
Its name comes from the method's similarity to bubbles in a tank
of water, where each bubble seeks its own level. In this simplest
form of the Bubble sort:
/* bubble sort */
void bubble(char item[], int count)
{
int i, k;
char t;
for (i = 1; i < count; ++i)
for (k = count-1; k >= i; --k) {
if (item[k-1] > item[k]) {
/* exchange elements */
t = item[k-1];
item[k-1] = item[k];
item[k] = t;
}
}
}
'item' is a pointer to the character array to be sorted and
'count' is the number of elements in the array.
The Bubble sort is driven by two loops. Given that there are
'count' elements in the array, the outer loop causes the array to
be scanned 'count-1' times. This ensures that, in the worst case,
every element is in its proper position when the function
terminates. The inner loop performs the actual comparisons and
exchanges. (A slightly optimized version of the Bubble sort will
terminate if no exchanges occur, but this also adds another
comparison to each pass through the inner loop.)
This version of the Bubble sort can be used to sort a
character array into ascending order. For example, this short
program sorts a string typed in from the keyboard:
#include <stdio.h>
#include <string.h>
/* function prototypes */
void bubble(char [], int);
void main()
{
char str[256];
int len;
do {
printf("\nEnter a string:\n");
printf("(Exit program with an empty string.)\n");
gets(str);
len = strlen(str);
if (len > 0) {
bubble(str, len);
printf("Sorted list:\n%s\n", str);
}
} while (len > 0);
}
To illustrate how the Bubble sort works, here are the passes
used to sort 'dcab'.
initial: d c a b
pass l: a d c b ('a' moved to the leftmost)
pass 2: a b d c ('b' moved to the left)
pass 3: a b c d ('c' exchanged with 'd')
When analyzing any sort, you must determine how many
comparisons and exchanges will be performed for the best,
average, and worst case. With the Bubble sort, the number of
comparisons is always the same because the two for loops will
still repeat the specified number of times, whether or not the
list is initially ordered. This means that the Bubble sort will
always perform (n^2-n)/2 comparisons, where n is the number of
elements to be sorted. (There are 6 comparisons in this example.)
The number of exchanges is 0 for the best case - an already
sorted list. The numbers are 3*(n^2-n)/4 for the average case and
3*(n^2-n)/2 for the worst case. It is beyond the scope of this
course to explain the derivation of these cases, but you can see
that as the list becomes less ordered, the number of elements
that are out of order approaches the number of comparisons.
(Remember, there are three exchanges in a Bubble sort for every
element out of order.) The Bubble sort is said to be an n-squared
algorithm because its execution time is a multiple of the square
of the number of elements. A Bubble sort is bad for a large
number of elements because execution time is directly related to
the number of comparisons and exchanges.
For example, if you ignore the time it takes to exchange any
out-of-position element, and if each comparison takes 0.001
seconds, then sorting 10 elements will take about 0.05 seconds,
sorting 100 elements will take about 5 seconds, and sorting 1000
elements will take about 500 seconds. A 100,000 element sort (the
size of a small telephone book) would take about 5,000,000
seconds, or about 1400 hours - about two months of continuous
sorting!
Sorting by Selection
~~~~~~~~~~~~~~~~~~~~
A Selectwn sort selects the element with the lowest value
and exchanges that with the first element. Then from the remain-
ing (n-1) elements, the element with the least key is found and
exchanged with the second element, and so forth, up to the last
two elements. For example, if the selection method were to be
used on the array 'bdac', each pass would look like this:
initial: b d a c
pass 1: a d b c ('a' was selected)
pass 2: a b d c ('b' was selected)
pass 3: a b c d ('c' was selected)
The basic Selection sort is shown here:
/* selection sort */
void select(char item[], int count)
{
int i, j, k;
char t;
for (i = 0; i < count-1; ++i) {
k = i;
t = item[i];
for (j = i+1; j < count; ++j) {
if(item[j] < t) {
k = j;
t = item[j];
}
}
item[k] = item[i];
item[i] = t;
}
}
Unfortunately, the number of comparisons in the Select sort
is the same as the Bubble sort. This means that the Selection
sort requires (n^2-n)/2 comparisons, which makes it too slow for
a large number of items. The number of exchanges for the best
case is 3*(n-1) and for the worst case is n^2/4 + 3*(n-1).
For the best case, if the list is ordered, then only n-1
elements need to be moved, and each move requires three
exchanges. The worst case approximates the number of comparisons.
Although the average case is beyond the scope of this course to
develop. It is n*(ln(n)+e) where e is Euler's constant (about
0.577216). This means that although the number of comparisons for
both the Bubble sort and the Selection sort is the same, the
number of exchanges in the average case is far less for the
Selection sort.
Sorting by Insertion
~~~~~~~~~~~~~~~~~~~~
The Insertion sort is the last of the simple sorting
algorithms. The Insertion sort initially sorts the first two
members of the array. Next, the algorithm inserts the third
member into its sorted position in relation to the first two
members. Then, the fourth element is inserted into the list of
three elements. The process continues until all elements have
been sorted. For example, given the array 'dcab', each pass of
the Insertion sort would look like this:
initial: d c a b
pass 1: c d a b ('c' & 'd' were exchanged)
pass 3: a c d b ('a' moved to the leftmost)
pass 4: a b c d ('b' inserted)
A version of the Insertion sort is shown here:
/* sorting by straight insertion */
void insert(char item[], int count)
{
int i, k;
char t;
for (i = 1; i < count; ++i) {
t = item[i];
k = i-1;
while (k >= 0 && t < item[k]) {
item[k+1] = item[k];
k--;
}
item[k+1] = t;
}
}
Unlike the Bubble sort and the Selection sort, the number of
comparisons that occur while the Insertion sort is used will
depend on how the list is initially ordered. If the list is in
order, then the number of comparisons is (n-1). If the list is in
inverse order, then the number of comparisons is (n^2+n)/2-1,
while its average number of comparisons is (n^2+n-2)/4.
The number of exchanges for each case is as follows:
best 2*(n-1)
average (n^2+9*n-10)/4
worst (n^2+3*n-4)/2
Therefore, the number for the worst case is as bad as those
for the Bubble and Selection sorts, and for the average case it
is only slightly better.
The Insertion sort does have two advantages, however. First,
it behaves naturally: it works the least when the array is
already sorted and the hardest when the array is sorted in
inverse order. This makes the Insertion sort useful for lists
that are almost in order. Second, it leaves the order of equal
keys unchanged: if a list is sorted using two keys, then it
remains sorted for both keys after an Insertion sort.
Even though the comparisons may be fairly good for certain
sets of data, the fact that the array must always be shifted over
each time an element is placed in its proper location means that
the number of moves can be very significant. However, the
Insertion sort still behaves naturally, with the least exchanges
occurring for an almost sorted list and the most exchanges for an
inversely ordered array.
Improved Sorts
==============
All of the algorithms thus far had the fatal flaw of
executing in n^2 time. For large amounts of data, the sorts would
be slow - in fact, at some point, too slow to use. Every computer
programmer has heard, or told, the horror story of the "sort that
took three days." Unfortunately, these stories are often true.
When a sort takes too long, it may be the fault of the
underlying algorithm. However, a sad commentary is that the first
response is often "write it in assembly code." Although assembler
code will sometimes speed up a routine by a constant factor, if
the underlying algorithm is bad, the sort will be slow no matter
how optimal the coding. Remember, when the run time of a routine
is relative to n^2, increasing the speed of the coding or of the
computer will only cause a slight improvement because the rate at
which the run time increases changes exponentially. Keep in mind
that if something is not fast enough in C code, then it will
generally not be fast enough in assembler. The solution is to use
a better sorting algorithm.
In this section, two excellent sorts are developed. The first
is the Shell sort, and the second is the Quicksort, which is
generally considered the best sorting routine. These sorts run so
fast that if you blink, you will miss them.
The Shell Sort
~~~~~~~~~~~~~~
The Shell sort is named after its inventor, D.L. Shell.
However, the name seems to have stuck because its method of
operation resembles sea shells piled upon one another.
The general method, derived from the Insertion sort, is based
on diminishing increments. Below gives a diagram of a Shell sort
on the array 'fdacbe'. First, all elements that are three positions
apart are sorted. Then all elements that are two positions apart
are sorted. Finally, all those adjacent to each other are sorted.
initial: f d a c b e
^--------^ exchange
^--------^ exchange
^--------^ no exchange
pass1: c b a f d e
^-----^ exchange
^-----^ no exchange
^-----^ no exchange
^-----^ exchange
pass2: a b c e d f
^--^ exchange
pass3: a b c d e f
It may not be obvious that this method yields good results,
or even that it will sort the array, but it does both. This
algorithm is efficient because each sorting pass involves
relatively few elements, or elements that are already in
reasonable order; therefore each pass increases the order of the
data.
The exact sequence for the increments can be changed. The
only rule is that the last increment must be 1. For example, the
sequence 9, 5, 3, 2, 1 works well and is used in the Shell sort
shown here. Avoid sequences that are powers of 2 because, for
mathematically complex reasons, they reduce the efficiency of the
sorting algorithm. (However, even if you use them, the sort would
still work.)
/* a shell sort */
void shell(char item[], int count)
{
int i, j, k, s, w;
char x, a[5];
a[0] = 9; a[1] = 5; a[2] = 3; a[3] = 2; a[4] = 1;
for (w = 0; w < 5; w++) {
k = a[w]; s = -k;
for (i = k; i < count; ++i) {
x = item[i];
j = i - k;
if (s == 0) {
s = -k;
s++;
item[s] = x;
}
while (x < item[j] && j >= 0 && j <= count) {
item[j+k] = item[j];
j -= k;
}
item[j+k] = x;
}
}
}
You may have noticed that the inner while loop has three test
conditions. The 'x < item[j]' is a comparison necessary for the
sorting process. The tests 'j >= 0' and 'j <= count' are used to
keep the sort from overrunning the boundary of the array 'item'.
These extra checks degrade the performance of Shell sort to some
extent. Slightly different versions of the Shell sort employ spe-
cial array elements, called sentinels, which are not actually
part of the array to be sorted. Sentinels hold special
termination values that indicate the least and greatest possible
elements. In this way the boundary checks are unnecessary.
However, using sentinels requires a specific knowledge of the
data, which limits the generality of the sort function.
The analysis of the Shell sort presents some very difficult
mathematical problems that are far beyond the scope of this text.
However, execution time is proportional to n^l.2 for sorting n
elements. This is a very significant improvement over the n^2
sorts of the previous section. However, before you decide to use
the Shell sort, you should know that the Quicksort is even
hetter.
The Quicksort
~~~~~~~~~~~~~
The Quicksort, invented and named by C.A.R. Hoare, is
generally considered the best sorting algorithm currently
available. It is based on the exchange method of sorting. This is
surprising if you consider the terrible performance of the Bubble
sort, which is also based on the exchange method.
The Quicksort is built on the idea of partitions. The general
procedure is to select a value (called the comparand) and then to
partition the array into two parts with all elements greater than
or equal to the partition value on one side and those less than
the partition value on the other. This process is then repeated
for each remaining part until the array is sorted. For example,
given the array 'fedacb' and using the value 'd', the first pass
of the Quicksort would rearrange the array like this:
initial: f e d | a c b
^------------ the comparand
pass 1: b c a | d e f
This process is then repeated for each half ('bca' and
'def'). The process is essentially recursive; indeed, the
cleanest implementations of Quicksort are recursive algorithms.
The selection of the middle comparand value can be
accomplished two ways. The value can be chosen either at random
or by averaging a small set of values taken from the array. For
the optimal sorting it is desirable to select a value that is
precisely in the middle of the range of values. However, this is
not easy to do for most sets of data. Even in the worst case - the
value chosen is at one extremity - Quicksort still performs well.
The following version of Quicksort selects the middle element
of the array. Although this may not always result in a good
choice, the sort still performs correctly.
/* quicksort set up */
void quick(char item[], int count)
{
qs(item, 0, count-1);
}
/* quicksort */
void qs(char item[], int left, int right)
{
int i, j;
char x, y;
i = left; j = right;
x = item[(left+right)/2];
do {
while (item[i] < x && i < right) i++;
while (x < item[j] && j > left) j--;
if (i <= j) {
y = item[i];
item[i] = item[j];
item[j] = y;
i++; j--;
}
} while (i <= j);
if (left < j) qs(item, left, j);
if (i < right) qs(item, i, right);
}
In this version, the function quick() sets up a call to the
main sorting function, called qs(). While this maintains the same
common interface of 'item' and 'count', it is not essential
because qs() could have been called directly using three
arguments.
The derivation of the number of comparisons and number of
exchanges that Quicksort performs requires mathematics beyond the
scope of this text. However, you can assume that the number of
comparisons is n*log(n) and that the number of exchanges is
approximately n*log(n)/6. These are significantly better than any
of the previous sorts discussed so far.
The equation
N = 10^x
can be rewritten as
x = log(N)
This means, for example, that if 100 elements were to be
sorted, Quicksort would require 100 * 2, or 200, comparisons
because log 100 is 2. Compared with the Bubble sort's 4950
comparisons, this number is quite good.
You should be aware of one particularly nasty aspect to
Quicksort. If the comparand value for each partition happens to
be the largest value, then Quicksort degenerates into "slowsort"
with an n^2 run time. Generally, however, this does not happen.
You must carefully choose a method of determining the value
of the comparand. Often the value is determined by the actual
data you are sorting. In large mailing lists where the sorting is
often by ZIP code, the selection is simple, because the ZIP codes
are fairly evenly distributed and a simple algebraic function can
determine a suitable comparand. However, in certain databases,
the sort keys may be so close in value (with many being the same
value) that a random selection is often the best method
available. A common and fairly effective method is to sample
three elements from a partition and take the middle value.
Generally, the Quicksort is the sort of choice because it is
so fast. However, when only very small lists of data are to be
sorted (less than 100), the overhead created by Quicksort's
recursive calls may offset the benefits of a superior algorithm.
In rare cases like this, one of the simpler sorts (perhaps even
the Bubble sort) will be quicker.
Sorting Other Data Structures
=============================
Until now, you have only been sorting arrays of characters.
This has made it easy to present each of the sorting routines.
Obviously, arrays of any of the built-in data types can be sorted
simply by changing the data types of the parameters and variables
to the sort function. However, generally complex data types like
strings, or groupings of information like structures, need to be
sorted. Most sorting involves a key and information linked to
that key. To adapt the algorithms to sort other structures, you
need to alter either the comparison section or the exchange
section, or both. The algorithm itself will remain unchanged.
Because Quicksort is one of the best general-purpose routines
available at this time, it will be used in the following
examples. The same techniques will apply to any of the sorts
described earlier.
Sorting strings
~~~~~~~~~~~~~~~
The easiest way to sort strings is to create an array of
character pointers to those strings. This allows you to maintain
easy indexing and keeps the basic Quicksort algorithm unchanged.
The string version of Quicksort shown here will accept an array
of char pointers that point to the strings to be sorted. The sort
rearranges the pointers to the strings, not the actual strings in
memory. This version sorts the strings in alphabetical order.
/* quick sort for strings setup */
void quick_string(char *item[], int count)
{
qs_string(item, 0, count-1);
}
/* quick sort for strings */
void qs_string(char *item[], int left, int right)
{
int i, j;
char *x, *y;
i = left; j = right;
x = item[(left+right)/2];
do {
while (strcmp(item[i], x) < 0 && i < right) i++;
while (strcmp(item[j], x) > 0 && j > left) j--;
if (i <= j) {
y = item[i];
item[i] = item[j];
item[j] = y;
}
} while (i <= j);
if (left < j) qs_tring(item, left, j);
if (i < right) qs_string(item, i, right);
}
The comparison step has been changed to use the function
strcmp(), which returns a negative number if the first string is
lexicographically less than the second, 0 if the strings are
equal, and a positive number if the first string is
lexicographically greater than the second. The exchange part of
the routine has been left unchanged because only the pointers are
being exchanged - not the actual strings. To exchange the actual
strings, you would have to use the function strcpy().
The use of strcmp() will slow down the sort for two reasons.
First, it involves a function call, which always takes time;
second, the strcmp() function itself performs several comparisons
to determine the relationship of the two strings. If speed is
absolutely critical, the code for strcmp() can be duplicated
in-line inside the routine. However, there is no way to avoid
comparing the strings, since this is by definition what the task
involves.
Sorting Structures
~~~~~~~~~~~~~~~~~~
Most application programs that require a sort will need to
have a grouping of data sorted. A mailing list is an excellent
example because a name, street, city, state, and ZIP code are all
linked together. When this conglomerate unit of data is sorted, a
sort key is used, but the entire structure is exchanged. To see
how this is done, you first need to create a structure. For the
mailing-list example, a convenient structure is
struct address {
char name[40];
char street[40];
char city[20];
char state[3];
char zip[10];
};
The 'state' is three characters long and 'zip' is ten
characters long because a string array always needs to be one
character longer than the maximum length of any string in order
to store the null terminator.
Since it is reasonable to arrange a mailing list as an array
of structures, assume for this example that the sort routine
sorts an array of structures of type 'address', as shown here.
void quick_struct(struct address [], int);
void qs_struct(struct address [], int, int);
/* quick sort for structures setup */
void quick_struct(struct address item[], int count)
{
qs_struct(item, 0, count-1);
}
/* quick sort for structures */
void qs_struct(struct address item[], int left, int right)
{
int i, j;
char *x, *y;
i = left; j = right;
x = item[(left+right)/2].zip;
do {
while (strcmp(item[i].zip, x) < 0 && i < right) i++;
while (strcmp(item[j].zip, x) > 0 && j > left) j--;
if (i <= j) {
swap_all_fields(item, i, j);
i++; j--;
}
} while (i <= j);
if (left < j) qs_struct(item, left, j);
if (i < right) qs_struct(item, i, right);
}
Notice that both the comparison code and the exchange code
needed to be altered. Because so many fields needed to be
exchanged, a separate function, swap_all_fields(), was created to
do this. You will need to create swap_ all_fields() in accordance
with the nature of the structure being sorted.
Searching Methods
=================
Databases of information exist so that, from time to time, a
user can locate a record by knowing its key. There is only one
method of finding information in an unsorted file or array, and
another for a sorted file or array. Many compilers supply
search functions as part of the standard library. However, as
with sorting, general-purpose routines sometimes are simply too
inefficient for use in demanding situations because of the extra
overhead created by the generalization.
Finding information in an unsorted array requires a
sequential search starting at the first element and stopping
either when a match is found or when the end of the array is
reached. This method must be used on unsorted data, but can also
be applied to sorted data. If the data has been sorted, then a
binary search can be used, which greatly speeds up any search.
The Sequential Search
~~~~~~~~~~~~~~~~~~~~~
The sequential search is easy to code. The following function
searches a character array of known length until a match is found
with the specified key.
int sequential_search(char item[], int count, char key)
{
int i;
for (i = 0; i < count; ++i)
if (key == item[i]) return (i);
return (-1); /* no match */
}
This function will return the index number of the matching
entry if there is one, or -1 if there is not.
A straight sequential search will, on the average, test n/2
elements. In the best case, it will test only one element and in
the worst case n elements. If the information is stored on disk,
the search time can be very long. But if the data is unsorted,
this is the only method available.
The Binary Search
~~~~~~~~~~~~~~~~~
If the data to be searched is in sorted order, then a
superior method, called the binary search, can be used to find a
match. The method uses the "divide-and-conquer" approach. It
first tests the middle element; if the element is larger than the
key, it then tests the middle element of the first half;
otherwise, it tests the middle element of the second half This
process is repeated until either a match is found, or there are
no more elements to test.
For example, to find the number 4 in the array
1 2 3 4 5 6 7 8 9
the binary search would first test the middle, which is 5.
Since this is greater than 4, the search would continue with the
first half, or
1 2 3 4 5
In this example, the middle element is 3. This is less than
4, so the first half is discarded and the search continues with
4 5
This time the match is found.
In the binary search, the number of comparisons given the
worst case is log2(n). With average cases, the number is somewhat
better; in the best case, the number is 1.
A binary search function for character arrays is shown here.
You can make this search any arbitrary data structure by changing
the comparison portion of the routine.
/* Binary search */
int binary(char item[], int count, char key)
{
int low, high, mid;
low = 0; high = count-1;
while(low <= high) {
mid = (low+high) / 2;
if (key < item[mid]) high = mid - 1;
else if( key > item[mid]) low = mid + 1;
else return (mid); /* found */
}
return (-1);