Sorting and Searching ================================ In the world of computer science perhaps no other tasks are more fundamental or as extensively analyzed as those of sorting and of searching. These routines are used in virtually all database programs, as well as in compilers, interpreters, and operating systems. This lesson introduces the basics of sorting and searching information. Because sorting data generally makes searching the data easier and faster, sorting is discussed first. Sorting is the process of arranging a set of similar informa- tion into an increasing or decreasing order. Specifically, given a sorted list a of n elements, a[1] <= a[2] <= ... <= a[n] Though ANSI C supplies the standard qsort() function as part of the standard library, the study and understanding of sorting is important for three main reasons. First, a generalized function like qsort() cannot be applied to all situations. Second, because qsort() is parameterized to operate on a wide variety of data, it will run slower than a similar sort that operates on only one type of data. (The generalization process inherently increases run time because of the extra processing time needed to handle various data types.) Finally, the Quicksort algorithm (used by qsort()), although very good for the general case, may not be the best type of sort for specialized situations. Generally, when information (such as a mailing list) is sorted, only a portion of that information is used as the sort key. This key is used in comparisons, but when an exchange is made the entire data structure is swapped. In a mailing list, for example, the ZIP code field might be used as the key, but the entire address is sorted. For the sake of simplicity while developing the various sorting methods, you will be sorting character arrays. Later, you will learn how to adapt any of these methods to any type of data structure. The three general methods that can be used to sort arrays are: By exchange By selection By insertion To understand these three methods, imagine a deck of cards. To sort the cards using exchange, you would spread the cards, face up, on a table and then proceed to exchange out-of-order cards until the deck is ordered. To sort by selection, you would spread the cards on the table, select the lowest-value card, take it out of the deck, and hold it in your hand. Then you would select from the remaining cards on the table the lowest card and place it behind the one already in your hand. This process would continue until all the cards were in your hand. Because you always select the lowest card from those remaining on the table, the cards in your hand would be sorted when the process was complete. To sort the cards using insertion, you would hold the cards in your hand and would take one at a time. As you took cards from the deck, you would place them into a new deck on the table, always inserting them in the correct position. The deck would be sorted when you had no cards in your hand. Many different algorithms exist for each of the three sorting methods. Each algorithm has its merits, but the general criteria for judging a sorting algorithm are based on the following questions: How fast can it sort information in an average case? How fast is its best and worst case? Does it exhibit natural or unnatural behavior? Does it rearrange elements with equal keys? How fast a particular algorithm sorts is of great concern. The speed with which an array can be sorted is directly related to the number of comparisons and the number of exchanges (exchanges take more time). A comparison occurs when one array element is compared to another. An exchange happens when two elements are swapped in the array. Later in this chapter you will see that some sorts require an exponential amount of time per element to sort and some require logarithmic time. The best- and worst-case run times are important if you expect to encounter one of these situations frequently. Often a sort will have a good average case, but a terrible worst case. A sort is said to exhibit natural behavior if it works least when the list is already in order, harder as the list becomes less ordered, and hardest when a list is in inverse order. How hard a sort works is based on the number of comparisons and moves that must be executed. To understand the importance of rearranging elements with equal keys, imagine a database that is sorted on a main key and a subkey - for example, a mailing list with the ZIP code as the main key and the last name within the same ZIP code as the subkey. When a new address is added to the list, and the list sorted again, you do not want the subkeys (that is, last names within ZIP codes) to be rearranged. To guarantee this, a sort must not exchange main keys of equal value. In the following sections, representative sorts from each class of sorting algorithms are analyzed to judge their efficiency. Later, improved sorting methods are studied. The Bubble Sort ~~~~~~~~~~~~~~~ The best-known (and most infamous) sort is the Bubble sort. Its popularity is derived from its catchy name and its simplicity. For reasons that will become evident, this is one of the worst sorts ever conceived. The Bubble sort uses the exchange method of sorting. The general concept behind the Bubble sort is the repeated comparisons and, if necessary, exchanges of adjacent elements. Its name comes from the method's similarity to bubbles in a tank of water, where each bubble seeks its own level. In this simplest form of the Bubble sort: /* bubble sort */ void bubble(char item[], int count) { int i, k; char t; for (i = 1; i < count; ++i) for (k = count-1; k >= i; --k) { if (item[k-1] > item[k]) { /* exchange elements */ t = item[k-1]; item[k-1] = item[k]; item[k] = t; } } } 'item' is a pointer to the character array to be sorted and 'count' is the number of elements in the array. The Bubble sort is driven by two loops. Given that there are 'count' elements in the array, the outer loop causes the array to be scanned 'count-1' times. This ensures that, in the worst case, every element is in its proper position when the function terminates. The inner loop performs the actual comparisons and exchanges. (A slightly optimized version of the Bubble sort will terminate if no exchanges occur, but this also adds another comparison to each pass through the inner loop.) This version of the Bubble sort can be used to sort a character array into ascending order. For example, this short program sorts a string typed in from the keyboard: #include #include /* function prototypes */ void bubble(char [], int); void main() { char str[256]; int len; do { printf("\nEnter a string:\n"); printf("(Exit program with an empty string.)\n"); gets(str); len = strlen(str); if (len > 0) { bubble(str, len); printf("Sorted list:\n%s\n", str); } } while (len > 0); } To illustrate how the Bubble sort works, here are the passes used to sort 'dcab'. initial: d c a b pass l: a d c b ('a' moved to the leftmost) pass 2: a b d c ('b' moved to the left) pass 3: a b c d ('c' exchanged with 'd') When analyzing any sort, you must determine how many comparisons and exchanges will be performed for the best, average, and worst case. With the Bubble sort, the number of comparisons is always the same because the two for loops will still repeat the specified number of times, whether or not the list is initially ordered. This means that the Bubble sort will always perform (n^2-n)/2 comparisons, where n is the number of elements to be sorted. (There are 6 comparisons in this example.) The number of exchanges is 0 for the best case - an already sorted list. The numbers are 3*(n^2-n)/4 for the average case and 3*(n^2-n)/2 for the worst case. It is beyond the scope of this course to explain the derivation of these cases, but you can see that as the list becomes less ordered, the number of elements that are out of order approaches the number of comparisons. (Remember, there are three exchanges in a Bubble sort for every element out of order.) The Bubble sort is said to be an n-squared algorithm because its execution time is a multiple of the square of the number of elements. A Bubble sort is bad for a large number of elements because execution time is directly related to the number of comparisons and exchanges. For example, if you ignore the time it takes to exchange any out-of-position element, and if each comparison takes 0.001 seconds, then sorting 10 elements will take about 0.05 seconds, sorting 100 elements will take about 5 seconds, and sorting 1000 elements will take about 500 seconds. A 100,000 element sort (the size of a small telephone book) would take about 5,000,000 seconds, or about 1400 hours - about two months of continuous sorting! Sorting by Selection ~~~~~~~~~~~~~~~~~~~~ A Selectwn sort selects the element with the lowest value and exchanges that with the first element. Then from the remain- ing (n-1) elements, the element with the least key is found and exchanged with the second element, and so forth, up to the last two elements. For example, if the selection method were to be used on the array 'bdac', each pass would look like this: initial: b d a c pass 1: a d b c ('a' was selected) pass 2: a b d c ('b' was selected) pass 3: a b c d ('c' was selected) The basic Selection sort is shown here: /* selection sort */ void select(char item[], int count) { int i, j, k; char t; for (i = 0; i < count-1; ++i) { k = i; t = item[i]; for (j = i+1; j < count; ++j) { if(item[j] < t) { k = j; t = item[j]; } } item[k] = item[i]; item[i] = t; } } Unfortunately, the number of comparisons in the Select sort is the same as the Bubble sort. This means that the Selection sort requires (n^2-n)/2 comparisons, which makes it too slow for a large number of items. The number of exchanges for the best case is 3*(n-1) and for the worst case is n^2/4 + 3*(n-1). For the best case, if the list is ordered, then only n-1 elements need to be moved, and each move requires three exchanges. The worst case approximates the number of comparisons. Although the average case is beyond the scope of this course to develop. It is n*(ln(n)+e) where e is Euler's constant (about 0.577216). This means that although the number of comparisons for both the Bubble sort and the Selection sort is the same, the number of exchanges in the average case is far less for the Selection sort. Sorting by Insertion ~~~~~~~~~~~~~~~~~~~~ The Insertion sort is the last of the simple sorting algorithms. The Insertion sort initially sorts the first two members of the array. Next, the algorithm inserts the third member into its sorted position in relation to the first two members. Then, the fourth element is inserted into the list of three elements. The process continues until all elements have been sorted. For example, given the array 'dcab', each pass of the Insertion sort would look like this: initial: d c a b pass 1: c d a b ('c' & 'd' were exchanged) pass 3: a c d b ('a' moved to the leftmost) pass 4: a b c d ('b' inserted) A version of the Insertion sort is shown here: /* sorting by straight insertion */ void insert(char item[], int count) { int i, k; char t; for (i = 1; i < count; ++i) { t = item[i]; k = i-1; while (k >= 0 && t < item[k]) { item[k+1] = item[k]; k--; } item[k+1] = t; } } Unlike the Bubble sort and the Selection sort, the number of comparisons that occur while the Insertion sort is used will depend on how the list is initially ordered. If the list is in order, then the number of comparisons is (n-1). If the list is in inverse order, then the number of comparisons is (n^2+n)/2-1, while its average number of comparisons is (n^2+n-2)/4. The number of exchanges for each case is as follows: best 2*(n-1) average (n^2+9*n-10)/4 worst (n^2+3*n-4)/2 Therefore, the number for the worst case is as bad as those for the Bubble and Selection sorts, and for the average case it is only slightly better. The Insertion sort does have two advantages, however. First, it behaves naturally: it works the least when the array is already sorted and the hardest when the array is sorted in inverse order. This makes the Insertion sort useful for lists that are almost in order. Second, it leaves the order of equal keys unchanged: if a list is sorted using two keys, then it remains sorted for both keys after an Insertion sort. Even though the comparisons may be fairly good for certain sets of data, the fact that the array must always be shifted over each time an element is placed in its proper location means that the number of moves can be very significant. However, the Insertion sort still behaves naturally, with the least exchanges occurring for an almost sorted list and the most exchanges for an inversely ordered array. Improved Sorts ============== All of the algorithms thus far had the fatal flaw of executing in n^2 time. For large amounts of data, the sorts would be slow - in fact, at some point, too slow to use. Every computer programmer has heard, or told, the horror story of the "sort that took three days." Unfortunately, these stories are often true. When a sort takes too long, it may be the fault of the underlying algorithm. However, a sad commentary is that the first response is often "write it in assembly code." Although assembler code will sometimes speed up a routine by a constant factor, if the underlying algorithm is bad, the sort will be slow no matter how optimal the coding. Remember, when the run time of a routine is relative to n^2, increasing the speed of the coding or of the computer will only cause a slight improvement because the rate at which the run time increases changes exponentially. Keep in mind that if something is not fast enough in C code, then it will generally not be fast enough in assembler. The solution is to use a better sorting algorithm. In this section, two excellent sorts are developed. The first is the Shell sort, and the second is the Quicksort, which is generally considered the best sorting routine. These sorts run so fast that if you blink, you will miss them. The Shell Sort ~~~~~~~~~~~~~~ The Shell sort is named after its inventor, D.L. Shell. However, the name seems to have stuck because its method of operation resembles sea shells piled upon one another. The general method, derived from the Insertion sort, is based on diminishing increments. Below gives a diagram of a Shell sort on the array 'fdacbe'. First, all elements that are three positions apart are sorted. Then all elements that are two positions apart are sorted. Finally, all those adjacent to each other are sorted. initial: f d a c b e ^--------^ exchange ^--------^ exchange ^--------^ no exchange pass1: c b a f d e ^-----^ exchange ^-----^ no exchange ^-----^ no exchange ^-----^ exchange pass2: a b c e d f ^--^ exchange pass3: a b c d e f It may not be obvious that this method yields good results, or even that it will sort the array, but it does both. This algorithm is efficient because each sorting pass involves relatively few elements, or elements that are already in reasonable order; therefore each pass increases the order of the data. The exact sequence for the increments can be changed. The only rule is that the last increment must be 1. For example, the sequence 9, 5, 3, 2, 1 works well and is used in the Shell sort shown here. Avoid sequences that are powers of 2 because, for mathematically complex reasons, they reduce the efficiency of the sorting algorithm. (However, even if you use them, the sort would still work.) /* a shell sort */ void shell(char item[], int count) { int i, j, k, s, w; char x, a[5]; a[0] = 9; a[1] = 5; a[2] = 3; a[3] = 2; a[4] = 1; for (w = 0; w < 5; w++) { k = a[w]; s = -k; for (i = k; i < count; ++i) { x = item[i]; j = i - k; if (s == 0) { s = -k; s++; item[s] = x; } while (x < item[j] && j >= 0 && j <= count) { item[j+k] = item[j]; j -= k; } item[j+k] = x; } } } You may have noticed that the inner while loop has three test conditions. The 'x < item[j]' is a comparison necessary for the sorting process. The tests 'j >= 0' and 'j <= count' are used to keep the sort from overrunning the boundary of the array 'item'. These extra checks degrade the performance of Shell sort to some extent. Slightly different versions of the Shell sort employ spe- cial array elements, called sentinels, which are not actually part of the array to be sorted. Sentinels hold special termination values that indicate the least and greatest possible elements. In this way the boundary checks are unnecessary. However, using sentinels requires a specific knowledge of the data, which limits the generality of the sort function. The analysis of the Shell sort presents some very difficult mathematical problems that are far beyond the scope of this text. However, execution time is proportional to n^l.2 for sorting n elements. This is a very significant improvement over the n^2 sorts of the previous section. However, before you decide to use the Shell sort, you should know that the Quicksort is even hetter. The Quicksort ~~~~~~~~~~~~~ The Quicksort, invented and named by C.A.R. Hoare, is generally considered the best sorting algorithm currently available. It is based on the exchange method of sorting. This is surprising if you consider the terrible performance of the Bubble sort, which is also based on the exchange method. The Quicksort is built on the idea of partitions. The general procedure is to select a value (called the comparand) and then to partition the array into two parts with all elements greater than or equal to the partition value on one side and those less than the partition value on the other. This process is then repeated for each remaining part until the array is sorted. For example, given the array 'fedacb' and using the value 'd', the first pass of the Quicksort would rearrange the array like this: initial: f e d | a c b ^------------ the comparand pass 1: b c a | d e f This process is then repeated for each half ('bca' and 'def'). The process is essentially recursive; indeed, the cleanest implementations of Quicksort are recursive algorithms. The selection of the middle comparand value can be accomplished two ways. The value can be chosen either at random or by averaging a small set of values taken from the array. For the optimal sorting it is desirable to select a value that is precisely in the middle of the range of values. However, this is not easy to do for most sets of data. Even in the worst case - the value chosen is at one extremity - Quicksort still performs well. The following version of Quicksort selects the middle element of the array. Although this may not always result in a good choice, the sort still performs correctly. /* quicksort set up */ void quick(char item[], int count) { qs(item, 0, count-1); } /* quicksort */ void qs(char item[], int left, int right) { int i, j; char x, y; i = left; j = right; x = item[(left+right)/2]; do { while (item[i] < x && i < right) i++; while (x < item[j] && j > left) j--; if (i <= j) { y = item[i]; item[i] = item[j]; item[j] = y; i++; j--; } } while (i <= j); if (left < j) qs(item, left, j); if (i < right) qs(item, i, right); } In this version, the function quick() sets up a call to the main sorting function, called qs(). While this maintains the same common interface of 'item' and 'count', it is not essential because qs() could have been called directly using three arguments. The derivation of the number of comparisons and number of exchanges that Quicksort performs requires mathematics beyond the scope of this text. However, you can assume that the number of comparisons is n*log(n) and that the number of exchanges is approximately n*log(n)/6. These are significantly better than any of the previous sorts discussed so far. The equation N = 10^x can be rewritten as x = log(N) This means, for example, that if 100 elements were to be sorted, Quicksort would require 100 * 2, or 200, comparisons because log 100 is 2. Compared with the Bubble sort's 4950 comparisons, this number is quite good. You should be aware of one particularly nasty aspect to Quicksort. If the comparand value for each partition happens to be the largest value, then Quicksort degenerates into "slowsort" with an n^2 run time. Generally, however, this does not happen. You must carefully choose a method of determining the value of the comparand. Often the value is determined by the actual data you are sorting. In large mailing lists where the sorting is often by ZIP code, the selection is simple, because the ZIP codes are fairly evenly distributed and a simple algebraic function can determine a suitable comparand. However, in certain databases, the sort keys may be so close in value (with many being the same value) that a random selection is often the best method available. A common and fairly effective method is to sample three elements from a partition and take the middle value. Generally, the Quicksort is the sort of choice because it is so fast. However, when only very small lists of data are to be sorted (less than 100), the overhead created by Quicksort's recursive calls may offset the benefits of a superior algorithm. In rare cases like this, one of the simpler sorts (perhaps even the Bubble sort) will be quicker. Sorting Other Data Structures ============================= Until now, you have only been sorting arrays of characters. This has made it easy to present each of the sorting routines. Obviously, arrays of any of the built-in data types can be sorted simply by changing the data types of the parameters and variables to the sort function. However, generally complex data types like strings, or groupings of information like structures, need to be sorted. Most sorting involves a key and information linked to that key. To adapt the algorithms to sort other structures, you need to alter either the comparison section or the exchange section, or both. The algorithm itself will remain unchanged. Because Quicksort is one of the best general-purpose routines available at this time, it will be used in the following examples. The same techniques will apply to any of the sorts described earlier. Sorting strings ~~~~~~~~~~~~~~~ The easiest way to sort strings is to create an array of character pointers to those strings. This allows you to maintain easy indexing and keeps the basic Quicksort algorithm unchanged. The string version of Quicksort shown here will accept an array of char pointers that point to the strings to be sorted. The sort rearranges the pointers to the strings, not the actual strings in memory. This version sorts the strings in alphabetical order. /* quick sort for strings setup */ void quick_string(char *item[], int count) { qs_string(item, 0, count-1); } /* quick sort for strings */ void qs_string(char *item[], int left, int right) { int i, j; char *x, *y; i = left; j = right; x = item[(left+right)/2]; do { while (strcmp(item[i], x) < 0 && i < right) i++; while (strcmp(item[j], x) > 0 && j > left) j--; if (i <= j) { y = item[i]; item[i] = item[j]; item[j] = y; } } while (i <= j); if (left < j) qs_tring(item, left, j); if (i < right) qs_string(item, i, right); } The comparison step has been changed to use the function strcmp(), which returns a negative number if the first string is lexicographically less than the second, 0 if the strings are equal, and a positive number if the first string is lexicographically greater than the second. The exchange part of the routine has been left unchanged because only the pointers are being exchanged - not the actual strings. To exchange the actual strings, you would have to use the function strcpy(). The use of strcmp() will slow down the sort for two reasons. First, it involves a function call, which always takes time; second, the strcmp() function itself performs several comparisons to determine the relationship of the two strings. If speed is absolutely critical, the code for strcmp() can be duplicated in-line inside the routine. However, there is no way to avoid comparing the strings, since this is by definition what the task involves. Sorting Structures ~~~~~~~~~~~~~~~~~~ Most application programs that require a sort will need to have a grouping of data sorted. A mailing list is an excellent example because a name, street, city, state, and ZIP code are all linked together. When this conglomerate unit of data is sorted, a sort key is used, but the entire structure is exchanged. To see how this is done, you first need to create a structure. For the mailing-list example, a convenient structure is struct address { char name[40]; char street[40]; char city[20]; char state[3]; char zip[10]; }; The 'state' is three characters long and 'zip' is ten characters long because a string array always needs to be one character longer than the maximum length of any string in order to store the null terminator. Since it is reasonable to arrange a mailing list as an array of structures, assume for this example that the sort routine sorts an array of structures of type 'address', as shown here. void quick_struct(struct address [], int); void qs_struct(struct address [], int, int); /* quick sort for structures setup */ void quick_struct(struct address item[], int count) { qs_struct(item, 0, count-1); } /* quick sort for structures */ void qs_struct(struct address item[], int left, int right) { int i, j; char *x, *y; i = left; j = right; x = item[(left+right)/2].zip; do { while (strcmp(item[i].zip, x) < 0 && i < right) i++; while (strcmp(item[j].zip, x) > 0 && j > left) j--; if (i <= j) { swap_all_fields(item, i, j); i++; j--; } } while (i <= j); if (left < j) qs_struct(item, left, j); if (i < right) qs_struct(item, i, right); } Notice that both the comparison code and the exchange code needed to be altered. Because so many fields needed to be exchanged, a separate function, swap_all_fields(), was created to do this. You will need to create swap_ all_fields() in accordance with the nature of the structure being sorted. Searching Methods ================= Databases of information exist so that, from time to time, a user can locate a record by knowing its key. There is only one method of finding information in an unsorted file or array, and another for a sorted file or array. Many compilers supply search functions as part of the standard library. However, as with sorting, general-purpose routines sometimes are simply too inefficient for use in demanding situations because of the extra overhead created by the generalization. Finding information in an unsorted array requires a sequential search starting at the first element and stopping either when a match is found or when the end of the array is reached. This method must be used on unsorted data, but can also be applied to sorted data. If the data has been sorted, then a binary search can be used, which greatly speeds up any search. The Sequential Search ~~~~~~~~~~~~~~~~~~~~~ The sequential search is easy to code. The following function searches a character array of known length until a match is found with the specified key. int sequential_search(char item[], int count, char key) { int i; for (i = 0; i < count; ++i) if (key == item[i]) return (i); return (-1); /* no match */ } This function will return the index number of the matching entry if there is one, or -1 if there is not. A straight sequential search will, on the average, test n/2 elements. In the best case, it will test only one element and in the worst case n elements. If the information is stored on disk, the search time can be very long. But if the data is unsorted, this is the only method available. The Binary Search ~~~~~~~~~~~~~~~~~ If the data to be searched is in sorted order, then a superior method, called the binary search, can be used to find a match. The method uses the "divide-and-conquer" approach. It first tests the middle element; if the element is larger than the key, it then tests the middle element of the first half; otherwise, it tests the middle element of the second half This process is repeated until either a match is found, or there are no more elements to test. For example, to find the number 4 in the array 1 2 3 4 5 6 7 8 9 the binary search would first test the middle, which is 5. Since this is greater than 4, the search would continue with the first half, or 1 2 3 4 5 In this example, the middle element is 3. This is less than 4, so the first half is discarded and the search continues with 4 5 This time the match is found. In the binary search, the number of comparisons given the worst case is log2(n). With average cases, the number is somewhat better; in the best case, the number is 1. A binary search function for character arrays is shown here. You can make this search any arbitrary data structure by changing the comparison portion of the routine. /* Binary search */ int binary(char item[], int count, char key) { int low, high, mid; low = 0; high = count-1; while(low <= high) { mid = (low+high) / 2; if (key < item[mid]) high = mid - 1; else if( key > item[mid]) low = mid + 1; else return (mid); /* found */ } return (-1);