home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Programming Tool Box
/
SIMS_2.iso
/
bp_6_93
/
bonus
/
winer
/
chap8.txt
< prev
next >
Wrap
Text File
|
1994-09-01
|
118KB
|
2,877 lines
CHAPTER 8
SORTING AND SEARCHING
Two fundamental operations required of many applications are searching and
sorting the data they operate on. Many different types of data are
commonly sorted, such as customer names, payment due dates, or even a list
of file names displayed in a file selection menu. If you are writing a
programmer's cross reference utility, you may need to sort a list of
variable names without regard to capitalization. In some cases, you may
want to sort several pieces of related information based on the contents
of only one of them. One example of that is a list of names and addresses
sorted in ascending zip code order.
Searching is equally important; for example, to locate a customer name
in an array or disk file. In some cases you may wish to search for a
complete match, while in others a partial match is needed. If you are
searching a list of names for, say, Leonard, you probably would want to
ignore Leonardo. But when searching a list of zip codes you may need to
locate all that begin with the digits 068. There are many different ways
sorting and searching can be accomplished, and the subject is by no means
a simple one.
Most programmers are familiar with the Bubble Sort, because it is the
simplest to understand. Each adjacent pair of items is compared, and then
exchanged if they are out of order. This process is repeated over and
over, until the entire list has been examined as many times as there are
items. Unfortunately, these repeated comparisons make the Bubble Sort an
extremely poor performer. Similarly, code to perform a linear search that
simply examines each item in succession for a match is easy to grasp, but
it will be painfully slow when there are many items.
In this chapter you will learn how sophisticated algorithms that handle
these important programming chores operate. You will also learn how to
sort data on more than one key. Often, it is not sufficient to merely sort
a list of customers by their last name. For example, you may be expected
to sort first by last name, then by first name, and finally by balance due.
That is, all of the last names would first be sorted. Then within all of
the Smiths you would sort again by first name, and for all of the John
Smiths sort that subgroup based on how much money is owed.
For completeness I will start each section by introducing sorting and
searching methods that are easy to understand, and then progress to the
more complex algorithms that are much more effective. Specifically, I will
show the Quick Sort and Binary Search algorithms. When there are many
thousands of data items, a good algorithm can make the difference between
a sort routine that takes ten minutes to complete, and one that needs only
a few seconds.
Finally, I will discuss both BASIC and assembly language sort routines.
As important as the right algorithm is for good performance, an assembly
language implementation will be even faster. Chapter 12 describes how
assembly language routines are written and how they work, and in this
chapter I will merely show how to use the routines included with this book.
SORTING FUNDAMENTALS
====================
Although there are many different ways to sort an array, the simplest
sorting algorithm is the Bubble Sort. The name Bubble is used because a
FOR/NEXT loop repeatedly examines each adjacent pair of elements in the
array, and those that have higher values rise to the top like bubbles in
a bathtub. The most common type of sort is ascending, which means that "A"
comes before "B", which comes before "C", and so forth. Figure 8-1 shows
how the name Zorba ascends to the top of a five-item list of first names.
Initial array contents:
Element 4 Kathy
Element 3 Barbara
Element 2 Cathy
Element 1 Zorba <
After 1 pass:
Element 4 Kathy
Element 3 Barbara
Element 2 Zorba <
Element 1 Cathy
After 2 passes:
Element 4 Kathy
Element 3 Zorba <
Element 2 Barbara
Element 1 Cathy
After 3 passes:
Element 4 Zorba <
Element 3 Kathy
Element 2 Barbara
Element 1 Cathy
Figure 8.1: Data ascending a list during a bubble sort.
The Bubble Sort routine that follows uses a FOR/NEXT loop to repeatedly
examine an array and exchange elements as necessary, until all of the items
are in the correct order.
DEFINT A-Z
DECLARE SUB BubbleSort (Array$())
CONST NumItems% = 20
CONST False% = 0
CONST True% = -1
DIM Array$(1 TO NumItems%)
FOR X = 1 TO NumItems%
READ Array$(X)
NEXT
CALL BubbleSort(Array$())
CLS
FOR X = 1 TO NumItems%
PRINT Array$(X)
NEXT
DATA Zorba, Cathy, Barbara, Kathy, Josephine
DATA Joseph, Joe, Peter, Arnold, Glen
DATA Ralph, Elli, Lucky, Rocky, Louis
DATA Paula, Paul, Mary Lou, Marilyn, Keith
END
SUB BubbleSort (Array$()) STATIC
DO
OutOfOrder = False% 'assume it's sorted
FOR X = 1 TO UBOUND(Array$) - 1
IF Array$(X) > Array$(X + 1) THEN
SWAP Array$(X), Array$(X + 1) 'if we had to swap
OutOfOrder = True% 'we may not be done
END IF
NEXT
LOOP WHILE OutOfOrder
END SUB
This routine is simple enough to be self-explanatory, and only a few things
warrant discussing. One is the OutOfOrder flag variable. When the array
is nearly sorted to begin with, fewer passes through the loop are needed.
The OutOfOrder variable determines when no more passes are necessary. It
is cleared at the start of each loop, and set each time two elements are
exchanged. If, after examining all of the elements in one pass no
exchanges were required, then the sorting is done and there's no need for
the DO loop to continue.
The other item worth mentioning is that the FOR/NEXT loop is set to
consider one element less than the array actually holds. This is necessary
because each element is compared to the one above it. If the last element
were included in the loop, then BASIC would issue a "Subscript out of
range" error on the statement that examines Array$(X + 1).
There are a number of features you can add to this Bubble Sort routine.
For example, you could sort without regard to capitalization. In that case
"adams" would come before "BAKER", even though the lowercase letter "a" has
a higher ASCII value than the uppercase letter "B". To add that capability
simply use BASIC's UCASE$ (or LCASE$) function as part of the comparisons:
IF UCASE$(Array$(X)) > UCASE$(Array$(X + 1)) THEN
And to sort based on the eight-character portion that starts six bytes
into each string you would use this:
IF MID$(Array$(X), 5, 8) > MID$(Array$(X + 1), 5, 8) THEN
Although the comparisons in this example are based on just a portion of
each string, the SWAP statement must exchange the entire elements. This
opens up many possibilities as you will see later in this chapter.
If there is a chance that the strings may contain trailing blanks that
should be ignored, you can use RTRIM$ on each pair of elements:
IF RTRIM$(Array$(X)) > RTRIM$(Array$(X + 1)) THEN
Of course, you can easily combine these enhancements to consider only the
characters in the middle after they have been converted to upper or lower
case.
Sorting in reverse (descending) order is equally easy; you'd simply
replace the greater-than symbol (>) with a less-than symbol (<).
Finally, you can modify the routine to work with any type of data by
changing the array type identifier. That is, for every occurrence of
Array$ you will change that to Array% or Array# or whatever is appropriate.
If you are sorting a numeric array, then different modifications may be in
order. For example, to sort ignoring whether the numbers are positive or
negative you would use BASIC's ABS (absolute value) function:
IF ABS(Array!(X)) > ABS(Array!(X + 1)) THEN
It is important to point out that all of the simple modifications described
here can also be applied to the more sophisticated sort routines we will
look at later in this chapter.
INDEXED SORTS
Besides the traditional sorting methods--whether a Bubble Sort or Quick
Sort or any other type of sort--there is another category of sort routine
you should be familiar with. Where a conventional sort exchanges elements
in an array until they are in order, an Index Sort instead exchanges
elements in a parallel numeric array of *pointers*. The original data is left
intact, so it may still be accessed in its natural order. However, the
array can also be accessed in sorted order by using the element numbers
contained in the index array.
As with a conventional sort, the comparisons in an indexed sort routine
examine each element in the primary array, but based on the element numbers
in that index array. If it is determined that the data is out of order,
the routine exchanges the elements in the index array instead of the
primary array. A modification to the Bubble Sort routine to sort using an
index is shown below.
DEFINT A-Z
DECLARE SUB BubbleISort (Array$(), Index())
CONST NumItems% = 20
CONST False% = 0
CONST True% = -1
DIM Array$(1 TO NumItems%) 'this holds the string data
DIM Ndx(1 TO NumItems%) 'this holds the index
FOR X = 1 TO NumItems%
READ Array$(X) 'read the string data
Ndx(X) = X 'initialize the index array
NEXT
CALL BubbleISort(Array$(), Ndx())
CLS
FOR X = 1 TO NumItems%
PRINT Array$(Ndx(X)) 'print based on the index
NEXT
DATA Zorba, Cathy, Barbara, Kathy, Josephine
DATA Joseph, Joe, Peter, Arnold, Glen
DATA Ralph, Elli, Lucky, Rocky, Louis
DATA Paula, Paul, Mary lou, Marilyn, Keith
SUB BubbleISort (Array$(), Index()) STATIC
DO
OutOfOrder = False% 'assume it's sorted
FOR X = 1 TO UBOUND(Array$) - 1
IF Array$(Index(X)) > Array$(Index(X + 1)) THEN
SWAP Index(X), Index(X + 1) 'if we had to swap
OutOfOrder% = True% 'we're not done yet
END IF
NEXT
LOOP WHILE OutOfOrder%
END SUB
In this indexed sort, all references to the data are through the index
array. And when a swap is necessary, it is the index array elements that
are exchanged. Note that an indexed sort requires that the index array be
initialized to increasing values--even if the sort routine is modified to
be descending instead of ascending. Therefore, when BubbleISort is called
Ndx(1) must hold the value 1, Ndx(2) is set to 2, and so forth.
In this example the index array is initialized by the caller. However,
it would be just as easy to put that code into the subprogram itself.
Since you can't pass an array that hasn't yet been dimensioned, it makes
the most sense to do both steps outside of the subprogram. Either way, the
index array must be assigned to these initial values.
As I mentioned earlier, one feature of an indexed sort is that it lets
you access the data in both its original and sorted order. But there are
other advantages, and a disadvantage as well. The disadvantage is that
each comparison takes slightly longer, because of the additional overhead
required to first look up the element number in the index array, to
determine which elements in the primary array will be compared. In some
cases, though, that can be more than offset by requiring less time to
exchange elements.
If you are sorting an array of 230-byte TYPE variables, the time needed
for SWAP to exchange the elements can become considerable. Every byte in
both elements must be read and written, so the time needed increases
linearly as the array elements become longer. Contrast that with the fixed
two bytes in the integer index array that are swapped.
Another advantage of an indexed sort is that it lends itself to sorting
more data than can fit in memory. As you will see later in the section
that shows how to sort files, it is far easier to manipulate an integer
index than an entire file. Further, sorting the file data using multiple
passes requires twice as much disk space as the file already occupies.
DATA MANIPULATION TECHNIQUES
Before I show the Quick Sort algorithm that will be used as a basis for
the remaining sort examples in this chapter, you should also be aware of
a few simple tricks that can help you maintain and sort your data. One was
described in Chapter 6, using a pair of functions that pack and unpack
dates such that the year is stored before the month, which in turn is
before the day. Thus, date strings are reduced to only three characters
each, and they can be sorted directly.
Another useful speed-up trick is to store string data as integers or
long integers. If you had a system of four-digit account numbers you could
use an integer instead of a string. Besides saving half the memory and
disk space, the integer comparisons in a sort routine will be many times
faster than a comparison on string equivalents. Zip codes are also suited
to this, and could be stored in a long integer. Even though the space
savings is only one byte, the time needed to compare the values for sorting
will be greatly reduced.
This brings up another important point. As you learned in Chapter 2,
all conventional (not fixed-length) strings require more memory than might
be immediately apparent. Besides the amount of memory needed to hold the
data itself, four additional bytes are used for a string descriptor, and
two more beyond those for a back pointer. Therefore, a zip code stored as
a string will actually require eleven bytes rather than the five you might
expect. With this in mind, you may be tempted to think that using a fixed-
length string to hold the zip codes will solve the problem. Since fixed-
length strings do not use either descriptors or back pointers, they do not
need the memory they occupy. And that leads to yet another issue.
Whenever a fixed-length string or the string portion of a TYPE variable
is compared, it must first be converted to a regular descriptored string.
BASIC has only one string comparison routine, and it expects the addresses
for two conventional string descriptors. Every time a fixed-length string
is used as an argument for comparison, BASIC must create a temporary copy,
call its comparison routine, and then delete the copy. This copying adds
code and wastes an enormous amount of time; in many cases the copying will
take longer than the comparison itself. Therefore, using integers and long
integers for numeric data where possible will provide more improvement than
just the savings in memory use.
In some cases, however, you must use fixed-length string or TYPE arrays.
In particular, when sorting information from a random access disk file it
is most sensible to load the records into a TYPE array. And as you learned
in Chapter 2, the string components of a TYPE variable or array element are
handled by BASIC as a fixed-length string. So how can you effectively sort
fixed-length string arrays without incurring the penalty BASIC's overhead
imposes? With assembly language subroutines, of course!
Rather than ask BASIC to pass the data using its normal methods,
assembly language routines can be invoked passing the data segments and
addresses directly. When you use SEG, or a combination of VARSEG and
VARPTR with fixed-length and TYPE variables, BASIC knows that you want the
segmented address of the variable or array element. Thus, you are tricking
BASIC into not making a copy as it usually would when passing such data.
An assembly language subroutine or function can be designed to accept data
addresses in any number of ways. As you will see later when we discuss
sorting on multiple keys, extra trickery is needed to do the same thing in
a BASIC procedure.
The three short assembly language functions that follow compare two
portions of memory, and then return a result that can be tested by your
program.
;COMPARE.ASM - compares two ranges of memory
.Model Medium, Basic
.Code
Compare Proc Uses DS ES DI SI, SegAdr1:DWord, _
SegAdr2:DWord, NumBytes:Word
Cld ;compare in the forward direction
Mov SI,NumBytes ;get the address for NumBytes%
Mov CX,[SI] ;put it into CX for comparing below
Les DI,SegAdr1 ;load ES:DI with the first
; segmented address
Lds SI,SegAdr2 ;load DS:SI with the second
; segmented address
Repe Cmpsb ;do the compare
Mov AX,0 ;assume the bytes didn't match
Jne Exit ;we were right, skip over
Dec AX ;wrong, decrement AX down to -1
Exit:
Ret ;return to BASIC
Compare Endp
End
;COMPARE2.ASM - compares memory case-insensitive
.Model Medium, Basic
.Code
Compare2 Proc Uses DS ES DI SI, SegAdr1:DWord, _
SegAdr2:DWord, NumBytes:Word
Cld ;compare in the forward direction
Mov BX,-1 ;assume the ranges are the same
Mov SI,NumBytes ;get the address for NumBytes%
Mov CX,[SI] ;put it into CX for comparing below
Jcxz Exit ;if zero bytes were given, they're
; the same
Les DI,SegAdr1 ;load ES:DI with the first address
Lds SI,SegAdr2 ;load DS:SI with the second address
Do:
Lodsb ;load the current character from
; DS:SI into AL
Call Upper ;capitalize as necessary
Mov AH,AL ;copy the character to AH
Mov AL,ES:[DI] ;load the other character into AL
Inc DI ;point at the next one for later
Call Upper ;capitalize as necessary
Cmp AL,AH ;now, are they the same?
Jne False ;no, exit now and show that
Loop Do ;yes, continue
Jmp Short Exit ;if we get this far, the bytes are
; all the same
False:
Inc BX ;increment BX to return zero
Exit:
Mov AX,BX ;assign the function output
Ret ;return to BASIC
Upper:
Cmp AL,"a" ;is the character below an "a"?
Jb Done ;yes, so we can skip it
Cmp AL,"z" ;is the character above a "z"?
Ja Done ;yes, so we can skip that too
Sub AL,32 ;convert to upper case
Done:
Retn ;do a near return to the caller
Compare2 Endp
End
;COMPARE3.ASM - case-insensitive, greater/less than
.Model Medium, Basic
.Code
Compare3 Proc Uses DS ES DI SI, SegAdr1:DWord, _
SegAdr2:DWord, NumBytes:Word
Cld ;compare in the forward direction
Xor BX,BX ;assume the ranges are the same
Mov SI,NumBytes ;get the address for NumBytes%
Mov CX,[SI] ;put it into CX for comparing below
Jcxz Exit ;if zero bytes were given, they're
; the same
Les DI,SegAdr1 ;load ES:DI with the first address
Lds SI,SegAdr2 ;load DS:SI with the second address
Do:
Lodsb ;load the current character from
; DS:SI into AL
Call Upper ;capitalize as necessary, remove for
; case-sensitive
Mov AH,AL ;copy the character to AH
Mov AL,ES:[DI] ;load the other character into AL
Inc DI ;point at the next character for later
Call Upper ;capitalize as necessary, remove for
; case-sensitive
Cmp AL,AH ;now, are they the same?
Loope Do ;yes, continue
Je Exit ;we exhausted the data and they're
; the same
Mov BL,1 ;assume block 1 was "greater"
Ja Exit ;we assumed correctly
Dec BX ;wrong, bump BX down to -1
Dec BX
Exit:
Mov AX,BX ;assign the function output
Ret ;return to BASIC
Upper:
Cmp AL,"a" ;is the character below an "a"?
Jb Done ;yes, so we can skip it
Cmp AL,"z" ;is the character above a "z"?
Ja Done ;yes, so we can skip that too
Sub AL,32 ;convert to upper case
Done:
Retn ;do a near return to the caller
Compare3 Endp
End
The first Compare routine above simply checks if all of the bytes are
identical, and returns -1 (True) if they are, or 0 (False) if they are not.
By returning -1 or 0 you can use either
IF Compare%(Type1, Type2, NumBytes%) THEN
or
IF NOT Compare%(Type1, Type2, NumBytes%) THEN
depending on which logic is clearer for your program. Compare2 is similar
to Compare, except it ignores capitalization. That is, "SMITH" and Smith"
are considered equal. The Compare3 function also compares memory and
ignores capitalization, but it returns either -1, 0, or 1 to indicate if
the first data range is less than, equal to, or greater than the second.
The correct declaration and usage for each of these routines is shown
below. Note that Compare and Compare2 are declared and used in the same
fashion.
Compare and Compare2:
DECLARE FUNCTION Compare%(SEG Type1 AS ANY, SEG Type2 AS ANY, _
NumBytes%)
Same = Compare%(Type1, Type2, NumBytes%)
or
DECLARE FUNCTION Compare%(BYVAL Seg1%, BYVAL Adr1%, BYVAL Seg2%, _
BYVAL Adr2%, NumBytes%)
Same = Compare%(Seg1%, Adr1%, Seg2%, Adr2%, NumBytes%)
Here, Same receives -1 if the two TYPE variables or ranges of memory are
the same, or 0 if they are not. NumBytes% tells how many bytes to compare.
Compare3:
DECLARE FUNCTION Compare3%(SEG Type1 AS ANY, SEG Type2 AS ANY, _
NumBytes%)
Result = Compare3%(Type1, Type2, NumBytes%)
or
DECLARE FUNCTION Compare3%(BYVAL Seg1%, BYVAL Adr1%, BYVAL Seg2%, _
BYVAL Adr2%, NumBytes%)
Result = Compare3%(Seg1%, Adr1%, Seg2%, Adr2%, NumBytes%)
Result receives 0 if the two type variables or ranges of memory are the
same, -1 if the first is less when compared as strings, or 1 if the first
is greater. NumBytes% tells how many bytes are to be to compared. In the
context of a sort routine you could invoke Compare3 like this:
IF Compare3%(TypeEl(X), TypeEl(X + 1), NumBytes%) = 1 THEN
SWAP TypeEl(X), TypeEl(X + 1)
END IF
As you can see, these routines may be declared in either of two ways.
When used with TYPE arrays the first is more appropriate and results in
slightly less setup code being generated by the compiler. When comparing
fixed-length strings or arbitrary blocks of memory (for example, when one
of the ranges is on the display screen) you should use the second method.
Since SEG does not work correctly with fixed-length strings, if you want
to use that more efficient version you must create a dummy TYPE comprised
solely of a single string portion:
TYPE FixedLength
Something AS STRING * 35
END TYPE
Then simply use DIM to create a single variable or an array based on this
or a similar TYPE, depending on what your program needs. The requirement
to create a dummy TYPE was discussed in Chapter 2, and I won't belabor the
reasons again here. These comparison routines will be used extensively in
the sort routines presented later in this chapter; however, their value in
other, non-sorting situations should also be apparent.
Although these routines are written in assembly language, they are
fairly simple to follow. It is important to understand that you do not
need to know anything about assembly language to use them. All of the
files you need to add these and all of the other routines in this book are
contained on the accompanying diskette [here, in the same ZIP file as this
text]. Chapter 12 discusses assembly language in great detail, and you can
refer there for further explanation of the instructions used.
If you plan to run the programs that follow in the QuickBASIC editor,
you must load the BASIC.QLB Quick Library as follows:
qb program /l basic
Later when you compile these or other programs you must link with the
parallel BASIC.LIB file:
bc program [/o];
link program , , nul , basic;
If you are using BASIC PDS start QBX using the BASIC7.QLB file, and then
link with BASIC7.LIB to produce a stand-alone .EXE program. [VB/DOS users
will also use the BASIC7 version.]
THE QUICK SORT ALGORITHM
========================
It should be obvious to you by now that a routine written in assembly
language will always be faster than an equivalent written in BASIC.
However, simply translating a procedure to assembly language is not always
the best solution. Far more important than which language you use is
selecting an appropriate algorithm. The best sorting method I know is the
Quick Sort, and a well-written version of Quick Sort using BASIC will be
many times faster than an assembly language implementation of the Bubble
Sort.
The main problem with the Bubble Sort is that the number of comparisons
required grows exponentially as the number of elements increases. Since
each pass through the array exchanges only a few elements, many passes are
required before the entire array is sorted. The Quick Sort was developed
by C.A.R. (Tony) Hoare, and is widely recognized as the fastest algorithm
available. In some special cases, such as when the data is already sorted
or nearly sorted, the Quick Sort may be slightly slower than other methods.
But in most situations, a Quick Sort is many times faster than any other
sorting algorithm.
As with the Bubble Sort, there are many different variations on how a
Quick Sort may be coded. (You may have noticed that the Bubble Sort shown
in Chapter 7 used a nested FOR/NEXT loop, while the one shown here uses a
FOR/NEXT loop within a DO/WHILE loop.) A Quick Sort divides the array into
sections--sometimes called partitions--and then sorts each section
individually. Many implementations therefore use recursion to invoke the
subprogram from within itself, as each new section is about to be sorted.
However, recursive procedures in any language are notoriously slow, and
also consume stack memory at an alarming rate.
The Quick Sort version presented here avoids recursion, and instead uses
a local array as a form of stack. This array stores the upper and lower
bounds showing which section of the array is currently being considered.
Another refinement I have added is to avoid making a copy of elements in
the array. As a Quick Sort progresses, it examines one element selected
arbitrarily from the middle of the array, and compares it to the elements
that lie above and below it. To avoid assigning a temporary copy this
version simply keeps track of the selected element number.
When sorting numeric data, maintaining a copy of the element is
reasonable. But when sorting strings--especially strings whose length is
not known ahead of time--the time and memory required to keep a copy can
become problematic. For clarity, the generic Quick Sort shown below uses
the copy method. Although this version is meant for sorting a single
precision array, it can easily be adapted to sort any type of data by
simply changing all instances of the "!" type declaration character.
'******** QSORT.BAS, Quick Sort algorithm demonstration
'Copyright (c) 1991 Ethan Winer
DEFINT A-Z
DECLARE SUB QSort (Array!(), StartEl, NumEls)
RANDOMIZE TIMER 'generate a new series each run
DIM Array!(1 TO 21) 'create an array
FOR X = 1 TO 21 'fill with random numbers
Array!(X) = RND(1) * 500 'between 0 and 500
NEXT
FirstEl = 6 'sort starting here
NumEls = 10 'sort this many elements
CLS
PRINT "Before Sorting:"; TAB(31); "After sorting:"
PRINT "==============="; TAB(31); "=============="
FOR X = 1 TO 21 'show them before sorting
IF X >= FirstEl AND X <= FirstEl + NumEls - 1 THEN
PRINT "==>";
END IF
PRINT TAB(5); USING "###.##"; Array!(X)
NEXT
CALL QSort(Array!(), FirstEl, NumEls)
LOCATE 3
FOR X = 1 TO 21 'print them after sorting
LOCATE , 30
IF X >= FirstEl AND X <= FirstEl + NumEls - 1 THEN
PRINT "==>"; 'point to sorted items
END IF
LOCATE , 35
PRINT USING "###.##"; Array!(X)
NEXT
SUB QSort (Array!(), StartEl, NumEls) STATIC
REDIM QStack(NumEls \ 5 + 10) 'create a stack array
First = StartEl 'initialize work variables
Last = StartEl + NumEls - 1
DO
DO
Temp! = Array!((Last + First) \ 2) 'seek midpoint
I = First
J = Last
DO 'reverse both < and > below to sort descending
WHILE Array!(I) < Temp!
I = I + 1
WEND
WHILE Array!(J) > Temp!
J = J - 1
WEND
IF I > J THEN EXIT DO
IF I < J THEN SWAP Array!(I), Array!(J)
I = I + 1
J = J - 1
LOOP WHILE I <= J
IF I < Last THEN
QStack(StackPtr) = I 'Push I
QStack(StackPtr + 1) = Last 'Push Last
StackPtr = StackPtr + 2
END IF
Last = J
LOOP WHILE First < Last
IF StackPtr = 0 THEN EXIT DO 'Done
StackPtr = StackPtr - 2
First = QStack(StackPtr) 'Pop First
Last = QStack(StackPtr + 1) 'Pop Last
LOOP
ERASE QStack 'delete the stack array
END SUB
Notice that I have designed this routine to allow sorting only a portion
of the array. To sort the entire array you'd simply omit the StartEl and
NumEls parameters, and assign First and Last from the LBOUND and UBOUND
element numbers. That is, you will change these:
First = StartEl
and
Last = StartEl + NumEls - 1
to these:
First = LBOUND(Array!)
and
Last = UBOUND(Array!)
As I mentioned earlier, the QStack array serves as a table of element
numbers that reflect which range of elements is currently being considered.
You will need to dimension this array to one element for every five
elements in the primary array being sorted, plus a few extra for good
measure. In this program I added ten elements, because one stack element
for every five main array elements is not enough for very small arrays.
For data arrays that have a large amount of duplicated items, you will
probably need to increase the size of the stack array.
Note that this ratio is not an absolute--the exact size of the stack
that is needed depends on how badly out of order the data is to begin with.
Although it is possible that one stack element for every five in the main
array is insufficient in a given situation, I have never seen this formula
fail. Because the stack is a dynamic integer array that is stored in far
memory, it will not impinge on near string memory. If this routine were
designed using the normal recursive method, BASIC's stack would be used
which is in near memory.
Each of the innermost DO loops searches the array for the first element
in each section about the midpoint that belongs in the other section. If
the elements are indeed out of order (when I is less than J) the elements
are exchanged. This incrementing and comparing continues until I and J
cross. At that point, assuming the variable I has not exceeded the upper
limits of the current partition, the partition bounds are saved and Last
is assigned to the top of the next inner partition level. When the entire
partition has been processed, the previous bounds are retrieved, but as a
new set of First and Last values. This process continues until no more
partition boundaries are on the stack. At that point the entire array is
sorted.
On the accompanying disk you will find a program called SEEQSORT.BAS
that contains an enhanced version of the QSort demo and subprogram. This
program lets you watch the progress of the comparisons and exchanges as
they are made, and actually see this complex algorithm operate. Simply
load SEEQSORT.BAS into the BASIC editor and run it. A constant named
Delay! is defined at the beginning of the program. Increasing its value
makes the program run more slowly; decreasing it causes the program to run
faster.
AN ASSEMBLY LANGUAGE QUICK SORT
As fast as the BASIC QuickSort routine is, we can make it even faster.
The listing below shows an assembly language version that is between ten
and twenty percent faster, depending on which compiler you are using and
if the BASIC PDS /fs (far strings) option is in effect.
;SORT.ASM - sorts an entire BASIC string array
.Model Medium, Basic
.Data
S DW 0
F DW 0
L DW 0
I DW 0
J DW 0
MidPoint DW 0
.Code
Extrn B$SWSD:Proc ;this swaps two strings
Extrn B$SCMP:Proc ;this compares two strings
Sort Proc Uses SI DI ES, Array:Word, Dir:Word
Cld ;all fills and compares are forward
Push DS ;set ES = DS for string compares
Pop ES
Xor CX,CX ;clear CX
Mov AX,7376h ;load AL and AH with the opcodes
; Jae and Jbe in preparation for
; code self-modification
Mov BX,Dir ;get the sorting direction
Cmp [BX],CX ;is it zero (ascending sort)?
Je Ascending ;yes, skip ahead
Xchg AL,AH ;no exchange the opcodes
Ascending:
Mov CS:[X1],AH ;install correct comparison opcodes
Mov CS:[X2],AL ; based on the sort direction
Mov BX,Array ;load the array descriptor address
Mov AX,[BX+0Eh] ;save the number of elements
Dec AX ;adjust the number to zero-based
Jns L0 ;at least 1 element, continue
Jmp L4 ;0 or less elements, get out now!
L0:
Mov BX,Array ;reload array descriptor address
Mov BX,[BX] ;Array$(LBOUND) descriptor address
Mov S,SP ;StackPtr = 0 (normalized to SP)
Mov F,CX ;F = 0
Mov L,AX ;L = Size%
;----- calculate the value of MidPoint
L1:
Mov DI,L ;MidPoint = (L + F) \ 2
Add DI,F
Shr DI,1
Mov MidPoint,DI
Mov AX,F ;I = F
Mov I,AX
Mov AX,L ;J = L
Mov J,AX
;----- calculate the offset into the descriptor table for Array$(MidPoint)
L1_2:
Shl DI,1 ;multiply MidPoint in DI times 4
Shl DI,1 ;now DI holds how far beyond Array$(Start)
; Array$(MidPoint)'s descriptor is
Add DI,BX ;add the array base address to produce the final
; address for Array$(MidPoint)
;----- calculate descriptor offset for Array$(I)
L2:
Mov SI,I ;put I into SI
Shl SI,1 ;as above
Shl SI,1 ;now SI holds how far beyond Array$(Start)
; Array$(I)'s descriptor is
Add SI,BX ;add the base to produce the final descriptor
; address
;IF Array$(I) < Array$(MidPoint) THEN I = I + 1: GOTO L2
Push BX ;save BX because B$SCMP trashes it
Push SI
Push DI
Call B$SCMP ;do the compare
Pop BX ;restore BX
X1 Label Byte ;modify the code below to "Jbe" if descending sort
Jae L2_1 ;Array$(I) isn't less, continue on
Inc Word Ptr I ;I = I + 1
Jmp Short L2 ;GOTO L2
;----- calculate descriptor offset for Array$(J)
L2_1:
Mov SI,J ;put J into SI
Shl SI,1 ;as above
Shl SI,1 ;now SI holds how far beyond Array$(Start)
; Array$(J)'s descriptor is
Add SI,BX ;add the base to produce the final descriptor
; address
;IF Array$(J) > Array$(MidPoint) THEN J = J - 1: GOTO L2.1
Push BX ;preserve BX
Push SI
Push DI
Call B$SCMP ;do the compare
Pop BX ;restore BX
X2 Label Byte ;modify the code below to "Jae" if descending sort
Jbe L2_2 ;Array$(J) isn't greater, continue on
Dec Word Ptr J ;J = J - 1
Jmp Short L2_1 ;GOTO L2.1
L2_2:
Mov AX,I ;IF I > J GOTO L3
Cmp AX,J
Jg L3 ;J is greater, go directly to L3
Je L2_3 ;they're the same, skip the swap
;Swap Array$(I), Array$(J)
Mov SI,I ;put I into SI
Mov DI,J ;put J into DI
Cmp SI,MidPoint ;IF I = MidPoint THEN MidPoint = J
Jne No_Mid1 ;not equal, skip ahead
Mov MidPoint,DI ;equal, assign MidPoint = J
Jmp Short No_Mid2 ;don't waste time comparing again
No_Mid1:
Cmp DI,MidPoint ;IF J = MidPoint THEN MidPoint = I
Jne No_Mid2 ;not equal, skip ahead
Mov MidPoint,SI ;equal, assign MidPoint = I
No_Mid2:
Mov SI,I ;put I into SI
Shl SI,1 ;multiply times four for the
Shl SI,1 ; for the descriptors
Add SI,BX ;add address for first descriptor
Mov DI,J ;do the same for J in DI
Shl DI,1
Shl DI,1
Add DI,BX
Push BX ;save BX because B$SWSD destroys it
Call B$SWSD ;and swap 'em good
Pop BX
L2_3:
Inc Word Ptr I ;I = I + 1
Dec Word Ptr J ;J = J - 1
Mov AX,I ;IF I <= J GOTO L2
Cmp AX,J
Jg L3 ;it's greater, skip to L3
Mov DI,MidPoint ;get MidPoint again
Jmp L1_2 ;go back to just before L2
L3:
Mov AX,I ;IF I < L THEN PUSH I: PUSH L
Cmp AX,L
Jnl L3_1 ;it's not less, so skip Pushes
Push I ;Push I
Push L ;Push L
L3_1:
Mov AX,J ;L = J
Mov L,AX
Mov AX,F ;IF F < L GOTO L1
Cmp AX,L
Jnl L3_2 ;it's not less, jump ahead to L3_2
Jmp L1 ;it's less, go to L1
L3_2:
Cmp S,SP ;IF S = 0 GOTO L4
Je L4
Pop L ;Pop L
Pop F ;Pop F
Jmp L1 ;GOTO L1
L4:
Ret ;return to BASIC
Sort Endp
End
Besides being faster than the BASIC version, the assembly language Sort
routine is half the size. This version also supports sorting either
forward or backward, but not just a portion of an array. The general
syntax is:
CALL Sort(Array$(), Direction)
Where Array$() is any variable-length string array, and Direction is 0 for
ascending, or any other value for descending. Note that this routine calls
upon BASIC's internal services to perform the actual comparing and
swapping; therefore, the exact same code can be used with either QuickBASIC
or BASIC PDS. Again, I refer you forward to Chapter 12 for an explanation
of the assembly language commands used in SORT.ASM.
SORTING ON MULTIPLE KEYS
In many situations, sorting based on one key is sufficient. For example,
if you are sorting a mailing list to take advantage of bulk rates you must
sort all of the addresses in order by zip code. When considering complex
data such as a TYPE variable, it is easy to sort the array based on one
component of each element. The earlier Bubble Sort example showed how MID$
could be used to consider just a portion of each string, even though the
entire elements were exchanged. Had that routine been designed to operate
on a TYPE array, the comparisons would have examined just one component,
but the SWAP statements would exchange entire elements:
IF Array(X).ZipCode > Array(X + 1).ZipCode THEN
SWAP Array(X), Array(X + 1)
END IF
This way, each customer's last name, first name, street address, and so
forth remain connected to the zip codes that are being compared and
exchanged.
There are several ways to sort on more than one key, and all are of
necessity more complex than simply sorting based on a single key. One
example of a multi-key sort first puts all of the last names in order.
Then within each group of identical last names the first names are sorted,
and within each group of identical last and first names further sorting is
performed on yet another key--perhaps Balance Due. As you can see, this
requires you to sort based on differing types of data, and also to compare
ranges of elements for the subgroups that need further sorting.
The biggest complication with this method is designing a calling syntax
that lets you specify all of the information. A table array must be
established to hold the number of keys, the type of data in each key
(string, double precision, and so forth), and how many bytes into the TYPE
element each key portion begins. Worse, you can't simply use the name of
a TYPE component in the comparisons inside the sort routine--which would
you use: Array(X).LastName, Array(X).FirstName, or Array(X).ZipCode?
Therefore, a truly general multi-key sort must be called passing the
address where the array begins in memory, and a table of offsets beyond
that address where each component being considered is located.
To avoid this added complexity I will instead show a different method
that has only a few minor restrictions, but is much easier to design and
understand. This method requires you to position each TYPE component into
the key order you will sort on. You will also need to store all numbers
that will be used for a sort key as ASCII digits. To sort first on last
name, then first name, and then on balance due, the TYPE might be
structured as follows:
TYPE Customer
LastName AS STRING * 15
FirstName AS STRING * 15
BalanceDue AS STRING * 9
Street AS STRING * 32
City AS STRING * 15
State AS STRING * 2
ZipCode AS STRING * 5
AnyNumber AS DOUBLE
END TYPE
In most cases the order in which each TYPE member is placed has no
consequence. When you refer to TypeVar.LastName, BASIC doesn't care if
LastName is defined before or after FirstName in the TYPE structure.
Either way it translates your reference to LastName into an address.
Having to store numeric data as strings is a limitation, but this is needed
only for those TYPE fields that will be used as a sort key.
The key to sorting on multiple items simultaneously is by treating the
contiguous fields as a single long field. Since assignments to the string
portion of a TYPE variable are handled internally by BASIC's LSET routine,
the data in each element will be aligned such that subsequent fields can
be treated as an extension of the primary field. Figure 8-2 below shows
five TYPE array elements in succession, as they would be viewed by a string
comparison routine. This data is defined as a subset of the name and
address TYPE shown above, using just the first three fields. Notice that
the balance due fields must be right-aligned (using RSET) for the numeric
values to be considered correctly.
Type.LastName Type.FirstName Type.BalanceDue
===============---------------=========
Munro Jay 8000.00
Smith John 122.03
Johnson Alfred 14537.89
Rasmussen Peter 100.90
Hudson Cindy 21.22
^ ^ ^
Field 1 Field 2 Field 3
starts here starts here starts here
Figure 8-2: Multiple contiguous fields in a TYPE can be treated as a single
long field.
Thus, the sort routine would be told to start at the first field, and
consider the strings to be 15 + 15 + 9 = 39 characters long. This way all
three fields are compared at one time, and treated as a single entity.
Additional fields can of course follow these, and they may be included in
the comparison or not at your option.
The combination demonstration and subroutine below sorts such a TYPE
array on any number of keys using this method, and it has a few additional
features as well. Besides letting you confine the sorting to just a
portion of the array, you may also specify how far into each element the
first key is located. As long as the key fields are contiguous, they do
not have to begin at the start of each TYPE. Therefore, you could sort
just on the first name field, or on any other field or group of fields.
'TYPESORT.BAS - performs a multi-key sort on TYPE arrays
'Copyright (c) 1991 Ethan Winer
DEFINT A-Z
DECLARE FUNCTION Compare3% (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, _
BYVAL Adr2, NumBytes)
DECLARE SUB SwapMem (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, BYVAL Adr2, _
BYVAL Length)
DECLARE SUB TypeSort (Segment, Address, ElSize, Offset, KeySize, NumEls)
CONST NumEls% = 23 'this keeps it all on the screen
TYPE MyType
LastName AS STRING * 10
FirstName AS STRING * 10
Dollars AS STRING * 6
Cents AS STRING * 2
END TYPE
REDIM Array(1 TO NumEls%) AS MyType
'---- Disable (REM out) all but one of the following blocks to test
Offset = 27 'start sorting with Cents
ElSize = LEN(Array(1)) 'the length of each element
KeySize = 2 'sort on the Cents only
Offset = 21 'start sorting with Dollars
ElSize = LEN(Array(1)) 'the length of each element
KeySize = 8 'sort Dollars and Cents only
Offset = 11 'start sorting with FirstName
ElSize = LEN(Array(1)) 'the length of each element
KeySize = 18 'sort FirstName through Cents
Offset = 1 'start sorting with LastName
ElSize = LEN(Array(1)) 'the length of each element
KeySize = ElSize 'sort based on all 4 fields
FOR X = 1 TO NumEls% 'build the array from DATA
READ Array(X).LastName
READ Array(X).FirstName
READ Amount$ 'format the amount into money
Dot = INSTR(Amount$, ".")
IF Dot THEN
RSET Array(X).Dollars = LEFT$(Amount$, Dot - 1)
Array(X).Cents = LEFT$(MID$(Amount$, Dot + 1) + "00", 2)
ELSE
RSET Array(X).Dollars = Amount$
Array(X).Cents = "00"
END IF
NEXT
Segment = VARSEG(Array(1)) 'show where the array is
Address = VARPTR(Array(1)) ' located in memory
CALL TypeSort(Segment, Address, ElSize, Offset, KeySize, NumEls%)
CLS 'display the results
FOR X = 1 TO NumEls%
PRINT Array(X).LastName, Array(X).FirstName,
PRINT Array(X).Dollars; "."; Array(X).Cents
NEXT
DATA Smith, John, 123.45
DATA Cramer, Phil, 11.51
DATA Hogan, Edward, 296.08
DATA Cramer, Phil, 112.01
DATA Malin, Donald, 13.45
DATA Cramer, Phil, 111.3
DATA Smith, Ralph, 123.22
DATA Smith, John, 112.01
DATA Hogan, Edward, 8999.04
DATA Hogan, Edward, 8999.05
DATA Smith, Bob, 123.45
DATA Cramer, Phil, 11.50
DATA Hogan, Edward, 296.88
DATA Malin, Donald, 13.01
DATA Cramer, Phil, 111.1
DATA Smith, Ralph, 123.07
DATA Smith, John, 112.01
DATA Hogan, Edward, 8999.33
DATA Hogan, Edward, 8999.17
DATA Hogan, Edward, 8999.24
DATA Smith, John, 123.05
DATA Cramer, David, 1908.80
DATA Cramer, Phil, 112
END
SUB TypeSort (Segment, Address, ElSize, Displace, KeySize, NumEls) STATIC
REDIM QStack(NumEls \ 5 + 10) 'create a stack array
First = 1 'initialize working variables
Last = NumEls
Offset = Displace - 1 'decrement once now rather than
' repeatedly later
DO
DO
Temp = (Last + First) \ 2 'seek midpoint
I = First
J = Last
DO
WHILE Compare3%(Segment, Address + Offset + (I - 1) * ElSize, Segment, _
Address + Offset + (Temp-1) * ElSize, KeySize) = -1 '< 1 for descending
I = I + 1
WEND
WHILE Compare3%(Segment, Address + Offset + (J - 1) * ElSize, Segment, _
Address + Offset + (Temp-1) * ElSize, KeySize) = 1 '< -1 for descending
J = J - 1
WEND
IF I > J THEN EXIT DO
IF I < J THEN
CALL SwapMem(Segment, Address + (I - 1) * ElSize, Segment, _
Address + (J - 1) * ElSize, ElSize)
IF Temp = I THEN
Temp = J
ELSEIF Temp = J THEN
Temp = I
END IF
END IF
I = I + 1
J = J - 1
LOOP WHILE I <= J
IF I < Last THEN
QStack(StackPtr) = I 'Push I
QStack(StackPtr + 1) = Last 'Push Last
StackPtr = StackPtr + 2
END IF
Last = J
LOOP WHILE First < Last
IF StackPtr = 0 THEN EXIT DO 'Done
StackPtr = StackPtr - 2
First = QStack(StackPtr) 'Pop First
Last = QStack(StackPtr + 1) 'Pop Last
LOOP
ERASE QStack 'delete the stack array
END SUB
As you can see, this version of the Quick Sort subprogram is derived from
the one shown earlier. The important difference is that all of the
incoming information is passed as segments, addresses, and bytes, rather
than using an explicit array name. But before describing the inner details
of the subprogram itself, I'll address the demonstration portion and show
how the routine is set up and called.
As with some of the other procedures on the disk that comes with this
book, you will extract the TypeSort subprogram and add it to your own
programs by loading it as a module, and then using the Move option of
BASIC's View Subs menu. You can quickly access this menu by pressing F2,
and then use Alt-M to select Move. Once this is done you will unload
TYPESORT.BAS using the Alt-F-U menu selection, and answer *No* when asked
if you want to save the modified file. You could also copy the TypeSort
subprogram into a separate file, and then load that file as a module in
each program that needs it.
Although the example TYPE definition here shows only four components,
you may of course use any TYPE structure. TypeSort expects six parameters
to tell it where in memory the array is located, how far into each element
the comparison routines are to begin, the total length of each element, the
length of the key fields, and the number of elements to sort.
After defining MyType, the setup portion of TYPESORT.BAS establishes the
offset, element size, and key size parameters. As you can see, four
different sample setups are provided, and you should add remarking
apostrophes to all but one of them. If the program is left as is, the last
setup values will take precedence.
The next section reads sample names, addresses and dollar amounts from
DATA statements, and formats the dollar amounts as described earlier. The
dollar portion of the amounts are right justified into the Dollars field
of each element, and the Cents portion is padded with trailing zeros as
necessary to provide a dollars and cents format. This way, the value 12.3
will be assigned as 12.30, and 123 will be formatted to 123.00 which gives
the expected appearance.
The final setup step is to determine where the array begins in memory.
Since you specify the starting segment and address, it is simple to begin
sorting at any array element. For example, to sort elements 100 through
200--even if the array is larger than that--you'd use VARSEG(Array(100))
and VARPTR(Array(100) instead of element 1 as shown in this example.
In addition to the starting segment and address of the array, TypeSort
also requires you to tell it how many elements to consider. If you are
sorting the entire array and the array starts with element 1, this will
simply be the UBOUND of the array. If you are sorting just a portion of
the array then you give it only the number of elements to be sorted. So
to sort elements 100 through 200, the number of elements will be 101. A
general formula you can use for calculating this based on element numbers
is NumElements = LastElement - FirstElement + 1.
Now let's consider the TypeSort subprogram itself. Since it is more
like the earlier QSort program than different, I will cover only the
differences here. In fact, the primary difference is in the way
comparisons and exchanges are handled. The Compare3 function introduced
earlier is used to compare the array elements with the midpoint. Although
QSort made a temporary copy of the midpoint element, that would be
difficult to do here. Since the routine is designed to work with any type
of data--and the size of each element can vary depending on the TYPE
structure--it is impractical to make a copy.
While SPACE$ could be used to claim a block of memory into which the
midpoint element is copied, there's a much better way: the Temp variable
is used to remember the element number itself. The only complication is
that once elements I and J are swapped, Temp must be reassigned if it was
equal to either of them. (This happens just below the call to SwapMem.)
But the simple integer IF test and assignment required adds far less code
and is much faster than making a copy of the element.
TypeSort is designed to sort the array in ascending order, and comments
in the code show how to change it to sort descending instead. If you
prefer to have one subprogram that can do both, you should add an extra
parameter, perhaps called Direction. Near the beginning of the routine
before the initial outer DO you would add this:
IF Direction = 0 THEN 'sort ascending
ICompare = -1
JCompare = 1
ELSE 'sort descending
ICompare = 1
JCompare = -1
END IF
Then, where the results from Compare3 are compared to -1 and 1 replace
those comparisons (at the end of each WHILE line) to instead use ICompare
and JCompare:
WHILE Compare3%(...) = ICompare
I = I + 1
WEND
WHILE Compare3%(...) = JCompare
J = J - 1
WEND
This way, you are using variables to establish the sorting direction, and
those variables can be set either way each time TypeSort is called.
The last major difference is that elements are exchanged using the
SwapMem routine rather than BASIC's SWAP statement. While it is possible
to call SWAP by aliasing its name as shown in Chapter 5, it was frankly
simpler to write a new routine for this purpose. Further, BASIC's SWAP is
slower than SwapMem because it must be able to handle variables of
different lengths, and also exchange fixed-length and conventional strings.
SwapMem is extremely simple, and it works very quickly.
As I stated earlier, the only way to write a truly generic sort routine
is by passing segments and addresses and bytes, instead of array names.
Although it would be great if BASIC could let you declare a subprogram or
function using the AS ANY option to allow any type of data, that simply
wouldn't work. As BASIC compiles your program, it needs to know the size
and type of each parameter. When you reference TypeVar.LastName, BASIC
knows where within TypeVar the LastName portion begins, and uses that in
its address calculations. It is not possible to avoid this limitation
other than by using addresses as is done here.
Indeed, this is the stuff that C and assembly language programs are made
of. In these languages--especially assembly language--integer pointer
variables are used extensively to show where data is located and how long
it is. However, the formulas used within the Compare3 and SwapMem function
calls are not at all difficult to understand.
The formula Address + Offset - (I - 1) * ElSize indicates where the key
field of element I begins. Address holds the address of the beginning of
the first element, and Offset is added to identify the start of the first
sort key. (I - 1) is used instead of I because addresses are always zero-
based. That is, the first element in the array from TypeSort's perspective
is element 0, even though the calling program considers it to be element
1. Finally, the element number is multiplied times the length of each
element, to determine the value that must be added to the starting address
and offset to obtain the final address for the data in element I. Please
understand that calculations such as these are what the compiler must do
each time you access an array element.
Note that if you call TypeSort incorrectly or give it illegal element
numbers, you will not receive a "Subscript out of range" error from BASIC.
Rather, you will surely crash your PC and have to reboot. This is the
danger--and fun--of manipulating pointers directly.
As I stated earlier, the SwapMem routine that does the actual exchanging
of elements is very simple, and it merely takes a byte from one element and
exchanges it with the corresponding byte in the other. This task is
greatly simplified by the use of the XCHG assembly language command, which
is similar to BASIC's SWAP statement. Although XCHG cannot swap a word in
memory with another word in memory, it can exchange memory with a register.
SwapMem is shown in the listing below.
;SWAPMEM.ASM, swaps two sections of memory
.Model Medium, Basic
.Code
SwapMem Proc Uses SI DI DS ES, Var1:DWord, Var2:DWord, NumBytes:Word
Lds SI,Var1 ;get the segmented address of the
; first variable
Les DI,Var2 ;and for the second variable
Mov CX,NumBytes ;get the number of bytes to exchange
Jcxz Exit ;we can't swap zero bytes!
DoSwap:
Mov AL,ES:[DI] ;get a byte from the second variable
Xchg AL,[SI] ;swap it with the first variable
Stosb ;complete the swap and increment DI
Inc SI ;point to the next source byte
Loop DoSwap ;continue until done
Exit:
Ret ;return to BASIC
SwapMem Endp
End
INDEXED SORTING ON MULTIPLE KEYS
Earlier I showed how to modify the simple Bubble Sort routine to sort a
parallel index array instead of the primary array. One important reason
you might want to do that is to allow access to the primary array in both
its original and sorted order. Another reason, and one we will get to
shortly, is to facilitate sorting disk files. Although a routine to sort
the records in a file could swap the actual data, it takes a long time to
read and write that much data on disk. Further, each time you wanted to
access the data sorted on a different key, the entire file would need to
be sorted again.
A much better solution is to create one or more sorted lists of record
numbers, and store those on disk each in a separate file. This lets you
access the data sorted by name, or by zip code, or by any other field,
without ever changing the actual file. The TypeISort subprogram below is
adapted from TypeSort, and it sorts an index array that holds the element
numbers of a TYPE array.
'TYPISORT.BAS, indexed multi-key sort for TYPE arrays
DEFINT A-Z
DECLARE FUNCTION Compare3% (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, _
BYVAL Adr2, NumBytes)
DECLARE SUB SwapMem (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, _
BYVAL Adr2, BYVAL Length)
DECLARE SUB TypeISort (Segment, Address, ElSize, Offset, KeySize, _
NumEls, Index())
CONST NumEls% = 23 'this keeps it all on the screen
TYPE MyType
LastName AS STRING * 10
FirstName AS STRING * 10
Dollars AS STRING * 6
Cents AS STRING * 2
END TYPE
REDIM Array(1 TO NumEls%) AS MyType
REDIM Index(1 TO NumEls%) 'create the index array
Offset = 1 'start sorting with LastName
ElSize = LEN(Array(1)) 'the length of each element
KeySize = ElSize 'sort based on all 4 fields
FOR X = 1 TO NumEls% 'build the array from DATA
READ Array(X).LastName
READ Array(X).FirstName
READ Amount$
... 'this continues as already
... ' shown in TypeSort
NEXT
FOR X = 1 TO NumEls% 'initialize the index
Index(X) = X - 1 'but starting with 0
NEXT
Segment = VARSEG(Array(1)) 'show where the array is
Address = VARPTR(Array(1)) ' located in memory
CALL TypeISort(Segment, Address, ElSize, Offset, KeySize, NumEls%, Index())
CLS 'display the results
FOR X = 1 TO NumEls% '+ 1 adjusts to one-based
PRINT Array(Index(X) + 1).LastName,
PRINT Array(Index(X) + 1).FirstName,
PRINT Array(Index(X) + 1).Dollars; ".";
PRINT Array(Index(X) + 1).Cents
NEXT
DATA Smith, John, 123.45 'this continues as already
... ' shown in TypeSort
...
END
SUB TypeISort (Segment, Address, ElSize, Displace, KeySize, NumEls, _
Index()) STATIC
REDIM QStack(NumEls \ 5 + 10) 'create a stack
First = 1 'initialize working variables
Last = NumEls
Offset = Displace - 1 'make zero-based now for speed later
DO
DO
Temp = (Last + First) \ 2 'seek midpoint
I = First
J = Last
DO 'change -1 to 1 and 1 to -1 to sort descending
WHILE Compare3%(Segment, Address + Offset + (Index(I) * ElSize), _
Segment, Address + Offset + (Index(Temp) * ElSize), KeySize) = -1
I = I + 1
WEND
WHILE Compare3%(Segment, Address + Offset + (Index(J) * ElSize), _
Segment, Address + Offset + (Index(Temp) * ElSize), KeySize) = 1
J = J - 1
WEND
IF I > J THEN EXIT DO
IF I < J THEN
SWAP Index(I), Index(J)
IF Temp = I THEN
Temp = J
ELSEIF Temp = J THEN
Temp = I
END IF
END IF
I = I + 1
J = J - 1
LOOP WHILE I <= J
IF I < Last THEN
QStack(StackPtr) = I 'Push I
QStack(StackPtr + 1) = Last 'Push Last
StackPtr = StackPtr + 2
END IF
Last = J
LOOP WHILE First < Last
IF StackPtr = 0 THEN EXIT DO 'Done
StackPtr = StackPtr - 2
First = QStack(StackPtr) 'Pop First
Last = QStack(StackPtr + 1) 'Pop Last
LOOP
ERASE QStack 'delete the stack array
END SUB
As with TypeSort, TypeISort is entirely pointer based so it can be used
with any type of data and it can sort multiple contiguous keys. The only
real difference is the addition of the Index() array parameter, and the
extra level of indirection needed to access the index array each time a
comparison is made. Also, when a swap is required, only the integer index
elements are exchanged, which simplifies the code and reduces its size.
Like TypeSort, you can change the sort direction by reversing the -1 and
1 values used with Compare3, or add a Direction parameter to the list and
modify the code to use that.
As with BubbleISort, the index array is initialized to increasing values
by the calling program; however, here the first element is set to hold a
value of 0 instead of 1. This reduces the calculations needed within the
routine each time an address must be obtained. Therefore, when TypeISort
returns, the caller must add 1 to the element number held in each index
element. This is shown within the FOR/NEXT loop that displays the sorted
results.
SORTING FILES
With the development of TypeISort complete, we can now use that routine
to sort disk files. The sorting strategy will be to determine how many
records are in the file, to determine how many separate passes are needed
to process the entire file. TypeISort and TypeSort are restricted to
working with arrays no larger than 64K (32K in the editing environment),
so there is a limit as to how much data may be loaded into memory at one
time. These sort routines can accommodate more data when compiled because
address calculations that result in values larger than 32767 cause an
overflow error in the QB editor. This overflow is in fact harmless, and
is ignored in a compiled program unless you use the /d switch.
Although the routines could be modified to perform segment and address
arithmetic to accommodate larger arrays, that still wouldn't solve the
problem of having more records than can fit in memory at once. Therefore,
separate passes must be used to sort the file contents in sections, with
each pass writing a temporary index file to disk. A final merge pass then
reads each index to determine which pieces fits where, and then writes the
final index file. The program FILESORT.BAS below incorporates all of the
sorting techniques shown so far, and includes a few custom BASIC routines
to improve its performance.
'FILESORT.BAS, indexed multi-key random access file sort
DEFINT A-Z
DECLARE FUNCTION Compare3% (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, _
BYVAL Adr2, NumBytes)
DECLARE FUNCTION Exist% (FileSpec$)
DECLARE SUB DOSInt (Registers AS ANY)
DECLARE SUB FileSort (FileName$, NDXName$, RecLength, Offset, KeySize)
DECLARE SUB LoadFile (FileNum, Segment, Address, Bytes&)
DECLARE SUB SaveFile (FileNum, Segment, Address, Bytes&)
DECLARE SUB SwapMem (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, BYVAL Adr2, _
BYVAL Length)
DECLARE SUB TypeISort (Segment, Address, ElSize, Offset, KeySize, _
NumEls, Index())
RANDOMIZE TIMER 'create new data each run
DEF FnRand% = INT(RND * 10 + 1) 'returns RND from 1 to 10
TYPE RegType 'used by DOSInt
AX AS INTEGER
BX AS INTEGER
CX AS INTEGER
DX AS INTEGER
BP AS INTEGER
SI AS INTEGER
DI AS INTEGER
FL AS INTEGER
DS AS INTEGER
ES AS INTEGER
END TYPE
DIM SHARED Registers AS RegType 'share among all subs
REDIM LastNames$(1 TO 10) 'we'll select names at
REDIM FirstNames$(1 TO 10) ' random from these
NumRecords = 2988 'how many test records to use
FileName$ = "TEST.DAT" 'really original, eh?
NDXName$ = "TEST.NDX" 'this is the index file name
TYPE RecType
LastName AS STRING * 11
FirstName AS STRING * 10
Dollars AS STRING * 6
Cents AS STRING * 2
AnyNumber AS LONG 'this shows that only key
OtherNum AS LONG ' information must be ASCII
END TYPE
FOR X = 1 TO 10 'read the possible last names
READ LastNames$(X)
NEXT
FOR X = 1 TO 10 'and the possible first names
READ FirstNames$(X)
NEXT
DIM RecordVar AS RecType 'to create the sample file
RecLength = LEN(RecordVar) 'the length of each record
CLS
PRINT "Creating a test file..."
IF Exist%(FileName$) THEN 'if there's an existing file
KILL FileName$ 'kill the old data from prior
END IF ' runs to start fresh
IF Exist%(NDXName$) THEN 'same for any old index file
KILL NDXName$
END IF
'---- Create some test data and write it to the file
OPEN FileName$ FOR RANDOM AS #1 LEN = RecLength
FOR X = 1 TO NumRecords
RecordVar.LastName = LastNames$(FnRand%)
RecordVar.FirstName = FirstNames$(FnRand%)
Amount$ = STR$(RND * 10000)
Dot = INSTR(Amount$, ".")
IF Dot THEN
RSET RecordVar.Dollars = LEFT$(Amount$, Dot - 1)
RecordVar.Cents = LEFT$(MID$(Amount$, Dot + 1) + "00", 2)
ELSE
RSET RecordVar.Dollars = Amount$
RecordVar.Cents = "00"
END IF
RecordVar.AnyNumber = X
PUT #1, , RecordVar
NEXT
CLOSE
'----- Created a sorted index based on the main data file
Offset = 1 'start sorting with LastName
KeySize = 29 'sort based on first 4 fields
PRINT "Sorting..."
CALL FileSort(FileName$, NDXName$, RecLength, Offset, KeySize)
'----- Display the results
CLS
VIEW PRINT 1 TO 24
LOCATE 25, 1
COLOR 15
PRINT "Press any key to pause/resume";
COLOR 7
LOCATE 1, 1
OPEN FileName$ FOR RANDOM AS #1 LEN = RecLength
OPEN NDXName$ FOR BINARY AS #2
FOR X = 1 TO NumRecords
GET #2, , ThisRecord 'get next rec. number
GET #1, ThisRecord, RecordVar 'then the actual data
PRINT RecordVar.LastName; 'print each field
PRINT RecordVar.FirstName;
PRINT RecordVar.Dollars; ".";
PRINT RecordVar.Cents
IF LEN(INKEY$) THEN 'pause on a keypress
WHILE LEN(INKEY$) = 0: WEND
END IF
NEXT
CLOSE
VIEW PRINT 1 TO 24 'restore the screen
END
DATA Smith, Cramer, Malin, Munro, Passarelli
DATA Bly, Osborn, Pagliaro, Garcia, Winer
DATA John, Phil, Paul, Anne, Jacki
DATA Patricia, Ethan, Donald, Tami, Elli
END
FUNCTION Exist% (Spec$) STATIC 'reports if a file exists
DIM DTA AS STRING * 44 'the work area for DOS
DIM LocalSpec AS STRING * 60 'guarantee the spec is in
LocalSpec$ = Spec$ + CHR$(0) ' DGROUP for BASIC PDS
Exist% = -1 'assume true for now
Registers.AX = &H1A00 'assign DTA service
Registers.DX = VARPTR(DTA) 'show DOS where to place it
Registers.DS = VARSEG(DTA)
CALL DOSInt(Registers)
Registers.AX = &H4E00 'find first matching file
Registers.CX = 39 'any file attribute okay
Registers.DX = VARPTR(LocalSpec)
Registers.DS = VARSEG(LocalSpec)
CALL DOSInt(Registers) 'see if there's a match
IF Registers.FL AND 1 THEN 'if the Carry flag is set
Exist% = 0 ' there were no matches
END IF
END FUNCTION
SUB FileSort (FileName$, NDXName$, RecLength, Displace, KeySize) STATIC
CONST BufSize% = 32767 'holds the data being sorted
Offset = Displace - 1 'make zero-based for speed later
'----- Open the main data file
FileNum = FREEFILE
OPEN FileName$ FOR BINARY AS #FileNum
'----- Calculate the important values we'll need
NumRecords = LOF(FileNum) \ RecLength
RecsPerPass = BufSize% \ RecLength
IF RecsPerPass > NumRecords THEN RecsPerPass = NumRecords
NumPasses = (NumRecords \ RecsPerPass) - ((NumRecords MOD RecsPerPass) _
<> 0)
IF NumPasses = 1 THEN
RecsLastPass = RecsPerPass
ELSE
RecsLastPass = NumRecords MOD RecsPerPass
END IF
'----- Create the buffer and index sorting arrays
REDIM Buffer(1 TO 1) AS STRING * BufSize
REDIM Index(1 TO RecsPerPass)
IndexAdjust = 1
'----- Process all of the records in manageable groups
FOR X = 1 TO NumPasses
IF X < NumPasses THEN 'if not the last pass
RecsThisPass = RecsPerPass 'do the full complement
ELSE 'the last pass may have
RecsThisPass = RecsLastPass ' fewer records to do
END IF
FOR Y = 1 TO RecsThisPass 'initialize the index
Index(Y) = Y - 1 'starting with value of 0
NEXT
'----- Load a portion of the main data file
Segment = VARSEG(Buffer(1)) 'show where the buffer is
CALL LoadFile(FileNum, Segment, Zero, RecsThisPass * CLNG(RecLength))
CALL TypeISort(Segment, Zero, RecLength, Displace, KeySize, _
RecsThisPass, Index())
'----- Adjust the zero-based index to record numbers
FOR Y = 1 TO RecsThisPass
Index(Y) = Index(Y) + IndexAdjust
NEXT
'----- Save the index file for this pass
TempNum = FREEFILE
OPEN "$$PASS." + LTRIM$(STR$(X)) FOR OUTPUT AS #TempNum
CALL SaveFile(TempNum, VARSEG(Index(1)), Zero, RecsThisPass * 2&)
CLOSE #TempNum
'----- The next group of record numbers start this much higher
IndexAdjust = IndexAdjust + RecsThisPass
NEXT
ERASE Buffer, Index 'free up the memory
'----- Do a final merge pass if necessary
IF NumPasses > 1 THEN
NDXNumber = FREEFILE
OPEN NDXName$ FOR BINARY AS #NDXNumber
REDIM FileNums(NumPasses) 'this holds the file numbers
REDIM RecordNums(NumPasses) 'this holds record numbers
REDIM MainRec$(1 TO NumPasses) 'holds main record data
REDIM Remaining(1 TO NumPasses) 'tracks index files
'----- Open the files and seed the first round of data
FOR X = 1 TO NumPasses
FileNums(X) = FREEFILE
OPEN "$$PASS." + LTRIM$(STR$(X)) FOR BINARY AS #FileNums(X)
Remaining(X) = LOF(FileNums(X)) 'this is what remains
MainRec$(X) = SPACE$(RecLength) 'holds main data file
GET #FileNums(X), , RecordNums(X) 'get the next record number
RecOffset& = (RecordNums(X) - 1) * CLNG(RecLength) + 1
GET #FileNum, RecOffset&, MainRec$(X) 'then get the data
NEXT
FOR X = 1 TO NumRecords
Lowest = 1 'assume this is the lowest data in the group
WHILE Remaining(Lowest) = 0 'Lowest can't refer to a dead index
Lowest = Lowest + 1 'so seek to the next higher active index
WEND
FOR Y = 2 TO NumPasses 'now seek out the truly lowest element
IF Remaining(Y) THEN 'consider only active indexes
IF Compare3%(SSEG(MainRec$(Y)), _ '<-- use VARSEG with QB
SADD(MainRec$(Y)) + Offset, _
SSEG(MainRec$(Lowest)), _ '<-- use VARSEG with QB
SADD(MainRec$(Lowest)) + Offset, KeySize) = -1 THEN
Lowest = Y
END IF
END IF
NEXT
PUT #NDXNumber, , RecordNums(Lowest) 'write the main index
Remaining(Lowest) = Remaining(Lowest) - 2
IF Remaining(Lowest) THEN 'if the index is still active
GET #FileNums(Lowest), , RecordNums(Lowest)
RecOffset& = (RecordNums(Lowest) - 1) * CLNG(RecLength) + 1
GET #FileNum, RecOffset&, MainRec$(Lowest)
END IF
NEXT
ELSE
'----- Only one pass was needed so simply rename the index file
NAME "$$PASS.1" AS NDXName$
END IF
CLOSE 'close all open files
IF Exist%("$$PASS.*") THEN 'ensure there's a file to kill
KILL "$$PASS.*" 'kill the work files
END IF
ERASE FileNums, RecordNums 'erase the work arrays
ERASE MainRec$, Remaining
END SUB
SUB LoadFile (FileNum, Segment, Address, Bytes&) STATIC
IF Bytes& > 32767 THEN Bytes& = Bytes& - 65536
Registers.AX = &H3F00 'read from file service
Registers.BX = FILEATTR(FileNum, 2) 'get the DOS handle
Registers.CX = Bytes& 'how many bytes to load
Registers.DX = Address 'and at what address
Registers.DS = Segment 'and at what segment
CALL DOSInt(Registers)
END SUB
SUB SaveFile (FileNum, Segment, Address, Bytes&) STATIC
IF Bytes& > 32767 THEN Bytes& = Bytes& - 65536
Registers.AX = &H4000 'write to file service
Registers.BX = FILEATTR(FileNum, 2) 'get the DOS handle
Registers.CX = Bytes& 'how many bytes to load
Registers.DX = Address 'and at what address
Registers.DS = Segment 'and at what segment
CALL DOSInt(Registers)
END SUB
SUB TypeISort (....) STATIC 'as shown in TYPISORT.BAS
END SUB
FILESORT.BAS begins by defining a function that returns a random number
between 1 and 10. Although the earlier sort demonstrations simply read the
test data from DATA statements, that is impractical when creating thousands
of records. Instead, two arrays are filled--one with ten last names and
another with ten first names--and these names are drawn from at random.
The Registers TYPE variable that is defined is used by three of the
supporting routines in this program. RegType is normally associated with
CALL Interrupt and InterruptX, but I have written a small-code replacement
to mimic InterruptX that works with DOS Interrupt &H21 only. DOSInt
accepts just a single Registers argument, instead of the three parameters
that BASIC's Interrupt and InterruptX require. Besides adding less code
each time it is used, the routine itself is smaller and simpler than
InterruptX.
The remainder of the demonstration program should be easy to follow, so
I won't belabor its operation; the real action is in the FileSort
subprogram.
Like TypeSort and TypeISort, FileSort is entirely pointer based, to
accommodate TYPE elements of any size and structure. You provide the name
of the main data file to be sorted, the name of an index file to create,
and the length and offset of the keys within the disk records. The
Displace parameter tells how far into the TYPE structure the key
information is located. When calling TypeISort this value is should be
one-based, but in the final merge pass where Compare3 is used, a zero-based
number is required. Therefore, a copy is made (Offset = Displace - 1) near
the beginning of the routine. This way, both are available quickly without
having to calculate - 1 repeatedly slowing its operation.
The initial steps FileSort performs are to determine how many records
are in the data file, and from that how many records can fit into memory
at one time. Once these are known, the number of passes necessary can be
easily calculated. An extra step is needed to ensure that RecsPerPass is
not greater than the number of records in the file. Just because 200
records can fit into memory at once doesn't mean there are that many
records. In most cases where multiple passes are needed the last pass will
process fewer records than the others. If there are, say, 700 records and
each pass can sort 300, the last pass will sort only 100 records.
Once the pass information is determined, a block of memory is created
to hold each portion of the file for sorting. This is the purpose of the
Buffer array. REDIM is used to create a 32K chunk of memory that doesn't
impinge on available string space.
For each pass that is needed, the number of records in the current pass
is determined and the index array is initialized to increasing values.
Then, a portion of the main data file is read using the LoadFile
subprogram. BASIC does not allow you to read records from a random access
file directly into a buffer specified by its address. And even if it did,
we can load data much faster than pure BASIC by reading a number of records
all at once.
Once the current block of records has been loaded, TypeISort is called
to sort the index array. The index array is also saved very quickly using
SaveFile, which is the compliment to LoadFile. A unique name is given to
each temporary index file such that the first one is named $$PASS.1, the
second $$PASS.2, and so forth. By using dollar signs in the name it is
unlikely that the routine will overwrite an existing file from another
application. Of course, you may change the names to anything else if you
prefer.
Notice the extra step that manipulates the IndexAdjust variable. This
adjustment is needed because each sort pass returns the index array holding
record numbers starting at 0. The first time through, 1 must be added to
each element to reflect BASIC's use of record numbers that start at 1. If
the first pass sorts, say, 250 records, then the index values 1 through 250
are saved to disk. But the second pass is processing records 251 through
500, so an adjustment value of 251 must be added to each element prior to
writing it to disk.
If the data file is small and only one pass was needed, the $$PASS.1
file is simply renamed to whatever the caller specified. Otherwise, a
merge pass is needed to determine which record number is the next in
sequence based on the results of each pass. Believe it or not, this is the
trickiest portion of the entire program. For the sake of discussion, we'll
assume that four passes were required to sort the file.
Each of the four index files contains a sequence of record numbers, and
all of the records within that sequence are in sorted order. However,
there is no relationship between the data records identified in one index
file and those in another. Thus, each index file and corresponding data
record must be read in turn. A FOR/NEXT loop then compares each of the
four records, to see which is truly next in the final sequence. The
complication arises as the merge nears completion, because some of the
indexes will have become exhausted. This possibility is handled by the
Remaining array.
Elements in the Remaining array are initialized to the length of each
index file as the indexes are opened. Then, as each index entry is read
from disk, the corresponding element is decremented by two to show that
another record number was read. Therefore, the current Remaining element
must be checked to see if that index has been exhausted. Otherwise, data
that was already processed might be considered in the merge comparisons.
The final steps are to close all the open files, delete the temporary
index files, and erase the work arrays to free the memory they occupied.
One important point to observe is the use of SSEG to show Compare3 where
the MainRec$ elements are located. SSEG is for BASIC 7 only; if you are
using QuickBASIC you must change SSEG to VARSEG. SSEG can be used with
either near or far strings in BASIC 7, but VARSEG works with near strings
only. SSEG is used as the default, so an error will be reported if you are
using QuickBASIC. The cursor will then be placed near the comment in the
program that shows the appropriate correction.
SEARCHING FUNDAMENTALS
======================
As with sorting, searching data effectively also requires that you select
an appropriate algorithm. There are many ways to search data, and we will
look at several methods here. The easiest to understand is a linear
search, which simply examines each item in sequence until a match is found:
FoundAt = 0 'assume no match
FOR X = 1 TO NumElements 'search all elements
IF Array$(X) = Sought$ THEN
FoundAt = X 'remember where it is
EXIT FOR 'no need to continue
END IF
NEXT
IF FoundAt THEN 'if it was found
PRINT "Found at element"; FoundAt
ELSE
PRINT "Not found" 'otherwise
END IF
For small arrays a linear search is effective and usually fast enough.
Also, integer and long integer arrays can be searched reasonably quickly
even if there are many elements. But with string data, as the number of
elements that must be searched increases, the search time can quickly
become unacceptable. This is particularly true when additional features
are required such as searching without regard to capitalization or
comparing only a portion of each element using MID$. Indeed, many of the
same techniques that enhance a sort routine can also be employed when
searching.
To search ignoring capitalization you would first capitalize Sought$
outside of the loop, and then use UCASE$ with each element in the
comparisons. Using UCASE$(Sought$) repeatedly within the loop is both
wasteful and unnecessary:
Sought$ = UCASE$(Sought$)
.
.
IF UCASE$(Array$(X)) = Sought$ THEN
Likewise, comparing only a portion of each string will require MID$ with
each comparison, after using MID$ initially to extract what is needed from
Sought$:
Sought$ = MID$(Sought$, 12, 6)
.
.
IF MID$(Array$(X), 12, 6) = Sought$ THEN
And again, as with sorting, these changes may be combined in a variety of
ways. You could even use INSTR to see if the string being searched for
is within the array, when an exact match is not needed:
IF INSTR(UCASE$(Array$(X)), Sought$) THEN
However, each additional BASIC function you use will make the searching
slower and slower. Although BASIC's INSTR is very fast, adding UCASE$ to
each comparison as shown above slows the overall process considerably.
There are three primary ways that searching can be speeded up. One is
to apply simple improvements based on understanding how BASIC works, and
knowing which commands are fastest. The other is to select a better
algorithm. The third is to translate selected portions of the search
routine into assembly language. I will use all three of these techniques
here, starting with enhancements to the linear search, and culminating with
a very fast binary search for use with sorted data.
One of the slowest operations that BASIC performs is comparing strings.
For each string, its descriptor address must be loaded and passed to the
comparison routine. That routine must then obtain the actual data address,
and examine each byte in both strings until one of the characters is
different, or it determines that both strings are the same. As I mentioned
earlier, if one or both of the strings are fixed-length, then copies also
must be made before the comparison can be performed.
There is another service that the string comparison routine must
perform, which is probably not obvious to most programmers and which also
impacts its speed. BASIC frequently creates and then deletes temporary
strings without your knowing it. One example is the copy it makes of
fixed-length strings before comparing them. But there are other, more
subtle situations in which this can happen.
For example, when you use IF X$ + Y$ > Z$ BASIC must create a temporary
string comprised of X$ + Y$, and then pass that to the comparison routine.
Therefore, that routine is also responsible for determining if the incoming
string is a temporary copy, and deleting it if so. In fact, all of BASIC's
internal routines that accept string arguments are required to do this.
Therefore, one good way to speed searching of conventional (not fixed-
length) string arrays is to first compare the lengths. Since strings whose
lengths are different can't possibly be the same, this will quickly weed
those out. BASIC's LEN function is much faster than its string compare
routine, and it offers a simple but effective opportunity to speed things
up. LEN is made even faster because it requires only a single argument,
as opposed to the two required for the comparison routine.
SLen = LEN(Sought$) 'do this once outside the loop
FOR X = 1 TO NumElements
IF LEN(Array$(X)) = SLen THEN 'maybe...
IF Array$(X) = Sought$ THEN 'found it!
FoundAt = X
EXIT FOR
END IF
END IF
NEXT
Similarly, if the first characters are not the same then the strings can't
match either. Like LEN, BASIC's ASC is much faster than the full string
comparison routine, and it too can improve search time by eliminating
elements that can't possibly match. Depending on the type and distribution
of the data in the array, using both LEN and ASCII can result in a very
fast linear search:
SLen = LEN(Sought$)
SAsc = ASC(Sought$)
FOR X = 1 TO NumElements
IF LEN(Array$(X)) = SLen THEN
IF ASC(Array$(X)) = SAsc THEN
IF Array$(X) = Sought$ THEN
...
END IF
END IF
END IF
NEXT
Notice that the LEN test must always be before the ASC test, to avoid an
"Illegal function call" error if the array element is a null string. If
all or most of the strings are the same length, then LEN will not be
helpful, and ASC should be used alone.
As I mentioned before, when comparing fixed-length string arrays BASIC
makes a copy of each element into a conventional string, prior to calling
its comparison routine. This copying is also performed when using ASC is
used, but not LEN. After all, the length of a fixed-length never changes,
and BASIC is smart enough to know the length directly. But then, comparing
the lengths of these string is pointless anyway.
Because of the added overhead to make these copies, the performance of
a conventional linear search for fixed-length data is generally quite poor.
This is a shame, because fixed-length strings are often the only choice
when as much data as possible must be kept in memory at once. And fixed-
length strings lend themselves perfectly to names and addresses. It should
be apparent by now that the best solution for quickly comparing fixed-
length string arrays--and the string portion of TYPE arrays too--is with
the various Compare functions already shown.
If you are searching for an exact match, then either Compare or Compare2
will be ideal, depending on whether you want to ignore capitalization. If
you have only a single string element in each array, you should define a
dummy TYPE. This avoids the overhead of having to use both VARSEG and
VARPTR as separate arguments. The short example program and SearchType
functions that follow search a fixed-length string array for a match.
DEFINT A-Z
DECLARE FUNCTION Compare% (SEG Type1 AS ANY, SEG Type2 AS ANY, NumBytes)
DECLARE FUNCTION Compare2% (SEG Type1 AS ANY, SEG Type2 AS ANY, NumBytes)
DECLARE FUNCTION SearchType% (Array() AS ANY, Sought AS ANY)
DECLARE FUNCTION SearchType2% (Array() AS ANY, Sought AS ANY)
DECLARE FUNCTION SearchType3% (Array() AS ANY, Searched AS ANY)
CLS
TYPE FLen 'this lets us use SEG
LastName AS STRING * 15
END TYPE
REDIM Array(1 TO 4000) AS FLen '4000 is a lot of names
DIM Search AS FLen 'best comparing like data
FOR X = 1 TO 4000 STEP 2 'impart some realism
Array(X).LastName = "Henderson"
NEXT
Array(4000).LastName = "Henson" 'almost at the end
Search.LastName = "Henson" 'find the same name
'----- first time how long it takes using Compare
Start! = TIMER 'start timing
FOR X = 1 TO 5 'search five times
FoundAt = SearchType%(Array(), Search)
NEXT
IF FoundAt >= 0 THEN
PRINT "Found at element"; FoundAt
ELSE
PRINT "Not found"
END IF
Done! = TIMER
PRINT USING "##.## seconds with Compare"; Done! - Start!
PRINT
'----- then time how long it takes using Compare2
Start! = TIMER 'start timing
FOR X = 1 TO 5 'as above
FoundAt = SearchType2%(Array(), Search)
NEXT
IF FoundAt >= 0 THEN
PRINT "Found at element"; FoundAt
ELSE
PRINT "Not found"
END IF
Done! = TIMER
PRINT USING "##.## seconds with Compare2"; Done! - Start!
PRINT
'---- finally, time how long it takes using pure BASIC
Start! = TIMER
FOR X = 1 TO 5
FoundAt = SearchType3%(Array(), Search)
NEXT
IF FoundAt >= 0 THEN
PRINT "Found at element"; FoundAt
ELSE
PRINT "Not found"
END IF
Done! = TIMER
PRINT USING "##.## seconds using BASIC"; Done! - Start!
END
FUNCTION SearchType% (Array() AS FLen, Sought AS FLen) STATIC
SearchType% = -1 'assume not found
FOR X = LBOUND(Array) TO UBOUND(Array)
IF Compare%(Array(X), Sought, LEN(Sought)) THEN
SearchType% = X 'save where it was found
EXIT FOR 'and skip what remains
END IF
NEXT
END FUNCTION
FUNCTION SearchType2% (Array() AS FLen, Sought AS FLen) STATIC
SearchType2% = -1 'assume not found
FOR X = LBOUND(Array) TO UBOUND(Array)
IF Compare2%(Array(X), Sought, LEN(Sought)) THEN
SearchType2% = X 'save where it was found
EXIT FOR 'and skip what remains
END IF
NEXT
END FUNCTION
FUNCTION SearchType3% (Array() AS FLen, Searched AS FLen) STATIC
SearchType3% = -1 'assume not found
FOR X = LBOUND(Array) TO UBOUND(Array)
IF Array(X).LastName = Searched.LastName THEN
SearchType3% = X 'save where it was found
EXIT FOR 'and skip what remains
END IF
NEXT
END FUNCTION
When you run this program it will be apparent that the SearchType function
is the fastest, because it uses Compare which doesn't perform any case
conversions. SearchType2 is only slightly slower with that added overhead,
and the purely BASIC function, SearchType3, lags far behind at half the
speed. Note that the array is searched five times in succession, to
minimize the slight errors TIMER imposes. Longer timings are generally
more accurate than short ones, because of the 1/18th second resolution of
the PC's system timer.
BINARY SEARCHES
This is about as far as we can go using linear searching, and to achieve
higher performance requires a better algorithm. The Binary Search is one
of the fastest available; however, it requires the data to already be in
sorted order. A Binary Search can also be used with a sorted index, and
both methods will be described.
Binary searches are very fast, and also very simple to understand.
Unlike the Quick Sort algorithm which achieves great efficiency at the
expense of being complicated, a Binary Search can be written using only a
few lines of code. The strategy is to start the search at the middle of
the array. If the value of that element value is less than that of the
data being sought, a new halfway point is checked and the process repeated.
This way, the routine can quickly zero in on the value being searched for.
Figure 8-3 below shows how this works.
13: Zambia
12: Sweden
11: Peru
10: Mexico <-- step 2
9: Holland
8: Germany
7: Finland <-- step 1
6: England
5: Denmark
4: China
3: Canada
2: Austria
1: Australia
Figure 8-3: How a Binary Search locates data in a sorted array.
If you are searching for Mexico, the first element examined is number 7,
which is halfway through the array. Comparing Mexico to Finland shows
that Mexico is greater, so the distance is again cut in half. In this
case, a match was found after only two tries--remarkably faster than a
linear search that would have required ten comparisons. Even when huge
arrays must be searched, data can often be found in a dozen or so tries.
One interesting property of a binary search is that it takes no longer to
find the last element in the array than the first one.
The program below shows one way to implement a Binary Search.
DEFINT A-Z
DECLARE FUNCTION BinarySearch% (Array$(), Find$)
CLS
PRINT "Creating test data..."
REDIM Array$(1 TO 1000) 'create a "sorted" array
FOR X = 1 TO 1000
Array$(X) = "String " + RIGHT$("000" + LTRIM$(STR$(X)), 4)
NEXT
PRINT "Searching array..."
FoundAt = BinarySearch%(Array$(), "String 0987")
IF FoundAt >= 0 THEN
PRINT "Found at element"; FoundAt
ELSE
PRINT "Not found"
END IF
END
FUNCTION BinarySearch% (Array$(), Find$) STATIC
BinarySearch% = -1 'no matching element yet
Min = LBOUND(Array$) 'start at first element
Max = UBOUND(Array$) 'consider through last
DO
Try = (Max + Min) \ 2 'start testing in middle
IF Array$(Try) = Find$ THEN 'found it!
BinarySearch% = Try 'return matching element
EXIT DO 'all done
END IF
IF Array$(Try) > Find$ THEN 'too high, cut in half
Max = Try - 1
ELSE
Min = Try + 1 'too low, cut other way
END IF
LOOP WHILE Max >= Min
END FUNCTION
The BinarySearch function returns either the element number where a match
was found, or -1 if the search string was not found. Not using a value of
zero to indicate failure lets you use arrays that start with element number
0. As you can see, the simplicity of this algorithm belies its incredible
efficiency. The only real problem is that the data must already be in
sorted order. Also notice that two string comparisons must be made--one
to see if the strings are equal, and another to see if the current element
is too high. Although you could use Compare3 which examines the strings
once and tells if the data is the same or which is greater, a Binary Search
is so fast that this probably isn't worth the added trouble. As you will
see when you run the test program, it takes far longer to create the data
than to search it!
Besides the usual enhancements that can be applied to the comparisons
using UCASE$ or MID$, this function could also be structured to use a
parallel index array. Assuming the data is not sorted but the index array
is, the modified Binary Search would look like this:
FUNCTION BinaryISearch% (Array$(), Index(), Find$) STATIC
BinaryISearch% = -1 'assume not found
Min = LBOUND(Array$) 'start at first element
Max = UBOUND(Array$) 'consider through last
DO
Try = (Max + Min) \ 2 'start testing in middle
IF Array$(Index(Try)) = Find$ THEN 'found it!
BinaryISearch% = Try 'return matching element
EXIT DO 'all done
END IF
IF Array$(Index(Try)) > Find$ THEN 'too high, cut
Max = Try - 1
ELSE
Min = Try + 1 'too low, cut other way
END IF
LOOP WHILE Max >= Min
END FUNCTION
NUMERIC ARRAYS
All of the searching techniques considered so far have addressed string
data. In most cases, string array searches are the ones that will benefit
the most from improved techniques. As you have already seen, BASIC makes
copies of fixed-length strings before comparing them, which slows down
searching. And the very nature of strings implies that many bytes may have
to be compared before determining if they are equal or which string is
greater. In most cases, searching a numeric array is fast enough without
requiring any added effort, especially when the data is integer or long
integer.
However, a few aspects of numeric searching are worth mentioning here.
One is avoiding the inevitable rounding errors that are sure to creep into
the numbers you are examining. Another is that in many cases, you may not
be looking for an exact match. For example, you may need to find the first
element that is higher than a given value, or perhaps determine the
smallest value in an array.
Unlike strings that are either the same or they aren't, the binary
representation of numeric values is not always so precise. Consider the
following test which *should* result in a match, but doesn't.
Value! = 1!
Result! = 2!
CLS
FOR X = 1 TO 1000
Value! = Value! + .001
NEXT
IF Value! = Result! THEN
PRINT "They are equal"
ELSE
PRINT "Value! ="; Value!
PRINT "Result! ="; Result!
END IF
After adding .001 to Value! 1000 times Value! should be equal to 2, but
instead it is slightly higher. This is because the binary storage method
used by computers simply cannot represent every possible value with
absolute accuracy. Even changing all of the single precision exclamation
points (!) to double precision pound signs (#) will not solve the problem.
Therefore, to find a given value in a numeric array can require some extra
trickery.
What is really needed is to determine if the numbers are *very close* to
each other, as opposed to exactly the same. One way to accomplish this is
to subtract the two, and see if the result is very close to zero. This is
shown below.
Value! = 1!
Result! = 2!
CLS
FOR X = 1 TO 1000
Value! = Value! + .001
NEXT
IF ABS(Value! - Result!) < .0001 THEN
PRINT "They are equal"
ELSE
PRINT "Value! ="; Value!
PRINT "Result! ="; Result!
END IF
Here, the absolute value of the difference between the numbers is examined,
and if that difference is very small the numbers are assumed to be the
same. Unfortunately, the added overhead of subtracting before comparing
slows the comparison even further. There is no simple cure for this, and
an array search must apply this subtraction to each element that is
examined.
Another common use for numeric array searches is when determining the
largest or smallest value. Many programmers make the common mistake shown
below when trying to find the largest value in an array.
MaxValue# = 0
FOR X = 1 TO NumElements
IF Array#(X) > MaxValue# THEN
MaxValue# = Array#(X)
Element = X
END IF
NEXT
PRINT "The largest value found is"; MaxValue#
PRINT "And it was found at element"; Element
The problem with this routine is that it doesn't account for arrays where
all of the elements are negative numbers! In that case no element will be
greater than the initial MaxValue#, and the routine will incorrectly report
zero as the result. The correct method is to obtain the lowest element
value, and use that as a starting point:
MaxValue# = Array#(1)
FOR X = 2 TO NumElements
IF Array#(X) > MaxValue# THEN
MaxValue# = Array#(X)
END IF
NEXT
PRINT "The largest value found is"; MaxValue#
Determining the highest value in an array would be handled similarly,
except the greater-than symbol (>) would be replaced with a less-than
operator (<).
SOUNDEX
The final searching technique I will show is Soundex. It is often useful
to search for data based on its sound, for example when you do not know how
to spell a person's name. Soundex was invented in the 1920's and has been
used since then by, among others, the U.S. Census Bureau. A Soundex code
is an alpha-numeric representation of the sound of a word, and it is
surprisingly accurate despite its simplicity. The classic implementation
of Soundex returns a four-character result code. The first character is
the same as the first letter of the word, and the other three are numeric
digits coded as shown in Figure 8-4.
1 B, F, P, V
2 C, G, J, K, Q, S, X
3 D, T
4 L
5 M, N
6 R
Figure 8-4: The Soundex code numbers returned for significant letters of
the alphabet.
Letters not shown are simply skipped as being statistically insignificant
to the sound of the word. In particular, speaking accents often minimize
the importance of vowels, and blur their distinction. If the string is
short and there are fewer than four digits, the result is simply padded
with trailing zeros. One additional rule is that a code digit is never
repeated, unless there is an uncoded letter in between. In the listing
that follows, two different implementations of Soundex are shown.
'SOUNDEX.BAS, Soundex routines and example
DEFINT A-Z
DECLARE FUNCTION ASoundex$ (Word$)
DECLARE FUNCTION ISoundex% (Word$)
CLS
DO
PRINT "press Enter alone to exit"
INPUT "What is the first word"; FWord$
IF LEN(FWord$) = 0 THEN EXIT DO
INPUT "What is the second word"; SWord$
PRINT
'Test by alpha-numeric soundex
PRINT "Alpha-Numeric Soundex: "; FWord$; " and ";
PRINT SWord$; " do ";
IF ASoundex$(FWord$) <> ASoundex$(SWord$) THEN
PRINT "NOT ";
END IF
PRINT "sound the same."
PRINT
'Test by numeric soundex
PRINT " Numeric Soundex: "; FWord$; " and ";
PRINT SWord$; " do ";
IF ISoundex%(FWord$) <> ISoundex%(SWord$) THEN
PRINT "NOT ";
END IF
PRINT "sound the same."
PRINT
LOOP
END
FUNCTION ASoundex$ (InWord$) STATIC
Word$ = UCASE$(InWord$)
Work$ = LEFT$(Word$, 1) + "000"
WkPos = 2
PrevCode = 0
FOR L = 2 TO LEN(Word$)
Temp = INSTR("BFPVCGJKQSXZDTLMNR", MID$(Word$, L, 1))
IF Temp THEN
Temp = ASC(MID$("111122222222334556", Temp, 1))
IF Temp <> PrevCode THEN
MID$(Work$, WkPos) = CHR$(Temp)
PrevCode = Temp
WkPos = WkPos + 1
IF WkPos > 4 THEN EXIT FOR
END IF
ELSE
PrevCode = 0
END IF
NEXT
ASoundex$ = Work$
END FUNCTION
FUNCTION ISoundex% (InWord$) STATIC
Word$ = UCASE$(InWord$)
Work$ = "0000"
WkPos = 1
PrevCode = 0
FOR L = 1 TO LEN(Word$)
Temp = INSTR("BFPVCGJKQSXZDTLMNR", MID$(Word$, L, 1))
IF Temp THEN
Temp = ASC(MID$("111122222222334556", Temp, 1))
IF Temp <> PrevCode THEN
MID$(Work$, WkPos) = CHR$(Temp)
PrevCode = Temp
WkPos = WkPos + 1
IF WkPos > 4 THEN EXIT FOR
END IF
ELSE
PrevCode = 0
END IF
NEXT
ISoundex% = VAL(Work$)
END FUNCTION
The first function, ASoundex, follows the standard Soundex definition and
returns the result as a string. The ISoundex version cheats slightly by
coding the first letter as a number, but it returns an integer value
instead of a string. Because integer searches are many times faster than
string searches, this version will be better when thousands--or even
hundreds of thousands--of names must be examined.
An additional benefit of the integer-only method is that it allows for
variations on the first letter. For example, if you enter Cane and Kane
in response to the prompts from SOUNDEX.BAS ASoundex will not recognize the
names as sounding alike where ISoundex will.
LINKED DATA
===========
No discussion of searching and sorting would be complete without a mention
of linked lists and other data links. Unlike arrays where all of the
elements lie in adjacent memory locations, linked data is useful when data
locations may be disjointed. One example is the linked list used by the
DOS File Allocation Table (FAT) on every disk. As I described in Chapter
6, the data in each file may be scattered throughout the disk, and only
through a linked list can DOS follow the thread from one sector in a file
to another.
Another example where linked data is useful--and the one we will focus
on here--is to keep track of memo fields in a database. A memo field is
a field that can store freeform text such as notes about a sales contact
or a patient's medical history. Since these fields typically require
varying lengths, it is inefficient to reserve space for the longest one
possible in the main database file. Therefore, most programs store memo
fields in a separate disk file, and use a *pointer field* in the main data
file to show where the corresponding memo starts in the dedicated memo
file. Similarly, a back pointer adjacent to each memo identifies the
record that points to it. This is shown in Figure 8-5 below.
┌────────┬─────────┬─╥────────┬─────────┬──╥─────────
MAIN.DAT │LastName│FirstName│1║LastName│FirstName│73║LastNa...
└────────┴─────────┴─╨────────┴─────────┴──╨─────────
^ ^
| |
pointers into memo file -----+--------------------+
(forward pointers)
offsets into --+----------+-------+--------------+
this memo file | | | |
1 73 126 233
┌──────────┬───────┬──────────────┬─────────┐
MEMO.DAT │1LMemo1 │2LMemo2│3LMemo3 │4LMemo4 │
└──────────┴───────┴──────────────┴─────────┘
^ ^ ^ ^
| | | |
record numbers +----------+-------+--------------+
(back pointers)
(L = length of this memo)
Figure 8-5: Pointers relate record numbers to memo file offsets and vice
versa.
Here, the pointer in the main data file record is a long integer that holds
the byte offset into the memo file where the corresponding memo text
begins. And just before the memo text is an integer record number that
shows which record this memo belongs to. (If you anticipate more than
65,535 records a long integer must be used instead.) Thus, these pointers
provide links between the two files, and relate the information they
contain.
When a new record is added to the main file, the memo that goes with it
is appended to the end of the memo file. BASIC's LOF function can be used
to determine the current end of the memo file, which is then used as the
beginning offset for the new memo text. And as the new memo is appended
to MEMO.DAT, the first data actually written is the number of the new
record in the main data file.
The record number back pointer in the memo file is needed to allow memo
data to be edited. Since there's no reasonable way to extend memo text
when other memo data follows it, most programs simply abandon the old text,
and allocate new space at the end of the file. The abandoned text is then
marked as such, perhaps by storing a negative value as the record number.
Storing a negative version of the abandoned data's length is ideal, because
that both identifies the data as obsolete, and also tells how much farther
into the file the next memo is located.
The idea here is that you would periodically run a memo file maintenance
program that compacts the file, thus eliminating the wasted space the
abandoned memos occupy. This is similar to the DBPACK.BAS utility shown
in Chapter 7, and also similar to the way that BASIC compacts string memory
when it becomes full. But when an existing memo is relocated in the memo
file, the field in the main data file that points to the memo must also be
updated. And that's why the record number back pointer is needed: so the
compaction program can know which record in the main file must be updated.
The "L" identifier in the memo file in Figure 8-5, shown between the
record number and memo text, is a length byte or word that tells how long
the text is. If you plan to limit the memo field lengths to 255 or fewer
characters, then a single byte is sufficient. Otherwise an integer must
be used. An example of code that reads a data record and then its
associated memo text is shown below.
GET #MainFile, RecNumber, TypeVar
MemoOffset& = TypeVar.MemoOff
GET #MemoFile, MemoOffset& + 2, MemoLength%
Memo$ = SPACE$(MemoLength%)
GET #MemoFile, , Memo$
The first step reads a record from the main data file into a TYPE variable,
and the second determines where in the memo file the memo text begins. Two
is added to that offset in the second GET statement, to skip over the
record number back pointer which isn't needed here. Once the length of the
memo text is known, a string is assigned to that length, and the actual
text is read into it.
If you are using long integer record numbers you would of course use
MemoOffset& + 4 in the second GET. And if you're using a single byte to
hold the memo length you would define a fixed-length string to receive
that byte:
DIM Temp AS STRING *1
GET #MemoFile, MemoOffset& + 2, Temp
MemoLength = ASC(Temp)
Since BASIC doesn't offer a byte-sized integer data type, ASC and STR$ can
be used to convert between numeric and string formats.
ARRAY ELEMENT INSERTION AND DELETION
====================================
The last issue related to array and memory manipulation I want to cover
is inserting and deleting elements. If you intend to maintain file indexes
or other information in memory and in sorted order, you will need some way
to insert a new entry. By the same token, deleting an entry in a database
requires that the parallel index entry also be deleted.
The most obvious way to insert or delete elements in an array is with
a FOR/NEXT loop. The first example below inserts an element, and the
second deletes one.
'----- Insert an element:
Element = 200
InsertValue = 999
FOR X = UBOUND(Array) TO Element + 1 STEP -1
Array(X) = Array(X - 1)
NEXT
Array(Element) = InsertValue
'----- Delete an element:
Element = 200
FOR X = Element TO UBOUND(Array) - 1
Array(X) = Array(X + 1)
NEXT
Array(UBOUND(Array)) = 0 'optionally clear last element
For integer, long integer, and fixed-length arrays this is about as
efficient as you can get, short of rewriting the code in assembly language.
However, with floating point and string arrays the performance is less than
ideal. Unless a numeric coprocessor is installed, floating point values
are assigned using interrupts and support code in the emulator library.
This adds an unnecessary level of complication that also impacts the speed.
When strings are assigned the situation is even worse, because of the
memory allocation overhead associated with dynamic string management.
A better solution for floating point and string arrays is a series of
SWAP statements. The short program below benchmarks the speed difference
of the two methods, as it inserts an element into a single precision array.
REDIM Array(1 TO 500)
CLS
Element% = 200
InsertValue = 999
Start = TIMER
FOR A% = 1 TO 500
FOR X% = UBOUND(Array) TO Element% + 1 STEP -1
Array(X%) = Array(X% - 1)
NEXT
Array(Element%) = InsertValue
NEXT
Done = TIMER
PRINT USING "##.## seconds when assigning"; Done - Start
Start = TIMER
FOR A% = 1 TO 500
FOR X% = UBOUND(Array) TO Element% + 1 STEP -1
SWAP Array(X%), Array(X% - 1)
NEXT
Array(Element%) = InsertValue
NEXT
Done = TIMER
PRINT USING "##.## seconds when swapping"; Done - Start
If you run this program in the BASIC environment, the differences may not
appear that significant. But when the program is compiled to an executable
file, the swapping method is more than four times faster. In fact, you
should never compare programming methods using the BASIC editor for exactly
this reason. In many cases, the slowness of the interpreting process
overshadows significant differences between one approach and another.
String arrays also benefit greatly from using SWAP instead of
assignments, though the amount of benefit varies depending on the length
of the strings. If you modify the previous program to use a string array,
also add this loop to initialize the elements:
FOR X% = 1 TO 500
Array$(X%) = "String number" + STR$(X)
NEXT
With BASIC PDS far strings the difference is only slightly less at about
three to one, due to the added complexity of far data. Also, SWAP will
always be worse than assignments when inserting or deleting elements in a
fixed-length string or TYPE array. An assignment merely copies the data
from one location to another. SWAP, however, must copy the data in both
directions.
Understand that when using SWAP with conventional string arrays, the
data itself is not exchanged. Rather, the four-byte string descriptors are
copied. But because BASIC program modules store string data in different
segments, extra work is necessary to determine which descriptor goes with
which segment. When near strings are being used, only six bytes are
exchanged, regardless of the length of the strings. Four bytes hold the
descriptors, and two more store the back pointers.
SUMMARY
=======
This chapter explained many of the finer points of sorting and searching
all types of data in BASIC. It began with sorting concepts using the
simple Bubble Sort as a model, and then went on to explain indexed and
multi-key sorts. One way to implement a multi-key sort is by aligning the
key fields into adjacent TYPE components. While there are some
restrictions to this method, it is fairly simple to implement and also
very fast.
The Quick Sort algorithm was shown, and the SEEQSORT.BAS program on the
accompanying disk helps you to understand this complex routine by
displaying graphically the progress of the comparisons and exchanges as
they are performed. Along the way you saw how a few simple modifications
to any string sort routine can be used to sort regardless of
capitalization, or based on only a portion of a string element.
You also learned that writing a truly general sort routine that can
handle any type of data requires dealing exclusively with segment and
address pointers. Here, assembly language routines are invaluable for
assisting you when performing the necessary comparisons and data exchanges.
Although the actual operation of the assembly language routines will be
deferred until Chapter 12, such routines may easily be added to a BASIC
program using .LIB and .QLB libraries.
I mentioned briefly the usefulness of packing and aligning data when
possible, as an aid to fast sorting. In particular, dates can be packed
to only three bytes in Year/Month/Day order, and other data such as zip
codes can be stored in long integers. Because numbers can be compared much
faster than strings, this helps the sorting routines operate more quickly.
Array searching was also discussed in depth, and both linear and binary
search algorithms were shown. As with the sorting routines, searching can
also employ UCASE$ and MID$ to search regardless of capitalization, or on
only a portion of each array element. Two versions of the Soundex
algorithm were given, to let you easily locate names and other data based
on how they sound.
Besides showing the more traditional searching methods, I presented
routines to determine the minimum and maximum values in a numeric array.
I also discussed some of the ramifications involved when searching floating
point data, to avoid the inevitable rounding errors that might cause a
legitimate match to be ignored.
Finally, some simple ways to insert and delete elements in both string
and numeric arrays were shown. Although making direct assignments in a
loop is the most obvious way to do this, BASIC's often-overlooked SWAP
command can provide a significant improvement in speed.
The next chapter will conclude this section about hands-on programming
by showing a variety of program optimization techniques.