GEMini Atari

home *** CD-ROM | disk | FTP | other *** search

/ GEMini Atari / GEMini_Atari_CD-ROM_Walnut_Creek_December_1993.iso / files / program / arp68bok / chap_06.doc < prev next >

Wrap

Text File | 1993-10-23 | 89.1 KB | 2,172 lines

Atari ST Machine Specific Programming In Assembly Chapter 6: Selections Based On Speed Reducing Program Loading Time As I was concluding chapter 3, I intended to indicate that, in many ways, this chapter is an extension of that one. For example, the first thing that I shall do here is to resume the discussion of assembly modes. As I stated there, the first step involved in rapid execution is rapid loading. A program cannot even begin run until it has been moved from disk storage to ram. For programs which simply load, execute and terminate, without spending much time in ram, the time required to load the program may be a significant portion of the total time involved in program execution. Many of the programs that are executed from an AUTO folder during boot are of this type. It seems reasonable, therefore, that discussions which involve the reduction of loading time precede discussions involving the reduction of execution time. As I mentioned in program SPEED_1's documentation, the time required to load a particular program from disk to ram depends on the assembly mode which produced the code; the type of drive from which it is loaded, hard disk or floppy; the method used to format the disk, such as ST normal, Double Click's DC Formatter or David Small and Dan Moore's Mega-Twister; the position of the file on the disk, which could include fragmentation; and, if the program is spawned by another, the distance between parent and child, which could include intermedia displacement. The Effects of Assembly Modes Programs 23 and 24 have been prepared for use in comparing the load and execution times of a program assembled in three assembly modes. Program 23 should be assembled in the Relocatable mode to produce PRG_2CC.TOS. Program 24 should be assembled in AssemPro's PC-relative mode to produce PRG_2CP.TOS, then the name of the source should be changed to PRG_2CR, while it is still in the editor, so that it can be assembled in the Relocatable mode to produce PRG_2CR.TOS. I obtained the load and execute times for the three TOS programs using the following method to isolate the experiment from the intermedia displacement variables discussed above. Execution results follow the appropriate listings. 1. Copy SPEEDTST.TTP to a blank hard disk partition or floppy disk. 2, Copy PRG_2CC.TOS to the same hard disk partition or floppy disk. 3. Execute SPEEDTST.TTP and type PRG_2CC.TOS on the command line. 4. Copy PRG_2CC.DAT to a safe place. 5. Remove PRG_2CC.TOS and PRG_2CC.DAT from the partition or floppy. 6. Repeat steps 2 through 5 for PRG_2CP.TOS and PRG_2CR.TOS. Program 23. Source file for PRG_2CC.TOS ; Program Name: PRG_2CC.S ; Version: 1.002 ; Assembly Instructions: ; Assemble in Relocatable mode and save with a TOS extension. ; Execution Instructions: ; Execute program SPEEDTST.TTP and type PRG_2CC.TOS on the input ; parameter line. SPEEDTST.TTP will produce a data file named PRG_2CC.DAT ; on disk. You will be able to compare the data for this program to that ; produced for programs PRG_2CP.TOS and PRG_2CR.TOS. ; Program Function: ; Statements within a nested loop structure are executed 50,000 times ; so that the load and execution time of this program can be compared with ; similar programs assembled in the PC-relative and Relocatable modes. store_after_load_time: trap #3 ; Returns value of system clock in D0. lea after_load_time(pc), a0 move.w d0, (a0) move.w #9, d1 ; Initialize outer loop counter. outer_loop: ; Loop ten times. move.w #49999, d0 ; Initialize inner loop counter. inner_loop: ; Loop 50,000 times. move.l #label, a0 ; Can't use (pc) here. lea label(pc), a0 move.l label(pc), a0 move.l #label, -(sp) ; Can't use (pc) here. pea label(pc) move.l label(pc), -(sp) lea $C(sp), sp ; Reposition stack pointer to top of stack. dbra d0, inner_loop ; Loop back until D0 = -1. dbra d1, outer_loop ; Loop back until D1 = -1. terminate: move.w after_load_time(pc), -(sp) ; Pass after load time to SPEEDTST.TTP. move.w #$4C, -(sp) ; Function = p_term = GEMDOS $4C. trap #1 data ; NOTE: Below, the variable "label" is supposed to be a pointer to the ; variable "after_load_time". If this program is assembled in ; Relocatable mode, the "run time" address of "after_load_time" will be ; stored in the 4 bytes declared at "label" when the program is loaded ; from disk to ram. ; But, if the program is assembled in PC-relative mode, the "run time" ; address will not be stored there; instead, the "assembly time" address ; will be stored in the 4 bytes. That is undesirable. label: dc.l after_load_time ; This works for COMBO assembly. bss after_load_time: ds.w 1 end SPEEDTST.TTP Execution Results for PRG_2CC.TOS, loaded from drive: G Load time: 1995 milliseconds Execution time: 7485 milliseconds Program 24. Source file for PRG_2CP.TOS and PRG_2CR.TOS. ; Program Name: PRG_2CP.S ; Version: 1.002 ; Assembly Instructions: ; Assemble in PC-relative mode and save with a TOS extension, then ; change the name to PRG_2CR, assemble in Relocatable mode and save as ; PRG_2CR.TOS. This will allow you to prepare two object code files from ; the same source file. ; Execution Instructions: ; Execute program SPEEDTST.TTP and type PRG_2CP.TOS on the input ; parameter line. SPEEDTST.TTP will produce a data file named PRG_2CP.DAT ; on disk. You will be able to compare the data for this program to that ; produced for programs PRG_2CC.TOS and PRG_2CR.TOS. ; Do the same with PRG_2CR.TOS to produce a data file named PRG_2CR.DAT. ; You can view the DAT files with an editor, from the desktop using the Show ; function or by printing them. ; Program Function: ; Statements within a nested loop structure are executed 50,000 times ; so that the load and execution time of this program can be compared with ; similar programs assembled in the Relocatable and Combo modes. store_after_load_time: trap #3 ; Returns value of system clock in D0. lea after_load_time, a0 move.w d0, (a0) ; Store time in variable "after_load_time". move.w #9, d1 ; Initialize outer loop counter. outer_loop: ; Loop ten times. move.w #49999, d0 ; Initialize inner loop counter. inner_loop: ; Loop 50,000 times. move.l #label, a0 lea label, a0 move.l label, a0 move.l #label, -(sp) pea label move.l label, -(sp) lea $C(sp), sp ; Reposition stack pointer to top of stack. dbra d0, inner_loop ; Loop back until D0 = -1. dbra d1, outer_loop ; Loop back until D1 = -1. terminate: move.w after_load_time, -(sp) ; Pass after load time to SPEEDTST.TTP. move.w #$4C, -(sp) ; Function = p_term = GEMDOS $4C. trap #1 data ; NOTE: Below, the variable "label" is supposed to be a pointer to the ; variable "after_load_time". If this program is assembled in ; Relocatable mode, the "run time" address of "after_load_time" will be ; stored in the 4 bytes declared at "label" when the program is loaded ; from disk to ram. ; But, if the program is assembled in PC-relative mode, the "run time" ; address will not be stored there; instead, the "assembly time" address ; will be stored in the 4 bytes. That is undesirable. label: dc.l after_load_time ; This does not give desired response for bss ; PC-relative assembly, it works for after_load_time: ds.w 1 ; Relocatable assembly. end SPEEDTST.TTP Execution Results for PRG_2CP.TOS, loaded from drive: G Load time: 35 milliseconds Execution time: 7475 milliseconds SPEEDTST.TTP Execution Results for PRG_2CR.TOS, loaded from drive: G Load time: 1995 milliseconds Execution time: 8510 milliseconds A comparison of load times is dramatic. When the program is assembled in AssemPro's PC-relative mode, the time required to load is 1960 msec (about 2 seconds) less than it is when the program is assembled in Relocatable mode. And, although the use of pc-relative addressing in a program that is assembled in Relocatable mode does not reduce load time, no less dramatic is its effect on execution time. Because for even so short a program as this, it reduces the execution time from 8510 msec to 7485 msec, effecting a 1 second savings. Why PC-relative Programs Load Faster As I mentioned in chapter 2, part of the extra load time (or that which I have chosen to designate as load time, but which could as easily be called start-up time) required by programs assembled in the Relocatable mode is due to the task of program relocation itself. I have since received information from a reader which further clarifies the comparison between the two assembly modes. This reader is well known for his hard disk backup utility, TURTLE.PRG. I speak, of course, of George Woodside. I now paraphrase the information that Mr. Woodside was kind enough to share with me. It seems that when a Relocatable program is executed the operating system clears all unoccupied ram before the onset of execution. In addition, Mr. Woodside went on to explain that PC-relative LSR programs which are executed from the AUTO folder can't be deleted because the operating system does not close those files while they are still ram resident. I have mentioned this phenomenon elsewhere, but I have not known the reason until now. I had planned to write a few programs that would graphically display the information presented by Mr. Woodside, but something has occurred which probably renders such explorations irrelevant from my position. Since receiving his letter I have been forced to purchase a MEGA 2 because I needed the additional memory. And, while the differences in loading and execution times are still noticeable, they are not as dramatic as they were with the 1040ST. Therefore, I simply rest the matter by saying that there is a difference and that difference is reported as I experienced it. The Effects of Floppy Disk Formatting Techniques The first thing I'm going to do in this section is to refer you to a couple of START magazine articles written by Dave Small and Dan Moore: Hard Disk Warfare, Spring, 1987 and Let's Twist Again, Summer, 1988. Those articles will help you to understand why this section is included in the book. Briefly, there is more than one way to format floppy disks for the ST. I have designed two experiments which permit you to compare the effects of two non-standard methods to that of the standard ST method. I refer to the formatted disk you obtain when you format from the desktop by selecting the Format option under the File menu as standard ST. I refer to the formatted disk you obtain by any other method as non-standard. The non- standard formatting methods used in the experiment are Mega Twister, included on START magazine's Summer, 1988 disk and DC Formatter Version 2.2, included on ST Informer's PDM 188 disk. The most extensive reference to DC Formatter that I could find occurs in the July/August 1988 issue of RESET magazine, on page 11, within the Public Beat column. If this article is not readily available to you, I suggest that you be not concerned about it. While the author's enthusiasm for DC Formatter is undoubtedly genuine, and, although I have found DC Formatter version 2.2 to format faster than Mega Twister (1 minute 7 seconds versus 1 minute 45 seconds), and, although the data in figures 6.1 and 6.2 seem to confirm that DC Formatter may indeed write to floppies slightly faster than Mega Twister, the data does not confirm the author's apparent conclusion that DC Formatter is tremendously faster than Mega Twister. A new version of DC Formatter, Version 3.0, is discussed in the December, 1988 issue of ST Informer. Before deciding to use this method of formatting disks, I suggest that you read the information on page 33 of that issue, under the heading DCFRMT3.ARC. There you will find a statement concerning the incompatibility of the earlier version with GDOS. Furthermore, I suggest that you use any non-standard method of formatting disks with all of the caution the word non-standard connotes. Program 25 has been designed to permit a comparison of the load, write and read times associated with each disk formatting method. The steps used to isolate the program on the appropriate disks are discussed within the program's documentation. Program 25 invokes trap calls to custom trap #10, which is installed by executing TRAP_10.PRG. The source listing for TRAP_10.PRG follows the listing for program 25. Program 25. This program is used to compare the effects of three floppy disk formatting techniques. ; Program Name: PRG_2DP.S ; Version 1.003 ; Assembly Instructions: ; Assemble in PC-relative mode and save with a TOS extension. ; Execution Note: ; This program invokes custom traps which must be installed by ; TRAPS.PRG and TRAP_10.PRG prior to its execution. Execute this program ; by typing its name on SPEEDTST.TTP's command line. ; But before executing this program, prepare three floppy disks. The ; first should be formatted from the desktop, using the ST formatting ; algorithm. The second should be formatted with a version of DCFORMAT, ; available from ST Informer, 909 NW Starlite Place, Grants Pass, OR 97526, ; (503)476-0071 on disk PDM 188 or PDM 1288. The price of each disk from ; ST Informer is $6.00 for non-subscribers. You can subscribe for $18.00 ; per year and get a free disk coupon. I highly recommend a subscription. ; The third disk should be formatted with Dave Small and Dan Moore's ; Twister program available on START magazine's Summer, 1988 issue. You ; can find an order form in any issue of START or phone (800)234-7001. ; Copy PRG_2DP.TOS to each blank disk. Copy SPEEDTST.TTP to each disk. ; Execute SPEEDTST.TTP on each disk in turn, typing PRG_2DP.TOS on its ; command line. Compare the contents of the WRITE_1 and WRITE_2 files of ; each disk to verify that they are identical. Compare the PRG_2DP.DAT file ; on each disk to the others. ; Program Function: ; This program writes data to WRITE_1.DAT, reads WRITE_1.DAT, then ; writes what it has read to WRITE_2.DAT to confirm that it has correctly ; written and read the data declared within the data section of the program. ; SPEEDTST.TTP will report the load and total execution time for this ; program. ; Within the program, the time to write the data to WRITE_1.DAT will be ; calculated and reported; and the time to read the data from WRITE_1.DAT ; will also be calculated and reported; finally, the time to write the data ; to WRITE_2.DAT will be calculated and reported. ; This program is to be used to compare the write to and read from times ; involving three floppy disks, each of which has been formatted with a ; different formatting algorithm. fetch_load_time: trap #3 ; Returns value of system clock in D0. release_excess_memory: ; Also stores after-load time in TRAPS bss. lea program_end, a0 ; Put "end of program" address in A0. movea.l 4(a7), a1 ; Put "basepage" address in A1. trap #6 ; Calculate program size and release memory. lea stack, a7 print_heading: lea heading, a0 bsr print_string fetch_write_start_time: trap #3 lea write_start_time, a0 move.w d0, (a0) create_file_1: move.w #0, -(sp) ; File attribute = read/write. pea file_1_name ; For WRITE_1.DAT. move.w #$3C, -(sp) ; Function = f_create = GEMDOS $3C. trap #1 ; File handle is returned in D0. addq.l #8, sp lea file_1_handle, a0 ; Store returned file handle. move.w d0, (a0) write_to_file_1: pea string ; Push address of buffer. move.l #451, -(sp) ; Number of bytes to write. move.w d0, -(sp) ; File handle to be written to. move.w #$40, -(sp) ; GEMDOS function = write. trap #1 lea $C(sp), sp ; Reposition stack pointer to top of stack. close_file_1: move.w file_1_handle, -(sp) move.w #$3E, -(sp) ; Function = GEMDOS $3E = f_close. trap #1 addq.l #4, sp get_end_time: trap #3 sub.w write_start_time, d0 ; Subtract start time from end time. ext.l d0 ; Extend to 32 bits trap #10 ; Convert to milliseconds and print. set_dta: pea dta ; dta = address of 44 byte buffer. move.w #$1A, -(sp) ; GEMDOS function = set dta. trap #1 addq.l #6, sp print_read_time_label: lea read_msg, a0 bsr print_string fetch_read_start_time: trap #3 lea read_start_time, a0 move.w d0, (a0) ; NOTE: Reading is so fast, must loop to accumulate enough time to measure. move.w #99, d3 ; Set up counter for 100 loops. search_for_file: move.w #0, -(sp) ; Attribute = normal access. pea file_1_name ; Name of file to search for. move.w #$4E, -(sp) ; GEMDOS function = search first. trap #1 addq.l #8, sp tst d0 bne.s not_found read_WRITE_1_DAT: lea dta, a0 pea buffer move.l $1A(a0), -(sp) ; Number of bytes to read. move.w file_1_handle, -(sp) ; File to read. move.w #$3F, -(sp) ; GEMDOS function = read. trap #1 lea $C(sp), sp ; Reposition stack pointer. dbra d3, search_for_file _get_end_time: trap #3 sub.w read_start_time, d0 ; Subtract start time from end time. ext.l d0 ; Extend to 32 bits trap #10 ; Convert to milliseconds and print. print_write_time_label: lea write_msg, a0 bsr print_string _fetch_write_start_time: trap #3 lea write_start_time, a0 move.w d0, (a0) create_file_2: move.w #0, -(sp) ; File attribute = read/write. pea file_2_name ; For WRITE_2.DAT. move.w #$3C, -(sp) ; Function = f_create = GEMDOS $3C. trap #1 ; File handle is returned in D0. addq.l #8, sp lea file_2_handle, a0 ; Store returned file handle. move.w d0, (a0) write_to_file_2: lea dta, a0 pea string ; Push address of buffer. move.l $1A(a0), -(sp) ; Number of bytes to write. move.w d0, -(sp) ; File handle to be written to. move.w #$40, -(sp) ; GEMDOS function = write. trap #1 lea $C(sp), sp ; Reposition stack pointer to top of stack. close_file_2: move.w file_2_handle, -(sp) move.w #$3E, -(sp) ; Function = GEMDOS $3E = f_close. trap #1 addq.l #4, sp _get__end_time: trap #3 sub.w write_start_time, d0 ; Subtract start time from end time. ext.l d0 ; Extend to 32 bits trap #10 ; Convert to milliseconds and print. not_found: trap #8 ; Terminate. print_string: pea (a0) move.w #9, -(sp) trap #1 addq.l #6, sp rts data file_1_name: dc.b 'WRITE_1.DAT',0 file_2_name: dc.b 'WRITE_2.DAT',0 heading: dc.b 'PRG_2DP.TOS Execution Results',$D,$A,$D,$A dc.b ' Time to create, write and close WRITE_1.DAT: ',0 read_msg dc.b ' Time to read WRITE_1.DAT into buffer 10 times: ',0 write_msg: dc.b ' Time to create, write and close WRITE_2.DAT: ',0 string: dc.b ' This paragraph will be written to a disk file named ' dc.b 'WRITE_1.DAT. The time ',$D,$A dc.b ' required to write the paragraph will be reported in ' dc.b 'file PRG_2DP.DAT.',$D,$A dc.b ' Then the contents of WRITE_1.DAT will be read into a ' dc.b 'buffer. The time ',$D,$A dc.b ' required to read the contents of the file will be ' dc.b 'reported in file ',$D,$A dc.b ' PRG_2DP.DAT. Finally, the contents of the buffer ' dc.b 'will be written to ',$D,$A dc.b ' WRITE_2.DAT so that what has been read can be compared ' dc.b 'to what was written.',$D,$A,$1A ; NOTE: The ASCII code for ^Z (control Z) is normally used to mark the end ; of a file so that a program reading the file may look for that mark. bss align file_1_handle: ds.w 1 file_2_handle: ds.w 1 write_start_time: ds.w 1 read_start_time: ds.w 1 dta: ds.b 44 buffer: ds.b 452 ds.l 96 stack: ds.l 0 program_end: ds.l 0 end Program 26. The custom trap #10 algorithm. ; Program Name: TRAP_10.S ; Version 1.001 ; Assembly Instructions: ; Assemble in PC-relative mode and save with a PRG extension. ; Program Function: ; This is a LSR program that establishes a user defined trap. It may be ; executed from the desktop, but you may prefer to copy it to the AUTO ; folder of your boot partition or floppy disk so that it will execute ; automatically during boot. ; Trap #10 is used by programs which time an interval. This trap ; converts the interval passed in register D0 to milliseconds, then it ; prints the ASCII decimal value of that interval in milliseconds. ; This program invokes a custom trap that is established by TRAPS.PRG, ; therefore, that program must be executed before trap #10 is invoked by a ; program. program_start: ; Calculate program size and retain result. lea program_end, a3 ; Fetch program end address. suba.l 4(a7), a3 ; Subtract basepage address. enter_supervisor_mode: move.l #0, -(sp) ; The zero turns on supervisor mode. move.w #$20, -(sp) ; Function = super = GEMDOS $20. trap #1 ; Go to supervisor mode. addq.l #6, sp ; Supervisor stack pointer (SSP) returned in D0. movea.l d0, a5 ; Save SSP in scratch register. install_trap_10_routine: ; Note: pointer = vector = pointer. lea $A8, a0 ; Fetch trap #10 pointer address. lea trap_10_routine, a1 ; Fetch address of trap #10 routine. move.l a1, (a0) ; Store trap address at pointer address. enter_user_mode: pea (a5) ; Restore supervisor stack pointer. move.w #$20, -(sp) ; Function = super = GEMDOS $20. trap #1 ; Go to user mode. addq.l #6, sp ; Reset stack pointer to top of stack. relinquish_processor_control: ; Maintain memory residency. move.w #0, -(sp) ; See page 121 of Internals book. move.l a3, -(sp) ; Program size. move.w #$31, -(sp) ; Function = ptermres = GEMDOS $31. trap #1 trap_10_routine: ; Expects a binary value in D0 which represents a measured interval. This ; algorithm converts the value in D0 to milliseconds (msec) then prints the ; value in decimal msec. preserve_value_in_d3: move.l d3, -(sp) convert_time_to_msec: move.l d0, d3 ; Save copy in D0 to add. asl.l #2, d3 ; Shift to multiply by 4. add.l d0, d3 ; To complete multiplication by 5. print_time: cmpi.l #999, d3 ; If time is less than 1000, then bgt no_space ; print a leading blank space for output lea space, a0 ; alignment. bsr print_string cmpi.l #99, d3 ; If time is less than 100, then bgt no_space ; print another leading blank space. lea space, a0 bsr print_string no_space: move.l d3, d1 ; Move to D1 so can convert to ASCII decimal trap #4 ; Returns address of decimal string in A0. bsr.s print_string lea units_label, a0 bsr.s print_string move.l (sp)+, d3 ; Restore contents of D3. rte ; ; Subroutine ; print_string: ; Expects address of string to be in A0. pea (a0) ; Push address of string onto stack. move.w #9, -(sp) ; Function = c_conws = GEMDOS $9. trap #1 ; GEMDOS call addq.l #6, sp ; Reset stack pointer to top of stack. rts data space: dc.b " ",0 units_label: dc.b " milliseconds", $D,$A,0 bss align ; Align storage on a word boundary. program_end: ds.l 0 end Figure 6.1 shows the execution results of program 25 with the contents of memory location $444 unaltered (See page 252 of the Internals book and Dave's Write-With-Verify Lecture, page 88 of START's Spring, 1987 issue.). As you will see, it makes sense to perform the experiment twice, once with the ST standard value in this memory location and once with the value zero stored there. And since that subject has come up, this is a convenient spot at which to introduce a program that I use to configure certain ST system variables during boot. Refer to section 3.7 The ST System Variables, pages 250-257 of the Internals book. Program 27. A program used to configure system variables during boot. I do not guarantee that this program will work with all ST systems. ; Program Name: CONFIG.S ; Assembly Instructions: ; Assemble in PC-relative mode and save with a PRG extension. Move ; CONFIG.PRG to the AUTO folder of the boot disk. ; Program Purpose: ; Configures system variables. mainline: lea stack, a7 ; Point A7 to this program's stack. enter_supervisor_mode: move.l #0, -(sp) ; The zero turns on supervisor mode. move.w #$20, -(sp) ; Function = super = GEMDOS $20. trap #1 ; Supervisor stack pointer returned in D0. addq.l #6, sp ; Algorithm 1: ; Turns off keyclick sound. Refer to page 254 of the Internals book. The ; system variable at address $484 is a byte length variable. The bits of ; this variable have the meanings as indicated in the Internals book. The ; bit of interest is #0. When this bit is a one, the computer emits a ; click each time a key is pressed. When the bit is a zero, these clicks ; are not emitted. A zero is placed in this bit by replacing the content ; of the byte at $484 (which is 7 before the replacement, if key click is ; enabled) with $6. disable_key_click: move.b #6, $484 ; Refer to page 254 of the Internals book. ; Algorithm 2: ; Performs the printer installation that is accomplished by CONTROL.ACC. ; The printer configuration table consists of one word, stored at $E4A. ; The bits of this word have the following meanings: ; BIT MEANING IF ZERO MEANING IF ONE ; --- --------------- -------------- ; 0 Dot matrix Daisy printer ; 1 Black/white Color printer ; 2 1280 dots/line 960 dots/line ; 3 Draft mode Final mode ; 4 Printer port Modem port ; 5 Continuous feed Single sheet ; Bits 6 through 15 are not used. install_printer: move.w #$4, $E4A ; Algorithm 3: ; Turns off disk write verify. disable_write_verify: move.w #0, $444 ; Refer to page 252 of the Internals book. return_to_user_mode: move.l d0, -(sp) ; Restore "before call" SSP. move.w #$20, -(sp) ; Function = super = GEMDOS $20. trap #1 addq.l #6, sp terminate: move.w #0, -(sp) ; Function = p_term_old = GEMDOS $0. trap #1 ; GEMDOS call. ds.l 24 ; Stack. stack: ds.l 1 ; Address of stack. program_end: ds.l 0 end Figure 6.1. Program 25's execution results with write-verify active. PRG_2DP.TOS Execution Results: Hard disk drive. Time to create, write and close WRITE_1.DAT: 160 milliseconds Time to read WRITE_1.DAT into buffer: 520 milliseconds Time to create, write and close WRITE_2.DAT: 115 milliseconds SPEEDTST.TTP Execution Results for PRG_2DP.TOS, loaded from drive: G Load time: 55 milliseconds Execution time: 1025 milliseconds PRG_2DP.TOS Execution Results: ST Standard Formatted. Time to create, write and close WRITE_1.DAT: 3185 milliseconds Time to read WRITE_1.DAT into buffer 10 times: 540 milliseconds Time to create, write and close WRITE_2.DAT: 2650 milliseconds SPEEDTST.TTP Execution Results for PRG_2DP.TOS, loaded from drive: A Load time: 385 milliseconds Execution time: 7840 milliseconds PRG_2DP.TOS Execution Results: Twister Formatted, Version 2.0. Time to create, write and close WRITE_1.DAT: 3195 milliseconds Time to read WRITE_1.DAT into buffer 10 times: 540 milliseconds Time to create, write and close WRITE_2.DAT: 2260 milliseconds SPEEDTST.TTP Execution Results for PRG_2DP.TOS, loaded from drive: A Load time: 365 milliseconds Execution time: 7300 milliseconds PRG_2DP.TOS Execution Results: DC Formatter, Version 2.2. Time to create, write and close WRITE_1.DAT: 3185 milliseconds Time to read WRITE_1.DAT into buffer 10 times: 540 milliseconds Time to create, write and close WRITE_2.DAT: 2255 milliseconds SPEEDTST.TTP Execution Results for PRG_2DP.TOS, loaded from drive: A Load time: 365 milliseconds Execution time: 7275 milliseconds Figure 6.2 shows the execution results of program 25 with the contents of memory location $444 equal to zero, thereby disabling the write-verify function. Notice that the state of the write-verify function does not seem to effect disk access times when the program is executed from a hard disk partition. The hard disk times of figures 6.1 and 6.2 are identical, considering the 5 msec resolution of the system clock. Figure 6.2. Program 25's execution results with write- verify disabled. PRG_2DP.TOS Execution Results: Hard disk drive. Time to create, write and close WRITE_1.DAT: 160 milliseconds Time to read WRITE_1.DAT into buffer 10 times: 520 milliseconds Time to create, write and close WRITE_2.DAT: 110 milliseconds SPEEDTST.TTP Execution Results for PRG_2DP.TOS, loaded from drive: G Load time: 60 milliseconds Execution time: 1020 milliseconds PRG_2DP.TOS Execution Results: ST Standard Formatted. Time to create, write and close WRITE_1.DAT: 1590 milliseconds Time to read WRITE_1.DAT into buffer 10 times: 540 milliseconds Time to create, write and close WRITE_2.DAT: 1450 milliseconds SPEEDTST.TTP Execution Results for PRG_2DP.TOS, loaded from drive: A Load time: 390 milliseconds Execution time: 4640 milliseconds PRG_2DP.TOS Execution Results: Twister Formatted, Version 2.0. Time to create, write and close WRITE_1.DAT: 1595 milliseconds Time to read WRITE_1.DAT into buffer 10 times: 540 milliseconds Time to create, write and close WRITE_2.DAT: 1060 milliseconds SPEEDTST.TTP Execution Results for PRG_2DP.TOS, loaded from drive: B Load time: 365 milliseconds Execution time: 4100 milliseconds PRG_2DP.TOS Execution Results: DC Formatter, Version 2.2. Time to create, write and close WRITE_1.DAT: 1590 milliseconds Time to read WRITE_1.DAT into buffer 10 times: 540 milliseconds Time to create, write and close WRITE_2.DAT: 1050 milliseconds SPEEDTST.TTP Execution Results for PRG_2DP.TOS, loaded from drive: A Load time: 370 milliseconds Execution time: 4080 milliseconds Unfinished Business: Pushing Zero Onto the Stack In chapter 2, I discussed the execution speed and memory requirements of three ways to push a zero onto the stack. At that time, I used table 2.1 to list data that was observed and prepared manually. I am now going to introduce a program that prints comparative data for the three methods previously discussed. Because the time required to execute individual instructions is too short for accurate measurement, we rely on multiple executions of each instruction to lengthen the period of time that we are measuring. In addition, we are not as interested in absolute execution speed as we are in relative speed. We just want to know which instructions are faster, not how long it takes to execute one of them. Program 28 will generate values that we can compare objectively. In addition, the program will calculate the memory requirements for each of the three instructions in which we are interested. As a finale, the advantage of using the third method, when it is appropriate to do so, is illustrated by removal of the statement which needs be executed only once from the execution loop. That is the statement which prestores a zero in register D0. Program 28. Three ways to push zero onto the stack. ; Program Name: PUSHZERO.S ; Version: 1.003 ; Assembly Instructions: ; Assemble in AssemPro PC-relative mode and save with a TOS extension. ; Execution Instructions: ; Execute from the desktop; or execute SPAWN.TTP, type PUSHZERO.TOS on ; its command line and view this program's output in PUSHZERO.DAT. ; Program Function: ; For each of the three methods of pushing $0 onto the stack discussed in ; chapter two, calculates the memory occupied by each instruction, and ; calculates the execution time in milliseconds required for 50,000 ; executions of each. ; Then, in order to emphasize the comparison for a more practical type of ; application, the one statement in the third algorithm that needs be ; executed only once, before the execution of the loop, is placed outside ; of the loop, and the time for 50,000 executions of the third method is ; repeated. calculate_program_size: lea -$102(pc), a1 ; Fetch basepage start address. lea program_end, a0 ; Fetch program end address. adda.l #100512, a0 ; Attach large stack space to end of program. ; NOTE: The above method of declaring stack space at the end of this program is ; preferable to that which I have been using because, if a label were ; used to declare the stack in the bss section, then the label used ; to mark the end of the program would be too far to permit the program ; to be assembled in AssemPro's PC-relative mode. movea.l a0, a7 ; Point A7 to this program's stack. trap #6 ; Return unused memory to op system. print_heading: lea heading, a0 bsr print_string push_method_1: lea header_1, a0 bsr print_string move.l #49999, d7 ; D7 is counter for the push loop. trap #3 ; Fetch start time. move.l d0, d3 ; Save start_time in D3. push_1_loop: ; Marks start of instruction in the loop. clr.w -(sp) ; Instruction in the loop. memory_1: ; Marks end of instruction in the loop. dbra d7, push_1_loop ; Loop 50000 times. trap #3 ; Fetch end time. sub.l d3, d0 ; Subtract start time from end time. trap #10 ; Convert to decimal milliseconds and print. adda.l #100000, sp ; Reset stack pointer to top of stack. print_method_1_requisite_memory: lea header_5, a0 ; Print requisite memory header. bsr print_string lea memory_1, a0 ; Calculate number of bytes occupied by the lea push_1_loop, a1 ; instruction in the loop. suba.l a1, a0 bsr print_requisite_memory push_method_2: lea header_2, a0 bsr print_string move.l #49999, d7 ; D7 is counter for the push loop. trap #3 ; Fetch start time. move.l d0, d3 ; Save start_time in D3. push_2_loop: ; Marks start of instruction in the loop. move.w #0, -(sp) ; Instruction in the loop. memory_2: ; Marks end of instruction in the loop. dbra d7, push_2_loop ; Loop 50000 times. trap #3 ; Fetch end time. sub.l d3, d0 ; Subtract start time from end time. trap #10 ; Convert to decimal milliseconds and print. adda.l #100000, sp ; Reset stack pointer to top of stack. print_method_2_requisite_memory: lea header_6, a0 ; Print requisite memory header. bsr.s print_string lea memory_2, a0 ; Calculate number of bytes occupied by the lea push_2_loop, a1 ; instruction in the loop. suba.l a1, a0 bsr.s print_requisite_memory push_method_3: lea header_3, a0 bsr.s print_string move.l #49999, d7 ; D7 is counter for the push loop. trap #3 ; Fetch start time. move.l d0, d3 ; Save start_time in D3. push_3_loop: ; Marks start of instructions in the loop. moveq #0, d0 ; There are two instructions in the loop. move.w d0, -(sp) ; memory_3: ; Marks end of instructions in the loop. dbra d7, push_3_loop ; Loop 50000 times. trap #3 ; Fetch end time. sub.l d3, d0 ; Subtract start time from end time. trap #10 ; Convert to decimal milliseconds and print. adda.l #100000, sp ; Reset stack pointer to top of stack. print_method_3_requisite_memory: lea header_7, a0 ; Print requisite memory header. bsr.s print_string lea memory_3, a0 ; Calculate number of bytes occupied by the lea push_3_loop, a1 ; instructions in the loop. suba.l a1, a0 bsr.s print_requisite_memory modified_push_method_3: lea header_4, a0 bsr.s print_string move.l #49999, d7 ; D7 is counter for the push loop. trap #3 ; Fetch start time. move.l d0, d3 ; Save start_time in D3. moveq #0, d0 ; Prestore $0 in D0. push_4_loop: move.w d0, -(sp) ; Contents of D0 onto the stack. dbra d7, push_4_loop ; Loop 50000 times. trap #3 ; Fetch end time. sub.l d3, d0 ; Subtract start time from end time. trap #10 ; Convert to decimal milliseconds and print. adda.l #100000, sp ; Reset stack pointer to top of stack. terminate: trap #8 ; ; SUBROUTINES ; print_requisite_memory: move.l a0, d1 trap #4 bsr.s print_string lea header_8, a0 bsr.s print_string rts print_string: ; Expects address of string to be in A0. move.l a0, -(sp) ; Push address of string onto stack. move.w #9, -(sp) ; Function = c_conws = GEMDOS $9. trap #1 ; GEMDOS call addq.l #6, sp ; Reset stack pointer to top of stack. rts data heading: dc.b $D,$A,"PUSHZERO Execution Results",$D,$A,$D,$A,0 header_1: dc.b " Elapsed time for clr.w -(sp): ",0 header_2: dc.b " Elapsed time for move.w #0, -(sp): ",0 header_3: dc.b " Elapsed time for moveq #0, d0",$D,$A dc.b " move.w d0, -(sp): ",0 header_4: dc.b " Time for only move.w d0, -(sp): ",0 header_5: dc.b " Requisite memory: ",0 header_6: dc.b " Requisite memory: ",0 header_7: dc.b " Requisite memory: ",0 header_8: dc.b " bytes",$D,$A,$D,$A,0 bss align program_end: ds.l 0 end PUSHZERO Execution Results Elapsed time for clr.w -(sp): 180 milliseconds Requisite memory: 2 bytes Elapsed time for move.w #0, -(sp): 155 milliseconds Requisite memory: 4 bytes Elapsed time for moveq #0, d0 move.w d0, -(sp): 155 milliseconds Requisite memory: 4 bytes Time for only move.w d0, -(sp): 125 milliseconds Accuracy of the Results It is true that we are interested in relative results, as far as instruction execution time is concerned, but it is worth remembering that when we view the results of a PUSHZERO execution, the elapsed times are calculated to a resolution of approximately 10 msec (milliseconds); 5 msec for the first call to get_time and 5 for the second call. How does this affect the accuracy of the program's report? Well, there are four cases to consider, each composed of two incidents, with each incident depending on whether we receive the time just before or just after the variable _hz_200 has been incremented. 1. Both the first and second times are received just after increment. In this case true time equals time, and loss of accuracy is due to overhead in making the calls and the inherent accuracy of the system clock. 2. The first time is received immediately after increment, the second is received immediately before. Then, true time equals time plus 5 msec plus overhead. 3. The first time is received immediately before increment, the second is received immediately after. Then, true time equals time minus 5 msec plus overhead. 4. Both the first time and second times are received immediately before increment. In this case true time equals time plus overhead. If you execute PUSHZERO repeatedly, you should notice some differences in reported times. For each of the timed events in the program, I have never seen a variance greater than 5 msec, however, the variance can be less than or greater than the most often reported value. We should expect this to happen because the resolution of the elapsed time calculations is 10 msec. To compare the reported times with the instruction execution times given in the Motorola's reference book, we need only consider the instructions that form the loop. For push_method_1, the instructions are: clr.w -(sp) dbra d7, push_1_loop The clr.w instruction requires 14 clock periods, the dbra instruction requires 10. Total clock periods for the loop is 24 X 50,000 = 1.2X106. Total time required to execute the loop is 1.2X106 X 1.25X10-7(time for one clock period) = 150 msec. The program reports the time most often as 180 msec. The difference is due to inaccuracies in the system clock, the imprecise resolution and the time needed to execute non-related instructions. Increasing Loop Execution Speed My primary use of program 28 was to introduce a method of determining the relative execution speed and memory requirements of two or more algorithms so that these attributes could be compared. As an adjunct to the three primary algorithms, I took the opportunity to introduce the modified_push_method_3 algorithm. With this algorithm, I intended to show you an example for which the third method of pushing a zero onto the stack was definitely superior because it clearly decreased loop execution time by decreasing the number of instructions within the loop. The next program, CLR_MEM, extends the idea of increasing loop execution speed, not by decreasing the instructions within the loop, but by increasing them. We begin the example by considering the task of clearing 32000 bytes of memory, the size of a video screen, to zero. Having been conditioned to consider quantities of memory in terms of bytes, the thought of clearing it a byte at a time should occur naturally. Program 29 also illustrates the advantages to be gained by thinking of memory in terms of words and longwords. Program 29. A program that clears 32,000 bytes of ram. ; Program Name: CLR_MEM.S ; Version: 1.004 ; Assembly Instructions: ; Assemble in PC-relative mode save with a TOS extension. ; Execution Instructions: ; Execute from the desktop; or execute SPAWN.TTP, type CLR_MEM.TOS on ; its command line and view this program's output in CLR_MEM.DAT. ; Program Function: ; Expands the concepts established with program PUSHZERO. Here, ; simulating the type of algorithm that would be used to clear video screen ; memory, the execution time when clearing memory a longword each time, ; within a loop, is compared to that of the more obvious method of clearing ; a byte, or perhaps a word, each time through the loop. ; Then, the speed advantage of clearing a longword by storing the ; content of a cleared register is illustrated. ; Finally, the advantage of clearing more than one longword within the ; body of the loop is explored, with an eye out for the maximum amount of ; beneficial loop expansion. calculate_program_size: lea -$102(pc), a1 ; Fetch basepage start address. lea array, a0 ; Fetch program end = array address. adda.l #32000, a0 ; Add in array space. movea.l a0, a7 ; Point A7 to this program's stack. trap #6 ; Return unused memory to op system. print_heading: lea heading, a0 bsr print_string clear_one_byte_algorithm: lea header_1, a0 bsr print_string lea array, a0 ; A0 is pointer to 32000 byte array. move.l #31999, d7 ; D7 is counter for the clear loop. trap #3 ; Fetch start time. move.l d0, d3 ; Save start_time in D3. clear_a_byte: move.b #0, (a0)+ dbra d7, clear_a_byte ; Loop 32000 times. trap #3 ; Fetch end time. sub.l d3, d0 ; Subtract start time from end time. trap #10 ; Convert to decimal milliseconds and print. clear_one_word_algorithm: lea header_2, a0 bsr print_string lea array, a0 ; A0 is pointer to 32000 byte array. move.l #15999, d7 ; D7 is counter for the clear loop. trap #3 ; Fetch start time. move.l d0, d3 ; Save start_time in D3. clear_a_word: move.w #0, (a0)+ dbra d7, clear_a_word ; Loop 16000 times. trap #3 ; Fetch end time. sub.l d3, d0 ; Subtract start time from end time. trap #10 ; Convert to decimal milliseconds and print. clear_one_longword_algorithm: lea header_3, a0 bsr print_string lea array, a0 move.l #7999, d7 trap #3 ; Fetch start time. move.l d0, d3 ; Save start_time in D3. clear_a_longword: move.l #0, (a0)+ dbra d7, clear_a_longword; Loop 8000 times. trap #3 ; Fetch end time. sub.l d3, d0 ; Subtract start time from end time. trap #10 ; Convert to decimal milliseconds and print. clear_one_longword_with_precleared_register: lea header_4, a0 bsr print_string lea array, a0 moveq #0, d1 ; Preclear D1 for use in the loop. move.l #7999, d7 trap #3 ; Fetch start time. move.l d0, d3 ; Save start_time in D3. clear_with_register: move.l d1, (a0)+ ; Clear a longword. dbra d7, clear_with_register trap #3 ; Fetch end time. sub.l d3, d0 ; Subtract start time from end time. trap #10 ; Convert to decimal milliseconds and print. clear_4_longwords_algorithm: lea header_5, a0 bsr.s print_string lea array, a0 move.l #0, d1 ; Preclear D1 for use in the loop. move.l #1999, d7 trap #3 ; Fetch start time. move.l d0, d3 ; Save start_time in D3. clr_4_longwords: move.l d1, (a0)+ ; A single move statement clears 4 bytes. move.l d1, (a0)+ ; Reduces number of loops to 4000. move.l d1, (a0)+ ; Reduces number of loops to 2666+. move.l d1, (a0)+ ; Reduces number of loops to 2000. dbra d7, clr_4_longwords ; Loop 2000 times, clear 16 bytes each time. trap #3 ; Fetch end time. sub.l d3, d0 ; Subtract start time from end time. trap #10 ; Convert to decimal milliseconds and print. clear_8_longwords_algorithm: ; Will a further increase in move instructions lea header_6, a0 ; within the loop decrease execution time? bsr.s print_string lea array, a0 move.l #0, d1 ; Preclear D1 for use in the loop. move.l #999, d7 trap #3 ; Fetch start time. move.l d0, d3 ; Save start_time in D3. clr_8_longwords: move.l d1, (a0)+ ; A single move statement clears 4 bytes. move.l d1, (a0)+ ; Reduces number of loops to 4000. move.l d1, (a0)+ ; Reduces number of loops to 2666+. move.l d1, (a0)+ ; Reduces number of loops to 2000. move.l d1, (a0)+ ; Reduces number of loops to 1600. move.l d1, (a0)+ ; Reduces number of loops to 1333+. move.l d1, (a0)+ ; Reduces number of loops to 1142+. move.l d1, (a0)+ ; Reduces number of loops to 1000. dbra d7, clr_8_longwords ; Loop 1000 times, clear 32 bytes each time. trap #3 ; Fetch end time. sub.l d3, d0 ; Subtract start time from end time. trap #10 ; Convert to decimal milliseconds and print. terminate: trap #8 ; ; SUBROUTINES ; print_string: ; Expects address of string to be in A0. move.l a0, -(sp) ; Push address of string onto stack. move.w #9, -(sp) ; Function = c_conws = GEMDOS $9. trap #1 ; GEMDOS call addq.l #6, sp ; Reset stack pointer to top of stack. rts data heading: dc.b 'CLR_MEM Execution Results',$D,$A,0 header_1: dc.b ' Clear one byte time: ',0 header_2: dc.b ' Clear one word time: ',0 header_3: dc.b ' Clear one longword time: ',0 header_4: dc.b ' With register time: ',0 header_5: dc.b ' Clear 4 longwords time: ',0 header_6: dc.b ' Clear 8 longwords time: ',0 bss align start_time: ds.l 1 ds.l 96 ; Stack. stack: ds.l 1 ; Address of stack. array: ds.l 0 end CLR_MEM Execution Results Clear one byte time: 100 milliseconds Clear one word time: 45 milliseconds Clear one longword time: 35 milliseconds With register time: 25 milliseconds Clear 4 longwords time: 15 milliseconds Clear 8 longwords time: 10 milliseconds If you assemble the program and repeatedly execute it, you will occasionally obtain 95 msec for the one byte time, 50 msec for the one word time, 15 msec for the 8 longwords time and etc. You know that variances are possible because of your experience with program 28, so don't let the variances bother you. The important things to realize are these: 1. Clearing the 32000 bytes a word at a time is just twice as fast as clearing it a byte at a time. This is so only because we were able to reduce the number of times through the loop by half. The amount of time to clear one word of memory is exactly equal to the time it takes to clear one byte. 2. The time it took to clear the 32000 bytes a longword at a time is not one/fourth the time it took to clear a byte, even though the number of loops was reduced by four, because it takes more time to clear a longword of memory than it does to clear a byte or a word. 3. Clearing a register to zero before we enter the loop, then storing the content of the register is faster than using a move.l #0 instruction within the loop. 4. Storing more than one longword within the loop reduced the loop execution time significantly. However, the times reported for the clear_four_longwords and clear_eight_longwords algorithms are too close to the resolution of the time recording functions; therefore, the significance of the reported values is questionable. Even though the results obtained for the longword algorithms mentioned in item four are not entirely reliable, they seem to suggest that, as the number of instructions within the loop is increased, we will begin to notice a saturation effect; that is, a reduction in the number of loops will affect the overall loop execution time to a lesser degree because of the amount of time spent within the loop. After I had seen the results of CLR_MEM, I could have rewritten the program so that it would produce more palatable results, however, if I had done that, I would have denied you the experience of viewing a program which produces valid results for only a portion of its execution. You would not know that I had not been completely successful the first time. In doing something to validate the results for the last two algorithms of program 29, one would be easily tempted to simply increase the number of times through each loop in the program by an identical factor, say 10, perhaps. In fact, it is the first method of improvement that occurred to me. When it didn't work, I was forced to examine the details of the DBRA instruction, something I had not done in a while. If you refer back to program 29, you can see that, if we increase the size of memory to be cleared by a factor of 10, we must clear 320,000 bytes in the clear_one_byte algorithm. To do that, we must loop 320,000 times, therefore, the value 319,999 must be prestored in the loop counter. What is wrong with this picture? The DBRA instruction (as does all of the DBcc instructions) permits its counter parameter to contain only a 16-bit value. The maximum count we can fit in 16 bits is 65535. So we see why we can't set the counter to 320,000 and expect valid results. Similarly, we know that a value of 159,999 for the clear_one_word algorithm will not work, nor will a value of 79,999 work for the clear_one_longword algorithm. In order to rearrange CLR_MEM so that the results for each algorithm would be comparable, we would have to resort to a minor loop inside of a major loop. Preparing CLR_MEM2, by extracting the necessary portions from PRGM_1F and increasing the appropriate values, is a more elegant solution. And, while we are doing that, we may as well include an algorithm that will let us obtain another saturation effect observation point. Program 30. Measures the time to clear memory with greater accuracy. ; Program Name: CLR_MEM2.S ; Version: 1.002 ; Assembly Instructions: ; Assemble in PC-relative mode and save with a TOS extension. ; Execution Instructions: ; Execute SPAWN.TTP and type CLR_MEM2.TOS on its command line. View the ; output of this program in CLR_MEM2.DAT. ; Program Function: ; The four_longwords_time and the eight_longwords_time reported by program ; CLR_MEM are too short to be reliable. Both values are too close to ; the resolution of the time recording functions. ; In this program, the significance of the reported execution times is ; increased by increasing the number of loops for those two algorithms ; by a factor of 10. This will permit a more valid comparison between the ; two algorithms. ; In addition, since I am going through the trouble of preparing this ; special program, I include one more algorithm. This one to investigate ; the intermediate five_longwords_time. ; Why go through this trouble? Well, as the number of instructions within ; the loop increases, we should observe a saturation effect in execution ; speed reduction, even though the number of loops are decreased, simply ; because more time is spent inside the loop. This additional point of ; observation will help to confirm or dismiss that theory. ; Now, because we are increasing the number of loops by 10 in this program, ; if we want to compare the results we obtain with those obtained from ; program CLR_MEM, we will have to divide CLR_MEM2's results by 10. calculate_program_size: lea -$102(pc), a1 ; Fetch basepage start address. lea program_end, a0 ; Fetch program end address. adda.l #320000, a0 ; Add in array space. movea.l a0, a7 ; Point A7 to this program's stack. trap #6 ; Return unused memory to op system. print_heading: lea heading, a0 bsr print_string clear_4_longwords_algorithm: lea header_1, a0 bsr print_string lea array, a0 move.l #0, d1 ; Preclear D1 for use in the loop. move.l #19999, d7 trap #3 ; Fetch start time. move.l d0, d3 ; Save start_time in D3. clr_4_longwords: move.l d1, (a0)+ ; A single move statement clears 4 bytes. move.l d1, (a0)+ ; Reduces number of loops to 40000. move.l d1, (a0)+ ; Reduces number of loops to 26666+. move.l d1, (a0)+ ; Reduces number of loops to 20000. dbra d7, clr_4_longwords ; Loop 20000 times, clear 16 bytes each time. trap #3 ; Fetch end time. sub.l d3, d0 ; Subtract start time from end time. trap #10 ; Convert to decimal milliseconds and print. clear_5_longwords_algorithm: lea header_2, a0 bsr.s print_string lea array, a0 move.l #0, d1 ; Preclear D1 for use in the loop. move.l #15999, d7 trap #3 ; Fetch start time. move.l d0, d3 ; Save start_time in D3. clr_5_longwords: move.l d1, (a0)+ ; A single move statement clears 4 bytes. move.l d1, (a0)+ ; Reduces number of loops to 40000. move.l d1, (a0)+ ; Reduces number of loops to 26666+. move.l d1, (a0)+ ; Reduces number of loops to 20000. move.l d1, (a0)+ ; Reduces number of loops to 16000. dbra d7, clr_5_longwords ; Loop 16000 times, clear 16 bytes each time. trap #3 ; Fetch end time. sub.l d3, d0 ; Subtract start time from end time. trap #10 ; Convert to decimal milliseconds and print. clear_8_longwords_algorithm: lea header_3, a0 bsr.s print_string lea array, a0 move.l #0, d1 ; Preclear D1 for use in the loop. move.l #9999, d7 trap #3 ; Fetch start time. move.l d0, d3 ; Save start_time in D3. clr_8_longwords: move.l d1, (a0)+ ; A single move statement clears 4 bytes. move.l d1, (a0)+ ; Reduces number of loops to 40000. move.l d1, (a0)+ ; Reduces number of loops to 26666+. move.l d1, (a0)+ ; Reduces number of loops to 20000. move.l d1, (a0)+ ; Reduces number of loops to 16000. move.l d1, (a0)+ ; Reduces number of loops to 13333+. move.l d1, (a0)+ ; Reduces number of loops to 11428+. move.l d1, (a0)+ ; Reduces number of loops to 10000. dbra d7, clr_8_longwords ; Loop 10000 times, clear 32 bytes each time. trap #3 ; Fetch end time. sub.l d3, d0 ; Subtract start time from end time. trap #10 ; Convert to decimal milliseconds and print. terminate: trap #8 ; ; SUBROUTINES ; print_string: ; Expects address of string to be in A0. move.l A0, -(sp) ; Push address of string onto stack. move.w #9, -(sp) ; Function = c_conws = GEMDOS $9. trap #1 ; GEMDOS call addq.l #6, sp ; Reset stack pointer to top of stack. rts data heading: dc.b 'CLR_MEM2 Execution Results',$D,$A,0 header_1: dc.b ' Clear 4 longwords time: ',0 header_2: dc.b ' Clear 5 longwords time: ',0 header_3: dc.b ' Clear 8 longwords time: ',0 bss align ; Align storage on a word boundary. ds.l 24 ; Stack. stack: ds.l 0 ; Address of stack. program_end: ds.l 0 ; Marks the end of program memory. array: ds.l 0 end CLR_MEM2 Execution Results Clear 4 longwords time: 155 milliseconds Clear 5 longwords time: 145 milliseconds Clear 8 longwords time: 140 milliseconds Program 30's results indicate that those of program 29 were actually valid enough. For if we divide the values reported by CLR_MEM2 by 10, we obtain 15.5, 15.0 and 14.0. In addition, this is the kind of data grouping we observe at a saturation level. Program 30's report gives us the confidence we need to utilize the clear_4_longwords algorithm to clear the video screen, knowing that we are prudently compromising between execution speed and memory requirement. Unfinished Business: The Morton Conclusion As promised at the end of chapter 3, I will resume my discussion of Mr. Morton's conclusion concerning multiplication. Program 31 records the relative speeds of the two methods of multiplying a word operand by four; however, I have added a little spice because, not only are the execution speeds for the two methods identical, but the shift method is the faster when multiplying a longword operand by four. Furthermore, the shift method requires less memory. I should also state that the adding dn to itself twice is faster than shifting left twice theory is supported in the Kelly-Bootle book on page 202. Program 31. The refutation. ; Program Name: TIMES_4.S ; Version: 1.005 ; Assembly Instructions: ; Assemble in PC-relative mode and save with a TOS extension. ; Execution Instructions: ; Execute from the desktop; or execute SPAWN.TTP, type TIMES_4.TOS on ; its command line and view this program's output in TIMES_4.DAT. ; Program Function: ; Compares the speed of multiplying by 4 with add dn,dn to asl #2,dn ; to determine which is faster. This experiment attempts to prove or ; refute the conclusion advanced by Mike Morton in his magazine article, ; "68000 Tricks and Traps", Byte, September, 1986, p163. As it turns out, ; when multiplying a word operand by 4, as Mr. Morton specifies in his ; article, the execution times are identical. But when multiplying a ; longword by 4, the shift algorithm is faster. Furthermore, the shift ; algorithm requires less memory. ; Author Stan Kelly-Bootle also asserts that the add.w dn,dn algorithm ; is a faster method of multiplying by 4 than is asl #2,dn at the top of ; page 202 of his book "680x0 Programming by Example". calculate_program_size: lea -$102(pc), a1 ; Fetch basepage start address. lea program_end, a0 ; Fetch program end address. trap #6 ; Return unused memory to op system. print_heading: lea heading, a0 bsr print_string word_addition: lea header_1, a0 bsr print_string move.l #49999, d7 ; D7 is counter for the loop. trap #3 ; Fetch start time. move.l d0, d3 ; Save start_time in D3. word_addition_loop: ; Marks start of instruction in the loop. add.w d0, d0 ; To multiply by two. add.w d0, d0 ; To multiply by four. memory_1: ; Marks end of instruction in the loop. dbra d7, word_addition_loop trap #3 ; Fetch end time. sub.l d3, d0 ; Subtract start time from end time. trap #10 ; Convert to decimal milliseconds and print. print_word_addition_requisite_memory: lea header_5, a0 bsr print_string lea memory_1, a0 ; Calculate number of bytes occupied by lea word_addition_loop, a1 ; the instructions in the loop. suba.l a1, a0 bsr print_requisite_memory word_shift: lea header_2, a0 bsr print_string move.l #49999, d7 ; D7 is counter for the loop. trap #3 ; Fetch start time. move.l d0, d3 ; Save start_time in D3. word_shift_loop: ; Marks start of instruction in the loop. asl.w #2, d0 ; Shift to multiply by 4. memory_2: ; Marks end of instruction in the loop. dbra d7, word_shift_loop ; Loop 50000 times. trap #3 ; Fetch end time. sub.l d3, d0 ; Subtract start time from end time. trap #10 ; Convert to decimal milliseconds and print. print_word_shift_requisite_memory: lea header_6, a0 bsr.s print_string lea memory_2, a0 ; Calculate number of bytes occupied by the lea word_shift_loop, a1 ; instruction in the loop, then store. suba.l a1, a0 bsr.s print_requisite_memory longword_addition: lea header_3, a0 bsr.s print_string move.l #49999, d7 ; D7 is counter for the loop. trap #3 ; Fetch start time. move.l d0, d3 ; Save start_time in D3. addition_loop: ; Marks start of instructions in the loop. add.l d0, d0 ; To multiply by two. add.l d0, d0 ; To multiply by four. memory_3: ; Marks end of instruction in the loop. dbra d7, addition_loop ; Loop 50000 times. trap #3 ; Fetch end time. sub.l d3, d0 ; Subtract start time from end time. trap #10 ; Convert to decimal milliseconds and print. print_longword_addition_requisite_memory: lea header_7, a0 bsr.s print_string lea memory_3, a0 ; Calculate number of bytes occupied by the lea addition_loop, a1 ; instruction in the loop, then store. suba.l a1, a0 bsr.s print_requisite_memory longword_shift: lea header_4, a0 bsr.s print_string move.l #49999, d7 ; D7 is counter for the loop. trap #3 ; Fetch start time. move.l d0, d3 ; Save start_time in D3. trap #3 ; Value of system clock returned in D0. move.l d0, d1 ; Save in trap call register. shift_loop: ; Marks start of instruction in the loop. asl.l #2, d0 ; Shift to multiply by 4. memory_4: ; Marks end of instruction in the loop. dbra d7, shift_loop ; Loop 50000 times. trap #3 ; Fetch end time. sub.l d3, d0 ; Subtract start time from end time. trap #10 ; Convert to decimal milliseconds and print. print_longword_shift_requisite_memory: lea header_8, a0 bsr.s print_string lea memory_4, a0 ; Calculate number of bytes occupied by the lea shift_loop, a1 ; instruction in the loop, then store. suba.l a1, a0 bsr.s print_requisite_memory terminate: trap #8 ; ; SUBROUTINES ; print_requisite_memory: move.l a0, d1 trap #4 bsr.s print_string lea header_9, a0 bsr.s print_string rts print_string: ; Expects address of string to be in A0. move.l a0, -(sp) ; Push address of string onto stack. move.w #9, -(sp) ; Function = c_conws = GEMDOS $9. trap #1 ; GEMDOS call addq.l #6, sp ; Reset stack pointer to top of stack. rts data heading: dc.b "TIMES_4 Execution Results",$D,$A,$D,$A,0 header_1: dc.b " Word addition time: ",0 header_2: dc.b $D,$A," Word shift time: ",0 header_3: dc.b $D,$A," Longword addition time: ",0 header_4: dc.b $D,$A," Longword shift time: ",0 header_5: dc.b " Word addition requisite memory: ",0 header_6: dc.b " Word shift requisite memory: ",0 header_7: dc.b " Longword add requisite memory: ",0 header_8: dc.b " Longword shift requisite memory: ",0 header_9: dc.b " bytes",$D,$A,0 bss align program_end: ds.l 0 end TIMES_4 Execution Results Word addition time: 130 milliseconds Word addition requisite memory: 4 bytes Word shift time: 125 milliseconds Word shift requisite memory: 2 bytes Longword addition time: 180 milliseconds Longword add requisite memory: 4 bytes Longword shift time: 155 milliseconds Longword shift requisite memory: 2 bytes SPEEDTST.TTP Execution Results for TIMES_4.TOS, loaded from drive: G Load time: 50 milliseconds Execution time: 925 milliseconds Now that the subject of multiplication algorithms has been introduced, it seems appropriate to present a program which compares the MC68000 multiply instruction to algorithms that accomplish the same task. That this seems appropriate might seem enigmatic at first because multiplication instructions are part of the MC68000's repertoire. Nevertheless, as is indicated in Mr. Morton's article, multiplication algorithms which ignore those instructions are faster for certain multipliers even if they are not powers of 2. Program 32 illustrates this fact for multiplication by 5, a multiplier that is used in many of my programs to convert the system clock count to milliseconds. MULTIPLY.TOS compares three methods of multiplying by 5. The program can be executed from the desktop or by typing its name on the command line of SPAWN.TTP or SPEEDTST.TTP. Program 32. Compares the speed of three multiplication algorithms. ; Program Name: MULTIPLY.S ; Version: 1.002 ; Assembly Instructions: ; Assemble in PC-relative mode and save with a TOS extension. ; Execution Instructions: ; Execute from the desktop; or execute SPAWN.TTP, type MULTIPLY.TOS on ; its command line and view this program's output in MULTIPLY.DAT. ; Program Function: ; Measures the speed of multiplication algorithms. calculate_program_size: lea -$102(pc), a1 ; Fetch basepage start address. lea program_end, a0 ; Fetch program end address. trap #6 ; Return unused memory to op system. print_heading: lea heading, a0 bsr print_string the_mulu_instruction: lea header_1, a0 bsr print_string move.l #49999, d7 ; D7 is counter for the loop. trap #3 ; Fetch start time. move.l d0, d3 ; Save start_time in D3. mulu_loop: ; Marks start of instruction in the loop. mulu #5, d0 ; Instruction in the loop. memory_1: ; Marks end of instruction in the loop. dbra d7, mulu_loop ; Loop 50000 times. trap #3 ; Fetch end time. sub.l d3, d0 ; Subtract start time from end time. trap #10 ; Convert to decimal milliseconds and print. addition: lea header_2, a0 bsr.s print_string move.l #49999, d7 ; D7 is counter for the push loop. trap #3 ; Fetch start time. move.l d0, d3 ; Save start_time in D3. addition_loop: ; Marks start of instruction in the loop. move.l d0, d2 ; To add one. add.l d0, d0 ; To double to two. add.l d0, d0 ; To double to four. add.l d2, d0 ; To complete multiplication by 5. memory_2: ; Marks end of instruction in the loop. dbra d7, addition_loop ; Loop 50000 times. trap #3 ; Fetch end time. sub.l d3, d0 ; Subtract start time from end time. trap #10 ; Convert to decimal milliseconds and print. shift_and_add: lea header_3, a0 bsr.s print_string move.l #49999, d7 ; D7 is counter for the push loop. trap #3 ; Fetch start time. move.l d0, d3 ; Save start_time in D3. shift_loop: ; Marks start of instruction in the loop. move.l d0, d2 ; Save a copy to add. asl.l #2, d0 ; Shift to multiply by 4. add.l d2, d0 ; To complete multiplication by 5. memory_3: ; Marks end of instruction in the loop. dbra d7, shift_loop ; Loop 50000 times. trap #3 ; Fetch end time. sub.l d3, d0 ; Subtract start time from end time. trap #10 ; Convert to decimal milliseconds and print. print__mulu_requisite_memory: lea header_4, a0 bsr.s print_string lea memory_1, a0 ; Calculate number of bytes occupied by the lea mulu_loop, a1 ; instruction in the loop. bsr.s print_requisite_memory print_addition_requisite_memory: lea header_5, a0 bsr.s print_string lea memory_2, a0 ; Calculate number of bytes occupied by the lea addition_loop, a1 ; instruction in the loop. bsr.s print_requisite_memory printS_shift_and_add_requisite_memory: lea header_6, a0 bsr.s print_string lea memory_3, a0 ; Calculate number of bytes occupied by the lea shift_loop, a1 ; instruction in the loop. bsr.s print_requisite_memory terminate: trap #8 ; SUBROUTINES print_requisite_memory: suba.l a1, a0 move.l a0, d1 trap #4 ; Returns address of decimal string in A0. bsr.s print_string lea header_7, a0 bsr.s print_string rts print_string: ; Expects address of string to be in A0. pea (a0) ; Push address of string onto stack. move.w #9, -(sp) ; Function = c_conws = GEMDOS $9. trap #1 ; GEMDOS call addq.l #6, sp ; Reset stack pointer to top of stack. rts data heading: dc.b "MULTIPLY Execution Results",$D,$A,$D,$A,0 header_1: dc.b " MULU time: ",0 header_2: dc.b " Addition time: ",0 header_3: dc.b " Shift and add time: ",0 header_4: dc.b $D,$A," MULU requisite memory: ",0 header_5: dc.b " Addition requisite memory: ",0 header_6: dc.b " Shift and add requisite memory: ",0 header_7: dc.b " bytes",$D,$A,0 bss align program_end: ds.l 0 end MULTIPLY Execution Results MULU time: 360 milliseconds Addition time: 260 milliseconds Shift and add time: 230 milliseconds MULU requisite memory: 4 bytes Addition requisite memory: 8 bytes Shift and add requisite memory: 6 bytes The results clearly show that the shift and add algorithm, in this case, is superior. It is 1.5 times faster than the MULU instruction, while requiring only 2 additional bytes of memory. On page 168 of his article, Mr. Morton shows an example of multiplication by 17. LEA Versus ADDA For Stack Adjustments On page 166 of the same Morton article is a nice tip concerning stack pointer movement. Mr. Morton used so few words to describe this hint that it is easy to slide right by it while reading the article. Program 33 explores this subject in a bit more detail, confirming that adjustments to the stack using the LEA instruction is twice as fast as adjustments using the ADDA instruction. The reason that this comparison is of concern is that the ADDQ instruction can be used for stack adjustments only when the alteration is limited to a maximum of 8 bytes. Beyond that, the ADDA instruction is mandatory. Within the documentation of program 33, I have included a timing note concerning the timing method I have been using and its relationship to the instruction execution times listed in the Motorola manual. With suitable precautions, you may be able to used the information there to generate timing data that more closely conforms to the Motorola data if you should find it necessary or desirable. Program 33. Confirms that the LEA instruction is the better choice for stack adjustments greater than 8 bytes. ; Program Name: LEA_ADDA.S ; Version 1.004 ; Assembly Instructions: ; Assemble in PC-relative mode and save with a TOS extension. ; Execution Instructions: ; Execute from the desktop; or execute SPAWN.TTP, type LEA_ADDA.TOS on ; its command line and view this program's output in LEA_ADDA.DAT. ; Program Function: ; Confirms (or refutes) the contention that lea $C(An), An is faster ; than adda.l #$C(An), where n = 1 - 7. Reference: "68000 Tricks and Traps", ; by Mike Morton, p.166, Byte, September, 1986. See the paragraph which ; begins "Small adjustments to the stack pointer...". calculate_program_size: lea -$102(pc), a1 ; Fetch basepage start address. lea program_end, a0 ; Fetch program end address. trap #6 ; Return unused memory to op system. lea stack, a7 ; Point A7 to this program's stack. print_heading: lea heading, a0 bsr print_string lea_method: lea lea_time_msg, a0 ; Print label for lea execution results. bsr print_string move.l #49999, d7 ; D7 is counter for the loop. trap #3 ; Value of system clock returned in D0. move.l d0, d1 ; Save time in scratch register. lea_loop: ; Marks start of instruction in the loop. lea $C(a2), a2 ; Instruction in the loop. memory_1: ; Marks end of instruction in the loop. dbra d7, lea_loop ; Loop 50000 times. trap #3 ; Get current value of system clock. sub.l d1, d0 ; Subtract beginning value from ending value. mulu #5, d0 ; Convert to milliseconds. sub.l #80, d0 ; Subtract dbra time and "error". ; See Timing Note below. move.l d0, d1 ; Transfer time for trap #4 call. convert_lea_time_to_ASCII_decimal: trap #4 ; Address of decimal string returned in A0. print_lea_time: bsr.s print_string ; Print the decimal string. lea time_msg, a0 ; Print units label. bsr.s print_string ; Timing Note: ; Until the count has expired, each dbra instruction requires 10 clock ; periods to execute. 49,999 dbra instructions require that amount of ; time. When the last dbra instruction is executed, the count has expired ; and that instruction requires 14 clock periods. ; Total clock periods consumed by the dbra instruction is ; 49,999 X 10 + 14 = 500,004. ; Each clock period is .000000125 second. Total time consumed by the ; dbra instructions is ; 500,004 X .000000125 = .0625 second = 62.5 milliseconds. ; If we assume that the 8 clock period execution time given in the ; Motorola reference manual for the lea d(An) instruction is correct, ; then 50,000 executions of lea $C(a2), a2 require ; 50,000 X 8 X .000000125 = 50 milliseconds ; Total time for the lea loop is 62.5 + 50 = 112.5 milliseconds. ; When no adjustment is made for "overhead" time, such as attendant ; instructions, which includes the dbra instruction, trap invocations and ; a few others, this program reports a lea loop execution time of 130 ; milliseconds. We can assume that the excess 17.5 milliseconds (130-112.5) ; is partially due to the "overhead" time and partially due to a system ; clock frequency different from 8 mhz. We can simply combine both ; components into "overhead" time. ; With no adjustment for "overhead" time this program reports a loop ; execution time of 180 milliseconds for the adda.l #$C, a2 loop. We ; subtract the 17.5 milliseconds "overhead" from this to obtain the true ; loop time of 162.5 milliseconds. ; From this we subtract the dbra execution time of 62.5 milliseconds to ; get 100 milliseconds for 50,000 adda.l instruction.. ; 62.5 milliseconds / 50,000 = .000002 second per instruction. ; .000002 / .000000125 = 16 clock periods per instruction ; The execution time given in the Motorola manual for the adda.l #d, An ; instruction is 16 clock periods. ; These calculations validate the method and justify subtracting the ; "overhead" time from the total loop execution times before printing the ; execution time for 50,000 executions of each of the two instructions of ; interest. Finally, we can combine dbra time and "overhead" time into ; a "new" overhead time that is 17.5 + 62.5 = 80 milliseconds. ; When the adjustments are made to the recorded loop times, the program ; prints out the correct time for 50,000 executions of each instruction, ; and the output correctly indicates that the lea instruction used in ; the program is twice as fast as the adda.l instruction used. The ; execution results of the program agree with the timing data given in ; the Motorola reference guide. adda_method: lea adda_time_msg, a0 ; Print label for adda execution results. bsr.s print_string move.l #49999, d7 ; D7 is counter for the loop. trap #3 ; Value of system clock returned in D0. move.l d0, d1 ; Save time in scratch register. adda_loop: ; Marks start of instruction in the loop. adda.l #$C, a2 ; Instruction in the loop. memory_2: ; Marks end of instruction in the loop. dbra d7, adda_loop ; Loop 50000 times. trap #3 ; Get current value of system clock. sub.l d1, d0 ; Subtract beginning value from ending value. mulu #5, d0 ; Convert to milliseconds. sub.l #80, d0 ; Subtract dbra time and "error". move.l d0, d1 ; Transfer time for trap #4 call. convert_adda_time_to_ASCII_decimal: trap #4 ; Address of decimal string returned in A0. print_adda_time: bsr.s print_string ; Print the decimal string. lea time_msg, a0 ; Print units label. bsr.s print_string print_lea_requisite_memory: lea lea_memory_msg, a0 bsr.s print_string lea memory_1, a0 ; Calculate number of bytes occupied by the lea lea_loop, a1 ; instruction in the loop. bsr.s print_requisite_memory print_adda_requisite_memory: lea adda_memory_msg, a0 bsr.s print_string lea memory_2, a0 ; Calculate number of bytes occupied by the lea adda_loop, a1 ; instruction in the loop. bsr.s print_requisite_memory terminate: trap #8 ; SUBROUTINES print_requisite_memory: suba.l a1, a0 move.l a0, d1 trap #4 ; Returns address of decimal string in A0. bsr.s print_string lea memory_msg, a0 bsr.s print_string rts print_string: ; Expects address of string to be in A0. pea (a0) ; Push address of string onto stack. move.w #9, -(sp) ; Function = c_conws = GEMDOS $9. trap #1 ; GEMDOS call addq.l #6, sp ; Reset stack pointer to top of stack. rts data heading: dc.b $D,$A,'LEA_ADDA Execution Results',$D,$A,$D,$A,0 lea_time_msg: dc.b ' Time for 50,000 lea instructions: ',0 adda_time_msg: dc.b ' Time for 50,000 adda.l instructions: ',0 time_msg: dc.b ' milliseconds',$D,$A,0 lea_memory_msg: dc.b $D,$A,' LEA requisite memory: ',0 adda_memory_msg: dc.b ' ADDA.L requisite memory: ',0 memory_msg: dc.b ' bytes',$D,$A,0 bss align ; Align storage on a word boundary. ds.l 96 stack: ds.l 0 program_end: ds.l 0 ; Marks the end of program memory. end LEA_ADDA Execution Results Time for 50,000 lea instructions: 50 milliseconds Time for 50,000 adda.l instructions: 100 milliseconds LEA requisite memory: 4 bytes ADDA.L requisite memory: 6 bytes The programs that I have used to compare the execution speeds and requisite memory of specific instructions and algorithms should have, by now, revealed their similarity of structure. It is appropriate that I present a more general algorithm which can be used to perform such comparisons. By endowing this algorithm with a certain amount of intelligence, I will be able to reduce the size of the programs that compare algorithms, and, in addition, I will be able to introduce the subject of algorithmic intelligence, itself. At first, my inclination was to include the general purpose performance testing algorithm in this chapter, but the detail with which I intend to discuss it and its ramifications would force this chapter to become very long and unwieldy, therefore, I have decided to devote an exclusive chapter to the algorithm. The subject of the next chapter, then, is the design of an algorithm that is smart enough to accept a variety of instructions or other algorithms as input and to generate suitable data for execution speed and requisite memory comparisons. I suggest that, if at all possible, you read The Micro Millennium, by Christopher Evans, first published in 1980 by The Viking Press, before you turn to chapter 7. At the very least, I think that you should read Chapters 12 through 14 which discuss the following subjects: The Nature of Intelligence, Can a Machine Think? and Towards the Ultra Intelligent Machine. If you follow this advice, then you will probably feel more comfortable with my use of the word intelligence as it applies to computers and computer algorithms. Conclusion The emphasis in this chapter, as it has been in previous chapters and as it will continue to be in future chapters, is accuracy, reliability, execution speed and minimum requisite memory. I feel that it is via a constant pressure to maintain these algorithmic attributes that all computer programming goals are eventually achieved. In chapter 7, I shall illustrate the manner in which the desire to maintain these attributes leads to the development of algorithms which assume an intelligence of their own, thereby violating the computers are dumb principle.