ftp.barnyard.co.uk

home *** CD-ROM | disk | FTP | other *** search

/ ftp.barnyard.co.uk / 2015.02.ftp.barnyard.co.uk.tar / ftp.barnyard.co.uk / cpm / walnut-creek-CDROM / JSAGE / ZSUS / TCJ / TCJ40LH.WS < prev next >

Wrap

Text File | 2000-06-30 | 20KB | 366 lines

PROGRAMMING FOR PERFORMANCE PART 2 by Lee A. Hart The Computer Journal, Issue 40 Reproduced with permission of author and publisher Know The Hardware Truly efficient software has an intimate, almost incestuous relationshipì with its hardware. They merge so thoroughly as to become inseparable;ì neither makes any sense without the other. This requires that you, the programmer, must TOTALLY understand theì hardware. I cannot stress this point too strongly. The strengths andì weaknesses of the hardware influence program structure at every level, notì just the low-level drivers. A system with weak disk I/O will be slow andì unresponsive if your program relies on overlays. A shallow keyboard bufferì requires frequent checks to avoid missing keys. The characteristics of the ì console device determines your whole approach to data displays. If you tryì to hide from these limitations in a high-level language, your program willì work as if it were written in BASIC 101. Let's consider some actual caseì histories of what can be gained by paying attention to the hardware. CASE #1 A customer needed a faster way to transfer data between two computers. ì He had been using a serial port at 9600 baud but complained that it was tooì slow and tied up the computer's serial port. Hardware mods were ruled out. After study, I found that each computer had unused handshake lines in itsì RS-232 port. A special "Y" cable was built to cross-connect two of theseì lines, providing one bit of serial I/O in each direction. A "software UART"ì program was then written to transfer data between the two machines. Thisì worked to about 30K bits per second before timing dither (due toì interrupts, memory refresh, etc.) caused errors. The serial port's UART could be programmed to generate an interrupt whenì the handshake line went low. Therefore, an interrupt-driven protocol withì handshaking was devised. A '0' was sent by pulling the output low until theì other computer echoed the low on its output. A '1' was sent by pulsing theì output low and immediately back high and waiting until the other systemì echoed it. The data rate increased to over 100K bits per second, andì transfers were now unaffected by disk I/O, keyboard activity, etc. CASE 2 The firmware for a CRT terminal was to be upgraded to run 38400 bits perì second without handshaking. Now, 38400 bps is fast, only 260 microsecondsì per character (about 75 instructions for a 3 MHz Z80). è The slowest routines need the most attention. For example, clear-lineì was accomplished by moving the stack pointer to the end of the line andì executing 36 PUSH HL instructions. The interrupt handler needed a 4-levelì stack, so the last 8 bytes were cleared normally. Clear-screen used 25ì iterations of clear-line. This still isn't fast enough to complete every ESC sequence before theì next one is received. This calls for an interrupt-driven system. Eachì character received generates an interrupt. The interrupt handler pushes theì character into one end of a FIFO (First-In-First-Out) buffer in memory. Theì main program pops characters out the other end and processes them. The FIFOì fills while we process slow commands like clear-screen and empties back outì during fast commands. But what if some idiot sends a long string of slow commands (like 100ì clear-screens in a row)? The FIFO would eventually overflow, and data wouldì be lost. I prevented this with "look-ahead" logic. When the interruptì handler spots a clear-screen command, it sets a flag so MAIN expects it. ì MAIN can then ignore unnecessary commands (no sense altering a screen that'sì about to be cleared). Scrolling is one of the most difficult actions. The obvious algorithm isì to block move lines 2-24 up 1, then clear line 24. But that's what IBM didì on the PC, and we all know how well that worked. So examine the 6845 CRTì controller. The Start-Address register holds the address of the firstì character on the screen, the one displayed in the top left corner. If weì add 80 to it, line 2 instantly becomes the top line, and we've scrolled theì whole screen up a line. All that remains is to clear the 80 bytes that ì form the new 24th line, for which we have a fast routine. Each scroll moves the start address up another 80 bytes. This obviouslyì can't go on indefinitely, so the original program spent a great deal of timeì checking for overflow outside its 2K block of screen RAM (F800-FFFF). Forì instance, the old code read: ld (hl),a ; put character on screen inc hl ; advance to next ld a,h ; get new address or 0F8h ; if overflow to 0000, ld h,a ; force it to F800-FFFF But is this really necessary? The schematic revealed that the 2K RAM wasì partially decoded and actually occupied 16K in the Z80's address spaceì (C000-FFFF). It's far easier to insure that an address lies within thisì range: ld (hl),a ; put character on screen res 6,h ; insure we don't wrap to 0000 inc hl ; advance to next CASE #3 è Fast Disk I/O. Way back in 8 B.C. (eight years Before Clones) I had anì S-100 system. Its 8080 CPU blazed along at 1.843 MHz, through 32K of RAMì spread over half a dozen furnace boards. Two Shugart SA-801R single-sidedì 8" drives provided disk storage, with CP/M 1.4 tying it all together. Thatì old war horse and I fought many battles together, until it finally died theì Death-of-1000-Intermittents. Many of its "features" I'd rather forget, but it had one outstandingì attribute: the fastest floppies I've ever seen. Warm boots were done beforeì your fingers were off the keys; Wordstar loaded in under a second; PIPì copied files at 10K bytes/sec. All without a fast CPU, DMA, vectoredì interrupts, or even a disk controller IC. The "controller" was just a bunchì of TTL chips implementing a parallel port, an 8-bit shift register, and aì CRC checkcode generator. The real work was done by the CPU, byte-bangingì out the IBM 3740 SD/DD format in software. How good was it? An 8" disk spins at 360 rpm, or 6 revs/sec. Each trackì held 6.5K (26 double-density sectors of 256 bytes each). That makes theì theoretical maximum transfer rate 6.5K x 6 = 39K bytes/sec. It actuallyì achieved 50% of this, or 20K bytes/sec. Few modern micros come anywhere near this level of performance. Theì Kaypro I wrote this article on creeps through the disk at 4K/sec. My PCì clone is closer, at 12K/sec. The problem is that the CPU spends most of itsì time in wait loops; waiting for the drive motor to start, for the head toì load, for an index hole, for a certain sector to come around on the disk.ì The capabilities of fast CPUs, elaborate interrupt systems, DMA, and fancyì disk controllers are thrown away by crude software. The CPU has better things to do. If the disk isn't ready when an ì application program needs it, the BIOS should start the task, save the dataì in a buffer, and set up an interrupt to finish the task later when the diskì is REALLY ready. The time lost to wait loops is thus reclaimed to run yourì application programs. That's how my antique worked. The BIOS maintained a track buffer in RAM.ì The first read from a particular track moved the head to the desired trackì and read the whole thing into the buffer. Further reads from that trackì simply came from RAM, taking virtually no time at all. Similarly, writes to a sector on the current track just put data in theì buffer and marked it as changed. The actual write was performed later, whenì a new track was selected for read/write, or just before the drive timed outì from a lack of disk activity. Physical track reads/writes were fast as well. The key was to simplyì begin wherever the head was. After seeking to the desired track, it readì the ID# of each sector encountered and transferred it to/from the appropriateì place in the RAM buffer. No need to find the index hole, wait for aì particular sector ID#, or worry about interleave; one revolution got it all. Such a system must be implemented carefully. CP/M does not expectìèdelayed error messages, which can produce some odd results. For instance, aì BDOS read error might be reported when the real cause was a write error inì flushing the previous track buffer to disk. Also, modern drives do not haveì door locks to prevent disk removal if unwritten data remains in the trackì buffer. The main factor limiting my S-100 system's performance was the slow CPUì and lack of DMA. A double-density 8" disk has a peak data transfer rate ofì 500K bits/sec, which only allows 16 microseconds between bytes. Thisì required polled I/O where the CPU was 100% devoted to the disk during actualì reads/writes. 5-1/4" disks have a slower maximum transfer rate, but modern hardware hasì advantages that can make up for it. A normal 5-1/4" disk spins at 300 rpm,ì or 5 rev/sec. Assuming 9 sectors of 512 bytes per track, the maximumì transfer rate is 22.5K bytes/sec. The peak data rate is 250K bits/sec, orì 32 microseconds per byte. This is slow enough for a 4 MHz Z80 to (barely)ì handle it on an interrupt basis. Here's an interrupt handler to read 256ì bytes from a disk controller chip at 32 microseconds max. per byte: T-states 23 ; time to finish longest instruction 13 ; Z80 interrupt mode 0 or 1 response 11 int: push af ; save registers used 11 in a,(data) ; read data byte from disk controller 13 next: ld (buffer),a ; store it in buffer (a variable) 13 ld a,(next+1) ; get buffer address 4 inc a ; increment 13 ld (next+1),a ; save for next time 7 jr nz,done ; if end of page, done 10 pop af ; else restore registers 10 ret ; and return ---- 128 T-states max = 32 microseconds with a 4 MHz Z80 But this routine barely squeaks by. It can't use interrupt mode 2 (whichì adds 6 T-states to the response time) or signal Z80 peripherals that theì interrupt is complete with an RETI (which adds 4 T-states). It's limited toì a 256-byte sector. Worse, some disk controller chips need processing timeì of their own. The popular Western Digital FD179x series only allows 27.5ì microseconds for each byte. So we have to get clever again. The following example reads pairs ofì bytes, the first on an interrupt and the second by polled I/O. Thisì improves performance to allow interrupt mode 2, larger sector sizes, and theì slow response time of a FD179x chip: T-states 23 ; time to finish longest instruction 19 ; Z80 interrupt mode 2 response time 11 int: push af ; save A and flags 11 in a,(data) ; read 1st byte from disk controllerè 11 push hl ; save HL 10 next: ld hl,buffer ; get buffer address (a variable) 7 ld (hl),a ; store byte in buffer 6 inc hl ; advance buffer pointer 6 inc hl ; for next interrupt 16 ld (next+1),hl; & store it 6 dec hl ; point to current address 11+11 check:in a,(status) ; check disk controller status 4+ 4 rra ; if not busy (bit 0=1), 7+ 7 jr nc,done ; then we're done 4+ 4 rra ; if next byte not ready (bit 1=0), 12+ 7 jr nc,check ; then loop until it is 11 in a,(data) ; get 2nd byte from disk controller 7 ld (hl),a ; & store it in buffer 10 pop hl ; restore registers 10 pop af 14 reti ; return ---- 188 or 226 T-states (for 1 or 2 passes through status loop) This routine reads bytes from the controller chip within 17.75ì microseconds worst-case. Interrupt overhead averages 80% for a 4 MHz Z80,ì leaving 20% for the main program execution. The peculiar way ofì incrementing the address pointer minimizes the worst-case delay from anì interrupt or status flag change until the byte is read. We want to maximizeì the chance that the second character is ready the first time the status isì checked. Why improve your disk system? Because, as a practical matter, there'sì more to be gained by improving it than any other change you could make. ì It's disk I/O that sets the pace, not CPU speed or memory size. Usersì almost never wait on CPU speed; it's the disk that keeps you twiddling yourì thumbs with the keyboard ignored, the screen frozen, and the disk driveì emitting Bronx cheers. Put a Commodore 64's tinkertoy disk system on an AT¡ clone, and you'd have high-priced junk that only a masochist would use.ì Conversely, the AT's DMA-based disk I/O would transform a C64 into a fire¡ breathing dragon that would eat its competition alive. Algorithms When a hardware engineer sits down to design a circuit, he doesn't beginì with a blank sheet of paper. He has a vast library of textbooks, dataì sheets, and catalogs of standard circuits to choose from. Most of the taskì is simply connecting off-the-shelf components into one of these standardì configurations, modifying them as necessary to satisfy any uniqueì requirements. Algorithms are to programmers what IC chips are to hardware designers.ì Just as the engineer builds a library of standard parts and circuits, everyì programmer must continually build his own algorithm collection. Whetherì it's a shoebox full of magazine clippings or a carefully indexed series ofìènotebooks, start NOW. Programming textbooks tend to concentrate on traditional computer ì algorithms for floating-point math, transcendental functions, and sortingì routines. The old standby is Knuth's "The Art of Programming". Hamming'sì "Numerical Methods for Scientists and Engineers" explains the basics ofì iterative calculations. "Digital Computation and Numerical Methods" byì Southworth and Deeleeuw provides detailed flowcharts and sample code asì well. Magazines are a great source and tend to be more down-to-earth and closerì to the state of the art. Read carefully! Good algorithms may be expressedì in BASIC listings, assembly code for some obscure processor, pocketì calculator key sequences, and even disguised as circuit diagrams.ì Professional journals like EDN or Computer Design are often better than theì popular magazines, which have pretty much abandoned education in favor ofì marketing. Especially check out back issues. The cruder the hardware, theì trickier the algorithms had to be to make up for it. Manufacturers' technical literature is a gold mine. Get the ì manufacturers' own manuals, not some boiled-down paperback from theì bookstore. They won't be models of clarity but are full of hidden gold.ì Read everything, hardware and software manuals, data sheets, applicationì notes, etc. User groups are the traditional source of solutions to specific ì problems. Even better, they provide actual implementations in printedì listings, on disk, or even by modem. Don't waste time reinventing the wheel. Learn from others what works,ì and what doesn't. Some of the best (and worst) algorithms I know were foundì by disassembling existing programs. And once you find a good algorithm,ì recycle it. That clever sort routine for an antique 8008 may be theì foundation of the fastest '386 sort yet! Conclusion These techniques are not new; in fact old-timers will recognize many ofì them from the early days of computing when hardware limitations were moreì severe. However, they have fallen into disuse. A whole generation ofì programmers has been taught that such techniques have no place in modernì structured programming. The theory goes something like this: Programs written in a high-levelì language are faster and easier to write, debug, document, and maintain.ì Memory and speed are viewed as infinite resources, so the performance lossì is unimportant. Programs should be totally generic; it is the compiler's orì run-time library's job to worry about the hardware interface. These rules make sense in a mainframe environment, where the hardwareì resources are truly awesome, and teams of programmers spend years working onìèone application. But they impose severe penalties on a microcomputerì system. The user must pay for the programmer's luxuries with higherì hardware cost and lackluster performance. It's easy to forget that "microcomputer" literally means "one millionthì of a computer". Microprocessors make abysmally bad CPUs. Build a computerì with one, and you'll wind up needing $5000 worth of memory and peripheralsì to support a $5 CPU chip. But micros make superlative controllers. That's what they were designedì for, and what they do best. A single microcomputer can replace dozens ofì boards and hundreds of ICs with as little as a single chip. That's why 90%ì of all microprocessors go into non-computer uses: calculators, auto emissionì controls, home entertainment equipment, industrial controls, and the like.ì Of 30 million Z80s sold last year, fewer than 1 million went into computers. Programming a controller is different than a computer. Most applicationsì demand real-time multi-tasking capabilities, and there is never enough speedì or memory. Inputs and outputs are physical hardware devices, not abstractì data structures, so the code must inevitably be hardware-dependent. ì Computer languages are just not cut out for this sort of thing. The question is not, "How do I write a computer program to handle thisì data?" Instead, you should ask yourself, "How must I manipulate thisì hardware to do the job?" The techniques in this article may be out of placeì in the bureaucracy of a large computer but are right at home in the wild¡ west world of a microcomputer. Lest you think this has nothing to do with a "real" computer like your PCì clone, consider this. Instead of a '286 with 1 meg of memory, suppose itì contained ten Z80 controller boards, each with 64K of memory and a fastì network to tie them together. Each Z80 handles a different device:ì keyboard, screen, printer, modem, and one for each disk. The rest are freeì to run application programs, several at a time! Suppose you're doing word processing on this system. The keyboard doesì spelling correction on data entry. The screen Z80 displays your text inì bit-mapped fonts to match the printer's Z80, which is simultaneouslyì printing a file. The Z80 running the word processor itself suffers noì annoying pauses or hesitation, since disk I/O is handled instantaneously viaì each drive's Z80 track buffer. Meanwhile, the modem's Z80 is downloading aì file while another assembles a program. Pop-up utilities are ready andì waiting in still other Z80s in case they're needed. Such a system would clearly have half the hardware cost of a PC, yetì would outperform it by a wide margin. True multi-tasking becomes child'sì play with multiple processors. More processors can be readily added forì even higher performance, or removed to save cost (or continue operationì while waiting for a replacement). If the computer scientists really want to further the micro-revolution,ì they should stop trying to force antiquated mainframe languages onto microsìèand concentrate on developing tools to maximize the use of micros as theyì are! [This article was originally published in issue 40 of The Computer Journal, P.O. Box 12, South Plainfield, NJ 07080-0012 and is reproduced with the permission of the author and the publisher. Further reproduction for non- commercial purposes is authorized. This copyright notice must be retained. (c) Copyright 1989, 1991 Socrates Press and respective authors]