Truly efficient software has an intimate, almost incestuous relationshipì
with its hardware. They merge so thoroughly as to become inseparable;ì
neither makes any sense without the other.
This requires that you, the programmer, must TOTALLY understand theì
hardware. I cannot stress this point too strongly. The strengths andì
weaknesses of the hardware influence program structure at every level, notì
just the low-level drivers. A system with weak disk I/O will be slow andì
unresponsive if your program relies on overlays. A shallow keyboard bufferì
requires frequent checks to avoid missing keys. The characteristics of the ì
console device determines your whole approach to data displays. If you tryì
to hide from these limitations in a high-level language, your program willì
work as if it were written in BASIC 101. Let's consider some actual caseì
histories of what can be gained by paying attention to the hardware.
CASE #1
A customer needed a faster way to transfer data between two computers. ì
He had been using a serial port at 9600 baud but complained that it was tooì
slow and tied up the computer's serial port. Hardware mods were ruled out.
After study, I found that each computer had unused handshake lines in itsì
RS-232 port. A special "Y" cable was built to cross-connect two of theseì
lines, providing one bit of serial I/O in each direction. A "software UART"ì
program was then written to transfer data between the two machines. Thisì
worked to about 30K bits per second before timing dither (due toì
interrupts, memory refresh, etc.) caused errors.
The serial port's UART could be programmed to generate an interrupt whenì
the handshake line went low. Therefore, an interrupt-driven protocol withì
handshaking was devised. A '0' was sent by pulling the output low until theì
other computer echoed the low on its output. A '1' was sent by pulsing theì
output low and immediately back high and waiting until the other systemì
echoed it. The data rate increased to over 100K bits per second, andì
transfers were now unaffected by disk I/O, keyboard activity, etc.
CASE 2
The firmware for a CRT terminal was to be upgraded to run 38400 bits perì
second without handshaking. Now, 38400 bps is fast, only 260 microsecondsì
per character (about 75 instructions for a 3 MHz Z80).
è The slowest routines need the most attention. For example, clear-lineì
was accomplished by moving the stack pointer to the end of the line andì
executing 36 PUSH HL instructions. The interrupt handler needed a 4-levelì
stack, so the last 8 bytes were cleared normally. Clear-screen used 25ì
iterations of clear-line.
This still isn't fast enough to complete every ESC sequence before theì
next one is received. This calls for an interrupt-driven system. Eachì
character received generates an interrupt. The interrupt handler pushes theì
character into one end of a FIFO (First-In-First-Out) buffer in memory. Theì
main program pops characters out the other end and processes them. The FIFOì
fills while we process slow commands like clear-screen and empties back outì
during fast commands.
But what if some idiot sends a long string of slow commands (like 100ì
clear-screens in a row)? The FIFO would eventually overflow, and data wouldì
be lost. I prevented this with "look-ahead" logic. When the interruptì
handler spots a clear-screen command, it sets a flag so MAIN expects it. ì
MAIN can then ignore unnecessary commands (no sense altering a screen that'sì
about to be cleared).
Scrolling is one of the most difficult actions. The obvious algorithm isì
to block move lines 2-24 up 1, then clear line 24. But that's what IBM didì
on the PC, and we all know how well that worked. So examine the 6845 CRTì
controller. The Start-Address register holds the address of the firstì
character on the screen, the one displayed in the top left corner. If weì
add 80 to it, line 2 instantly becomes the top line, and we've scrolled theì
whole screen up a line. All that remains is to clear the 80 bytes that ì
form the new 24th line, for which we have a fast routine.
Each scroll moves the start address up another 80 bytes. This obviouslyì
can't go on indefinitely, so the original program spent a great deal of timeì
checking for overflow outside its 2K block of screen RAM (F800-FFFF). Forì
instance, the old code read:
ld (hl),a ; put character on screen
inc hl ; advance to next
ld a,h ; get new address
or 0F8h ; if overflow to 0000,
ld h,a ; force it to F800-FFFF
But is this really necessary? The schematic revealed that the 2K RAM wasì
partially decoded and actually occupied 16K in the Z80's address spaceì
(C000-FFFF). It's far easier to insure that an address lies within thisì
range:
ld (hl),a ; put character on screen
res 6,h ; insure we don't wrap to 0000
inc hl ; advance to next
CASE #3
è Fast Disk I/O. Way back in 8 B.C. (eight years Before Clones) I had anì
S-100 system. Its 8080 CPU blazed along at 1.843 MHz, through 32K of RAMì
spread over half a dozen furnace boards. Two Shugart SA-801R single-sidedì
8" drives provided disk storage, with CP/M 1.4 tying it all together. Thatì
old war horse and I fought many battles together, until it finally died theì
Death-of-1000-Intermittents.
Many of its "features" I'd rather forget, but it had one outstandingì
attribute: the fastest floppies I've ever seen. Warm boots were done beforeì
your fingers were off the keys; Wordstar loaded in under a second; PIPì
copied files at 10K bytes/sec. All without a fast CPU, DMA, vectoredì
interrupts, or even a disk controller IC. The "controller" was just a bunchì
of TTL chips implementing a parallel port, an 8-bit shift register, and aì
CRC checkcode generator. The real work was done by the CPU, byte-bangingì
out the IBM 3740 SD/DD format in software.
How good was it? An 8" disk spins at 360 rpm, or 6 revs/sec. Each trackì
held 6.5K (26 double-density sectors of 256 bytes each). That makes theì
theoretical maximum transfer rate 6.5K x 6 = 39K bytes/sec. It actuallyì
achieved 50% of this, or 20K bytes/sec.
Few modern micros come anywhere near this level of performance. Theì
Kaypro I wrote this article on creeps through the disk at 4K/sec. My PCì
clone is closer, at 12K/sec. The problem is that the CPU spends most of itsì
time in wait loops; waiting for the drive motor to start, for the head toì
load, for an index hole, for a certain sector to come around on the disk.ì
The capabilities of fast CPUs, elaborate interrupt systems, DMA, and fancyì
disk controllers are thrown away by crude software.
The CPU has better things to do. If the disk isn't ready when an ì
application program needs it, the BIOS should start the task, save the dataì
in a buffer, and set up an interrupt to finish the task later when the diskì
is REALLY ready. The time lost to wait loops is thus reclaimed to run yourì
application programs.
That's how my antique worked. The BIOS maintained a track buffer in RAM.ì
The first read from a particular track moved the head to the desired trackì
and read the whole thing into the buffer. Further reads from that trackì
simply came from RAM, taking virtually no time at all.
Similarly, writes to a sector on the current track just put data in theì
buffer and marked it as changed. The actual write was performed later, whenì
a new track was selected for read/write, or just before the drive timed outì
from a lack of disk activity.
Physical track reads/writes were fast as well. The key was to simplyì
begin wherever the head was. After seeking to the desired track, it readì
the ID# of each sector encountered and transferred it to/from the appropriateì
place in the RAM buffer. No need to find the index hole, wait for aì
particular sector ID#, or worry about interleave; one revolution got it all.
Such a system must be implemented carefully. CP/M does not expectìèdelayed error messages, which can produce some odd results. For instance, aì
BDOS read error might be reported when the real cause was a write error inì
flushing the previous track buffer to disk. Also, modern drives do not haveì
door locks to prevent disk removal if unwritten data remains in the trackì
buffer.
The main factor limiting my S-100 system's performance was the slow CPUì
and lack of DMA. A double-density 8" disk has a peak data transfer rate ofì
500K bits/sec, which only allows 16 microseconds between bytes. Thisì
required polled I/O where the CPU was 100% devoted to the disk during actualì
reads/writes.
5-1/4" disks have a slower maximum transfer rate, but modern hardware hasì
advantages that can make up for it. A normal 5-1/4" disk spins at 300 rpm,ì
or 5 rev/sec. Assuming 9 sectors of 512 bytes per track, the maximumì
transfer rate is 22.5K bytes/sec. The peak data rate is 250K bits/sec, orì
32 microseconds per byte. This is slow enough for a 4 MHz Z80 to (barely)ì
handle it on an interrupt basis. Here's an interrupt handler to read 256ì
bytes from a disk controller chip at 32 microseconds max. per byte:
T-states
23 ; time to finish longest instruction
13 ; Z80 interrupt mode 0 or 1 response
11 int: push af ; save registers used
11 in a,(data) ; read data byte from disk controller
13 next: ld (buffer),a ; store it in buffer (a variable)
13 ld a,(next+1) ; get buffer address
4 inc a ; increment
13 ld (next+1),a ; save for next time
7 jr nz,done ; if end of page, done
10 pop af ; else restore registers
10 ret ; and return
----
128 T-states max = 32 microseconds with a 4 MHz Z80
But this routine barely squeaks by. It can't use interrupt mode 2 (whichì
adds 6 T-states to the response time) or signal Z80 peripherals that theì
interrupt is complete with an RETI (which adds 4 T-states). It's limited toì
a 256-byte sector. Worse, some disk controller chips need processing timeì
of their own. The popular Western Digital FD179x series only allows 27.5ì
microseconds for each byte.
So we have to get clever again. The following example reads pairs ofì
bytes, the first on an interrupt and the second by polled I/O. Thisì
improves performance to allow interrupt mode 2, larger sector sizes, and theì
slow response time of a FD179x chip:
T-states
23 ; time to finish longest instruction
19 ; Z80 interrupt mode 2 response time
11 int: push af ; save A and flags
11 in a,(data) ; read 1st byte from disk controllerè 11 push hl ; save HL
10 next: ld hl,buffer ; get buffer address (a variable)
7 ld (hl),a ; store byte in buffer
6 inc hl ; advance buffer pointer
6 inc hl ; for next interrupt
16 ld (next+1),hl; & store it
6 dec hl ; point to current address
11+11 check:in a,(status) ; check disk controller status
4+ 4 rra ; if not busy (bit 0=1),
7+ 7 jr nc,done ; then we're done
4+ 4 rra ; if next byte not ready (bit 1=0),
12+ 7 jr nc,check ; then loop until it is
11 in a,(data) ; get 2nd byte from disk controller
7 ld (hl),a ; & store it in buffer
10 pop hl ; restore registers
10 pop af
14 reti ; return
----
188 or 226 T-states (for 1 or 2 passes through status loop)
This routine reads bytes from the controller chip within 17.75ì
microseconds worst-case. Interrupt overhead averages 80% for a 4 MHz Z80,ì
leaving 20% for the main program execution. The peculiar way ofì
incrementing the address pointer minimizes the worst-case delay from anì
interrupt or status flag change until the byte is read. We want to maximizeì
the chance that the second character is ready the first time the status isì
checked.
Why improve your disk system? Because, as a practical matter, there'sì
more to be gained by improving it than any other change you could make. ì
It's disk I/O that sets the pace, not CPU speed or memory size. Usersì
almost never wait on CPU speed; it's the disk that keeps you twiddling yourì
thumbs with the keyboard ignored, the screen frozen, and the disk driveì
emitting Bronx cheers. Put a Commodore 64's tinkertoy disk system on an AT¡
clone, and you'd have high-priced junk that only a masochist would use.ì
Conversely, the AT's DMA-based disk I/O would transform a C64 into a fire¡
breathing dragon that would eat its competition alive.
Algorithms
When a hardware engineer sits down to design a circuit, he doesn't beginì
with a blank sheet of paper. He has a vast library of textbooks, dataì
sheets, and catalogs of standard circuits to choose from. Most of the taskì
is simply connecting off-the-shelf components into one of these standardì
configurations, modifying them as necessary to satisfy any uniqueì
requirements.
Algorithms are to programmers what IC chips are to hardware designers.ì
Just as the engineer builds a library of standard parts and circuits, everyì
programmer must continually build his own algorithm collection. Whetherì
it's a shoebox full of magazine clippings or a carefully indexed series ofìènotebooks, start NOW.
Programming textbooks tend to concentrate on traditional computer ì
algorithms for floating-point math, transcendental functions, and sortingì
routines. The old standby is Knuth's "The Art of Programming". Hamming'sì
"Numerical Methods for Scientists and Engineers" explains the basics ofì
iterative calculations. "Digital Computation and Numerical Methods" byì
Southworth and Deeleeuw provides detailed flowcharts and sample code asì
well.
Magazines are a great source and tend to be more down-to-earth and closerì
to the state of the art. Read carefully! Good algorithms may be expressedì
in BASIC listings, assembly code for some obscure processor, pocketì
calculator key sequences, and even disguised as circuit diagrams.ì
Professional journals like EDN or Computer Design are often better than theì
popular magazines, which have pretty much abandoned education in favor ofì
marketing. Especially check out back issues. The cruder the hardware, theì
trickier the algorithms had to be to make up for it.
Manufacturers' technical literature is a gold mine. Get the ì
manufacturers' own manuals, not some boiled-down paperback from theì
bookstore. They won't be models of clarity but are full of hidden gold.ì
Read everything, hardware and software manuals, data sheets, applicationì
notes, etc.
User groups are the traditional source of solutions to specific ì
problems. Even better, they provide actual implementations in printedì
listings, on disk, or even by modem.
Don't waste time reinventing the wheel. Learn from others what works,ì
and what doesn't. Some of the best (and worst) algorithms I know were foundì
by disassembling existing programs. And once you find a good algorithm,ì
recycle it. That clever sort routine for an antique 8008 may be theì
foundation of the fastest '386 sort yet!
Conclusion
These techniques are not new; in fact old-timers will recognize many ofì
them from the early days of computing when hardware limitations were moreì
severe. However, they have fallen into disuse. A whole generation ofì
programmers has been taught that such techniques have no place in modernì
structured programming.
The theory goes something like this: Programs written in a high-levelì
language are faster and easier to write, debug, document, and maintain.ì
Memory and speed are viewed as infinite resources, so the performance lossì
is unimportant. Programs should be totally generic; it is the compiler's orì
run-time library's job to worry about the hardware interface.
These rules make sense in a mainframe environment, where the hardwareì
resources are truly awesome, and teams of programmers spend years working onìèone application. But they impose severe penalties on a microcomputerì
system. The user must pay for the programmer's luxuries with higherì
hardware cost and lackluster performance.
It's easy to forget that "microcomputer" literally means "one millionthì
of a computer". Microprocessors make abysmally bad CPUs. Build a computerì
with one, and you'll wind up needing $5000 worth of memory and peripheralsì
to support a $5 CPU chip.
But micros make superlative controllers. That's what they were designedì
for, and what they do best. A single microcomputer can replace dozens ofì
boards and hundreds of ICs with as little as a single chip. That's why 90%ì
of all microprocessors go into non-computer uses: calculators, auto emissionì
controls, home entertainment equipment, industrial controls, and the like.ì
Of 30 million Z80s sold last year, fewer than 1 million went into computers.
Programming a controller is different than a computer. Most applicationsì
demand real-time multi-tasking capabilities, and there is never enough speedì
or memory. Inputs and outputs are physical hardware devices, not abstractì
data structures, so the code must inevitably be hardware-dependent. ì
Computer languages are just not cut out for this sort of thing.
The question is not, "How do I write a computer program to handle thisì
data?" Instead, you should ask yourself, "How must I manipulate thisì
hardware to do the job?" The techniques in this article may be out of placeì
in the bureaucracy of a large computer but are right at home in the wild¡
west world of a microcomputer.
Lest you think this has nothing to do with a "real" computer like your PCì
clone, consider this. Instead of a '286 with 1 meg of memory, suppose itì
contained ten Z80 controller boards, each with 64K of memory and a fastì
network to tie them together. Each Z80 handles a different device:ì
keyboard, screen, printer, modem, and one for each disk. The rest are freeì
to run application programs, several at a time!
Suppose you're doing word processing on this system. The keyboard doesì
spelling correction on data entry. The screen Z80 displays your text inì
bit-mapped fonts to match the printer's Z80, which is simultaneouslyì
printing a file. The Z80 running the word processor itself suffers noì
annoying pauses or hesitation, since disk I/O is handled instantaneously viaì
each drive's Z80 track buffer. Meanwhile, the modem's Z80 is downloading aì
file while another assembles a program. Pop-up utilities are ready andì
waiting in still other Z80s in case they're needed.
Such a system would clearly have half the hardware cost of a PC, yetì
would outperform it by a wide margin. True multi-tasking becomes child'sì
play with multiple processors. More processors can be readily added forì
even higher performance, or removed to save cost (or continue operationì
while waiting for a replacement).
If the computer scientists really want to further the micro-revolution,ì
they should stop trying to force antiquated mainframe languages onto microsìèand concentrate on developing tools to maximize the use of micros as theyì
are!
[This article was originally published in issue 40 of The Computer Journal,
P.O. Box 12, South Plainfield, NJ 07080-0012 and is reproduced with the
permission of the author and the publisher. Further reproduction for non-
commercial purposes is authorized. This copyright notice must be retained.
(c) Copyright 1989, 1991 Socrates Press and respective authors]