COCO-3 BOOT LIST ORDER BUG (BLOB) Facts, fixes and theories by Kevin Darling & friends THE BLOB. Some owners have it, some have never seen it. Ordering of modules in a bootlist for os9gen seems to affect it. Adding new devices may cause it to show up. What causes it? It's past time to lay out both what has been conjectured and what is truly known so far. At first, the OS-9 kernel itself was blamed. We've been pretty sure now for a long time that it is NOT at fault. All the modules are position-independent, and have been gone over very closely by several of us, looking for anything that could cause a problem. We have found no software cause at all (with the exception of the disk driver - see below). Instead, hardware and timing discrepancies in the CoCo-3 and peripherals have been found almost always to be at fault. In fact, it's often possible to pinpoint the exact cause of a particular problem, with enough information. Enough preliminaries. Here are most of the confirmed and unconfirmed symptoms and possible reasons, including things that act like BLOBs... ---------------------------------------------------------------------------- FLOPPY FORMATTING HALTS IN FIRST FEW TRACKS; READ/WRITES ARE OFF BY A BYTE: Ken Schunk, myself, and others long ago found that the halt method used by CC3Disk (and some RSDOS drivers in programs) has a problem with some disk controllers (apparently mostly pre-1985 1773's). The usual method is to wait for the FDC (floppy disk controller) to indicate it is ready to exchange a byte of data, and then have the CoCo go into the halt mode. What will happen is that the first byte transfer gets lost, and this is returned as a "Read Error" by the driver. For reasons as yet unknown, this "data lost" sequence sometimes "seems" to be driver position dependent. I would guess that most boot failures are caused by this one, especially with older controllers (altho I've seen it happen on newer ones, too). The drivers can be fixed, and we should be able to post patches later. ---------------------------------------------------------------------------- READS/WRITES GO TO WRONG LSN: Actually, they go to the wrong TRACK, which is also always the wrong LSN. Usually caused by using disk drives that are set to turn on their motors only with drive select, instead of the required method of all motors on with the motor-on signal. All drivers assume that if one motor is on, ALL are on. Because of this assumption, and especially because the drive READY line isn't usually available on the CoCo setup, the FDC will send stepping commands to a drive that is still spinning up again when selected (it takes about 1/2 second to be actually "ready").... and those stepping pulses are totally ignored by drives not spun up. So while the FDC _thinks_ it's stepped the head to a new track, in fact either some or all of the step pulses have been lost. Worse, the 1773 FDC seems to ignore the imbedded track information on the disk itself (contrary to docs) and so as long as the sector number matches up, the data is read/written... to whatever track the head happens to be over! So make sure your drive motors all come on at the same time. ---------------------------------------------------------------------------- SPEED AND BAD CHIPS Testing and experiences by several people has shown that the American semiconductor industry has gotten pretty bad over the last few years as far as quality goes. Or perhaps retailers are selling more reject chips that they buy on the grey market. In any case, some failures of chips used in add-on devices have been found to be brand dependent. For example, some of the LS245 data buffers inside CoCo-3's seem to fail to pass true data at times. Replacing this chip with a Japanese brand will usually cure this particular problem. Motorola chips seem to be the worst bet. Symptom is that an instruction loop reading from the MPI sometimes sees bits set that it shouldn't. Solution is to replace the chip or slow down the loop. Speedwise, many people use hardware designed and built for 1Mhz operation from the CoCo1/2 days. A common problem is with RS232 paks... they may need the 6551 replaced with a higher speed version. ---------------------------------------------------------------------------- INTERRUPTS Boot problems also sometimes appear when a device's interrupt line isn't correctly reset. I've had several 6551 ACIAs (used in RS232 paks, etc) that decided not to clear their interrupt line just by resetting the CoCo. This leaves an interrupt hanging and can mess up a machine trying to boot OS-9. It's also been found that some RS232 paks were built with the E clock tied to the IRQ line... this can abort a boot also. Stuck interrupts are covered in the various "IRQ HACK" files available on most networks, as are files on the RS232 pak. ---------------------------------------------------------------------------- MULTIPAK UPGRADE A non-upgraded MPI definitely causes problems. At the least, it can cause wrong information to be read from the crucial GIME interrupt status port. The most common rumor we see on BBS's is that the MPI upgrade "isn't needed", because "my machine runs fine without it". DO NOT LISTEN TO THESE PEOPLE. PLEASE EXPLAIN TO THEM THAT THEY ARE STUPID. While we can't swear that you WILL hurt your GIME if you don't upgrade, we can certainly say that it does make electronic sense to DO the upgrade (plus Tandy sold the upgrades at first cheaper than their cost, which alone would make one think there's a good reason for having it, eh?). The electronic reason for the upgrade is this: a READ from $FF80-9F will turn on BOTH the GIME data bus AND the MPI data bus. (In addition, really old MPIs ghost their slot select at $FF7F and $FF9F, which causes problems.) It's never a good idea to have two devices trying to put data on a bus at the same time... one of them could get hurt (usually the GIME, in reported experiences). Especially under OS-9, where the interrupt register at $FF92 is read at least 60 times a second, it makes sense to not have that data be corrupted by bogus MPI data coming on at the same time. So UPGRADE YOUR MULTIPAK NOW! ---------------------------------------------------------------------------- E-CLOCK SYNCHRONIZATION: All accesses to peripherals need to use the 6809 E clock to validate the transfer of data (especially at 2Mhz!). A few early versions of third-party devices accidentally were made with registers that didn't do this. All have been fixed for a year now, as far as I know. The boot-order side of this came about whenever a device register was accessed at an odd/even address, and then the next cpu instruction fetch was at the opposite even/odd address... which meant the A0 address line (or sometimes A1 and maybe A2 also) would change after the E cycle ended and thus cause wrong device register addressing. This was shown on scopes as a small (around 10-ns) glitch. So the *position* of the driver I/O access instructions in memory was very important, and was a true common "boot order" trouble causer (and may still be with older devices made in the pre-CoCo3 days). ---------------------------------------------------------------------------- GIME S0-3 DECODING A variation of E-gating is that the SCS external select line is generated inside the CoCo-3 without being E-gated. This could possibly mean that while the GIME is decoding a different I/O selection, the S0-2 GIME lines decoded by the 74LS138 in the CoCo could easily wobble between outputs, possibly randomly enabling ROMs, PIAs, etc and placing bogus data on the bus. It also may be one cause of the video "sparklies". Again, using the E gating on devices should mostly solve this, altho it's also recommended that if you have problems you should gate the 138 with the E clock (Roger Krupski came up with the easiest method: inside the CoCo on the cartridge port, simply tie the E clock to the SLENB line. ---------------------------------------------------------------------------- DOUBLE INTERRUPTS This is an oddball one. Sometimes people notice that their boot fails, or that their software clock runs at double speed while within a VDG screen. Quite by accident, I stumbled across evidence that certain address bit combinations in these situations causes double the vertical interrupts to be generated. No solution except to boot to a real window always, and if you have this clock problem to change the order in which you start up a game, so that it's video address can be moved somewhere "safe". This also seems to be GIME dependent. Non-upgraded MPIs can cause this also, I think. ---------------------------------------------------------------------------- OTHER HARDWARE PROBLEMS Bad connections. Bad connections. Bad connections. Clean all your contacts regularly. The cartridge port, the MPI and slot pins, all rompak devices, disk drive cables, and even yank your GIME and swab it with alcohol if need be, altho sometimes just pushing/tapping on it cures many oddball troubles. Make sure your drives don't have something covering the write-protect detect LEDs. In general, just keep everything clean! It's also about now that many disk drives in use for years, are wearing out or becoming misaligned. Heads become a lot weaker, and data becomes flaky. We've also seen cases where a new cordless phone, or appliance on the same circuit breaker, can screw up floppy or hard disk transfers. Even satellite dish downfeeds running by the computer. If you start to have problems, ask yourself "did anything change here lately?" ---------------------------------------------------------------------------- OTHER SOFTWARE PROBLEMS More and more often, we find that many supposed boot list problems often have an unrelated simple explanation... such as making a new boot and forgetting that you patched some modules or used old ones; the common "oops forgot to put Grfdrv and Shell in the CMDS directory" gotcha; leaving out a module; Very often it can be caused by not having the latest drivers for a device. It's important to keep updated with the newest software made available. Also, sometimes a module (especially os9p1) will get hit by an errant program, and then you os9gen a new disk... which gets perpetuated with the bad os9p1 from then on through new os9gens. We also find that people often reverify a bad module quite by accident using disk editors on their bootfile, thus hiding future trouble. Keep a log of all changes you make, and CRCs! ---------------------------------------------------------------------------- MISC THEORIES Most other problems fall into the mystery section (meaning we don't have a firm handle on the cause yet). I have two pet ideas that may or may not make sense, but which are bolstered in part by experiences by myself and others. One is that since interrupts cause the internal BASIC ROM to turn on (to get the interrupt vectors), the ROM stays on a bit too long and corrupts the data bus at times. Probably a dumb theory . The other is that the dead cycles within many instructions have an effect. During the dead cycle the address bus contains $FFFF (which turns on the ROM!) and again, perhaps this data sticks around, or the address lines change too fast enough once in a while from true address to FFFF. This ties in with partial evidence that some 6809s at 2Mhz will start changing their address lines immediately after the end of an E cycle, perhaps even before E-gated devices finish up. We do know that oddball reads/writes occur at times to strange addresses, and this might explain them. A third theory gaining some acceptance (but we just don't know how the GIME works internally) is that the GIME, like the SAM chip, powers up using either the up or down side of the main oscillator clock (remember hitting reset on SAM machines to get the right red/blue fake color phase? like that). Perhaps one side is better than the other. Certainly powering down sometimes cures a boot or other problem. So who knows? We also know that changing cpu brands, and sometimes switching GIMEs, will often cure timing problems and the sparklies. Not always, though. ---------------------------------------------------------------------------- CONCLUSIONS We're still gathering data, and occasionally do run across something unexplained. For the most part though, BLOBs have become fairly rare. This may be because people have more L-II experience, or newer hardware, or a combination. OS-9 itself is not at fault, and note that even RSDOS applications can and do suffer from the same symptoms. The basic answer is that we moved up to a faster machine, while still using older peripheral equipment. The order of the bootlist CAN affect the symptoms (as we've seen), but this is simply software showing up hardware bugs, and is NOT the fault of OS-9 itself. So the final word is this: our best evidence is that there really _isn't_ a boot list order bug. Look to your hardware instead. - kevin The above information has been gleaned over the past two years from personal experience, many phone calls and network messages, and the work of Bruce Isted, Tony DiStefano, Chris Burke, Roger Krupski, DP Johnson, Dave Wiens, Ken Schunk, and many others. This file may be reposted on BBS's and other electronic networks, but may not be used in commercial publications without the author's permission. PS: if you have anything to add, please send information to me at: 76703,4227 - compuserve OS9UGPRES - delphi uunet!76703.4227@compuserve.com PPS: LATE UPDATE - Replacing the 7406/7416 chips in older floppy disk controllers with a different brand can help. Three people have called lately with info that some of the Fairchild chips have nasty waveforms. -eof-